The problem

Most small drones run PID controllers. You tune a set of gains for your airframe, test in a limited range of wind conditions, and it works. But every time you hit something the gains weren't tuned for (gusty crosswind, degraded GPS, a task change mid-flight) you're back to writing new logic and re-tuning. If you've done this across multiple airframes you know how much time goes into it, and how fragile the result can be.

We wanted a different approach. Reinforcement learning trains a control policy by running the vehicle through a wide range of simulated scenarios and letting the policy learn the sensor-to-control mapping directly. If the simulation is realistic enough and the training process has the right guardrails, you end up with a controller that can handle conditions nobody wrote explicit rules for.

Getting a policy to fly in simulation is the easy part. The actual engineering is in everything around it: reproducible training, staged validation, a clean handoff to real hardware. We spent most of our time building that infrastructure.

What we built

The system handles training, validation, and export for learned flight control policies. Everything runs on local compute, with no cloud dependency in the development loop.

Simulation environment. GPU-accelerated physics with atmosphere modeling, weather effects, and sensor simulation. Built for fast parallel training on local hardware.

Training pipeline. Warm-start from an expert controller using behavior cloning, then refine with reinforcement learning against task objectives. The policies carry memory across timesteps so they can handle situations where context matters.

Evaluation gates. Every checkpoint gets tested under deterministic conditions before it can promote. The gates check task performance against defined criteria, and if a checkpoint doesn't pass it doesn't advance.

Export pipeline. Targets PX4-compatible flight controllers. Each exported checkpoint ships with a manifest that locks down the interface contract between the model and the hardware.

The surrounding system includes experiment orchestration, health monitoring, and experiment tracking. It has been used across repeated training campaigns, each one aimed at isolating a specific question about the training process. Everything is versioned and traceable.

The training loop

For a given flight task (takeoff to a low hover, for example), the pipeline goes through a few stages:

  1. Demonstration collection. A tuned PID controller flies the task across randomized initial conditions while we record sensor observations and control outputs.
  2. Warm start. A neural policy learns to reproduce the PID controller's behavior, which gives the RL stage a stable starting point. Skipping this step means the policy crashes constantly in early training, which makes learning basically impossible for flight control tasks.
  3. Policy refinement. The warm-started policy trains in the full simulation. Dense reward signals guide it through the early phase, and over time it learns to improve on the PID baseline, particularly in conditions where hand-tuned gains struggle.
  4. Evaluation. The trained policy runs in a fixed-seed environment where we measure altitude accuracy, position stability, control smoothness, and task completion against pass/fail criteria.
  5. Promotion or rejection. If it passes, it becomes the reference checkpoint for the next stage. If it fails, we log the failure mode and design the next experiment to go after that specific problem.

A task can take many iterations before a model qualifies for promotion. We built it that way on purpose because reward going up during training doesn't always mean the vehicle is actually doing the right thing. The evaluation gates are there to catch that.

Sim-to-hardware handoff

Our export pipeline forces every checkpoint through a contract validation step before it can leave the training environment. The contract locks down observation dimensions, field ordering, control frequency, normalization state, and architecture parameters. If anything about the interface has changed since training (say a sensor field got added or reordered), the export just fails with a clear error instead of silently producing a broken controller.

Every deployment candidate also runs through PX4 software-in-the-loop testing before it goes near a real vehicle. SITL exercises the full inference path inside the actual PX4 flight stack. Standard safety interlocks like geofencing, attitude limits, and battery failsafes stay in the flight stack where they belong, and the learned policy operates inside those bounds.

Scope and limitations

Trained policies work within the envelope they were trained on. Push them outside that (higher wind speeds, different vehicle mass, sensor setups we haven't modeled) and performance drops off. We know that, which is why promotion criteria exist and why we test hard before anything advances.

Sim-to-real transfer is the main technical risk with any approach like this. Our simulation covers atmosphere, wind, and sensor characteristics, but no simulation is perfect. The contract system catches structural mismatches. Physics modeling errors only show up in real flight testing.

We build the learned control layer and the training infrastructure around it. Flight stack, command and control, task planning are separate systems. We integrate with them.

Need help with this kind of work?

If you need help evaluating learned control, building the validation path, or integrating autonomy software with an existing flight stack, contact us.