ArduPilot: An Unrelated Matter of Time

Written 2024-03-25

Tags:race-condition ArduPilot 

I found another odd timer behaviour in ArduPilot on Linux, when monitoring the scheduler using a GPIO. I had hooked up a low-priority thread to check in on the autopilot software every second, and if all checks were good, set a pin, delay(1ms), clear pin. The only problem was that rarely, I would find that the pin wasn't going high, but there were no problems recorded.

Historically, delay(N) on Arduino delayed at least N milliseconds. Practically, I couldn't find anything in ArduPilot that cared about sub-millisecond precision that called delay() - usually the caller wanted a rather long amount of time, or it was used for something like throttling logging code. On the Linux scheduler backend, there was a rare race-condition when calling delay(N) just at the top of millisecond. Unknowingly my test setup was configured to expose this discrepancy a couple times per day. Here's the offending code:

void Scheduler::delay(uint16_t ms)
{
    if (_stopped_clock_usec) {
        return;
    }

    uint64_t start = AP_HAL::millis64();

    //Race condition occurs between previous and next calls to millis64()

    while ((AP_HAL::millis64() - start) < ms) {
        // this yields the CPU to other apps
        microsleep(1000);
        if (in_main_thread() && _min_delay_cb_ms <= ms) {
            call_delay_cb();
        }
    }
}

Instead of a ~1ms pulse at 1Hz, a few times per day I would see a dropped pulse. At first, this appeared as a slight deviation in the frequency, as my frequency counter averages pulse-count over measurement period. Until I started looking with an oscilloscope, then I noticed that what I was seeing was a runt pulse, where the GPIO was being cleared low just as soon as it was being set high. With the frequency counter's low-pass filter enabled, these runt pulses were being skipped.

The bug was simple once I accepted it was actually happening. When calling delay(1) it is rare, but possible to call start = milli64() just on the edge of the next millisecond. When this happens, the loop condition degrades into while(((start + 1) - start) < 1){...}, so the loop exits immediately without calling microsleep() to delay execution.

The fix consists of a few parts. First the exit condition is based on the starting timestamp +1 so we cannot return early. The internal resolution is also increased from milliseconds to microseconds, so that the amount of time lost to round-off is reduced. The final calls to microsleep() is now based on how much time is actually needed relative to the starting timestamp, so most drift is accounted except for the last call to microsleep(). The new delay(N):

void Scheduler::delay(uint16_t ms)
{
    if (_stopped_clock_usec) {
        return;
    }

    if (ms == 0) {
        return;
    }

    uint64_t now = AP_HAL::micros64();
    uint64_t end = now + 1000UL * ms + 1U;
    do {
        // this yields the CPU to other apps
        microsleep(MIN(1000UL, end-now));
        if (in_main_thread() && _min_delay_cb_ms <= ms) {
            call_delay_cb();
        }
        now = AP_HAL::micros64();
    } while (now < end);
}

I measured the impact of this on an otherwise idle CPU core on my Raspberry Pi 4, calling delay(100ms) after usleep(rand()%1000) to ensure the test wasn't getting locked to milliseconds. I used delay(100) because delay(1) 1/20000 times would not have made for as interesting of a graph, though the orignal bimodal distribution below is the same cause:

ArduPilotOnLinux delay()

Lauterbach Teardowns

Written 2024-02-27

Tags: Debugging Lauterbach 

I was curious what was inside the Lauterbach debug tips (also called probe heads around here). For the unfamiliar, the Lauterbach power-debug system consists of two parts. The first part is a universal module that connects to a PC over USB or Ethernet, or in the old days parallel. The second part adapts the debug-cable port on the module to a specific JTAG or SWD target, like 20-pin ARM. I took apart two of mine.

LA-7742(ARM9)

I like this one because while Lauterbach generally sticks to a Vintage 1990s hardware design, this LA-7742 seems even older. If you look closely, the housing says W. Germany.

IMG_20240227_194107 IMG20240227195122

This one consists of a 24C02 serial flash and a ALVC164245 level translating tri-state buffer.

IMG_20240227_194128 IMG_20240227_194149

LA-7747(ARM7)

This one looks like the current debug adapter cables, except with a 20pin IDC socket rather than pigtail. even though it is for an older core design than the LA-7742.

IMG20240227194710 IMG_20240227_194801

Inside it appears more complicated, and seems be the same 24C02 serial flash, but instead has MAX4615/6 analog switches, DS90C032 quad differential receivers, and some PACDN006 ESD diode packs.

IMG20240227194333 IMG20240227194225 IMG20240227194435

ArduPilot: A Matter of Time

Written 2024-02-04

Tags:Floating-point ArduPilot 

I have a couple of drones and an RC car controlled by a custom PCB I made with a RaspberryPi running OpenHD and ArduPilot. It's fun to play with when I have time.

Recently, I left one of my drones powered with the props removed while I was replacing a desk in my office. A surgery and some shaved yak's later, the old desks are gone, the new desk surface doesn't fit around a hallway corner, I'm building a new desk surface with a friend, and the drone has been plugged in the whole time, rebooting only for power outages.

When I finally checked in on it, the Linux OS seemed fine, but some periodic ArduPilot code I was tracing was reporting very strange timestamps - every timestamp was rounded to 1/4 second, or 250 milliseconds.

The root cause was a couple events:

  1. The initial ArduPilot HAL for Linux in 2013 uses default-precision(double at the time) floats for converting between Linux timevals & timespecs and ArduPilot millis() & micros().
  2. Last summer ArduPilot switches from double-precision to single-precision float litterals with -fsingle-precision-constant. It's important to note that ArduPilot targets mostly microcontrollers, some with only single-precision FPUs, where double-precision math is expensive.
  3. I happened to leave my board running long enough for rounding and conversion errors to be obvious.

Here's the offending code:

uint64_t micros64()
{
    const Linux::Scheduler* scheduler = Linux::Scheduler::from(hal.scheduler);
    uint64_t stopped_usec = scheduler->stopped_clock_usec();
    if (stopped_usec) {
        return stopped_usec;
    }

    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return 1.0e6*((ts.tv_sec + (ts.tv_nsec*1.0e-9)) -
                  (state.start_time.tv_sec +
                   (state.start_time.tv_nsec*1.0e-9)));
}

With the double-precision constants and calculations in-use at the time, the rounding errors were very small, most notably some microseconds were only 999 nanoseconds long, but this wasn't very often if Linux and ArduPilot hadn't been running too long. After the change to 32-bit floats, things come apart quickly.

First, 1.0e-9 isn't an exact 32-bit number, so it gets rounded to about .9999999974752427078783512115478515625e-6 - this isn't really an issue, because it would only be responsible for around 80ns/year of clock drift, and only affects the fractional seconds count, so the drift re-aligns every new second.

I'll also define a term here, ArduPilot epoch - it's the millis() or micros() since ArduPilot read the start_time from Linux. There's also Linux's CLOCK_MONOTONIC epoch, which is seconds and nanoseconds since the Linux system startup, possibly with NTP skewing.

We next need to consider the order of addition and subtraction operations above - we're taking seconds plus fractional nanoseconds, and subtracting from that seconds plus fractional nanoseconds. This means that we need a precise enough representation to hold both seconds and nanoseconds combined, and a 32-bit float is often good for about 6-7 decimal digits, but a second is 10e9 nanoseconds. A partial improvement here would be add the difference of nanoseconds to the difference of seconds, this way the differences are first computed exactly. This would allow ArduPilot to compute the difference from CLOCK_MONOTONIC epoch to ArduPilot epoch exactly, but as ArduPilot runs longer things still get weird.

Finally, we should look directly at the 32-bit floating-point datatype. I really like this calculator for playing with 32-bit float quantization and rounding issues. At this middle part of micros64(), where we're dealing with floating-poing seconds since ArduPilot startup, if we want to calculate floating-point resolution at say, 60 seconds of ArduPilot epoch, enter 60 into the calculator and hit '-1' to compute the next lowest 32-bit floating-point number, we get a number that is ~4 microseconds away from 60. This means that calling micros() around 1 minute after ArduPilot startup can be off by up to 4 microseconds.

ardu_1min3

Things quickly get weirder the longer the system runs, or the later ArduPilot is started after Linux:

Approx epochApprox representation error
1 second60 nanoseconds
1 minute4 microseconds
1 hour244 microseconds
1 day8 milliseconds
1 week62.5 milliseconds
3 weeks125 milliseconds
1 month250 milliseconds
2 months500 milliseconds
4 months1 second
6 months1 second
8 months2 seconds
1 year2 seconds
14 years2038 bug!

It's neat that we can see the floating-point format at play here. As the epoch runs on, the floating-point error goes up, but the time between increases in floating-point error go up too. This makes sense because each time the floating-point moves up, the space between representable numbers increases. Granularity of time tends to cause measurement errors where one interval might seem short (because it contains fewer time ticks) or another seem long(because it has an extra time tick). At an extreme, intervals of 0 ticks are instant. What sort of practical issues can this cause? We can run one of ArduPilot's test programs for just this! Here's output from the Scheduler_test around 3 weeks of Linux uptime, 1 minute of ArduPilot uptime, t= milliseconds since Scheduler_test launch:

five_seconds: t=53375 ins_counter=3750
one_hz: t=54000
one_hz: t=54500
one_hz: t=55000
one_hz: t=55500
one_hz: t=56000
five_seconds: t=56000 ins_counter=4000

The fp32 rounding error is quite obvious. But also note that the Scheduler is running the one_hz task at 2Hz. Here's an example of where problems can occur when millis64() might increment by 8 milliseconds after a day of uptime:

void Scheduler::delay(uint16_t ms)
{
    if (_stopped_clock_usec) {
        return;
    }

    uint64_t start = AP_HAL::millis64();

    while ((AP_HAL::millis64() - start) < ms) {
        // this yields the CPU to other apps
        microsleep(1000);
        if (in_main_thread() && _min_delay_cb_ms <= ms) {
            call_delay_cb();
        }
    }
}

After a day of uptime, a simple delay(1ms) will either delay execution between 0ms and 8ms depending on when delay() is called relative to when the floating-point time increments.

I had started on a couple patches to move back to double-precision math and reorder operations, but Andrew Tridgell beat me to it, using uint64_t nanoseconds and an optimized integer division routine.

Older