ArduPilot: An Unrelated Matter of Time

Written 2024-03-25

Tags:race-condition ArduPilot 

I found another odd timer behaviour in ArduPilot on Linux, when monitoring the scheduler using a GPIO. I had hooked up a low-priority thread to check in on the autopilot software every second, and if all checks were good, set a pin, delay(1ms), clear pin. The only problem was that rarely, I would find that the pin wasn't going high, but there were no problems recorded.

Historically, delay(N) on Arduino delayed at least N milliseconds. Practically, I couldn't find anything in ArduPilot that cared about sub-millisecond precision that called delay() - usually the caller wanted a rather long amount of time, or it was used for something like throttling logging code. On the Linux scheduler backend, there was a rare race-condition when calling delay(N) just at the top of millisecond. Unknowingly my test setup was configured to expose this discrepancy a couple times per day. Here's the offending code:

void Scheduler::delay(uint16_t ms)
{
    if (_stopped_clock_usec) {
        return;
    }

    uint64_t start = AP_HAL::millis64();

    //Race condition occurs between previous and next calls to millis64()

    while ((AP_HAL::millis64() - start) < ms) {
        // this yields the CPU to other apps
        microsleep(1000);
        if (in_main_thread() && _min_delay_cb_ms <= ms) {
            call_delay_cb();
        }
    }
}

Instead of a ~1ms pulse at 1Hz, a few times per day I would see a dropped pulse. At first, this appeared as a slight deviation in the frequency, as my frequency counter averages pulse-count over measurement period. Until I started looking with an oscilloscope, then I noticed that what I was seeing was a runt pulse, where the GPIO was being cleared low just as soon as it was being set high. With the frequency counter's low-pass filter enabled, these runt pulses were being skipped.

The bug was simple once I accepted it was actually happening. When calling delay(1) it is rare, but possible to call start = milli64() just on the edge of the next millisecond. When this happens, the loop condition degrades into while(((start + 1) - start) < 1){...}, so the loop exits immediately without calling microsleep() to delay execution.

The fix consists of a few parts. First the exit condition is based on the starting timestamp +1 so we cannot return early. The internal resolution is also increased from milliseconds to microseconds, so that the amount of time lost to round-off is reduced. The final calls to microsleep() is now based on how much time is actually needed relative to the starting timestamp, so most drift is accounted except for the last call to microsleep(). The new delay(N):

void Scheduler::delay(uint16_t ms)
{
    if (_stopped_clock_usec) {
        return;
    }

    if (ms == 0) {
        return;
    }

    uint64_t now = AP_HAL::micros64();
    uint64_t end = now + 1000UL * ms + 1U;
    do {
        // this yields the CPU to other apps
        microsleep(MIN(1000UL, end-now));
        if (in_main_thread() && _min_delay_cb_ms <= ms) {
            call_delay_cb();
        }
        now = AP_HAL::micros64();
    } while (now < end);
}

I measured the impact of this on an otherwise idle CPU core on my Raspberry Pi 4, calling delay(100ms) after usleep(rand()%1000) to ensure the test wasn't getting locked to milliseconds. I used delay(100) because delay(1) 1/20000 times would not have made for as interesting of a graph, though the orignal bimodal distribution below is the same cause:

ArduPilotOnLinux delay()