ArduPilot: A Matter of Time

Written 2024-02-04

Tags:Floating-point ArduPilot 

I have a couple of drones and an RC car controlled by a custom PCB I made with a RaspberryPi running OpenHD and ArduPilot. It's fun to play with when I have time.

Recently, I left one of my drones powered with the props removed while I was replacing a desk in my office. A surgery and some shaved yak's later, the old desks are gone, the new desk surface doesn't fit around a hallway corner, I'm building a new desk surface with a friend, and the drone has been plugged in the whole time, rebooting only for power outages.

When I finally checked in on it, the Linux OS seemed fine, but some periodic ArduPilot code I was tracing was reporting very strange timestamps - every timestamp was rounded to 1/4 second, or 250 milliseconds.

The root cause was a couple events:

  1. The initial ArduPilot HAL for Linux in 2013 uses default-precision(double at the time) floats for converting between Linux timevals & timespecs and ArduPilot millis() & micros().
  2. Last summer ArduPilot switches from double-precision to single-precision float litterals with -fsingle-precision-constant. It's important to note that ArduPilot targets mostly microcontrollers, some with only single-precision FPUs, where double-precision math is expensive.
  3. I happened to leave my board running long enough for rounding and conversion errors to be obvious.

Here's the offending code:

uint64_t micros64()
{
    const Linux::Scheduler* scheduler = Linux::Scheduler::from(hal.scheduler);
    uint64_t stopped_usec = scheduler->stopped_clock_usec();
    if (stopped_usec) {
        return stopped_usec;
    }

    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return 1.0e6*((ts.tv_sec + (ts.tv_nsec*1.0e-9)) -
                  (state.start_time.tv_sec +
                   (state.start_time.tv_nsec*1.0e-9)));
}

With the double-precision constants and calculations in-use at the time, the rounding errors were very small, most notably some microseconds were only 999 nanoseconds long, but this wasn't very often if Linux and ArduPilot hadn't been running too long. After the change to 32-bit floats, things come apart quickly.

First, 1.0e-9 isn't an exact 32-bit number, so it gets rounded to about .9999999974752427078783512115478515625e-6 - this isn't really an issue, because it would only be responsible for around 80ns/year of clock drift, and only affects the fractional seconds count, so the drift re-aligns every new second.

I'll also define a term here, ArduPilot epoch - it's the millis() or micros() since ArduPilot read the start_time from Linux. There's also Linux's CLOCK_MONOTONIC epoch, which is seconds and nanoseconds since the Linux system startup, possibly with NTP skewing.

We next need to consider the order of addition and subtraction operations above - we're taking seconds plus fractional nanoseconds, and subtracting from that seconds plus fractional nanoseconds. This means that we need a precise enough representation to hold both seconds and nanoseconds combined, and a 32-bit float is often good for about 6-7 decimal digits, but a second is 10e9 nanoseconds. A partial improvement here would be add the difference of nanoseconds to the difference of seconds, this way the differences are first computed exactly. This would allow ArduPilot to compute the difference from CLOCK_MONOTONIC epoch to ArduPilot epoch exactly, but as ArduPilot runs longer things still get weird.

Finally, we should look directly at the 32-bit floating-point datatype. I really like this calculator for playing with 32-bit float quantization and rounding issues. At this middle part of micros64(), where we're dealing with floating-poing seconds since ArduPilot startup, if we want to calculate floating-point resolution at say, 60 seconds of ArduPilot epoch, enter 60 into the calculator and hit '-1' to compute the next lowest 32-bit floating-point number, we get a number that is ~4 microseconds away from 60. This means that calling micros() around 1 minute after ArduPilot startup can be off by up to 4 microseconds.

ardu_1min3

Things quickly get weirder the longer the system runs, or the later ArduPilot is started after Linux:

Approx epochApprox representation error
1 second60 nanoseconds
1 minute4 microseconds
1 hour244 microseconds
1 day8 milliseconds
1 week62.5 milliseconds
3 weeks125 milliseconds
1 month250 milliseconds
2 months500 milliseconds
4 months1 second
6 months1 second
8 months2 seconds
1 year2 seconds
14 years2038 bug!

It's neat that we can see the floating-point format at play here. As the epoch runs on, the floating-point error goes up, but the time between increases in floating-point error go up too. This makes sense because each time the floating-point moves up, the space between representable numbers increases. Granularity of time tends to cause measurement errors where one interval might seem short (because it contains fewer time ticks) or another seem long(because it has an extra time tick). At an extreme, intervals of 0 ticks are instant. What sort of practical issues can this cause? We can run one of ArduPilot's test programs for just this! Here's output from the Scheduler_test around 3 weeks of Linux uptime, 1 minute of ArduPilot uptime, t= milliseconds since Scheduler_test launch:

five_seconds: t=53375 ins_counter=3750
one_hz: t=54000
one_hz: t=54500
one_hz: t=55000
one_hz: t=55500
one_hz: t=56000
five_seconds: t=56000 ins_counter=4000

The fp32 rounding error is quite obvious. But also note that the Scheduler is running the one_hz task at 2Hz. Here's an example of where problems can occur when millis64() might increment by 8 milliseconds after a day of uptime:

void Scheduler::delay(uint16_t ms)
{
    if (_stopped_clock_usec) {
        return;
    }

    uint64_t start = AP_HAL::millis64();

    while ((AP_HAL::millis64() - start) < ms) {
        // this yields the CPU to other apps
        microsleep(1000);
        if (in_main_thread() && _min_delay_cb_ms <= ms) {
            call_delay_cb();
        }
    }
}

After a day of uptime, a simple delay(1ms) will either delay execution between 0ms and 8ms depending on when delay() is called relative to when the floating-point time increments.

I had started on a couple patches to move back to double-precision math and reorder operations, but Andrew Tridgell beat me to it, using uint64_t nanoseconds and an optimized integer division routine.

FireWall Stories: Your firewall is out of TCPs!

Written 2024-02-01

Tags:Firewall TLS HTTPS Linux 

Back in 2015, I worked in small security group at a medium-sized electronics company. Nominally, I was supposed be testing our products. My coworkers managed a small fleet of big firewalls, and I thought they were really neat. Nominally Linux but with 10gig and sometimes 40gig ethernet, dozens of cores, and weird IRQ steering. Big loud machines, a far cry from the embedded systems I usually work on. I'd help them out whenever I could.

One day, my coworker AaronV asked if I could make a bash script to take the text output of some of the firewall commands and process to make it easier to track down a problem one of our internal application server teams was experiencing. Between us, we put a quickly one-liner together to scan through all established TCP connections and group and count them as he needed. Aaron says, "sure enough, there they are".

I couldn't resist asking why we were doing this. The affected application was responsible for taking data from a queue uploaded from our users, and if the user had checked a box to enable 3rd party integration, the application would ship it off to the 3rd party for processing. But today was different. The 3rd party system was timing out, leading to our queue backing up. But why was this a firewall issue? The 3rd party said they were having a handful of load spikes in their fleet of VMs, but nothing significant, and had already scaled up the VM count to accommodate. Load balancing was done though DNS round-robin, so our load should be spread wide. And, if our server team went around the firewall and made HTTP requests manually, it always worked. Seems like it could be a firewall issue. The application team suggested we increase the number of TCP connections allocated on the firewall, as this had fixed some things before for them.

Indeed, the firewall running out of TCP connection tracking allocations is such a bad thing that would impact every system going through the it so we had plenty of connections allocated, and weren't running low on them today.

In fact, we didn't see the application team's VMs in our throughput load graph at all. But, when we sorted by how many connections were open per IP on our side, "sure enough, there they are", right at the top of the list we see our application team's servers each using more TCP connections than anyone else in the company.

A little more bash and we can watch the TCP connections forming live for a single one of our application VMs. It seemed around 20k connections to a 3rd party server were usually open, then about every five minutes around 5k of the connections would close and reopen to a different server from the same 3rd party.

We asked our team if they were re-using HTTPS connections and they assured us they were. But when we plotted data throughput over time for a single TCP connection, we saw the user data go out, the response come back, then only periodic packets - likely keepalives until the connection closed. Our application team confirmed that when it was working, the requests and responses were fairly quick, nothing as long as we were seeing the connections open for. So, why were multiple HTTPS requests not flowing through each TCP connection?

Recently our application team had their code to OpenGrok, so we decided to take a peak. At first everything seemed good, creating a Apache Java connection pool, and using HTTP keepalives. We were increasing the connections-per-route count because it defaults at 2 and I guess we'd like more concurrency?

But, something was amiss. We were constructing the pool in the code that generated the request, so each connection was getting its own pool. So the connections weren't being re-used, and both our application and the 3rd party server was spending most of their time doing TLS handshakes. To make matters worse, the HTTP connection pools weren't being destroyed until their connection closed, which meant we were progressively tying up per-connection resources on the 3rd party web server nutil the connections timed out. In general, it went something like this:

  1. Each thread on our server pulls an item from the upload queue.
  2. Create a connection pool for the request, query DNS for the 3rd party service, DNS request cache misses and we get a new IP pointing to one of their VMs.
  3. Open a new TLS connection, send the request, server responds, and we keep the connection open using keepalives, just in case we might want it later, for 20 minutes.
  4. Our thread pulls another item from the queue
  5. Create a connection pool for the request, query DNS for the 3rd party service, DNS cache hits, same IP points to same server
  6. goto #3 until the DNS TTL expires

About halfway though the DNS TTL, the 3rd party's VM would start to bog down under load of continuously establishing new TLS connections, the third party would see that machine's load spike, and we'd see a bunch of timeouts. Once we knew the issue, our app server team was able to rework the HTTP connection pool to be shared among threads, leading to much quicker processing for our customers and a reduction in load for everyone.

It turns out this is a key configuration in the Apache Java connection pool client - the limit for connections is separate from the limit for connections per host. This way the pool can limit itself in the face of DNS TTL caching to try to spread load wider, which would've helped a ton. Of course, this, and connection pooling in general requires creating a single pool and letting all the threads utilize it, rather an a pool per HTTP request :P

the Richards-Gebaur radioactive waste burial site.

Written 2023-11-24

Tags:Richards-Gebaur radiation ColdWar 

In the late 1950s and early 1960s, the US military created a large number of radioactive waste disposal sites. Most contained low-level waste - things like dosimeters and radium-painted airplane gauges. The Air Force in particular buried waste at a number of sites, then with base turnover, had a problem relocating some of them. This post is the story of what happened to the waste at Richards-Gebaur Air Force Base.

My introduction to this topic occured when Shane Lakin mentioned a map of the former Richards-Gebaur Air Reserve Station including an area marked 'RADIOACTIVE WASTE BURIAL SITE'[1] in the Local Kansas City History Buffs group. Much discussion was had by former staff and locals to the area. This base was around 15 minutes from my house, and has since been split up, and the airfield repurposed into an intermodal rail transfer station. Here's the map Shane posted:

unsourced_map

General timeline:

A note about TO 00-110A-l

Technical Order 00-110A-l describes how radioactive waste should be buried. Before around 1958, bases were authorized to bury waste, and this document specified design rules for how to do so in concrete-lined and capped wells.

General Excerpts from "BURIAL OF RADIOACTIVE WASTE IN THE USAF", 15 March 1972

The following excerpts give an overview of the problem: radioactive waste has been disposed of for some time, and due to turnover and records loss, in the 1970s we weren't certain where it all was. The following text is copied from the report, but may contain OCR transcription errors:

I.A. In the early 1950's, it was common practice in the United States
Air Force to bury radioactive waste. This was merely a reflection of the
AEC policy of those times. A technical order (T.O. 00-110A-l) specified .
procedures to be followed, including identification of the burial ::ite
location on appropriate maps and fencing to prevent entry of unauthorized
personnel (reference Appendix A). Although a wide range of radioactive
materials was buried throughout the Air Force, a majority of 1.1,e waste
volume could be divided into three categories. The first was electron
tubes containing small amounts of radioisotopes. These items were used
under the terms of a general license issued by the Atomic Energy
Commission. The second category was low-level wastes generated in
nuclear weapons maintenance operations. The last category war;
radioactive self-luminous instrument dials, usually containing radium.
I.B. In the period 1958-59, several important changes occurred. The
general authority to bury radioactive wastes was deleted from T.O.
00-11 -1. Instead, the Air Force converted to a contractor dhiposal
system, as outlined in T.O. 00-110N-2. The alternative of burial still
existed, but only with the permission of the USAF Radioisotop,Committee.
It is important to note that, when the authority to bury was deleted from
T.O. 00-11QA-1, no instructions were provided for maintaining the
existing burial sites and their associated records. 
I.E. In 1970 it became increasingly apparent that the records covering
ch disposal site were gradually being lost. In some cases, because of
transfer of personnel and disposal of records, current Base Civil
Engineering personnel were completely unaware of the existence of a
burial site at their installation. In keeping with a continuing responsibility
for radiological safety, it became imperative that the location and
condition of existing burial sites be identified and recorded. 
I.F. Accordingly, a Hq USAF survey letter (Radioactive Waste
Disposal, Hq USAF SCN 71-28, 19 Feb 71) was sent to all major
commands. (Reference Appendix C.) It directed that all existing burial
sites should be identified in the Base Master Plan (Tab C-1), and that
certain details concerning construction and utilization of the site be
forwarded to the USAF Radioisotope Committee, which would subsequently
make recommendations on the maintenance of the sites. This report will
summarize the survey returns, and make recommendations where
appropriate.
II.B. Replies were received from 136 USAF installations. Of these, 46
were classed as "positive." A positive reply is one which indicates that
either radioactive waste burial did in fact occur, or that a burial site
exists or existed. In the latter cases, it may not be possible to determine
if the burial sites were ever used. 

Richards-Gebaur AFB's entry in "BURIAL OF RADIOACTIVE WASTE IN THE USAF", 15 March 1972

A burial site is located about 1200' east of the north-south runway.
A 10" - 12" diameter cast iron pipe buried 23' deep, with concrete end
caps. No information available on nature of wastes, or on existence of
fence or sign.
Recommendations:
  1. Remove pipe, dispose of contents .1 5 per T. 0. 00-110N-2.
  2. Confirm presence of fence and warning sign, if no sign, erect one; schedule regular inspection and maintenance.

Excerpts from "Installation Restoration Program Records Search for Richards-Gebaur Air Force Base, Missouri", March 1983 (ADA126605)

IV.A.9 Findings, Activity Review, Other Activities

During the period that ADC was in command (1955-1970), nuclear weapons were stored on the base at the
1200 munitions area. Nuclear weapons were also stored at an off-base munitions storage area currently known as the
Belton Training Annex. Nuclear materials or weapons have not been buried and are no longer stored at the base or at
the Belton Training Annex (see Section VII).
Conventional explosive ordnance is sent offsite to Fort Leonardwood, Missouri. No site at Richards-Gebaur AFB
has been used for disposal of explosive ordnance or of nuclear materials.

IV.B.2 Findings, Disposal Sites Identification and Evaluation, Fire Department Training Areas

Site No. 7, the Radioactive Disposal Well located west of Scope Creek in the southern portion of the base,
was used intermittently between 1955 and about 1970 for disposal of low-level radioactive materials, primarily
dosimeters. Levels of radioactivity in the vicinity of the well have been measured and found to be at or near
background levels. The well has been tested and capped. An overall rating score of 4 is due to the low levels
of radioactivity and full containment of small waste quantities.

Table 1, PRIORITY LISTING OF DISPOSAL SITES

Site No.Site DescriptionOverall Score
1South Landfill55
2Northeast Landfill54
8Herbicide Burial Site51
3Contractor Rubble Burial Site48
5South Burn Pit48
9Oil-Saturated Area48
6North Burn Pit45
4West Burn Pit42
7Radioactive Disposal Well4

Executive Summary, C, Conclusions

Table 1 presents a priority listing of the rated sites and their overall scores. The sites designated as areas showing the most
significant potential (relative to other Richards-Gebaur sites) for environmental impact were the South Landfill (Site No. 1)
and the Northeast Landfill (Site No. 2). The remaining rated sites (Sites No. 3, 4, 5, 6, 7, 8, and 9) are not considered to
present significant environmental concerns and, therefore, no Phase II work is recommended.

Figure 6, Identified disposal sites, Richards-Gebaur AFB, Missouri

ADA126605_1983

My thoughts

Tables 1 and 7 lists the different waste disposal sites according to priority - these show that the radiactive disposal well is the lowest priority. The priorities range from 4(radioactive disposal well) to 55(south landfill), but the second-lowest priority is 42(west burn pit), so the well is the lowest by far.

It's important to consider that the prioritization considers many factors including that at the time the rating was based on the well being inside the base, as well as being well-contained. Some of the higher-priority sites were off-base or actively spreading, and were a more pressing need.

No mention of following 1972 recommendation for documenting the unremediated site are made.

Lack of mention in "FINAL ENVIRONMENTAL IMPACT STATEMENT DISPOSAL AND REUSE OF RICHARDS-GEBAUR AIR FORCE BASE, MISSOURI", July 1994

Under section 3.3, HAZARDOUS MATERIALS AND HAZARDOUS WASTE MANAGEMENT, the only mention of radioactive dangers is radon gas. The well is no longer present on any maps I have found after this point.

Where is the radioactive waste now?

After CH2M in 1983 did not recommend any action, there is a 10-year gap in my records before it disappears from the 1990s maps. It is possible the well was removed and there is no longer a need to monitor it.

I haven't heard back from the Air Force. I spoke to folks at Belton City Hall, who pointed me to Port of Kansas City, as the well is on their land. They pointed me back to the federal government. I was hoping someone would know when it was removed, or that it was being actively managed or planned for remediation, but of those I was able to reach, everyone seemed surprised to learn it was ever there.

So far everything I've mentioned comes from government sources or their contractors, and references to the source material are listed below. When cycling through the diferent aerial imagery available on Google Earth, there is one thing in that field that doesn't die off in the winter, green up in the spring, doesn't grow, and doesn't move. Here is my only conjecture: if it's still out there, this might be the concrete well cap.

CassCoMapOverlay

References

All of these documents come from public or now-public sources, and I've collected them into a github project along with a KMZ here.

  1. Shane Lakin, Untitled Map, https://www.facebook.com/groups/126159798374/permalink/10160323174818375/a
  2. "Grandview Airport", https://en.wikipedia.org/wiki/Richards-Gebaur_Memorial_Airport, accessed Dec 5, 2023
  3. Air Force "T.O. 00-110A-1", 25 May 1956
  4. USAF Radioisotope Committee, "BURIAL OF RADIOACTIVE WASTE IN THE USAF", 15 March 1972
  5. CH2M Hill, "INSTALLATION RESTORATION PROGRAM RECORDS SEARCH FOR RICHARDS-GEBAUR AIR FORCE BASE, MISSOURI", March 1983, ADA126605
  6. Water and Air Research, Inc. Consulting Environmental Engineers and Scientists, "INSTALLATION RESTORATION PROGRAM For RICHARDS-GEBAUR A. F. B. PHASE II- FIELD EVALUATION", December 1983, ADA137512
  7. HQ AFCEE/ECA, "FINAL ENVIRONMENTAL IMPACT STATEMENT DISPOSAL AND REUSE OF RICHARDS-GEBAUR AIR FORCE BASE, MISSOURI", July 1994, ADA292976
  8. Seagull Environmental Technologies, Inc, "PHASE I ENVIRONMENTAL SITE ASSESSMENT MARKEY BUSINESS PARK SITE, BELTON, MISSOURI", June 24, 2016
  9. Cass County Parcel Viewer, 05-02-03-000-000-001.001, https://cass.missouriassessors.com/parcel.php?pin=05-02-03-000-000-001.001, accessed Nov 2023
  10. Cass County GIS, https://maps.camavision.com/cassmo, 2020 Imagery Layer, accessed Dec 2023
  11. Special thanks to https://www.georgeafb.info/radioactive-waste-timeline/, which covers a similar site in California

Bonus technical content: Map Alignment Challenges

To locate the approximate area of the radioactive waste disposal site, I aligned images of the the available historical waste dump maps into Google Earth, marked the center of the radioactive waste dump and measured the side to be about 100ft, then hid that map layer.

The Air Force Base Maps

Some of these maps come from photocopies of microfilm or microfiche documents and some are difficult to read. A few required adjusting the gamma curve to make them more readable.

The Datum

I cannot say what datum they were in, likely NAD27, but they were not in North-up WGS84, and so required manually aligning map features into Google Earth.

The Taxiway

I had a lot of trouble when first aligning the maps. Eventually I worked out that I could align almost everything except for the taxiway crossing nearest the radioactive waste dump. It is east of the airstrip, but is it north or south of the taxiway crossing? The waste site isn't shown here, but do you see what changed?

1985Aerial 1990Aerial

Eventually I found that the southernmost connecting taxiway moved north about 620 feet sometime between 1985 and 1990. After correcting for this, the Air Force maps aligned nicely on to the aerial photography.

2023 Map

Richards_Gebaur2023

Older