A compiler bug

Written 2023-09-29

Tags:c compiler 

Around 10 years ago, I found a bug in a vendor's compiler.

The Driver

Quite often we needed to read an asynchronous hardware timer. Effectively like hooking a 32-bit parallel timer to a 32-bit input port, just entirely inside the chip.

Because the hardware timer isn't synchronous to the MCU clock, there's a very slim chance that reading the timer will occur just as the timer is incrementing, showing a mix of both old and new bits as the timer increment ripples through. This is a type of hardware race condition.

The hardware vendor said it would be enough to repeatedly read the timer until it returned the same value twice, then that was a safe value to use.

My first pass:

//pointer to memory-mapped hardware register
static const volatile uint32_t * async_timer = (volatile uint32_t*)0x4003D000;

uint32_t read_timer_simple()
{
    uint32_t x,y;

    do
    {
        x = *async_timer;
        y = *async_timer;
    } while(x!=y);
    return x;
}

This had a slight downside that on the rare mismatch, it would require reading the timer port an even number of times, and the timer port wasn't the fastest to read, so I came up with the following which takes only one additional timer read on mismatch:


uint32_t read_timer()
{
    uint32_t x;
    uint32_t y = *async_timer;

    while(1)
    {
        x = y;
        y = *async_timer;
        if(x==y) return x;
    }
}

The good assembly

I no longer have access to the original, so I've run this through GCC 13.2.0, -O2, which roughly matches:

read_timer():
        ldr     r2, .L10 //Load address of hardware register into r2
        ldr     r0, [r2] //load value of timer into r0
.loop:
        mov     r3, r0
        ldr     r0, [r2] //load next value of timer into r0
        cmp     r3, r0   //compare previous and next value of timer
        bne     .loop    //if not equal, go to .L8 to read timer again
        bx      lr       //return value in r0
.L10:
        .word   0x4003D000

The compiler upgrade

To support a new microchip with a newer CPU core, we upgraded from proprietary compiler 3.something to proprietary compiler 4.whatnot to gain support for some new instructions. This went generally well, except that every few hours to days, a device would lock up, and the user would restart the device, losing any chance to catch what was going on. Eventually, we were able to catch a live one and hook it up to a debugger.

The bug

read_timer():
        ldr     r2, .L10 //Load address of hardware register into r2
        ldr     r0, [r2] //load value of timer into r0
        mov     r3, r0
        ldr     r0, [r2] //load next value of timer into r0
.loop:
        cmp     r3, r0   //compare previous and next value of timer
        bne     .loop    //if not equal, go to .L8 to read timer again
        bx      lr       //return value in r0
.L10:
        .word   0x4003D000

Did you see it? The instructions are almost the same, but the branch target moved slightly. In the common case where the bus race condition does not occur, the timer is read twice, it matches and returns. I think the compiler optimizes assuming that the volatile read may return the same value, even though it does read it more than once. Effectively, the compiler implemented the following:

uint32_t read_timer_bugged()
{
    uint32_t x = *async_timer;
    uint32_t y = *async_timer;

    if(x!=y)while(1);
    return x;
}

...which of course, locks up the machine whenever the hardware race occurs.

Closing

Once we understood the issue, it was simple enough to reproduce, and the compiler vendor took care of it promptly in a paid upgrade to proprietary compiler 5.next

Stories of the famous Dr McCracken

Written 2023-09-12

Tags:UMR MST Rolla 

Back around 2008, I and a few friends had a computer architecture class that would change the course of my degree. There were two sections, and we started by designing a simple computer, maybe a 4-bit ALU, from scratch.

By scratch, I mean that we started with diodes, resistors, and transistors, from that built gates, latches, registers, address decoders, an ALU, memory.

The class was difficult, yet highly productive. Eventually, students of the two sections that year found each other, and merged our study groups at night in the library.

Eventually we wrote simple programs for it in machine code, which we would submit to him, he would evaluate, explain any errors or inefficiencies, and return to us. Up to that class, most of my programming had been on UNIX and Linux systems, so it was strange to find myself writing for a hypothetical machine that couldn't really be tested. At some point I wrote a simple assembler and simulator to help me.

But this post isn't about me, it's about a few of the things the teacher, Ted McCracken, mentioned about his previous career, during that semester. A few of these things, namely that he had survived three airplane wrecks and built one of the earliest 3D video cards pushing the limits of display resolutions...in the 1980s!

Thinking back on my career, I learned the most about system design in Dr Ted McCracken's class. One of the most knowledgable teachers in my major I had met, but a few of these things seemed maybe a little far fetched or at least unusual. But the more I learn, it's all true.

VECTRIX

VECTRIX was a manufacturer of video graphics accelerators. A great resource on these is bitsavers.org. We can find the business registration for Vectrix of Missouri. We do know a few models: Some competitors included: Some contemporary IBM CGA modes for reference(which my family had as a child):

One thing that didn't make sense to me, for a time, about some of the VECTRIX units was the framerate and resolution claim.

For many display technologies pixels need to be periodically refreshed - this is true of phosphor CRTs, most LCDs, AMOLED and PMOLED displays, though with differing decay rates and corresponding minimum refresh rates to limit flicker. When display refresh is driven from a framebuffer it's usually known as scanout. It's fairly common to have at least a small amount of dedicated high speed or dual-ported video memory for the framebuffer(s), both for performance reasons and because scanout has some realtime requirements that must be met. For single-ported memories scanout bandwidth takes away from bandwidth usable for rendering, so today's GPU memory tends to be one of the fastests in the system.

We can estimate the minimum average memory bandwidth required for scanout like so: number of columns transmitted * number of rows transmitted * framerate * framebuffer bits per pixel. And when we do 1024x1024x60Hz*12bit, we find that the maximum spec'd VECTRIX Pepe minimum scanout bandwidth is ~95MB/s. An IBM CGA(color graphics array) adapter has almost 1MB/s. What gives?

That's really an unfair comparison - Pepe was a high end device, so we should compare perhaps with IBM's Professional Graphics Controller(PGC) at only 18.5 MB/s. The IBM PC/AT 16-bit ISA bus is only around 8MB/s, and a significant limitation if flinging pixels over the bus. The more I learned about embedded displays, the more I wondered how this thing could've worked because the framebuffer bandwidth is several times that of the system.

The truth is stranger than fiction - both PGC and Pepe supported both a fully remote framebuffer as well as offloading vector rendering to the display controller. The IBM PGC supported GKS and VECTRIX's had its own command set, and a bit-planed framebuffer.

The original VECTRIX VX series were external components connected by serial or parallel cable to a PC. The command set included instructions for 3D rendering, matrix rotation/translation/scaling, adjusting the synchronization between rendering and scanout(BLANK MODE vs FLASH MODE), but also, arbitrary code execution - the U-command allows uploader user-code to the VX to memory location 0x100 and jumps to it. The hardware registers are documented in the Vectrix Advanced Programming Manual. Eventually VECTRIX would ship their Pepe model, which fit everything into a twin-slot AT card, removing the serial port rendering command bottleneck.

The key trick in attaining the needed framebuffer bandwidth was bit-planeing. Rather than byte-addressable RAMs, usage of single-bit RAMs allowed wiring as many as needed in parallel, so that framebuffer bandwidth scaled with the number of bitplanes, which scaled with log2(number of colors). Of course, the IBM PC, XT, and AT computers couldn't even reach this bandwidth but didn't need to, as they only needed to upload commands to the video system over whatever bus was available. There was a dedicated microprocessor responsible for managing the hardware. And, you can even upload your own instruction streams to it, like an early proto-shader.

Third Airplane Wreck

In order to more efficiently deliver video cards, Ted said they bought an airplane. At some point, I think while landing, he misjudged the windsock, leading him to land with a tailwind instead of a headwind. After touching down, he realized the airplane wasn't going to stop in time, and throttled back up to take off. However, at the end of the landing strip is a berm with a fence, topped with a power line. He's not going to make it over the powerlines, so he aims under them. This almost works, except that his landing gear catch the fence, pulling up fence posts until his airplane slows to a stall and falls into a barn.

About the time my friend Doug K became a pilot, he searched through the accident reports for small airports around the area we went to school in. Sure enough, he found the report for this wreck pretty much as described in class, registered to Video Systems Engineering.

ZFS Pool upgrade

Written 2023-09-10

Tags:ZFS 5big NAS Debian Lacie 

One of the old 500GB disks in my NAS has begin to throw errors.

Buy SSDs

I picked up 6x used PM863 SSDs from eBay. They came with 90%+ wear leveling remaining, except for one at 85%, which will be my cold spare and backup disk.

Checksum everything

find /media/ -type f -exec md5sum {} \+ >backup.md5sum 2>backup.failure
cat backup.failure #should be empty

zfs-send everything to the cold-spare

First, we need a root pool snapshot. In my case:
zfs snapshot -r tank@backup
Next, we send it to the external SSD that I've mounted at /media/usb, running in screen:
zfs send -vRI tank > /media/usb/backup.zfs

pro-tip: make sure you're plugged into the USB3 not the USB2 ports :p

Hardware swap

Take out the old array, put the new array into adapters and install.

recreate new zpool

sudo zpool create tank raidz

zfs-recv everything from the cold-spare to the pool

cat /media/usb/tank@backup | sudo zfs receive tank -F

verify checksums

md5sum --check backup.md5sum

Older