Saturday 1 April 2017

Sign up to Signed Timeouts

I'm Julian Skidmore, the developer of the DIY 8-bit computer FIGnition. Most of my career has been spent doing embedded software, and timeouts have become a major bugbear of mine.

As I see it, the typical timeout mechanisms used on pretty much every low-end embedded system I've seen are terrible and most people probably don't even realise they are, so let's swallow the red pill and embrace a new way of doing them.

Let's start with typical timeout parameters for low-end systems.

On 8-bit and 16-bit embedded systems 16-bit jiffy clocks are often used, because we still want to economise on resources even in the 21st century. They typically have a 1KHz frequency (a millisecond timer) which means they wrap around every 64s, and this has to be carefully considered when calculating timeouts.

On ARMs and the arduino library and undoubtably many other systems, 32-bit jiffy clocks are often used which extends the range to 4096000 seconds, which is roughly 1.5 months; and so for long-running applications; wrap-around problems apply to them as well.

The Evolution Of Timeouts

Embedded programmers usually approach timeouts in a progressive way, from naive implementations to ones that make better use of system resources over time. A programmer might start off by implementing countdown timeouts in the polled part of the mainloop.

However, they then find out that the main loop isn't polled deterministically, so they move the countdowns to an interrupt routine, which does the same thing.



A timeout of this kind is a form of Signal. The timeout starts at 0, which signals to the consumer (the main loop) that a timeout has taken place (or not yet started) and to the producer (the interrupt routine), that it can't decrement it. The consumer then sets the timeout to the number of jiffy ticks desired and this signals to the consumer that it's not going to write to it until it's complete, and signals to the producer that it should decrement the countdown until it gets to 0 (so now the producer has possession of the countdown). The consumer then monitors the countdown until it reaches 0; whereupon it regains possession.

These forms of timeouts have two main strong points. Firstly, they're really simple to implement (just an assignment, a decrement, and a couple of if statements), and secondly, they have an indefinite polling window (the consumer can poll a 1ms countdown after an hour and it'll still report that it's complete). Unfortunately, they soak up CPU time for every countdown timer, even if the timer isn't running and they make it harder to write modular code (because you have to hardcode in the countdown to the interrupt, unless you use a data structure for your set of countdowns).

So, programmers then generally move onto timeouts based on simple comparisons. For example, they'll set the timeout expiry to the current time in milliseconds and then keep polling the milliseconds to see if it's passed the timeout:

... // The timeout is set up,

gMyTimeout=millis()+aPeriod;

... // then later in a main loop, we keep testing it:

if(millis()>=gMyTimeout) {
    // The timeout has expired.
}

The first timeout code is always constructed like this, because we see time as being linear. But of course it's buggy, because eventually millis() will get to its maximum, or near it, and when you add aPeriod, it wraps around to 0, so the first test for millis()>=gMyTimeout will think it's expired when it hasn't.

So, the programmer does the logical thing and incorporates the 'wrapping' issue into the algorithm by adding a flag to determine if the timeout wrapped around. Then we get something like:

gMyTimeout=millis()+aPeriod;
if(millis()>gMyTimeout)
    gMyTimeoutWrap=TRUE;
else
    gMyTimeoutWrap=FALSE;

...//
if(millis()>=gMyTimeout && gMyTimeoutWrap==FALSE ||

    millis()<=gMyTimeout && gMyTimeoutWrap==TRUE) {
    millis()
    // The Timeout flag has expired.
}


Except this doesn't work properly either, because if, for example, gMyTimeout is close to the maximum value, then there's only a small window where millis()>=gMyTimeout even though there was no wrap and similarly, in the wrapping case, there's only a small window where millis()<=gMyTimeout.


There are similar problems if, for example we just have the first test, and a different test for when the timeout wrap is TRUE, e.g:

if(gMyTimeoutWrap==TRUE) {
    if(millis()<gMyTimeoutWrap)

        gMyTimeoutWrap=FALSE; // we've wrapped!
}
else if(millis()>=gMyTimeout){
   // We've expired!
}


The code must be polled often enough for the condition to be met to reset the wrap flag.

A this point, assuming that the embedded programmer actually becomes aware of these issues, they'll realise that a simple concept like a timeout is far more complex than they'd imagined.

They may then try another common pattern. Instead of adding aPeriod to the current time to obtain the timeout time, both the aPeriod and the original time are maintained. We then compute a timeout as follows:

gMyTimeout=millis();
gMyTimeoutPeriod=aPeriod;

.../// and in the main loop
if(millis()-gMyTimeout>=gMyTimeoutPeriod) {
  .. // We've timed out.
}


In this code we solve the major issues with the previous code: by subtracting the initial timeout set time from the current time we end up calculating the relative time since gMyTimeout. It will always start out at 0 and will eventually become >= gMyTimeoutPeriod. At last we have code that isn't actually buggy!

But it's poor, very poor, because the timeout must be polled every gMyTimeoutPeriod milliseconds or the window can be missed. For example, a 1ms period means the polling must take place within a millisecond or it will look like the timeout won't expire for another 65535ms.


We ought to find a better way, and this is how:

GROT: The Golden Rules Of Timeouts

The First Golden Rule Of Timeouts is that for any circular jiffy time of n bits you always need at least n+1 bits of storage to maintain a timeout.

This is true in the first case, we had n-bits for millis() and the flag represented an extra bit of storage (assuming we managed to correct the issue). It's also true for the second case; we had n-bits for millis() and since both a timeout and the aPeriod were represented, we needed 2n bits of storage.

This rule is a direct consequence of the requirement we must have for timeouts, if a jiffy clock has a period of x before wrapping, then we need to be able to represent a time period of up to 2*x to determine a timeout:




This also means that because we have at least n+1 bits of storage for the timeout, we must perform at least n+1 bit arithmetic to calculate timeouts. Again, we can see that's true in both cases: in the first case we have some explicit code to extend the arithmetic by an extra bit and in the second case we must perform an initial subtraction (millis()-gMyTimeout) and then another subtraction will be performed for the comparison.

The second golden rule is this: timeouts must be computed as relative quantities if the jiffy clock wraps around. The primary reason why timeout calculations are buggy in the first case even when the extra bit is added, is because absolute comparisons are made instead of relative calculations. In the second algorithm, when we compute millis()-gMyTimeout, we actually compute a relative value. For example, when gMyTimeout is initially a high value, e.g. 0xff01, then when millis()==0, the calculation will be 0-0xff01 which is 0xfe, i.e. 254 milliseconds later.

Now the interesting thing about that calculation is that we're no longer strictly using unsigned arithmetic here, because 0 is much less than 0xff01 so the result is a negative number whose lower 16-bits result in a small positive number. So, what you have is a maths error, which happens to deliver the right results. In some (often older) architectures it's possible to trap on overflow; and this is what would happen here, a timeout would lead to a system error.

Signed Timeouts Are Sweet Timeouts

The observation that timeouts are relative leads neatly to this insight: timeouts should be signed, as should the jiffy clock.

It's counter-intuitive, but simply by defining, say the jiffy clock as:

int16_t gJiffy;

Incrementing it in the timer tick routine (let's assume it's 1KHz):

ISR(TimerTick) {
    // ClearTimerTickInterruptFlag
    ++gJiffy;
}

Let's say millis() is a int16_t:

int16_t millis() AtomicRead(gJiffy)

Defining a timeout as:

int16_t gMyTimeout=millis()+aPeriod;

And then to test for a timeout with:

if(millis()-gMyTimeout<0) {
    ..// We've timed out.
}

Solves every problem we've found with timeouts so far; does it neatly and cleanly; and maximises the timeout period, in this case to n-1 bits, i.e. a range of 32s when the whole jiffy clock period will be 64s.

Math Madness

Why is it that the timeout calculation is expressed as millis()-gMyTimeout<0 instead of millis()<gMyTimeout when surely they're the same thing? In basic algebra we're taught that:

a<b <=> a-b<0

You simply subtract (or add) b at both sides to derive the equivalent equation. The answer is that it's only true on an infinite number line: for modulo arithmetic, both unsigned and signed, there are cases where they are not equivalent.


a b a-b signed(a<b) a-b<0 unsigned(a<b)
0x00000x0001 0xffff True True True
0xfffe0x0001 0xfffd True True False
0x80000x7fff 0x0001 True False False
0x7fff0x8000 0xffff False True True
0x00010xfffe 0x0003 False False True
0x00010x0000 0xffff False False False

This is because when you take two quantities in the range 0..x and subtract one from the other, the total range of values you can get on an infinite number line is -x to x; and therefore if you can only represent x different values, then the same values must be re-used. This explains why, when you use unsigned values for timeouts, you get a large polling window if a is very low and b is really high, but as a gets higher, the polling window becomes increasingly squeezed.

It also explains why if we want to maximise the efficiency for timeouts, we should maximise (and standardise) the polling interval, and the way to do that is to limit the timeout range to maxTimeout=x/2, and then the range of results will be -x/2 to x/2.

The Intuitive Angle

The upshot is this: using signed timeouts is the cleanest solution out. It works, because when we subtract the current time from the timeout, we obtain the timeout relative to now: a positive number if it's in the future (it hasn't timed out) or a negative number if it's in the past (it has timed out). The window is the maximum theoretically possible, because given two arbitrary times, the only way to tell if a value has timed out is if you guarantee your timeouts and polling period are within half the maximum range, 180ยบ of your data type.



Conclusion

There are many ways in which software engineers perform timeouts, but only signed timeouts are clean and efficient in storage, computation and code size. We should use them as standard.