Sunday, 31 December 2017

Z180 MMu-tiny

There's surprisingly little in the way of clear explanations for Z180's MMU, which is mostly OK because as an 8-bit CPU, the Z180 isn't very popular any more.

So, here's a bit of an explanation. The Z180 is basically a Z80 with some extra built-in peripherals, one of which is a bank-switching MMU which provides the CPU with the ability to manage 1Mb of physical memory.

The MMU itself is very simplistic. It divides the normal 16-bit address space into 3 sections: a section (called Common 0) that's always mapped to the beginning of physical memory; a section (called the Banked Area) which can be mapped anywhere in the 1Mb of physical memory; and a final section (called the Common 1 area) which can also be mapped anywhere in the 1Mb of physical memory.



The two banked areas can be made to start at any 4Kb region in the logical 64Kb memory space; though the Common 1 area should be made to start after the Banked Area.

The MMU is controlled via 3 x 8-bit registers in I/O space, at addresses at 0x38 to 0x3A as follows:

CA and BA are both reset to 0xf, which means that only the top 4Kb is mapped on boot-up, and since BBR and CBR are reset to 0, then this means that the top 4Kb is mapped to 0x0f000, which is good because the MMU can't be switched off.

Zilog (and Hitachi who first implemented the scheme) intended the MMU to operate so that only the Bank Area would move during the lifetime of any Z180 process, which is why the top segment is called "Common Bank 1", rather than, say "Bank Area 2" (or '1' if we'd numbered the Bank Areas from 0).

But this scheme severely restricts the flexibility of the MMU. That's because a flexible banked system needs at least 4 moveable banks for a process. You need a non-moving banked area of data to hold the stack and key global variables. Similarly, you need a non-moving banked area of code to hold key common routines, e.g. the bank switching code or a vector to the OS. Then to be able to use the 1Mb space for larger applications you need a relocatable segment for code and another for data; which means 4 banks in total. The Z180 only provides 2 movable banks.

If I had designed the Z180 MMU I would have chosen a pure bank-switching technique where each virtually addressed 16Kb bank is independently mapped somewhere in the physical address space. What might that look like if we were confined to the equivalent set of registers as for the real Z180 MMU?

Here the registers are called the FBP (Fixed Bank Page), the DBP (the Data Bank page) and the CBP (the Code Bank Page).

Each register is 8-bits and they map the virtual address to a 22-bit virtual address space (providing 4Mb) as follows:

So, the Code and Data bank pages map to the first two 16Kb banks anywhere in the physical address space and the last 32Kb normally maps to a 32Kb bank. There are two use-cases for how we might want to do the mapping. Firstly, there's the general purpose OS case. Here, we assume that it's fairly easy to map fixed code and data to consecutive banks (moving a pre-allocated moveable bank if needed). Hence we can make do with a single bank register for this purpose. The second use-case is for embedded systems where there would be a single application (in ROM). Here, the X bit can be set so that the top 16Kb can be mapped into RAM in the other half of the physical address space.

The logic involved in this scheme would be at least as simple as the original Z180, because no adder is required to compute the virtual Address + the bank register and no comparison logic is needed to determine which segment any particular virtual address belongs to. Instead, the top two virtual address bits directly index the MMU registers.

One additional feature is implemented: a RST xx instruction, including RST 0 - reset, will clear the CBP. This means that it would be possible for an application or process running from banked or the fixed code bank to execute a ROM routine - in rather the same way it works with CP/M.

Thursday, 10 August 2017

Xenophobia on the Virgin Express

I've heard a lot about the growing Brexit xenophobia being nurtured by the likes of the Daily Mail, The Sun and Express, but prior to our holiday trip to the lovely North East town of Seahouses a couple of weeks ago I'd not really witnessed any.

I mean we've seen the video of anti-immigrant or racist abuse on the Manchester Metrolink right?

So, let's get down to the specifics. On Saturday, July 22 we were on the (I believe) Virgin Intercity (though it might have been Cross County) to Newcastle from Birmingham New Street. I think we would be arriving at 14:46 and then catching the next train to Berwick upon Tweed.

Me and my wife had an allocated pair of aisle seats at a table as it happened and there was a couple in the window seats too. I was travelling backwards. Slightly behind me and on the opposite side window seat was a lady - clearly very pregnant - enjoying the ride.

That is, until the conductor came along and asked for her ticket. Instead she presented him with a printed A4 sheet with her booking details including a code for the ticket. She didn't know this wasn't a proper ticket, because she was from Hungary and hadn't been in the country that long.

Instead of being gracious and explaining how she could get her ticket, the ticket inspector started to get really bolshy with her; threatening her with a fine or throwing her off the train - but she didn't have any real money with her either, her partner was the one who had actually bought the ticket online.

So, I said to the inspector it was obviously an innocent mistake so couldn't she get it sorted out at the end of the journey? To which he spouted "They wouldn't allow that where she comes from!" and stormed off down the carriage like a true Daily Express reader!

Me and my wife did a double-take. So, did the couple next to us at the window side. Then between us we all did a double-double take. Then we chatted with the woman for a few minutes and explained how she should be able to sort it out when she gets off the train by going to Information.

And the post-script for any xenophobe blog readers: I've no idea if they'd immediately chuck you off a train in Hungary for only having the booking information instead of an actual ticket, but who cares? Why should we define our behaviour by the worst we might imagine about another country's? To me it just sounds like a mirror image of your own inner thoughts and you can't fix that by targeting Eastern Europeans.

Saturday, 1 April 2017

Sign up to Signed Timeouts

I'm Julian Skidmore, the developer of the DIY 8-bit computer FIGnition. Most of my career has been spent doing embedded software, and timeouts have become a major bugbear of mine.

As I see it, the typical timeout mechanisms used on pretty much every low-end embedded system I've seen are terrible and most people probably don't even realise they are, so let's swallow the red pill and embrace a new way of doing them.

Let's start with typical timeout parameters for low-end systems.

On 8-bit and 16-bit embedded systems 16-bit jiffy clocks are often used, because we still want to economise on resources even in the 21st century. They typically have a 1KHz frequency (a millisecond timer) which means they wrap around every 64s, and this has to be carefully considered when calculating timeouts.

On ARMs and the arduino library and undoubtably many other systems, 32-bit jiffy clocks are often used which extends the range to 4096000 seconds, which is roughly 1.5 months; and so for long-running applications; wrap-around problems apply to them as well.

The Evolution Of Timeouts

Embedded programmers usually approach timeouts in a progressive way, from naive implementations to ones that make better use of system resources over time. A programmer might start off by implementing countdown timeouts in the polled part of the mainloop.

However, they then find out that the main loop isn't polled deterministically, so they move the countdowns to an interrupt routine, which does the same thing.



A timeout of this kind is a form of Signal. The timeout starts at 0, which signals to the consumer (the main loop) that a timeout has taken place (or not yet started) and to the producer (the interrupt routine), that it can't decrement it. The consumer then sets the timeout to the number of jiffy ticks desired and this signals to the consumer that it's not going to write to it until it's complete, and signals to the producer that it should decrement the countdown until it gets to 0 (so now the producer has possession of the countdown). The consumer then monitors the countdown until it reaches 0; whereupon it regains possession.

These forms of timeouts have two main strong points. Firstly, they're really simple to implement (just an assignment, a decrement, and a couple of if statements), and secondly, they have an indefinite polling window (the consumer can poll a 1ms countdown after an hour and it'll still report that it's complete). Unfortunately, they soak up CPU time for every countdown timer, even if the timer isn't running and they make it harder to write modular code (because you have to hardcode in the countdown to the interrupt, unless you use a data structure for your set of countdowns).

So, programmers then generally move onto timeouts based on simple comparisons. For example, they'll set the timeout expiry to the current time in milliseconds and then keep polling the milliseconds to see if it's passed the timeout:

... // The timeout is set up,

gMyTimeout=millis()+aPeriod;

... // then later in a main loop, we keep testing it:

if(millis()>=gMyTimeout) {
    // The timeout has expired.
}

The first timeout code is always constructed like this, because we see time as being linear. But of course it's buggy, because eventually millis() will get to its maximum, or near it, and when you add aPeriod, it wraps around to 0, so the first test for millis()>=gMyTimeout will think it's expired when it hasn't.

So, the programmer does the logical thing and incorporates the 'wrapping' issue into the algorithm by adding a flag to determine if the timeout wrapped around. Then we get something like:

gMyTimeout=millis()+aPeriod;
if(millis()>gMyTimeout)
    gMyTimeoutWrap=TRUE;
else
    gMyTimeoutWrap=FALSE;

...//
if(millis()>=gMyTimeout && gMyTimeoutWrap==FALSE ||

    millis()<=gMyTimeout && gMyTimeoutWrap==TRUE) {
    millis()
    // The Timeout flag has expired.
}


Except this doesn't work properly either, because if, for example, gMyTimeout is close to the maximum value, then there's only a small window where millis()>=gMyTimeout even though there was no wrap and similarly, in the wrapping case, there's only a small window where millis()<=gMyTimeout.


There are similar problems if, for example we just have the first test, and a different test for when the timeout wrap is TRUE, e.g:

if(gMyTimeoutWrap==TRUE) {
    if(millis()<gMyTimeoutWrap)

        gMyTimeoutWrap=FALSE; // we've wrapped!
}
else if(millis()>=gMyTimeout){
   // We've expired!
}


The code must be polled often enough for the condition to be met to reset the wrap flag.

A this point, assuming that the embedded programmer actually becomes aware of these issues, they'll realise that a simple concept like a timeout is far more complex than they'd imagined.

They may then try another common pattern. Instead of adding aPeriod to the current time to obtain the timeout time, both the aPeriod and the original time are maintained. We then compute a timeout as follows:

gMyTimeout=millis();
gMyTimeoutPeriod=aPeriod;

.../// and in the main loop
if(millis()-gMyTimeout>=gMyTimeoutPeriod) {
  .. // We've timed out.
}


In this code we solve the major issues with the previous code: by subtracting the initial timeout set time from the current time we end up calculating the relative time since gMyTimeout. It will always start out at 0 and will eventually become >= gMyTimeoutPeriod. At last we have code that isn't actually buggy!

But it's poor, very poor, because the timeout must be polled every gMyTimeoutPeriod milliseconds or the window can be missed. For example, a 1ms period means the polling must take place within a millisecond or it will look like the timeout won't expire for another 65535ms.


We ought to find a better way, and this is how:

GROT: The Golden Rules Of Timeouts

The First Golden Rule Of Timeouts is that for any circular jiffy time of n bits you always need at least n+1 bits of storage to maintain a timeout.

This is true in the first case, we had n-bits for millis() and the flag represented an extra bit of storage (assuming we managed to correct the issue). It's also true for the second case; we had n-bits for millis() and since both a timeout and the aPeriod were represented, we needed 2n bits of storage.

This rule is a direct consequence of the requirement we must have for timeouts, if a jiffy clock has a period of x before wrapping, then we need to be able to represent a time period of up to 2*x to determine a timeout:




This also means that because we have at least n+1 bits of storage for the timeout, we must perform at least n+1 bit arithmetic to calculate timeouts. Again, we can see that's true in both cases: in the first case we have some explicit code to extend the arithmetic by an extra bit and in the second case we must perform an initial subtraction (millis()-gMyTimeout) and then another subtraction will be performed for the comparison.

The second golden rule is this: timeouts must be computed as relative quantities if the jiffy clock wraps around. The primary reason why timeout calculations are buggy in the first case even when the extra bit is added, is because absolute comparisons are made instead of relative calculations. In the second algorithm, when we compute millis()-gMyTimeout, we actually compute a relative value. For example, when gMyTimeout is initially a high value, e.g. 0xff01, then when millis()==0, the calculation will be 0-0xff01 which is 0xfe, i.e. 254 milliseconds later.

Now the interesting thing about that calculation is that we're no longer strictly using unsigned arithmetic here, because 0 is much less than 0xff01 so the result is a negative number whose lower 16-bits result in a small positive number. So, what you have is a maths error, which happens to deliver the right results. In some (often older) architectures it's possible to trap on overflow; and this is what would happen here, a timeout would lead to a system error.

Signed Timeouts Are Sweet Timeouts

The observation that timeouts are relative leads neatly to this insight: timeouts should be signed, as should the jiffy clock.

It's counter-intuitive, but simply by defining, say the jiffy clock as:

int16_t gJiffy;

Incrementing it in the timer tick routine (let's assume it's 1KHz):

ISR(TimerTick) {
    // ClearTimerTickInterruptFlag
    ++gJiffy;
}

Let's say millis() is a int16_t:

int16_t millis() AtomicRead(gJiffy)

Defining a timeout as:

int16_t gMyTimeout=millis()+aPeriod;

And then to test for a timeout with:

if(millis()-gMyTimeout<0) {
    ..// We've timed out.
}

Solves every problem we've found with timeouts so far; does it neatly and cleanly; and maximises the timeout period, in this case to n-1 bits, i.e. a range of 32s when the whole jiffy clock period will be 64s.

Math Madness

Why is it that the timeout calculation is expressed as millis()-gMyTimeout<0 instead of millis()<gMyTimeout when surely they're the same thing? In basic algebra we're taught that:

a<b <=> a-b<0

You simply subtract (or add) b at both sides to derive the equivalent equation. The answer is that it's only true on an infinite number line: for modulo arithmetic, both unsigned and signed, there are cases where they are not equivalent.


a b a-b signed(a<b) a-b<0 unsigned(a<b)
0x00000x0001 0xffff True True True
0xfffe0x0001 0xfffd True True False
0x80000x7fff 0x0001 True False False
0x7fff0x8000 0xffff False True True
0x00010xfffe 0x0003 False False True
0x00010x0000 0xffff False False False

This is because when you take two quantities in the range 0..x and subtract one from the other, the total range of values you can get on an infinite number line is -x to x; and therefore if you can only represent x different values, then the same values must be re-used. This explains why, when you use unsigned values for timeouts, you get a large polling window if a is very low and b is really high, but as a gets higher, the polling window becomes increasingly squeezed.

It also explains why if we want to maximise the efficiency for timeouts, we should maximise (and standardise) the polling interval, and the way to do that is to limit the timeout range to maxTimeout=x/2, and then the range of results will be -x/2 to x/2.

The Intuitive Angle

The upshot is this: using signed timeouts is the cleanest solution out. It works, because when we subtract the current time from the timeout, we obtain the timeout relative to now: a positive number if it's in the future (it hasn't timed out) or a negative number if it's in the past (it has timed out). The window is the maximum theoretically possible, because given two arbitrary times, the only way to tell if a value has timed out is if you guarantee your timeouts and polling period are within half the maximum range, 180ยบ of your data type.



Conclusion

There are many ways in which software engineers perform timeouts, but only signed timeouts are clean and efficient in storage, computation and code size. We should use them as standard.

Saturday, 18 March 2017

New Record Low Arctic SIE Maximum Reached

The Arctic Sea Ice is dying. We've known this since the mid-1990s from satellite measurements available since the end of the 1970s and there is some pre-satellite evidence to show that it has been in decline since the 1960s via the early environmentalist Rachel Carson.

However, since the mid-2000s it's been accelerating. Normally the big news has been with the Sea Ice Extent minimum reached in September, but recently the decline in the Sea Ice Extent maximum in March is becoming increasingly concerning.

This year we have reached a new record low Arctic SIE Maximum, about 40,000Km2 lower than the previous maximum reached in 2015. This is after 6 months in 2016 where the Arctic SIE was at record low levels and even this year it has spent about 30% of the time in record low territory over and above (or should it be under and below) the record lows over that period in 2016.

The record itself was reached near the beginning of March (March 06), but because the extent can vary quite significantly up and down at the maximum point, it's not safe to call the maximum until it can be reasonably known that it's peak will be exceeded.

That point has been reached, the current extent reached 13.61mKm2 as of March 16 and there is no year from the year 2000 to 2016 where SIE has risen by more than the 270,000Km2 that would be required for 2017 to break its current peak.

Here's the graphic.


Wednesday, 8 March 2017

uxForth: Unexpanded forth for a standard VIC-20. Part 3, the memory map

I'm the developer of the DIY 8-bit computer FIGnition, but it doesn't mean I'm not interested in other retro computers and the idea of developing a minimal Forth for the ancient, but cute Commodore VIC-20 is irresistable!

Part 1 talks about the appeal of the VIC-20 and what a squeeze it will be to fit Forth into it's meagre RAM.

In Part 2 I discussed choices for the inner interpreter and found out that a token Forth model could be both compact and about as fast as DTC.

Now I'm going to allocate the various parts of the Forth system to VIC-20 memory to make the best of what's there. Some of it will be fairly conventional and some somewhat unorthodox.

(An Aside, The Slow uxForth Development Process)

From the presentation of the blog entries it looks like I'm working these things out as I'm going along. For example, it's worthwhile asking why it looks like I can leap to fairly concrete decisions about the inner interpreter or even that I think I'll be able to fit the entire system into the available space.

The simple answer is that I've already done much of the work to make this possible. I've already written the code that implements the primitives (in fact I've written, modified and rewritten it a few times as I've improved it). I've made use of the wonderful resources at 6502 org, particularly the idea of splitting the instruction pointer (called gIp in my implementation) into a page offset and using the Y register to hold the byte offset: it really does improve the performance of the core Next function.

Similarly, I've written the non-primitive code and accounted for the space. It's written in Forth with a home-brew meta-forth compiler written in 'C'. So, there will be a future blog on that too!

However, it's not a cheat as such. The code is not tested yet; nor even loaded into a real VIC-20 nor emulator (I don't have a real VIC-20 :-( ). I have real decisions to make as the blog continues, which means I can make real mistakes too and have to correct them. What I've done, really, is basically a feasibility study, so that you don't waste your time reading the blog. And of course, the whole of uxForth will be released publicly, on a GPL licence via my GitHub account.

Admittedly, it's being released slowly, a 2.75Kb program I hope to release over the course of 2017!

The Memory Map

Page 0

Page 0 is the gold dust of every 6502 system: versatile and in short supply. BASIC uses the first 0x90 bytes and the KERNAL uses the rest. We'll use all 0x90 bytes for the data stack and some key system variables:


Addr Size Name Comment
$00 2 gIp Instruction pointer, lower byte always 0.
$02 1 gTmpLo Temporary byte
$03 1 gTmpHi Temporary byte used for indirect access.
$04 2 gILimit The limit for the inner-most do.. loop. uxForth (and FIGnition Forth) differ from most Forths in that the inner most loops values, the limit and the current value are held in global locations. do causes the previous gILimit and gCurrent to be pushed to the stack; thus r is equivalent to j on other forths.
$06 2 gICount The current loop count for the inner-most do.. loop.
$08 1 gUpState The current compilation state.
$09 1 gUpBase The current number base
$0a 2 gUpDp The current dictionary pointer.
$0c 2 gUpLast A pointer to the header of the most recent dictionary entry compiled
$0e 2 gUpTib The pointer to the input buffer (I'm not sure if we need this)
$10 128 gDs The data stack
$fb 2 gTmpPtr0 Spare pointer 0
$fd 2 gTmpPtr1 Spare pointer 1

Page 1

Page 1 is the return stack as you might expect. Oddly enough, we only get 192b, because the KERNAL uses $100 to $13F.

Page 2

There are 89 bytes available here, because they're used by BASIC. I plan to use them for the byte code vectors which are:

# Name # Name # Name # Name
$00 (nop) $0b (+loop) $16 u/ $21 rp!
$01 ;s $0c 0< $17 @ $22 drop
$02 exec $0d 0= $18 c@ $23 dup
$03 (native) $0e + $19 ! $24 over
$04 (lit8) $0f neg $1a c! $25 swap
$05 (lit16) $10 and $1b r> $26 (vardoes)
$06 0 $11 or $1c >r $27 (constdoes)
$07 (0branch) $12 xor $1d r $28 inkey
$08 (branch) $13 >> $1e sp@ $29 emit
$09 (do) $14 << $1f sp! $2a at
$0a (loop) $15 * $20 rp@ $2b

The codes that are greyed out have no names in the dictionary to save space; the way you'd insert them into code would be with [ nn c, ] sequences.

Page 3 and Page 4

There are a total of 116 bytes free from $2A0 to $313, I'll fill that area with some of the actual native definitions.

The cassette buffer is at $33c to $3fb. We'll be using the cassette for storage so we can't use it for code. 

Pages 16 to 31 ish ($1000 to $1dff)

This is the area of RAM reserved for BASIC. It will contain the rest of the Forth system.

The screen RAM ($1e00 to $1ff9)

The end of RAM for an unexpanded VIC-20 is used for the screen. The plan here is to use that area for the editing space.  Instead of implementing a line editor (ACCEPT in FIG-forth and early FIGnition Forth), we use key to call the KERNAL editor and allow it to manage the editing of a line including cursor movement. Pressing Return doesn't execute the command line, instead, pressing F1 exits the editor and sets the interpretation point to the current cursor position. The end of the interpretation point is set to the end of the screen and emit is turned off until interpretation gets to the end of the screen. Importantly, pressing return doesn't start interpretation.

In addition, pressing F2 saves the screen bytes onto cassette.

This is how I'll implement storage in a fairly minimal way. By implementing save via F2 I can save a block (actually the 506 screen bytes are roughly half a traditional block), but LOAD is a normal word, so multiple blocks can be loaded (you just add load to the end of the block).

So, this is how you'd do normal editing operations. For normal words you would place the cursor near the end of the screen and edit to the end of the screen; cursor to return to the first character you want to interpret and then press F1. In a sense this is easy, because you can just press Return and then cursor up until you get there. The same method would also work if you wanted to compile a whole screen's worth of code. Load itself would reset the cursor position to [home] and then return to the interpreter, so placing a load at the end of the screen would load the next screen without any recursion. That way you'd be able to develop programs that were longer than just one screen without manual reloading.

Conclusion

In the memory allocation of uxForth, we've squirrelled away about 1053 bytes of RAM, embedding the line buffer in the screen and a number of system variables in page 0. We've also included 212 bytes of what we'd use for the program proper. It won't get much better than this!

In the next post I hope to talk in more detail about the implementation of the primitive words and the code used to test them.

Sunday, 12 February 2017

Arctic Brain Freeze

Another Geo-engineering project is doing the rounds.

https://www.theguardian.com/world/2017/feb/12/plan-to-refreeze-arctic-before-ice-goes-for-good-climate-change

In my opinion, it's not such a good idea, so sorry to dump on the scientist. There's grave concern for the arctic, I track it via the Jaxa arctic sea-ice extent vishop web page on a daily basis. It's scary.


Steven Desch's plan is still a bit like pumping harder on the gas to provide the power to apply some brakes, because the 1 million wind pumps would have to be currently made with fossil fuel industry power. I imagine, that because they're not generating electricity, just pumping cold, salty water up; they won't deplete the rare earth metals used to make electric wind-turbines. So, it's not as bad as it might be.


But, 1 million wind turbines: for electricity turbines that's enough to power 2 billion homes at a cost of £2000bn (though economy of scale would bring that down).

It's disappointing that he says: "“Our only strategy at present seems to be to tell people to stop burning fossil fuels.. It’s a good idea..." What he should be saying is "that is of course essential." Because it's not just a good idea, we have to get our emissions down to zero (and rapidly, and then go negative!), because CO2 will hang in the atmosphere for 100s to 1000s of years. Prioritising CO2 over arctic geoengineering is wiser, because it'll provide centuries of better breathing space (quite literally); whereas this, like all geoengineering projects encourages BAU and creates a maintenance issue: you're going to have to replace 5% of them every year: that's 50,000 wind pumps per year.

In addition, because global warming would get worse, the effort to keep the arctic frozen would increase. That's because of warm waters entering from the Atlantic and Pacific as well as warm air from the rest of the planet, pressures on the jetstream and higher radiative forcing due to greater CO2 in the atmosphere. These things wouldn't change and all of them are a product of burning FF.

The question though is that prioritising this would probably deplete renewable energy efforts, because the companies used to build turbine blades would first be diverted to build these things.

The definitive lay perspective on Geo-engineering, I think is the chapter on it in Naomi Klein's book "This Changes Everything."