Sunday, 6 November 2016

uxForth: Unexpanded forth for a standard VIC-20. Part 2, the inner interpreter.

I'm the developer of the DIY 8-bit computer FIGnition, but it doesn't mean I'm not interested in other retro computers and the idea of developing a minimal Forth for the ancient, but cute Commodore VIC-20 is irresistable!

In the first part I talked about the appeal of the VIC-20 and how much usable RAM I thought I could squeeze out of it.

That turned out to be between 3947 bytes and 4646 bytes depending on whether we count the screen and the CPU stack. And this sounded more credible, except that I want at least 1Kb of RAM for user programs which brings me back to 2923 to 3622 bytes. A terrible squeeze after all.

There's one obvious way to tackle that: use the Token Forth model. A definitive articles covering all the trade-offs with developing Forth are in the series "Moving Forth" by Brad Rodriguez, but here, we just need to recap on the most popular Forth models.

Forth Execution Models

Forth normally represents its programs as lists of addresses which eventually point to machine code. The mechanism is handled by the inner Forth interpreter called "Next". The traditional Forth model implements what's called Indirect Threaded Code.



Here, each forth command (in blue) points to an indirect address (in green) which points to some machine code (in pink). Primitive commands in Forth (like DUP, >R, SWAP and R> here) have an indirect address which points to the next RAM location where the machine code starts. But commands written in Forth itself (like ROT) start with an indirect address which points to ENTER which implements a function call and is then followed by more Forth commands (in blue). A Forth command like this then ends in EXIT, which returns Forth execution to the next calling function (MYFUNC).

The next Forth model to consider is Direct Threaded Code. Here's the same thing:


Here, every forth command (in blue) points directly to machine code (in pink). Primitive commands are executed directly, but commands written in Forth itself (like ROT) start with a "JSR Enter" machine code instruction which saves the return address (to Forth code) on the normal stack and in the DTC Forth, this return address is used as the new Forth Instruction Pointer after pushing the old IP. We can see that DTC will normally be faster than ITC because there's less indirection.

Token Threaded Forth is essentially a byte-coded Forth, except that in the case of commands written in Forth itself, the NEXT routine uses the top bit of the token to denote an address. Thus, only a maximum of 128 tokens can be supported and only 32Kb of Forth code.


In this example, we can see that the Forth code has been reduced from 14 bytes to 8b, but there is a jump table of addresses which is the same size as the indirect entries in ITC (10b used for these entries). DTC used an additional JSR (3 bytes) for the ':' defined word, but TTC didn't need any extra bytes for the ':' definition (it uses a single bit, encoded in the $93A0 address). Here, the overhead of ITC weighs in at 24 bytes, TTC weighs in at 18 bytes and DTC weighs in at 17 bytes.

We can see that TTC could significantly reduce the size of Forth code if the forth tokens are used often enough, but traditionally a byte-coded interpreter is slower than a threaded code interpreter. uxForth won't beat a DTC Forth, so the question is whether it can compete with an ITC Forth.

Execution Timings

ITC Forth:

NEXT      LDY #1
          LDA (IP),Y ;Fetch
          STA W+1    ;Indirect
          DEY        ;Addr
          LDA (IP),Y ;to
          STA W      ;W
          CLC        ;Inc
          LDA IP
          ADC #2
          STA IP     ;IP.lo
          BCC L54    ;If CY,
          INC IP+1   ;inc IP.hi
L54       JMP IndW
IndW:  .byte $6c ;JMP () opcode
W         .word 0

This is the implementation from the original 6502 FIG Forth. It uses zero-page for IP and W. The indirection is achieved by jumping to an indirect Jump.  It requires 41 cycles.

DTC Forth

NEXT      LDY #1
          LDA (IP),Y ;Fetch
          STA W+1    ;Indirect
          DEY        ;Addr
          LDA (IP),Y ;to
          STA W      ;W
          CLC        ;Inc
          LDA IP
          ADC #2
          STA IP     ;IP.lo
          BCC L54    ;If CY,
          INC IP+1   ;inc IP.hi
L54       JMP (W)
W         .word 0

This is a simple derivation from the original 6502 FIG Forth. As before it uses zero-page for IP and W. The indirection is achieved using a simple indirection.  It requires 36 cycles.

UxForth

Next:
lda (gIp),Y ;byte code.
asla ;*2 for gVecs index
                        ;Also 'Enter' bit in Carry
iny ;inc gIp.lo
beq Next10 ;page inc?
;no page inc, fall-through.
Next5:
bcc Enter ;Handle Enter.
Next7:
sta Next7+4 ;modify following Jmp.
jmp (gVecs) ;exec byte code.
Next10:
inc gIp+1 ;inc page
bcs Next7 ;now handle token/enter.

This is the proposed UxForth implementation. The UxForth version has to handle both multiplying the token by 2 to get the index for the jump table (gVecs) and testing to see if it's a call to another Forth routine (bcc Enter). It requires 22 cycles, so we can see that it's almost twice as fast as the ITC version. This is because it has one natural advantage and uses several techniques to improve the speed:

  • Y is used to hold the low byte of IP, thus when we execute lda (gIP),Y , only the upper byte of gIP is used, the lower byte is always 0.
  • Branches are arranged so that the common case is the fall-through case. Thus when IP increments over a page boundary two jumps are needed.
  • We normally only have to read one instruction byte instead of two. This is the one natural advantage TTC has over ITC or DTC.
  • The vector is stored directly in the code (the second byte of jmp (gVecs) ).
18b vs 26b for ITC Forth and 25b for DTC forth. It's possible to use most of these techniques to improve the speed of ITC and DTC 6502 Forth, but I'm not so concerned about that, because the easiest to access VIC-20 Forth is the Datatronic Forth (which is an 8Kb ROM cartridge) and Datatronic Forth uses exactly the same version of NEXT as FIG-Forth.


Conclusion

RAM is still very tight, but we can reduce its usage by implementing a byte-coded Forth and we should find it's perhaps up to twice as fast as a traditional FIG-FORTH implementation.

In the next post we'll look at how we might map our Forth implementation to the available RAM regions!

Thursday, 3 November 2016

uxForth: Unexpanded Forth For A Standard VIC-20

I'm the developer of the DIY 8-bit computer FIGnition, but it doesn't mean I'm not interested in other retro computers and as far as it goes, the Commodore VIC-20 is one of the cutest to come out of the 1980 stables.

The VIC-20 was cute, because it had a combination of fun and dumb features. Like: a full quality 65 key keyboard - and only two cursor keys!


Or the ability to support business applications with a floppy disk drive, but only having 23 column text. Or multi-colour graphics (and even a 2-bit per pixel mode that can co-exist with a 1-bit per pixel mode) with a near complete lack of support for bitmapped graphics. Or it's 16Kb of ROM and only 5Kb of RAM (with just 3582bytes free when Basic boots).

So, the fun challenge here is to see how much of a Forth I can squeeze into the Basic, unexpanded VIC-20 given the RAM limitations. I'm pretty confident I can do this, given that a super-tiny Forth subset has been crammed into just 1Kb of 8086 code (itsy Forth). I'm aiming for something that's kinda more usable.

Dude, Where's My RAM?

The first step (and this is the topic of this blog) is to find out how much RAM we can really use. A VIC-20 boots up and proudly displays: 

But it actually has 5Kb, so where has that other 1.5Kb gone? Armed with a detailed VIC-20 memory map we can see that areas of the first 1Kb have been nicked by Basic and the Kernal, which is a set of OS services abstracted from Basic and forms part of the ROM. For our purposes we don't want to use Basic, but we do want to use the Kernal, so we can read the keyboard, display to the screen and input/output between peripherals. For some of this 1Kb it's obvious which is used by Basic, but not all. So, here I decided to use the VIC-20 ROM disassembly. I first worked out that the Kernal starts at the address $E475, or thereabouts by observing that the rest of that code doesn't reference Basic. So, then I looked up all the system variables used by that section of code and found this set of addresses:

01,X 0100+1,X 0200,X 0259,Y 0263,Y 026D,Y 0277-1,X 0277
0277+1,X 0281 0282 0283 0284 0285 0286 0287
0288 0289 028A 028B 028C 028D 028E 028F
0290 0291 0292 0293 0293,Y 0294 0297 0298
0299 029A 029B 029C 029D 029E 029F 02A0
0300,X 0314 0314,Y 0315
(EABF) 0314 IRQ vector
(FED2) 0316 BRK vector
(FEAD) 0318 NMI vector
(F40A) 031A open a logical file
(F34A) 031C close a specified logical file
(F2C7) 031E open channel for input
(F309) 0320 open channel for output
(F3F3) 0322 close input and output channels
(F20E) 0324 input character from channel
(F27A) 0326 output character to channel
(F770) 0328 scan stop key
(F1F5) 032A get character from keyboard queue
(F3EF) 032C close all channels and files
(FED2) 032E user function
(F549) 0330 load
(F685) 0332 save

C533
C677

I also searched the Kernal code to find references to addresses within the Basic part of the ROM and found none, which meant that Basic sits properly on top of the Kernal. So, this tells us what areas of RAM we can use and it's as follows:

Address Range Size Owner Available for Forth?
$000 .. $08F $090 BASIC Yes, it's Page 0 how could we avoid it :-) ?
$090 .. $0FA KERNAL No.
$0FB .. $0FE $004 Nothing Yes, not sure yet what for.
$0100..$013E KERNAL Unlikely (tape error log/correction buffer)
$013F..$01FF $0C1 CPU Yes, for the return stack
$0200..$0259 $05A BASIC Yes, for Forth code
$025A..$029F KERNAL No.
$02A0..$02FF $060 None Free, more Forth code.
$0300..$0313 $013 BASIC Yes.
$0314..$03FF KERNAL No.
$033C..$03FB KERNAL Cassette buffer, maybe, but limited usage.
$03FC..$03FF $004 Free Some Vars?
$1000..$1DFF $E00 BASIC Forth code
$1E00..$1FF9 $1FA VIC (Screen)
$1FFA..$1FFF $006 Free 6b, more vars?

This gives us a total of $F6B (3947) bytes, or 4453 bytes if we can use the screen, or 4140 (4646) bytes if we include the CPU stack, which of course we will.

In the next part we'll make some basic decisions about the uxForth model, this will help us decide how to use all these areas.

Sunday, 7 August 2016

Glamourising Carmageddon In a 1ºC Warmer World

A recent Guardian article nails it. We've already hit >1ºC of warming, so it's virtually impossible to stay under 1.5ºC despite the fact that the Paris Climate Agreement in December 2015 stated it as a goal.

There's an interesting incidental quote in the article:

“And by 2030 you will have to get rid of the combustion engine entirely”

Indeed, that's where it must lead and the realisation that infernal combustion engines are going to have to be phased out rather more quickly is going to dawn on us pretty soon. For example, if we want to have 0 fossil fuel cars on the road by 2050, then we have to stop making them by 2025. Norway, India and the Netherlands are already considering this.

So what this article is saying - if it doesn't mean merely ending production - is that we should already have stopped building fossil fuel cars in 2005, which means they'll be forcibly taken off the road. The fact that cities like Paris ban odd or even numbered vehicles on high pollution days is an indication of how this might take place, but whatever, they have to go and go quickly.

Yet, and this is a big yet - we are still glorifying cars in every advertising space possible: on TV in virtually every commercial break; on billboards, in papers, on facebook, on the Guardian, even over Guardian articles warning about climate change. There they are, the car companies, rubbing our faces in the pollution they're causing - because they get to advertise all the time and the counter-narrative gets almost no air-time.

And this needs to stop as well, because it's no different to tobacco adverts during the 1970s, actually it's worse because these products are a threat to the whole planet, not primarily the individuals who smoke. Deglamourise those tail-pipes and the incessant urban noise that the fossil fuel industry has immersed us in. End those adverts, it'll make it easier to end fossil fuel car production.

Sunday, 27 December 2015

Capitalising the Cumbrian Floods

Naomi Klein's book: The Shock Doctrine (2007), makes the insightful, but surprising claim that free-market capitalists use economic and climate disasters to push forward neoliberal ideological policies. This is called #disastercapitalism . 

The idea is this: people normally resist free market solutions to all sorts of cultural norms. For example, the NHS in the United Kingdom is deeply valued to the extent that an amateur choir can achieve a Christmas number 1 in the pop charts by releasing a song that defends it.

What Disaster Capitalism does though is use the dislocation, caused by a traumatic event to railroad purported solutions to inject free-market answers while people are still reeling in confusion from the event. For example, in the case of climate change related events, you would think it would be the other way around: that people would rapidly form the association between global warming and and the current event and make strong calls for the government to turn more rapidly away from fossil fuels.

Surprisingly, although some people do this, by and large no such response occurs. Instead what happens is that people are preoccupied with the immediate need for relief and consequently suppress or ignore statements or opinions that relate the event to climate change. For example, after the Hebden Bridge flooding of 2012, they released a web page where they talk about "what work has been done since last Summer, and what’s also planned in the medium and long terms". There's no mention of climate anything. You can skip to the forum too. So, from the 2012 forum, topic "Floods Practical Solutions", of the 19 posts, there's one reference to climate ("Has anyone mentioned Treesponsibility? The group was set up to leessen (sic) the effects of climate change in the Valley by planting trees"). Another flood-related topic is no longer on the server. Of the other topics "Floods Sandbags" (no references); "Floods, The Politics" 20 refs: "Climate change certainly seems to be contributing to the problem, but it will be hundreds of years, if at all, before we will see the results of any effort we make to counter climate change."(i.e. let's not think about it right now); "Every Councillor and every senior Council Officer in Calderdale needs to read the Pitt Review of the 2007 floods and act on the recommendations. They should also understand where Climate Change is leading us." (which was written by the Green Party Candidate and brushed aside in the next post with "I'm not sure that right now is the time to be pushing 'I told you so' stuff"); "Only yesterday a report was mentioned on the BBC from the Committee on Climate Change that warns of more of these downpours in the future.". Another Green party comment "Our own Government and its agencies have to stop talking about flood threats in terms of one in 30 or 100 years. With climate change it could happen next month again". This was countered in the next post and in a couple of later posts: "As I said, there is no evidence to link the recent wet weather with man made climate change and for that reason the Green Party and other opportunists should not use terrible events such as the recent floods to progress their agendas."

Given that people are cautious to make any link, particularly during the events themselves, it becomes relatively easy for the government to exploit it. And indeed this is just what they've done. In response to questions about how the Government had made cuts to the Environmental Agency prior to the Cumbrian floods in early December 2015, the government then offered Council Tax and business tax cuts as relief.

Doesn't this strike anyone as being rather odd? As if anyone's need for flooding relief was somehow related to how much they pay in council tax? Isn't it also more likely that homes built closer to flood plains are likely to be in lower council tax areas and therefore there's less to claim back, though they're more likely to be flooded? And how is a business's profits related to the amount of damage they'd suffer? And isn't it likely that a business that was making more money would find it easier to cope with the cost anyway?

The only common factors in all of this is that:
  1. The 'flooding relief' potentially has greater benefits for Cumbrians who are better off, compared with those who aren't.
  2. The relief needs to be applied from across the country to those in need, not localised to the area in question. How is a community's ability to fund its own relief related to the extent of the disaster?
  3. Business taxes (and possibly Council taxes) pay for things like welfare, so in a sense the flooding disaster is being used to make cutbacks in welfare: although the rich across the country aren't paying for the flooding relief, the poor in the UK will have less money as a consequence.
And guess what, Liz Truss is planning to apply the same approach to the flooding in the North of England again - they have no idea if it works, but it's already policy.
“Of course, you’ve asked about funding and we’re looking at schemes similar to what we put in place in Cumbria to make sure families and businesses are supported,” said Truss.
 Disaster Capitalism to a tee.

Saturday, 7 November 2015

Parallel CRC16 Collection 3

There's so many processors to write CRC16 algorithms on, and I feel like I'm on a roll!

In this section we look at some other popular processors: the MSP430 (another RISC-but-in-many-ways-not MCU); the venerable 8051 and the current king of MCUs, the ARM (Cortex M0).

The MSP430 is like a stripped-down pdp-11 with 16x16-bit registers and single-cycle instructions. But it's not great for computing 16-bit CRCs. This routine weighs in at 58 bytes and 28 cycles per byte, making it the same length and relatively the same performance as an H8. Only the 6502 is longer! The Msp430 has some features going for it, like it's ability to swap bytes and shift whole words in a single cycle, but the reason it's not as fast as a PIC or AVR is primarily because the MSP430 has no nybble operations and because sometimes its byte operations (which modify the whole word of the destination register, by zero-extending the result) get in the way. The MSP430's straight-forward <<5 is faster than the rotate right 3 times technique used on some 8-bit CPUs, because it doesn't need to save an intermediate and apply mask operations. But the neatest trick here is the >>4 . By zero-extending r4.b when moving into r7, we clear the upper byte, so '0' bits get shifted into bits 4..7 of r7. Then by zero-extending r7 again, we clear r7's original bits 3..0; and so we get a byte shift, in 1 cycle less than applying a mask.

Crc16Msp430:
    push r7
Crc16Msp430Loop:
    swpb r5       ;crc=(crc>
>8)(crc<<8)
    mov.b @r4+,r7 ;r7=new byte.
    xor r7,r5     ;r5=crc^=newByte.
    mov.b r5,r7   ;r7=crc and 255;
    rra r7
    rra r7
    rra r7
    rra r7
    mov.b r7,r7 ; clears bits 7 to 15 giving the >>4
    xor r7,r5
    mov.b r5,r7  ;Sign extend again.
    swpb r7      ;<&lt8
    add r7,r7
    add r7,r7
    add r7,r7
    add r7,r7 ;<<12
    xor r7,r5
    mov r5,r7 ;crc and 255 again.
    add r7,r7
    add r7,r7
    add r7,r7
    add r7,r7
    add r7,r7 ;<<5
    xor r7,r5
    Dec r6
    jne Crc16Msp430Loop
    pop r7
    ret ;


The 8051 version takes 31 bytes and 27 instruction cycles per byte - it's the shortest implementation so far (the PIC version uses at least 12-bit opcodes). On an original 12MHz 8051, this would have meant 27µs per byte, about 76% faster than a pdp-11.

Crc16_8051:
    push r3
Crc16_8051Loop:
    movx acc,@dptr
    inc dptr
    xrl a,r1
    mov r3,a
    swap
    anl a,15;
    xrl a,r3
    mov r3,a
    swap
    anl a,240 ;<<12
    xrl a,r0
    mov r1,a
    mov a,r3
    swap
    rl a ;<<5
    mov r0,a ;save
    anl a,31
    xrl a, r1
    mov r1,a
    mov a,r1
    anl a,240
    xrl a,r3
    mov r0,a
    djnz r2,Crc16_8051Loop
    pop r3
    ret


The Arm Thumb version. Using just the original Thumb instruction set, it can perform a CRC16 in just 18 cycles (46b), making it about 5.1x more efficient than a 6502 and nearly 7.4x more efficient than a 68000. This is mostly because the ARM thumb can perform multi-bit shifts in a single cycle. The processor is only let down when swapping the Crc16 high and low bytes (which is done just like in 'C', crc16=(crc16 >> 8) | (crc16 <<8)&0xffff. Later versions of Thumb provided zero-extend and byte swap instructions.

c65535: .word 65536
Crc16_Thumb:
    push {r3,r4}
    ldr r4,[PC,c65535]
Crc16_ThumbLoop:
    lsl r3,r0,#8
    lsr r0,r0,#8
    orr r0,r3
    and r0,r4
    ldrb r3,[r1],#1
    eor r0,r3
    mov r3,#240
    and r3,r0
    lsr r3,r3,#4
    eor r0,r3
    lsl r3,r0,#12
    eor r0,r3
    lsl r3,r0,#5;
    eor r0,r3
    and r0,r4
    subs r2,#1
    bne Crc16_ThumbLoop
    pop {r3,r4}
    rts


Finally, an AVR version. The webpage I based all of these on uses CRC code that operates on a byte at a time, so you wouldn't think there's any point in publishing one here, but my DIY FIGnition computer uses a more efficient one in its audio loading firmware:

;z^Data, r25:r24=Crc16, r26=Len.
Crc16Lo=r24
Crc16Hi=r25
BuffLo=r26,
BuffHi=r27
temp=r26
Len=r28

Crc16_AtMega:
    push r16
    push r30
    push r31 ;used to hold buff.
    movw z,BuffLo
    ldi r16,16
    ldi r27,32
Crc16_AtMegaLoop:
    mov temp,Crc16Lo
    mov Crc16Lo,Crc16Hi
    mov Crc16Hi, temp ; Swap hi and lo.
    ld temp,Z+
    eor Crc16Lo, temp ; crc^=ser_data;
    mul r16,Crc16Lo ; Faster than executing lsr 4 times.
    eor Crc16Lo, r1 ; crc^=(crc&0xff)>>4
    mul r16,Crc16Lo ;(crc&0xff)<<4
    eor Crc16Hi,r0 ;crc^=(crc<<8)<<4
                    ;(r0 wonderfully contains the other 4 bits)
    mul r27,Crc16Lo ; (crc&0xff)<<5
    eor Crc16Lo,r0
    eor Crc16Hi,r1
    subi Len,1
    bne Crc16_AtMegaLoop
    pop r31
    pop r30
    pop r16
    ret
In the above version (designed for a Mega AVR), we can save a cycle from the mov, swap, andi sequence by making use of the mul instruction. We can save 5 cycles in the <<5 code. This gives us 19cycles per byte, just 1 cycle longer than an ARM thumb (and the same length). Yet it's 7 cycles shorter than the other AVR version (53% faster!).

Tuesday, 3 November 2015

Parallel CRC16 Collection 2


Hello folks,

I recently posted a collection of byte-parallel CRC16 implementations and thought I'd add a few more.

Firstly, a 6800 version. Since I don't have the original BSI volume, I don't know exactly how it was coded then, but I vaguely remember it being about 26 instructions long. The 6800 version is actually shorter and faster than my 6809 version despite the 6809 being a far better CPU with better CPI and more registers. That's because I end up having to use 0 page for speed and that's still faster than stack indexing:


Crc16_6800: ;a:b=CRC, X^data.
;uses page 0 temp0, temp1, Len.
Crc16_6800Loop:
eora 0,x
staa temp1 ;hi now swapped
lsra
lsra
lsra
lsra
eora temp1
staa temp1
stab temp0
lsla
lsla
lsla
lsla
eora temp0
staa temp0
rorb
rorb
rorb
tba
anda #31
eora temp0
rorb
andb #0xe0
eorb temp1
inx
dec len
bne Crc16_6800Loop
rts

Next, the Hitachi H8 version. In this version, R0=CRC, R1^data, R2=Len. R3 is a temp. The H8 is interesting, because it was part of a 1990s generation of Microcontrollers designed for high-level languages like 'C'. Hitachi claimed it was RISC, though it isn't due to the excessive number of instruction formats and addressing modes. It's more like a 16-bit micro-controller version of the 68000 or pdp-11. The performance looks decent at 57 ø-clock cycles, but it isn't really that great because a ø-clock cycle is half the XTAL clock frequency, making it 114 XTAL oscillations per CRC byte.

Crc16_H8:
push r3
Crc16_H8Loop: ;start by pretending to swap.
mov.b @r1+,r3h ;get the next byte.
xor.b r3h,r0h ;
mov.b r0h,r3h ;need a copy.
shlr r3h
shlr r3h
shlr r3h
shlr r3h ; >>4
xor.b r3h,r0h ;
mov.b r0h,r3h ;copy again.
shll r3h
shll r3h
shll r3h
shll r3h ; <<4
xor.b r3h,r0l ;Into what was the low byte.
mov.b r0l,r3h
rotr r3h
rotr r3h
rotr r3h
mov.b r3h,r3l
and.b #31,r3l
xor.b r0h,r3l
and.b #0xe0,r3h
xor.b r0l,r3h
mov.w r3,r0
subs #1,r2
bne Crc16_H8Loop
pop r3
rts

So, let's compare that with the PIC I originally coded it for, a PIC16C55. This isn't the actual code, I don't have rights to that, so I've just rewritten the equivalent algorithm, and anyway, the version I used would probably have calculated the CRC on the fly, as each byte was received. Surprisingly, the humble PIC cpu (which predates Commerical RISC CPUs by a decade), manages to implement the algorithm in 25 instructions and 25 instruction cycles per byte, which works out as 100 clocks per byte.

Crc16_Pic16C55: ;pretend we've swapped.
movf ind,w ;got the next byte.
incf fsr
xorwf gCrc16+1,f ;xor with lo byte.
swap gCrc16,w ;copy and swap.
andlw 15 ;this is the >>4!
xorwf gCrc16+1,w ;'lo' byte again.
movwf gTemp2
swap gCrc16,w ;copy and swap
andlw 0xf0
xorwf gCrc16,f ;this time into 'hi' byte.
rrf gTemp2,w ;get 'lo' mostly calc'd Crc.
movwf gTemp1 ;save in temp1.
rrf gTemp1,f ;Have to rotate into f, because
rrf gTemp1,f ;can't rotate w by itself.
movf gTemp1,w
andlw 31
xorwf gCrc16,w
movwf gCrc16+1 ;OK to overwrite, since new 'lo' byte in gTemp2
rrf gTemp1,w ;and again, because of the carry.
andlw 0xe0
xorwf gTemp2,w
movwf gCrc16
decfsz gCrcLen
goto Crc16_Pic16C55
retlw 0 ;done.

We've mentioned the 68000 CPU a couple of times, so let's compare it. Here, we can see how the 68000 is well suited to high-level languages, the straight-forward implementation closely matches the algorithm and is  only 19words, 38 bytes long, almost as short as a Z80! The timing on the other hand - yike, 134 cycles per byte, only slightly more efficient than a Z80. This is mostly due to the fact that a 68000 had a 4 clock bus cycle, but also because the 68000 isn't very good at byte manipulation. On the other hand, the 68000 has a decent decrement and branch instruction.

Crc16_68K:
move.l d2.w,-(sp)
bra.s Crc16_68KWhile
Crc16_68KLoop:
rol.w #8,d0 ;this time we have to rotate to begin with.
eor.b (a0)+,d0 ;
move.b d0,d2 ;crc&0xff just need byte ops this time
lsr.b #4,d2 ;..>>4
eor.b d2,d0 ;crc^= above calculation.
move.b d0,d2 ;crc&0xff (we'll shift out upper 8 bits).
lsl.w #8,d2 ;..<<8
lsl.w #4,d2 ;..<<4
eor.w d2,d0 ;crc^= above calculation.
moveq #0,d2 ;(Faster than a move.b and and.w #0xff)
move.b d0,d2 ;crc&0xff
lsl.w #5,d2 ;..<<5
eor.w d2,d0 ;done.
Crc16_68KWhile:
dbra d1,Crc16_68KLoop
move.l (sp)+,d2
rts

Finally, let's also compare it with the 68000's arch-rival, the 8086. Here we can see that the 8086 isn't so well suited to a high-level language, mostly due to the optimisations needed for the byte operations and judging single bit shifts vs shifting with cl. The length is the second worst at 54 bytes and the standard instruction timings for the 8086 would give a total cycle time of 80 cycles (as efficient as a 6809) but this fails to take into account how the prefetch queue empties when executing linear code whose instructions are < 4 cycles. This bumps it up to 112 cycles per byte, which means a standard 8MHz 68000 would be about 33% faster than a 5MHz 8086. Both of them, however, are much faster than a pdp-11/34, (2.8x and 2.13x respectively).

Crc16_8086:
push dx
Crc16_8086Loop:
mov dh,al ;2(4)
mov dl,ah ;swap to begin with. 2(2)
lodsb ;al=[si]+ 12(6)
xor dl,al ;3(5)
mov al,dl ;2(3)
shr al,1 ;2(1)
shr al,1 ;4(0)
shr al,1 ;4(0)
shr al,1 ;4(0), still faster than mov cl,4: shr al,cl
xor dl,al ;4(0)
mov al,dl ;4(0) upper byte.
shl al,1 ;4(0)
shl al,1 ;4(0)
shl al,1 ;4(0)
shl al,1 ;4(0)
xor dh,al ;4(0) the <<12
xor ah,ah ;4(0)
mov al,dl ;4(0)
shl ax,1 ;4(0)
shl ax,1 ;4(0)
shl ax,1 ;4(0)
shl ax,1 ;4(0)
shl ax,1 ;4(0) Perform a straight-forward <<5
xor ax,dx ;4(0)
loop Crc16_8086Loop ;17c. 2 bytes fetched = 4c, so 13c can refill BIU
pop dx
ret



Saturday, 31 October 2015

Parallel CRC16 Collection

I first came across CRCs in 1996, when I was writing the audio software for the Heathrow Express. A couple of engineers from the company I worked for opened a dusty BSI volume; pointed me to a short algorithm for an ancient CPU (even by 1990s standards), the 6800 and told me to use that algorithm.

I had to translate it for the PIC and Hitachi H8 MCUs I was using for the project, but that wasn't a major hassle. The puzzle was why they seized upon that moment to insist I use a specific, and short algorithm. Did they think I was a mathematical dunce who couldn't implement something involving a few XORs? Were they just trying to impress me with their claim (I'm willing to accept it) that they actually wrote that algorithm and got it published in BSI standards? Were they just sticklers for standardized solutions?

Well, I was quite happy, because it was a bit of an education. It wasn't until about 5 years later I started to realize something was amiss with CRC16 algorithms. And what was amiss, was that no-one else seemed to be using byte parallel algorithms. Surely a BSI (and presumably US / ISO) standard would dominate? But here I was in other commercial environments where the references and implementations were all bit-serial or table-driven algorithms of this kind:


uint16_t crc16(char *addr, int len, uint16_t crc)
{
  int bit;
  while(num-- >0) { // For each byte.
    crc = crc ^ (*addr++ <<8); // xor into high byte.
    for(bit=0; bit<8;bit++)
      if ((int16_t)crc<0) // top bit of poly set.
        crc = ((crc<<1) ^ 0x1021); // xor with poly
      else
        crc<<=1;
    }
  }
  return crc;
}

I could see they were all highly inefficient, or worse still, wrong - some of the table-driven versions actually contained wrong entries as I found out last year when I wasted a day or so trying to figure out why a colleague's Crc algorithm didn't generate the same results as mine. And that was because he'd just copied and pasted the first implementation he'd seen on the internet - and that was even after I'd pointed him to this byte-parallel version and asked him to translate it to C#.

The reason I figure table-driven algorithms became popular is because they're easy on the brain. It's easy to grasp how a bit-serial algorithm relates to the CRC's polynomial and then easy to jump to a bit-serial algorithm that generates a table or just copy a table version directly. However, byte-parallel algorithms are, thankfully, making a comeback. Why? I guess because constrained MCUs are still used in a lot of applications and because cache-misses on table-driven CRCs are pretty costly on higher-end processors.

This leads me to - an alternative for a set of CRC algorithms published on MDFS.net, a wonderful site that has conversions of BBC Basic for every processor you'd ever want :-) (with the exception of an AVR). Here are equivalent byte-parallel versions for the same set of ancient processors and they run at least twice as fast:

First, Crc16_6502, which clocks in at 64 bytes and 92cycles per byte, over twice as fast as the bit-algorithm. In common with most of the 8-bit implementations, it's more efficient to swap the CrcHi and CrcLo nearer the end and instead perform the calculations on the 'wrong' halves of the Crc until then. The 6502 version also saves cycles by using y to represent both an index into the buffer, and the length of the buffer (which is incremented until it gets to 0). This means we have to adjust the buffer pointer and negate the length.


Crc16_6502: ;buff^buffer, y=len.
clc
tya
        eor #255
        adc #1     ;Negate length
        beq Crc16_6502End
adc buff ;buff-len
;Carry would mean we don't need to dec buff+1
        ;but since we really have 256-length in y,
        ;then we need to inc buff+1 instead (no-carry
;means we don't need to inc buff+1)
bcc Crc16_6502Loop
inc buff+1

Crc16_6502Loop:
lda (buff),y
eor CrcHi
sta CrcHi ;really CrcLo
lsra
lsra
lsra
lsra ;(Crc&0xff)>>4.
eor CrcHi
sta Temp1
sta Temp0 ;Copy low byte for <<5 later.
asla
asla
asla
asla ;(Crc<<12)
eor CrcLo
sta CrcHi ;this is the new CrcHi
asl Temp0
rola
asl Temp0
rola
asl Temp0
rola
asl Temp0
rola
asl Temp0
rola ;<<5
eor CrcHi
sta CrcHi
lda Temp0
eor Temp1
sta CrcLo
iny
bne Crc16_6502
Crc16_6502End
rts

Then there's a Z80 version. This is more straight-forward, since there are enough registers to handle the entire algorithm. It clocks in at 33b and 139 T-states per byte, making it the shortest version and only 51% slower than a 6502 at the same clock speed. Here we use c as a temp while we perform the crc hi and lo swap over the course of the first two shifts, so that they end up nicely in hl ready for when we do the <<5 near the end.

Crc16_Z80: ;(Z80 style, b=length, de^data, hl=CRC).
ld a,(de)
inc de
xor h
ld c,a
rra ;the Z80 doesn't have a fast
rra ;right shift so we
rra ;rotate and mask.
rra
and a,15 ;>>4,
xor c
ld c,a ;new low byte, gets shifted.
add a,a
add a,a
add a,a
add a,a ;<<12.
xor l
ld h,a ;new crc hi
rrca
rrca
rrca ;<<5
ld l,a ;save in c
and 31
xor h
ld h,a
ld l,a
and 0E0h
xor c
ld l,a ;done.
djnz Crc16_Z80

Next up, the 6809. Despite having several 16-bit registers, the 6809's accumulator architecture means we need to allocate 2 temporary bytes on the stack and we can't make use of 16-bit operations on D. I estimate the length at 45bytes and the speed as 80 cycles per byte, a little faster than a 6502.

Crc16_6809: ;D=CRC, X^data, Y=Len.
leas -2,s ;Allocate 2 temp bytes.
Crc16_6809Loop:
eora ,x+
staa ,s
lsra
lsra
lsra
lsra
eora ,s
std ,s
lsla
lsla
lsla
lsla
eora 1,s
staa 1,s
ldab ,s
rorb
rorb
rorb
tfr b,a
anda #31
eora 1,s
rorb
andb #0xe0
eorb ,s
leay -1,y
bne Crc16_6809Loop
leas 2,s ;Deallocate 2 temp bytes.
rts


Finally, the pdp11 (with Extended Arithmetic). The pdp11's adequate number of 16-bit registers and programmer-friendly instruction set makes it easy to implement the algorithm. Nevertheless, if run on a typical 1970s pdp-11/34 it would require 40 bytes of code and 47.56µs per byte, roughly equivalent to a 2MHz 6502 or a 3MHz Z80. Yet more evidence to demonstrate that the pdp-11 in the 1970s wasn't theoretically much faster than a humble late 70s Microprocessor. Check out my Z80 Dhrystones article to discover why this might be.

Crc16_pdp11:
;With Extended Arithmetic. r0=crc, r1^data, r2=len, r3=tmp
swab r0
clr r4
movb r4,(r1)+
xor r0,r4 ;no byte version.
mov r3,#-4
movb r4,r0
ash r4,r3 ;>>4.
xor r0,r4 ;crc^=(crc&0xff)>>4
mov r4,r0 ;need copy
mov r3,#12 ;adds 1.6us vs much more for swap /and etc.
ash r4,r3 ;crc<<12
xor r0,r4
mov r4,r0
mov r3,#5
ash r4,r3 ;<<5
xor r0,r3 ;done.
sob r2,Crc16_pdp11