One Week Wonder: FIGnition

Showing posts with label FIGnition. Show all posts

Wednesday, 8 March 2017

uxForth: Unexpanded forth for a standard VIC-20. Part 3, the memory map

I'm the developer of the DIY 8-bit computer FIGnition, but it doesn't mean I'm not interested in other retro computers and the idea of developing a minimal Forth for the ancient, but cute Commodore VIC-20 is irresistable!

Part 1 talks about the appeal of the VIC-20 and what a squeeze it will be to fit Forth into it's meagre RAM.

In Part 2 I discussed choices for the inner interpreter and found out that a token Forth model could be both compact and about as fast as DTC.

Now I'm going to allocate the various parts of the Forth system to VIC-20 memory to make the best of what's there. Some of it will be fairly conventional and some somewhat unorthodox.

(An Aside, The Slow uxForth Development Process)

From the presentation of the blog entries it looks like I'm working these things out as I'm going along. For example, it's worthwhile asking why it looks like I can leap to fairly concrete decisions about the inner interpreter or even that I think I'll be able to fit the entire system into the available space.

The simple answer is that I've already done much of the work to make this possible. I've already written the code that implements the primitives (in fact I've written, modified and rewritten it a few times as I've improved it). I've made use of the wonderful resources at 6502 org, particularly the idea of splitting the instruction pointer (called gIp in my implementation) into a page offset and using the Y register to hold the byte offset: it really does improve the performance of the core Next function.

Similarly, I've written the non-primitive code and accounted for the space. It's written in Forth with a home-brew meta-forth compiler written in 'C'. So, there will be a future blog on that too!

However, it's not a cheat as such. The code is not tested yet; nor even loaded into a real VIC-20 nor emulator (I don't have a real VIC-20 :-( ). I have real decisions to make as the blog continues, which means I can make real mistakes too and have to correct them. What I've done, really, is basically a feasibility study, so that you don't waste your time reading the blog. And of course, the whole of uxForth will be released publicly, on a GPL licence via my GitHub account.

Admittedly, it's being released slowly, a 2.75Kb program I hope to release over the course of 2017!

The Memory Map

Page 0

Page 0 is the gold dust of every 6502 system: versatile and in short supply. BASIC uses the first 0x90 bytes and the KERNAL uses the rest. We'll use all 0x90 bytes for the data stack and some key system variables:

Addr	Size	Name	Comment
$00	2	gIp	Instruction pointer, lower byte always 0.
$02	1	gTmpLo	Temporary byte
$03	1	gTmpHi	Temporary byte used for indirect access.
$04	2	gILimit	The limit for the inner-most do.. loop. uxForth (and FIGnition Forth) differ from most Forths in that the inner most loops values, the limit and the current value are held in global locations. do causes the previous gILimit and gCurrent to be pushed to the stack; thus r is equivalent to j on other forths.
$06	2	gICount	The current loop count for the inner-most do.. loop.
$08	1	gUpState	The current compilation state.
$09	1	gUpBase	The current number base
$0a	2	gUpDp	The current dictionary pointer.
$0c	2	gUpLast	A pointer to the header of the most recent dictionary entry compiled
$0e	2	gUpTib	The pointer to the input buffer (I'm not sure if we need this)
$10	128	gDs	The data stack
$fb	2	gTmpPtr0	Spare pointer 0
$fd	2	gTmpPtr1	Spare pointer 1

Page 1

Page 1 is the return stack as you might expect. Oddly enough, we only get 192b, because the KERNAL uses $100 to $13F.

Page 2

There are 89 bytes available here, because they're used by BASIC. I plan to use them for the byte code vectors which are:

#	Name	#	Name	#	Name	#	Name
$00	(nop)	$0b	(+loop)	$16	u/	$21	rp!
$01	;s	$0c	0<	$17	@	$22	drop
$02	exec	$0d	0=	$18	c@	$23	dup
$03	(native)	$0e	+	$19	!	$24	over
$04	(lit8)	$0f	neg	$1a	c!	$25	swap
$05	(lit16)	$10	and	$1b	r>	$26	(vardoes)
$06	0	$11	or	$1c	>r	$27	(constdoes)
$07	(0branch)	$12	xor	$1d	r	$28	inkey
$08	(branch)	$13	>>	$1e	sp@	$29	emit
$09	(do)	$14	<<	$1f	sp!	$2a	at
$0a	(loop)	$15	*	$20	rp@	$2b

The codes that are greyed out have no names in the dictionary to save space; the way you'd insert them into code would be with [ nn c, ] sequences.

Page 3 and Page 4

There are a total of 116 bytes free from $2A0 to $313, I'll fill that area with some of the actual native definitions.

The cassette buffer is at $33c to $3fb. We'll be using the cassette for storage so we can't use it for code.

Pages 16 to 31 ish ($1000 to $1dff)

This is the area of RAM reserved for BASIC. It will contain the rest of the Forth system.

The screen RAM ($1e00 to $1ff9)

The end of RAM for an unexpanded VIC-20 is used for the screen. The plan here is to use that area for the editing space. Instead of implementing a line editor (ACCEPT in FIG-forth and early FIGnition Forth), we use key to call the KERNAL editor and allow it to manage the editing of a line including cursor movement. Pressing Return doesn't execute the command line, instead, pressing F1 exits the editor and sets the interpretation point to the current cursor position. The end of the interpretation point is set to the end of the screen and emit is turned off until interpretation gets to the end of the screen. Importantly, pressing return doesn't start interpretation.

In addition, pressing F2 saves the screen bytes onto cassette.

This is how I'll implement storage in a fairly minimal way. By implementing save via F2 I can save a block (actually the 506 screen bytes are roughly half a traditional block), but LOAD is a normal word, so multiple blocks can be loaded (you just add load to the end of the block).

So, this is how you'd do normal editing operations. For normal words you would place the cursor near the end of the screen and edit to the end of the screen; cursor to return to the first character you want to interpret and then press F1. In a sense this is easy, because you can just press Return and then cursor up until you get there. The same method would also work if you wanted to compile a whole screen's worth of code. Load itself would reset the cursor position to [home] and then return to the interpreter, so placing a load at the end of the screen would load the next screen without any recursion. That way you'd be able to develop programs that were longer than just one screen without manual reloading.

Conclusion

In the memory allocation of uxForth, we've squirrelled away about 1053 bytes of RAM, embedding the line buffer in the screen and a number of system variables in page 0. We've also included 212 bytes of what we'd use for the program proper. It won't get much better than this!

In the next post I hope to talk in more detail about the implementation of the primitive words and the code used to test them.

Thursday, 3 November 2016

uxForth: Unexpanded Forth For A Standard VIC-20

I'm the developer of the DIY 8-bit computer FIGnition, but it doesn't mean I'm not interested in other retro computers and as far as it goes, the Commodore VIC-20 is one of the cutest to come out of the 1980 stables.

The VIC-20 was cute, because it had a combination of fun and dumb features. Like: a full quality 65 key keyboard - and only two cursor keys!

Or the ability to support business applications with a floppy disk drive, but only having 23 column text. Or multi-colour graphics (and even a 2-bit per pixel mode that can co-exist with a 1-bit per pixel mode) with a near complete lack of support for bitmapped graphics. Or it's 16Kb of ROM and only 5Kb of RAM (with just 3582bytes free when Basic boots).

So, the fun challenge here is to see how much of a Forth I can squeeze into the Basic, unexpanded VIC-20 given the RAM limitations. I'm pretty confident I can do this, given that a super-tiny Forth subset has been crammed into just 1Kb of 8086 code (itsy Forth). I'm aiming for something that's kinda more usable.

Dude, Where's My RAM?

The first step (and this is the topic of this blog) is to find out how much RAM we can really use. A VIC-20 boots up and proudly displays:

But it actually has 5Kb, so where has that other 1.5Kb gone? Armed with a detailed VIC-20 memory map we can see that areas of the first 1Kb have been nicked by Basic and the Kernal, which is a set of OS services abstracted from Basic and forms part of the ROM. For our purposes we don't want to use Basic, but we do want to use the Kernal, so we can read the keyboard, display to the screen and input/output between peripherals. For some of this 1Kb it's obvious which is used by Basic, but not all. So, here I decided to use the VIC-20 ROM disassembly. I first worked out that the Kernal starts at the address $E475, or thereabouts by observing that the rest of that code doesn't reference Basic. So, then I looked up all the system variables used by that section of code and found this set of addresses:

01,X 0100+1,X 0200,X 0259,Y 0263,Y 026D,Y 0277-1,X 0277

0277+1,X 0281 0282 0283 0284 0285 0286 0287

0288 0289 028A 028B 028C 028D 028E 028F

0290 0291 0292 0293 0293,Y 0294 0297 0298

0299 029A 029B 029C 029D 029E 029F 02A0

0300,X 0314 0314,Y 0315

(EABF) 0314 IRQ vector

(FED2) 0316 BRK vector

(FEAD) 0318 NMI vector

(F40A) 031A open a logical file

(F34A) 031C close a specified logical file

(F2C7) 031E open channel for input

(F309) 0320 open channel for output

(F3F3) 0322 close input and output channels

(F20E) 0324 input character from channel

(F27A) 0326 output character to channel

(F770) 0328 scan stop key

(F1F5) 032A get character from keyboard queue

(F3EF) 032C close all channels and files

(FED2) 032E user function

(F549) 0330 load

(F685) 0332 save

C533

C677

I also searched the Kernal code to find references to addresses within the Basic part of the ROM and found none, which meant that Basic sits properly on top of the Kernal. So, this tells us what areas of RAM we can use and it's as follows:

Address Range	Size	Owner	Available for Forth?
$000 .. $08F	$090	BASIC	Yes, it's Page 0 how could we avoid it :-) ?
$090 .. $0FA		KERNAL	No.
$0FB .. $0FE	$004	Nothing	Yes, not sure yet what for.
$0100..$013E		KERNAL	Unlikely (tape error log/correction buffer)
$013F..$01FF	$0C1	CPU	Yes, for the return stack
$0200..$0259	$05A	BASIC	Yes, for Forth code
$025A..$029F		KERNAL	No.
$02A0..$02FF	$060	None	Free, more Forth code.
$0300..$0313	$013	BASIC	Yes.
$0314..$03FF		KERNAL	No.
$033C..$03FB		KERNAL	Cassette buffer, maybe, but limited usage.
$03FC..$03FF	$004	Free	Some Vars?
$1000..$1DFF	$E00	BASIC	Forth code
$1E00..$1FF9	$1FA	VIC	(Screen)
$1FFA..$1FFF	$006	Free	6b, more vars?

This gives us a total of $F6B (3947) bytes, or 4453 bytes if we can use the screen, or 4140 (4646) bytes if we include the CPU stack, which of course we will.

In the next part we'll make some basic decisions about the uxForth model, this will help us decide how to use all these areas.

Saturday, 7 November 2015

Parallel CRC16 Collection 3

There's so many processors to write CRC16 algorithms on, and I feel like I'm on a roll!

In this section we look at some other popular processors: the MSP430 (another RISC-but-in-many-ways-not MCU); the venerable 8051 and the current king of MCUs, the ARM (Cortex M0).

The MSP430 is like a stripped-down pdp-11 with 16x16-bit registers and single-cycle instructions. But it's not great for computing 16-bit CRCs. This routine weighs in at 58 bytes and 28 cycles per byte, making it the same length and relatively the same performance as an H8. Only the 6502 is longer! The Msp430 has some features going for it, like it's ability to swap bytes and shift whole words in a single cycle, but the reason it's not as fast as a PIC or AVR is primarily because the MSP430 has no nybble operations and because sometimes its byte operations (which modify the whole word of the destination register, by zero-extending the result) get in the way. The MSP430's straight-forward <<5 is faster than the rotate right 3 times technique used on some 8-bit CPUs, because it doesn't need to save an intermediate and apply mask operations. But the neatest trick here is the >>4 . By zero-extending r4.b when moving into r7, we clear the upper byte, so '0' bits get shifted into bits 4..7 of r7. Then by zero-extending r7 again, we clear r7's original bits 3..0; and so we get a byte shift, in 1 cycle less than applying a mask.


Crc16Msp430:

    push r7

Crc16Msp430Loop:

    swpb r5       ;crc=(crc>

>8)(crc<<8)


    mov.b @r4+,r7 ;r7=new byte.

    xor r7,r5     ;r5=crc^=newByte.

    mov.b r5,r7   ;r7=crc and 255;

    rra r7

    rra r7

    rra r7

    rra r7

    mov.b r7,r7 ; clears bits 7 to 15 giving the >>4

    xor r7,r5

    mov.b r5,r7  ;Sign extend again.

    swpb r7      ;<&lt8

    add r7,r7

    add r7,r7

    add r7,r7

    add r7,r7 ;<<12

    xor r7,r5

    mov r5,r7 ;crc and 255 again.

    add r7,r7

    add r7,r7

    add r7,r7

    add r7,r7

    add r7,r7 ;<<5

    xor r7,r5

    Dec r6

    jne Crc16Msp430Loop

    pop r7

    ret ;

The 8051 version takes 31 bytes and 27 instruction cycles per byte - it's the shortest implementation so far (the PIC version uses at least 12-bit opcodes). On an original 12MHz 8051, this would have meant 27µs per byte, about 76% faster than a pdp-11.

Crc16_8051:

    push r3

Crc16_8051Loop:

    movx acc,@dptr

    inc dptr

    xrl a,r1

    mov r3,a

    swap

    anl a,15;

    xrl a,r3

    mov r3,a

    swap

    anl a,240  ;<<12

    xrl a,r0

    mov r1,a

    mov a,r3

    swap

    rl a   ;<<5

    mov r0,a ;save

    anl a,31

    xrl a, r1

    mov r1,a

    mov a,r1

    anl a,240

    xrl a,r3

    mov r0,a

    djnz r2,Crc16_8051Loop

    pop r3

    ret

The Arm Thumb version. Using just the original Thumb instruction set, it can perform a CRC16 in just 18 cycles (46b), making it about 5.1x more efficient than a 6502 and nearly 7.4x more efficient than a 68000. This is mostly because the ARM thumb can perform multi-bit shifts in a single cycle. The processor is only let down when swapping the Crc16 high and low bytes (which is done just like in 'C', crc16=(crc16 >> 8) | (crc16 <<8)&0xffff. Later versions of Thumb provided zero-extend and byte swap instructions.

c65535: .word 65536

Crc16_Thumb:

    push {r3,r4}

    ldr r4,[PC,c65535]

Crc16_ThumbLoop:

    lsl r3,r0,#8

    lsr r0,r0,#8

    orr r0,r3

    and r0,r4

    ldrb r3,[r1],#1

    eor r0,r3

    mov r3,#240

    and r3,r0

    lsr r3,r3,#4

    eor r0,r3

    lsl r3,r0,#12

    eor r0,r3

    lsl r3,r0,#5; 

    eor r0,r3

    and r0,r4

    subs r2,#1

    bne Crc16_ThumbLoop

    pop {r3,r4}

    rts

Finally, an AVR version. The webpage I based all of these on uses CRC code that operates on a byte at a time, so you wouldn't think there's any point in publishing one here, but my DIY FIGnition computer uses a more efficient one in its audio loading firmware:

;z^Data, r25:r24=Crc16, r26=Len.

Crc16Lo=r24

Crc16Hi=r25

BuffLo=r26,

BuffHi=r27

temp=r26

Len=r28



Crc16_AtMega:

    push r16

    push r30

    push r31 ;used to hold buff.

    movw z,BuffLo

    ldi r16,16

    ldi r27,32

Crc16_AtMegaLoop:

    mov temp,Crc16Lo

    mov Crc16Lo,Crc16Hi

    mov Crc16Hi, temp ; Swap hi and lo.

    ld temp,Z+

    eor Crc16Lo, temp ; crc^=ser_data;

    mul r16,Crc16Lo ; Faster than executing lsr 4 times.

    eor  Crc16Lo, r1 ; crc^=(crc&0xff)>>4

    mul r16,Crc16Lo  ;(crc&0xff)<<4

    eor Crc16Hi,r0  ;crc^=(crc<<8)<<4

                    ;(r0 wonderfully contains the other 4 bits)

    mul r27,Crc16Lo  ; (crc&0xff)<<5

    eor Crc16Lo,r0

    eor Crc16Hi,r1 

    subi Len,1

    bne Crc16_AtMegaLoop

    pop r31

    pop r30

    pop r16

    ret

In the above version (designed for a Mega AVR), we can save a cycle from the mov, swap, andi sequence by making use of the mul instruction. We can save 5 cycles in the <<5 code. This gives us 19cycles per byte, just 1 cycle longer than an ARM thumb (and the same length). Yet it's 7 cycles shorter than the other AVR version (53% faster!).

Tuesday, 3 November 2015

Parallel CRC16 Collection 2

Hello folks,

I recently posted a collection of byte-parallel CRC16 implementations and thought I'd add a few more.

Firstly, a 6800 version. Since I don't have the original BSI volume, I don't know exactly how it was coded then, but I vaguely remember it being about 26 instructions long. The 6800 version is actually shorter and faster than my 6809 version despite the 6809 being a far better CPU with better CPI and more registers. That's because I end up having to use 0 page for speed and that's still faster than stack indexing:

Crc16_6800: ;a:b=CRC, X^data.

;uses page 0 temp0, temp1, Len.
Crc16_6800Loop:
eora 0,x
staa temp1 ;hi now swapped
lsra
lsra
lsra
lsra
eora temp1
staa temp1

stab temp0
lsla
lsla
lsla
lsla
eora temp0
staa temp0
rorb
rorb
rorb
tba
anda #31
eora temp0
rorb
andb #0xe0
eorb temp1

inx
dec len
bne Crc16_6800Loop
rts

Next, the Hitachi H8 version. In this version, R0=CRC, R1^data, R2=Len. R3 is a temp. The H8 is interesting, because it was part of a 1990s generation of Microcontrollers designed for high-level languages like 'C'. Hitachi claimed it was RISC, though it isn't due to the excessive number of instruction formats and addressing modes. It's more like a 16-bit micro-controller version of the 68000 or pdp-11. The performance looks decent at 57 ø-clock cycles, but it isn't really that great because a ø-clock cycle is half the XTAL clock frequency, making it 114 XTAL oscillations per CRC byte.

Crc16_H8:

push r3

Crc16_H8Loop: ;start by pretending to swap.

mov.b @r1+,r3h ;get the next byte.

xor.b r3h,r0h ;

mov.b r0h,r3h ;need a copy.

shlr r3h

shlr r3h ; >>4

xor.b r3h,r0h ;

mov.b r0h,r3h ;copy again.

shll r3h

shll r3h ; <<4

xor.b r3h,r0l ;Into what was the low byte.

mov.b r0l,r3h

rotr r3h

mov.b r3h,r3l

and.b #31,r3l

xor.b r0h,r3l

and.b #0xe0,r3h

xor.b r0l,r3h

mov.w r3,r0

subs #1,r2

bne Crc16_H8Loop

pop r3

rts

So, let's compare that with the PIC I originally coded it for, a PIC16C55. This isn't the actual code, I don't have rights to that, so I've just rewritten the equivalent algorithm, and anyway, the version I used would probably have calculated the CRC on the fly, as each byte was received. Surprisingly, the humble PIC cpu (which predates Commerical RISC CPUs by a decade), manages to implement the algorithm in 25 instructions and 25 instruction cycles per byte, which works out as 100 clocks per byte.

Crc16_Pic16C55: ;pretend we've swapped.

movf ind,w ;got the next byte.

incf fsr

xorwf gCrc16+1,f ;xor with lo byte.

swap gCrc16,w ;copy and swap.

andlw 15 ;this is the >>4!

xorwf gCrc16+1,w ;'lo' byte again.

movwf gTemp2

swap gCrc16,w ;copy and swap

andlw 0xf0

xorwf gCrc16,f ;this time into 'hi' byte.

rrf gTemp2,w ;get 'lo' mostly calc'd Crc.

movwf gTemp1 ;save in temp1.

rrf gTemp1,f ;Have to rotate into f, because

rrf gTemp1,f ;can't rotate w by itself.

movf gTemp1,w

andlw 31

xorwf gCrc16,w

movwf gCrc16+1 ;OK to overwrite, since new 'lo' byte in gTemp2

rrf gTemp1,w ;and again, because of the carry.

andlw 0xe0

xorwf gTemp2,w

movwf gCrc16

decfsz gCrcLen

goto Crc16_Pic16C55

retlw 0 ;done.

We've mentioned the 68000 CPU a couple of times, so let's compare it. Here, we can see how the 68000 is well suited to high-level languages, the straight-forward implementation closely matches the algorithm and is only 19words, 38 bytes long, almost as short as a Z80! The timing on the other hand - yike, 134 cycles per byte, only slightly more efficient than a Z80. This is mostly due to the fact that a 68000 had a 4 clock bus cycle, but also because the 68000 isn't very good at byte manipulation. On the other hand, the 68000 has a decent decrement and branch instruction.

Crc16_68K:

move.l d2.w,-(sp)

bra.s Crc16_68KWhile

Crc16_68KLoop:

rol.w #8,d0 ;this time we have to rotate to begin with.

eor.b (a0)+,d0 ;

move.b d0,d2 ;crc&0xff just need byte ops this time

lsr.b #4,d2 ;..>>4

eor.b d2,d0 ;crc^= above calculation.

move.b d0,d2 ;crc&0xff (we'll shift out upper 8 bits).

lsl.w #8,d2 ;..<<8

lsl.w #4,d2 ;..<<4

eor.w d2,d0 ;crc^= above calculation.

moveq #0,d2 ;(Faster than a move.b and and.w #0xff)

move.b d0,d2 ;crc&0xff

lsl.w #5,d2 ;..<<5

eor.w d2,d0 ;done.

Crc16_68KWhile:

dbra d1,Crc16_68KLoop

move.l (sp)+,d2

rts

Finally, let's also compare it with the 68000's arch-rival, the 8086. Here we can see that the 8086 isn't so well suited to a high-level language, mostly due to the optimisations needed for the byte operations and judging single bit shifts vs shifting with cl. The length is the second worst at 54 bytes and the standard instruction timings for the 8086 would give a total cycle time of 80 cycles (as efficient as a 6809) but this fails to take into account how the prefetch queue empties when executing linear code whose instructions are < 4 cycles. This bumps it up to 112 cycles per byte, which means a standard 8MHz 68000 would be about 33% faster than a 5MHz 8086. Both of them, however, are much faster than a pdp-11/34, (2.8x and 2.13x respectively).

Crc16_8086:

push dx

Crc16_8086Loop:

mov dh,al ;2(4)

mov dl,ah ;swap to begin with. 2(2)

lodsb ;al=[si]+ 12(6)

xor dl,al ;3(5)

mov al,dl ;2(3)

shr al,1 ;2(1)

shr al,1 ;4(0)

shr al,1 ;4(0), still faster than mov cl,4: shr al,cl

xor dl,al ;4(0)

mov al,dl ;4(0) upper byte.

shl al,1 ;4(0)

xor dh,al ;4(0) the <<12

xor ah,ah ;4(0)

mov al,dl ;4(0)

shl ax,1 ;4(0)

shl ax,1 ;4(0) Perform a straight-forward <<5

xor ax,dx ;4(0)

loop Crc16_8086Loop ;17c. 2 bytes fetched = 4c, so 13c can refill BIU

pop dx

ret

Saturday, 31 October 2015

Parallel CRC16 Collection

I first came across CRCs in 1996, when I was writing the audio software for the Heathrow Express. A couple of engineers from the company I worked for opened a dusty BSI volume; pointed me to a short algorithm for an ancient CPU (even by 1990s standards), the 6800 and told me to use that algorithm.

I had to translate it for the PIC and Hitachi H8 MCUs I was using for the project, but that wasn't a major hassle. The puzzle was why they seized upon that moment to insist I use a specific, and short algorithm. Did they think I was a mathematical dunce who couldn't implement something involving a few XORs? Were they just trying to impress me with their claim (I'm willing to accept it) that they actually wrote that algorithm and got it published in BSI standards? Were they just sticklers for standardized solutions?

Well, I was quite happy, because it was a bit of an education. It wasn't until about 5 years later I started to realize something was amiss with CRC16 algorithms. And what was amiss, was that no-one else seemed to be using byte parallel algorithms. Surely a BSI (and presumably US / ISO) standard would dominate? But here I was in other commercial environments where the references and implementations were all bit-serial or table-driven algorithms of this kind:

uint16_t crc16(char *addr, int len, uint16_t crc)
{
int bit;
while(num-- >0) { // For each byte.
crc = crc ^ (*addr++ <<8); // xor into high byte.
for(bit=0; bit<8;bit++)
if ((int16_t)crc<0) // top bit of poly set.
crc = ((crc<<1) ^ 0x1021); // xor with poly
else
crc<<=1;
}
}
return crc;
}

I could see they were all highly inefficient, or worse still, wrong - some of the table-driven versions actually contained wrong entries as I found out last year when I wasted a day or so trying to figure out why a colleague's Crc algorithm didn't generate the same results as mine. And that was because he'd just copied and pasted the first implementation he'd seen on the internet - and that was even after I'd pointed him to this byte-parallel version and asked him to translate it to C#.

The reason I figure table-driven algorithms became popular is because they're easy on the brain. It's easy to grasp how a bit-serial algorithm relates to the CRC's polynomial and then easy to jump to a bit-serial algorithm that generates a table or just copy a table version directly. However, byte-parallel algorithms are, thankfully, making a comeback. Why? I guess because constrained MCUs are still used in a lot of applications and because cache-misses on table-driven CRCs are pretty costly on higher-end processors.

This leads me to - an alternative for a set of CRC algorithms published on MDFS.net, a wonderful site that has conversions of BBC Basic for every processor you'd ever want :-) (with the exception of an AVR). Here are equivalent byte-parallel versions for the same set of ancient processors and they run at least twice as fast:

First, Crc16_6502, which clocks in at 64 bytes and 92cycles per byte, over twice as fast as the bit-algorithm. In common with most of the 8-bit implementations, it's more efficient to swap the CrcHi and CrcLo nearer the end and instead perform the calculations on the 'wrong' halves of the Crc until then. The 6502 version also saves cycles by using y to represent both an index into the buffer, and the length of the buffer (which is incremented until it gets to 0). This means we have to adjust the buffer pointer and negate the length.

Crc16_6502: ;buff^buffer, y=len.
clc
tya
eor #255
adc #1 ;Negate length
beq Crc16_6502End
adc buff ;buff-len
;Carry would mean we don't need to dec buff+1
;but since we really have 256-length in y,
;then we need to inc buff+1 instead (no-carry
;means we don't need to inc buff+1)
bcc Crc16_6502Loop
inc buff+1

Crc16_6502Loop:
lda (buff),y
eor CrcHi
sta CrcHi ;really CrcLo
lsra
lsra
lsra
lsra ;(Crc&0xff)>>4.
eor CrcHi
sta Temp1
sta Temp0 ;Copy low byte for <<5 later.
asla
asla
asla
asla ;(Crc<<12)
eor CrcLo
sta CrcHi ;this is the new CrcHi
asl Temp0
rola
asl Temp0
rola
asl Temp0
rola
asl Temp0
rola
asl Temp0
rola ;<<5
eor CrcHi
sta CrcHi
lda Temp0
eor Temp1
sta CrcLo
iny
bne Crc16_6502
Crc16_6502End
rts

Then there's a Z80 version. This is more straight-forward, since there are enough registers to handle the entire algorithm. It clocks in at 33b and 139 T-states per byte, making it the shortest version and only 51% slower than a 6502 at the same clock speed. Here we use c as a temp while we perform the crc hi and lo swap over the course of the first two shifts, so that they end up nicely in hl ready for when we do the <<5 near the end.

Crc16_Z80: ;(Z80 style, b=length, de^data, hl=CRC).
ld a,(de)
inc de
xor h
ld c,a
rra ;the Z80 doesn't have a fast
rra ;right shift so we
rra ;rotate and mask.
rra
and a,15 ;>>4,
xor c
ld c,a ;new low byte, gets shifted.
add a,a
add a,a
add a,a
add a,a ;<<12.
xor l
ld h,a ;new crc hi
rrca
rrca
rrca ;<<5
ld l,a ;save in c
and 31
xor h
ld h,a
ld l,a
and 0E0h
xor c
ld l,a ;done.
djnz Crc16_Z80

Next up, the 6809. Despite having several 16-bit registers, the 6809's accumulator architecture means we need to allocate 2 temporary bytes on the stack and we can't make use of 16-bit operations on D. I estimate the length at 45bytes and the speed as 80 cycles per byte, a little faster than a 6502.

Crc16_6809: ;D=CRC, X^data, Y=Len.
leas -2,s ;Allocate 2 temp bytes.
Crc16_6809Loop:
eora ,x+
staa ,s
lsra
lsra
lsra
lsra
eora ,s
std ,s
lsla
lsla
lsla
lsla
eora 1,s
staa 1,s
ldab ,s
rorb
rorb
rorb
tfr b,a
anda #31
eora 1,s
rorb
andb #0xe0
eorb ,s
leay -1,y
bne Crc16_6809Loop
leas 2,s ;Deallocate 2 temp bytes.
rts

Finally, the pdp11 (with Extended Arithmetic). The pdp11's adequate number of 16-bit registers and programmer-friendly instruction set makes it easy to implement the algorithm. Nevertheless, if run on a typical 1970s pdp-11/34 it would require 40 bytes of code and 47.56µs per byte, roughly equivalent to a 2MHz 6502 or a 3MHz Z80. Yet more evidence to demonstrate that the pdp-11 in the 1970s wasn't theoretically much faster than a humble late 70s Microprocessor. Check out my Z80 Dhrystones article to discover why this might be.

Crc16_pdp11:
;With Extended Arithmetic. r0=crc, r1^data, r2=len, r3=tmp
swab r0
clr r4
movb r4,(r1)+
xor r0,r4 ;no byte version.
mov r3,#-4
movb r4,r0
ash r4,r3 ;>>4.
xor r0,r4 ;crc^=(crc&0xff)>>4
mov r4,r0 ;need copy
mov r3,#12 ;adds 1.6us vs much more for swap /and etc.
ash r4,r3 ;crc<<12
xor r0,r4
mov r4,r0
mov r3,#5
ash r4,r3 ;<<5
xor r0,r3 ;done.
sob r2,Crc16_pdp11

One Week Wonder