Saturday 31 October 2015

Parallel CRC16 Collection

I first came across CRCs in 1996, when I was writing the audio software for the Heathrow Express. A couple of engineers from the company I worked for opened a dusty BSI volume; pointed me to a short algorithm for an ancient CPU (even by 1990s standards), the 6800 and told me to use that algorithm.

I had to translate it for the PIC and Hitachi H8 MCUs I was using for the project, but that wasn't a major hassle. The puzzle was why they seized upon that moment to insist I use a specific, and short algorithm. Did they think I was a mathematical dunce who couldn't implement something involving a few XORs? Were they just trying to impress me with their claim (I'm willing to accept it) that they actually wrote that algorithm and got it published in BSI standards? Were they just sticklers for standardized solutions?

Well, I was quite happy, because it was a bit of an education. It wasn't until about 5 years later I started to realize something was amiss with CRC16 algorithms. And what was amiss, was that no-one else seemed to be using byte parallel algorithms. Surely a BSI (and presumably US / ISO) standard would dominate? But here I was in other commercial environments where the references and implementations were all bit-serial or table-driven algorithms of this kind:


uint16_t crc16(char *addr, int len, uint16_t crc)
{
  int bit;
  while(num-- >0) { // For each byte.
    crc = crc ^ (*addr++ <<8); // xor into high byte.
    for(bit=0; bit<8;bit++)
      if ((int16_t)crc<0) // top bit of poly set.
        crc = ((crc<<1) ^ 0x1021); // xor with poly
      else
        crc<<=1;
    }
  }
  return crc;
}

I could see they were all highly inefficient, or worse still, wrong - some of the table-driven versions actually contained wrong entries as I found out last year when I wasted a day or so trying to figure out why a colleague's Crc algorithm didn't generate the same results as mine. And that was because he'd just copied and pasted the first implementation he'd seen on the internet - and that was even after I'd pointed him to this byte-parallel version and asked him to translate it to C#.

The reason I figure table-driven algorithms became popular is because they're easy on the brain. It's easy to grasp how a bit-serial algorithm relates to the CRC's polynomial and then easy to jump to a bit-serial algorithm that generates a table or just copy a table version directly. However, byte-parallel algorithms are, thankfully, making a comeback. Why? I guess because constrained MCUs are still used in a lot of applications and because cache-misses on table-driven CRCs are pretty costly on higher-end processors.

This leads me to - an alternative for a set of CRC algorithms published on MDFS.net, a wonderful site that has conversions of BBC Basic for every processor you'd ever want :-) (with the exception of an AVR). Here are equivalent byte-parallel versions for the same set of ancient processors and they run at least twice as fast:

First, Crc16_6502, which clocks in at 64 bytes and 92cycles per byte, over twice as fast as the bit-algorithm. In common with most of the 8-bit implementations, it's more efficient to swap the CrcHi and CrcLo nearer the end and instead perform the calculations on the 'wrong' halves of the Crc until then. The 6502 version also saves cycles by using y to represent both an index into the buffer, and the length of the buffer (which is incremented until it gets to 0). This means we have to adjust the buffer pointer and negate the length.


Crc16_6502: ;buff^buffer, y=len.
clc
tya
        eor #255
        adc #1     ;Negate length
        beq Crc16_6502End
adc buff ;buff-len
;Carry would mean we don't need to dec buff+1
        ;but since we really have 256-length in y,
        ;then we need to inc buff+1 instead (no-carry
;means we don't need to inc buff+1)
bcc Crc16_6502Loop
inc buff+1

Crc16_6502Loop:
lda (buff),y
eor CrcHi
sta CrcHi ;really CrcLo
lsra
lsra
lsra
lsra ;(Crc&0xff)>>4.
eor CrcHi
sta Temp1
sta Temp0 ;Copy low byte for <<5 later.
asla
asla
asla
asla ;(Crc<<12)
eor CrcLo
sta CrcHi ;this is the new CrcHi
asl Temp0
rola
asl Temp0
rola
asl Temp0
rola
asl Temp0
rola
asl Temp0
rola ;<<5
eor CrcHi
sta CrcHi
lda Temp0
eor Temp1
sta CrcLo
iny
bne Crc16_6502
Crc16_6502End
rts

Then there's a Z80 version. This is more straight-forward, since there are enough registers to handle the entire algorithm. It clocks in at 33b and 139 T-states per byte, making it the shortest version and only 51% slower than a 6502 at the same clock speed. Here we use c as a temp while we perform the crc hi and lo swap over the course of the first two shifts, so that they end up nicely in hl ready for when we do the <<5 near the end.

Crc16_Z80: ;(Z80 style, b=length, de^data, hl=CRC).
ld a,(de)
inc de
xor h
ld c,a
rra ;the Z80 doesn't have a fast
rra ;right shift so we
rra ;rotate and mask.
rra
and a,15 ;>>4,
xor c
ld c,a ;new low byte, gets shifted.
add a,a
add a,a
add a,a
add a,a ;<<12.
xor l
ld h,a ;new crc hi
rrca
rrca
rrca ;<<5
ld l,a ;save in c
and 31
xor h
ld h,a
ld l,a
and 0E0h
xor c
ld l,a ;done.
djnz Crc16_Z80

Next up, the 6809. Despite having several 16-bit registers, the 6809's accumulator architecture means we need to allocate 2 temporary bytes on the stack and we can't make use of 16-bit operations on D. I estimate the length at 45bytes and the speed as 80 cycles per byte, a little faster than a 6502.

Crc16_6809: ;D=CRC, X^data, Y=Len.
leas -2,s ;Allocate 2 temp bytes.
Crc16_6809Loop:
eora ,x+
staa ,s
lsra
lsra
lsra
lsra
eora ,s
std ,s
lsla
lsla
lsla
lsla
eora 1,s
staa 1,s
ldab ,s
rorb
rorb
rorb
tfr b,a
anda #31
eora 1,s
rorb
andb #0xe0
eorb ,s
leay -1,y
bne Crc16_6809Loop
leas 2,s ;Deallocate 2 temp bytes.
rts


Finally, the pdp11 (with Extended Arithmetic). The pdp11's adequate number of 16-bit registers and programmer-friendly instruction set makes it easy to implement the algorithm. Nevertheless, if run on a typical 1970s pdp-11/34 it would require 40 bytes of code and 47.56µs per byte, roughly equivalent to a 2MHz 6502 or a 3MHz Z80. Yet more evidence to demonstrate that the pdp-11 in the 1970s wasn't theoretically much faster than a humble late 70s Microprocessor. Check out my Z80 Dhrystones article to discover why this might be.

Crc16_pdp11:
;With Extended Arithmetic. r0=crc, r1^data, r2=len, r3=tmp
swab r0
clr r4
movb r4,(r1)+
xor r0,r4 ;no byte version.
mov r3,#-4
movb r4,r0
ash r4,r3 ;>>4.
xor r0,r4 ;crc^=(crc&0xff)>>4
mov r4,r0 ;need copy
mov r3,#12 ;adds 1.6us vs much more for swap /and etc.
ash r4,r3 ;crc<<12
xor r0,r4
mov r4,r0
mov r3,#5
ash r4,r3 ;<<5
xor r0,r3 ;done.
sob r2,Crc16_pdp11