Saturday 7 November 2015

Parallel CRC16 Collection 3

There's so many processors to write CRC16 algorithms on, and I feel like I'm on a roll!

In this section we look at some other popular processors: the MSP430 (another RISC-but-in-many-ways-not MCU); the venerable 8051 and the current king of MCUs, the ARM (Cortex M0).

The MSP430 is like a stripped-down pdp-11 with 16x16-bit registers and single-cycle instructions. But it's not great for computing 16-bit CRCs. This routine weighs in at 58 bytes and 28 cycles per byte, making it the same length and relatively the same performance as an H8. Only the 6502 is longer! The Msp430 has some features going for it, like it's ability to swap bytes and shift whole words in a single cycle, but the reason it's not as fast as a PIC or AVR is primarily because the MSP430 has no nybble operations and because sometimes its byte operations (which modify the whole word of the destination register, by zero-extending the result) get in the way. The MSP430's straight-forward <<5 is faster than the rotate right 3 times technique used on some 8-bit CPUs, because it doesn't need to save an intermediate and apply mask operations. But the neatest trick here is the >>4 . By zero-extending r4.b when moving into r7, we clear the upper byte, so '0' bits get shifted into bits 4..7 of r7. Then by zero-extending r7 again, we clear r7's original bits 3..0; and so we get a byte shift, in 1 cycle less than applying a mask.

Crc16Msp430:
    push r7
Crc16Msp430Loop:
    swpb r5       ;crc=(crc>
>8)(crc<<8)
    mov.b @r4+,r7 ;r7=new byte.
    xor r7,r5     ;r5=crc^=newByte.
    mov.b r5,r7   ;r7=crc and 255;
    rra r7
    rra r7
    rra r7
    rra r7
    mov.b r7,r7 ; clears bits 7 to 15 giving the >>4
    xor r7,r5
    mov.b r5,r7  ;Sign extend again.
    swpb r7      ;<&lt8
    add r7,r7
    add r7,r7
    add r7,r7
    add r7,r7 ;<<12
    xor r7,r5
    mov r5,r7 ;crc and 255 again.
    add r7,r7
    add r7,r7
    add r7,r7
    add r7,r7
    add r7,r7 ;<<5
    xor r7,r5
    Dec r6
    jne Crc16Msp430Loop
    pop r7
    ret ;


The 8051 version takes 31 bytes and 27 instruction cycles per byte - it's the shortest implementation so far (the PIC version uses at least 12-bit opcodes). On an original 12MHz 8051, this would have meant 27µs per byte, about 76% faster than a pdp-11.

Crc16_8051:
    push r3
Crc16_8051Loop:
    movx acc,@dptr
    inc dptr
    xrl a,r1
    mov r3,a
    swap
    anl a,15;
    xrl a,r3
    mov r3,a
    swap
    anl a,240 ;<<12
    xrl a,r0
    mov r1,a
    mov a,r3
    swap
    rl a ;<<5
    mov r0,a ;save
    anl a,31
    xrl a, r1
    mov r1,a
    mov a,r1
    anl a,240
    xrl a,r3
    mov r0,a
    djnz r2,Crc16_8051Loop
    pop r3
    ret


The Arm Thumb version. Using just the original Thumb instruction set, it can perform a CRC16 in just 18 cycles (46b), making it about 5.1x more efficient than a 6502 and nearly 7.4x more efficient than a 68000. This is mostly because the ARM thumb can perform multi-bit shifts in a single cycle. The processor is only let down when swapping the Crc16 high and low bytes (which is done just like in 'C', crc16=(crc16 >> 8) | (crc16 <<8)&0xffff. Later versions of Thumb provided zero-extend and byte swap instructions.

c65535: .word 65536
Crc16_Thumb:
    push {r3,r4}
    ldr r4,[PC,c65535]
Crc16_ThumbLoop:
    lsl r3,r0,#8
    lsr r0,r0,#8
    orr r0,r3
    and r0,r4
    ldrb r3,[r1],#1
    eor r0,r3
    mov r3,#240
    and r3,r0
    lsr r3,r3,#4
    eor r0,r3
    lsl r3,r0,#12
    eor r0,r3
    lsl r3,r0,#5;
    eor r0,r3
    and r0,r4
    subs r2,#1
    bne Crc16_ThumbLoop
    pop {r3,r4}
    rts


Finally, an AVR version. The webpage I based all of these on uses CRC code that operates on a byte at a time, so you wouldn't think there's any point in publishing one here, but my DIY FIGnition computer uses a more efficient one in its audio loading firmware:

;z^Data, r25:r24=Crc16, r26=Len.
Crc16Lo=r24
Crc16Hi=r25
BuffLo=r26,
BuffHi=r27
temp=r26
Len=r28

Crc16_AtMega:
    push r16
    push r30
    push r31 ;used to hold buff.
    movw z,BuffLo
    ldi r16,16
    ldi r27,32
Crc16_AtMegaLoop:
    mov temp,Crc16Lo
    mov Crc16Lo,Crc16Hi
    mov Crc16Hi, temp ; Swap hi and lo.
    ld temp,Z+
    eor Crc16Lo, temp ; crc^=ser_data;
    mul r16,Crc16Lo ; Faster than executing lsr 4 times.
    eor Crc16Lo, r1 ; crc^=(crc&0xff)>>4
    mul r16,Crc16Lo ;(crc&0xff)<<4
    eor Crc16Hi,r0 ;crc^=(crc<<8)<<4
                    ;(r0 wonderfully contains the other 4 bits)
    mul r27,Crc16Lo ; (crc&0xff)<<5
    eor Crc16Lo,r0
    eor Crc16Hi,r1
    subi Len,1
    bne Crc16_AtMegaLoop
    pop r31
    pop r30
    pop r16
    ret
In the above version (designed for a Mega AVR), we can save a cycle from the mov, swap, andi sequence by making use of the mul instruction. We can save 5 cycles in the <<5 code. This gives us 19cycles per byte, just 1 cycle longer than an ARM thumb (and the same length). Yet it's 7 cycles shorter than the other AVR version (53% faster!).

Tuesday 3 November 2015

Parallel CRC16 Collection 2


Hello folks,

I recently posted a collection of byte-parallel CRC16 implementations and thought I'd add a few more.

Firstly, a 6800 version. Since I don't have the original BSI volume, I don't know exactly how it was coded then, but I vaguely remember it being about 26 instructions long. The 6800 version is actually shorter and faster than my 6809 version despite the 6809 being a far better CPU with better CPI and more registers. That's because I end up having to use 0 page for speed and that's still faster than stack indexing:


Crc16_6800: ;a:b=CRC, X^data.
;uses page 0 temp0, temp1, Len.
Crc16_6800Loop:
eora 0,x
staa temp1 ;hi now swapped
lsra
lsra
lsra
lsra
eora temp1
staa temp1
stab temp0
lsla
lsla
lsla
lsla
eora temp0
staa temp0
rorb
rorb
rorb
tba
anda #31
eora temp0
rorb
andb #0xe0
eorb temp1
inx
dec len
bne Crc16_6800Loop
rts

Next, the Hitachi H8 version. In this version, R0=CRC, R1^data, R2=Len. R3 is a temp. The H8 is interesting, because it was part of a 1990s generation of Microcontrollers designed for high-level languages like 'C'. Hitachi claimed it was RISC, though it isn't due to the excessive number of instruction formats and addressing modes. It's more like a 16-bit micro-controller version of the 68000 or pdp-11. The performance looks decent at 57 ø-clock cycles, but it isn't really that great because a ø-clock cycle is half the XTAL clock frequency, making it 114 XTAL oscillations per CRC byte.

Crc16_H8:
push r3
Crc16_H8Loop: ;start by pretending to swap.
mov.b @r1+,r3h ;get the next byte.
xor.b r3h,r0h ;
mov.b r0h,r3h ;need a copy.
shlr r3h
shlr r3h
shlr r3h
shlr r3h ; >>4
xor.b r3h,r0h ;
mov.b r0h,r3h ;copy again.
shll r3h
shll r3h
shll r3h
shll r3h ; <<4
xor.b r3h,r0l ;Into what was the low byte.
mov.b r0l,r3h
rotr r3h
rotr r3h
rotr r3h
mov.b r3h,r3l
and.b #31,r3l
xor.b r0h,r3l
and.b #0xe0,r3h
xor.b r0l,r3h
mov.w r3,r0
subs #1,r2
bne Crc16_H8Loop
pop r3
rts

So, let's compare that with the PIC I originally coded it for, a PIC16C55. This isn't the actual code, I don't have rights to that, so I've just rewritten the equivalent algorithm, and anyway, the version I used would probably have calculated the CRC on the fly, as each byte was received. Surprisingly, the humble PIC cpu (which predates Commerical RISC CPUs by a decade), manages to implement the algorithm in 25 instructions and 25 instruction cycles per byte, which works out as 100 clocks per byte.

Crc16_Pic16C55: ;pretend we've swapped.
movf ind,w ;got the next byte.
incf fsr
xorwf gCrc16+1,f ;xor with lo byte.
swap gCrc16,w ;copy and swap.
andlw 15 ;this is the >>4!
xorwf gCrc16+1,w ;'lo' byte again.
movwf gTemp2
swap gCrc16,w ;copy and swap
andlw 0xf0
xorwf gCrc16,f ;this time into 'hi' byte.
rrf gTemp2,w ;get 'lo' mostly calc'd Crc.
movwf gTemp1 ;save in temp1.
rrf gTemp1,f ;Have to rotate into f, because
rrf gTemp1,f ;can't rotate w by itself.
movf gTemp1,w
andlw 31
xorwf gCrc16,w
movwf gCrc16+1 ;OK to overwrite, since new 'lo' byte in gTemp2
rrf gTemp1,w ;and again, because of the carry.
andlw 0xe0
xorwf gTemp2,w
movwf gCrc16
decfsz gCrcLen
goto Crc16_Pic16C55
retlw 0 ;done.

We've mentioned the 68000 CPU a couple of times, so let's compare it. Here, we can see how the 68000 is well suited to high-level languages, the straight-forward implementation closely matches the algorithm and is  only 19words, 38 bytes long, almost as short as a Z80! The timing on the other hand - yike, 134 cycles per byte, only slightly more efficient than a Z80. This is mostly due to the fact that a 68000 had a 4 clock bus cycle, but also because the 68000 isn't very good at byte manipulation. On the other hand, the 68000 has a decent decrement and branch instruction.

Crc16_68K:
move.l d2.w,-(sp)
bra.s Crc16_68KWhile
Crc16_68KLoop:
rol.w #8,d0 ;this time we have to rotate to begin with.
eor.b (a0)+,d0 ;
move.b d0,d2 ;crc&0xff just need byte ops this time
lsr.b #4,d2 ;..>>4
eor.b d2,d0 ;crc^= above calculation.
move.b d0,d2 ;crc&0xff (we'll shift out upper 8 bits).
lsl.w #8,d2 ;..<<8
lsl.w #4,d2 ;..<<4
eor.w d2,d0 ;crc^= above calculation.
moveq #0,d2 ;(Faster than a move.b and and.w #0xff)
move.b d0,d2 ;crc&0xff
lsl.w #5,d2 ;..<<5
eor.w d2,d0 ;done.
Crc16_68KWhile:
dbra d1,Crc16_68KLoop
move.l (sp)+,d2
rts

Finally, let's also compare it with the 68000's arch-rival, the 8086. Here we can see that the 8086 isn't so well suited to a high-level language, mostly due to the optimisations needed for the byte operations and judging single bit shifts vs shifting with cl. The length is the second worst at 54 bytes and the standard instruction timings for the 8086 would give a total cycle time of 80 cycles (as efficient as a 6809) but this fails to take into account how the prefetch queue empties when executing linear code whose instructions are < 4 cycles. This bumps it up to 112 cycles per byte, which means a standard 8MHz 68000 would be about 33% faster than a 5MHz 8086. Both of them, however, are much faster than a pdp-11/34, (2.8x and 2.13x respectively).

Crc16_8086:
push dx
Crc16_8086Loop:
mov dh,al ;2(4)
mov dl,ah ;swap to begin with. 2(2)
lodsb ;al=[si]+ 12(6)
xor dl,al ;3(5)
mov al,dl ;2(3)
shr al,1 ;2(1)
shr al,1 ;4(0)
shr al,1 ;4(0)
shr al,1 ;4(0), still faster than mov cl,4: shr al,cl
xor dl,al ;4(0)
mov al,dl ;4(0) upper byte.
shl al,1 ;4(0)
shl al,1 ;4(0)
shl al,1 ;4(0)
shl al,1 ;4(0)
xor dh,al ;4(0) the <<12
xor ah,ah ;4(0)
mov al,dl ;4(0)
shl ax,1 ;4(0)
shl ax,1 ;4(0)
shl ax,1 ;4(0)
shl ax,1 ;4(0)
shl ax,1 ;4(0) Perform a straight-forward <<5
xor ax,dx ;4(0)
loop Crc16_8086Loop ;17c. 2 bytes fetched = 4c, so 13c can refill BIU
pop dx
ret