Showing posts with label avr. Show all posts
Showing posts with label avr. Show all posts

Saturday, 7 November 2015

Parallel CRC16 Collection 3

There's so many processors to write CRC16 algorithms on, and I feel like I'm on a roll!

In this section we look at some other popular processors: the MSP430 (another RISC-but-in-many-ways-not MCU); the venerable 8051 and the current king of MCUs, the ARM (Cortex M0).

The MSP430 is like a stripped-down pdp-11 with 16x16-bit registers and single-cycle instructions. But it's not great for computing 16-bit CRCs. This routine weighs in at 58 bytes and 28 cycles per byte, making it the same length and relatively the same performance as an H8. Only the 6502 is longer! The Msp430 has some features going for it, like it's ability to swap bytes and shift whole words in a single cycle, but the reason it's not as fast as a PIC or AVR is primarily because the MSP430 has no nybble operations and because sometimes its byte operations (which modify the whole word of the destination register, by zero-extending the result) get in the way. The MSP430's straight-forward <<5 is faster than the rotate right 3 times technique used on some 8-bit CPUs, because it doesn't need to save an intermediate and apply mask operations. But the neatest trick here is the >>4 . By zero-extending r4.b when moving into r7, we clear the upper byte, so '0' bits get shifted into bits 4..7 of r7. Then by zero-extending r7 again, we clear r7's original bits 3..0; and so we get a byte shift, in 1 cycle less than applying a mask.

Crc16Msp430:
    push r7
Crc16Msp430Loop:
    swpb r5       ;crc=(crc>
>8)(crc<<8)
    mov.b @r4+,r7 ;r7=new byte.
    xor r7,r5     ;r5=crc^=newByte.
    mov.b r5,r7   ;r7=crc and 255;
    rra r7
    rra r7
    rra r7
    rra r7
    mov.b r7,r7 ; clears bits 7 to 15 giving the >>4
    xor r7,r5
    mov.b r5,r7  ;Sign extend again.
    swpb r7      ;<&lt8
    add r7,r7
    add r7,r7
    add r7,r7
    add r7,r7 ;<<12
    xor r7,r5
    mov r5,r7 ;crc and 255 again.
    add r7,r7
    add r7,r7
    add r7,r7
    add r7,r7
    add r7,r7 ;<<5
    xor r7,r5
    Dec r6
    jne Crc16Msp430Loop
    pop r7
    ret ;


The 8051 version takes 31 bytes and 27 instruction cycles per byte - it's the shortest implementation so far (the PIC version uses at least 12-bit opcodes). On an original 12MHz 8051, this would have meant 27µs per byte, about 76% faster than a pdp-11.

Crc16_8051:
    push r3
Crc16_8051Loop:
    movx acc,@dptr
    inc dptr
    xrl a,r1
    mov r3,a
    swap
    anl a,15;
    xrl a,r3
    mov r3,a
    swap
    anl a,240 ;<<12
    xrl a,r0
    mov r1,a
    mov a,r3
    swap
    rl a ;<<5
    mov r0,a ;save
    anl a,31
    xrl a, r1
    mov r1,a
    mov a,r1
    anl a,240
    xrl a,r3
    mov r0,a
    djnz r2,Crc16_8051Loop
    pop r3
    ret


The Arm Thumb version. Using just the original Thumb instruction set, it can perform a CRC16 in just 18 cycles (46b), making it about 5.1x more efficient than a 6502 and nearly 7.4x more efficient than a 68000. This is mostly because the ARM thumb can perform multi-bit shifts in a single cycle. The processor is only let down when swapping the Crc16 high and low bytes (which is done just like in 'C', crc16=(crc16 >> 8) | (crc16 <<8)&0xffff. Later versions of Thumb provided zero-extend and byte swap instructions.

c65535: .word 65536
Crc16_Thumb:
    push {r3,r4}
    ldr r4,[PC,c65535]
Crc16_ThumbLoop:
    lsl r3,r0,#8
    lsr r0,r0,#8
    orr r0,r3
    and r0,r4
    ldrb r3,[r1],#1
    eor r0,r3
    mov r3,#240
    and r3,r0
    lsr r3,r3,#4
    eor r0,r3
    lsl r3,r0,#12
    eor r0,r3
    lsl r3,r0,#5;
    eor r0,r3
    and r0,r4
    subs r2,#1
    bne Crc16_ThumbLoop
    pop {r3,r4}
    rts


Finally, an AVR version. The webpage I based all of these on uses CRC code that operates on a byte at a time, so you wouldn't think there's any point in publishing one here, but my DIY FIGnition computer uses a more efficient one in its audio loading firmware:

;z^Data, r25:r24=Crc16, r26=Len.
Crc16Lo=r24
Crc16Hi=r25
BuffLo=r26,
BuffHi=r27
temp=r26
Len=r28

Crc16_AtMega:
    push r16
    push r30
    push r31 ;used to hold buff.
    movw z,BuffLo
    ldi r16,16
    ldi r27,32
Crc16_AtMegaLoop:
    mov temp,Crc16Lo
    mov Crc16Lo,Crc16Hi
    mov Crc16Hi, temp ; Swap hi and lo.
    ld temp,Z+
    eor Crc16Lo, temp ; crc^=ser_data;
    mul r16,Crc16Lo ; Faster than executing lsr 4 times.
    eor Crc16Lo, r1 ; crc^=(crc&0xff)>>4
    mul r16,Crc16Lo ;(crc&0xff)<<4
    eor Crc16Hi,r0 ;crc^=(crc<<8)<<4
                    ;(r0 wonderfully contains the other 4 bits)
    mul r27,Crc16Lo ; (crc&0xff)<<5
    eor Crc16Lo,r0
    eor Crc16Hi,r1
    subi Len,1
    bne Crc16_AtMegaLoop
    pop r31
    pop r30
    pop r16
    ret
In the above version (designed for a Mega AVR), we can save a cycle from the mov, swap, andi sequence by making use of the mul instruction. We can save 5 cycles in the <<5 code. This gives us 19cycles per byte, just 1 cycle longer than an ARM thumb (and the same length). Yet it's 7 cycles shorter than the other AVR version (53% faster!).

Saturday, 5 July 2014

BootJacker: The Amazing AVR Bootloader Hack!

There's an old adage that says if you don't know it's impossible you could end up achieving it. BootJacker is that kind of hack: a way for ordinary firmware on an AVR to reprogram its bootloader. It's something Atmel's manual for AVR microcontrollers with Bootloaders says is impossible (Note the italics):

27.3.1 Application Section.
The Application section is the section of the Flash that is used for storing the application code. The protection level for the Application section can be selected by the application Boot Lock bits (Boot Lock bits 0), see Table 27-2 on page 284. The Application section can never store any Boot Loader code since the SPM instruction is disabled when executed from the Application section.

Here's the background: I'm the designer of FIGnition, the definitive DIY 8-bit computer. It's not cobbled together from hackware from around the web, instead three years of sweat and bare-metal development has gone into this tiny 8-bitter. I've been working on firmware version 1.0.0 for a few months; the culmination of claiming that I'll put audio data transfer on the machine (along with a fast, tiny Floating point library about 60% of the size of the AVR libc one).

Firmware 1.0.0 uses the space previously occupied by its 2Kb USB bootloader and so, needs its own migration firmware image to copy the V1.0.0 firmware to external flash. The last stage is to reprogram the bootloader with a tiny 128b bootloader which reads the new image from external flash. Just as I got to the last stage I came across section 27.3.1, which let me know in no uncertain terms that I was wasting my time.

I sat around dumbstruck for a while ("How could I have not read that?") before wondering whether[1], crazies of crazy, imagining that a solution to the impossible might actually lead me there. And it turns out it does.

The solution is actually conceptually fairly simple. A bootloader, by its very nature is designed to download new firmware to the device. Therefore it will contain at least one spm instruction. Because the spm configuration register must be written no more than 4 cycles before the spm instruction it means there are very few sequences that practically occur: just sts, spm or out, spm sequences. So, all you need to is find the sequence in the bootloader section; set up the right registers and call it.

However, it turned out there was a major problem with that too. The V-USB self-programming bootloader's spm instructions aren't a neat little routine, but are inlined into the main code; so calling it would just cause the AVR to crash as it tried to execute the rest of the V-USB bootloader.

Nasty, but again there's a solution. By using a timer clocked at the CPU frequency (which is easy on an AVR), you can create a routine in assembler which sets up the registers for the Bootloader's out, spm sequence; calls it and just at the moment when it's executed the first cycle of the spm itself, the timer interrupt goes off and the AVR should jump to your interrupt routine (in Application space). The interrupt routine pops the bootloader address and then returns to the previous code - which is the routine that sets up the outspm sequence. This should work, because when you apply spm instructions to the bootloader section the CPU is halted until it's complete.

Here's the key part of BootJacker:

The code uses the Bootloader's spm to first write a page of flash which also contains a usable outspm sequence and then uses that routine to write the rest (because of course you might end up overwriting the bootloader with your own new bootloader!)

BootJacker involves cycle counting, I used a test routine to figure out the actual number of instructions executed after you set the timer for x cycles in the future (it's x-2). In addition I found there was one other oddity: erase and writes always have a 1 cycle latency after the SPM in a bootloader. I fixed this with a nop instruction in my mini bootloader.

This algorithm, I think is pretty amazing. It means that most bootloaders can in fact be overwritten using application firmware containing a version of BootJacker!

[1] As a Christian, I also have to fess' up that I prayed about it too. Not some kind of desperation thing, but some pretty calm prayer, trusting it'll get sorted out :-)