Sunday 6 November 2016

uxForth: Unexpanded forth for a standard VIC-20. Part 2, the inner interpreter.

I'm the developer of the DIY 8-bit computer FIGnition, but it doesn't mean I'm not interested in other retro computers and the idea of developing a minimal Forth for the ancient, but cute Commodore VIC-20 is irresistable!

In the first part I talked about the appeal of the VIC-20 and how much usable RAM I thought I could squeeze out of it.

That turned out to be between 3947 bytes and 4646 bytes depending on whether we count the screen and the CPU stack. And this sounded more credible, except that I want at least 1Kb of RAM for user programs which brings me back to 2923 to 3622 bytes. A terrible squeeze after all.

There's one obvious way to tackle that: use the Token Forth model. A definitive articles covering all the trade-offs with developing Forth are in the series "Moving Forth" by Brad Rodriguez, but here, we just need to recap on the most popular Forth models.

Forth Execution Models

Forth normally represents its programs as lists of addresses which eventually point to machine code. The mechanism is handled by the inner Forth interpreter called "Next". The traditional Forth model implements what's called Indirect Threaded Code.



Here, each forth command (in blue) points to an indirect address (in green) which points to some machine code (in pink). Primitive commands in Forth (like DUP, >R, SWAP and R> here) have an indirect address which points to the next RAM location where the machine code starts. But commands written in Forth itself (like ROT) start with an indirect address which points to ENTER which implements a function call and is then followed by more Forth commands (in blue). A Forth command like this then ends in EXIT, which returns Forth execution to the next calling function (MYFUNC).

The next Forth model to consider is Direct Threaded Code. Here's the same thing:


Here, every forth command (in blue) points directly to machine code (in pink). Primitive commands are executed directly, but commands written in Forth itself (like ROT) start with a "JSR Enter" machine code instruction which saves the return address (to Forth code) on the normal stack and in the DTC Forth, this return address is used as the new Forth Instruction Pointer after pushing the old IP. We can see that DTC will normally be faster than ITC because there's less indirection.

Token Threaded Forth is essentially a byte-coded Forth, except that in the case of commands written in Forth itself, the NEXT routine uses the top bit of the token to denote an address. Thus, only a maximum of 128 tokens can be supported and only 32Kb of Forth code.


In this example, we can see that the Forth code has been reduced from 14 bytes to 8b, but there is a jump table of addresses which is the same size as the indirect entries in ITC (10b used for these entries). DTC used an additional JSR (3 bytes) for the ':' defined word, but TTC didn't need any extra bytes for the ':' definition (it uses a single bit, encoded in the $93A0 address). Here, the overhead of ITC weighs in at 24 bytes, TTC weighs in at 18 bytes and DTC weighs in at 17 bytes.

We can see that TTC could significantly reduce the size of Forth code if the forth tokens are used often enough, but traditionally a byte-coded interpreter is slower than a threaded code interpreter. uxForth won't beat a DTC Forth, so the question is whether it can compete with an ITC Forth.

Execution Timings

ITC Forth:

NEXT      LDY #1
          LDA (IP),Y ;Fetch
          STA W+1    ;Indirect
          DEY        ;Addr
          LDA (IP),Y ;to
          STA W      ;W
          CLC        ;Inc
          LDA IP
          ADC #2
          STA IP     ;IP.lo
          BCC L54    ;If CY,
          INC IP+1   ;inc IP.hi
L54       JMP IndW
IndW:  .byte $6c ;JMP () opcode
W         .word 0

This is the implementation from the original 6502 FIG Forth. It uses zero-page for IP and W. The indirection is achieved by jumping to an indirect Jump.  It requires 41 cycles.

DTC Forth

NEXT      LDY #1
          LDA (IP),Y ;Fetch
          STA W+1    ;Indirect
          DEY        ;Addr
          LDA (IP),Y ;to
          STA W      ;W
          CLC        ;Inc
          LDA IP
          ADC #2
          STA IP     ;IP.lo
          BCC L54    ;If CY,
          INC IP+1   ;inc IP.hi
L54       JMP (W)
W         .word 0

This is a simple derivation from the original 6502 FIG Forth. As before it uses zero-page for IP and W. The indirection is achieved using a simple indirection.  It requires 36 cycles.

UxForth

Next:
lda (gIp),Y ;byte code.
asla ;*2 for gVecs index
                        ;Also 'Enter' bit in Carry
iny ;inc gIp.lo
beq Next10 ;page inc?
;no page inc, fall-through.
Next5:
bcc Enter ;Handle Enter.
Next7:
sta Next7+4 ;modify following Jmp.
jmp (gVecs) ;exec byte code.
Next10:
inc gIp+1 ;inc page
bcs Next7 ;now handle token/enter.

This is the proposed UxForth implementation. The UxForth version has to handle both multiplying the token by 2 to get the index for the jump table (gVecs) and testing to see if it's a call to another Forth routine (bcc Enter). It requires 22 cycles, so we can see that it's almost twice as fast as the ITC version. This is because it has one natural advantage and uses several techniques to improve the speed:

  • Y is used to hold the low byte of IP, thus when we execute lda (gIP),Y , only the upper byte of gIP is used, the lower byte is always 0.
  • Branches are arranged so that the common case is the fall-through case. Thus when IP increments over a page boundary two jumps are needed.
  • We normally only have to read one instruction byte instead of two. This is the one natural advantage TTC has over ITC or DTC.
  • The vector is stored directly in the code (the second byte of jmp (gVecs) ).
18b vs 26b for ITC Forth and 25b for DTC forth. It's possible to use most of these techniques to improve the speed of ITC and DTC 6502 Forth, but I'm not so concerned about that, because the easiest to access VIC-20 Forth is the Datatronic Forth (which is an 8Kb ROM cartridge) and Datatronic Forth uses exactly the same version of NEXT as FIG-Forth.


Conclusion

RAM is still very tight, but we can reduce its usage by implementing a byte-coded Forth and we should find it's perhaps up to twice as fast as a traditional FIG-FORTH implementation.

In the next post we'll look at how we might map our Forth implementation to the available RAM regions!