One Week Wonder: 6502

Wednesday, 8 March 2017

uxForth: Unexpanded forth for a standard VIC-20. Part 3, the memory map

I'm the developer of the DIY 8-bit computer FIGnition, but it doesn't mean I'm not interested in other retro computers and the idea of developing a minimal Forth for the ancient, but cute Commodore VIC-20 is irresistable!

Part 1 talks about the appeal of the VIC-20 and what a squeeze it will be to fit Forth into it's meagre RAM.

In Part 2 I discussed choices for the inner interpreter and found out that a token Forth model could be both compact and about as fast as DTC.

Now I'm going to allocate the various parts of the Forth system to VIC-20 memory to make the best of what's there. Some of it will be fairly conventional and some somewhat unorthodox.

(An Aside, The Slow uxForth Development Process)

From the presentation of the blog entries it looks like I'm working these things out as I'm going along. For example, it's worthwhile asking why it looks like I can leap to fairly concrete decisions about the inner interpreter or even that I think I'll be able to fit the entire system into the available space.

The simple answer is that I've already done much of the work to make this possible. I've already written the code that implements the primitives (in fact I've written, modified and rewritten it a few times as I've improved it). I've made use of the wonderful resources at 6502 org, particularly the idea of splitting the instruction pointer (called gIp in my implementation) into a page offset and using the Y register to hold the byte offset: it really does improve the performance of the core Next function.

Similarly, I've written the non-primitive code and accounted for the space. It's written in Forth with a home-brew meta-forth compiler written in 'C'. So, there will be a future blog on that too!

However, it's not a cheat as such. The code is not tested yet; nor even loaded into a real VIC-20 nor emulator (I don't have a real VIC-20 :-( ). I have real decisions to make as the blog continues, which means I can make real mistakes too and have to correct them. What I've done, really, is basically a feasibility study, so that you don't waste your time reading the blog. And of course, the whole of uxForth will be released publicly, on a GPL licence via my GitHub account.

Admittedly, it's being released slowly, a 2.75Kb program I hope to release over the course of 2017!

The Memory Map

Page 0

Page 0 is the gold dust of every 6502 system: versatile and in short supply. BASIC uses the first 0x90 bytes and the KERNAL uses the rest. We'll use all 0x90 bytes for the data stack and some key system variables:

Addr	Size	Name	Comment
$00	2	gIp	Instruction pointer, lower byte always 0.
$02	1	gTmpLo	Temporary byte
$03	1	gTmpHi	Temporary byte used for indirect access.
$04	2	gILimit	The limit for the inner-most do.. loop. uxForth (and FIGnition Forth) differ from most Forths in that the inner most loops values, the limit and the current value are held in global locations. do causes the previous gILimit and gCurrent to be pushed to the stack; thus r is equivalent to j on other forths.
$06	2	gICount	The current loop count for the inner-most do.. loop.
$08	1	gUpState	The current compilation state.
$09	1	gUpBase	The current number base
$0a	2	gUpDp	The current dictionary pointer.
$0c	2	gUpLast	A pointer to the header of the most recent dictionary entry compiled
$0e	2	gUpTib	The pointer to the input buffer (I'm not sure if we need this)
$10	128	gDs	The data stack
$fb	2	gTmpPtr0	Spare pointer 0
$fd	2	gTmpPtr1	Spare pointer 1

Page 1

Page 1 is the return stack as you might expect. Oddly enough, we only get 192b, because the KERNAL uses $100 to $13F.

Page 2

There are 89 bytes available here, because they're used by BASIC. I plan to use them for the byte code vectors which are:

#	Name	#	Name	#	Name	#	Name
$00	(nop)	$0b	(+loop)	$16	u/	$21	rp!
$01	;s	$0c	0<	$17	@	$22	drop
$02	exec	$0d	0=	$18	c@	$23	dup
$03	(native)	$0e	+	$19	!	$24	over
$04	(lit8)	$0f	neg	$1a	c!	$25	swap
$05	(lit16)	$10	and	$1b	r>	$26	(vardoes)
$06	0	$11	or	$1c	>r	$27	(constdoes)
$07	(0branch)	$12	xor	$1d	r	$28	inkey
$08	(branch)	$13	>>	$1e	sp@	$29	emit
$09	(do)	$14	<<	$1f	sp!	$2a	at
$0a	(loop)	$15	*	$20	rp@	$2b

The codes that are greyed out have no names in the dictionary to save space; the way you'd insert them into code would be with [ nn c, ] sequences.

Page 3 and Page 4

There are a total of 116 bytes free from $2A0 to $313, I'll fill that area with some of the actual native definitions.

The cassette buffer is at $33c to $3fb. We'll be using the cassette for storage so we can't use it for code.

Pages 16 to 31 ish ($1000 to $1dff)

This is the area of RAM reserved for BASIC. It will contain the rest of the Forth system.

The screen RAM ($1e00 to $1ff9)

The end of RAM for an unexpanded VIC-20 is used for the screen. The plan here is to use that area for the editing space. Instead of implementing a line editor (ACCEPT in FIG-forth and early FIGnition Forth), we use key to call the KERNAL editor and allow it to manage the editing of a line including cursor movement. Pressing Return doesn't execute the command line, instead, pressing F1 exits the editor and sets the interpretation point to the current cursor position. The end of the interpretation point is set to the end of the screen and emit is turned off until interpretation gets to the end of the screen. Importantly, pressing return doesn't start interpretation.

In addition, pressing F2 saves the screen bytes onto cassette.

This is how I'll implement storage in a fairly minimal way. By implementing save via F2 I can save a block (actually the 506 screen bytes are roughly half a traditional block), but LOAD is a normal word, so multiple blocks can be loaded (you just add load to the end of the block).

So, this is how you'd do normal editing operations. For normal words you would place the cursor near the end of the screen and edit to the end of the screen; cursor to return to the first character you want to interpret and then press F1. In a sense this is easy, because you can just press Return and then cursor up until you get there. The same method would also work if you wanted to compile a whole screen's worth of code. Load itself would reset the cursor position to [home] and then return to the interpreter, so placing a load at the end of the screen would load the next screen without any recursion. That way you'd be able to develop programs that were longer than just one screen without manual reloading.

Conclusion

In the memory allocation of uxForth, we've squirrelled away about 1053 bytes of RAM, embedding the line buffer in the screen and a number of system variables in page 0. We've also included 212 bytes of what we'd use for the program proper. It won't get much better than this!

In the next post I hope to talk in more detail about the implementation of the primitive words and the code used to test them.

Sunday, 6 November 2016

uxForth: Unexpanded forth for a standard VIC-20. Part 2, the inner interpreter.

I'm the developer of the DIY 8-bit computer FIGnition, but it doesn't mean I'm not interested in other retro computers and the idea of developing a minimal Forth for the ancient, but cute Commodore VIC-20 is irresistable!

In the first part I talked about the appeal of the VIC-20 and how much usable RAM I thought I could squeeze out of it.

That turned out to be between 3947 bytes and 4646 bytes depending on whether we count the screen and the CPU stack. And this sounded more credible, except that I want at least 1Kb of RAM for user programs which brings me back to 2923 to 3622 bytes. A terrible squeeze after all.

There's one obvious way to tackle that: use the Token Forth model. A definitive articles covering all the trade-offs with developing Forth are in the series "Moving Forth" by Brad Rodriguez, but here, we just need to recap on the most popular Forth models.

Forth Execution Models

Forth normally represents its programs as lists of addresses which eventually point to machine code. The mechanism is handled by the inner Forth interpreter called "Next". The traditional Forth model implements what's called Indirect Threaded Code.

Here, each forth command (in blue) points to an indirect address (in green) which points to some machine code (in pink). Primitive commands in Forth (like DUP, >R, SWAP and R> here) have an indirect address which points to the next RAM location where the machine code starts. But commands written in Forth itself (like ROT) start with an indirect address which points to ENTER which implements a function call and is then followed by more Forth commands (in blue). A Forth command like this then ends in EXIT, which returns Forth execution to the next calling function (MYFUNC).

The next Forth model to consider is Direct Threaded Code. Here's the same thing:

Here, every forth command (in blue) points directly to machine code (in pink). Primitive commands are executed directly, but commands written in Forth itself (like ROT) start with a "JSR Enter" machine code instruction which saves the return address (to Forth code) on the normal stack and in the DTC Forth, this return address is used as the new Forth Instruction Pointer after pushing the old IP. We can see that DTC will normally be faster than ITC because there's less indirection.

Token Threaded Forth is essentially a byte-coded Forth, except that in the case of commands written in Forth itself, the NEXT routine uses the top bit of the token to denote an address. Thus, only a maximum of 128 tokens can be supported and only 32Kb of Forth code.

In this example, we can see that the Forth code has been reduced from 14 bytes to 8b, but there is a jump table of addresses which is the same size as the indirect entries in ITC (10b used for these entries). DTC used an additional JSR (3 bytes) for the ':' defined word, but TTC didn't need any extra bytes for the ':' definition (it uses a single bit, encoded in the $93A0 address). Here, the overhead of ITC weighs in at 24 bytes, TTC weighs in at 18 bytes and DTC weighs in at 17 bytes.

We can see that TTC could significantly reduce the size of Forth code if the forth tokens are used often enough, but traditionally a byte-coded interpreter is slower than a threaded code interpreter. uxForth won't beat a DTC Forth, so the question is whether it can compete with an ITC Forth.

Execution Timings

ITC Forth:

NEXT LDY #1

LDA (IP),Y ;Fetch

STA W+1 ;Indirect

DEY ;Addr

LDA (IP),Y ;to

STA W ;W

CLC ;Inc

LDA IP

ADC #2

STA IP ;IP.lo

BCC L54 ;If CY,

INC IP+1 ;inc IP.hi

L54 JMP IndW

IndW: .byte $6c ;JMP () opcode

W .word 0

This is the implementation from the original 6502 FIG Forth. It uses zero-page for IP and W. The indirection is achieved by jumping to an indirect Jump. It requires 41 cycles.

DTC Forth

NEXT LDY #1

LDA (IP),Y ;Fetch

STA W+1 ;Indirect

DEY ;Addr

LDA (IP),Y ;to

STA W ;W

CLC ;Inc

LDA IP

ADC #2

STA IP ;IP.lo

BCC L54 ;If CY,

INC IP+1 ;inc IP.hi

L54 JMP (W)

W .word 0

This is a simple derivation from the original 6502 FIG Forth. As before it uses zero-page for IP and W. The indirection is achieved using a simple indirection. It requires 36 cycles.

UxForth

lda (gIp),Y ;byte code.

asla ;*2 for gVecs index

;Also 'Enter' bit in Carry

iny ;inc gIp.lo

beq Next10 ;page inc?

;no page inc, fall-through.

Next5:

bcc Enter ;Handle Enter.

Next7:

sta Next7+4 ;modify following Jmp.

jmp (gVecs) ;exec byte code.

Next10:

inc gIp+1 ;inc page

bcs Next7 ;now handle token/enter.

This is the proposed UxForth implementation. The UxForth version has to handle both multiplying the token by 2 to get the index for the jump table (gVecs) and testing to see if it's a call to another Forth routine (bcc Enter). It requires 22 cycles, so we can see that it's almost twice as fast as the ITC version. This is because it has one natural advantage and uses several techniques to improve the speed:

Y is used to hold the low byte of IP, thus when we execute lda (gIP),Y , only the upper byte of gIP is used, the lower byte is always 0.
Branches are arranged so that the common case is the fall-through case. Thus when IP increments over a page boundary two jumps are needed.
We normally only have to read one instruction byte instead of two. This is the one natural advantage TTC has over ITC or DTC.
The vector is stored directly in the code (the second byte of jmp (gVecs) ).

18b vs 26b for ITC Forth and 25b for DTC forth. It's possible to use most of these techniques to improve the speed of ITC and DTC 6502 Forth, but I'm not so concerned about that, because the easiest to access VIC-20 Forth is the Datatronic Forth (which is an 8Kb ROM cartridge) and Datatronic Forth uses exactly the same version of NEXT as FIG-Forth.

Conclusion

RAM is still very tight, but we can reduce its usage by implementing a byte-coded Forth and we should find it's perhaps up to twice as fast as a traditional FIG-FORTH implementation.

In the next post we'll look at how we might map our Forth implementation to the available RAM regions!

One Week Wonder

Wednesday, 8 March 2017

uxForth: Unexpanded forth for a standard VIC-20. Part 3, the memory map

(An Aside, The Slow uxForth Development Process)

The Memory Map

Page 0

Page 1

Page 2

Page 3 and Page 4

Pages 16 to 31 ish ($1000 to $1dff)

The screen RAM ($1e00 to $1ff9)

Conclusion

Sunday, 6 November 2016

uxForth: Unexpanded forth for a standard VIC-20. Part 2, the inner interpreter.

Forth Execution Models

Execution Timings

ITC Forth:

DTC Forth

UxForth

Conclusion

Blog Archive

About Me

Blog Links