Thursday 24 November 2011

FIGnition HiRes Graphics, API: RFC.

This is a short post describing a proposed Bitmapped Graphics API for FIGnition.

Bitmap Format

A FIGnition bitmap (including the display) is stored as a grid of 8x8 bit tiles:



etc...
A bitmap itself is specified explicitly using a pointer to the bitmap and its dimensions in tiles: y*256+x.

In the case of the display the bitmap is 20x20 tiles, representing 160x160 pixels. A tile format is good for being able to efficiently copy bitmaps to the FIGnition's serial memory, because a single 8x8 tile only requires 11 SPI accesses compared with 32 SPI accesses if the target bitmap was a simple raster. Better savings are made if the target bitmap is larger, an 8x16 bitmap requires only 19 SPI accesses compared with 40 SPI accesses for a simple raster.

Video Prefetch

In the FIGnition's bitmap mode, program access to SRAM must be interleaved with video fetching from SRAM. Since the SRAM chip never provides a means of correctly interrupting an SPI access (you can't read it to find out what operation was last being used and what the last address was); we rely on the SRAM access being interrupted during Forth ROM access; program jumps or memory accesses frequently enough to be able to prefetch the data during the video scanning. This means you can't unroll loops too much in hires mode or video glitches will occur.

Video Prefetch fetches a whole 160 byte row of the frame buffer at a time and copies it to internal RAM, the RAM used by the normal video RAM in text mode. This operation requires 160µs every 512µs and thus is within the bandwidth of the memory system; leaving around 352µs of which 88µs is free for program execution (around 16µs for each of the 5.5 scans), enough for typically 11 instructions (at around 120KIPS on average, the raw average performance of FIGnition). It's likely therefore that program execution will be suspended during video scanning (currently it isn't) and this will reduce performance by about 0.5%.

The prefetching needs to prefetch a whole 160bytes, because the video scan must output it in raster order, displaying every 8th byte of the buffer per scan byte and repeating the same process starting one byte further for the next scan.

Therefore, we have to use a 2x 160b buffers for video scanning, leaving 728-320 = 408b free for a blitter cache called the Tile Buffer.

The Tile Buffer

In general, using serial SRAM to blit data represents a major bottleneck in the system. Therefore it's advantageous to use internal RAM as much as possible, in our case as a cache of tiles for our bitmap images (such as sprites) as well as using it as a cache for parts of the frame buffer when we write bitmaps to the frame buffer.

408 bytes provides room for 51 tiles. One tile (tile 0) will be used as a single tile cache for plot. The subsequent tiles are intended to be used by your program (1..50) and the tiles working from the end of the buffer, backwards (50..11) are to be used for caching the frame buffer. At maximum, only 40 can ever be used. For example, if you use a number of 16x16 sprites and no other larger graphic objects, in practice only 2 rows of 3 frame buffer tiles will be used, 6 tiles.

Even though there's only 51 tiles available, this doesn't mean FIGnition graphics are limited to 51 tiles; it just means you can only cache 51 tiles at any one point: changing the cached tiles won't change anything on the screen, because the frame buffer is stored independently. So you can cache a number of tiles; do some blitting; cache some other tiles; do some more blitting and it'll all work out correctly.

Blitter API

Here's the proposed Blitter API (the parameters may change a bit, it's the concepts that are important here).

The first aspect is support for tiles. There's a single function, to copy from SRAM to tiles:


bitmap dim dx dy tile# tile

This copies the bitmap whose dim is height*256+width with a shift of (dx,dy) pixels to tile tile#. Note, this means you can pre-shift tiles by up to 8 pixels in 2 directions. If the tile is shifted in x by more than 0 it'll take up width+1 tiles, and similarly if it's shifted in y by more than 0 it'll take up height+1 rows.

The blitter itself works a bit like emit; having a pen location where tiles are to be blitted next.


x y at ( in graphics mode sets the x y pen coordinate for the blitter)

dx dy clip ( clips from the pen coordinate to dx dy in the current frame buffer, blits outside the clip area aren't displayed).

tile# dim blt ( blits a single tile to the frame buffer in xor mode).

tile# dim tile2# dim2 x2 y2 2blt
( is a double-blit routine; blitting both tile# and tile2# to the frame buffer in order to eliminate flicker it uses an xor mode).

tile# dim xrep yrep blts ( is a background tile blit routine; copying a single tile to the frame buffer at the current position, it doesn't use xor).

This graphics system has a number of useful characteristics.

  • It splits the blitting into SRAM->Internal RAM and Internal RAM -> SRAM operations; thus reducing the contention for SRAM at any one point.
  • It increases the re-use of Internal RAM by being able to cache tiles. Thus, repetitive tiles can be easily reproduced using blts and repeated use of sprites is possible (e.g. with galaxians or space invaders where the same graphics are used multiple times).
  • It allows blitting to be treated like printing; the blit position is advanced tile by tile.
  • It allows characters to be displayed easily, we use use the address of a character from the character set in firmware as the tile to be displayed (simple tiling is just a cmove).
  • It's pretty minimal. You can get away with knowing only 3 routines and then later expand your knowledge to all 5!

I'm considering adding a built-in scroll routine too. Here, we'd fill tiles with the background data to be revealed by the scroll and then use dx dy scroll; where dx dy are values in the range -8 to 8. The problem here is that theoretically we'd need up to 80 tiles to do this, but we only have 51.

Performance Estimations

Some rough estimations follow, though they could be up to 2x slower or so than these estimates; be warned!

blts could operate at up to roughly 2us/byte or 500Kbytes/s, so we’d get a full-screen blts in 6.4ms. Tile/blt would be slower. 2x2 graphics would operate at roughly 2us / byte for the tile+40us? blt would operate at around 4us/byte, because each byte must be read and written once, so that’s 32*2+40 + 72*4+10 = 402us for a 16x16 sprite. 2blt would be a bit slower, but it probably won’t be worth it if it’s more than 50% slower than using blt (otherwise we might as well just use blt twice). So that’s 603us for a 16x16 sprite. Let’s say we need 12.5Hz for OK performance and we can’t use more than 70% CPU, so that’s: 603/.7 = 861µs for a 16x16 sprite, *12 = 10.8ms = about 92 16x16 sprites maximum/second.

Of course, the performance for blt can be better if some sprites are always being cached then we save 100µs / sprite, or 111/second.

I think these guesses are decent for the kind of architecture we have here, but the real question is: is this kind of performance good enough for 8-bit games? If not, we'll have to approach it differently.

Comments are very welcome!