Saturday, 30 April 2022

Cubase Re-Lited Part 1

In the mid-1980s my parents kindly bought me a Casio CZ-101. It was a great introduction to synthesisers and my way back into keyboard instruments. I had been introduced to synthesiser music thanks to some of  school friends lending me albums by Jean Michel Jarre (Oxygene and Equinoxe), Vangelis (L'Opera Sauvage, Apocalypse Des Animaux, Heaven and Hell) and Tomita (Pictures at An Exhibition, The Planet Suite).

At the time I'd been learning the trumpet, but my mum was well-known in the locality, for playing the piano and organ in primary schools and working-men's clubs, alongside teaching piano the music theory, but I was never able to pick it up - my sister did much better than me, getting up to grade 3 or 4.

The CZ series was largely dismissed since the 1990s, primarily because CASIO wasn't associated with professional instruments; the Yamaha DX series were more capable, but also because new synthesis techniques superseded this early digital techniques. However, in recent years, as artists have increasingly mined and revisited older technology, Phase Distortion and the CZ series has been re-evaluated, alongside the emergence of DAWless production.

Over time I acquired a few more instruments: a Yamaha SY-22; an early Sample and (FM) Synthesis, but capable of creating interesting dynamic soundscapes thanks to its vector synthesis feature. For a while I had a Roland MT-32, which could handle up to 8 parts + a drum kit channel.

Around the same time I bought a copy of Cubase lite (bundled, I believe with a MIDI interface) for my Macintosh Performa 400 and with it sequenced quite a number of pieces and songs. Cubase Lite was well within my needs and even when running on a Mac Plus, it could handle everything I threw at it.

Stupidly I sold it because I didn't think that its sounds were particularly interesting and I never took the opportunity to learn how to program it properly. Finally, I bought a super-ugly Yamaha TQ-5 (for £35 on Ebay); which is a superb 4-operator FM synth / sound module with a built-in sequencer (horrible UI) and an effects unit. It's badly underrated as it's better than a DX-11.

Over the past two decades though I've used Garage Band for my music creation - again, because although it's an entry level program, it's still within my needs What I did miss was being able to connect Garage Band to my earlier synths.

However, now that DAWless music creation has become more fashionable, I thought I'd have a go at trying to re-unite my current keyboards and sound modules with a cheap or free, but simple to use DAWless MIDI application.

And it turns out that's quite a rabbit-hole. It's not that nothing is available. For example, there's a Linux program called RoseGarden. Now, typically, the first question I ask about any application, whether it's for Linux, macOS or whatever, is what are the requirements in terms of RAM and CPU? RoseGarden says it's "demanding". What does that mean? A 1GHz Athlon? A 2.5GHz Core i7? A Raspberry PI 3? Muse looks good, but there's no description at all about the hardware requirements. And at this point I'm already turned off. Why make the effort if my hardware might not run it? Also, as software developers have shifted from producing sequencers to full Digital Audio Workstations, they've blurred the descriptions of what their software does. For example, obviously Garageband does handle MIDI input, but it doesn't handle MIDI output, so I've found it progressively harder to figure out whether a given application would support what I want, or even how to ask the right kind of search question that would provide an answer.

Also, it's such a pain even just finding out about trivial things. Consider part of the Support FAQ (which of course, doesn't even tell you what CPU performance you need):

“MusE requires a high performance timer to be able to keep a stable beat. It is recommended that your operating system be setup to allow at least a 500hz timer, MusE will warn you if this is not available.”

An 8MHz Mac Plus could do this! An Atari ST could do this! An Amiga could do this! A 1980s PC running MS-DOS Could. Do. This! 

So, in the end, I figured that if a mid-80s 16-bit era computer (yes, the 68000 is 32-bit) and emulators for these computers are available that run much faster than the original hardware, then surely it should be possible to get an emulation of a Mac to run my original Cubase Lite and communicate with a MIDI interface.

Picking An Emulator?

Mini VMac

My normal go to Macintosh emulator is miniVMac. It is great and and easy to install on a number of platforms. It's also not too heavy on hardware requirements as it makes efforts to maximise emulator performance.

Also, someone has tried to interface miniVMac to real Midi hardware.

“Hi, i've implemented a midi bridge for Mini vMac which exposes emulator modem and printer midi ports to the host OS. so far no stuck notes, and sysex seems to work both ways. there's systematic jitter though which needs to be resolved.”
The cause of the systematic jitter is fairly easy to understand. Mini Mac has a fairly simple emulation core, which aims to emulate execution in 1/60th of a second chunks. That is, it works out how many instructions can be executed in 1/60th of a real second, according to the emulator's requested speed and executes them all in one go. It's not quite that simple, within the basic emulation loop it actually executes instructions until the next interrupt, but the same principle applies. It means that mini vMac only synchronises every ≈16.67ms. The upshot is that mini vMac receives MIDI events prior to its time slice all at once; sends multiple events too quickly during its time slice and worse still, is unable to send events prior to its time slice, as that would mean generating events backwards in time.





This is why messages end up with jitter. Now, the jitter is often not very noticeable, because there often aren't that many events every 17ms, but for an application which requires sub-millisecond latency, it will be enough to mess up recording and playback.

The developer of MiniVMac is slowly making progress on this (see newsletters 9 and 13), but for me, it's a non-starter.

Basilisk II

Basilisk II isn't capable of emulating a Mac Plus, instead it aims to emulate 68020 to 68040 Macintosh computers. Well, at least the most significant ones. Unfortunately, one of the ones it doesn't emulate is my Performa 400 (a.k.a LC II). In addition, Basilisk II substitutes the emulation of real Mac hardware with drivers to the equivalent peripherals. This will work in most circumstances, but if I understand things correctly, applications like Cubase Lite actually addressed the real hardware, in this case the SCC serial communications controller; rather than go through the driver.

PCE/MacPlus Emulator

I'd come across the PCE MacPlus emulator because of the Javascript version:

https://jamesfriend.com.au/pce-js/

I had originally thought it would be rather sluggish, because it was in Javascript, but in reality it performs pretty well. It was only very recently that I actually followed the links and found out it was a conventional emulator written in C++ and ported to Javascript. Importantly, it's possible to browse through the source code, at which point I found this crucial line:

#define MAC_CPU_SYNC 250
With a comment to say that the emulator is sync'd to the emulating computer that many time per second. So, I reasoned that, if I increased it to 1000, then the emulator's serial I/O's latency would be  

Then I came across a ToughDev article about using the PCE emulator to develop classic Macintosh applications, which gave a decent walkthrough:


Building it shouldn't be that hard, you just need X11-dev and SDL (the Simple Direct Layer) libraries, which I tried to first brew the build on my Mac mini; and later used apt-get on a Linux PC, but unfortunately in both cases, the configuration stage said SDL wasn't there.

So, then I tried it on a Raspberry PI 3 I had to hand, but the Raspberry PI 3 had its own problems with the latest version of apt-get. It kept complaining with messages a bit like:

W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: http://archive.raspberrypi.org/debian buster InRelease: Splitting up /var/lib/apt/lists/archive.raspberrypi.org_debian_dists_buster_InRelease into data and signature failed
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: http://raspbian.raspberrypi.org/raspbian buster InRelease: Splitting up /var/lib/apt/lists/raspbian.raspberrypi.org_raspbian_dists_buster_InRelease into data and signature failed
W: Failed to fetch http://raspbian.raspberrypi.org/raspbian/dists/buster/InRelease  Splitting up /var/lib/apt/lists/raspbian.raspberrypi.org_raspbian_dists_buster_InRelease into data and signature failed
W: Failed to fetch http://archive.raspberrypi.org/debian/dists/buster/InRelease  Splitting up /var/lib/apt/lists/archive.raspberrypi.org_debian_dists_buster_InRelease into data and signature failed
W: Some index files failed to download. They have been ignored, or old ones used instead.
I found that this was a common problem with libraries being updated to use the new Bullseye OS version, replacing Buster. Eventually I found couple [1], [2] of Raspberry PI forums that told me how to fix it. This meant typing in the following line:
sudo apt-get update --allow-releaseinfo-change
I was then able to follow the rest of the development installation as per the ToughDev web page on compiling PCE.
sudo apt update
sudo apt-get update
sudo apt-get install libx11-dev libx11-doc libxau-dev libxcb1-dev libxdmcp-dev x11proto-core-dev x11proto-input-dev x11proto-kb-dev xorg-sgml-doctools xtrans-dev
sudo apt-get install libsdl1.2-dev libsdl1.2debian
One problem I had was at the configure stage, I couldn't get it to acknowledge that SDL's configuration output supported SDL. It would always say things like:
Terminals built:         null x11
Terminals not built:     sdl

 To make it compile with SDL I had to type:
./configure -with-x --with-sdl=1 --enable-char-ppp --enable-char-tcp --enable-sound-oss --enable-char-slip --enable-char-pty --enable-char-posix --enable-char-termios
cp /usr/local/etc/pce/pce-mac-plus.cfg .

At this point I had a version of PCE that could actually run. In part 2 I'll describe the steps needed to get PCE to work as I'd intended.




Saturday, 23 April 2022

A Toast to the PowerMac 4400

 The PowerMac 4400 is probably one of the most hated Macs of all time. It wasn't really all that bad (merely Compromised, but it was a very non-Mac, style Mac which used a cheap PC case and had the floppy disk and CD drive the wrong way round). It's the Register's third most awful Mac:

“The Power Macintosh 4400 of November 1996 is widely regarded as one of - if not the - least distinguished Macs of all time. The best thing you could say about it was that it worked.”

And I had one. And I liked it. So, a couple of weeks ago I was wondering how much I'd actually paid for that 'lemon de jour' (thanks ArsTechnica) and I couldn't find out. It's like, surely everything is there on the internet. Well,  that's not quite true, I could find out the price in USD, but I needed to find the price of the PowerMac 4400 in GBP, because that's what I paid for it. And, I tried, for a least an hour or two: looking at Everymac, and LowEndMac and then just searching for reviews, and then checking out every online back issue of MacUser (UK) and Macworld (UK) I could find all to no avail.

It's just not there. Until now! Ready?

The PowerMac 4400 cost £799 for a 160MHz machine with a 1.2GB HD; 16MB of RAM; 8x CD ROM (not CDRW or DVD, this was November 1996!) and no display.

Now you know. But to tell you truth, I really liked mine. Let me explain the back story.

BackStory

In late 1996, I'd started an MPhil in Computer Architecture at Manchester University. At the time, I'd had a Performa 400 (the first Macintosh I'd bought); and with it I'd added a Zip drive; extra RAM and HD and put it on the internet, which I'd then used to get in touch with Steve Furber (who invented the ARM processor) to be able to get an interview for the course I wanted to do.

But the Performa 400 was never going to be capable of handling an MPhil (in retrospect I think it might have); so I was quite keen on replacing it with a funky, new PowerPC Mac.

I was actually quite keen on an all-in-one Performa 5200/5300 as they were the cheapest and I was a student. In fact I'd used one to do my Manchester University application on, while house-sitting for a friend and was impressed, but 18 months later better ones were available.

I'd been most impressed by the Apus 2000 Mac clone, but no-one seemed to have any any, but then came across the 5260, which seemed to have a similar spec. I'd gone as far as placing an order for one; and then discovered that it could only handle a 640x480 display! I was appalled; and cancelled the order.

I was getting fairly desperate for a new Mac (my Performa 400 was a shocking 3.5 years old by then 😮  ), but when I opened the latest issue of MacUser, I saw that there was a review of the PowerMac 4400, which even when adding a monitor was cheaper, and better than the 5260.

And then in the same magazine I saw that they were on offer at a Mac dealer called Gordon Harwood, based in the backwoods of Alfreton, Derbyshire.


Weird, a decent, successful Mac dealer, not in a big city! And more importantly, on the way from Manchester to my parent's house.

So, I promptly rang them up; reserved one (they had them in stock); then went to my parents for the weekend and bought it, either on the way down, or on the way back. I added another 16MB of RAM (I believe, taking it up to 32MB); and a Sony 15" Triniton SX; along with a student discount.

In Use

And frankly, it was a great computer! It could do everything I could throw at it; including full-screen QuickTime videos. The 8x CD ROM felt really nippy compared with my PowerCD. The Hard disk did seem a little small, even then and I had to shove quite a lot of stuff onto my Zip drive. I bought Nisus Writer 6 to write my thesis on it, and Nisus Writer 6 was an absolute joy of a word-processor (I also had Clarisworks 3 for day-to-day office documents). I even had a version of SoftWindows so that I could run Turbo C++ 4.5 and the H8, IAR compiler I was using to continue my Heathrow Express and Midlands Metro firmware.

I did most of my development on Metrowerks CodeWarrior 10 Gold, a multi-target 68K / PowerPC 'C' compiler with a great Object-oriented application framework called PowerPlant.

Later I bought a PAL based PCI TV card that worked with that Mac. 

Stupidly I Sold It

Things all went wrong when my PowerBook 100 got stolen from the University. It had been quite handy to have a laptop as a sort of satellite computer and my main computer back at home and I'd bought the PowerBook 100 with a 30MB hard drive just a few months before I'd bought the PowerMac 4400 and already I couldn't imagine computing without a laptop.

But when it got stolen I had a dilemma, which basically revolved around moving to only a laptop or having both. For a while I used a PowerBook duo 230 from Steve Furber himself (he'd just upgraded), but it came with the full dock and it was impossible to sell just the dock for a mini-dock as people wanted both. But the full dock took up quite a bit of space. So, eventually I figured I should sell both the PowerMac 4400 and get a PowerBook, and because the 4400 was quite cheap, I ended up with a low-end PowerBook 5300 (black and white), but with a second monitor card.

It wasn't terribly reliable, the HD was small and then it started to fail in the spring of 2000.

So, in the end, I think it was a bit of a mistake going down that route. The PM4400 would have done a good job of running up to Mac OS 8.x and completing my thesis; while I could bided my time and then chosen a more suitable laptop (e.g. a Powerbook 150, because although bulky, it supported an IDE drive) or perhaps even waited until after my thesis.

Conclusion

I originally wanted to know how much the PowerMac 4400 cost, to help work out if I've been spending more or less on Macs over time, in particular my MacBook Pro. In reality it's been quite variable with the PM 4400 in the middle of the pack. It was one of those rare occasions where the internet didn't have the answers, so I had to hunt down the facts myself.





Thursday, 26 August 2021

Virtually Lost: An Alternative Intel 80286 Protected Mode

 The Intel 80286 was the true successor to the unexpectedly, overwhelmingly dominant 8086. This post is intended to be part of a larger series on its protected-mode architecture, an alternative Paged-based memory management. We show that a far simpler design could have achieved far more for Intel Operating Systems.

The development of the 80286 is covered pretty well in the 80286 Oral History and some anecdotal information can be found in Wikipedia.

The original 80286 Protected Memory mode is a highly sophisticated, purely segmented design modelled on the Multics segmentation, that saw almost no practical use beyond providing access to the full 16MB physical address space in real Intel 80286 Operating Systems, including MSDOS, OS/2, Xenix (probably) and Windows. We won't discuss that further here (it's for future posts), instead we'll discuss a paged alternative called the 80286p here.

80286p Overview

If the designer(s) of the 286 had had enough foresight and a willingness to break with the ideology of Intel's i432 albatross, they could have implemented a paged memory version of it with simple 4Kb pages in a 16Mb Physical and virtual address space. This could have made a lot of sense given that the 80286 was only supposed to be a stop-gap CPU design until the i432 was released.


Address Translation


A simple 16-bit VM architecture for the 286 could have redefined a segment register to point to one of 64K x 256b pages. This would have extended the virtual address space to 16Mb with the same kind of incompatibility as an actual 286 whilst being conceptually similar to the 8086.

In fact the 8086 designers did consider an 8-bit shift for segments, however they rejected this in favour of 4-bit shifts on the grounds that its 16MB address space couldn't be accommodated in a 40-pin package without sacrificing other hardware design goals.


The VM side comprises of a 12-bit translation and 12-bit tag for user code access only and four access modes (none, code, read-only and read-write data ) for a total of 26-bit TLBs. Assuming the same register resources as an actual 80286, the four 48-bit descriptor caches and additional VM supporting registers could provide for up to 8 TLBs backed by a software page table (which would only need to be 8Kb in size at maximum) and kernel mode could be purely physically addressed). 8 TLBs isn't a lot, but even some DEC VAX computers only supported 8 TLBs.


Enabling the MMU


The PMMU is enabled using bit 15 of the the original 8086 flags register (which is defined to be 0 for the 80286 and 80386). Setting it to 1 enables the PMMU; resetting bit 14 to physically addressed kernel mode; where bit 13 is then "don't care" and full I/O access is automatically supported.


An MMU fault pushes the access mode used and virtual page tag onto the stack and switches to physical addressing (flags.i must be 0 and flags.k must be 1 for the MMU to translate addresses and flags.i can't be changed if virtual addressing is on). All further MMU handling is in software. The MMU uses an LRU (least recently used) algorithm for replacing TLBs (essentially a 3-bit counter): on return from the fault handler; the least recently used TLB gets replaced by the updated access mode and translation address.


The initial entry into user mode can be achieved by creating a system virtual page table containing translations to the current thread of execution; then setting bit 14. The following execution address causes a TLB fault, leading to the VM entry being mapped to the current physical page and execution continues. This implies at least one VM user process should be allocated to the kernel for 'booting' up user mode and User-side management (a 64kB kernel would only need 16 entries). Kernel mode support requires a kernel mode SS and SP register pair; this means that user mode is expected to provide its own settings for SS and SP.


Software Page Tables


A VM algorithm can be extremely simplistic even if we want to do is support a number of user processes in a multiple virtual memory space; while caching a fixed swap space and ignoring any virtual kernel mode. The TLB uses a round-robin algorithm and the instruction MOVT loads TLB[cl]'s physical translation from AX. A process's virtual address space has a simple organisation, a fixed region for code and read-only data followed by a space for read/write heap memory and finally a stack region which must be <64kB (because the stack is limited to a 64kB single segment). Each entry in the VM table references a Physical PTE and because we can deduce access rights from the VM tag, we don't need to store access rights within each VPte, so the PTE limit is up to 64K*4096 pages, or 256MB, easily enough for the lifespan of the 80286p (though only 16MB is actual physical memory, the rest are swap page entries).


We also assume that although there's a single user-space, an application will allocate a fixed code space + stack space and all data space is a shared, dynamically allocated and freed space. A virtual memory map looks like this:

A physical memory map also contains the dynamic memory allocations and application allocations. Because the code and stack spaces are fixed, it's simple to test for access violation by reading the page table (Page=(Seg>>4)|(Addr>>12)). The rules are fairly simple. If there's an access violation, then at least the access rights should match, otherwise it's a real access rights violation (erroneous code). Then if the translate address is in swap space, we page it into physical memory at the next page (mod user pages); paging out the previous virtual page at that physical address. If the page was RW, then we update the TLB as Read-Only, else we update as the actual page.

A Simplistic Swap Algorithm

Although there's more to a Virtual Memory implementation, the Swap algorithm is central. Here's a simplistic swap algorithm, which supports up to 256MB of swap space and 16 processes each of which can be up to 16MB.

void VmSwap(uint16_t aAccess)
{
  uint16_t tag=aAccess&0xfff; // got the page.
  uint8_t fault=(aAccess>>12)&kTlbAccessMask;
  uint8_t realAccess=VmAccess(gVmVPte, tag); // proper access rights
  uint16_t trans;
  if(fault==kVmAccessRo && realAccess==kVmAccessRw) { 
    gVmVPte->iPages[tag]=(aAccess|=(kVmAccessRw<<12));
    return;
  }
  trans=gVmVPte->iPages[tag]; // Phys page (possibly in swap)
  if(fault!=realAccess) {
    return Trap(&aAccess, &aMap); // Application faulted.
  }
  else if(trans>gVmPhys){ // access is OK and paged out; swap out next page (if needed).
    uint16_t swapOut=gVmPte[gVmPhysHead]; // vpte and vpte entry.
    tVmVPte *vPte=gVmVPteSet[swapOut>>12]; // got the process vPte.
    uint16_t swapBlk;
    swapOut&=0xfff; // each virtual table is <=4095 pages.
    if(vPte==gVmVPte) { // the swapOut page might be in the TLB.
      uint16_t tlb, tlbTag;
      for(tlb=0;tlb<kVmTlbs; tlb++) {
        __asm("mov cl,%1",tlb);
        __asm("movt ax,cl");
        __asm("mov %1,ax",tlbTag);
        if((tlbTag&0xfff)==swapOut) {
          __asm("xor ax,ax");
          __asm("movt cl,ax"); // clear the swapOut page from the TLB if so.
          tlb=kVmTlbs; // force end of for loop.
        }
      }
    }
    if(VmAccess(vPte, swapOut)==kVmAccessRw) { // write back.
      swapBlk=gVmSwapBase+gVmNextOut; // the swapout tail.
      SwapWrite(swapBlk, ((long)((gVmPhysHead)+gVmUserBase)<<20),kVmPageSize);
      vPte->iPages[swapOut]=gVmNextOut; // save swapped out location.
      gVmNextOut=gVmPte[gVmNextOut]; // Pte entry for free block points to next free.
    } // otherwise we don't need to write back.
    else { // Code and Ro pages still need to update the vPte.
      vPte->iPages[swapOut]=vPte->iRoBase+swapOut;
    }
    SwapRead(trans, ((long)((gVmPhysHead)+gVmUserBase)<<20),kVmPageSize);
    gVmPte[gVmPhysHead]=(gVmProcess<<12)|tag; // update Pte
    gVmVPte->iPages[tag]=gVmPhysHead; // update VPte to point to phys mem.
    swapBlk=gVmSwapBase+(gVmVPte[tag]<<gVmPtePerPage);
    if(++gVmPhysHead>gVmUserLim) {
      gVmPhysHead=0; // reset.
    }
  }
  aAccess=(aAccess&0xf000)|((gVmVPte[tag]&0xfff)+gVmUserBase); // Return the new Phys page
}

The PTE can do double-duty as both a reference to a Virtual table and a given entry within it, and as a reference to the next free page for modified Read/Write pages for spare pages. Swap-outs for non-modifiable pages don't require any writes and therefore they never move - they can be obtained by storing them in contiguous swap blocks when the application is loaded (moving other pages out of the way if needed, and if there's no space, then the application can't load).

We have to provide a means of invalidating specific TLB entries, because it's possible that a swapped out page is currently in the TLB, because it's part of the same process and then two different tags could map to the same physical entry. Thus, instructions to load and store TLB entries (movt ax,cl and movt cl,ax) are the minimum needed.


In this Vm system, dynamic memory allocations (including stack space) would allocate a read-write block in virtual memory space (which may currently be mapped into physical user space); code allocations would copy all the code to virtual memory. Similarly, deallocations would free the block in swap space. To create a new program with a given code and stack space, the heap between the end of the current code space and the additional code and data space must be free (the program must defragment the heap if needed to do this). User code can't access Kernel space in kernel mode, instead they're accessed via the INT interface. The Physical Page table is much smaller than the VPTE, comprising of, for example, only 128b for 256kB of physical memory (the IBM AT in 1984 only came with 256kB as standard), and would be smaller still given that the kernel space wouldn't be included.


But within these limitations it can be seen that a virtual memory implementation would be relatively simple, easily possible with an early 80286p operating system.


The 80286p could also support 8086-compatible mapping, whereby the segment is only shifted 4-bits, providing a virtual memory space of 1Mb (via a second flags register). The standard 80286p method for enabling the MMU and clearing the TLBs is to turn it off (by resetting the MMU flag) and then turning it on again. The TLB can have a simple 3-bit LRU head register, initialised to 0. Unmapped or access right faults lead to page faults which cause the next TLB entry to be updated with the returned physical page and access rights (the virtual page is unchanged). Thus initialising the TLB with all 0's means that no accesses will initially map correctly.

More Limitations

The original 80286 could virtualise interrupts (by providing an interrupt trap), but in this implementation, user code can't service interrupts. However, OS routines could provide mechanisms for jumping to user interrupts if needed.

The original 80286 provided mechanisms for jumping to different protection levels, but the 80286p supports only a physically addressed kernel and a virtually addressed user mode.

The original 80286 supported user I/O access, so it's possible that the 80286p could do too on a global basis. This would allow Windows 3.1 style user-side I/O access.

The original 80286 could support thousands of processes, because every LDT (Local Descriptor Table) could be a process. The 80286p doesn't really support any processes, but the simple software implementation above supports 16. This would be a small number by the standards of Unix in the 1980s, but desktop computer operating systems such as OS/2 1.0 and Mac OS Classic supported only a limited number of applications in memory (Mac OS Classic had a shared memory space too). Extending the Physical page table to 32-bit per entry could provide for up to 65536 address spaces each with 256MB of virtual address space per VTable. However, it's unlikely this would be necessary, since the 80286 was superseded by the 80386 in 1986 and by the time computers were reaching 25% of its physical memory limitations it had been replaced by the i386 and i486 in the early 1990s.

For the same reason, although it would be possible to increase the address space of the 80286 by having separate instruction and data spaces (so the virtual address space could be up to 32MB even though the physical address space would be 16MB, by simply differentiating TLB tag entries based on code vs data access rights), there's no point, because the processor would have been a minority player by the time this could be exploited.


Conclusion

Implementing a simpler 80286 paged memory management unit would have enabled software developers to provide most of what's needed by virtual memory in an operating system, whilst providing for simple software implementations that would have better leveraged software on the 80286; supporting full compatibility with the 8086 and retaining a similar segmentation model.


In turn this would have lead to a simpler 80386 implementation, accelerating the dominance of the IBM PC.

Saturday, 7 August 2021

Fig-Forth At PC=Forty (Part 5)

FIG-Forth was a popular and very compact, public-domain version of the medium speed Forth systems programming language and environment during the early 1980s. In part 1, I talked about how to get FIG-Forth for the IBM PC running on PCjs and in part 2 I implemented a very rudimentary interim disk-based line editor. Part 3 dives into machine code routines and a PC BIOS interface, so that I could implement the screen functions I needed for my full-screen editor and in Part 4 I used them to implement that full-screen editor.

Here I'd like to explore a bit more graphics, since the PC BIOS interface can plot pixels (in any of 4 colours).

So, let's start with Plot. I get most of my BIOS programming information from wikipedia, though I've used an independent web page too. Implementing plot is just an INT10H function (where AH=24), so let's try it:

: PLOT ( CLR X Y )
  >R >R 3072 + 0 R> R>
  INT10H
;

4 VMODE takes us into 320x 200 graphics. You can still type in text, but you can't see the cursor. CLS fills the screen with a stripe - the bios call doesn't work the same way. We can create a graphics cls with:

: CLG 1536 SWAP 21760 * 0 6183 INT10H 0 0 AT ;

We can fill the screen with a colour:

: FCOL 200 0 DO I 320 0 DO OVER OVER I SWAP PLOT LOOP DROP LOOP DROP ;

So, 0 CLS 1 FCOL will then fill the screen in cyan in 37.6s. This makes the plot rate 37.6/(320*200=64000) = 1702 pixels per second, or if we exclude the non-plot functions (43µs+32µs*4)*320*200 = 10.9s for the FORTH code itself, so 26.456s or 2419 pixels per second. Writing a simple random number generator:

0 VARIABLE SEED

: RND SEED @ 1+ 75 * DUP SEED ! U* SWAP DROP ;

Means we can fill the screen with random pixels with:

: RNDPIX 200 0 DO I 320 0 DO 4 RND OVER I SWAP PLOT LOOP DROP LOOP ;

Is what we get part of the way running this after CLS. It randomly plots successive pixels with the colours black, cyan, magenta or grey.

Lines

More usefully we can draw lines. The Jupiter Ace manual gives a nice routine for drawing lines (page 79), however it uses the definition PICK which isn't available on FIG-Forth. It will also turn out that the Bresenham routine, which although it's faster in pure assembler, is slower when the instruction execution rate is much slower than division or multiplication. And that's true for the 8088 where each Forth instruction is about 30µs, but a multiplication also takes about 60µs. Thus, if the Bresenham algorithm is at least 2 instructions longer, multiplies (or divides) are faster. And the Bresenham algorithm on the Jupiter Ace takes: 14 basic loop instructions + DIAG (some of the time) = 4 instructions + SQUARE the rest of the time = 8 or 7 instructions. And Step = 12 instructions. So, that's about 14+6+12 = 32 instructions per loop. By comparison, the main loop in FIGnition is 22 instructions. On this basis we'd be able to achieve about 1500 points per second, a diagonal line across the screen would take about 0.2 seconds.

It's possible to consider the fastest potential line drawing code and base the full line drawing algorithm around that. The quickest way is to consider that again, for the longest axis, its coordinate will increment by 1 on each pass and on the other axis, some fraction of 1. So we can consider a 16-bit fraction, in the range 0..1 on that axis, which we can multiply by the longer axis. In each case we need to add an offset for each coordinate to get the final location.

: DAxis ( col grad offsetg offseth limh )
  0 DO ( col grad offg offh)
    OVER >R >R >R 2DUP ( c g c g : f h f)
    I U* SWAP DROP R> + ( c g c g*i+f : h f)
    R I + PLOT R> R> SWAP
  LOOP
;

So, in this version we need about 20 instructions and we need a second version where the x coordinates map to the do loop. The problem with this version is handling negative gradients, because using U* to generate gradients won't generate negative results in the high word. However, this can be fixed by sorting the coordinates. Consider (with a normal x-y coordinate system) a vector in the second quadrant at about 153º (a gradient of about -1/2). If we sort the coordinates so that we draw from right to left, then the y coordinates will draw upwards. Similarly a line in the 4th quadrant drawn at 288º (a gradient of 2/3), if we sort the coordinates so that we draw from top to bottom, then the x coordinates are ordered left to right. And we can achieve this by XOR'ing the DO loop coordinate by 0xffff (and adding 1 to the DO LOOP coordinate offset). Furthermore we can 'improve' the line drawing by modifying the plot routine to add an origin ( ox, oy) and set the colour ( fgCol). This gives us:

0 VARIABLE oxy 0 ,

: Oplot ( col dx dy )
  >R >R 3072 + 0 oxy 2@ R> + SWAP R> +
  INT10H
;

So, OPLOT is 4 words longer than PLOT. Also we want x in the first word of oxy and y in the second word. Then DAxis is:

0 VARIABLE XDIR

: DAXISY ( col grad limh sgn&dx sgn&dy)
  OXY >R R 2+ +! R> +!
  0 DO ( col grad)
    2DUP ( c g c g)
    I U* SWAP DROP ( c g c g*i)
    XDIR @ I XOR OPLOT ( c g )
  LOOP DROP DROP
;

: SGN 0< MINUS ;
( Here Y is the major axis. There are 4 cases,
  maj>0, min>0 = first quadrant, normal.
  maj >0, min <0 = second quadrant [ \ ]. Set ox+=dx.
               Since the gradient is always unsigned,
               a left to right draw will cover the correct x direction.
               However, y will draw from bottom to top, giving [ / ]
               So, y needs to be drawn from top to bottom too,
               XDIR=-1, oy+=dy.
  maj <0, min <0. = third quadrant, set ox+=dx and oy+=dy. That's because
                dy<0, dx<0 is / kind of line so simply moving oxy fixes
                the problem.
  maj <0, min >0 = fourth quadrant.
               XDIR=-1.
  So, if dy^dx <0, then XDIR=-1 else 0.
)
: QUADFIX ( c min maj )
  2DUP >R >R ( c min maj : min maj)
  ABS SWAP ABS SWAP ( c |min| |maj| : min maj )
  >R 0 SWAP R U/ SWAP DROP R> ( c  g |maj| : min maj)
  R> R> 2DUP XOR SGN XDIR ! ( c g |maj| min maj sgn(min^maj)!XDIR [Quadrants 2 and 4])
  OVER SGN SWAP OVER AND >R ( c g |maj| min sgn.min : maj&sgn.min)
  AND R> ( c g |maj| min&sgn.min maj&sgn.min)
;

: DAXISX ( col grad limh sgn&dy sgn&dx)
  OXY >R R +! R> 2+ +!
  0 DO ( col grad)
    2DUP ( c g c g)
    I U* SWAP DROP ( c g c g*i)
    XDIR @ I XOR SWAP OPLOT ( c g )
  LOOP DROP DROP
;

So, this is 13 words + 4 for OPLOT, Saving 3 words. Now drawing the longest line ought to take about (43µs+4*32µs+32µs*16)*320 = 0.22s. In practice, timed basic plotting rate is 27.8s=32000 pixels, or 0.278s for the 320 pixels so that's a maximum of 1151 pixels per second, which is slowish, but tolerable.


: DRAW ( col dx dy)
  2DUP OR 0= IF
    DROP DROP DROP
  ELSE
  OVER OXY SWAP OVER @ + >R ( col dx dy &OXY : X')
  2+ @ OVER + >R ( col dx dy : Y' X')
  OVER ABS OVER ABS > IF
    SWAP QUADFIX DAXISX
  ELSE
    QUADFIX DAXISY
  THEN
THEN
  R> R> OXY 2!
;

: RLINE
  3 RND 1+ ( COLOR)
  200 RND 320 RND 2DUP OXY 2!
  320 RND SWAP - SWAP 200 RND SWAP - DRAW
;

: RLINES BEGIN RLINE ?TERMINAL UNTIL ;



Circles

Our circle algorithm, on the other hand will be copied straight from the equivalent FIGnition version.

Method: we know x^2+y^2 = const. So, we start at [0,r], which gives r^2 We can go straight up, which gives:
   [x^2+[y+1]^2] - x^2-y^2 => a difference of +2y+1.
Or we can do [x-1]^2 => a diff of 1-2x.
So, the rule is that when the accumulation of 2y+1>1-2x, then we do 1-2x. By only calculating the error, we don't need to calculate r*r and therefore there is no danger of 16-bit arithmetic overflow even for radius's larger than the width of the highest screen resolution.

: NEXTP ( X Y DIFF )
 OVER DUP + 1+ + >R  ( CALC WITH INC Y )
 OVER DUP + 1 - R> 2DUP > IF
   SWAP DROP
 ELSE
   SWAP - >R SWAP 1 - SWAP R>
 THEN SWAP 1+ SWAP
;

0 VARIABLE FG

: DXYPLOT ( COL DX DY)
  >R 2DUP R OPLOT R> ;

: OCTPLOT ( CX CY DX DY)
  4 0 DO
    DXYPLOT SWAP
    DXYPLOT NEG
  LOOP
; ( -- CX CY )

: CIRC ( COL X Y R )
  >R SWAP OXY 2! ( COL : R)
  R> 0 0 >R ( COL DX=R, DY=0 : DIFF=0)
  BEGIN
   OCTPLOT R> NEXTP >R
  2DUP < UNTIL R>
  DROP DROP DROP
;

: CIRCS ( CIRCLES)
  0 DO 3 RND 1+ 160 100 I CIRC 2 +LOOP DROP DROP ;




: CIRCBM 0 DO I 3 AND 160 100 98 CIRC LOOP ;

And with a ticks routine of the form: HEX : TICKS 40 6C L@ ; DECIMAL we can time it fairly well. 20 CIRCBM draws 98*2*pi pixels per circle and takes 174 ticks. So that's 12315 pixels in 9.56s or 1288 pixels per second. Amazingly, a bit faster than line drawing!

Conclusion

Once we can choose a graphics mode and implement a plot function we can build on that with simple line-drawing and circle-drawing algorithms. An original IBM PC running FIG-Forth draws lines like a lazy 8-bit computer, but for graphics of moderate complexity, that's tolerable. The real challenge is that a classic line drawing algorithm involves quite a lot of stack variables with extensive stack shuffling, making an efficient program far more involved than the equivalent even in 8-bit assembler and that the overhead of the Forth interpreter means the Bresenham line drawing algorithm is slower than one that uses division.

Sunday, 25 July 2021

Fig-Forth At PC=Forty (Part 4)

In part 1, I talked about how to get FIG-Forth for the IBM PC running on PCjs. FIG-Forth was a popular and very compact, public-domain version of the medium speed Forth systems programming language and environment during the early 1980s. Then part 2 covered how to implement a very rudimentary disk-based line editor as a precursor to an interactive full-screen editor; while part 3 dives into machine code routines and a PC BIOS interface, because there was no real screen cursor control via the existing commands.

Let's Edit!

Now, at least we can implement a screen editor. One of the constraints I'll impose will be to keep the editor to within 1kB of source code. At first that'll be easy, because all I want to support is normal characters, cursor control, return and escape to update. However, I know that I'll probably want to add the ability to copy text from a marker point using <ctrl-c>. But even a simple decision like this raises possible issues. Consider this, if I type VLIST I find I can press <ctrl-c> to stop the listing:

However, if I type this definition:

: T1 BEGIN KEY DUP . 27 = UNTIL ;

I find that <ctrl-c> doesn't break out of T1, instead it simply displays 3 and I really do have to press <esc> to quit the routine. But it could be that Forth still checks it automatically, just not with the above sequence of commands in T1. No, from the FIG-Forth source we can see, the breakout of VLIST is simply due to it executing ?TERMINAL and exiting if any key has been pressed. So, that potential problem is sorted!

VLIST Source Code:

DB 85H
DB 'VLIS'
DB 'T'+80H
DW UDOT-5
VLIST DW DOCOL
DW LIT,80H
DW OUTT
DW STORE
DW CONT
DW AT
DW AT
VLIS1 DW OUTT ;BEGIN
DW AT
DW CSLL
DW GREAT
DW ZBRAN ;IF
DW OFFSET VLIS2-$
DW CR
DW ZERO
DW OUTT
DW STORE ;ENDIF
VLIS2 DW DUPE
DW IDDOT
DW SPACE
DW SPACE

DW PFA
DW LFA
DW AT
DW DUPE
DW ZEQU
DW QTERM
DW ORR
DW ZBRAN ;UNTIL

DW OFFSET VLIS1-$
DW DROP
DW SEMIS

The Editor Itself

The first goal in the editor is to convert y x coordinates in the current screen to a memory location in a buffer. This is a minor change from the initial part of the code in EDL:

: EDYX& ( Y X -- ADDR)
  >R 15 AND 8 /MOD SCR @ B/SCR * +
  BLOCK SWAP C/L * + R> +
;

We want to be able to constrain Y and X coordinates to within the bounds of the screen (in this case with wrap around):

: 1- 1 - ; ( oddly enough missing from FIG-Forth, but I use it quite a bit in the editor)

: EDLIM ( Y X -- Y' X')
  DUP 0< IF
    DROP 1- 0
  THEN
  DUP C/L 1- > IF
    DROP 1+ 0
  THEN
  SWAP 15 AND SWAP
  OVER 2+ OVER 4 + AT
;

The key part of a screen editor is to be able to process characters, I've picked the vi cursor keys:

: DOKEY ( R C K  )
  >R
  R 8 = IF ( CTRL-H, Left)
    1-
  THEN
  R 12 = IF ( CTRL-L, Right)
    1+
  THEN
  SWAP R 11 = IF ( CTRL-K, Up )
    1-
  THEN
  R 10 = IF ( CTRL-J, Down)
   1+
  THEN
  SWAP R 13 = IF ( CR or CTRL-M)
    64 +
  THEN
  EDLIM
  R 31 > R 128 < AND IF ( PRINTABLE)
    2DUP EDYX& R SWAP C!
    R EMIT 1+ UPDATE EDLIM
  THEN
  R>
;

Finally we want to put it all together in a top-level function:

: ED ( scr -- )
  CLS LIST 0 0 EDLIM
  BEGIN
   KEY DOKEY
  27 = UNTIL
  DROP DROP
;

Interestingly, once I'd cleared up an initial bug where the cursor wasn't advanced when I typed a character, and another where I'd missed an AND when checking for printable characters, I was able to use the editor itself to edit improvements ( namely, putting the initial cursor at the right location instead of (0,0)).

Finally, although Forth isn't always as compact as its proponents like me often claim, in fact this editor in itself uses a mere 306 bytes, probably the most compact interactive editor I've seen and there's still over half the screen left for improvements. For example, there's no support for delete ( left, space, left); for inserting a line; copying text, nor blocks. But for the moment, it's easily far more enjoyable than the line editor it replaces.

Exercise For The Reader

The biggest user-interface problem I've found with extensions to the editor has been to decide which control character to use to mark the text location for copying. To explain: many archaic screen editors, including some early word processors used a mark, edit sequence for text manipulation. The user would move the cursor to where they wanted to perform an 'advanced' edit operation; mark the initial location; then move the cursor to either where they wanted the edit operation to finish; mark the end of that edit text; then finally, possibly move the cursor to some other location and complete the edit. For example, on the Quill word processor for the Sinclair QL, you'd Type F3, 'E' (for Erase), move the cursor to where you wanted to erase a block; press enter; move to where the erase should finish (and it would highlight the text as you went along); press enter; confirm you wanted to erase it and then it would. Or in the Turbo C editor you'd Mark an initial starting location; then Mark again an ending location and then perform an operation like 'Copy' to duplicate the text, or 'Delete' or 'Move' to move the text.

Or on a BBC Micro, the BASIC editor was essentially a line editor which supported two cursor positions (!!) If you needed to manipulate a line instead of just retyping it, you'd list the line (if it wasn't on the screen), then move the cursor keys and a second cursor would appear, moving to where you wanted on the screen; while the cursor at your editing position would remain. You'd then hit COPY and it would copy from the second cursor to your editing position, advancing both cursors.

So, in my system, which is similar to how editing works on FIGnition, I'd want to Mark the position where I wanted to copy / erase from; move to where I wanted to paste or finish a delete to and then COPY / MOVE a character at a time from the source to the destination (or Erase the text).

However, the most obvious control character to use, <ctrl-m> is already used for Return, and everything else seems rather contrived. Then I thought, what happens if I use <ctrl-symbol> instead? Do they produce interesting control codes? I found out quite a number produce 0s, but some actually generate the control codes in the range 0..31 that you can't generate from <ctrl-a> to <ctrl-z>.

This is what I found out:

 Ctrl+  Code  Ctrl+  Code
 \  28  ]  29
 6  30  -  31

So, what I'd like to know is whether this is just an artefact of the simulator being used on a Mac or whether it's common to other emulators and an actual IBM PC?

Conclusion

Once I'd implemented some earlier, critical definitions it turned out to be quite easy and satisfying to write a full-screen editor. The biggest challenges were in making sure certain key presses wouldn't collide with any system behaviour and finally thinking about some user-interface decisions for some future enhancements.

The editor also nicely illustrates some key Forth aspirations: the editor turns out to be very compact (though of course it's very rudimentary too); and I was able to use it to debug and improve itself once I'd reached some critical level of functionality. It was so easy and tiny, I wonder why it wasn't a standard part of FIG-FORTH, given that it was designed in an era when cursor addressable VDUs were already the norm.