Thursday, 26 August 2021

Virtually Lost: An Alternative Intel 80286 Protected Mode

 The Intel 80286 was the true successor to the unexpectedly, overwhelmingly dominant 8086. This post is intended to be part of a larger series on its protected-mode architecture, an alternative Paged-based memory management. We show that a far simpler design could have achieved far more for Intel Operating Systems.

The development of the 80286 is covered pretty well in the 80286 Oral History and some anecdotal information can be found in Wikipedia.

The original 80286 Protected Memory mode is a highly sophisticated, purely segmented design modelled on the Multics segmentation, that saw almost no practical use beyond providing access to the full 16MB physical address space in real Intel 80286 Operating Systems, including MSDOS, OS/2, Xenix (probably) and Windows. We won't discuss that further here (it's for future posts), instead we'll discuss a paged alternative called the 80286p here.

80286p Overview

If the designer(s) of the 286 had had enough foresight and a willingness to break with the ideology of Intel's i432 albatross, they could have implemented a paged memory version of it with simple 4Kb pages in a 16Mb Physical and virtual address space. This could have made a lot of sense given that the 80286 was only supposed to be a stop-gap CPU design until the i432 was released.


Address Translation


A simple 16-bit VM architecture for the 286 could have redefined a segment register to point to one of 64K x 256b pages. This would have extended the virtual address space to 16Mb with the same kind of incompatibility as an actual 286 whilst being conceptually similar to the 8086.

In fact the 8086 designers did consider an 8-bit shift for segments, however they rejected this in favour of 4-bit shifts on the grounds that its 16MB address space couldn't be accommodated in a 40-pin package without sacrificing other hardware design goals.


The VM side comprises of a 12-bit translation and 12-bit tag for user code access only and four access modes (none, code, read-only and read-write data ) for a total of 26-bit TLBs. Assuming the same register resources as an actual 80286, the four 48-bit descriptor caches and additional VM supporting registers could provide for up to 8 TLBs backed by a software page table (which would only need to be 8Kb in size at maximum) and kernel mode could be purely physically addressed). 8 TLBs isn't a lot, but even some DEC VAX computers only supported 8 TLBs.


Enabling the MMU


The PMMU is enabled using bit 15 of the the original 8086 flags register (which is defined to be 0 for the 80286 and 80386). Setting it to 1 enables the PMMU; resetting bit 14 to physically addressed kernel mode; where bit 13 is then "don't care" and full I/O access is automatically supported.


An MMU fault pushes the access mode used and virtual page tag onto the stack and switches to physical addressing (flags.i must be 0 and flags.k must be 1 for the MMU to translate addresses and flags.i can't be changed if virtual addressing is on). All further MMU handling is in software. The MMU uses an LRU (least recently used) algorithm for replacing TLBs (essentially a 3-bit counter): on return from the fault handler; the least recently used TLB gets replaced by the updated access mode and translation address.


The initial entry into user mode can be achieved by creating a system virtual page table containing translations to the current thread of execution; then setting bit 14. The following execution address causes a TLB fault, leading to the VM entry being mapped to the current physical page and execution continues. This implies at least one VM user process should be allocated to the kernel for 'booting' up user mode and User-side management (a 64kB kernel would only need 16 entries). Kernel mode support requires a kernel mode SS and SP register pair; this means that user mode is expected to provide its own settings for SS and SP.


Software Page Tables


A VM algorithm can be extremely simplistic even if we want to do is support a number of user processes in a multiple virtual memory space; while caching a fixed swap space and ignoring any virtual kernel mode. The TLB uses a round-robin algorithm and the instruction MOVT loads TLB[cl]'s physical translation from AX. A process's virtual address space has a simple organisation, a fixed region for code and read-only data followed by a space for read/write heap memory and finally a stack region which must be <64kB (because the stack is limited to a 64kB single segment). Each entry in the VM table references a Physical PTE and because we can deduce access rights from the VM tag, we don't need to store access rights within each VPte, so the PTE limit is up to 64K*4096 pages, or 256MB, easily enough for the lifespan of the 80286p (though only 16MB is actual physical memory, the rest are swap page entries).


We also assume that although there's a single user-space, an application will allocate a fixed code space + stack space and all data space is a shared, dynamically allocated and freed space. A virtual memory map looks like this:

A physical memory map also contains the dynamic memory allocations and application allocations. Because the code and stack spaces are fixed, it's simple to test for access violation by reading the page table (Page=(Seg>>4)|(Addr>>12)). The rules are fairly simple. If there's an access violation, then at least the access rights should match, otherwise it's a real access rights violation (erroneous code). Then if the translate address is in swap space, we page it into physical memory at the next page (mod user pages); paging out the previous virtual page at that physical address. If the page was RW, then we update the TLB as Read-Only, else we update as the actual page.

A Simplistic Swap Algorithm

Although there's more to a Virtual Memory implementation, the Swap algorithm is central. Here's a simplistic swap algorithm, which supports up to 256MB of swap space and 16 processes each of which can be up to 16MB.

void VmSwap(uint16_t aAccess)
{
  uint16_t tag=aAccess&0xfff; // got the page.
  uint8_t fault=(aAccess>>12)&kTlbAccessMask;
  uint8_t realAccess=VmAccess(gVmVPte, tag); // proper access rights
  uint16_t trans;
  if(fault==kVmAccessRo && realAccess==kVmAccessRw) { 
    gVmVPte->iPages[tag]=(aAccess|=(kVmAccessRw<<12));
    return;
  }
  trans=gVmVPte->iPages[tag]; // Phys page (possibly in swap)
  if(fault!=realAccess) {
    return Trap(&aAccess, &aMap); // Application faulted.
  }
  else if(trans>gVmPhys){ // access is OK and paged out; swap out next page (if needed).
    uint16_t swapOut=gVmPte[gVmPhysHead]; // vpte and vpte entry.
    tVmVPte *vPte=gVmVPteSet[swapOut>>12]; // got the process vPte.
    uint16_t swapBlk;
    swapOut&=0xfff; // each virtual table is <=4095 pages.
    if(vPte==gVmVPte) { // the swapOut page might be in the TLB.
      uint16_t tlb, tlbTag;
      for(tlb=0;tlb<kVmTlbs; tlb++) {
        __asm("mov cl,%1",tlb);
        __asm("movt ax,cl");
        __asm("mov %1,ax",tlbTag);
        if((tlbTag&0xfff)==swapOut) {
          __asm("xor ax,ax");
          __asm("movt cl,ax"); // clear the swapOut page from the TLB if so.
          tlb=kVmTlbs; // force end of for loop.
        }
      }
    }
    if(VmAccess(vPte, swapOut)==kVmAccessRw) { // write back.
      swapBlk=gVmSwapBase+gVmNextOut; // the swapout tail.
      SwapWrite(swapBlk, ((long)((gVmPhysHead)+gVmUserBase)<<20),kVmPageSize);
      vPte->iPages[swapOut]=gVmNextOut; // save swapped out location.
      gVmNextOut=gVmPte[gVmNextOut]; // Pte entry for free block points to next free.
    } // otherwise we don't need to write back.
    else { // Code and Ro pages still need to update the vPte.
      vPte->iPages[swapOut]=vPte->iRoBase+swapOut;
    }
    SwapRead(trans, ((long)((gVmPhysHead)+gVmUserBase)<<20),kVmPageSize);
    gVmPte[gVmPhysHead]=(gVmProcess<<12)|tag; // update Pte
    gVmVPte->iPages[tag]=gVmPhysHead; // update VPte to point to phys mem.
    swapBlk=gVmSwapBase+(gVmVPte[tag]<<gVmPtePerPage);
    if(++gVmPhysHead>gVmUserLim) {
      gVmPhysHead=0; // reset.
    }
  }
  aAccess=(aAccess&0xf000)|((gVmVPte[tag]&0xfff)+gVmUserBase); // Return the new Phys page
}

The PTE can do double-duty as both a reference to a Virtual table and a given entry within it, and as a reference to the next free page for modified Read/Write pages for spare pages. Swap-outs for non-modifiable pages don't require any writes and therefore they never move - they can be obtained by storing them in contiguous swap blocks when the application is loaded (moving other pages out of the way if needed, and if there's no space, then the application can't load).

We have to provide a means of invalidating specific TLB entries, because it's possible that a swapped out page is currently in the TLB, because it's part of the same process and then two different tags could map to the same physical entry. Thus, instructions to load and store TLB entries (movt ax,cl and movt cl,ax) are the minimum needed.


In this Vm system, dynamic memory allocations (including stack space) would allocate a read-write block in virtual memory space (which may currently be mapped into physical user space); code allocations would copy all the code to virtual memory. Similarly, deallocations would free the block in swap space. To create a new program with a given code and stack space, the heap between the end of the current code space and the additional code and data space must be free (the program must defragment the heap if needed to do this). User code can't access Kernel space in kernel mode, instead they're accessed via the INT interface. The Physical Page table is much smaller than the VPTE, comprising of, for example, only 128b for 256kB of physical memory (the IBM AT in 1984 only came with 256kB as standard), and would be smaller still given that the kernel space wouldn't be included.


But within these limitations it can be seen that a virtual memory implementation would be relatively simple, easily possible with an early 80286p operating system.


The 80286p could also support 8086-compatible mapping, whereby the segment is only shifted 4-bits, providing a virtual memory space of 1Mb (via a second flags register). The standard 80286p method for enabling the MMU and clearing the TLBs is to turn it off (by resetting the MMU flag) and then turning it on again. The TLB can have a simple 3-bit LRU head register, initialised to 0. Unmapped or access right faults lead to page faults which cause the next TLB entry to be updated with the returned physical page and access rights (the virtual page is unchanged). Thus initialising the TLB with all 0's means that no accesses will initially map correctly.

More Limitations

The original 80286 could virtualise interrupts (by providing an interrupt trap), but in this implementation, user code can't service interrupts. However, OS routines could provide mechanisms for jumping to user interrupts if needed.

The original 80286 provided mechanisms for jumping to different protection levels, but the 80286p supports only a physically addressed kernel and a virtually addressed user mode.

The original 80286 supported user I/O access, so it's possible that the 80286p could do too on a global basis. This would allow Windows 3.1 style user-side I/O access.

The original 80286 could support thousands of processes, because every LDT (Local Descriptor Table) could be a process. The 80286p doesn't really support any processes, but the simple software implementation above supports 16. This would be a small number by the standards of Unix in the 1980s, but desktop computer operating systems such as OS/2 1.0 and Mac OS Classic supported only a limited number of applications in memory (Mac OS Classic had a shared memory space too). Extending the Physical page table to 32-bit per entry could provide for up to 65536 address spaces each with 256MB of virtual address space per VTable. However, it's unlikely this would be necessary, since the 80286 was superseded by the 80386 in 1986 and by the time computers were reaching 25% of its physical memory limitations it had been replaced by the i386 and i486 in the early 1990s.

For the same reason, although it would be possible to increase the address space of the 80286 by having separate instruction and data spaces (so the virtual address space could be up to 32MB even though the physical address space would be 16MB, by simply differentiating TLB tag entries based on code vs data access rights), there's no point, because the processor would have been a minority player by the time this could be exploited.


Conclusion

Implementing a simpler 80286 paged memory management unit would have enabled software developers to provide most of what's needed by virtual memory in an operating system, whilst providing for simple software implementations that would have better leveraged software on the 80286; supporting full compatibility with the 8086 and retaining a similar segmentation model.


In turn this would have lead to a simpler 80386 implementation, accelerating the dominance of the IBM PC.