Imagine a processor with no interrupts. We can do a lot better and get rid of most exceptions (e.g. system calls, page faults etc.), most peripheral devices/buses, and even cache misses, but let’s start with interrupts. Modern micro-processors are bloated with circuitry that is designed to allow the precious CPU to switch back and forth between streams of code because cpu cycles were the scarcest resource. That was a great idea in the 1950s. But interrupts are expensive in compute time, circuit complexity, and chip area. Interrupts take microseconds to start processing – which is an eternity on a Ghz processor. And they solve a problem that does not exist anymore: we have a lot of processor cores. In fact, one problem faced by computer architects is that it’s not easy to exploit parallel processing if you keep thinking in terms of 1950s computer architecture.
Suppose you have 32 cores and you make one or two of them poll I/O devices and the other 30 never get I/O interrupts, never run interrupt handling code, never have to save and restore state or context switch due to I/O interrupts. What they do is run application code to completion or for a specified number of cycles or clock ticks. The cores reserved for the operating system manage all devices and even they don’t need interrupts because they can run a real-time OS that polls devices.
Suppose we have 100 devices and flags on device registers to indicate whether the device needs attention. We’d probably want devices to be able to do DMA (but see below). The polling software might just cycle through in arbitrary order or in order of priority or to optimize some function.
For an x86 type processor: throw away the IOAPIC, IOAPIC bus, the APIC on each core, the logic for unwinding the pipeline on an interrupt, the processor interrupt level logic. And then contemplate the layers of interrupt virtualization in modern processors. We could have a simple cycle timer switch on each core so that after the timer expires there is an interrupt-like jump to a function to see what to do next. That jump would be perfectly synchronous since predicting the next jump can be done with 100% accuracy (or nearly 100%). (added 2022 note. One method as noted in Ycombinator discussion is to have the core do a pseudo-fetch of a jump to some designated place – maybe the OS scheduling code in some protected memory or even to the new co-routine/process. There could be multiple such timers with different jump targets or even different flavors of jump ).
Once I/O interrupts go away, radical redesign of peripheral devices starts to seem sensible. Instead of putting an embedded computer on an Ethernet adapter card or chip, let’s use one of our I/O cores which already can access memory and knows how to get data to the right processes or OS components. Current OS offload ends up recreating the same bottlenecks on the adapter, but the OS cores don’t have the same problems. The adapter can be simplified and optimized for speed and error correction. A hardware fifo of addresses for TX and RX queues perhaps: header, payload start, payload end, condition code. The OS core software has to keep TX queue from sitting idle and the RX queue from overflowing. RDMA is reasonably simple in this environment and can be used to do something more interesting than traditional “os bypass” because the OS knows where the data should go or come from. Do nanosecond work in the devices.
Do we need system calls with all the complex context and permission level change? I don’t see why. A system call could queue a request for the OS and then either call code to voluntarily switch or keep going – which might allow the same application to respond to the result on a second core or speed up the event model or coroutines. For some application loads we’d maybe need to steal a few more cores for OS functions, but the OS is now much simplified without all these weird asynchronous operations and code flow can be optimized for smarter caching (and this leads to a simplified cache design without all the expensive circuitry for snooping and cache coherency)
Getting rid of page faults/memory faults seems like it needs a slightly bigger change to programming models, but not all that radical.
Pingback:Processor architecture – keeping simple