Something about ISA Emulation -- 1
A complete ISA consists of many parts, including the register set and memory architecture, the instructions , and the trap and interrupt architecture. A virtual machine implementation is usually concerned with all aspects of ISA emulation.
Here we will be focusing on (user-level) instruction emulation.
Instruction set emulation can be carried out basiclly in 2 techniques: interpretation, and binary translation.
Interpretation involves a cycle of fetching a source instrction, analyzing it, performing the required operation, and then fetching the next source instruction – all in software. Binary translation, on the other hand, attempts to amortize the fetch and analysis costs by translating a block of source instructions to a block of target instructions and saving the translated code for repeated use. In contrast to interpretation, binary translation has a bigger initial translation cost but a smaller execution cost. The choice of one or the otehr depends on the number of times a block of source code is expected to be executed by the guest softeware. Predictably, there are techniques that lie in between these extremes. For example, threaded interpretation eliminated the interpreter loop correstponding to the cycle mentioned earlier, and efficiency can be increased even further by predecoding the source instructions into a more efficiently interpretable intermediate form. 1
Speeding up Interpretation
Observation: Jumps are time consuming
-
control conflicts in CPU pipeline
-
lots of branch mispredictions in dispatcher
-
What are pipeline conflicts?
- data/control/resource conflicts
- enforce pipeline stalls
- longer pipeline stalls
- longer pipeline risks longer stalls
-
How to reduce bad effects of control conflicts?
- predict jump targets
- execute in speculative state
-
How do branch predictors typucally work?
- static: on first execution (e.g. take backwards branches)
- dynamic: often uses table keyed by instruction address
- conditional: saturating counters keyed by history pattern
- returns: stack of recently pushed return addrresses
Efficiency Guidlines for Branches
- “Premature optimization is the root of all evil”
- run performance analysis tools on finel code
- Reduce number of branches
- use inlining, also helps by specialization
- combine switch statments ()
Optimizaiton “Threaded Interpretation”
Observation: Jumps are time consuming
- control conflicts in CPU pipeline
- lots of branch mispredictions in dispatcher
Solution:
decode next instruction at end of emulation routine of curretn instruction.
Inditect vs direct
- This code is very similar to the indirect threaded code, except the dispatch table lookup is removed. The address of the interpreter routine is loaded from a field in the intermediate code, and a register indirect jump goes directly to the routine. Although fast, this causes the intermediate form to become dependent on the exact locations of the interpreter routines and consequently limits portability. If the interpreter code is ported to a different target machine, it must be regenerated for the target machine that executes it. However, there are programming techniques and compiler features that can mitigate this problem to some extent.
Interpretation using Predecoding
Motivation:
Although the centralized dispatch loop has been eliminated in the indirect threaded interpreter, there remains the overhead created by the centralized dispatch table. Looking up an interpreter routine in this table still requires a memory access and a register indirect branch. It would be desirable, for even better efficiency, to eliminate the access to the centralized table.
Observation:
- opcodes consist of multiple parts faster: one opcode (instead of op + ex_op)
- operands are coded in bits faster: operands aligned
Properties:
- Space for predecoded data needed
- faster interpretaion
- significant benefit for interpretaion of CISC
TPC & SPC:
- Why still SPC?
- Whenever it maybe used somewhere else
- You could have some kind of code which try to read the machine code itself from the PC for whatever reasons.
- why TPC + 1 but SPC + 4?
- TPC is in C, and the pre-decode array is also in C, so the compiler does the work. However, SPC + 4, if one instruction is in 4 byte.
Interpretation - CISC
**Potential issues with predecoding: **
- much space needed
- better: space-tuned formats for different instructions
- detection of instruction borders
- could be data interleaved with code
- correct predecoding almost impossible
Use a two-step process
- at first interpretation: do predecoding on the fly, filling predecode table
- all further executions: use predecoded data generated by first run
-
[James E. Smith, Ravi Nair, Virtual Machines, 2015, ISBN:9781558609105] ↩︎