Contents

Something about ISA Emulation -- 2

Binary Translation

Work on multiple instructions at once

  • less management

  • use sequence (of codes) without control flow change (“basic block” = BB)

    • „dynamic BB“: ends at jump/branch instruction
    • „static BB“ (term for compilers): ends at jump instr. or before jump target
  • piecewise translation

    • only when execution is required

    • detection of instruction borders now simple (compare with static translation: what is code?)

Dynamic Binary Translation

  • Components required
    • code cache (CC) sometimes called translation cache (TC)
      • used to store translated basic blocks (BBs)
      • has to be invalidated when original code
        • is removed („unmapped“)
        • is modified
    • translation table
      • maps start addresses of BBs: SPC => TPC
      • looked up for (indirect) jumps
    • original code and data
      • are kept for potential accesses

Code Cache

  • properties in common with processor cashes
    • limited size => replacement strategy (LRU, LFU, FIFO, …)
    • processor caches have
      • tags
      • assoziativity
      • cache line length
      • coherency protocol
    • differences to processor caches
      • on a cache miss …
        • differenttranslationspossibleforsameBB
        • cachehierarchy?
      • variable sizes of translated code
      • interdependencies among entries (see chaining …)

Code Cache: Invalidation

  • required for self modifying code („SMC“)
  • detection
    • use host MMU (Memory Management Unit)
      • guest code is write-protected
      • on a write: invalidate all code on given page
      • fall back: interpretation
      • very slow with frequent modifications
    • check for modification before every BB execution
      • copy of original guest code required
      • slow
      • modification of currently running BB?

Code Cache: Eviction handling

  • Cache is full, but translation to be stored

  • LRU

    • evict the translated BB least recently used

    • problems

      • need to maintain order of uses => Overhead (time stamps / linked lists)

      • fragmentation of cache storage

        – possible solution: buddy lists

      • eviction events happen often, so need to be fast

=> LRU is rarely used. What is better?

  • Goal: minimum runtime overhead

  • Solutions:

    • complete flush when full

      • advantage: regular retranslation – adaptation of optimizations to current runtime behavior – old/unneeded translations get removed

      • disadvantage – frequently used BBs have to be translated often

– flush on detection of execution phase change – blocks with FIFO replacement

Chaining

  • Observation: expensive actions

    • lookup SPC => TPC
    • indirect jump
  • Chaining = Linking translated BBs

  • on known successor

    • last guest instruction is
      • unconditional jump
      • conditional jump (2 cases: follow/pass-through)
    • lookup jump targets at translation time
      • what if successor not translated yet?
  • on unknown successor

    • indirect jump
      • convert into if-then-else chains (profile targets!)
    • return from function
      • use shadow stack (similar to return prediction)

Superblocks (SB)

  • Motivation
    • reduce number of jumps for given guest code (same as with chaining)
    • larger translation units allow for more optimization possibilities (see next slides)
  • Superblock
    • One entry, multiple “side exits”
      • we may have branch which go out of SB in the middle, if you have a unconditional jumps in ur original code you may just go on. I you have a conditional jump you may have a side exits at that point
    • Combine sequence of BBs
      • works best if the execution path of the full SB is similar to later executions (it is “hot”) => predict from past
      • also called “trace” (“tracing JITs”) => execution path of BBs