Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal SOSP 2013 Presenter Apoorva Garg Outline Paper Presentation Background DBT Architecture DBT Optimizations ID: 536730
Download Presentation The PPT/PDF document "CS 5204" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CS 5204Fast Dynamic Binary Translation for the Kernel
Piyus
Kedia
and
Sorav
Bansal
(SOSP 2013)
Presenter: Apoorva GargSlide2
OutlinePaper Presentation
Background
DBT Architecture
DBT Optimizations
Kernel level DBT
Kernel level
DBT - DRK
Interrupt/Exception handling
Faster Design
Design subtleties
Implementation
Results
OpinionSlide3
Background – Binary TranslatorEmulation of one instruction set by another through translation of machine code
Static – Operates on executable file
Difficult because of dynamic linking and self-modifying code
Dynamic – Translate on the fly
Translation overhead affects runtimeSlide4
Background – Dynamic Binary Translator
Architecturally similar to JIT compiler
Translate foreign code & execute generated target code.
General Architecture
System calls interface might differ
Patching direct jumps, Translation cache (Code cache)
Unit of translation
Instruction selection, Byte Order, Register mapping
Trivia: Apple used DBT(Rosetta) to ease transition from PPC to x86 Slide5
DBT ApplicationInstrumentation - Used in many user-space tools like
V
algrind
etc.
Valgrind
runs application in sandbox – synthetic
CPU
Insert own instruction for debugging and profiling
As new code executed for first time – handed to a tool like
memcheck
, which introduces it’s own instrumentation
Example:
Memcheck
C
heck for all read/write to memory, intercept
malloc
/free
Reports error
If access memory it
shouldn’t, Leaks
etc
Virtualization – Use DBT to modify instructions that “pierce VM” with VM safe sequence of instructionsSlide6
Basic DBT Architecture
Translate blocks of native code
Blocks typically begin at target of control transfer statements like jump, call, ret
Reasonable performance by caching the translated code
Control transfer between code cache and framework is costlySlide7
Translation – Important
One native instruction can be translated into multiple target instruction
For example
rlwinm
(
Rotate Left Word Immediate Then AND with Mask)
Instruction in PowerPC requires 6 instructions on
Aplha
Instrumentation code might insert more instructions
Leads to most of the problems – as the native state is not precisely defined during such translationsSlide8
DBT Optimizations
Direct branch chaining
Replace direct branch with jump to translated addresses
tx-nextpc
-> address for translated
nextpcSlide9
DBT Optimizations For indirect branch nextpc
only determined at runtime
Jumptable
– small
hashtable
nextpc
to
tx-nextpc
If(miss) call dispatcherSlide10
Kernel-level DBT Requires more mechanism to correctly handle interrupts, exceptions, reentrancy and concurrency issues
Interpose all kernel execution – by replacing kernel entry points by call to dispatcherSlide11
Kernel DBT – TerminologyGuest kernel – kernel being translated
Code block - straight-line sequence of instructions which terminates at an unconditional control branch
Dispatcher
Translation rulebook
Code cacheSlide12
Interrupt handling in DRKPC value pushed on hardware stack either
Code cache address
Dispatcher address
Single guest instruction can be translated into multiple target instructions
Therefore interrupts & exceptions may happen during translation and not at boundaries
PC value needs to be changed to native counterpartSlide13
Synchronous exception in DRKPrecise exception behavior emulated
If happened during translation of instruction
Rollback machine state to start of current native instruction
Executed by dispatcher, part of translation rulebook
Rollback expensiveSlide14
Asynchronous Interrupts in DRKDelivery of interrupt delayed until the next native instruction boundary
Patch translation of next native instruction with software-interrupt instruction – again expensiveSlide15
Faster designIdea – Disallow interrupts and exceptions in the dispatcher. Now interrupts can occur in either
Code cache
User
mode
Assumption –
Guest
kernel rarely relies on PC value being pushed on stack and is indifferent to imprecise exception and interrupt
behavior
IDT entries replaced with address of translated counterparts during
initialization
Identity translation for
iret
instructionSlide16
Fast DBT – ArchitectureSlide17
Correctness concernsINT/exception handler sensitive to unexpected PC value might not behave correctly
Expecting native address values, but finds address value from translated cache
Rare and can be handled as special cases
Page fault handler in
L
inux – solved by adding adding entries in module’s exception tableSlide18
Design subtletiesReentrancy and Concurrency
Code cache optimization
Function call/return optimization
Translation
switchoff
and cache replacementSlide19
ReentrancyINT/exceptions occurs when in dispatcher
Exceptions disallowed by design. No page fault – kernel code expected to be mapped
Interrupts disabled (cli) during dispatcher execution.
But still problem happens at boundary
Set interrupt flag again and jump to
tx-nextpc-locSlide20
Dispatcher exit
tx
-nextpc
cannot be stored in a register/stack
Save/Restore
tx-nextpc-loc
at INT entry/exit
Not on INT stack – destroy interrupt frame layout
Identified redundant location in stack’s interrupt frame structureSlide21
ConcurrencyLeads
to data race on scratch space
Mandate extra space used by translation rule to be allocated on kernel’s thread
stack
Example: If EAX and ECX used for translation, push to stack. Stack pointer restored after
INT execution.
push %
eax
push %
ecx
/* emulated code */
pop %
ecx
pop %
eax
If
scratch
space
used
–
Interrupt
might
clobber
it
movl
%
eax
, scratch1
movl
%
ecx
, scratch2
/* emulated code */
movl
scratch1, %
eax
movl
scratch2, %
ecxSlide22
Function CALL/RET optimization Even
with
jumptable
always
hit for indirect branches
2-3x slowdown observed on code with high percentage of indirect
branch
Most common indirect branches are ret instruction
Use identity translation for call/
ret
All calls must push code cache addresses
Still challenge for calls with indirect operands
“
call *REG
” and “call *MEM
”
Tx
-readdress needs to be pushed on the stack, but might not be computed yet
Without optimization – native address pushed on stackSlide23
Contd.Solution – add unconditional branch to
tx-retaddress
, in
direct branch chaining
style Slide24
ImplementationImplemented as loadable kernel module
Exports DBT functionality by
switchon
() &
switchoff
()
ioctl
calls to user
s
witchon
() replaces IDT with translated oneSlide25
Experimental Setup and BenchmarkUsed kernel-intensive benchmarks like programs in
lmbench
and
filebench
& apache
Compared native, default(full optimization) and no-
callret
(all opt. except call/return)
Also implemented a prof client which counts number of instruction executed, number of indirect branches etc.Slide26
ResultsDRK exhibits 2-3x slowdown on all these calls
Call/ret optimization has significant affect on results. Slide27
ConclusionNew design performs significantly better than previous work
Relaxes not essential transparency requirements
Near native performance achieved for most of the benchmarksSlide28
OpinionSimple and very
effective design
Identified further bottlenecks like call/ret optimization
Currently supports only host kernel code
Translation rulebook should have been mentioned brieflySlide29
Questions?Slide30
Thank you