/
CS 5204 CS 5204

CS 5204 - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
372 views
Uploaded On 2017-04-12

CS 5204 - PPT Presentation

Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal SOSP 2013 Presenter Apoorva Garg Outline Paper Presentation Background DBT Architecture DBT Optimizations ID: 536730

translation code kernel instruction code translation instruction kernel dbt native call cache dispatcher interrupt translated stack branch design optimization

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS 5204" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CS 5204Fast Dynamic Binary Translation for the Kernel

Piyus

Kedia

and

Sorav

Bansal

(SOSP 2013)

Presenter: Apoorva GargSlide2

OutlinePaper Presentation

Background

DBT Architecture

DBT Optimizations

Kernel level DBT

Kernel level

DBT - DRK

Interrupt/Exception handling

Faster Design

Design subtleties

Implementation

Results

OpinionSlide3

Background – Binary TranslatorEmulation of one instruction set by another through translation of machine code

Static – Operates on executable file

Difficult because of dynamic linking and self-modifying code

Dynamic – Translate on the fly

Translation overhead affects runtimeSlide4

Background – Dynamic Binary Translator

Architecturally similar to JIT compiler

Translate foreign code & execute generated target code.

General Architecture

System calls interface might differ

Patching direct jumps, Translation cache (Code cache)

Unit of translation

Instruction selection, Byte Order, Register mapping

Trivia: Apple used DBT(Rosetta) to ease transition from PPC to x86 Slide5

DBT ApplicationInstrumentation - Used in many user-space tools like

V

algrind

etc.

Valgrind

runs application in sandbox – synthetic

CPU

Insert own instruction for debugging and profiling

As new code executed for first time – handed to a tool like

memcheck

, which introduces it’s own instrumentation

Example:

Memcheck

C

heck for all read/write to memory, intercept

malloc

/free

Reports error

If access memory it

shouldn’t, Leaks

etc

Virtualization – Use DBT to modify instructions that “pierce VM” with VM safe sequence of instructionsSlide6

Basic DBT Architecture

Translate blocks of native code

Blocks typically begin at target of control transfer statements like jump, call, ret

Reasonable performance by caching the translated code

Control transfer between code cache and framework is costlySlide7

Translation – Important

One native instruction can be translated into multiple target instruction

For example

rlwinm

(

Rotate Left Word Immediate Then AND with Mask)

Instruction in PowerPC requires 6 instructions on

Aplha

Instrumentation code might insert more instructions

Leads to most of the problems – as the native state is not precisely defined during such translationsSlide8

DBT Optimizations

Direct branch chaining

Replace direct branch with jump to translated addresses

tx-nextpc

-> address for translated

nextpcSlide9

DBT Optimizations For indirect branch nextpc

only determined at runtime

Jumptable

– small

hashtable

nextpc

to

tx-nextpc

If(miss) call dispatcherSlide10

Kernel-level DBT Requires more mechanism to correctly handle interrupts, exceptions, reentrancy and concurrency issues

Interpose all kernel execution – by replacing kernel entry points by call to dispatcherSlide11

Kernel DBT – TerminologyGuest kernel – kernel being translated

Code block - straight-line sequence of instructions which terminates at an unconditional control branch

Dispatcher

Translation rulebook

Code cacheSlide12

Interrupt handling in DRKPC value pushed on hardware stack either

Code cache address

Dispatcher address

Single guest instruction can be translated into multiple target instructions

Therefore interrupts & exceptions may happen during translation and not at boundaries

PC value needs to be changed to native counterpartSlide13

Synchronous exception in DRKPrecise exception behavior emulated

If happened during translation of instruction

Rollback machine state to start of current native instruction

Executed by dispatcher, part of translation rulebook

Rollback expensiveSlide14

Asynchronous Interrupts in DRKDelivery of interrupt delayed until the next native instruction boundary

Patch translation of next native instruction with software-interrupt instruction – again expensiveSlide15

Faster designIdea – Disallow interrupts and exceptions in the dispatcher. Now interrupts can occur in either

Code cache

User

mode

Assumption –

Guest

kernel rarely relies on PC value being pushed on stack and is indifferent to imprecise exception and interrupt

behavior

IDT entries replaced with address of translated counterparts during

initialization

Identity translation for

iret

instructionSlide16

Fast DBT – ArchitectureSlide17

Correctness concernsINT/exception handler sensitive to unexpected PC value might not behave correctly

Expecting native address values, but finds address value from translated cache

Rare and can be handled as special cases

Page fault handler in

L

inux – solved by adding adding entries in module’s exception tableSlide18

Design subtletiesReentrancy and Concurrency

Code cache optimization

Function call/return optimization

Translation

switchoff

and cache replacementSlide19

ReentrancyINT/exceptions occurs when in dispatcher

Exceptions disallowed by design. No page fault – kernel code expected to be mapped

Interrupts disabled (cli) during dispatcher execution.

But still problem happens at boundary

Set interrupt flag again and jump to

tx-nextpc-locSlide20

Dispatcher exit

tx

-nextpc

cannot be stored in a register/stack

Save/Restore

tx-nextpc-loc

at INT entry/exit

Not on INT stack – destroy interrupt frame layout

Identified redundant location in stack’s interrupt frame structureSlide21

ConcurrencyLeads

to data race on scratch space

Mandate extra space used by translation rule to be allocated on kernel’s thread

stack

Example: If EAX and ECX used for translation, push to stack. Stack pointer restored after

INT execution.

push %

eax

push %

ecx

/* emulated code */

pop %

ecx

pop %

eax

If

scratch

space

used

Interrupt

might

clobber

it

movl

%

eax

, scratch1

movl

%

ecx

, scratch2

/* emulated code */

movl

scratch1, %

eax

movl

scratch2, %

ecxSlide22

Function CALL/RET optimization Even

with

jumptable

always

hit for indirect branches

2-3x slowdown observed on code with high percentage of indirect

branch

Most common indirect branches are ret instruction

Use identity translation for call/

ret

All calls must push code cache addresses

Still challenge for calls with indirect operands

call *REG

” and “call *MEM

Tx

-readdress needs to be pushed on the stack, but might not be computed yet

Without optimization – native address pushed on stackSlide23

Contd.Solution – add unconditional branch to

tx-retaddress

, in

direct branch chaining

style Slide24

ImplementationImplemented as loadable kernel module

Exports DBT functionality by

switchon

() &

switchoff

()

ioctl

calls to user

s

witchon

() replaces IDT with translated oneSlide25

Experimental Setup and BenchmarkUsed kernel-intensive benchmarks like programs in

lmbench

and

filebench

& apache

Compared native, default(full optimization) and no-

callret

(all opt. except call/return)

Also implemented a prof client which counts number of instruction executed, number of indirect branches etc.Slide26

ResultsDRK exhibits 2-3x slowdown on all these calls

Call/ret optimization has significant affect on results. Slide27

ConclusionNew design performs significantly better than previous work

Relaxes not essential transparency requirements

Near native performance achieved for most of the benchmarksSlide28

OpinionSimple and very

effective design

Identified further bottlenecks like call/ret optimization

Currently supports only host kernel code

Translation rulebook should have been mentioned brieflySlide29

Questions?Slide30

Thank you