/
Estimating Types in Binaries Estimating Types in Binaries

Estimating Types in Binaries - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
394 views
Uploaded On 2016-09-11

Estimating Types in Binaries - PPT Presentation

using Predictive Modeling Omer Katz Ran El Yaniv Eran Yahav Technion Israel 21012016 You get new binaries everyday Most software reaches users as binaries Usually stripped ID: 464569

ebx mov call eax mov ebx eax call edx virtual object targets tracelets ecx types models binary objects loop function order usage

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Estimating Types in Binaries" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Estimating Types in Binariesusing Predictive Modeling

Omer Katz

, Ran El-

Yaniv

,

Eran

Yahav

Technion

,

Israel

21/01/2016Slide2

You get new binaries everyday…

Most software reaches users as binaries

Usually stripped

Check for vulnerabilities in the softwareReverse engineering is used for software securityVery difficult and tedious

2Slide3

?

Reverse Engineering (RE)

Statically

understand

control

flow of

program

Calls to

virtual functions

Statically unknown target

Given

a stripped binary, Rank the most likely targets for each virtual call

3

f1

f2

f5

f4

f3

Binary

1

2

3

5

4

call

eaxSlide4

Virtual function calls and objects

Virtual call applied to an object

Target determined

by object’s runtime typeRank most likely types

for

each object

Deduce

ranking of targets

4Slide5

Virtual function calls and objects

Virtual call applied to object o

Labeled objects

Known typeSingle target

5

f1

f2

f5

f4

f3

Binary

?

O

Types

type3

call

eax

 Slide6

Virtual function calls and objects

Unlabeled objects

Unknown type

Many possible targets6

f1

f2

f5

f4

f3

Binary

1

2

3

5

4

Types

type1

type2

type3

type4

type5

1

2

3

5

4

call

eax

O

 

OSlide7

Our approach

Different

types are

used differentlyRely on combination of Program Analysis and Machine LearningProgram analysis extracts

usage profiles of

objects

Machine learning predicts

type of object based on usage

7

Stripped binary

Tracelets

per object with known typeSLM

Program Analysis

Train SLM

Ranked

list of

targets per call site

Query SLM

Tracelets

per object with unknown type

training

prediction

Extracting

tracelets

input

outputSlide8

Extracting usages

call

edx

is a virtual function call8

mov

ebx

,[

ecx]mov

edx

,[ebx+04h]call edxmov [ecx+0ch],ebx

loop:

mov [eax+10h],1 mov

ebx,[eax] mov

edx,[ebx] call edx

mov ebx,[eax+08h] cmp

ebx,0 jz loop

Win 32-bit x86 assemblySlide9

mov

ebx

,[

ecx

]

mov

edx,[ebx+04h]call

edx

mov

[ecx+0ch],ebx…loop: mov [eax+10h],1

mov

ebx,[

eax]

mov edx,[ebx]

call edx mov ebx

,[eax+08h] cmp ebx,0

jz loop

Memory:

object:

Extracting usages

call

edx

is a virtual function calledx is [

[eax]]

9

vtable

address

field

1

field

2

vfunc

1

vfunc

2

vfunc

3

Registers:

eax

edx

ebx

Win 32-bit x86 assemblySlide10

Usage profiles

High-level

usage

of objectsIntra-procedural sequences of actions Tracked actions:R(x) – field in

offset

x

is read

W(x) – field in offset x

is writtenC(x) – the x’th virtual function is calledetc

…10Slide11

mov

ebx

,[

ecx

]

mov

edx,[ebx+04h]call

edx

mov

[ecx+0ch],ebx…loop: mov [eax+10h],1

mov

ebx,[

eax]

mov edx,[ebx]

call edx mov ebx

,[eax+08h] cmp ebx,0

jz loop

Extracting usages

Object pointers in the code

eax

ecx

Track actions on

objectseax:

ecx

:

 

11

 

 

 

 

 

 

 

 

Win 32-bit x86 assemblySlide12

From usages to tracelets

Split sequences to

tracelets

Subsequences of fixed-lengthExhaustively repeat loops

12

Split to length 5

 

 

 

 

 Slide13

Obtaining the tracelets

Track

objects and actions

staticallyUse symbolic executionRepresent objects as sets of tracelets

13Slide14

Statistical Language Models (SLMs)

Given a language

with alphabet

, an

SLM

and sentence

from

,

is the

probability

of

originating from

For

example:

 

14Slide15

 

N

-gram models

Sentences

of

length

n

Variable-order

n

-gram models

Let the data determine the orderUse high order when possible, lower order otherwiseIn our setting:

Alphabet := set of all single actionsSentences := tracelets

bi

-grams

Variable-order Markov models

15

 Slide16

Prediction by Partial Match (PPM)

Originally for compression, useful for prediction

Given sentence “

n

-grams

(n

=3)

:

In

practice a small probability is reserved for the smoothing and back-off mechanism

 

Training the models

16

 

 

1

 

1

 

1

 

 

1

0.33

 

 

1

0.5

0.5

0.67

0.2

0.2

0.6

 

 

 

 

 

 Slide17

Ranking types

Compute

probability

for object and type pairRank types by probabilityFor object

and type

is the

set

of

tracelets

for

,

Alternative

:

Train

models

for object and measure the distance between modelse.g. using Jensen-Shannon divergence

 17Slide18

Evaluation

Evaluated on

20

open source C++ projectsCompiled with default optimizations and stripped

Ground

truth

determined manually

Evaluate results by

rank of expected target18

f1

f2

f5

f4

f3

Binary

1

2

3

5

4

call

eaxSlide19

Benchmark name: Smoothing.exe

Top 2 ranked targets => ~

92%

of

calls

Coverage graphs

19

2

71Slide20

SLMs

outperform

other models

20

Fixed-order

models:

n

-gram model

using

KT-estimator for

smoothing

Edit-distance metrics: based on Levenshtein-Damarau

SLMs requires

x9

less targets than fixed-order models and

x11 less than edit-distance metrics

2

22

18Slide21

Summary

Statistical solution to an otherwise extremely difficult problem

Drastically reduce

number of targets per virtual callAcross all benchmarks, expected target was ranked among top 3 targets for 80% of virtual

calls

Power lies in combination of program analysis and machine learning

Program analysis to extract usage profiles

Machine learning to reason and predict objects’ types

21Slide22

Questions?

Thank

you for

listening

The research leading to these results has

received

funding from

the European Union's -

Seventh Framework Programme (FP7

) under grant agreement n° 615688 – ERC- COG-PRIME.