using Predictive Modeling Omer Katz Ran El Yaniv Eran Yahav Technion Israel 21012016 You get new binaries everyday Most software reaches users as binaries Usually stripped ID: 464569
Download Presentation The PPT/PDF document "Estimating Types in Binaries" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Estimating Types in Binariesusing Predictive Modeling
Omer Katz
, Ran El-
Yaniv
,
Eran
Yahav
Technion
,
Israel
21/01/2016Slide2
You get new binaries everyday…
Most software reaches users as binaries
Usually stripped
Check for vulnerabilities in the softwareReverse engineering is used for software securityVery difficult and tedious
2Slide3
?
Reverse Engineering (RE)
Statically
understand
control
flow of
program
Calls to
virtual functions
Statically unknown target
Given
a stripped binary, Rank the most likely targets for each virtual call
3
f1
f2
f5
f4
f3
Binary
1
2
3
5
4
call
eaxSlide4
Virtual function calls and objects
Virtual call applied to an object
Target determined
by object’s runtime typeRank most likely types
for
each object
Deduce
ranking of targets
4Slide5
Virtual function calls and objects
Virtual call applied to object o
Labeled objects
Known typeSingle target
5
f1
f2
f5
f4
f3
Binary
?
O
Types
type3
call
eax
Slide6
Virtual function calls and objects
Unlabeled objects
Unknown type
Many possible targets6
f1
f2
f5
f4
f3
Binary
1
2
3
5
4
Types
…
type1
type2
type3
type4
type5
1
2
3
5
4
call
eax
O
OSlide7
Our approach
Different
types are
used differentlyRely on combination of Program Analysis and Machine LearningProgram analysis extracts
usage profiles of
objects
Machine learning predicts
type of object based on usage
7
Stripped binary
Tracelets
per object with known typeSLM
Program Analysis
Train SLM
Ranked
list of
targets per call site
Query SLM
Tracelets
per object with unknown type
training
prediction
Extracting
tracelets
input
outputSlide8
Extracting usages
call
edx
is a virtual function call8
…
mov
ebx
,[
ecx]mov
edx
,[ebx+04h]call edxmov [ecx+0ch],ebx
…
loop:
mov [eax+10h],1 mov
ebx,[eax] mov
edx,[ebx] call edx
mov ebx,[eax+08h] cmp
ebx,0 jz loop
Win 32-bit x86 assemblySlide9
…
mov
ebx
,[
ecx
]
mov
edx,[ebx+04h]call
edx
mov
[ecx+0ch],ebx…loop: mov [eax+10h],1
mov
ebx,[
eax]
mov edx,[ebx]
call edx mov ebx
,[eax+08h] cmp ebx,0
jz loop
Memory:
object:
Extracting usages
call
edx
is a virtual function calledx is [
[eax]]
9
vtable
address
field
1
field
2
…
vfunc
1
vfunc
2
vfunc
3
…
Registers:
eax
edx
ebx
Win 32-bit x86 assemblySlide10
Usage profiles
High-level
usage
of objectsIntra-procedural sequences of actions Tracked actions:R(x) – field in
offset
x
is read
W(x) – field in offset x
is writtenC(x) – the x’th virtual function is calledetc
…10Slide11
…
mov
ebx
,[
ecx
]
mov
edx,[ebx+04h]call
edx
mov
[ecx+0ch],ebx…loop: mov [eax+10h],1
mov
ebx,[
eax]
mov edx,[ebx]
call edx mov ebx
,[eax+08h] cmp ebx,0
jz loop
Extracting usages
Object pointers in the code
eax
ecx
Track actions on
objectseax:
ecx
:
11
Win 32-bit x86 assemblySlide12
From usages to tracelets
Split sequences to
tracelets
Subsequences of fixed-lengthExhaustively repeat loops
12
Split to length 5
Slide13
Obtaining the tracelets
Track
objects and actions
staticallyUse symbolic executionRepresent objects as sets of tracelets
13Slide14
Statistical Language Models (SLMs)
Given a language
with alphabet
, an
SLM
and sentence
from
,
is the
probability
of
originating from
For
example:
14Slide15
”
N
-gram models
Sentences
of
length
n
Variable-order
n
-gram models
Let the data determine the orderUse high order when possible, lower order otherwiseIn our setting:
Alphabet := set of all single actionsSentences := tracelets
bi
-grams
Variable-order Markov models
15
Slide16
Prediction by Partial Match (PPM)
Originally for compression, useful for prediction
Given sentence “
”
n
-grams
(n
=3)
:
In
practice a small probability is reserved for the smoothing and back-off mechanism
Training the models
16
1
1
1
1
0.33
1
0.5
0.5
0.67
0.2
0.2
0.6
Slide17
Ranking types
Compute
probability
for object and type pairRank types by probabilityFor object
and type
is the
set
of
tracelets
for
,
Alternative
:
Train
models
for object and measure the distance between modelse.g. using Jensen-Shannon divergence
17Slide18
Evaluation
Evaluated on
20
open source C++ projectsCompiled with default optimizations and stripped
Ground
truth
determined manually
Evaluate results by
rank of expected target18
f1
f2
f5
f4
f3
Binary
1
2
3
5
4
call
eaxSlide19
Benchmark name: Smoothing.exe
Top 2 ranked targets => ~
92%
of
calls
Coverage graphs
19
2
71Slide20
SLMs
outperform
other models
20
Fixed-order
models:
n
-gram model
using
KT-estimator for
smoothing
Edit-distance metrics: based on Levenshtein-Damarau
SLMs requires
x9
less targets than fixed-order models and
x11 less than edit-distance metrics
2
22
18Slide21
Summary
Statistical solution to an otherwise extremely difficult problem
Drastically reduce
number of targets per virtual callAcross all benchmarks, expected target was ranked among top 3 targets for 80% of virtual
calls
Power lies in combination of program analysis and machine learning
Program analysis to extract usage profiles
Machine learning to reason and predict objects’ types
21Slide22
Questions?
Thank
you for
listening
The research leading to these results has
received
funding from
the European Union's -
Seventh Framework Programme (FP7
) under grant agreement n° 615688 – ERC- COG-PRIME.