Simple Substitution Distance 1 Gayathri Shanmugam Richard M Low Mark Stamp The Idea Metamorphic malware mutates with each infection Measuring software similarity is a possible means of detection ID: 435123
Download Presentation The PPT/PDF document "Simple Substitution Distance and Metamor..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Simple Substitution Distance and Metamorphic Detection
Simple Substitution Distance
1
Gayathri
Shanmugam
Richard M. Low
Mark StampSlide2
The Idea
Metamorphic malware “mutates” with each infectionMeasuring software similarity is a possible means of detectionBut, how to measure similarity?
Much relevant previous workHere, a novel distance measure is considered
2
Simple Substitution DistanceSlide3
Simple Substitution Distance
We treat each metamorphic copy as if it is an “encrypted” version of “base” virusWhere the “cipher” is a simple substitution
Why simple substitution?Easy to work with, fast algorithm to solveWhy might this work?Simple substitution “cryptanalysis” tends to yield results that match family statistics
Accounts for modifications to files similar to some common metamorphic techniques
3
Simple Substitution DistanceSlide4
Motivation
Given a simple substitution ciphertext
where plaintext is English…If we cryptanalyze using English language statistics, we expect a good score
If we
cryptanalyze
using, say, French language statistics, we expect a not-so-good score
We can obtain
opcode
statistics for a metamorphic family
Using simple substitution cryptanalysis, a virus of same family should score well…
…but, a benign exe should not score as well
Assuming statistics of these families differ
4
Simple Substitution DistanceSlide5
Metamorphic Techniques
Many possible morphing strategiesHere, briefly considerRegister swapping
Garbage code insertionEquivalent substitutionTranspositionFormal grammar mutation
At a high level --- substitution, transposition, insertion, and deletion
5
Simple Substitution DistanceSlide6
Register Swap
Register swappingE.g., replace EBX register with
EAX, provided EAX not in use
Very simple and used in some of first metamorphic malware
Not very effective
Why not?
6
Simple Substitution DistanceSlide7
Garbage Insertion
Garbage code insertionTwo cases:Dead code --- inserted, but not executed
We can simply JMP over dead codeDo-nothing instructions --- executed, but has no effect on program
Like
NOP
or
ADD EAX,0
Relatively easy to implement
Effective at breaking signature detection
7
Simple Substitution DistanceSlide8
Code Substitution
Equivalent instruction substitutionFor example, can replace SUB EAX,EAX
with XOR EAX,EAXDoes not need to be 1 for 1 substitutionThat is, can include insertion/deletion
Unlimited number of substitutions
Very effective
Somewhat difficult to implement
8
Simple Substitution DistanceSlide9
Transposition
TranspositionReorder instructions that have no dependencyFor example,
MOV R1,R2 ADD R3,R4 ADD R3,R4 MOV R1,R2
Can be highly effective
But, can be difficult to implement
Sometimes applied only to subroutines
9
Simple Substitution DistanceSlide10
Formal Grammar Mutation
Formal grammar mutationView morphing engine as non-deterministic automataAllow transitions between any symbols
Apply formal grammar rulesObtain many variants, high variationReally just a formalization of others approaches, not a separate technique
10
Simple Substitution DistanceSlide11
Previous Work
Easy to prove that “good” metamorphic code is immune to signature detection
Why?But, many successes detecting hacker-produced metamorphic malware…HMM/PHMM/machine learning
Graph-based techniques
Statistics (chi-squared, naïve
Bayes
)
Structural entropy
Linear algebraic techniques
11
Simple Substitution DistanceSlide12
This Research
Measure similarity using “simple substitution distance”We “decrypt” suspect file using statistics from a metamorphic familyIf decryption is good, we classify it as a member of the same metamorphic family
If decryption is poor, we classify it as NOT a member of the given metamorphic family
12
Simple Substitution DistanceSlide13
Simple Substitution Cipher
Simple substitution is one of the oldest and simplest means of encryptionA fixed key used to substitute letters
For example, Caesar’s cipher, substitute letter 3 positions ahead in alphabetIn general, any permutation can be keySimple substitution cryptanalysis?
Statistical analysis of
ciphertext
13
Simple Substitution DistanceSlide14
Simple Substitution Cryptanalysis
Suppose you observe the
ciphertext PBFPVYFBQXZTYFPBFEQJHDXXQVAPTPQJKTOYQWIPBVWLXTOXBTFXQWAXBVCXQWAXFQJVWLEQNTOZQGGQLFXQWAKVWLXQWAEBIPBFXFQVXGTVJVWLBTPQWAEBFPBFHCVLXBQUFEVWLXGDPEQVPQGVPPBFTIXPFHXZHVFAGFOTHFEFBQUFTDHZBQPOTHXTYFTODXQHFTDPTOGHFQPBQWAQJJTODXQHFOQPWTBDHHIXQVAPBFZQHCFWPFHPBFIPBQWKFABVYYDZBOTHPBQPQJTQOTOGHFQAPBFEQJHDXXQVAVXEBQPEFZBVFOJIWFFACFCCFHQWAUVWFLQHGFXVAFXQHFUFHILTTAVWAFFAWTEVOITDHFHFQAITIXPFHXAFQHEFZQWGFLVWPTOFFA
Analyze frequency counts…
Likely that
ciphertext
“F” represents “E”
And so on, at least for common letters
14
Simple Substitution DistanceSlide15
Simple Substitution Cryptanalysis
Can even automate attack
Make initial guess for key using frequency countsCompute
oldScore
Modify key by swapping adjacent elements
Compute
newScore
If
newScore
>
oldScore
then
oldScore
=
newScore
Else
unswap
elements
Goto
3
How to compute score?
Number of dictionary words in putative plaintext?
Much better to use English digraph statistics
15
Simple Substitution DistanceSlide16
Jackobsen’s Algorithm
Method on previous slide can be slowWhy?Jackobsen’s
algorithm uses similar idea, but fast and efficient
Ciphertext
is only decrypted once
So algorithm is (essentially) independent of length of message
Then, only matrix manipulations required
16
Simple Substitution DistanceSlide17
Jackobsen’s Algorithm: Swapping
Assume plaintext is English, 26 letters
Let K = k
1
,k
2
,k
3
,…,k
26
be putative key
And let “
|
” represent “swap”
Then we swap elements as follows
Also, we restart this swapping schedule from the beginning whenever score improves
17
Simple Substitution DistanceSlide18
Jackobsen’s Algorithm: Swapping
Minimum swaps is 26 choose 2, or 325
Maximum is unboundedEach swap requires a score computation
Average number of swaps? Experimentally
Ciphertext
of length 500, average 1050 swaps
Ciphertext
of length 8000,
avg
just 630 swaps
So, work depends on length of
ciphertext
More
ciphertext
, better scores, fewer swaps
18
Simple Substitution DistanceSlide19
Jackobsen’s Algorithm: Scoring
Let D = {
dij} be digraph distribution corresponding to putative key
K
Let
E = {
e
ij
}
be digraph distribution of English language
These matrices are 26
x
26
Compute score as
19
Simple Substitution DistanceSlide20
Jackobsen’s Algorithm
So far, nothing fancy hereCould see all of this in a CS 265 assignment
Jackobsen’s trick: Determine new D matrix from old
D
without decrypting
How to do so?
It turns out that swapping elements of
K
swaps corresponding rows and columns of
D
See example on next slides…
20
Simple Substitution DistanceSlide21
Swapping Example
To simplify, suppose 10 letter alphabetE, T, A, O, I, N, S, R, H, DSuppose you are given the
ciphertext TNDEODRHISOADDRTEDOAHENSINEOAR
DTTDTINDDRNEDNTTTDDISRETEEEEEAA
Frequency counts given by
21
Simple Substitution DistanceSlide22
Swapping Example
We choose the putative key
K given here
The corresponding putative plaintext is
AOETRENDSHRIEENATE
RIDTOHSOTRINEAAEAS
OEENOTEOAAAEESHNA
TTTTTII
Corresponding digraph distribution
D
is
22
Simple Substitution DistanceSlide23
Swapping Example
Suppose we swap first 2 elements of KThen decrypt using new
KAnd compute digraph matrix for new K
Previous key
K
New key
K
23
Simple Substitution DistanceSlide24
Swapping Example
Old D
matrix vs new D
matrix
What do you notice?
So what’s the point here?
This is good!
24
Simple Substitution DistanceSlide25
Jackobsen’s
Algorithm
25
Simple Substitution DistanceSlide26
Proposed Similarity Score
Extract opcodes sequences from collection of virusesAll viruses from
same metamorphic familyDetermine n
most common
opcodes
Symbol
n+1
used for all “other”
opcodes
Use resulting digraph statistics form matrix
E = {
e
ij
}
Note that matrix is
(n+1)
x
(n+1)
26
Simple Substitution DistanceSlide27
Scoring a File
Given an executable we want to score
Extract it’s opcode sequence
Use
opcode
digraph stats to get
D = {
d
ij
}
This matrix also
(n+1)
x
(n+1)
Initial “key”
K
chosen to match monograph stats of virus family
Most frequent
opcode
in exe maps to most frequent
opcode
in virus family, etc.
Score based on distance between
D
and
E
“Decrypt”
D
and score how closely it matches
E
Jackobsen’s
algorithm used for “decryption”
27
Simple Substitution DistanceSlide28
Example
Suppose only 5 common
opcodes
in family viruses (in descending frequency)
Extract following sequence from an exe
Initial “key” is
And “decrypt is
28
Simple Substitution DistanceSlide29
Example
Given “decrypt”Form D matrix
After swap…And so on…
29
Simple Substitution DistanceSlide30
Scoring Algorithm
30
Simple Substitution DistanceSlide31
Quantifying Success
Consider these 2 scatterplots
of scores
Which is better (and why)?
31
Simple Substitution DistanceSlide32
ROC Curves
Plot true-positive vs false positive
As “threshold” variesCurve nearer 45-degree line is badCurve nearer upper-left is good
32
Simple Substitution DistanceSlide33
ROC Curves
Use ROC curves to quantify successArea under the ROC curve (AUC)Probability that randomly chosen positive instance scores higher than a randomly chosen negative instance
AUC of 1.0 implies ideal detectionAUC of 0.5 means classification is no better than flipping a coin
33
Simple Substitution DistanceSlide34
Parameter Selection
Tested the following parametersOpcode matrix size
Scoring functionNormalizationSwapping strategyNone significant, except matrix size
So we only give results for matrix size here
34
Simple Substitution DistanceSlide35
Opcode Matrix Size
Obtained following results
So, ironically, we use 26
x
26
matrix
35
Simple Substitution DistanceSlide36
Test Data
Tested the following metamorphic familiesG2 --- known to be weakNGVCK --- highly metamorphic
MWOR --- highly metamorphic and stealthyMWOR “padding ratios” of 0.5 to 4.0For G2 and NGVCK
50 files tested,
cygwin
utilities for benign files
For each MWOR padding ratio
100 files tested, Linux utilities for benign files
5-fold cross validation in each experiment
36
Simple Substitution DistanceSlide37
NGVCK and G2 Graphs
37
Simple Substitution DistanceSlide38
MWOR Score Graphs
38
Simple Substitution DistanceSlide39
MWOR ROC Curves
39
Simple Substitution DistanceSlide40
MWOR AUC Statistics
40
Simple Substitution DistanceSlide41
Efficiency
41
Simple Substitution DistanceSlide42
Conclusions
Simple substitution score, good results for challenging metamorphic virusesScoring is fast and efficient
Applicable to other types of malwareRequires opcodes
42
Simple Substitution DistanceSlide43
References
G. Shanmugam, R.M. Low, and M. Stamp, Simple substitution distance
and metamorphic detection, Journal of Computer Virology and Hacking Techniques
, 9(3):159-170, 2013
43
Simple Substitution Distance