Simple Substitution Distance 1 Gayathri Shanmugam Richard M Low Mark Stamp The Idea Metamorphic malware mutates with each infection Measuring software similarity is one method of detection ID: 414886
Download Presentation The PPT/PDF document "Simple Substitution Distance and Metamor..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Simple Substitution Distance and Metamorphic Detection
Simple Substitution Distance
1
Gayathri
Shanmugam
Richard M. Low
Mark StampSlide2
The Idea
Metamorphic malware “mutates” with each infectionMeasuring software similarity is one method of detectionBut, how to measure similarity?
Lots of relevant previous workHere, an unusual and interesting distance measure is considered
2
Simple Substitution DistanceSlide3
Simple Substitution Distance
We treat each metamorphic copy as if it is an “encrypted” version of “base” virusWhere the cipher is a simple substitution
Why simple substitution?Easy to work with, fast algorithm to solveWhy might this work?
Simple substitution cryptanalysis gives results that match family statistics
Accounts for modifications to files similar to some common metamorphic techniques
3
Simple Substitution DistanceSlide4
Motivation
Given a simple substitution ciphertext
where plaintext is English…If we cryptanalyze using English language statistics, we expect a good score
If we
cryptanalyze
using, say, French language statistics, we expect a not-so-good score
We can obtain
opcode
statistics for a metamorphic family
Using simple substitution cryptanalysis, a virus of same family should score well…
…but, a benign exe should not score as well
Assuming statistics of these families differ
4
Simple Substitution DistanceSlide5
Metamorphic Techniques
Many possible morphing strategiesHere, briefly considerRegister swapping
Garbage code insertionEquivalent substitutionTranspositionFormal grammar mutation
At a high level --- substitution, transposition, insertion, and deletion
5
Simple Substitution DistanceSlide6
Register Swap
Register swappingE.g., replace EBX register with
EAX, provided EAX not in use
Very simple and used in some of first metamorphic malware
Not very effective
Why not?
6
Simple Substitution DistanceSlide7
Garbage Insertion
Garbage code insertionTwo cases:Dead code --- inserted, but not executed
We can simply JMP over dead codeDo-nothing instructions --- executed, but has no effect on program
Like
NOP
or
ADD EAX,0
Relatively easy to implement
Effective at breaking signatures
Changes the
opcodes
statistics
7
Simple Substitution DistanceSlide8
Code Substitution
Equivalent instruction substitutionFor example, can replace SUB EAX,EAX
with XOR EAX,EAXDoes not need to be 1 for 1 substitutionThat is, can also include insertion/deletion
Unlimited number of substitutions
And can be very effective
Somewhat difficult to implement
8
Simple Substitution DistanceSlide9
Transposition
TranspositionReorder instructions that have no dependencyFor example,
MOV R1,R2 ADD R3,R4 ADD R3,R4 MOV R1,R2
Can be highly effective
But, can be difficult to implement
Sometimes applied only to subroutines
9
Simple Substitution DistanceSlide10
Formal Grammar Mutation
Formal grammar mutationView morphing engine as non-deterministic automataAllow transitions between any symbols
Apply formal grammar rulesObtain many variants, high variationReally just a formalization of others approaches, not a separate technique
10
Simple Substitution DistanceSlide11
Previous Work
Easy to prove that “good” metamorphic code is immune to signature detection
Why?But, many successes detecting hacker-produced metamorphic malware…HMM/PHMM/machine learning
Graph-based techniques
Statistics (chi-squared, naïve
Bayes
)
Structural entropy
Linear algebraic techniques
11
Simple Substitution DistanceSlide12
Topic of This Research
Measure similarity using simple substitution distanceWe “decrypt” suspect file using statistics from a metamorphic family
If decryption is good, we classify it as a member of the same metamorphic familyIf decryption is poor, we classify it as NOT a member of the given family
12
Simple Substitution DistanceSlide13
Simple Substitution Cipher
Simple substitution is one of the oldest and simplest means of encryptionA fixed key used to substitute letters
For example, Caesar’s cipher, substitute letter 3 positions ahead in alphabetIn general, any permutation can be keySimple substitution cryptanalysis?
Statistical analysis of
ciphertext
13
Simple Substitution DistanceSlide14
Simple Substitution Cryptanalysis
Suppose you observe the
ciphertext PBFPVYFBQXZTYFPBFEQJHDXXQVAPTPQJKTOYQWIPBVWLXTOXBTFXQWAXBVCXQWAXFQJVWLEQNTOZQGGQLFXQWAKVWLXQWAEBIPBFXFQVXGTVJVWLBTPQWAEBFPBFHCVLXBQUFEVWLXGDPEQVPQGVPPBFTIXPFHXZHVFAGFOTHFEFBQUFTDHZBQPOTHXTYFTODXQHFTDPTOGHFQPBQWAQJJTODXQHFOQPWTBDHHIXQVAPBFZQHCFWPFHPBFIPBQWKFABVYYDZBOTHPBQPQJTQOTOGHFQAPBFEQJHDXXQVAVXEBQPEFZBVFOJIWFFACFCCFHQWAUVWFLQHGFXVAFXQHFUFHILTTAVWAFFAWTEVOITDHFHFQAITIXPFHXAFQHEFZQWGFLVWPTOFFA
Analyze frequency counts…
Likely that
ciphertext
“F” represents “E”
And so on, at least for common letters
14
Simple Substitution DistanceSlide15
Simple Substitution Cryptanalysis
Can automate the cryptanalysis
Make initial guess for key using frequency countsCompute
oldScore
Modify key by swapping adjacent elements
Compute
newScore
If
newScore
>
oldScore
. let
oldScore
=
newScore
Else
unswap
key elements
Goto
3
How to compute score?
Number of dictionary words in putative plaintext?
Much better to use English digraph statistics
15
Simple Substitution DistanceSlide16
Jackobsen’s Algorithm
Method on previous slide can be slowWhy?Jackobsen’s
algorithm uses similar idea, but fast and efficient
Ciphertext
is only decrypted once
So algorithm is (essentially) independent of length of message
Then, only matrix manipulations required
16
Simple Substitution DistanceSlide17
Jackobsen’s Algorithm: Swapping
Assume plaintext is English, 26 letters
Let K = k
1
,k
2
,k
3
,…,k
26
be putative key
And let “
|
” represent “swap”
Then we swap elements as follows
Restart this swapping from the beginning whenever the score improves
17
Simple Substitution DistanceSlide18
Jackobsen’s Algorithm: Swapping
Minimum swaps is 26 choose 2, or 325
Maximum is unboundedEach swap requires a score computation
Average number of swaps, experimentally:
Ciphertext
of length 500, average 1050 swaps
Ciphertext
of length 8000,
avg
just 630 swaps
So, work depends on length of
ciphertext
More
ciphertext
, better scores, fewer swaps
18
Simple Substitution DistanceSlide19
Jackobsen’s Algorithm: Scoring
Let D = {
dij} be digraph distribution corresponding to putative key
K
Let
E = {
e
ij
}
be digraph distribution of English language
These matrices are 26
x
26
Compute score as
19
Simple Substitution DistanceSlide20
Jackobsen’s Algorithm
So far, nothing fancy hereCould see all of this in a CS 265 assignment
Jackobsen’s trick: Determine new D matrix from old
D
without decrypting
How to do so?
It turns out that swapping elements of
K
swaps corresponding rows and columns of
D
See example on next slides…
20
Simple Substitution DistanceSlide21
Swapping Example
To simplify, suppose 10 letter alphabet
E, T, A, O, I, N, S, R, H, DSuppose you are given the ciphertext
TNDEODRHISOADDRTEDOAHENSINEOAR
DTTDTINDDRNEDNTTTDDISRETEEEEEAA
Frequency counts given by
21
Simple Substitution DistanceSlide22
Swapping Example
We choose the putative key
K given here
The corresponding putative plaintext is
AOETRENDSHRIEENATE
RIDTOHSOTRINEAAEAS
OEENOTEOAAAEESHNA
TTTTTII
Corresponding digraph distribution
D
is
22
Simple Substitution DistanceSlide23
Swapping Example
Suppose we swap first 2 elements of KThen decrypt using new
KAnd compute digraph matrix for new K
Previous key
K
New key
K
23
Simple Substitution DistanceSlide24
Swapping Example
Old D
matrix vs new D
matrix
What do you notice?
So what’s the point here?
This is good!
24
Simple Substitution DistanceSlide25
Jackobsen’s
Algorithm
25
Simple Substitution DistanceSlide26
Proposed Similarity Score
Extract opcodes sequences from collection of (family) virusesAll viruses from
same metamorphic familyDetermine n
most common
opcodes
Symbol
n+1
used for all “other”
opcodes
Use resulting digraph statistics form matrix
E = {
e
ij
}
Note that matrix is
(n+1)
x
(n+1)
26
Simple Substitution DistanceSlide27
Scoring a File
Given an executable we want to score…
Extract it’s opcode sequence
Use
opcode
digraph stats to get
D = {
d
ij
}
This matrix also
(n+1)
x
(n+1)
Initial “key”
K
chosen to match monograph stats of virus family
Most frequent
opcode
in exe maps to most frequent
opcode
in virus family, etc.
Score based on distance between
D
and
E
“Decrypt”
D
and score how closely it matches
E
Jackobsen’s
algorithm used for “decryption”
27
Simple Substitution DistanceSlide28
Example
Suppose only 5 common
opcodes
in family viruses (in descending frequency)
Extract following sequence from an exe
Initial “key” is
And “decrypt” is
28
Simple Substitution DistanceSlide29
Example
Given “decrypt”Form D
matrixAfter swapAnd so on…
29
Simple Substitution DistanceSlide30
Scoring Algorithm
30
Simple Substitution DistanceSlide31
Quantifying Success
Consider these 2 scatterplots
of scores
Which is better (and why)?
31
Simple Substitution DistanceSlide32
ROC Curves
Plot true-positive vs false positive
As “threshold” variesCurve nearer 45-degree line is badCurve nearer upper-left is better
32
Simple Substitution DistanceSlide33
ROC Curves
Use ROC curves to quantify successArea under the ROC curve (AUC)Probability that randomly chosen positive instance scores higher than a randomly chosen negative instance
AUC of 1.0 implies ideal detectionAUC of 0.5 means classification is no better than flipping a coin
33
Simple Substitution DistanceSlide34
Parameter Selection
Tested the following parametersOpcode matrix size
Scoring functionNormalizationSwapping strategyNone significant, except matrix size
So we only give results for matrix size
34
Simple Substitution DistanceSlide35
Opcode Matrix Size
Obtained following results
So, ironically, we use 26
x
26
matrix
35
Simple Substitution DistanceSlide36
Test Data
Tested the following metamorphic familiesG2 --- known to be weakNGVCK --- highly metamorphic
MWOR --- highly metamorphic and stealthyMWOR “padding ratios” of 0.5 to 4.0For G2 and NGVCK
50 files tested,
cygwin
utilities for benign files
For each MWOR padding ratio
100 files tested, Linux utilities for benign files
5-fold cross validation in each experiment
36
Simple Substitution DistanceSlide37
NGVCK and G2 Graphs
37
Simple Substitution DistanceSlide38
MWOR Score Graphs
38
Simple Substitution DistanceSlide39
MWOR ROC Curves
39
Simple Substitution DistanceSlide40
MWOR AUC Statistics
40
Simple Substitution DistanceSlide41
Efficiency
41
Simple Substitution DistanceSlide42
Conclusions
Simple substitution score, good results for challenging metamorphics
Scoring is fast and efficientApplicable to other types of malware
Requires
opcodes
42
Simple Substitution DistanceSlide43
Related Work
Recently, we generalized Jakobsen’s algorithm to “combination” cipherSimple substitution column transposition (SSCT)
Uses multiple D matricesOne
D
matrix for each column
Enables easy column manipulations
Overall, fast and effective SSCT attack
Simple Substitution Distance
43Slide44
SSCT
SSCT for malware detectionThis might be stronger malware scoreWhy?Finding good test data is an issue
Can we find/make data where SSCT outperforms simple substitution score?Currently studying this problem
Simple Substitution Distance
44Slide45
Homophonic Substitution
Homophonic sub. allows more than one ciphertext symbol for each plaintextEasy to encrypt, but harder to break than simple substitution --- why?
Previous student developed Jakobsen-like algorithm for homophonic sub.
Uses a nested hill climb approach
This could be tested on malware
Simple Substitution Distance
45Slide46
Zodiac 408
Example of homophonic substitution
Simple Substitution Distance
46Slide47
Zodiac 340
Unsolved
What is it?
Simple Substitution Distance
47Slide48
HMM
A different way to attack simple substitution (and related) ciphers…Train an HMM (of course!)Let
A be 26 x 26, English digraph stats
Then train, without updating
A
matrix
Resulting
B
matrix is the key
Can work for homophonic case too
Any problems with this?
Simple Substitution Distance
48Slide49
HMM with Random Restarts
HMM requires lots of data to convergeOften, we don’t have lots of dataIn such cases, try random restarts
HMM should converge with less data if we start closer to the solutionTry enough random restarts, might start close enough to convergeHow many random restarts?
Simple Substitution Distance
49Slide50
HMM with Random Restarts
Could be applied to malware detectionHowever, slow and expensiveMore relevant for cryptanalysisZodiac 340 cipher, for example
This has previously been analyzed using millions of random restarts
Simple Substitution Distance
50Slide51
References
G. Shanmugam, R.M. Low, and M. Stamp, Simple substitution distance
and metamorphic detection, Journal of Computer Virology and Hacking Techniques
, 9(3):159-170, 2013
A.
Dhavare
, R.M. Low, and M. Stamp, Efficient cryptanalysis of homophonic substitution ciphers,
Cryptologia
, 37(3):250-281, 2013
51
Simple Substitution DistanceSlide52
References
T. Berg-Kirkpatrick and D. Klein, Decipherment with a million random restarts, http://www.cs.berkeley.edu/~tberg/papers/emnlp2013.pdf
Simple Substitution Distance
52