/
Simple Substitution Distance and Metamorphic Detection Simple Substitution Distance and Metamorphic Detection

Simple Substitution Distance and Metamorphic Detection - PowerPoint Presentation

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
394 views
Uploaded On 2016-07-22

Simple Substitution Distance and Metamorphic Detection - PPT Presentation

Simple Substitution Distance 1 Gayathri Shanmugam Richard M Low Mark Stamp The Idea Metamorphic malware mutates with each infection Measuring software similarity is one method of detection ID: 414886

simple substitution metamorphic distance substitution simple distance metamorphic matrix algorithm score statistics swapping digraph key ciphertext jackobsen

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Simple Substitution Distance and Metamor..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Simple Substitution Distance and Metamorphic Detection

Simple Substitution Distance

1

Gayathri

Shanmugam

Richard M. Low

Mark StampSlide2

The Idea

Metamorphic malware “mutates” with each infectionMeasuring software similarity is one method of detectionBut, how to measure similarity?

Lots of relevant previous workHere, an unusual and interesting distance measure is considered

2

Simple Substitution DistanceSlide3

Simple Substitution Distance

We treat each metamorphic copy as if it is an “encrypted” version of “base” virusWhere the cipher is a simple substitution

Why simple substitution?Easy to work with, fast algorithm to solveWhy might this work?

Simple substitution cryptanalysis gives results that match family statistics

Accounts for modifications to files similar to some common metamorphic techniques

3

Simple Substitution DistanceSlide4

Motivation

Given a simple substitution ciphertext

where plaintext is English…If we cryptanalyze using English language statistics, we expect a good score

If we

cryptanalyze

using, say, French language statistics, we expect a not-so-good score

We can obtain

opcode

statistics for a metamorphic family

Using simple substitution cryptanalysis, a virus of same family should score well…

…but, a benign exe should not score as well

Assuming statistics of these families differ

4

Simple Substitution DistanceSlide5

Metamorphic Techniques

Many possible morphing strategiesHere, briefly considerRegister swapping

Garbage code insertionEquivalent substitutionTranspositionFormal grammar mutation

At a high level --- substitution, transposition, insertion, and deletion

5

Simple Substitution DistanceSlide6

Register Swap

Register swappingE.g., replace EBX register with

EAX, provided EAX not in use

Very simple and used in some of first metamorphic malware

Not very effective

Why not?

6

Simple Substitution DistanceSlide7

Garbage Insertion

Garbage code insertionTwo cases:Dead code --- inserted, but not executed

We can simply JMP over dead codeDo-nothing instructions --- executed, but has no effect on program

Like

NOP

or

ADD EAX,0

Relatively easy to implement

Effective at breaking signatures

Changes the

opcodes

statistics

7

Simple Substitution DistanceSlide8

Code Substitution

Equivalent instruction substitutionFor example, can replace SUB EAX,EAX

with XOR EAX,EAXDoes not need to be 1 for 1 substitutionThat is, can also include insertion/deletion

Unlimited number of substitutions

And can be very effective

Somewhat difficult to implement

8

Simple Substitution DistanceSlide9

Transposition

TranspositionReorder instructions that have no dependencyFor example,

MOV R1,R2 ADD R3,R4 ADD R3,R4 MOV R1,R2

Can be highly effective

But, can be difficult to implement

Sometimes applied only to subroutines

9

Simple Substitution DistanceSlide10

Formal Grammar Mutation

Formal grammar mutationView morphing engine as non-deterministic automataAllow transitions between any symbols

Apply formal grammar rulesObtain many variants, high variationReally just a formalization of others approaches, not a separate technique

10

Simple Substitution DistanceSlide11

Previous Work

Easy to prove that “good” metamorphic code is immune to signature detection

Why?But, many successes detecting hacker-produced metamorphic malware…HMM/PHMM/machine learning

Graph-based techniques

Statistics (chi-squared, naïve

Bayes

)

Structural entropy

Linear algebraic techniques

11

Simple Substitution DistanceSlide12

Topic of This Research

Measure similarity using simple substitution distanceWe “decrypt” suspect file using statistics from a metamorphic family

If decryption is good, we classify it as a member of the same metamorphic familyIf decryption is poor, we classify it as NOT a member of the given family

12

Simple Substitution DistanceSlide13

Simple Substitution Cipher

Simple substitution is one of the oldest and simplest means of encryptionA fixed key used to substitute letters

For example, Caesar’s cipher, substitute letter 3 positions ahead in alphabetIn general, any permutation can be keySimple substitution cryptanalysis?

Statistical analysis of

ciphertext

13

Simple Substitution DistanceSlide14

Simple Substitution Cryptanalysis

Suppose you observe the

ciphertext PBFPVYFBQXZTYFPBFEQJHDXXQVAPTPQJKTOYQWIPBVWLXTOXBTFXQWAXBVCXQWAXFQJVWLEQNTOZQGGQLFXQWAKVWLXQWAEBIPBFXFQVXGTVJVWLBTPQWAEBFPBFHCVLXBQUFEVWLXGDPEQVPQGVPPBFTIXPFHXZHVFAGFOTHFEFBQUFTDHZBQPOTHXTYFTODXQHFTDPTOGHFQPBQWAQJJTODXQHFOQPWTBDHHIXQVAPBFZQHCFWPFHPBFIPBQWKFABVYYDZBOTHPBQPQJTQOTOGHFQAPBFEQJHDXXQVAVXEBQPEFZBVFOJIWFFACFCCFHQWAUVWFLQHGFXVAFXQHFUFHILTTAVWAFFAWTEVOITDHFHFQAITIXPFHXAFQHEFZQWGFLVWPTOFFA

Analyze frequency counts…

Likely that

ciphertext

“F” represents “E”

And so on, at least for common letters

14

Simple Substitution DistanceSlide15

Simple Substitution Cryptanalysis

Can automate the cryptanalysis

Make initial guess for key using frequency countsCompute

oldScore

Modify key by swapping adjacent elements

Compute

newScore

If

newScore

>

oldScore

. let

oldScore

=

newScore

Else

unswap

key elements

Goto

3

How to compute score?

Number of dictionary words in putative plaintext?

Much better to use English digraph statistics

15

Simple Substitution DistanceSlide16

Jackobsen’s Algorithm

Method on previous slide can be slowWhy?Jackobsen’s

algorithm uses similar idea, but fast and efficient

Ciphertext

is only decrypted once

So algorithm is (essentially) independent of length of message

Then, only matrix manipulations required

16

Simple Substitution DistanceSlide17

Jackobsen’s Algorithm: Swapping

Assume plaintext is English, 26 letters

Let K = k

1

,k

2

,k

3

,…,k

26

be putative key

And let “

|

” represent “swap”

Then we swap elements as follows

Restart this swapping from the beginning whenever the score improves

17

Simple Substitution DistanceSlide18

Jackobsen’s Algorithm: Swapping

Minimum swaps is 26 choose 2, or 325

Maximum is unboundedEach swap requires a score computation

Average number of swaps, experimentally:

Ciphertext

of length 500, average 1050 swaps

Ciphertext

of length 8000,

avg

just 630 swaps

So, work depends on length of

ciphertext

More

ciphertext

, better scores, fewer swaps

18

Simple Substitution DistanceSlide19

Jackobsen’s Algorithm: Scoring

Let D = {

dij} be digraph distribution corresponding to putative key

K

Let

E = {

e

ij

}

be digraph distribution of English language

These matrices are 26

x

26

Compute score as

19

Simple Substitution DistanceSlide20

Jackobsen’s Algorithm

So far, nothing fancy hereCould see all of this in a CS 265 assignment

Jackobsen’s trick: Determine new D matrix from old

D

without decrypting

How to do so?

It turns out that swapping elements of

K

swaps corresponding rows and columns of

D

See example on next slides…

20

Simple Substitution DistanceSlide21

Swapping Example

To simplify, suppose 10 letter alphabet

E, T, A, O, I, N, S, R, H, DSuppose you are given the ciphertext

TNDEODRHISOADDRTEDOAHENSINEOAR

DTTDTINDDRNEDNTTTDDISRETEEEEEAA

Frequency counts given by

21

Simple Substitution DistanceSlide22

Swapping Example

We choose the putative key

K given here

The corresponding putative plaintext is

AOETRENDSHRIEENATE

RIDTOHSOTRINEAAEAS

OEENOTEOAAAEESHNA

TTTTTII

Corresponding digraph distribution

D

is

22

Simple Substitution DistanceSlide23

Swapping Example

Suppose we swap first 2 elements of KThen decrypt using new

KAnd compute digraph matrix for new K

Previous key

K

New key

K

23

Simple Substitution DistanceSlide24

Swapping Example

Old D

matrix vs new D

matrix

What do you notice?

So what’s the point here?

This is good!

24

Simple Substitution DistanceSlide25

Jackobsen’s

Algorithm

25

Simple Substitution DistanceSlide26

Proposed Similarity Score

Extract opcodes sequences from collection of (family) virusesAll viruses from

same metamorphic familyDetermine n

most common

opcodes

Symbol

n+1

used for all “other”

opcodes

Use resulting digraph statistics form matrix

E = {

e

ij

}

Note that matrix is

(n+1)

x

(n+1)

26

Simple Substitution DistanceSlide27

Scoring a File

Given an executable we want to score…

Extract it’s opcode sequence

Use

opcode

digraph stats to get

D = {

d

ij

}

This matrix also

(n+1)

x

(n+1)

Initial “key”

K

chosen to match monograph stats of virus family

Most frequent

opcode

in exe maps to most frequent

opcode

in virus family, etc.

Score based on distance between

D

and

E

“Decrypt”

D

and score how closely it matches

E

Jackobsen’s

algorithm used for “decryption”

27

Simple Substitution DistanceSlide28

Example

Suppose only 5 common

opcodes

in family viruses (in descending frequency)

Extract following sequence from an exe

Initial “key” is

And “decrypt” is

28

Simple Substitution DistanceSlide29

Example

Given “decrypt”Form D

matrixAfter swapAnd so on…

29

Simple Substitution DistanceSlide30

Scoring Algorithm

30

Simple Substitution DistanceSlide31

Quantifying Success

Consider these 2 scatterplots

of scores

Which is better (and why)?

31

Simple Substitution DistanceSlide32

ROC Curves

Plot true-positive vs false positive

As “threshold” variesCurve nearer 45-degree line is badCurve nearer upper-left is better

32

Simple Substitution DistanceSlide33

ROC Curves

Use ROC curves to quantify successArea under the ROC curve (AUC)Probability that randomly chosen positive instance scores higher than a randomly chosen negative instance

AUC of 1.0 implies ideal detectionAUC of 0.5 means classification is no better than flipping a coin

33

Simple Substitution DistanceSlide34

Parameter Selection

Tested the following parametersOpcode matrix size

Scoring functionNormalizationSwapping strategyNone significant, except matrix size

So we only give results for matrix size

34

Simple Substitution DistanceSlide35

Opcode Matrix Size

Obtained following results

So, ironically, we use 26

x

26

matrix

35

Simple Substitution DistanceSlide36

Test Data

Tested the following metamorphic familiesG2 --- known to be weakNGVCK --- highly metamorphic

MWOR --- highly metamorphic and stealthyMWOR “padding ratios” of 0.5 to 4.0For G2 and NGVCK

50 files tested,

cygwin

utilities for benign files

For each MWOR padding ratio

100 files tested, Linux utilities for benign files

5-fold cross validation in each experiment

36

Simple Substitution DistanceSlide37

NGVCK and G2 Graphs

37

Simple Substitution DistanceSlide38

MWOR Score Graphs

38

Simple Substitution DistanceSlide39

MWOR ROC Curves

39

Simple Substitution DistanceSlide40

MWOR AUC Statistics

40

Simple Substitution DistanceSlide41

Efficiency

41

Simple Substitution DistanceSlide42

Conclusions

Simple substitution score, good results for challenging metamorphics

Scoring is fast and efficientApplicable to other types of malware

Requires

opcodes

42

Simple Substitution DistanceSlide43

Related Work

Recently, we generalized Jakobsen’s algorithm to “combination” cipherSimple substitution column transposition (SSCT)

Uses multiple D matricesOne

D

matrix for each column

Enables easy column manipulations

Overall, fast and effective SSCT attack

Simple Substitution Distance

43Slide44

SSCT

SSCT for malware detectionThis might be stronger malware scoreWhy?Finding good test data is an issue

Can we find/make data where SSCT outperforms simple substitution score?Currently studying this problem

Simple Substitution Distance

44Slide45

Homophonic Substitution

Homophonic sub. allows more than one ciphertext symbol for each plaintextEasy to encrypt, but harder to break than simple substitution --- why?

Previous student developed Jakobsen-like algorithm for homophonic sub.

Uses a nested hill climb approach

This could be tested on malware

Simple Substitution Distance

45Slide46

Zodiac 408

Example of homophonic substitution

Simple Substitution Distance

46Slide47

Zodiac 340

Unsolved

What is it?

Simple Substitution Distance

47Slide48

HMM

A different way to attack simple substitution (and related) ciphers…Train an HMM (of course!)Let

A be 26 x 26, English digraph stats

Then train, without updating

A

matrix

Resulting

B

matrix is the key

Can work for homophonic case too

Any problems with this?

Simple Substitution Distance

48Slide49

HMM with Random Restarts

HMM requires lots of data to convergeOften, we don’t have lots of dataIn such cases, try random restarts

HMM should converge with less data if we start closer to the solutionTry enough random restarts, might start close enough to convergeHow many random restarts?

Simple Substitution Distance

49Slide50

HMM with Random Restarts

Could be applied to malware detectionHowever, slow and expensiveMore relevant for cryptanalysisZodiac 340 cipher, for example

This has previously been analyzed using millions of random restarts

Simple Substitution Distance

50Slide51

References

G. Shanmugam, R.M. Low, and M. Stamp, Simple substitution distance

and metamorphic detection, Journal of Computer Virology and Hacking Techniques

, 9(3):159-170, 2013

A.

Dhavare

, R.M. Low, and M. Stamp, Efficient cryptanalysis of homophonic substitution ciphers,

Cryptologia

, 37(3):250-281, 2013

51

Simple Substitution DistanceSlide52

References

T. Berg-Kirkpatrick and D. Klein, Decipherment with a million random restarts, http://www.cs.berkeley.edu/~tberg/papers/emnlp2013.pdf

Simple Substitution Distance

52