45K - views

Decompression-Free Inspection: DPI for Shared Dictionary Co

Anat. . Bremler. -Barr. Interdisciplinary Center . Herzliya. Shimrit. . Tzur. David. Interdisciplinary Center . Herzliya. &. The Hebrew University, Jerusalem. David Hay. The Hebrew University, Jerusalem.

Embed :
Presentation Download Link

Download Presentation - The PPT/PDF document "Decompression-Free Inspection: DPI for S..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Decompression-Free Inspection: DPI for Shared Dictionary Co






Presentation on theme: "Decompression-Free Inspection: DPI for Shared Dictionary Co"— Presentation transcript:

Slide1

Decompression-Free Inspection: DPI for Shared Dictionary Compression over HTTP

Anat Bremler-BarrInterdisciplinary Center HerzliyaShimrit Tzur DavidInterdisciplinary Center Herzliya &The Hebrew University, JerusalemDavid HayThe Hebrew University, JerusalemYaron KoralTel Aviv University

1Slide2

Outline

MotivationBackgroundAC algorithmOur solutionThe offline PhaseThe online phaseExperimental Results2Slide3

Deep Packet Inspection (DPI)

Search for patterns in the packets` payloadSignatures-based NIDS Intrusion PreventionsWeb-Application FirewallsLeakage preventionContent FilteringChallenges:

Thousands of known malicious patterns

Real time, link rate

Security tools performance is dominated by the pattern matching engine (Fisk & Varghese 2002)

3Slide4

Compressed HTTP

419% increase in 8 month!

84.1

% of the top 1,000 sites compress their traffic.

Data compression is done by adding references to repeated data.

There are two types of compression:

Intra-response compression

– the references point to bytes within the response (

Gzip

/Deflate

)

Inter-responses/connections compression

– the references point to bytes in a separate file, called dictionary (

Google’s SDCH

).Slide5

Example –

Intra-Response CompressionFile1.html:abcdefgabcdFile2.htmlabcdxyzbcdtrEncode repeated strings by pointer: {distance, length}

5

TCP Connection Setup

GET File1.html

abcdefg

(7,4)

GET File2.html

abcdxyz

(6,3)

trSlide6

Example –

Inter-Response CompressionDictionary: abcdFile1.html:abcdefgabcdFile2.htmlabcdxyzbcdtrCopy repeated strings from the dictionary: (address, length)

6

TCP Connection Setup

GET File1.html

Delta file: (0,4)

efg

(0,4)

GET File2.html

Delta file:(0,4)xyz(1,3)

tr

GET dictionary

abcdSlide7

Current NIDS Operation (1)

7Server

Client

Http

uncompressed

NIDS

GET \index.html

Accept-Encoding: SDCH

Scan for

Intrusions

Http

uncompressed

GET \index.html

Accept-Encoding: SDCHSlide8

Current NIDS Operation (2)

8Server

Client

Http

compressed

NIDS

GET \index.html

Accept-Encoding: SDCH

Do Not

Scan/

Decompress,

Scan, Compress

Http

compressed

GET \index.html

Accept-Encoding: SDCHSlide9

Our Solution

9Server

Client

Http

compressed

NIDS

GET \index.html

Accept-Encoding: SDCH

Scan

directly

with no decompression

Http

compressed

GET \index.html

Accept-Encoding: SDCHSlide10

Our Solution: Decompression-Free Scanning

Focused on inter-response compression  Our algorithm works in two phasesOffline phase - Scanning the dictionaryOnline phase - Scanning the delta filesWorks at the rate of the compressed trafficGain 56% improvement compared with scanning the plain-text directly10Slide11

Outline

MotivationBackgroundAho-Corasick (AC) algorithmOur solutionThe offline PhaseThe online phaseExperimental Results11Slide12

Aho-Corasick (AC) Algorithm

Finite State Machine (FSM)Regular states, accepting statesGoto function (black arrows)g(state,symbol)stateEach state corresponds to a label-

the sequence of characters on its

goto

path from the root.

The length of the label is the depth of the state

Failure function (red arrows)

f(state)state

Taken when there is no

goto

function

Goes to a state that its label is the longest suffix of the current state’s label

s

0

s

7

s

12

s

1

s

2

s

3

s

5

s

4

s

14

s

13

s

6

s

8

s

9

s

10

s

11

C

C

E

D

B

E

D

D

B

C

A

B

A

A

The label of S

14

is BCAA

g(S

11

,B) = S

12

g(S

11

,A) = ?

Patterns:

E

BE

BD

BCAA

BCD

CDBCAB

f(S

11

) = S

13

g(S

11

,A)

 g(S

13

,A)=S

14Slide13

Aho-Corasick Insights

The automaton remembers only its current stateThe input text ends with the label of current stateThis label is the longest suffix in the text that can be a prefix of a matchNo future pattern can begin before this labels

0

s

7

s

12

s

1

s

2

s

3

s

5

s

4

s

14

s

13

s

6

s

8

s

9

s

10

s

11

C

C

E

D

B

E

D

D

B

C

A

B

A

ASlide14

Outlines

MotivationBackgroundAho-Corasick (AC) algorithmOur solutionThe offline PhaseThe online phaseExperimental Results14Slide15

Accelerator Algorithm Idea

The algorithm operates in two phases:The Offline Phase:Scan the dictionary and store information about the pattern matching results The Online Phase:Scan the delta file and skip almost all referenced bytes that were already scanned for patterns.15Slide16

The Offline Phase

The dictionary is scanned using AC (from its first byte and from s0). We save the state after each byte.161110

9

8

7

6

5

4

3

2

1

0

C

B

A

C

B

D

C

A

A

E

B

D

S

5

S

12

S11

S10S9

S

8

S

7

S

0

S

0

S

3

S

2

S

0

s

0

s

7

s

12

s

1

s

2

s

3

s

5

s

4

s

14

s

13

s

6

s

8

s

9

s

10

s

11

C

C

E

D

B

E

D

D

B

C

A

B

A

A

State:

We also save information of matched patterns

that are found in the dictionary Slide17

Challenges

Dictionary:Delta file:ABDB(5,4)AAB(1,4)The uncompressed data is:We copy from arbitrary position in the dictionary when the automaton in an arbitrary stateWe show that no matter in what state and which symbol we start to copy, the resulting state is reachable via failure transitions from the saved state. 17 A

B

D

B

C

D

B

C

A

A

B

B

E

A

A

Patterns/

Signatures:

E

BE

BD

BCAA

BCD

CDBCAB

Types of matches:

Right boundary

Internal

Left boundary

0

1

2

3

4

5

6

7

8

9

10

11

D

B

E

A

A

C

D

B

C

A

B

CSlide18

The Online Phase

Scan the delta file:Uncompressed bytes - scan using AC.Copy instruction (p,x)The compressed data that we already scanned in the offline phase.We will save the scan for almost all these bytes.The internal match is trivial, see paper for details.

18Slide19

The Online Phase - Right Boundary

When encountering copy instruction (p,x), We want to stop scanning and jump to state[p+x-1]If the label of the state is longer than the copy-valueThe label begins before the copy valueThe context of this state is not as in the online scan We take failure transitions to find state with sufficiently short label.OtherwiseThe label of the state is contained in the copy valueThis is the longest suffix that can lead to a match

19Slide20

Example – Right Boundary

Uncompressed data: …B20s

0

s

7

s

12

s

1

s

2

s

3

s

5

s

4

s

14

s

13

s

6

s

8

s

9

s

10

s

11

C

C

E

D

B

E

D

D

B

C

A

B

A

A

11

10

9

8

7

6

5

4

3

2

1

0

C

B

A

C

B

D

C

A

A

E

B

D

S

5

S

12

S

11

S

10

S

9

S

8

S

7

S

0

S

0

S

3

S

2

S

0

State:

B

C

A

B

COPY(7,4):

Go to

State[10]=s

12

. depth(s

12

) > 4.

Go to f(s

12

)=s

2

depth(s

2

) ≤ 4

Current state is S

2Slide21

The Online Phase – Left Boundary

When encountering copy instruction (p,x), We want to stop scanning and jump to state[p+x-1]If the number of bytes we read from the copy value is less than the depth of the current stateThe label of the state begins before the copied bytesWe scan the copy value till we reach a state that its label is shorter than the number of read bytes.OtherwiseThe label of the state is contained in the copy valueBoth offline and online scans have the same context

21Slide22

Example – Left Boundary

Uncompressed data: …B22s

0

s

7

s

12

s

1

s

2

s

3

s

5

s

4

s

14

s

13

s

6

s

8

s

9

s

10

s

11

C

C

E

D

B

E

D

D

B

C

A

B

A

A

11

10

9

8

7

6

5

4

3

2

1

0

C

B

A

C

B

D

C

A

A

E

B

D

S

5

S

12

S

11

S

10

S

9

S

8

S

7

S

0

S

0

S

3

S

2

S

0

State:

C

D

B

C

COPY(5,4):

j=0

depth=1

Continue

j=1

Depth=2

Continue

j=2

Depth=3

Continue

j=3

Stop scanning (

depth(s

9

)≤3)Slide23

Outline

MotivationBackgroundAho-Corasick (AC) algorithmOur solutionThe offline PhaseThe online phaseExperimental Results23Slide24

Experimental Results

Input: google.com dictionary Pages for 1000 most popular Google queries. PatternsSnortThe synthetic caseA patterns file for each input file so the input file has a different percentage of matches, from 25% to 100%. 24Slide25

The Algorithm Overheads

Traversing the failure transitionsIn the right boundaryScanning the copy valueIn the left boundaryMemory consumption:The additional information of the offline phase.Total: 420 KB (per dictionary)Can be further reduced by a variable-length pointer encoding.25Slide26

Failure Transitions – Right Boundaries

If length ≥ depth, no failure transition is takenIn our experiments:The average is 2.35 failure transitions per file(average of 557 copy instructions per file)26Slide27

Scanning the Copy Value -

Left BoundaryCompression ratio – compressed/uncompressedScan ratio – scanned/uncompressed. Snortlow percentage of matches scan-ratio ~ compression ratioThe synthetic casehigh percentage of matchesUnrealistic case scan-ratio is between 1.05 to 1.2 times compression-ratio.27Slide28

Regular Expression Results

Strings were extracted from the regular expression and were added to the pattern set.When needed, we use off-the-shelf perl compatible regular expression engine to scan additional parts of the text.The overhead of the regular expression is around 1% which is almost negligible28Slide29

Questions??

29Slide30

Regular Expression

Very common in security purpose patterns. In Snort, 55% of the rules contain regular expression. Composed of anchors and pcre tokens.For example, in the pattern: abc[1-9]*xyza{3,7}The anchors are:abcxyzThe pcre tokens are:[1-9]*a{3,7}

30Slide31

Dealing with Regular Expression

The anchors are extracted from the regular expression offline. The anchors are added to the patterns set.If there is a regular expression which all its anchors were matched:run an off the-shelf regular expression engine until, either a mismatch, a full pattern match, or the whole (limited) text is searched. 31Slide32

Regular Expression – Limited Search

In most cases, we can limit the search in at least one direction.If before the first anchor all tokens have a limited size, there is a bounded number of characters we should examine before the matched anchor. If after the last anchor all tokens have a limited size there is a bounded number of characters we should examine after the matched anchor. 32Slide33

Memory Consumption

Doubling the size of the dictionary (for saving the offline scan results, one pointer per symbol)Saving the matched list (for internal matches) Our experiments:Match list size 40,000Dictionary size 116K symbolsPointer size 17 bitsTotal memory consumption is 420 KB (per dictionary)Can be further reduced by a variable-length pointer encoding.33