/
Kargus Kargus

Kargus - PowerPoint Presentation

test
test . @test
Follow
398 views
Uploaded On 2016-10-08

Kargus - PPT Presentation

A Highlyscalable Softwarebased Intrusion Detection System M Asim Jamshed Jihyung Lee Sangwoo Moon Insu Yun Deokjin Kim Sungryoul ID: 473154

packet core gbps gpu core packet gpu gbps multi matching cpu detection string packets midea kargus acquisition based traffic

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Kargus" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Kargus: A Highly-scalable Software-based Intrusion Detection System

M.

Asim

Jamshed

*

,

Jihyung

Lee

,

Sangwoo

Moon

,

Insu

Yun

*

,

Deokjin

Kim

,

Sungryoul

Lee

, Yung Yi

,

KyoungSoo

Park

*

* Networked

& Distributed Computing Systems Lab, KAIST

† Laboratory of Network Architecture Design & Analysis, KAIST

‡ Cyber R&D Division, NSRISlide2

Internet

Network Intrusion Detection Systems (NIDS)

Detect known malicious activities

P

ort scans, SQL injections, buffer overflows, etc.Deep packet inspectionDetect malicious signatures (rules) in each packetDesirable featuresHigh performance (> 10Gbps) with precisionEasy maintenanceFrequent ruleset updates

2

NIDS

AttackSlide3

Hardware vs. SoftwareH/W-based NIDSSpecialized hardwareASIC, TCAM, etc.High performanceExpensive

Annual servicing costs

Low flexibility

S/W-based

NIDSCommodity machinesHigh flexibilityLow performanceDDoS/packet drops3

IDS/IPS Sensors

(10s of

Gbps

)

IDS/IPS M8000

(10s of

Gbps)

Open-source S/W

~ US$

20,000 - 60,000

~ US$

10,000 - 24,000

≤ ~2

GbpsSlide4

GoalsS/W-based

NIDS

Commodity machines

High flexibility

4

High performanceSlide5

Typical Signature-based NIDS Architecture

5

Packet

Acquisition

Preprocessing

Decode

Flow management

Reassembly

Match

Success

Match Failure(Innocent Flow)

Multi-string Pattern Matching

Evaluation Failure

(Innocent Flow)

Evaluation

Success

Rule Options Evaluation

Output

Malicious

Flow

alert

tcp

$EXTERNAL_NET any -> $HTTP_SERVERS 80

(

msg

:“possible attack attempt BACKDOOR

optix

runtime

detection"; content

:"/

whitepages

/

page_me

/100.html";

pcre

:"/body=\

x2521\x2521\x2521Optix\

s+Pro

\

s+v

\d

+\x252E\d+\

S+sErver

\

s+Online

\x2521\x2521\x2521/"

)

Bottlenecks

* PCRE: Perl Compatible Regular Expression Slide6

ContributionsA highly-scalable software-based NIDS for high-speed network

Goal

Slow

software NIDS

F

ast

software NIDS

Inefficient packet acquisition

Expensive string &

PCRE pattern matching

Multi-core packet acquisition

Parallel processing &

GPU offloading

Bottlenecks

Solutions

Fastest S/W signature-based IDS:

33Gbps

100% malicious traffic:

10

Gbps

Real network traffic:

~24

Gbps

Outcome

6Slide7

Challenge 1: Packet AcquisitionDefault packet module: Packet CAPture (PCAP) libraryUnsuitable for multi-core

environment

Low

p

erformingMore power consumptionMulti-core packet capture library is required7[Core 1]

[Core 2]

[Core 3]

[Core 4]

[Core 5]

10

Gbps

NIC B

10

Gbps NIC A

[Core 7]

[Core 8]

[Core 9]

[Core 10]

[Core 11]

10

Gbps

NIC D

10

Gbps

NIC C

Packet RX bandwidth

*

0.4-6.7

Gbps

CPU utilization

100 %

*

Intel Xeon X5680, 3.33 GHz, 12 MB L3 Cache

Slide8

Solution: PacketShader I/OPacketShader I/OUniformly distributes

packets

based on flow info by

RSS

hashing Source/destination IP addresses, port numbers, protocol-id1 core can read packets from RSS queues of multiple NICsReads packets in batches (32 ~ 4096)Symmetric Receive-Side Scaling (RSS)Passes packets of 1 connection to the same queue8

* S. Han

et al., “PacketShader: a GPU-accelerated software router”, ACM SIGCOMM 2010

RxQ

A1

RxQ

B1

RxQ

A2

RxQ

B2

RxQ

A3

RxQ

B3

RxQ

A4

RxQ

B4

RxQ

A5

RxQ

B5

[Core 1]

[Core 2]

[Core 3]

[Core 4]

[Core 5]

10

Gbps

NIC B

10

Gbps

NIC A

Packet RX bandwidth

0.4 - 6.7

Gbps

40

Gbps

CPU utilization

100 %

16-29%Slide9

Challenge 2: Pattern MatchingCPU intensive tasks for serial packet scanningMajor bottlenecksMulti-string matching (Aho-Corasick phase)

PCRE evaluation (if ‘

pcre

’ rule option exists in rule)

On an Intel Xeon X5680, 3.33 GHz, 12 MB L3 CacheAho-Corasick analyzing bandwidth per core: 2.15 GbpsPCRE analyzing bandwidth per core: 0.52 Gbps9Slide10

Solution: GPU for Pattern MatchingGPUsContaining 100s of SIMD processors512 cores for NVIDIA GTX 580Ideal for parallel data processing without branchesDFA-based pattern matching on GPUsMulti-string matching using Aho-Corasick

algorithm

PCRE matching

Pipelined execution in CPU/GPU

Concurrent copy and execution10Engine Thread

Packet Acquisition

Preprocess

Multi-string

Matching

Rule Option

Evaluation

GPU Dispatcher Thread

Offloading

Offloading

GPU

Multi-string

Matching

PCRE

Matching

Multi-string Matching Queue

PCRE Matching Queue

Aho-Corasick

bandwidth

2.15

Gbps

39

Gbps

PCRE bandwidth

0.52

Gbps

8.9

GbpsSlide11

Optimization 1: IDS ArchitectureHow to best utilize the multi-core architecture?Pattern matching is the eventual bottleneck

Run entire engine on each core

11

Function

Time %ModuleacsmSearchSparseDFA_Full

51.56

multi-string matchingList_GetNextState

13.91

multi-string matchingmSearch

9.18multi-string matching

in_chksum_tcp2.63

preprocessing

* GNU gprof profiling results Slide12

Solution: Single-process Multi-threadRuns multiple IDS engine threads & GPU dispatcher threads concurrentlyShared address spaceLess GPU memory consumptionHigher GPU utilization & shorter service latency

12

GPU memory usage

1/6

Packet Acquisition

Core 1

Preprocess

Multi-string

Matching

Rule

Option

Evaluation

Packet Acquisition

Core 2

Preprocess

Multi-string

Matching

Rule

Option

Evaluation

Packet Acquisition

Core

3

Preprocess

Multi-string

Matching

Rule

Option

Evaluation

Packet Acquisition

Core 4

Preprocess

Multi-string

Matching

Rule

Option

Evaluation

Packet Acquisition

Core 5

Preprocess

Multi-string

Matching

Rule

Option

Evaluation

Core 6

GPU Dispatcher Thread

Single thread pinned at core 1Slide13

ArchitectureNon Uniform Memory Access (NUMA)-awareCore framework as deployed in dual hexa-core systemCan be configured to various NUMA set-ups accordingly

13

Kargus

configuration on a dual NUMA hexanode machine having 4 NICs, and 2 GPUsSlide14

CaveatsLong per-packet processing latency:Buffering in GPU dispatcherMore power consumptionNVIDIA GTX 580: 512 coresUse:CPU when ingress rate is low (idle GPU)GPU when ingress rate is high

Optimization

2: GPU Usage

14Slide15

Load balancing between CPU & GPU

Reads packets from NIC queues per cycle

Analyzes smaller

#

of packets at each cycle (

a <

b <

c

)

Increases analyzing rate if queue length increases

Activates GPU if queue length increases

CPU

CPU

GPU

Solution: Dynamic Load Balancing

15

a

b

b

c

a

c

α

β

γ

Internal packet queue (per engine)

GPU

Queue

Length

Packet

l

atency with

GPU : 640

μ

secs

CPU:

13

μ

secs

Slide16

Optimization 3: Batched ProcessingHuge per-packet processing overhead> 10 million packets per second for small-sized packets at 10 Gbps

reduces overall processing throughput

Function call batching

Reads group of packets from RX queues at once

Pass the batch of packets to each function16Decode(p)  Preprocess(p)  Multistring_match(p)

Decode(list-p)

Preprocess(

list-p) 

Multistring_match(list-p

)

2x

faster processing rateSlide17

Kargus Specifications17

NUMA node 1

12 GB DRAM (3GB x 4)

Intel 82599 Gigabit Ethernet

Adapter (dual port)

NVIDIA GTX 580

GPU

NUMA node 2

Intel

X5680

3.33

GHz (

hexacore

)

12

MB L3 NUMA-Shared Cache

$1,210

$512

$370

$100

Total Cost

(incl.

serverboard

) = ~$7,000Slide18

IDS Benchmarking ToolGenerates packets at line rate (40 Gbps) Random TCP packets (innocent)Attack packets are generated by attack rule-setSupport packet replay using PCAP files

Useful for

p

erformance evaluation

18Slide19

Kargus Performance EvaluationMicro-benchmarksInput traffic rate: 40 Gbps

Evaluate

Kargus

(~3,000 HTTP rules) against:

Kargus-CPU-only (12 engines)Snort with PF_RINGMIDeA*Refer to the paper for more results19* G. Vasiliadis et al., “

MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS ‘11Slide20

Innocent Traffic Performance

20

Actual payload analyzing bandwidth

2.7-4.5x faster than Snort

1.9-4.3x faster than

MIDeASlide21

Malicious Traffic Performance215

x faster than SnortSlide22

Real Network TrafficThree 10Gbps LTE backbone traces of a major ISP in Korea:Time duration of each trace: 30 mins ~ 1 hourTCP/IPv4 traffic:84 GB of PCAP traces109.3 million packets

845K TCP sessions

Total analyzing rate:

25.2 GbpsBottleneck: Flow Management (preprocessing)22Slide23

Effects of Dynamic GPU Load Balancing 23

Offered Incoming Traffic (

Gbps

) [

Packet Size: 1518 B]

Power Consumption

(Watts)

Varying incoming traffic rates

Packet size = 1518 B

8.7

%

20%Slide24

ConclusionSoftware-based NIDS:Based on commodity hardwareCompetes with hardware-based counterparts5x faster than previous S/W-based NIDSPower efficientCost effective

24

> 25

Gbps

(real traffic)> 33 Gbps (synthetic traffic)US $~7,000/-Slide25

Thank You25

fast-ids@list.ndsl.kaist.edu

https://shader.kaist.edu/kargus/Slide26

Backup SlidesSlide27

Kargus vs. MIDeA27

UPDATE

MIDEA

KARGUS

OUTCOME* G. Vasiliadis, M.Polychronakis, and S. Ioannidis, “

MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide28

Kargus vs. MIDeA28

UPDATE

MIDEA

KARGUS

OUTCOMEPacket acquisitionPF_RINGPacketShader I/O

70% lower CPU utilization

* G.

Vasiliadis, M.Polychronakis

, and S. Ioannidis, “MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide29

Kargus vs. MIDeA29

UPDATE

MIDEA

KARGUS

OUTCOMEPacket acquisitionPF_RINGPacketShader I/O

70% lower CPU utilization

Detection engine

GPU-support for Aho-Corasick

GPU-support for

Aho-Corasick & PCRE65% faster detection rate

* G. Vasiliadis

, M.Polychronakis, and S. Ioannidis, “

MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide30

Kargus vs. MIDeA30

UPDATE

MIDEA

KARGUS

OUTCOMEPacket acquisitionPF_RINGPacketShader I/O

70% lower CPU utilization

Detection engine

GPU-support for Aho-Corasick

GPU-support for

Aho-Corasick & PCRE65% faster

detection rateArchitecture

Process-based

Thread-based1/6 GPU memory

usage* G. Vasiliadis

, M.Polychronakis

, and S. Ioannidis, “MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide31

Kargus vs. MIDeA31

UPDATE

MIDEA

KARGUS

OUTCOMEPacket acquisitionPF_RINGPacketShader I/O

70% lower CPU utilization

Detection engine

GPU-support for Aho-Corasick

GPU-support for

Aho-Corasick & PCRE65% faster

detection rateArchitecture

Process-based

Thread-based1/6 GPU memory

usageBatch processingBatching only for

detection engine (GPU)

Batching from packet acquisition

to output1.9x higher throughput

* G.

Vasiliadis,

M.Polychronakis, and S. Ioannidis, “MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide32

Kargus vs. MIDeA32

UPDATE

MIDEA

KARGUS

OUTCOMEPacket acquisitionPF_RINGPacketShader I/O

70% lower CPU utilization

Detection engine

GPU-support for Aho-Corasick

GPU-support for

Aho-Corasick & PCRE65% faster

detection rateArchitecture

Process-based

Thread-based1/6 GPU memory

usageBatch processingBatching only for

detection engine (GPU)

Batching from packet acquisition

to output1.9x higher throughput

Power-efficient

Always GPU

(does not offload only when packet size is too small)Opportunistic offloading

to GPUs (Ingress

traffic rate)15% power

saving

* G.

Vasiliadis

,

M.Polychronakis

, and S. Ioannidis, “

MIDeA

: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide33

Receive-Side Scaling (RSS)33RSS uses Toeplitz hash function (with a random secret key)

Algorithm: RSS Hash Computation

function

ComputeRSSHash

(Input[], RSK)ret = 0;for each bit b in Input[] doif b == 1 thenret ^= (left-most 32 bits of RSK);endifshift RSK left 1 bit position;end forend function Slide34

Symmetric Receive-Side Scaling34Update RSK (Shinae et al.)

0x6d5a

0x6d5a

0x6d5a

0x6d5a

0x6d5a

0x6d5a

0x6d5a

0x6d5a

0x6d5a0x6d5a

0x6d5a

0x6d5a

0x6d5a0x6d5a

0x6d5a

0x6d5a

0x6d5a

0x6d5a

0x6d5a

0x6d5a

0x6d5a

0x56da

0x255b

0x0ec2

0x4167

0x253d

0x43a3

0x8fb0

0xd0ca

0x2bcb

0xae7b

0x30b4

0x77cb

0x2d3a

0x8030

0xf20c

0x6a42

0xb73b

0xbeac

0x01faSlide35

Why use a GPU?35

GTX 580:

512

cores

ALU

Xeon X5680:

6

cores

Control

Cache

ALU

ALU

ALU

ALU

ALU

ALU

VS

*Slide adapted from NVIDIA CUDA C A Programming Guide Version 4.2 (Figure 1-2)Slide36

GPU Microbenchmarks – Aho-Corasick36

2.15

Gbps

39

GbpsSlide37

GPU Microbenchmarks – PCRE37

0.52

Gbps

8.9

GbpsSlide38

Use of global variables minimalAvoids compulsory cache missesEliminates cross-NUMA cache bouncing effectsEffects of NUMA-aware Data Placement

38

Packet Size (Bytes)

Performance Speedup

1518Slide39

CPU-only analysis for small-sized packets Offloading small-sized packets to the GPU is expensiveContention across page-locked DMA accessible memory with GPUGPU operational cost of packet metadata increases

39

82Slide40

Challenge 1: Packet AcquisitionDefault packet module: Packet CAPture (PCAP) libraryUnsuitable for multi-core

environment

Low Performing

40Slide41

Solution: PacketShader* I/O41

Related Contents


Next Show more