A Highlyscalable Softwarebased Intrusion Detection System M Asim Jamshed Jihyung Lee Sangwoo Moon Insu Yun Deokjin Kim Sungryoul ID: 473154
Download Presentation The PPT/PDF document "Kargus" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Kargus: A Highly-scalable Software-based Intrusion Detection System
M.
Asim
Jamshed
*
,
Jihyung
Lee
†
,
Sangwoo
Moon
†
,
Insu
Yun
*
,
Deokjin
Kim
‡
,
Sungryoul
Lee
‡
, Yung Yi
†
,
KyoungSoo
Park
*
* Networked
& Distributed Computing Systems Lab, KAIST
† Laboratory of Network Architecture Design & Analysis, KAIST
‡ Cyber R&D Division, NSRISlide2
Internet
Network Intrusion Detection Systems (NIDS)
Detect known malicious activities
P
ort scans, SQL injections, buffer overflows, etc.Deep packet inspectionDetect malicious signatures (rules) in each packetDesirable featuresHigh performance (> 10Gbps) with precisionEasy maintenanceFrequent ruleset updates
2
NIDS
AttackSlide3
Hardware vs. SoftwareH/W-based NIDSSpecialized hardwareASIC, TCAM, etc.High performanceExpensive
Annual servicing costs
Low flexibility
S/W-based
NIDSCommodity machinesHigh flexibilityLow performanceDDoS/packet drops3
IDS/IPS Sensors
(10s of
Gbps
)
IDS/IPS M8000
(10s of
Gbps)
Open-source S/W
~ US$
20,000 - 60,000
~ US$
10,000 - 24,000
≤ ~2
GbpsSlide4
GoalsS/W-based
NIDS
Commodity machines
High flexibility
4
High performanceSlide5
Typical Signature-based NIDS Architecture
5
Packet
Acquisition
Preprocessing
Decode
Flow management
Reassembly
Match
Success
Match Failure(Innocent Flow)
Multi-string Pattern Matching
Evaluation Failure
(Innocent Flow)
Evaluation
Success
Rule Options Evaluation
Output
Malicious
Flow
alert
tcp
$EXTERNAL_NET any -> $HTTP_SERVERS 80
(
msg
:“possible attack attempt BACKDOOR
optix
runtime
detection"; content
:"/
whitepages
/
page_me
/100.html";
pcre
:"/body=\
x2521\x2521\x2521Optix\
s+Pro
\
s+v
\d
+\x252E\d+\
S+sErver
\
s+Online
\x2521\x2521\x2521/"
)
Bottlenecks
* PCRE: Perl Compatible Regular Expression Slide6
ContributionsA highly-scalable software-based NIDS for high-speed network
Goal
Slow
software NIDS
F
ast
software NIDS
Inefficient packet acquisition
Expensive string &
PCRE pattern matching
Multi-core packet acquisition
Parallel processing &
GPU offloading
Bottlenecks
Solutions
Fastest S/W signature-based IDS:
33Gbps
100% malicious traffic:
10
Gbps
Real network traffic:
~24
Gbps
Outcome
6Slide7
Challenge 1: Packet AcquisitionDefault packet module: Packet CAPture (PCAP) libraryUnsuitable for multi-core
environment
Low
p
erformingMore power consumptionMulti-core packet capture library is required7[Core 1]
[Core 2]
[Core 3]
[Core 4]
[Core 5]
10
Gbps
NIC B
10
Gbps NIC A
[Core 7]
[Core 8]
[Core 9]
[Core 10]
[Core 11]
10
Gbps
NIC D
10
Gbps
NIC C
Packet RX bandwidth
*
0.4-6.7
Gbps
CPU utilization
100 %
*
Intel Xeon X5680, 3.33 GHz, 12 MB L3 Cache
Slide8
Solution: PacketShader I/OPacketShader I/OUniformly distributes
packets
based on flow info by
RSS
hashing Source/destination IP addresses, port numbers, protocol-id1 core can read packets from RSS queues of multiple NICsReads packets in batches (32 ~ 4096)Symmetric Receive-Side Scaling (RSS)Passes packets of 1 connection to the same queue8
* S. Han
et al., “PacketShader: a GPU-accelerated software router”, ACM SIGCOMM 2010
RxQ
A1
RxQ
B1
RxQ
A2
RxQ
B2
RxQ
A3
RxQ
B3
RxQ
A4
RxQ
B4
RxQ
A5
RxQ
B5
[Core 1]
[Core 2]
[Core 3]
[Core 4]
[Core 5]
10
Gbps
NIC B
10
Gbps
NIC A
Packet RX bandwidth
0.4 - 6.7
Gbps
40
Gbps
CPU utilization
100 %
16-29%Slide9
Challenge 2: Pattern MatchingCPU intensive tasks for serial packet scanningMajor bottlenecksMulti-string matching (Aho-Corasick phase)
PCRE evaluation (if ‘
pcre
’ rule option exists in rule)
On an Intel Xeon X5680, 3.33 GHz, 12 MB L3 CacheAho-Corasick analyzing bandwidth per core: 2.15 GbpsPCRE analyzing bandwidth per core: 0.52 Gbps9Slide10
Solution: GPU for Pattern MatchingGPUsContaining 100s of SIMD processors512 cores for NVIDIA GTX 580Ideal for parallel data processing without branchesDFA-based pattern matching on GPUsMulti-string matching using Aho-Corasick
algorithm
PCRE matching
Pipelined execution in CPU/GPU
Concurrent copy and execution10Engine Thread
Packet Acquisition
Preprocess
Multi-string
Matching
Rule Option
Evaluation
GPU Dispatcher Thread
Offloading
Offloading
GPU
Multi-string
Matching
PCRE
Matching
Multi-string Matching Queue
PCRE Matching Queue
Aho-Corasick
bandwidth
2.15
Gbps
39
Gbps
PCRE bandwidth
0.52
Gbps
8.9
GbpsSlide11
Optimization 1: IDS ArchitectureHow to best utilize the multi-core architecture?Pattern matching is the eventual bottleneck
Run entire engine on each core
11
Function
Time %ModuleacsmSearchSparseDFA_Full
51.56
multi-string matchingList_GetNextState
13.91
multi-string matchingmSearch
9.18multi-string matching
in_chksum_tcp2.63
preprocessing
* GNU gprof profiling results Slide12
Solution: Single-process Multi-threadRuns multiple IDS engine threads & GPU dispatcher threads concurrentlyShared address spaceLess GPU memory consumptionHigher GPU utilization & shorter service latency
12
GPU memory usage
1/6
Packet Acquisition
Core 1
Preprocess
Multi-string
Matching
Rule
Option
Evaluation
Packet Acquisition
Core 2
Preprocess
Multi-string
Matching
Rule
Option
Evaluation
Packet Acquisition
Core
3
Preprocess
Multi-string
Matching
Rule
Option
Evaluation
Packet Acquisition
Core 4
Preprocess
Multi-string
Matching
Rule
Option
Evaluation
Packet Acquisition
Core 5
Preprocess
Multi-string
Matching
Rule
Option
Evaluation
Core 6
GPU Dispatcher Thread
Single thread pinned at core 1Slide13
ArchitectureNon Uniform Memory Access (NUMA)-awareCore framework as deployed in dual hexa-core systemCan be configured to various NUMA set-ups accordingly
13
▲
Kargus
configuration on a dual NUMA hexanode machine having 4 NICs, and 2 GPUsSlide14
CaveatsLong per-packet processing latency:Buffering in GPU dispatcherMore power consumptionNVIDIA GTX 580: 512 coresUse:CPU when ingress rate is low (idle GPU)GPU when ingress rate is high
Optimization
2: GPU Usage
14Slide15
Load balancing between CPU & GPU
Reads packets from NIC queues per cycle
Analyzes smaller
#
of packets at each cycle (
a <
b <
c
)
Increases analyzing rate if queue length increases
Activates GPU if queue length increases
CPU
CPU
GPU
Solution: Dynamic Load Balancing
15
a
b
b
c
a
c
α
β
γ
Internal packet queue (per engine)
GPU
Queue
Length
Packet
l
atency with
GPU : 640
μ
secs
CPU:
13
μ
secs
Slide16
Optimization 3: Batched ProcessingHuge per-packet processing overhead> 10 million packets per second for small-sized packets at 10 Gbps
reduces overall processing throughput
Function call batching
Reads group of packets from RX queues at once
Pass the batch of packets to each function16Decode(p) Preprocess(p) Multistring_match(p)
Decode(list-p)
Preprocess(
list-p)
Multistring_match(list-p
)
2x
faster processing rateSlide17
Kargus Specifications17
NUMA node 1
12 GB DRAM (3GB x 4)
Intel 82599 Gigabit Ethernet
Adapter (dual port)
NVIDIA GTX 580
GPU
NUMA node 2
Intel
X5680
3.33
GHz (
hexacore
)
12
MB L3 NUMA-Shared Cache
$1,210
$512
$370
$100
Total Cost
(incl.
serverboard
) = ~$7,000Slide18
IDS Benchmarking ToolGenerates packets at line rate (40 Gbps) Random TCP packets (innocent)Attack packets are generated by attack rule-setSupport packet replay using PCAP files
Useful for
p
erformance evaluation
18Slide19
Kargus Performance EvaluationMicro-benchmarksInput traffic rate: 40 Gbps
Evaluate
Kargus
(~3,000 HTTP rules) against:
Kargus-CPU-only (12 engines)Snort with PF_RINGMIDeA*Refer to the paper for more results19* G. Vasiliadis et al., “
MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS ‘11Slide20
Innocent Traffic Performance
20
Actual payload analyzing bandwidth
2.7-4.5x faster than Snort
1.9-4.3x faster than
MIDeASlide21
Malicious Traffic Performance215
x faster than SnortSlide22
Real Network TrafficThree 10Gbps LTE backbone traces of a major ISP in Korea:Time duration of each trace: 30 mins ~ 1 hourTCP/IPv4 traffic:84 GB of PCAP traces109.3 million packets
845K TCP sessions
Total analyzing rate:
25.2 GbpsBottleneck: Flow Management (preprocessing)22Slide23
Effects of Dynamic GPU Load Balancing 23
Offered Incoming Traffic (
Gbps
) [
Packet Size: 1518 B]
Power Consumption
(Watts)
Varying incoming traffic rates
Packet size = 1518 B
8.7
%
20%Slide24
ConclusionSoftware-based NIDS:Based on commodity hardwareCompetes with hardware-based counterparts5x faster than previous S/W-based NIDSPower efficientCost effective
24
> 25
Gbps
(real traffic)> 33 Gbps (synthetic traffic)US $~7,000/-Slide25
Thank You25
fast-ids@list.ndsl.kaist.edu
https://shader.kaist.edu/kargus/Slide26
Backup SlidesSlide27
Kargus vs. MIDeA27
UPDATE
MIDEA
KARGUS
OUTCOME* G. Vasiliadis, M.Polychronakis, and S. Ioannidis, “
MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide28
Kargus vs. MIDeA28
UPDATE
MIDEA
KARGUS
OUTCOMEPacket acquisitionPF_RINGPacketShader I/O
70% lower CPU utilization
* G.
Vasiliadis, M.Polychronakis
, and S. Ioannidis, “MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide29
Kargus vs. MIDeA29
UPDATE
MIDEA
KARGUS
OUTCOMEPacket acquisitionPF_RINGPacketShader I/O
70% lower CPU utilization
Detection engine
GPU-support for Aho-Corasick
GPU-support for
Aho-Corasick & PCRE65% faster detection rate
* G. Vasiliadis
, M.Polychronakis, and S. Ioannidis, “
MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide30
Kargus vs. MIDeA30
UPDATE
MIDEA
KARGUS
OUTCOMEPacket acquisitionPF_RINGPacketShader I/O
70% lower CPU utilization
Detection engine
GPU-support for Aho-Corasick
GPU-support for
Aho-Corasick & PCRE65% faster
detection rateArchitecture
Process-based
Thread-based1/6 GPU memory
usage* G. Vasiliadis
, M.Polychronakis
, and S. Ioannidis, “MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide31
Kargus vs. MIDeA31
UPDATE
MIDEA
KARGUS
OUTCOMEPacket acquisitionPF_RINGPacketShader I/O
70% lower CPU utilization
Detection engine
GPU-support for Aho-Corasick
GPU-support for
Aho-Corasick & PCRE65% faster
detection rateArchitecture
Process-based
Thread-based1/6 GPU memory
usageBatch processingBatching only for
detection engine (GPU)
Batching from packet acquisition
to output1.9x higher throughput
* G.
Vasiliadis,
M.Polychronakis, and S. Ioannidis, “MIDeA: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide32
Kargus vs. MIDeA32
UPDATE
MIDEA
KARGUS
OUTCOMEPacket acquisitionPF_RINGPacketShader I/O
70% lower CPU utilization
Detection engine
GPU-support for Aho-Corasick
GPU-support for
Aho-Corasick & PCRE65% faster
detection rateArchitecture
Process-based
Thread-based1/6 GPU memory
usageBatch processingBatching only for
detection engine (GPU)
Batching from packet acquisition
to output1.9x higher throughput
Power-efficient
Always GPU
(does not offload only when packet size is too small)Opportunistic offloading
to GPUs (Ingress
traffic rate)15% power
saving
* G.
Vasiliadis
,
M.Polychronakis
, and S. Ioannidis, “
MIDeA
: a multi-parallel intrusion detection architecture”, ACM CCS 2011Slide33
Receive-Side Scaling (RSS)33RSS uses Toeplitz hash function (with a random secret key)
Algorithm: RSS Hash Computation
function
ComputeRSSHash
(Input[], RSK)ret = 0;for each bit b in Input[] doif b == 1 thenret ^= (left-most 32 bits of RSK);endifshift RSK left 1 bit position;end forend function Slide34
Symmetric Receive-Side Scaling34Update RSK (Shinae et al.)
0x6d5a
0x6d5a
0x6d5a
0x6d5a
0x6d5a
0x6d5a
0x6d5a
0x6d5a
0x6d5a0x6d5a
0x6d5a
0x6d5a
0x6d5a0x6d5a
0x6d5a
0x6d5a
0x6d5a
0x6d5a
0x6d5a
0x6d5a
0x6d5a
0x56da
0x255b
0x0ec2
0x4167
0x253d
0x43a3
0x8fb0
0xd0ca
0x2bcb
0xae7b
0x30b4
0x77cb
0x2d3a
0x8030
0xf20c
0x6a42
0xb73b
0xbeac
0x01faSlide35
Why use a GPU?35
GTX 580:
512
cores
ALU
Xeon X5680:
6
cores
Control
Cache
ALU
ALU
ALU
ALU
ALU
ALU
VS
*Slide adapted from NVIDIA CUDA C A Programming Guide Version 4.2 (Figure 1-2)Slide36
GPU Microbenchmarks – Aho-Corasick36
2.15
Gbps
39
GbpsSlide37
GPU Microbenchmarks – PCRE37
0.52
Gbps
8.9
GbpsSlide38
Use of global variables minimalAvoids compulsory cache missesEliminates cross-NUMA cache bouncing effectsEffects of NUMA-aware Data Placement
38
Packet Size (Bytes)
Performance Speedup
1518Slide39
CPU-only analysis for small-sized packets Offloading small-sized packets to the GPU is expensiveContention across page-locked DMA accessible memory with GPUGPU operational cost of packet metadata increases
39
82Slide40
Challenge 1: Packet AcquisitionDefault packet module: Packet CAPture (PCAP) libraryUnsuitable for multi-core
environment
Low Performing
40Slide41
Solution: PacketShader* I/O41