Resource Disaggregation Peter Gao Berkeley Akshay Narayan MIT Sagar Karandikar Berkeley Joao Carreira Berkeley Sangjin Han Berkeley Rachit Agarwal Cornell Sylvia ID: 598329
Download Presentation The PPT/PDF document "Network Requirements for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Network Requirements for Resource Disaggregation
Peter
Gao
(Berkeley),
Akshay
Narayan (MIT),
Sagar
Karandikar
(
Berkeley),
Joao
Carreira
(
Berkeley),
Sangjin
Han (
Berkeley),
Rachit
Agarwal (Cornell), Sylvia
Ratnasamy
(
Berkeley),
Scott
Shenker
(Berkeley/ICSI
)Slide2
Current Datacenter: Server-Centric
Future datacenter: Disaggregated?
Disaggregated Datacenters
GPU
GPU
GPU
G
PU
Datacenter Network
Datacenter Network
HP – The Machine
Intel - RSD
Facebook
Huawei - NUWA
SeaMicro
Berkeley -
FireBox
2Slide3
Disaggregation Benefits (Architecture Community)
3
Overcome
memory capacity
wall
H
igher resource density
Simplify
Hardware
Design
Relax
Power & Capacity
Scaling Slide4
Network is the key
Network
GPU
Network
GPU
GPU
4
QPI, SMI,
PCI-e
Existing prototypes us
e
specialized hardware, such as Silicon Photonics, PCI-e
Server-Centric
Disaggregated
GPU
Do we need specialized hardware?
e.g.: Silicon photonics, PCI-eSlide5
What end-to-end
latency and bandwidth
must the network provide for legacy apps?
Do existing
transport
protocols meet these requirements?
Do existing
OS
network stacks meet these requirements
?
Can commodity
network hardware
meet these requirements?
Application
OS
Transport
NIC
Switch
Remote Resource
OS
Transport
NIC
5
OS
Transport
NIC
Switch
OS
Transport
NIC
Commodity
hardware solutions
may be sufficient
✔
Current OS and network stack are not
(
S
olutions
are
feasible)
✘
Worst case performance degradation Slide6
Assumptions
CPU
Memory
Storage
Scale
Datacenter Network
Limited cache coherence
domain
Small amount of local cache (how much?)
Page-level
remote memory
access
Block-level distributed data
placement
Rack-scale?
Datacenter-scale?
6
Cache CoherenceSlide7
Methodology: Workload Driven
10
workloads on
8
applications
~ 125 GB input data5 m3.2xlarge EC2 nodes
Virtual Private Cloud enabledLatency and Bandwidth Requirements
7
Key-value Store
SQL
Streaming
Wordcount
Sort
Pagerank
Collaborative Filter.
Spark
Hadoop
Timely Dataflow
Graphlab
Memcached
HERD
Spark SQL
Spark Streaming
Batch Processing
Interactive
Workloads
ApplicationsSlide8
Disaggregated Datacenter Emulator
OS
Memory
Special Swap Device (Handles Page Fault)
Local RAM
E
mulated
Remote RAM
8
Backed by the machine’s own memory
Partition the memory into local and remote
Free
to access
Via swap device
Inject latency and bandwidth constraints
Using special swap
device
Delay = latency + request size / bandwidth
Akin to a dedicated link between CPU and memory Slide9
*Note: Delay
= latency + request size /
bandwidth
Latency and Bandwidth Requirement
5% degradation
9
5% degradation
1us
5
us
10us
100
Gbps
40
Gbps
10
Gbps
100
Gbps
40
Gbps
10
Gbps
100
Gbps
40
Gbps
10
Gbps
~3us latency
/ 40Gbps
bandwidth is enough, ignoring
queueing
delaySlide10
Understanding Performance Degradation
10
Spark Streaming
Wordcount
Memcached
YCSB
Graphlab
CF
Hadoop
Sort
Hadoop
Wordcount
Timely
Pagerank
HERD
YCSB
SparkSQL
BDB
Spark
Sort
Spark
Wordcount
Performance degradation is correlated with application memory bandwidthSlide11
Application
OS
Transport
NIC
Switch
Remote Resource
OS
Transport
NIC
Application
Remote Resource
3us end-to-end latency
40Gbps
dedicated
link (no queueing delay)
11Slide12
Transport Simulation Setting
Special Swap Instrumentation
Network
Simulator
Flow Trace
Flow completion time distribution
12
Need new transport protocolsSlide13
Application Performance Degradation
~5
% degradation
40Gbps
network
100Gbps
network
~5% degradation
13
40Gbps (no queueing delay)
DC Scale (with queueing delay)
Rack Scale (with queueing delay)
100Gbps (no queueing delay)
DC Scale (with queueing delay)
Rack Scale (with queueing delay)
100Gbps network
DC scale for some apps, rack scale for othersSlide14
Application
OS
Transport
NIC
Switch
Remote Resource
OS
Transport
NIC
3us end-to-end latency
40Gbps
dedicated
link
Transport
Transport
Efficient Transport
100Gbps network
14Slide15
Is 100Gbps/3us achievable?
15Slide16
Feasibility of end-to-end latency within a rack
Application
Remote Resource
0.32us
0.8us
2us
1.9us
Propagation
Transmission
Switching
Data Copying
OS
3us Target
16
*
Numbers estimated optimistically based on existing hardwareSlide17
Feasibility of end-to-end latency within a rack
Application
Remote Resource
0.32us
0.8us
2us
1.9us
Propagation
Transmission
Switching
Data Copying
OS
3us Target
Application
Remote Resource
0.32us
0.8us
Propagation
Transmission
Switching
2us
1.9us
Data Copying
OS
Cut-through Switch
0.48us
15
*
Numbers estimated optimistically based on existing hardwareSlide18
Feasibility of end-to-end latency
within
a rack
Application
Remote Resource
0.32us
0.8us
2us
1.9us
Propagation
Transmission
Switching
Data Copying
OS
3us Target
Application
Remote Resource
0.32us
2us
Data Copying
0.48us
Cut-through Switch
NIC Integration
1
us
1.9us
OS
15
*
Numbers estimated optimistically based on existing hardwareSlide19
Feasibility of end-to-end latency
within
a rack
Application
Remote Resource
0.32us
0.8us
2us
1.9us
Propagation
Transmission
Switching
Data Copying
OS
3us Target
Application
Remote Resource
0.32us
0.48us
Cut-through Switch
NIC Integration
1
us
1.9us
OS
Use RDMA
3us Target
15
*
Numbers estimated optimistically based on existing hardware
Feasible to meet target across
the datacenter?Slide20
Application
OS
Transport
NIC
Switch
Remote Resource
OS
Transport
NIC
3us end-to-end latency
40Gbps
dedicated
link
Efficient Transport (pFabric,SIGCOMM’13, pHost,CoNEXT’15)
100Gbps network (Available)
Kernel bypassing (RDMA common)
CPU-NIC Integration
(Coming soon)
Cut-through switch (Common?)
100Gbps links (Available)
Application
OS
Transport
NIC
Switch
Remote Resource
OS
Transport
NIC
16Slide21
What’s next?17
Please refer our paper for evaluations on improving application performance in disaggregated datacenters
Application Design
Rethinking OS Stack
Storage
Network Stack
Failure Models
Network Fabric DesignSlide22
Thank You!
Peter X. Gao
Akshay
Narayan
Sagar
Karandikar
Joao
Carreira
Sangjin
Han
Sylvia
Ratnasamy
Scott
Shenker
Rachit
AgarwalSlide23