Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California San Diego Networks onChip Chipmultiprocessors CMPs increasingly popular ID: 757105
Download Presentation The PPT/PDF document "Destination-Based Adaptive Routing for 2..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Destination-Based Adaptive Routing for 2D Mesh NetworksANCS 2010
Rohit Sunkam Ramanujam
Bill Lin
Electrical and Computer Engineering
University of California, San DiegoSlide2
Networks-on-Chip
Chip-multiprocessors (
CMPs
) increasingly popular2D-mesh networks often used as on-chip fabricRouting algorithm central in determining performance
Tilera
Tile64
Intel 48-core data center on die(ISSCC 2010)Slide3
Classes of Routing AlgorithmsOblivious routing
Simple and fast router designs
Poor load balancing under
bursty trafficAdaptive routing
Better performance (throughput, latency) Better fault tolerance
Higher router complexitySlide4
Related WorkOblivious Routing [Valiant, ROMM, O1TURN, Optimal oblivious routing]
Optimize for
worst
and average-case performance Adaptive routing commercially used in multiprocessors from IBM, Cray, CompaqOn-chip routing very different from off-chip:
Lower powerLower area Lower router complexitySlide5
OutlineIntroduction
Motivation
Destination-Based Adaptive Routing (DAR
)EvaluationSlide6
Minimal Adaptive
R
outing
Model
Adaptive routing along minimal directions
D
SSlide7
Coarse Fine
Granularity of Congestion Estimation
Local congestionSlide8
Local CongestionLocal adaptive
Measure local congestion metric (free VC, free buffers)
S
Low congestion
Moderate congestion
D
High congestion
Optimal
Local adaptiveSlide9
Coarse Fine
Granularity of Congestion Estimation
Local congestion
Dimension-based congestionSlide10
Dimension-based Congestion
RCA-1D (
Gratz
et al. HPCA’ 08)
Exponential moving average of congestion to all nodes along a dimension
S
Low congestion
Moderate congestion
D
High congestion
Optimal
RCA-1DSlide11
Coarse Fine
Granularity of Congestion Estimation
Local congestion
Dimension-based congestion
Quadrant-based congestionSlide12
Quadrant-based Congestion
RCA-Quadrant (
Gratz
et al. HPCA’ 08)
Exponential moving average of congestion to all nodes in the destination quadrant
S
Low congestion
Moderate congestion
D
High congestion
OptimalSlide13
Quadrant-based Congestion
RCA-Quadrant (
Gratz
et al. HPCA’ 08)
Exponential moving average of congestion to all nodes in the destination quadrant
S
Low congestion
Moderate congestion
D
High congestion
OptimalSlide14
Quadrant-based Congestion
RCA-Quadrant (
Gratz
et al. HPCA’ 08)
Exponential moving average of congestion to all nodes in the destination quadrant
S
Low congestion
Moderate congestion
D
High congestion
Optimal
RCA-quadSlide15
Coarse Fine
Granularity of Congestion Estimation
Local congestion
Dimension-based congestion
Quadrant-based congestion
Destination-based congestionSlide16
Ideally …On a
per-destination
basis:
Estimate end-to-end delay along all minimal paths to destinationChoose path with least delay
S
Low congestion
Moderate congestion
D
High congestion
OptimalSlide17
ChallengesLimited bandwidth for congestion updates
Congestion notification not instantaneous
Limited storage in on-chip routers
Exponential number of paths to each destinationLimited hardware resources for computations
How can we practically emul
ate ideal adaptive routing? Slide18
Destination-based adaptive routing (DAR)
A node estimates delay to all other nodes through candidate outputs every T cycles
S
D
L[N][D] = 20
L[E][D] = 30Slide19
DAR-High LevelTraffic distribution to output ports controlled using per-destination split
ratios W
W[N][D]= 0.6
W[E][D]= 0.4
S
D
Estimate delay to destination through candidate outputs
Shift traffic from more congested port to less congested port
Start with initial set of split ratios
L[N][D] = 20
L[E][D] = 30Slide20
DAR-High LevelTraffic distribution to output ports controlled using per-destination split
ratios W
Estimate delay to destination through candidate outputs
S
D
Shift traffic from more congested port to less congested port
Start with initial set of split ratios
W[N][D]= 0.8
W[E][D]= 0.2
L[N][D] = 20
L[E][D] = 30Slide21
OutlineIntroduction
Motivation
Destination-Based Adaptive Routing (DAR
)
Distributed delay measurementSplit ratio adaptationScaling
EvaluationSlide22
Distributed Delay Measurement
A node maintains:
Per
-destination traffic split ratio through candidate output ports: W[p][j]Delay to next-hop router/ejection interface through each output port (N, S, E, W,
Ej): l[p
]Slide23
Distributed Delay Measurement
Every node estimates average delay to all other nodes in the network
12
13
14
15
8
4
0
9
5
11
6
7
1
2
3
10
Avg
10
[10]
Avg
10
[10]
Avg
10
[10]
Avg
10
[10]
Delay
from 10 to itself,
Avg
10
[10] = l
10
[Ej
]
Avg
10
[10] propagated to neighbors
Nodes 6
, 9, 14, 11 add local delay to
Avg
10
[
10] to compute delay to node 10
For example, at node 9, L
[E][10] =
l[E
] + Avg
10
[10
] Avg
9
[10] = L[E][10]Slide24
Distributed Delay Measurement
Every node estimates delay to all other nodes in the network
12
13
14
15
8
4
0
9
5
11
6
7
1
2
3
10
Avg
14
[10]
Avg
11
[10]
Avg
9
[10]
Nodes 6, 9, 14, 11 propagate estimated delay to node 10 to upstream neighbors
For example, node 5 receives two delay updates, from nodes 9 and 6 A[E][10] = Avg
6
[10] A[N][10] = Avg
9
[10]
Node 5 adds local link delay to received delay update: L[E][10] = A[E][10] +
l[E
] L[N][10] = A[N][10] +
l[N
]
Finally, average delay from node 5 to node 10 is computed as: Avg
5
[10] = W[E][10]L[E][10] + W[N][10]L[N][10]
Avg
14
[10]
Avg
9
[10]
Avg
9
[10]
Avg
6
[10]
Avg
6
[10]
Avg
6
[10]
Avg
11
[10]Slide25
Distributed Delay Measurement
Every node estimates delay to all other nodes in the network
12
13
14
15
8
4
0
9
5
11
6
7
1
2
3
10
Nodes 6, 9, 14, 11 propagate estimated delay to node 10 to upstream neighbors
For example, node 5 receives two delay updates, from nodes 9 and 6 A[E][10] = Avg
6
[10] A[N][10] = Avg
9
[10]
Node 5 adds local link delay to received delay update: L[E][10] = A[E][10] +
l[E
] L[N][10] = A[N][10] +
l[N
]
Finally, average delay from node 5 to node 10 is computed as: Avg
5
[10] = W[E][10]L[E][10] + W[N][10]L[N][10]Slide26
OutlineIntroduction
Motivation
Destination-Based Adaptive Routing (DAR
)
Distributed delay measurementSplit ratio adaptation
ScalingEvaluationSlide27
Adaptation of Split ratio
Objective
: Equalize delay on candidate output ports
If only one candidate output, split ratio is 1If two candidate outputs,Let p
h be the port with higher delay to destination j
Let pl be the port with lower delay to destination j
W[ph][j] + W[pl][j] = 1
Δ
traffic shifted from
p
h
to
p
l
every T cycles
Δ
proportional to
(
L[p
h
][j]-L[pl
][j])/L[ph][j]Slide28
Coarse Fine
Granularity of Congestion Estimation
Local congestion
Dimension-based congestion
Quadrant-based congestion
Destination-based congestion
Does
not
scale !!
Slide29
Coarse Fine
Granularity of Congestion Estimation
Local congestion
Dimension-based congestion
Quadrant-based congestion
Destination-based congestion
Scalable Destination-based congestionSlide30
OutlineIntroduction
Motivation
Destination-Based Adaptive Routing (DAR
)
Distributed delay measurement
Split ratio adaptationScalingEvaluationSlide31
Look-ahead Window
0
0
3
3
6
6
9
9
46
46
P
A
43
40
40
37
37
12
12
B
15
15
18
18
21
21
34
34
31
28
28
25
25
50
50
53
53
56
56
59
59
96
96
93
93
90
87
87
62
68
68
71
71
84
84
81
81
78
78
75
75
18
18
21
21
28
28
25
25
68
68
71
71
78
78
75
75
96
96
93
93
90
90
87
87
84
84
81
81
78
78
75
75
78
75
75
0
0
3
3
6
6
9
9
12
12
15
15
18
18
21
21
18
18
21
21
0
0
3
46
46
43
43
50
50
53
53
96
96
C
C
93
96
96
93
93
0
0
3
3
A
C
P
C
P
B
S
15
N
ode
S
maintains delay estimate for
MxM
window
centered at S
.
Any n
ode outside window
mapped
to
closest
node
within
window
A packet’s
look-ahead window shifts as
it
is
routed from source to
destinationSlide32
Window SizeDestination D
guaranteed to be within window when packet is
(M-1)/2
hops away from D
Intuition: Packet has (M-1)/2 hops to route around congestion hot spots7x7 look-ahead window in 16x16 mesh has comparable performance to DAR (equivalent to 31x31 look-ahead window) Slide33
OutlineIntroduction
Related work
Destination-Based Adaptive Routing (DAR)
EvaluationSlide34
Experimental setupCompare DAR with RCA-1D, RCA-quadrant, Local adaptive
SPLASH-2 benchmarks + synthetic traffic patterns (uniform, transpose, shuffle)
Cycle-accurate
NoC simulator models 3-stage router pipeline8 VC, 5 flit deep1 VC used as escape VC for deadlock preventionSlide35
Splash results – 7x7 mesh
41%Slide36
Splash results – 7x7 mesh
65%Slide37
Uniform traffic – 8x8 meshSlide38
Transpose traffic – 8x8 meshSlide39
Shuffle traffic – 8x8 meshSlide40
SDAR - 16x16 mesh, 7x7 window
Average latency over 100 permutation traffic
patterns at 18% injection load
Network saturation statistics at 18% injection loadSlide41
SummaryDestination-based Adaptive
R
outing (DAR) for 2D mesh networks
Scalable DAR (SDAR) uses look-ahead window and easily scales to large networksDAR outperforms existing adaptive and oblivious routingSDAR achieves comparable performance with significantly less overheadsSlide42
Thank you!!Slide43
Key implementation detailsSimple router implementation: low storage
,
low bandwidth
Synchronize delay updates to reuse delay computation and weight adaptation hardwareApproximate computations to simplify implementation Slide44
Router architecture – Kim et al DAC ‘05
Quadrant
Port
Pre-select
VC-1
VC Allocator
XB Allocator
.
.
.
N
VC-
v
.
.
.
S
E
W
VC-1
.
.
.
VC-
v
Preferred Output Registers
In
N
S
E
W
Ej
Congestion Value Registers
Credits
Routing Unit
Override
CreditsSlide45
DAR Router
W
λ
L[p
y
][N-1]
p[N-1]
p[1]
p[0]
Destination
Port
Pre-select
VC-1
W[p
x
, p
y
][0]
W[p
x
, p
y
][1]
W[p
x
, p
y
][N-1]
Adapt
Weights
Latency
measurement
VC Allocator
XB Allocator
cnt[P-1]
cnt[0]
.
.
.
Increment/
Decrement
.
.
.
.
.
.
A[p
x
][0]
A[p
y
][0]
A[p
x
][N-1]
A[p
y
][N-1]
.
.
.
L[p
x
][0]
L[p
y
][0]
L[p
x
][N-1]
.
.
.
.
.
.
.
.
.
Latency
Propagation
.
.
.
Avg[0]
Avg[N-1]
.
.
.
Storage Overhead
Logic Overhead
N
VC-
v
.
.
.
S
E
VC-1
.
.
.
VC-
v
Preferred output registers
Per-destination Split ratios
Local
delay
In
N
S
E
W
Ej
l[P-1]
l[1]
l[0]
.
.
.
Exponentially averaged
local delay
cnt[1]Slide46
Distributed delay measurementA node maintains:Per-destination traffic split ratio through candidate output ports:
W[p][j
]
Delay to next-hop router/ejection interface through each output port (N, S, E, W, Ej): l[p
]Using updates received from downstream nodes, a node computes:L[p][j
]: Average delay from current node to node j through output port p
Avg[j]: Average delay from current node to node jSlide47
Destination-based Adaptive Routing (DAR)
Every router maintains per-destination split ratios which control traffic distribution to output ports
Split ratios adjusted every T cycles based on measured delay to D through the two ports
S
Low congestion
Moderate congestion
D
High congestion
0.8
0.2
0.7
0.3
1
1