Andrew B Kahng Jiajia Li and Lutong Wang UC San Diego VLSI CAD Laboratory Outline Background and Motivation Related Work Our Methodology Experimental Setup and Results Conclusion ID: 594109
Download Presentation The PPT/PDF document "Improved Flop Tray-Based Design Implemen..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Improved Flop Tray-Based Design Implementation for Power Reduction
Andrew B. Kahng, Jiajia Li and Lutong Wang UC San Diego VLSI CAD LaboratorySlide2
Outline
Background and MotivationRelated WorkOur MethodologyExperimental Setup and ResultsConclusionSlide3
Flop Tray Benefits (1)
Flop tray = multi-bit flip-flop (MBFF)Application of flop trays significantly reduces #sinksMotivating “thought experiments” Replacing all single-bit flops in a clock tree (N sinks) with 64-bit flop trays can reduce #clock buffers by (N-
N
/64)/(
N
-1) ≈ 98.4% !
In a clock tree with N = 100K, F = 8, replacing all single-bit flops with 64-bit flop trays can reduce #levels from 6 to 4 Fewer clock buffers, smaller clock power
N
sinks
root
log
F
Nlevels
Each buffer has
F
fanouts
#Buffers ≈ (N-1)/(F-1)
N/K
sinks
root
log
F
(N/K)
levels
Each buffer has
F
fanouts
#Buffers ≈ (N/K-1)/(F-1)
Use K-bit flop traysSlide4
Flop Tray Benefits (2)
Inverters for clock signals are shared within a flop tray Power and area reductionsA recent work (Lin et al. TCAD 2015) achieves 22% flop power reduction by using 2-bit and 4-bit flop trays
Master latch
Slave
latch
clk
clk
Single-bit flop
Master latch
Slave
latch
Master latch
Slave
latch
clk
clk
2-bit flop traySlide5
Challenges of Flop Tray Generation
Flops occupy large portion of block areaIn VGA, 30% of instances are flops 51% of block areaFlop trays can have high aspect ratio and distinct size4-bit flop tray = 1 row x 63 sites64-bit flop tray = 4 rows x 244 sites Clustering of flops imposes additional placement constraintsSmall clusters do not fully exploit flop tray benefits
Large clusters may sacrifice
datapath
wirelength
/ power
Power overhead on
datapaths
(flop tray w/ logical clustering vs. single-bit flop)Slide6
Outline
Background and MotivationRelated WorkOur MethodologyExperimental Setup and ResultsConclusionSlide7
Related Works
Early-stage flop tray generation[Chen10] enables flop tray generation during synthesis[Hou09] splits flop trays to mitigate routing congestionBut are not
aware
of
physical layoutFlop tray generation during/after placement
[Lin11] clusters flops by finding K-cliques in a merging graph[Jiang12] generates flop trays using interval graphs
[Tsai13] guides placement of flops with bonding force Hard to
define feasible displacement
regionBut ignore the shape (AR) of flop trays and timing paths
Our work: flop tray generation considering flop
displacement, timi
ng paths and flop tray shapesSlide8
Outline
Background and MotivationRelated WorkOur MethodologyExperimental Setup and ResultsConclusionSlide9
Overall Optimization Flow
In blue are our optimizationsInitial placement w/ single-bit flops == “optimal” placementObjectivesMinimize displacement of flopsMinimize timing impact
Minimize #flop trays
Two-step optimization
Capacitated K-means clustering
(in dotted red boxes)
ILP-based selection of flop traysSlide10
Example of Overall Flow
4-bit only solution
16-bit only solution
64-bit only solution
ILP solution
Design: AES
Technology: 28FDSOISlide11
Capacitated K-Means Clustering
Given N points (flops), a capacity of K (flop tray size), obtain (N/K) clusters.Selection of starting pointsRandomly select one flop among single-bit flopsFor each flop (h), calculate the total Manhattan distance (d) from h to all selected flops
Randomly select one new flop with probability d
Repeat Steps II and III until M flops are selected
Min-cost flow-based clustering
Update of cluster centers
Minimize ∑
d
k
Such that |xi + x’ij – xk
| + |yi + y’ij – yk| =
dkflop location: (x
k, yk); flop tray location: (xi, yi); relative slot location (x’
ij, y’ij)
h
k
: kth flop (point)
ti : i
th flop tray (cluster)
fij
: jth
slot on ith flop tray
dk,ij: Mahattan
distance
between hk and
fij
By considering distances between flops and slots, we are aware of flop tray ARs
Initial center
Clustering
Cluster c
enter update
S
olutionSlide12
Example on AES
Circles: initial flop locationsRed dots: flop tray locations Slide13
Awareness of Flop Tray Shapes
Our clustering solution more closely matches the AR of flop trays Smaller displacements
Without awareness of flop tray AR, layout
Avg. displacement =
15
μ
m
With awareness of flop tray AR, layout
Avg. displacement =
5
μm
Design: AESTechnology: 28FDSOISlide14
ILP-Based Selection of Flop Tray Solutions
Formulate an ILP to select flop tray solutions with various flop tray sizes to minimize displacement, timing impact and flop tray cost
Minimize
α
∙ W + D +
β
∙
Z
Such that
// flop displacements
|∑
ij
(x
i
+
x’
ij - x
k) ∙ bk,ij| + |∑
ij (yi + y’
ij - yk) ∙
bk,ij| =
dk ∑k d
k = D// relative displacements between timing-critical flop pairs
|∑ij (xi +
x’ij - xk
) · bk,ij - ∑i’j’ (x
i’ +x’i’j’ - xk’) ·
b
k
’,
i’j
’
| + |∑
ij
(
y
i
+
y’
ij
-
y
k
) ·
b
k,ij
- ∑
i’j
’
(
y
i
’
+
y’
i’j
’
-
y
k
’
) ·
b
k
’,
i’j
’
| =
z
kk
’
∑
kk
’
z
kk
’
= Z// cost of flop trays
bk,ij ≤ ei
; e
i ≤ ∑kj b
k,ij
∑
i
(
w
i
·
e
i
) = W
// each flop has exactly one slot to match & each slot can have at most one flop to match
∑
ij
b
k,ij
= 1; ∑
k
b
k,ij
≤ 1
Notations
D
total displacement
Z
total relative displacement of timing-critical flop pairs
W
total cost of flop trays
α
,
β
weighting parameters
(xi,
yi
)
location of
i
th
flop tray
(
x’
ij
,
y’
ij
)
relative location of
j
th
slot on
i
th
flop tray
(
x
k
,
y
k
)
location of k
th
flop
b
k,ij
binary indicator whether k
th
flop is assigned to
j
th
slot on
i
th
flop tray
e
i
binary indicator whether
i
th
flop tray
is selected
w
i
cost of
i
th
flop traySlide15
Impact of α Value
Choice of α determines a tradeoff between clock power reduction versus datapath power penaltySmall value of α Small-size flop trays, small displacementLarge value of α
Large-size flop trays, large displacementSlide16
Minimization of Relative Placement
Relative displacement between timing-critical start-end flop pairs degrades timingMove apart wire↑ delay↑Move closer routing/placement congestionMinimization of relative displacement reduces power penalty by 5%
logic cone
Move closer
placement/routing congestion
Move apart
longer wire
5%Slide17
Outline
Background and MotivationRelated WorkOur MethodologyExperimental Setup and ResultsConclusionSlide18
Experimental Setup
Designs: AES, JPEG, MPEG, VGA (from OpenCores website)Technology: 28nm FDSOI, dual-VT ToolsSynthesis: Synopsys Design Compiler vH-2013.12-SP3P&R: Cadence
Innovus
Implementation System v15.2
Power/timing analysis:
Cadence
Innovus Implementation System v15.2Candidate flop trays
Tray size
4-bit
8-bit
16-bit
32-bit
64-bit
Norm. area/power per bit
0.875
0.854
0.854
0.844
0.844
AR (#rows x #columns)
1 x 4
2 x 4
4 x 4
4 x 8
4 x 16
AR (#rows x #sites)
1 x 63
2 x 62
4 x 62
4 x 122
4 x 244Slide19
Power Benefits
Reference flowsref_1b: conventional implementation flow with single-bit flopsref_mb: flop tray-based implementation with logical clustering (flop tray generation during synthesis with commercial tools)Up to 98% sink number reduction and 90% clock power reduction compared to ref_1bUp to
16%
more total power reduction and 40% more clock power reduction compared to
ref_mb
Design
Flow
Clock power (
mW
)
Total power (
mW
)
#Sinks
AES
ref_1b
1.53
14.02
530
ref_mb
0.72
13.35
227
opt_mb
0.46
12.56
128
JPEG
ref_1b
13.37
84.54
4512
ref_mb
6.1
76.2
1665
opt_mb
2.28
69.24
515
MPEG
ref_1b
10.72
45.53
3181
ref_mb
5.19
38.7
1316
opt_mb
0.98
31.76
181
VGA
ref_1b
42.19
164.84
17053
ref_mb
20.73
138.99
7665
opt_mb
2.04
111.32
308
17053
42.19
2.04
308
111.32
20.73
138.99Slide20
Layout Examples
In red are flop trays and flops, in blue
are combinational cellsSlide21
Optimization with Various Flop Tray Sizes
Flop tray-based optimization with various combinations of flop tray size candidatesOptimization with large-size (i.e., > 16-bit) flop trays achieves 11% more clock power reduction on average, especially on large designs
AES
JPEG
MPEG
VGA
I
II
III
IV
V
1
bit
{1, 4} bit
{1, 4,
8
}
bit
{1, 4,
8, 16
}
bit
{1, 4,
8, 16, 32, 64
}
bitSlide22
Study of Useful Skew Optimization
Comparison of useful skew benefits (= datapath leakage power reductions) across various flowsref_1b: design with only single-bit flopsopt_mb: flop tray-based design (w/o skew-aware clustering)opt_mb (skew aware): flop tray-based design (w/ skew-aware clustering)
Skew-aware clustering achieves
similar
useful skew benefits as
ref_1b
, but with 21% less sink number reduction
ref_1b
530
4512
3181
17053
opt_mb
128
515
181
308
opt_mb
(skew aware)
392
1830
205
1245
#sinksSlide23
Outline
Background and MotivationRelated WorkOur MethodologyExperimental Setup and ResultsConclusionSlide24
Conclusion
A novel flop tray-based optimization with capacitated K-means algorithmUp to 16% total block power reduction compared to logical clusteringUseful skew optimization in the context of flop tray-based designOngoing / Future worksScalable optimization considering all flop tray sizesFloorplan blockage awarenessSlide25
Thank you!
UCSD ABKGroup is grateful to Qualcomm, Samsung, NXP, the IMPACT+/C-DEN centers, Mentor Graphics and the NSF for research support. We thank IMEC and Cadence for additional research enablements and collaborations.