Samyoung Bang Kwangsoo Han Andrew B Kahng and Vaishnav Srinivas ECE and CSE Departments UC San Diego La Jolla CA 92093 Samsung Electronics Co Ltd Hwaseongsi South Korea ID: 467206
Download Presentation The PPT/PDF document "Clock Clustering and IO Optimization for..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Clock Clustering and IO Optimization for 3D Integration
Samyoung Bang*, Kwangsoo Han
‡
,
Andrew B
.
Kahng
‡†
and Vaishnav Srinivas
‡
‡
ECE
and
†
CSE
Departments, UC San Diego, La Jolla, CA
92093
*
Samsung
Electronics Co. Ltd, Hwaseong-si, South Korea
eva.bang@samsung.com, {
kwhan
,
abk
,
vaishnav
}@ucsd.eduSlide2
Outline
Motivation
Power, Area and Timing Model
P&R and Timing Flow
Experimental Results
ConclusionSlide3
Motivation
For 3D integration with large bandwidth needs between dies, choice of clocking options need to be made upfront
Tradeoff between area and power needed upfront
Affects
floorplanning
choices
Serializer
3DIO PLL
Deserializer
3DIO PLL
PLL
Serializer
3DIO PLL
Deserializer
3DIO PLL
PLLSlide4
Key Choices for Clocking Options
Local clustering
Partition a given region into sub-regions
Clock synchronization scheme
Synchronous
Source-synchronousAsynchronous3DIO frequency
# of 3DIO
To enable design space pathfinding/exploration:Power/Area/Timing model based on total bandwidth,
clustering, synchronization scheme, 3DIO frequencyCombine clocking and 3DIO power/area/timing Slide5
Clock entry
point
Cluster
3DIO array
Data path
The layout of the bottom die
3DIO Clustering
Localize the clock tree of the 3D interconnect
Advantages when number of cluster increase:
S
ize of c
luster clock
tree↓ (smaller skew, jitter)
Shorter data paths to 3DIO array at the center of each clusterEnables efficient clocking schemes (forwarded clock, asynchronous)Disadvantages when number of cluster increase
:Overhead to synchronize between clusters on top die
Overhead in cluster clock 3DIO per clusterSlide6
Synchronization Schemes for 3DIO ClockingSynchronous
Cluster clock tree is balanced to all F/Fs on both the bottom and the top die
Simplest clocking scheme (similar to on-die)
Vulnerable to inter-die process/voltage variation (large skews)
Source-synchronous
Forwarded clock from one die to another
No skew balancing needed
across two diesRequire
balance delays (Tb) within each die on the data path to match the clock insertion delay
Asynchronous
Separate clocks on each dieFIFO to help clock domain crossing
Obtain
much smaller number of
3DIOs due to higher speeds achievable with embedded clock and CDR techniques
Asynchronous clocking
Serializer
Deserializer
IO-clock
Cluster
clock
Cluster clock
IO-clock
Synchronous clocking
Launch FFs
Capture FFs
Data path
Bottom
Top
Source-synchronous clocking
Bottom
Top
Launch FFs
Capture FFs
Data path
T
bSlide7
Our WorkGiven the choices of clock synchronization schemes, number of clusters and 3DIO frequency,
find maximum
bandwidth for the 3D interconnect given a max power and area
constraints.
Optimal Clocking scheme for Max BW
Optimal Clocking frequency for Max BW
Max power constraint
Max area constraints
Synch.
Source-synch.
Asynch
.
Max power constraint
Max area constraints
Max Achievable BW
Max power constraint
Max area constraints
Max power constraint
Max area constraints
Optimal number of clusters for Max BWSlide8
Outline
Motivation
Power, Area and Timing Model
P&R and Timing Flow
Experimental Results
ConclusionSlide9
3DIO/CTS Directed Graph
Primary inputs are indicated by circle
Rectangles are determined by the primary inputs
Solid and dotted arrow indicates positive and negative correlation
Estimate the rounded rectangles as analytic expressions
#Clusters
Freq.
Clocking
scheme
Region
Area
BW
WNS
Skew outcome,
clock ins. delay
Area
Power
# FFs
IO Freq.
Per-IO
power/area
# 3DIO
Max skew/transition
Jitter
Input
Deterministic
Est. outcome
Estimated
Increase
Decrease
Clock WL
Clock
buf
. area
Data WL
Data
buf
. areaSlide10
Clock Wirelength
Hierarchical approach to estimate clock
wirelength
Assume clock tree is well balanced because FFs are uniformly distributed over the region area
L
ength of Steiner minimal tree
over N points uniformly distributed within a given
region Areg is proportional
toTotal clock wirelength is
Notation
Depth of clock tree (
i
== 0 for clock source)
Number of cluster
Total number of flip-flops
Fitted coefficients
i
= 0
w
0
i
= 1
w
1
FFs
FFs
FFs
i
= 2
w
2
Cluster clock tree
Global clock treeSlide11
Clock Buffer AreaTellez and
Sarrafzadeh
propose
a method to insert the
minimum number
of buffers under a given transition time (T
max_tran) constraintLinearize the problem by using the concept of maxinum capacitance (
Cmax) Any buffer stage
i with stage cap Ci <
Cmax will have T
i_tran < Tmax_tran
Using
Cmax, we estimate the number of clock buffers (Ncbuf
), Kashyap
et al. discuss transition time degradation
and Cmax
can be expressed as follows,
Total clock buffer area is
Wire (max length =
W
max
)
T
0
T
max
_
tranSlide12
Data Wirelength and Data Buffer AreaData path wirelength is proportional to the number of data
wires and
the cluster
dimension
Distribution exists based on sink placement
wrt 3DIO cluster
For data buffer area, we use a similar concept to clock buffer area estimation
Need to consider each data path separately
Cannot use total
wirelength
Need minimum number of data buffers to meet hold timingSlide13
3DIO/Overall Power and Area3DIO power and area models are based on CACTI-IO
Overall (3DIO+clocking) power and area are
Switching power
Internal and leakage
power
IO
powerSlide14
Outline
Motivation
Power, Area and Timing Model
P&R and Timing Flow
Experimental Results
ConclusionSlide15
3D P&R Flow - SynchronousSynchronous
Synthesize the cluster clock tree on the top die first to
balance the clock tree on both dies
Extract maximum clock insertion delay (
T
clk1)
Propagate the data path delay (Tdata) for the routing on top die
Synchronous scheme
T
clk1
Propagated
Clock 3DIO
The layout of top die
The layout of the bottom die
T
dataSlide16
3D P&R Flow – Source-synchronous and AsynchronousSource-synchronous
Synthesize
the clock
tree and
route on the bottom die, and
separately synthesize the clock tree only
for the top dieExtract balance delay Tb
(i.e., Tclk1) for each
capture FF and annotate the delays to the corresponding data 3DIOsAsynchronousRun traditional 2D flow on both dies separately
S
ource-synchronous scheme
Propagated
Clock 3DIO
The layout of top die
The layout of the bottom die
T
clk1
Annotate
T
b
(i.e., T
clk1
)Slide17
Conventional 2D STA vs. our 3D STA
We focus on inter-die variation, and do not consider intra-die variation which can be comprehended by timing
derate
or OCV
Two process corners {BC, WC} for inter-die variation
Assign the same corner on the paths on the same dieReport worst setup WNS out of four combinations (i.e., BC-BC, BC-WC, WC-BC, WC-WC ) of corners
Conventional 2D STA(without inter-die variation)
Our 3D STASetup slack = T
per – Tsu – T{c2q, WC} – T{data1, WC} – T{data2, WC
} + (T{capture, BC} – T{launch, WC})
T
launch
T
c2q
T
data1
T
capture
T
data2
Buffer on bottom die
Buffer on top die
FF on bottom die
FF on top die
Setup
slack1 =
T
per
–
T
su
– T
{c2q,
BC
}
– T
{data1,
BC
}
– T
{data2,
BC
}
+ (T
{capture,
BC
}
– T
{launch,
BC
}
)
slack2
=
T
per
–
T
su
– T
{c2q,
BC
}
– T
{data1,
BC
}
– T
{data2,
WC
}
+ (T
{capture,
WC
}
– T
{launch,
BC
}
)
slack3 = Tper –
Tsu – T{c2q, WC} – T
{data1, WC} – T{data2, BC}
+ (T{capture, BC} – T{launch, WC
}) slack4 = Tper – T
su – T{c2q, WC} – T{data1,
WC} – T{data2, WC} + (T
{capture, WC} – T{launch, WC}
) slack = min (slack1, slack2, slack3, slack4)Slide18
Outline
Motivation
Power, Area and Timing Model
P&R and Timing Flow
Experimental Results
ConclusionSlide19
Experimental Setup
P&R tool is Synopsys
IC Compiler
I-2013.12-SP1
Timing analysis tool is Synopsys
PrimeTime H-2013.06-SP2
We use a 65nm TSMC libraryDesign of experimentsBandwidth (10 – 200 GB/s)
Region area (25 – 100 mm2)3DIO clock frequencySynchronous (100 – 2000 MHz)
Source-synchronous (1500 – 4000 MHz)Asynchronous (3500 – 8000 MHz)Number of clusters (1 – 25)
We select four data points for each parameter
256 design implementations for each clocking schemeSlide20
Model Fitting ApproachWe use Artificial Neural Network (ANN) model for our fit, guided by the directed graphIteratively progress through the directed graph to fit each nodeClock wirelength
Data
wirelength
Clock buffer area
Data buffer area
3DIO power/areaTotal Area/Power/WNSWe use the Fmax for the timing model (instead of WNS) Multiple runs with different training, validation and test data sets Improved generality and robustness of the resulting models
#Clusters
Freq.
Clocking
scheme
Region
Area
BW
WNS
Skew outcome,
clock ins. delay
Area
Power
# FFs
IO Freq.
Per-IO
power/area
# 3DIO
Max skew/transition
Jitter
Input
Deterministic
Est. outcome
Estimated
Increase
Decrease
Clock WL
Clock
buf
. area
Data WL
Data
buf
.
areaSlide21
Area, Power and Timing Model Results
Min-Max error within +/-20%
For synchronous scheme, tool inserts large number of hold buffers due to inter-die variation
Larger error
Mean error within +/-0.5%
Area
Power
F
maxSlide22
Design Space Results
Max BW:
Figure shows the
iso
-bandwidth curves
Vertical and horizontal walls show min power/area required to hit a bandwidth requirement
Clocking scheme:
The asynchronous scheme is area-efficient
The synchronous scheme is power-efficientThe source-synchronous
scheme provides a valuable tradeoff between
power and area along the knee of the iso-bandwidth curve. The interesting tradeoffs
between the schemes occurs along these knee points as
we change the power/area constraint tradeoffs.Max BW
Optimal clocking scheme for Max BWSlide23
Design Space Results
Cluster clock frequency:
As power constraint gets tighter, frequency goes down
As area constraint gets tighter, frequency goes up
Source-synchronous schemes provide benefits at higher cluster frequencies
The asynchronous scheme provides a way to keep the cluster frequency down but still have high 3DIO frequency, through serialization
Number of clusters:Not monotonic along edges of hypercube and clocking scheme boundariesAlso sensitive to the total region area
Optimal # of Clusters for Max BW
Optimal Cluster Frequency for Max BWSlide24
Outline
Motivation
Power, Area and Timing Model
P&R and Timing Flow
Experimental Results
ConclusionSlide25
Conclusion
We have developed
a power, area and timing model for 3DIO
and CTS
that includes clustering and three different clock
synchronization schemes (synchronous, source-synchronous, asynchronous)
Our model estimates power, area and timing within 20% error across a large range of
bandwidths, region areas, numbers of clusters and 3DIO frequenciesOur
modeling methodology will enable architects to study and optimize the design space upfrontKey takeaways:
Iso-bandwidth lines identify min area/power required to hit a particular BWClocking scheme tradeoffs are interesting along the knee of iso-bandwidth lines
Cluster frequency for asynchronous schemes can be kept low while still reducing the number of 3DIO due to serializationSlide26
Future Work
Extend our model to be aware of
Placement uniformity
Technology dependence
Datapath
logics More comprehensive STA including intra-die variationBlockages
Asymmetric clusteringDifferent 3DIO placement Serial 3DIO circuit options for asynchronous scheme2.5D (interposer-based) designSlide27
Thank youSlide28
BACKUPSlide29
Synchronous
All end points on both dies are synchronized
Colored FFs are uniformly distributed over the region
Non-colored FFs are placed right next to the 3DIO array
Clock tree is vulnerable to the inter-die variation
Use DDR to minimize number of 3DIOs
Two factors affect to determine max 3DIO clock frequency (FIO)Clock skew due to the inter-die variationJitter
Increase #clusters increase max FIO because clock tree becomes more robust to the inter-die variation
BW
(GB/s)
Region Area (
mm2)
N
cluster
Fmax
(MHz)
F
IO
(MHz)
12
25
1
640
1280
11.25
25
25
900
1800
12
100
1
300
600
11.25
100
25
600
1200
50.025
25
1
460
920
50.625
25
25
900
1800
200.25
100
1
300
600
202.5
100
25
600
1200Slide30
Source-Synchronous
Forward clock one die to the another die
For any paths across dies, the launch and capture path delays from source to 3DIO at the bottom die are balanced
no inter-die variation
Require balance delay
T
b to compensate clock insertion delay Tclk1Two factors to determine max 3DIO clock frequency (FIO
)Skew between Tb and
Tclk1 due to the intra-die variationJitter
BW
(GB/s)
Region
Area (mm
2)
Ncluster
F
max
(MHz)
F
IO
(MHz)
12.095
25
1
820
1640
10.625
25
25
1700
3400
50.02
25
1
820
1640
46.875
25
25
1500
3000
12
100
1
500
1000
15
100
25
1200
2400
200
100
1
350
700
195
100
25
1200
2400Slide31
Asynchronous
BW
(GB/s)
Region
Area
(
mm
2)
Ncluster
Fmax
(MHz)
F
IO(MHz)
11.9
25
1
700
5600
25
25
25
1000
8000
49.7
25
1
700
5600
40
25
25
800
6400
12
100
1
400
3200
20
100
25
800
6400
200
100
1
125
1000
200
100
25
500
4000
Use FIFO (1:8
serializer
,
8:1
deserializer
) to separate clock domain
No inter-die variation
Minimize the number of 3DIOs
Require PLL for cluster clock for the top die and IO clock for both dies
Large power overhead
One factor
to determine max 3DIO clock frequency (F
IO
)
JitterSlide32
Flow of Synch. Clocking Schemes
Bottom
Top
0.307ns (
bc
)
0.618ns (
wc
)
0.089ns (
bc
)
0.200ns (
wc
)
0.125ns (
bc
)
0.247ns (
wc
)
0.147ns (
bc
)
0.306ns (
wc
)
Bottom
Top
0.307ns (
bc
)
0.618ns (
wc
)
0.089ns (
bc
)
0.200ns (
wc
)
0.125ns (
bc
)
0.247ns (
wc
)
0.147ns (
bc
)
0.306ns (
wc
)
Delay to balance
the clock insertion delays across dies
0.307 - 0.125 = 0.182ns (
bc
)
0.618 – 0.247 = 0.371ns (
wc
)
Input delay to prevent
unnecessary hold buffer insertions
0.307 + 0.089 – 0.182 = 0.214ns
CTS, CTO and Route
o
n bottom die
Custom placement
on bottom/top dies
CTS o
n top die
Extract delay
1
2
1
2
BW: 12GB/s
A
reg
: 81mm
2
n
c
: 4
f
clus
: 1000MHz
Cluster buffer
Cluster bufferSlide33
Flow of Synch. Clocking Schemes
CTS o
n top die
Extract balance delay
CTO and Route on top die
1
2
3
3
STA
4
Run CTO and route at worst corner
considering hold time and clock uncertainty
Top
0.247ns (
wc
)
0.214ns
Top
0.247ns (
wc
)
0.214ns
4
Bottom
Top
0.618ns (
wc
)
0.200ns (
wc
)
0.125ns (
bc
)
0.306ns (
wc
)
Bottom
Top
0.307ns (
bc
)
0.089ns (
bc
)
0.247ns (
wc
)
0.147ns (
bc
)
Setup: 0.5 (half cycle) + 0.802(
t
clk
) – 0.075 (
t
unc
)
- 0.008 (
t
s
) – 1.195 (
t
data
) = 0.024ns
Hold: 0.683(
t
data
) - 0.576 (
t
clk
) - 0.060 (
t
unc
) - 0.030 (
t
h
)
= 0.017ns
0.371ns (
wc
)
0.071ns (
bc
)
0.140ns (
wc
)
0.182ns (
bc
)
Cluster buffer
Cluster bufferSlide34
Flow of Synch. Clocking Schemes
Bottom
Top
0.618ns
0.200ns
0.247ns
0.306ns
Bottom
Top
0.618ns
0.200ns
0.247ns
0.306ns
CTS, CTO and Route
o
n bottom die
Custom placement
on bottom/top dies
CTS o
n top die
Extract delay
1
2
1
2
BW: 12GB/s
A
reg
: 81mm
2
n
c
: 4
f
clus
: 1000MHz
Balance the delay from clock source to data
3DIO and the delay from clock source to
clock 3DIO
0.618 + 0.200 = 0.818ns
Annotate “balancing delay”
0.247ns
Cluster buffer
Cluster bufferSlide35
Flow of Source synch. Clocking Schemes
CTS o
n top die
Extract balance delay
CTO and Route on top die
1
2
3
3
STA
4
Run CTO and route at worst corner
considering hold time and clock uncertainty
Top
0.247ns (
wc
)
0.247ns
Top
0.247ns (
wc
)
0.247ns
4
Bottom
Top
0.618ns
0.200ns
0.247ns
0.306ns
Bottom
Top
0.618ns
0.200ns
0.247ns
0.306ns
Setup: 0.5 (half cycle) + 1.371 (
t
clk
) – 0.075 (
t
unc
)
- 0.008 (
t
s
) – 1.471 (
t
data
) = 0.317ns
Hold:1.471 (
t
data
) - 1.371 (
t
clk
) - 0.060 (
t
unc
) - 0.030 (
t
h
)
= 0.010ns
0.818ns
0.100ns
0.100ns
0.818ns
0.247ns
0.247ns
Cluster buffer
Cluster buffer
Balancing delay
Balancing delay