/
Clock Clustering and IO Optimization for 3D Integration Clock Clustering and IO Optimization for 3D Integration

Clock Clustering and IO Optimization for 3D Integration - PowerPoint Presentation

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
393 views
Uploaded On 2016-09-16

Clock Clustering and IO Optimization for 3D Integration - PPT Presentation

Samyoung Bang Kwangsoo Han Andrew B Kahng and Vaishnav Srinivas ECE and CSE Departments UC San Diego La Jolla CA 92093 Samsung Electronics Co Ltd Hwaseongsi South Korea ID: 467206

area clock 3dio die clock area die 3dio power top max cluster data clocking bottom timing buffer scheme delay

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Clock Clustering and IO Optimization for..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Clock Clustering and IO Optimization for 3D Integration

Samyoung Bang*, Kwangsoo Han

,

Andrew B

.

Kahng

‡†

and Vaishnav Srinivas

ECE

and

CSE

Departments, UC San Diego, La Jolla, CA

92093

*

Samsung

Electronics Co. Ltd, Hwaseong-si, South Korea

eva.bang@samsung.com, {

kwhan

,

abk

,

vaishnav

}@ucsd.eduSlide2

Outline

Motivation

Power, Area and Timing Model

P&R and Timing Flow

Experimental Results

ConclusionSlide3

Motivation

For 3D integration with large bandwidth needs between dies, choice of clocking options need to be made upfront

Tradeoff between area and power needed upfront

Affects

floorplanning

choices

Serializer

3DIO PLL

Deserializer

3DIO PLL

PLL

Serializer

3DIO PLL

Deserializer

3DIO PLL

PLLSlide4

Key Choices for Clocking Options

Local clustering

Partition a given region into sub-regions

Clock synchronization scheme

Synchronous

Source-synchronousAsynchronous3DIO frequency

 # of 3DIO

To enable design space pathfinding/exploration:Power/Area/Timing model based on total bandwidth,

clustering, synchronization scheme, 3DIO frequencyCombine clocking and 3DIO power/area/timing Slide5

Clock entry

point

Cluster

3DIO array

Data path

The layout of the bottom die

3DIO Clustering

Localize the clock tree of the 3D interconnect

Advantages when number of cluster increase:

S

ize of c

luster clock

tree↓ (smaller skew, jitter)

Shorter data paths to 3DIO array at the center of each clusterEnables efficient clocking schemes (forwarded clock, asynchronous)Disadvantages when number of cluster increase

:Overhead to synchronize between clusters on top die

Overhead in cluster clock 3DIO per clusterSlide6

Synchronization Schemes for 3DIO ClockingSynchronous

Cluster clock tree is balanced to all F/Fs on both the bottom and the top die

Simplest clocking scheme (similar to on-die)

Vulnerable to inter-die process/voltage variation (large skews)

Source-synchronous

Forwarded clock from one die to another

No skew balancing needed

across two diesRequire

balance delays (Tb) within each die on the data path to match the clock insertion delay

Asynchronous

Separate clocks on each dieFIFO to help clock domain crossing

Obtain

much smaller number of

3DIOs due to higher speeds achievable with embedded clock and CDR techniques

Asynchronous clocking

Serializer

Deserializer

IO-clock

Cluster

clock

Cluster clock

IO-clock

Synchronous clocking

Launch FFs

Capture FFs

Data path

Bottom

Top

Source-synchronous clocking

Bottom

Top

Launch FFs

Capture FFs

Data path

T

bSlide7

Our WorkGiven the choices of clock synchronization schemes, number of clusters and 3DIO frequency,

find maximum

bandwidth for the 3D interconnect given a max power and area

constraints.

Optimal Clocking scheme for Max BW

Optimal Clocking frequency for Max BW

Max power constraint

Max area constraints

Synch.

Source-synch.

Asynch

.

Max power constraint

Max area constraints

Max Achievable BW

Max power constraint

Max area constraints

Max power constraint

Max area constraints

Optimal number of clusters for Max BWSlide8

Outline

Motivation

Power, Area and Timing Model

P&R and Timing Flow

Experimental Results

ConclusionSlide9

3DIO/CTS Directed Graph

Primary inputs are indicated by circle

Rectangles are determined by the primary inputs

Solid and dotted arrow indicates positive and negative correlation

Estimate the rounded rectangles as analytic expressions

#Clusters

Freq.

Clocking

scheme

Region

Area

BW

WNS

Skew outcome,

clock ins. delay

Area

Power

# FFs

IO Freq.

Per-IO

power/area

# 3DIO

Max skew/transition

Jitter

Input

Deterministic

Est. outcome

Estimated

Increase

Decrease

Clock WL

Clock

buf

. area

Data WL

Data

buf

. areaSlide10

Clock Wirelength

Hierarchical approach to estimate clock

wirelength

Assume clock tree is well balanced because FFs are uniformly distributed over the region area

L

ength of Steiner minimal tree

over N points uniformly distributed within a given

region Areg is proportional

toTotal clock wirelength is

Notation

Depth of clock tree (

i

== 0 for clock source)

Number of cluster

Total number of flip-flops

Fitted coefficients

i

= 0

w

0

i

= 1

w

1

FFs

FFs

FFs

i

= 2

w

2

Cluster clock tree

Global clock treeSlide11

Clock Buffer AreaTellez and

Sarrafzadeh

propose

a method to insert the

minimum number

of buffers under a given transition time (T

max_tran) constraintLinearize the problem by using the concept of maxinum capacitance (

Cmax) Any buffer stage

i with stage cap Ci <

Cmax will have T

i_tran < Tmax_tran

Using

Cmax, we estimate the number of clock buffers (Ncbuf

), Kashyap

et al. discuss transition time degradation

and Cmax

can be expressed as follows,

Total clock buffer area is

Wire (max length =

W

max

)

T

0

T

max

_

tranSlide12

Data Wirelength and Data Buffer AreaData path wirelength is proportional to the number of data

wires and

the cluster

dimension

Distribution exists based on sink placement

wrt 3DIO cluster

For data buffer area, we use a similar concept to clock buffer area estimation

Need to consider each data path separately

 Cannot use total

wirelength

Need minimum number of data buffers to meet hold timingSlide13

3DIO/Overall Power and Area3DIO power and area models are based on CACTI-IO

Overall (3DIO+clocking) power and area are

Switching power

Internal and leakage

power

IO

powerSlide14

Outline

Motivation

Power, Area and Timing Model

P&R and Timing Flow

Experimental Results

ConclusionSlide15

3D P&R Flow - SynchronousSynchronous

Synthesize the cluster clock tree on the top die first to

balance the clock tree on both dies

Extract maximum clock insertion delay (

T

clk1)

Propagate the data path delay (Tdata) for the routing on top die

Synchronous scheme

T

clk1

Propagated

Clock 3DIO

The layout of top die

The layout of the bottom die

T

dataSlide16

3D P&R Flow – Source-synchronous and AsynchronousSource-synchronous

Synthesize

the clock

tree and

route on the bottom die, and

separately synthesize the clock tree only

for the top dieExtract balance delay Tb

(i.e., Tclk1) for each

capture FF and annotate the delays to the corresponding data 3DIOsAsynchronousRun traditional 2D flow on both dies separately

S

ource-synchronous scheme

Propagated

Clock 3DIO

The layout of top die

The layout of the bottom die

T

clk1

Annotate

T

b

(i.e., T

clk1

)Slide17

Conventional 2D STA vs. our 3D STA

We focus on inter-die variation, and do not consider intra-die variation which can be comprehended by timing

derate

or OCV

Two process corners {BC, WC} for inter-die variation

Assign the same corner on the paths on the same dieReport worst setup WNS out of four combinations (i.e., BC-BC, BC-WC, WC-BC, WC-WC ) of corners

Conventional 2D STA(without inter-die variation)

Our 3D STASetup slack = T

per – Tsu – T{c2q, WC} – T{data1, WC} – T{data2, WC

} + (T{capture, BC} – T{launch, WC})

T

launch

T

c2q

T

data1

T

capture

T

data2

Buffer on bottom die

Buffer on top die

FF on bottom die

FF on top die

Setup

slack1 =

T

per

T

su

– T

{c2q,

BC

}

– T

{data1,

BC

}

– T

{data2,

BC

}

+ (T

{capture,

BC

}

– T

{launch,

BC

}

)

slack2

=

T

per

T

su

– T

{c2q,

BC

}

– T

{data1,

BC

}

– T

{data2,

WC

}

+ (T

{capture,

WC

}

– T

{launch,

BC

}

)

slack3 = Tper –

Tsu – T{c2q, WC} – T

{data1, WC} – T{data2, BC}

+ (T{capture, BC} – T{launch, WC

}) slack4 = Tper – T

su – T{c2q, WC} – T{data1,

WC} – T{data2, WC} + (T

{capture, WC} – T{launch, WC}

) slack = min (slack1, slack2, slack3, slack4)Slide18

Outline

Motivation

Power, Area and Timing Model

P&R and Timing Flow

Experimental Results

ConclusionSlide19

Experimental Setup

P&R tool is Synopsys

IC Compiler

I-2013.12-SP1

Timing analysis tool is Synopsys

PrimeTime H-2013.06-SP2

We use a 65nm TSMC libraryDesign of experimentsBandwidth (10 – 200 GB/s)

Region area (25 – 100 mm2)3DIO clock frequencySynchronous (100 – 2000 MHz)

Source-synchronous (1500 – 4000 MHz)Asynchronous (3500 – 8000 MHz)Number of clusters (1 – 25)

We select four data points for each parameter

256 design implementations for each clocking schemeSlide20

Model Fitting ApproachWe use Artificial Neural Network (ANN) model for our fit, guided by the directed graphIteratively progress through the directed graph to fit each nodeClock wirelength

Data

wirelength

Clock buffer area

Data buffer area

3DIO power/areaTotal Area/Power/WNSWe use the Fmax for the timing model (instead of WNS) Multiple runs with different training, validation and test data sets  Improved generality and robustness of the resulting models

#Clusters

Freq.

Clocking

scheme

Region

Area

BW

WNS

Skew outcome,

clock ins. delay

Area

Power

# FFs

IO Freq.

Per-IO

power/area

# 3DIO

Max skew/transition

Jitter

Input

Deterministic

Est. outcome

Estimated

Increase

Decrease

Clock WL

Clock

buf

. area

Data WL

Data

buf

.

areaSlide21

Area, Power and Timing Model Results

Min-Max error within +/-20%

For synchronous scheme, tool inserts large number of hold buffers due to inter-die variation

 Larger error

Mean error within +/-0.5%

Area

Power

F

maxSlide22

Design Space Results

Max BW:

Figure shows the

iso

-bandwidth curves

Vertical and horizontal walls show min power/area required to hit a bandwidth requirement

Clocking scheme:

The asynchronous scheme is area-efficient

The synchronous scheme is power-efficientThe source-synchronous

scheme provides a valuable tradeoff between

power and area along the knee of the iso-bandwidth curve. The interesting tradeoffs

between the schemes occurs along these knee points as

we change the power/area constraint tradeoffs.Max BW

Optimal clocking scheme for Max BWSlide23

Design Space Results

Cluster clock frequency:

As power constraint gets tighter, frequency goes down

As area constraint gets tighter, frequency goes up

Source-synchronous schemes provide benefits at higher cluster frequencies

The asynchronous scheme provides a way to keep the cluster frequency down but still have high 3DIO frequency, through serialization

Number of clusters:Not monotonic along edges of hypercube and clocking scheme boundariesAlso sensitive to the total region area

Optimal # of Clusters for Max BW

Optimal Cluster Frequency for Max BWSlide24

Outline

Motivation

Power, Area and Timing Model

P&R and Timing Flow

Experimental Results

ConclusionSlide25

Conclusion

We have developed

a power, area and timing model for 3DIO

and CTS

that includes clustering and three different clock

synchronization schemes (synchronous, source-synchronous, asynchronous)

Our model estimates power, area and timing within 20% error across a large range of

bandwidths, region areas, numbers of clusters and 3DIO frequenciesOur

modeling methodology will enable architects to study and optimize the design space upfrontKey takeaways:

Iso-bandwidth lines identify min area/power required to hit a particular BWClocking scheme tradeoffs are interesting along the knee of iso-bandwidth lines

Cluster frequency for asynchronous schemes can be kept low while still reducing the number of 3DIO due to serializationSlide26

Future Work

Extend our model to be aware of

Placement uniformity

Technology dependence

Datapath

logics More comprehensive STA including intra-die variationBlockages

Asymmetric clusteringDifferent 3DIO placement Serial 3DIO circuit options for asynchronous scheme2.5D (interposer-based) designSlide27

Thank youSlide28

BACKUPSlide29

Synchronous

All end points on both dies are synchronized

Colored FFs are uniformly distributed over the region

Non-colored FFs are placed right next to the 3DIO array

Clock tree is vulnerable to the inter-die variation

Use DDR to minimize number of 3DIOs

Two factors affect to determine max 3DIO clock frequency (FIO)Clock skew due to the inter-die variationJitter

Increase #clusters  increase max FIO because clock tree becomes more robust to the inter-die variation

BW

(GB/s)

Region Area (

mm2)

N

cluster

Fmax

(MHz)

F

IO

(MHz)

12

25

1

640

1280

11.25

25

25

900

1800

12

100

1

300

600

11.25

100

25

600

1200

50.025

25

1

460

920

50.625

25

25

900

1800

200.25

100

1

300

600

202.5

100

25

600

1200Slide30

Source-Synchronous

Forward clock one die to the another die

For any paths across dies, the launch and capture path delays from source to 3DIO at the bottom die are balanced

 no inter-die variation

Require balance delay

T

b to compensate clock insertion delay Tclk1Two factors to determine max 3DIO clock frequency (FIO

)Skew between Tb and

Tclk1 due to the intra-die variationJitter

BW

(GB/s)

Region

Area (mm

2)

Ncluster

F

max

(MHz)

F

IO

(MHz)

12.095

25

1

820

1640

10.625

25

25

1700

3400

50.02

25

1

820

1640

46.875

25

25

1500

3000

12

100

1

500

1000

15

100

25

1200

2400

200

100

1

350

700

195

100

25

1200

2400Slide31

Asynchronous

BW

(GB/s)

Region

Area

(

mm

2)

Ncluster

Fmax

(MHz)

F

IO(MHz)

11.9

25

1

700

5600

25

25

25

1000

8000

49.7

25

1

700

5600

40

25

25

800

6400

12

100

1

400

3200

20

100

25

800

6400

200

100

1

125

1000

200

100

25

500

4000

Use FIFO (1:8

serializer

,

8:1

deserializer

) to separate clock domain

No inter-die variation

Minimize the number of 3DIOs

Require PLL for cluster clock for the top die and IO clock for both dies

Large power overhead

One factor

to determine max 3DIO clock frequency (F

IO

)

JitterSlide32

Flow of Synch. Clocking Schemes

Bottom

Top

0.307ns (

bc

)

0.618ns (

wc

)

0.089ns (

bc

)

0.200ns (

wc

)

0.125ns (

bc

)

0.247ns (

wc

)

0.147ns (

bc

)

0.306ns (

wc

)

Bottom

Top

0.307ns (

bc

)

0.618ns (

wc

)

0.089ns (

bc

)

0.200ns (

wc

)

0.125ns (

bc

)

0.247ns (

wc

)

0.147ns (

bc

)

0.306ns (

wc

)

Delay to balance

the clock insertion delays across dies

0.307 - 0.125 = 0.182ns (

bc

)

0.618 – 0.247 = 0.371ns (

wc

)

Input delay to prevent

unnecessary hold buffer insertions

0.307 + 0.089 – 0.182 = 0.214ns

CTS, CTO and Route

o

n bottom die

Custom placement

on bottom/top dies

CTS o

n top die

Extract delay

1

2

1

2

BW: 12GB/s

A

reg

: 81mm

2

n

c

: 4

f

clus

: 1000MHz

Cluster buffer

Cluster bufferSlide33

Flow of Synch. Clocking Schemes

CTS o

n top die

Extract balance delay

CTO and Route on top die

1

2

3

3

STA

4

Run CTO and route at worst corner

considering hold time and clock uncertainty

Top

0.247ns (

wc

)

0.214ns

Top

0.247ns (

wc

)

0.214ns

4

Bottom

Top

0.618ns (

wc

)

0.200ns (

wc

)

0.125ns (

bc

)

0.306ns (

wc

)

Bottom

Top

0.307ns (

bc

)

0.089ns (

bc

)

0.247ns (

wc

)

0.147ns (

bc

)

Setup: 0.5 (half cycle) + 0.802(

t

clk

) – 0.075 (

t

unc

)

- 0.008 (

t

s

) – 1.195 (

t

data

) = 0.024ns

Hold: 0.683(

t

data

) - 0.576 (

t

clk

) - 0.060 (

t

unc

) - 0.030 (

t

h

)

= 0.017ns

0.371ns (

wc

)

0.071ns (

bc

)

0.140ns (

wc

)

0.182ns (

bc

)

Cluster buffer

Cluster bufferSlide34

Flow of Synch. Clocking Schemes

Bottom

Top

0.618ns

0.200ns

0.247ns

0.306ns

Bottom

Top

0.618ns

0.200ns

0.247ns

0.306ns

CTS, CTO and Route

o

n bottom die

Custom placement

on bottom/top dies

CTS o

n top die

Extract delay

1

2

1

2

BW: 12GB/s

A

reg

: 81mm

2

n

c

: 4

f

clus

: 1000MHz

Balance the delay from clock source to data

3DIO and the delay from clock source to

clock 3DIO

0.618 + 0.200 = 0.818ns

Annotate “balancing delay”

0.247ns

Cluster buffer

Cluster bufferSlide35

Flow of Source synch. Clocking Schemes

CTS o

n top die

Extract balance delay

CTO and Route on top die

1

2

3

3

STA

4

Run CTO and route at worst corner

considering hold time and clock uncertainty

Top

0.247ns (

wc

)

0.247ns

Top

0.247ns (

wc

)

0.247ns

4

Bottom

Top

0.618ns

0.200ns

0.247ns

0.306ns

Bottom

Top

0.618ns

0.200ns

0.247ns

0.306ns

Setup: 0.5 (half cycle) + 1.371 (

t

clk

) – 0.075 (

t

unc

)

- 0.008 (

t

s

) – 1.471 (

t

data

) = 0.317ns

Hold:1.471 (

t

data

) - 1.371 (

t

clk

) - 0.060 (

t

unc

) - 0.030 (

t

h

)

= 0.010ns

0.818ns

0.100ns

0.100ns

0.818ns

0.247ns

0.247ns

Cluster buffer

Cluster buffer

Balancing delay

Balancing delay