/
Improved Flop Tray-Based Design Implementation for Power Re Improved Flop Tray-Based Design Implementation for Power Re

Improved Flop Tray-Based Design Implementation for Power Re - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
377 views
Uploaded On 2017-10-08

Improved Flop Tray-Based Design Implementation for Power Re - PPT Presentation

Andrew B Kahng Jiajia Li and Lutong Wang UC San Diego VLSI CAD Laboratory Outline Background and Motivation Related Work Our Methodology Experimental Setup and Results Conclusion ID: 594109

tray flop power bit flop tray bit power trays flops clustering clock based reduction displacement skew placement single optimization

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Improved Flop Tray-Based Design Implemen..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Improved Flop Tray-Based Design Implementation for Power Reduction

Andrew B. Kahng, Jiajia Li and Lutong Wang UC San Diego VLSI CAD LaboratorySlide2

Outline

Background and MotivationRelated WorkOur MethodologyExperimental Setup and ResultsConclusionSlide3

Flop Tray Benefits (1)

Flop tray = multi-bit flip-flop (MBFF)Application of flop trays significantly reduces #sinksMotivating “thought experiments” Replacing all single-bit flops in a clock tree (N sinks) with 64-bit flop trays can reduce #clock buffers by (N-

N

/64)/(

N

-1) ≈ 98.4% !

In a clock tree with N = 100K, F = 8, replacing all single-bit flops with 64-bit flop trays can reduce #levels from 6 to 4 Fewer clock buffers, smaller clock power

N

sinks

root

log

F

Nlevels

Each buffer has

F

fanouts

#Buffers ≈ (N-1)/(F-1)

N/K

sinks

root

log

F

(N/K)

levels

Each buffer has

F

fanouts

#Buffers ≈ (N/K-1)/(F-1)

Use K-bit flop traysSlide4

Flop Tray Benefits (2)

Inverters for clock signals are shared within a flop tray Power and area reductionsA recent work (Lin et al. TCAD 2015) achieves 22% flop power reduction by using 2-bit and 4-bit flop trays

Master latch

Slave

latch

clk

clk

Single-bit flop

Master latch

Slave

latch

Master latch

Slave

latch

clk

clk

2-bit flop traySlide5

Challenges of Flop Tray Generation

Flops occupy large portion of block areaIn VGA, 30% of instances are flops  51% of block areaFlop trays can have high aspect ratio and distinct size4-bit flop tray = 1 row x 63 sites64-bit flop tray = 4 rows x 244 sites Clustering of flops imposes additional placement constraintsSmall clusters do not fully exploit flop tray benefits

Large clusters may sacrifice

datapath

wirelength

/ power

Power overhead on

datapaths

(flop tray w/ logical clustering vs. single-bit flop)Slide6

Outline

Background and MotivationRelated WorkOur MethodologyExperimental Setup and ResultsConclusionSlide7

Related Works

Early-stage flop tray generation[Chen10] enables flop tray generation during synthesis[Hou09] splits flop trays to mitigate routing congestionBut are not

aware

of

physical layoutFlop tray generation during/after placement

[Lin11] clusters flops by finding K-cliques in a merging graph[Jiang12] generates flop trays using interval graphs

[Tsai13] guides placement of flops with bonding force Hard to

define feasible displacement

regionBut ignore the shape (AR) of flop trays and timing paths

Our work: flop tray generation considering flop

displacement, timi

ng paths and flop tray shapesSlide8

Outline

Background and MotivationRelated WorkOur MethodologyExperimental Setup and ResultsConclusionSlide9

Overall Optimization Flow

In blue are our optimizationsInitial placement w/ single-bit flops == “optimal” placementObjectivesMinimize displacement of flopsMinimize timing impact

Minimize #flop trays

Two-step optimization

Capacitated K-means clustering

(in dotted red boxes)

ILP-based selection of flop traysSlide10

Example of Overall Flow

4-bit only solution

16-bit only solution

64-bit only solution

ILP solution

Design: AES

Technology: 28FDSOISlide11

Capacitated K-Means Clustering

Given N points (flops), a capacity of K (flop tray size), obtain (N/K) clusters.Selection of starting pointsRandomly select one flop among single-bit flopsFor each flop (h), calculate the total Manhattan distance (d) from h to all selected flops

Randomly select one new flop with probability d

Repeat Steps II and III until M flops are selected

Min-cost flow-based clustering

Update of cluster centers

Minimize ∑

d

k

Such that |xi + x’ij – xk

| + |yi + y’ij – yk| =

dkflop location: (x

k, yk); flop tray location: (xi, yi); relative slot location (x’

ij, y’ij)

h

k

: kth flop (point)

ti : i

th flop tray (cluster)

fij

: jth

slot on ith flop tray

dk,ij: Mahattan

distance

between hk and

fij

By considering distances between flops and slots, we are aware of flop tray ARs

Initial center

Clustering

Cluster c

enter update

S

olutionSlide12

Example on AES

Circles: initial flop locationsRed dots: flop tray locations Slide13

Awareness of Flop Tray Shapes

Our clustering solution more closely matches the AR of flop trays  Smaller displacements

Without awareness of flop tray AR, layout

Avg. displacement =

15

μ

m

With awareness of flop tray AR, layout

Avg. displacement =

5

μm

Design: AESTechnology: 28FDSOISlide14

ILP-Based Selection of Flop Tray Solutions

Formulate an ILP to select flop tray solutions with various flop tray sizes to minimize displacement, timing impact and flop tray cost

Minimize

α

∙ W + D +

β

Z

Such that

// flop displacements

|∑

ij

(x

i

+

x’

ij - x

k) ∙ bk,ij| + |∑

ij (yi + y’

ij - yk) ∙

bk,ij| =

dk ∑k d

k = D// relative displacements between timing-critical flop pairs

|∑ij (xi +

x’ij - xk

) · bk,ij - ∑i’j’ (x

i’ +x’i’j’ - xk’) ·

b

k

’,

i’j

| + |∑

ij

(

y

i

+

y’

ij

-

y

k

) ·

b

k,ij

- ∑

i’j

(

y

i

+

y’

i’j

-

y

k

) ·

b

k

’,

i’j

| =

z

kk

kk

z

kk

= Z// cost of flop trays

bk,ij ≤ ei

; e

i ≤ ∑kj b

k,ij

i

(

w

i

·

e

i

) = W

// each flop has exactly one slot to match & each slot can have at most one flop to match

ij

b

k,ij

= 1; ∑

k

b

k,ij

≤ 1

Notations

D

total displacement

Z

total relative displacement of timing-critical flop pairs

W

total cost of flop trays

α

,

β

weighting parameters

(xi,

yi

)

location of

i

th

flop tray

(

x’

ij

,

y’

ij

)

relative location of

j

th

slot on

i

th

flop tray

(

x

k

,

y

k

)

location of k

th

flop

b

k,ij

binary indicator whether k

th

flop is assigned to

j

th

slot on

i

th

flop tray

e

i

binary indicator whether

i

th

flop tray

is selected

w

i

cost of

i

th

flop traySlide15

Impact of α Value

Choice of α determines a tradeoff between clock power reduction versus datapath power penaltySmall value of α  Small-size flop trays, small displacementLarge value of α

 Large-size flop trays, large displacementSlide16

Minimization of Relative Placement

Relative displacement between timing-critical start-end flop pairs degrades timingMove apart  wire↑  delay↑Move closer  routing/placement congestionMinimization of relative displacement reduces power penalty by 5%

logic cone

Move closer

placement/routing congestion

Move apart

longer wire

5%Slide17

Outline

Background and MotivationRelated WorkOur MethodologyExperimental Setup and ResultsConclusionSlide18

Experimental Setup

Designs: AES, JPEG, MPEG, VGA (from OpenCores website)Technology: 28nm FDSOI, dual-VT ToolsSynthesis: Synopsys Design Compiler vH-2013.12-SP3P&R: Cadence

Innovus

Implementation System v15.2

Power/timing analysis:

Cadence

Innovus Implementation System v15.2Candidate flop trays

Tray size

4-bit

8-bit

16-bit

32-bit

64-bit

Norm. area/power per bit

0.875

0.854

0.854

0.844

0.844

AR (#rows x #columns)

1 x 4

2 x 4

4 x 4

4 x 8

4 x 16

AR (#rows x #sites)

1 x 63

2 x 62

4 x 62

4 x 122

4 x 244Slide19

Power Benefits

Reference flowsref_1b: conventional implementation flow with single-bit flopsref_mb: flop tray-based implementation with logical clustering (flop tray generation during synthesis with commercial tools)Up to 98% sink number reduction and 90% clock power reduction compared to ref_1bUp to

16%

more total power reduction and 40% more clock power reduction compared to

ref_mb

Design

Flow

Clock power (

mW

)

Total power (

mW

)

#Sinks

AES

ref_1b

1.53

14.02

530

ref_mb

0.72

13.35

227

opt_mb

0.46

12.56

128

JPEG

ref_1b

13.37

84.54

4512

ref_mb

6.1

76.2

1665

opt_mb

2.28

69.24

515

MPEG

ref_1b

10.72

45.53

3181

ref_mb

5.19

38.7

1316

opt_mb

0.98

31.76

181

VGA

ref_1b

42.19

164.84

17053

ref_mb

20.73

138.99

7665

opt_mb

2.04

111.32

308

17053

42.19

2.04

308

111.32

20.73

138.99Slide20

Layout Examples

In red are flop trays and flops, in blue

are combinational cellsSlide21

Optimization with Various Flop Tray Sizes

Flop tray-based optimization with various combinations of flop tray size candidatesOptimization with large-size (i.e., > 16-bit) flop trays achieves 11% more clock power reduction on average, especially on large designs

AES

JPEG

MPEG

VGA

I

II

III

IV

V

1

bit

{1, 4} bit

{1, 4,

8

}

bit

{1, 4,

8, 16

}

bit

{1, 4,

8, 16, 32, 64

}

bitSlide22

Study of Useful Skew Optimization

Comparison of useful skew benefits (= datapath leakage power reductions) across various flowsref_1b: design with only single-bit flopsopt_mb: flop tray-based design (w/o skew-aware clustering)opt_mb (skew aware): flop tray-based design (w/ skew-aware clustering)

Skew-aware clustering achieves

similar

useful skew benefits as

ref_1b

, but with 21% less sink number reduction

ref_1b

530

4512

3181

17053

opt_mb

128

515

181

308

opt_mb

(skew aware)

392

1830

205

1245

#sinksSlide23

Outline

Background and MotivationRelated WorkOur MethodologyExperimental Setup and ResultsConclusionSlide24

Conclusion

A novel flop tray-based optimization with capacitated K-means algorithmUp to 16% total block power reduction compared to logical clusteringUseful skew optimization in the context of flop tray-based designOngoing / Future worksScalable optimization considering all flop tray sizesFloorplan blockage awarenessSlide25

Thank you!

UCSD ABKGroup is grateful to Qualcomm, Samsung, NXP, the IMPACT+/C-DEN centers, Mentor Graphics and the NSF for research support. We thank IMEC and Cadence for additional research enablements and collaborations.