HighLevel Synthesis Quick amp Accurate Power Analysis and Optimization Flow JAN202014 Asher Berkovitz Yaniv Fais Authors Contact Details Asher Berkovitz AsherBerkovitzfreescalecom 972 099522511 ID: 533602
Download Presentation The PPT/PDF document "Efficient IP Design flow for Low-..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Efficient IP Design flow for Low-Power
High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow
JAN.20.2014
Asher Berkovitz
Yaniv FaisSlide2
Authors Contact Details
Asher Berkovitz
Asher.Berkovitz@freescale.com
+972- 09-9522511
Yaniv Fais
Yaniv.Fais@freescale.com
+972- 09-9522179
Freescale
Semiconductor
Israel
Herzelia
Shenkar
3Slide3
Outline
Challenges
High Level Synthesis flowPower EfficiencyProblems at RTL
Proposed VSIM++ Flow
Analysis
Optimization
Results on Networking Algorithm (Non-Abstract Version)
ConclusionsSlide4
Challenges
IP blocks for networking types of applications need to meet tight power consumptions while meeting aggressive performance requirements.
Making changes to micro architectures and other high abstraction modeling styles could deliver the largest benefits on overall power.It is hard to accurately measure power at higher abstractions.
Measuring accurate power upon signoff is late in the design process when high level changes are impossibleSlide5
High Level Synthesis design Flow
Algorithms
Definition
Macro-Architecture Definition
RTL2GDSII
“Normal”
flow
RTL
Macro-architecture definition:
Based on an accelerator base class
Uses unified modules (FIFOs, interfaces etc)
Commands (uArch)
Cell library (.lib)
Bit-exact
SystemC
®
Model
SystemC
®
Model:
Architecture evaluation
and
RTL generation
Accurate data path description according to macro-architecture
Design to meet processing requirements
HLS:
Builds pipelined data path and control logic
Considers real timings during RTL generation
Explore implementation tradeoffs
HLS
SystemC ®
RTL Quick explore (Timing/Area)Slide6
Power Dissipation
Static Power - ~test independent
Dynamic Power – highly dependent on application (Signal Transition)Signal transitions can be divided to:
Functional change
Glitch (signal changes that which not captured by a sequential element)
Glitches are not visible in RTL simulation and can contribute ~20% to power dissipationSlide7
Fast & Accurate power analysis flow (VSIM)
Quick Physical Design (PD) flow:
Timing violations allowedDRC violations allowed
Less than 100% RTL to GL equivalence
Costumed test bench enables Cycle accurate Gate Level Simulation
Power analysis is performed using gate level
netlist
& parasitics file.
Power analysis results are mapped backed to RTL
netlist
.
Quick PD flow
RTL DB
Power Analysis
GLV simulation
Test bench generation
Mapping GL 2 RTLSlide8
Test Bench Generation
Based on RTL to GL mapping, force RTL values on GLV simulation
Advantages:
Q
D
Std’ test bench
Q
D
“VSIM” test bench
Force the RTL value on the key point
Timing violation!
Q
D
Short run time:
Simulate selected window
Force correct value @ time point X
Q
D
GL delay for logic cones
(SDF)
Q
D
Q
D
Q
D
Values are a bit “off”
Correct values forced
GL & SDFSlide9
Cond_0
Gate level results mapping to RTL
netlist
reg
cond
[1:0]
reg
count[1:0]
always @(
posedge
clk
)
if (condition == 2’b11)
count = count + 1;
RTL
netlist
GL
netlist
26
29
Cond_1
count_1
count_0
Clock Gate
Map RTL 2 GL
For each unmapped GL instance:
Divide the power between drive/load key points
Assign GL key point power to RTL key pointThe power of each RTL hierarchy is the sum of power assigned to its key point4
810
10
21
10
10
1
1
1
1
13
13
14
15
11
11
11
11Slide10
Mapping results to high-level language (VSIM++)
Using annotation of C++ class names, variable names as well as file name/line numbers we can map power consumption from the accurate gate-level to the C++.
This capability allows us to:Analyze and fix clock gating
Redesign “power hungry” resources
Consider different architectures
reg
my_var_Ln123[1:0]
reg
count_Ln124[1:0]
always @(
posedge
clk
)
if (my_var_Ln123 == 2’b11)
count_Ln124 = count_Ln124 + 1;
RTL
netlist
void process() { … while (true) { if (my_var==3) count++; … }
}C++ code
121:122:123:124:125:126:127:Line #Slide11
DFF
DFF
Example problem identified
Tool inserts “clock gating” enabler code for RTL automatically
always @(
posedge
clk
)
if (en)
data[511:0] <=
new_data
;
C++ process condition
HLS
DFF
clk
en
new_data
data
Gate-Level implementation is not implemented as gated clock but as data logic due to timing violations
Solution – Simplify clock gating enablers to meet timing constraintsSlide12
Clock gating enabler simplification
DFF
DFF
Hash Key
clk
en
new_data
data
DFF
DFF
Header
DFF
DFF
Process control
DFF
DFF
Hash Key
clk
en
new_data
data
DFF
DFF
Process control
Original clock gating scheme –
Complicated enable logic
Synthesized to non efficient enabler
Simplified clock gating scheme –
Enable synthesized w/o changes
Leading to high clock gating efficiencySlide13
Conclusions
Use High Level Synthesis for IP Design
Quick and easy to explore architecture alternatives Quick front-end flow including verification
Power analysis:
Measure power on system level scenario
Quick (doesn’t require full physical design flow convergence)
Accurate (done on gate-level)
Analysis and Optimization in high-level design (C++)
Manual clock gating enable setting reduced dynamic power consumption by 19.4%
Early in the design cycle : Easy to change IP architecture !Slide14
BackupSlide15
Accuracy
Measured using similar methodology on a different designSi measurement compared to full T/O gate level data
Test
Dynamic
power accuracy
Single
core Fast Fourier Transform
-7.59%
Single
core Fast Fourier Transform No memory miss
-8.40%
Dual
core Fast Fourier Transform
7.57%