Ganesh Dasika Kevin Fan 1 Scott Mahlke 1 Parakinetics Inc University of Michigan Advanced Computer Architecture Laboratory The Advent of the GPGPU Increasingly popular substrate for HPC ID: 497401
Download Presentation The PPT/PDF document "Power-Efficient Medical Image Processing..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Power-Efficient Medical Image Processing using PUMA
Ganesh Dasika, Kevin Fan1, Scott Mahlke
1Parakinetics, Inc.
University of MichiganAdvanced Computer Architecture LaboratorySlide2
The Advent of the GPGPU
Increasingly popular substrate for HPCAstrophysicsWeather PredictionEDAFinancial instrument pricingMedical ImagingSlide3
Advantages of GPGPUs
High degree of parallelismData-levelThread-levelHigh bandwidthCommodity productsIncreasingly programmableSlide4
Disadvantages of GPGPUs
Gap between computation and bandwidth933 GFLOPS : 142 GB/s bandwidth (0.15B of data per FLOP, ~26:1 Compute:Mem Ratio)Very high power consumptionGraphics-specific hardwareMultiple thread contextsLarge register files and memories
Fully general datapath
Inefficiencies in allgeneral-purpose architecturesSlide5
Programmability vs Efficiency?
FPGAs
General PurposeProcessors
DSPs
Domain-specificAccelerators,GPGPUs
Efficiency
Flexibility
5
Loop Accelerators,
ASICs
???
Highly efficient,
some programmabilitySlide6
Medical Image Reconstruction
Compute intensive loops32-bit floating point codeHigh data/bandwidth requirementsIncreased demand for portability, low powerMuch current research focuses on using GPGPUs for this domainSlide7
CT Image reconstruction
X-Ray emitters and receptors on opposite sides of patientsReceived x-ray intensity corresponds to tissue densityMultiple scans (“slices”) taken around patient put together to reconstruct 1 2D-imageSlide8
Projection & Sinogram
Sinogram:All projections
Projection:All ray-sums in a direction
P(
t)
f(x,y)
t
y
x
X-rays
Sinogram
t
pSlide9
Example:
Backprojection
Sinogram
Backprojected
ImageSlide10
Example:
Filtered Backprojection
Filtered
Sinogram
Reconstructed ImageSlide11
Reconstruction: Solve for m’s
m
11
m
12
m
13
m
14
m
21
m
22
m
23
m
24
m
31
m
32
m
33
m
34
m
41
m
42
m
43
m
44
16
22
11
10
X-Ray
Emitter
Detector
Values
Densities
“Human Body“
22
12
10
15Slide12
Real Reconstruction Problem
Intensity measured Rays transmitted through multiple “pixels”Find individual
“pixel” values from transmission data
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
534
417
364
555
501
355
255
712
199
512 values
512
values
100’s of diagonals @ 100’s of anglesSlide13
Medical Imaging Applications
Image reconstruction for MRI/CT/PET scansLarge amounts of Vector/Thread-level parallelismFP-intensive kernelsOften requiring math library functionsData-intensive (~5:1 compute:mem ratio)
Benchmark
Inner-loop
%Scalar/Vector
Outer-loop TLP
Compute:Mem
ratio
Segmentation
Fully vectorizable
Do-all
4:1
Laplacian
Filtering
Fully vectorizable
Do-all
3:1
Gaussian
Convolution
Fully
vectorizable
with predicates
Do-all
6:1
MRI
FH Vector
Fully
vectorizable
Do-all
6:1
MRI
Q Vector
Fully
vectorizable
Do-all
5.5:1Slide14
Currently, most scans require moving patient to imaging room
Consumes timeStress on patientStudies show benefits of portable, bed-side scanners:86% increase in patients suitable for post-stroke thrombolytic therapy [Weinreb et al, RSNA]80-100% drop in scan-related complications [Gunnarsson
et al, J. of Neurosurgery]New X-Ray emitters push for mAs of current use
Current Concerns: Portability/PowerSlide15
Current Concerns: Performance
High-accuracy CT algorithms take too longIterative forward/backward projection~Hours on modern CT scanners instead of minutesInterventional radiologyScans currently takes minutes, but should take secondsCT-Flouroscopy
Several scans done in successionSlide16
Flexibility
Software algorithms change over timeNRETime-to-market
16Slide17
PUMA
Tiled architectureBandwidth-matched for improved efficiencyEach tile is a “Programmable Loop Accelerator”
Extern. Interface
CPU
Mem
Disk
…Slide18
Programmable Loop Accelerator
Generalize accelerator without losing efficiencyFPGAs
Efficiency, Performance
Flexibility
Loop Accelerators,
ASICs
Programmable
Loop Accelerators
18
General Purpose
Processors
DSPs
Domain-specific
Accelerators,
GPGPUs
???Slide19
Designing Loop Accelerators
C Code
Loop
19
Hardware
Point-to-point Connections
BR
CRF
+
…
…
&
…
…
MEM
…
…
Local
Mem
+
…
…
*
…
…
MEM
…
…
<<
…
…
Local
MemSlide20
Loop Accelerator Architecture
Point-to-point Connections
+
…
…
&
…
…
MEM
…
…
Local
Mem
FSM
Control
signals
CRF
BR
Hardware realization of modulo scheduled
loop
Parameterized hardware:
FUs
Shift Register Files
20
Static Control
Point-to-point InterconnectSlide21
Programmable Loop-Accelerator
ArchitecturePoint-to-point Connections
+/-
…
…
&/|
…
…
MEM
…
…
Local
Mem
Control
Memory
Control
signals
CRF
BR
RR
RR
RR
RR
Literals
Ring
Functionality
Storage
Connectivity
Control
LA
PLA
Custom FU set
Generalized FUs + MOVs
Point-to-point
Ring
+ Port-swapping
Limited size, no addr.
Rotating Reg. Files
Hardwired Control
Lit. Reg. File + Control Mem
21
+
&
SRF
SRF
SRF
SRF
FSMSlide22
MRI.FH PLA
~0.6 mm2 per tile38 FUs128 32-bit registersInter-FU BW 1 TB/sec
FU Type
#FP-ADDSUB6FP-MPY
9I-ADDSUB
8
MEM
9
I-MPY
1
Other
5Slide23
Performance on MRI.FH PLA
II preserved
II doubledUnschedulableSlide24
Efficiency on MRI.FH PLASlide25
PUMA System Design
5 systems designed around 5 benchmarksEach composed of identical tilesAssume same B/W as GTX280 (142 GB/s)# Tiles based on B/W requirements of benchmark
Extern. Interface
CPU
Mem
Disk
…Slide26
System Performance
4W
3W2.8W
2.3W2.7WSlide27
Performance vs. GPGPU
63% performance of GTX 295
2X performance of GTS 250Slide28
Efficiency vs. GPGPU
22X
54XSlide29
Conclusions
Power-efficient accelerator for medical imagingASIC-like efficiency with programmability63-201% of GPU performance22-54X GPU Performance/Power efficiencySlide30
Thank you!!
Questions?