/
Power-Efficient Medical Image Processing using PUMA Power-Efficient Medical Image Processing using PUMA

Power-Efficient Medical Image Processing using PUMA - PowerPoint Presentation

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
394 views
Uploaded On 2016-12-05

Power-Efficient Medical Image Processing using PUMA - PPT Presentation

Ganesh Dasika Kevin Fan 1 Scott Mahlke 1 Parakinetics Inc University of Michigan Advanced Computer Architecture Laboratory The Advent of the GPGPU Increasingly popular substrate for HPC ID: 497401

loop mem performance point mem loop point performance efficiency image control mri power gpgpus accelerator programmable reconstruction sinogram fully

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Power-Efficient Medical Image Processing..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Power-Efficient Medical Image Processing using PUMA

Ganesh Dasika, Kevin Fan1, Scott Mahlke

1Parakinetics, Inc.

University of MichiganAdvanced Computer Architecture LaboratorySlide2

The Advent of the GPGPU

Increasingly popular substrate for HPCAstrophysicsWeather PredictionEDAFinancial instrument pricingMedical ImagingSlide3

Advantages of GPGPUs

High degree of parallelismData-levelThread-levelHigh bandwidthCommodity productsIncreasingly programmableSlide4

Disadvantages of GPGPUs

Gap between computation and bandwidth933 GFLOPS : 142 GB/s bandwidth (0.15B of data per FLOP, ~26:1 Compute:Mem Ratio)Very high power consumptionGraphics-specific hardwareMultiple thread contextsLarge register files and memories

Fully general datapath

Inefficiencies in allgeneral-purpose architecturesSlide5

Programmability vs Efficiency?

FPGAs

General PurposeProcessors

DSPs

Domain-specificAccelerators,GPGPUs

Efficiency

Flexibility

5

Loop Accelerators,

ASICs

???

Highly efficient,

some programmabilitySlide6

Medical Image Reconstruction

Compute intensive loops32-bit floating point codeHigh data/bandwidth requirementsIncreased demand for portability, low powerMuch current research focuses on using GPGPUs for this domainSlide7

CT Image reconstruction

X-Ray emitters and receptors on opposite sides of patientsReceived x-ray intensity corresponds to tissue densityMultiple scans (“slices”) taken around patient put together to reconstruct 1 2D-imageSlide8

Projection & Sinogram

Sinogram:All projections

Projection:All ray-sums in a direction

P(



t)

f(x,y)

t

y

x

X-rays

Sinogram

t

pSlide9

Example:

Backprojection

Sinogram

Backprojected

ImageSlide10

Example:

Filtered Backprojection

Filtered

Sinogram

Reconstructed ImageSlide11

Reconstruction: Solve for m’s

m

11

m

12

m

13

m

14

m

21

m

22

m

23

m

24

m

31

m

32

m

33

m

34

m

41

m

42

m

43

m

44

16

22

11

10

X-Ray

Emitter

Detector

Values

Densities

“Human Body“

22

12

10

15Slide12

Real Reconstruction Problem

Intensity measured Rays transmitted through multiple “pixels”Find individual

“pixel” values from transmission data

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

534

417

364

555

501

355

255

712

199

512 values

512

values

100’s of diagonals @ 100’s of anglesSlide13

Medical Imaging Applications

Image reconstruction for MRI/CT/PET scansLarge amounts of Vector/Thread-level parallelismFP-intensive kernelsOften requiring math library functionsData-intensive (~5:1 compute:mem ratio)

Benchmark

Inner-loop

%Scalar/Vector

Outer-loop TLP

Compute:Mem

ratio

Segmentation

Fully vectorizable

Do-all

4:1

Laplacian

Filtering

Fully vectorizable

Do-all

3:1

Gaussian

Convolution

Fully

vectorizable

with predicates

Do-all

6:1

MRI

FH Vector

Fully

vectorizable

Do-all

6:1

MRI

Q Vector

Fully

vectorizable

Do-all

5.5:1Slide14

Currently, most scans require moving patient to imaging room

Consumes timeStress on patientStudies show benefits of portable, bed-side scanners:86% increase in patients suitable for post-stroke thrombolytic therapy [Weinreb et al, RSNA]80-100% drop in scan-related complications [Gunnarsson

et al, J. of Neurosurgery]New X-Ray emitters push for mAs of current use

Current Concerns: Portability/PowerSlide15

Current Concerns: Performance

High-accuracy CT algorithms take too longIterative forward/backward projection~Hours on modern CT scanners instead of minutesInterventional radiologyScans currently takes minutes, but should take secondsCT-Flouroscopy

Several scans done in successionSlide16

Flexibility

Software algorithms change over timeNRETime-to-market

16Slide17

PUMA

Tiled architectureBandwidth-matched for improved efficiencyEach tile is a “Programmable Loop Accelerator”

Extern. Interface

CPU

Mem

Disk

…Slide18

Programmable Loop Accelerator

Generalize accelerator without losing efficiencyFPGAs

Efficiency, Performance

Flexibility

Loop Accelerators,

ASICs

Programmable

Loop Accelerators

18

General Purpose

Processors

DSPs

Domain-specific

Accelerators,

GPGPUs

???Slide19

Designing Loop Accelerators

C Code

Loop

19

Hardware

Point-to-point Connections

BR

CRF

+

&

MEM

Local

Mem

+

*

MEM

<<

Local

MemSlide20

Loop Accelerator Architecture

Point-to-point Connections

+

&

MEM

Local

Mem

FSM

Control

signals

CRF

BR

Hardware realization of modulo scheduled

loop

Parameterized hardware:

FUs

Shift Register Files

20

Static Control

Point-to-point InterconnectSlide21

Programmable Loop-Accelerator

ArchitecturePoint-to-point Connections

+/-

&/|

MEM

Local

Mem

Control

Memory

Control

signals

CRF

BR

RR

RR

RR

RR

Literals

Ring

Functionality

Storage

Connectivity

Control

LA

PLA

Custom FU set

Generalized FUs + MOVs

Point-to-point

Ring

+ Port-swapping

Limited size, no addr.

Rotating Reg. Files

Hardwired Control

Lit. Reg. File + Control Mem

21

+

&

SRF

SRF

SRF

SRF

FSMSlide22

MRI.FH PLA

~0.6 mm2 per tile38 FUs128 32-bit registersInter-FU BW 1 TB/sec

FU Type

#FP-ADDSUB6FP-MPY

9I-ADDSUB

8

MEM

9

I-MPY

1

Other

5Slide23

Performance on MRI.FH PLA

II preserved

II doubledUnschedulableSlide24

Efficiency on MRI.FH PLASlide25

PUMA System Design

5 systems designed around 5 benchmarksEach composed of identical tilesAssume same B/W as GTX280 (142 GB/s)# Tiles based on B/W requirements of benchmark

Extern. Interface

CPU

Mem

Disk

…Slide26

System Performance

4W

3W2.8W

2.3W2.7WSlide27

Performance vs. GPGPU

63% performance of GTX 295

2X performance of GTS 250Slide28

Efficiency vs. GPGPU

22X

54XSlide29

Conclusions

Power-efficient accelerator for medical imagingASIC-like efficiency with programmability63-201% of GPU performance22-54X GPU Performance/Power efficiencySlide30

Thank you!!

Questions?