/
Accelerating Persistent Accelerating Persistent

Accelerating Persistent - PowerPoint Presentation

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
394 views
Uploaded On 2016-12-15

Accelerating Persistent - PPT Presentation

Scatterer Pixel Selection for InSAR Processing Tahsin Reza Aaron Zimmer Parwant Ghuman Tanuj kr Aasawat Matei Ripeanu Electrical and Computer Engineering University of British Columbia ID: 502122

memory gpu peak cpu gpu memory cpu peak gflop window bandwidth bound ptsel arc intensity parallel patch pixel roofline

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Accelerating Persistent" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Accelerating Persistent Scatterer Pixel Selection for InSAR Processing

Tahsin Reza†, Aaron Zimmer‡, Parwant Ghuman‡, Tanuj kr Aasawat†, Matei Ripeanu††Electrical and Computer Engineering, University of British Columbia‡3vGeomatics, Vancouver, British Columbia†{treza, taasawat, matei}@ece.ubc.ca ‡{azimmer, parwant}@3vgeomatics.com

netsyslab.ece.ubc.ca

3vgeomatics.comSlide2

Interferometric Synthetic Aperture Radar (InSAR)

Remote sensing technology for estimating displacement/deformation of the earth's surfaceApplications – monitoring earth subsidence and uplift due to urban infrastructure development, mining, oil and gas extraction, and permafrost thawingSatellites – Radarsat 1/2, ERS-1/2, ENVISAT, Sentinel 1A/1B, ALOSRadar signal – operating frequency 1 - 12GHz

(

Radarsat 1: Image courtesy of CSA)2Slide3

Uses differences in the phase (∆Φ) of returning radar echo

3Interferometric Synthetic Aperture Radar (InSAR)

Before earthquake

After earthquakeMovement∆Φ MovementChange in heightSlide4

Richmond, BC is Sinking

150mm since 2009Slide5

Alyeska Pipeline, Alaska and Denali Earthquake 2003Slide6

Lions Gate Bridge, BC

High Deformation Slide7

Mine in Chile

20mm since 200612mm since 20064mm since 2006Slide8

Arctic

LandslideSlide9

Challenges in InSAR Processing Domain Specific

Acquired data is noisy and prone to ambiguityEstablishing ground truth is expensive and difficultComputationalLarge memory footprint Floating point computation heavy algorithms, e.g. FFTI/O intensive9Slide10

Persistent Scatterer (PS) Pixel Selection

Identify permanent objects, e.g., a hanger not an airplaneAvoid distributed scatterers – ambiguous signal echoesRanking pixels helps improving efficiency and quality of phase unwrapping10Slide11

Persistent Scatterer (PS) Pixel Selection

State of the art – by Andy Hooper in 2006Requires model assumption – Free of different types of noises – Height error correctedOnly considers individual pixel’s phase information11[Hooper’06]Slide12

Persistent Scatterer (PS) Pixel SelectionNew Algorithm –

PtSelDoes not require model assumptionsCoherence computation considers nearby neighbours12

 

 

v

v

v

v

v

v

v

v

 

Temporal arc

coherence,

(

C

is a pixel an

N

is its

neighbour

)

(

F is

neighbour

count)Slide13

InSAR Preliminaries13

SLC images

SLC

1SLC2(

)

 

Interferogram

Interferogram

Network

t

3

t

2

t

1Slide14

The PtSel Algorithm

14

v

v

Create a network of

interferograms

 

v

v

v

v

v

v

v

v

v

Compute temporal coherence on each arc

A search window with eight arcs

Find the arc with maximum coherence

Output: 2D image indicating maximum coherence

Input:

n

interferograms

(Step 1)

(Step 2)

(Step 3)Slide15

The ChallengeVolume of Data

A 15,000 × 8,000-pixel interferogram – 1GB Network of 100 interferograms – 5TBCompute-to-Memory ratio 1.5:1 Fairly balanced and difficult to optimize15Slide16

GPU Architecture and Programming Model

Hundreds of cores and thousands of threadsSingle Instruction Multiple Thread (SIMT)Memory request coalescing5-6x higher FLOP-rate and memory bandwidth than CPUsParallel

Random Access Machine abstraction

Used by supercomputers like Cray XMTAssumes a shared memory system, infinite number of processors, uniform memory latencyData-level parallelism as opposed to instruction-level parallelism in SMP16Slide17

Roofline Analysis

: dual CPU vs single GPU17[Williams et al.’09]Compute bound

Memory bound

CPU Peak memory bandwidth 59GB/sCPU peak GFLOP/sSlide18

Roofline Analysis: dual CPU vs single GPU18

[Williams et al.’09]Compute bound

Memory bound

GPU Peak memory bandwidth 288GB/sCPU Peak memory bandwidth 59GB/sGPU peak GFLOP/sCPU peak GFLOP/sSlide19

Roofline Analysis: dual CPU vs single GPU19

[Williams et al.’09]

Compute bound

Memory boundGPU Peak memory bandwidth 288GB/sCPU Peak memory bandwidth 59GB/sGPU peak GFLOP/s

CPU peak GFLOP/s

Theoretical arithmetic intensity

432 GFLOP/s

88.5 GFLOP/s

1.5Slide20

Roofline Analysis: dual CPU vs single GPU20

[Williams et al.’09]

Compute bound

Memory boundGPU Peak memory bandwidth 288GB/s

CPU Peak memory bandwidth 59GB/s

GPU peak GFLOP/s

CPU peak GFLOP/s

Theoretical arithmetic intensity

GPU: achieved arithmetic intensity

432 GFLOP/s

549 GFLOP/s

88.5 GFLOP/s

2016 GFLOP/s

1.5

7

83% of peak CPU FLOP-rate

[Sterling et al.’06]Slide21

PtSel GPU : Overview21

GPUGPUGPUUser inputs:

Interferograms, patch size, search window size

Divide the image into smaller patchesOutput fileGPUKernel 1: Temporal arc coherence compute kernelGroup patches into chunks based on the number of GPUS

From each chunk copy a patch to a GPU

Kernel 2: Max-reduction kernel

Host

Copy output from GPUs to host buffer

Write output to file

Interferograms

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

GPU Threads

Window-parallel

GPU threads

Arc-parallelSlide22

Evaluation MethodologyCPU vs GPU execution time

– CPU baseline: parallel OpenMP-based implementationScalabilityEnergy consumptionEvaluation Platform Ivy Bridge dual-socket/16-core, 256GB DDR3-1600 Kepler GK110/6GB, GK110B/12GBWorkload3260 x 2680-pixel, 1830 interferograms, about 2TB22Slide23

Evaluation : GPU Speedup and Scalability

CPU vs. GPU execution time for various workload sizes . 23

Speedup on GPU over

CPU2 CPUs 29 hours1 GPU 90 minutes – 18x4 GPUs 27 minutes – 65xSlide24

Evaluation : GPU Speedup and Scalability

CPU vs GPU execution time for different window sizes (left y-axis). Speedup on GPU over CPU on right y-axis.24Slide25

Evaluation : Energy Consumption25

WattsUP Meter, http://www.wattsupmeters.comIdle system power

Average power consumption per second

Energy consumption2x CPU200W401W11.63 kilowatt hour4x GPU280W893W0.4 kilowatt hour

29x improvementPower rating (TDP): CPU – 95W, GPU – 250WSlide26

Verifying Correctness

PtSel identified 75% of the PS candidates detected by the state of the art [Hooper’06] (in pink) Also, identified an additional 11% pixels (in yellow) as stable26(area surrounding

Vancouver airport)Slide27

ImpactIntegrated to the production processing

chain at 3vGeomaticsProcess very large data sets which was not possible before in regular production settingNew applications leveraging additional PtSel output27Slide28

Lessons LearnedAccelerator computing is important for HPC

scientific applicationsScale up / Speed up at a fraction of hardware and energy costTakes effort and requires expertise but the reward could be worth it28

3vgeomatics.com

netsyslab.ece.ubc.caSlide29

Thank you29

3vgeomatics.com

netsyslab.ece.ubc.caSlide30

Backup Slides

30Slide31

Future WorkExplore opportunities for better

locality and memory coalescing on the GPUExplore likelihood of finding PS through up-sampling pixels – splitting a distributed scatterer into multiple persistent scatterersExplore possibilities of parameter auto-tuning, e.g., search window size31Slide32

ReferencesA. Hooper, H. Zebker, P.

Segall, and B. Kampes, “A new method for measuring deformation on volcanoes and other natural terrains using In-SAR persistent scatterers,” Geophys. Res. Lett., vol. 31, p. L23611, 2004.S. Williams, A. Waterman, and D. Patterson. “Roofline: an insightful visual performance model for multicore architectures”. Commun. ACM, vol. 52, pp. 65–76, 2009.T. Sterling, P. Kogge, W. J. Dally, S. Scott, W. Gropp, D. Keyes, and P Beckman. "Multi-Core for HPC: breakthrough or breakdown?," In Pro. of the 2006 ACM/IEEE conf. on Supercomputing, article 73, 2006.32Slide33

PtSel Algorithm

33

 Slide34

PtSel CPU34

Search window

Pixel

Next

neighbour

visited

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

OpenMP

Threads

Window-parallelSlide35

PtSel GPU : Optimization35

012345678910

1112

1314150123456

789

101112

13

14

15

0’s search window

5’s search window

“Half-window”Slide36

PtSel GPU : Optimization

36Overlapping communication with computationS1:K0S2:H2D1

S2:H2D2 (patch data for p + 1, except for the last patch)

S1:K1S1:K2S1:D2H1HH

time

(patch data, only for p = 1)

p – Patch ID

H – Host code

S1 –

Stream 1,

S2 –

Stream 2

H2D – Asynchronous host to device memory copy

D2H

Asynchronous device to host memory copy

K0 – kernel: Identifies arcs to process

K1 – kernel: Temporal coherence computation

K2 – kernel: Max-reduction

(output)

(process p)

delete p

(arc-parallel only)Slide37

 

System ASystem BCPUDual Intel Xeon E5 2650 v2Dual Intel Xeon E5 2670 v2

CPU core/thread count

16/3220/40System memoryDDR3 - 256GBDDR3 - 512GBPCIe 3.0 - x163.0 - x16GPU

4x Nvidia

Geforce GTX TITAN (GK110)2x Nvidia Tesla K40c (GK110B)

GPU thread count

2048/multi-processor

2048/multi-processor

GPU memory

GDDR5 - 6GB

GDDR5 - 12GB

Systems used in

experimentsSlide38

Roofline ModelThe roofline model establishes a relation between achievable performance, e.g., floating point operations (FLOP) per second, and

arithmetic intensity of an application to a processing platform’s peak performance. Arithmetic intensity (AI) for an application is the ratio of its generated FLOP count to bytes read from the DRAM.Peak performance for a platform, measured in GFLOP/s, is computed as: Peak performance (GFLOP/s) = min (Peak GFLOP/s of the platform, peak memory bandwidth (GB/s) of the platform × arithmetic intensity (FLOPs/bytes)).This equation leads to the roofline-like curve in the figure. 38Slide39

1.5

Compute bound

CPU peak GFLOP/s

Theoretical: 88.5 GFLOP/s10.8Achieved: 18.31 GFLOP/s

Peak memory bandwidth 59GB/s

Memory bound

Theoretical arithmetic intensity

Achieved arithmetic intensity

Achievable: 637.2 GFLOP/sSlide40

Evaluation : PtSel GPU Scalability

Execution times for varying number of GPUs (left y-axis). Speedup compared to execution time on a single GPU (right y-axis).Comparison of execution time for different patch sizes40Slide41

Evaluation: Window vs Arc-parallel41

Speedup achieved by arc-parallel over window-parallel for different patch sizes for two different workloads. Here, the search window width is fixed to 35 pixels and GPU count is four. Slide42

Evaluation : Energy Consumption

Power rating1x CPU (TDP)95W1x GPU (TDP)250WDDR3 256 GB (loaded)80WIdle system power280W42

WattsUP

Meter, http://www.wattsupmeters.comAverage power consumption per secondEnergy consumption2x CPU401W11.63 kilowatt hour4x GPU893W

0.4

kilowatt hour

29x improvement