Scatterer Pixel Selection for InSAR Processing Tahsin Reza Aaron Zimmer Parwant Ghuman Tanuj kr Aasawat Matei Ripeanu Electrical and Computer Engineering University of British Columbia ID: 502122
Download Presentation The PPT/PDF document "Accelerating Persistent" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Accelerating Persistent Scatterer Pixel Selection for InSAR Processing
Tahsin Reza†, Aaron Zimmer‡, Parwant Ghuman‡, Tanuj kr Aasawat†, Matei Ripeanu††Electrical and Computer Engineering, University of British Columbia‡3vGeomatics, Vancouver, British Columbia†{treza, taasawat, matei}@ece.ubc.ca ‡{azimmer, parwant}@3vgeomatics.com
netsyslab.ece.ubc.ca
3vgeomatics.comSlide2
Interferometric Synthetic Aperture Radar (InSAR)
Remote sensing technology for estimating displacement/deformation of the earth's surfaceApplications – monitoring earth subsidence and uplift due to urban infrastructure development, mining, oil and gas extraction, and permafrost thawingSatellites – Radarsat 1/2, ERS-1/2, ENVISAT, Sentinel 1A/1B, ALOSRadar signal – operating frequency 1 - 12GHz
(
Radarsat 1: Image courtesy of CSA)2Slide3
Uses differences in the phase (∆Φ) of returning radar echo
3Interferometric Synthetic Aperture Radar (InSAR)
Before earthquake
After earthquakeMovement∆Φ MovementChange in heightSlide4
Richmond, BC is Sinking
150mm since 2009Slide5
Alyeska Pipeline, Alaska and Denali Earthquake 2003Slide6
Lions Gate Bridge, BC
High Deformation Slide7
Mine in Chile
20mm since 200612mm since 20064mm since 2006Slide8
Arctic
LandslideSlide9
Challenges in InSAR Processing Domain Specific
Acquired data is noisy and prone to ambiguityEstablishing ground truth is expensive and difficultComputationalLarge memory footprint Floating point computation heavy algorithms, e.g. FFTI/O intensive9Slide10
Persistent Scatterer (PS) Pixel Selection
Identify permanent objects, e.g., a hanger not an airplaneAvoid distributed scatterers – ambiguous signal echoesRanking pixels helps improving efficiency and quality of phase unwrapping10Slide11
Persistent Scatterer (PS) Pixel Selection
State of the art – by Andy Hooper in 2006Requires model assumption – Free of different types of noises – Height error correctedOnly considers individual pixel’s phase information11[Hooper’06]Slide12
Persistent Scatterer (PS) Pixel SelectionNew Algorithm –
PtSelDoes not require model assumptionsCoherence computation considers nearby neighbours12
v
v
v
v
v
v
v
v
Temporal arc
coherence,
(
C
is a pixel an
N
is its
neighbour
)
(
F is
neighbour
count)Slide13
InSAR Preliminaries13
SLC images
SLC
1SLC2(
)
Interferogram
Interferogram
Network
t
3
t
2
t
1Slide14
The PtSel Algorithm
14
v
v
Create a network of
interferograms
v
v
v
v
v
v
v
v
v
Compute temporal coherence on each arc
A search window with eight arcs
Find the arc with maximum coherence
Output: 2D image indicating maximum coherence
Input:
n
interferograms
(Step 1)
(Step 2)
(Step 3)Slide15
The ChallengeVolume of Data
A 15,000 × 8,000-pixel interferogram – 1GB Network of 100 interferograms – 5TBCompute-to-Memory ratio 1.5:1 Fairly balanced and difficult to optimize15Slide16
GPU Architecture and Programming Model
Hundreds of cores and thousands of threadsSingle Instruction Multiple Thread (SIMT)Memory request coalescing5-6x higher FLOP-rate and memory bandwidth than CPUsParallel
Random Access Machine abstraction
Used by supercomputers like Cray XMTAssumes a shared memory system, infinite number of processors, uniform memory latencyData-level parallelism as opposed to instruction-level parallelism in SMP16Slide17
Roofline Analysis
: dual CPU vs single GPU17[Williams et al.’09]Compute bound
Memory bound
CPU Peak memory bandwidth 59GB/sCPU peak GFLOP/sSlide18
Roofline Analysis: dual CPU vs single GPU18
[Williams et al.’09]Compute bound
Memory bound
GPU Peak memory bandwidth 288GB/sCPU Peak memory bandwidth 59GB/sGPU peak GFLOP/sCPU peak GFLOP/sSlide19
Roofline Analysis: dual CPU vs single GPU19
[Williams et al.’09]
Compute bound
Memory boundGPU Peak memory bandwidth 288GB/sCPU Peak memory bandwidth 59GB/sGPU peak GFLOP/s
CPU peak GFLOP/s
Theoretical arithmetic intensity
432 GFLOP/s
88.5 GFLOP/s
1.5Slide20
Roofline Analysis: dual CPU vs single GPU20
[Williams et al.’09]
Compute bound
Memory boundGPU Peak memory bandwidth 288GB/s
CPU Peak memory bandwidth 59GB/s
GPU peak GFLOP/s
CPU peak GFLOP/s
Theoretical arithmetic intensity
GPU: achieved arithmetic intensity
432 GFLOP/s
549 GFLOP/s
88.5 GFLOP/s
2016 GFLOP/s
1.5
7
83% of peak CPU FLOP-rate
[Sterling et al.’06]Slide21
PtSel GPU : Overview21
GPUGPUGPUUser inputs:
Interferograms, patch size, search window size
Divide the image into smaller patchesOutput fileGPUKernel 1: Temporal arc coherence compute kernelGroup patches into chunks based on the number of GPUS
From each chunk copy a patch to a GPU
Kernel 2: Max-reduction kernel
Host
Copy output from GPUs to host buffer
Write output to file
Interferograms
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
GPU Threads
Window-parallel
GPU threads
Arc-parallelSlide22
Evaluation MethodologyCPU vs GPU execution time
– CPU baseline: parallel OpenMP-based implementationScalabilityEnergy consumptionEvaluation Platform Ivy Bridge dual-socket/16-core, 256GB DDR3-1600 Kepler GK110/6GB, GK110B/12GBWorkload3260 x 2680-pixel, 1830 interferograms, about 2TB22Slide23
Evaluation : GPU Speedup and Scalability
CPU vs. GPU execution time for various workload sizes . 23
Speedup on GPU over
CPU2 CPUs 29 hours1 GPU 90 minutes – 18x4 GPUs 27 minutes – 65xSlide24
Evaluation : GPU Speedup and Scalability
CPU vs GPU execution time for different window sizes (left y-axis). Speedup on GPU over CPU on right y-axis.24Slide25
Evaluation : Energy Consumption25
WattsUP Meter, http://www.wattsupmeters.comIdle system power
Average power consumption per second
Energy consumption2x CPU200W401W11.63 kilowatt hour4x GPU280W893W0.4 kilowatt hour
29x improvementPower rating (TDP): CPU – 95W, GPU – 250WSlide26
Verifying Correctness
PtSel identified 75% of the PS candidates detected by the state of the art [Hooper’06] (in pink) Also, identified an additional 11% pixels (in yellow) as stable26(area surrounding
Vancouver airport)Slide27
ImpactIntegrated to the production processing
chain at 3vGeomaticsProcess very large data sets which was not possible before in regular production settingNew applications leveraging additional PtSel output27Slide28
Lessons LearnedAccelerator computing is important for HPC
scientific applicationsScale up / Speed up at a fraction of hardware and energy costTakes effort and requires expertise but the reward could be worth it28
3vgeomatics.com
netsyslab.ece.ubc.caSlide29
Thank you29
3vgeomatics.com
netsyslab.ece.ubc.caSlide30
Backup Slides
30Slide31
Future WorkExplore opportunities for better
locality and memory coalescing on the GPUExplore likelihood of finding PS through up-sampling pixels – splitting a distributed scatterer into multiple persistent scatterersExplore possibilities of parameter auto-tuning, e.g., search window size31Slide32
ReferencesA. Hooper, H. Zebker, P.
Segall, and B. Kampes, “A new method for measuring deformation on volcanoes and other natural terrains using In-SAR persistent scatterers,” Geophys. Res. Lett., vol. 31, p. L23611, 2004.S. Williams, A. Waterman, and D. Patterson. “Roofline: an insightful visual performance model for multicore architectures”. Commun. ACM, vol. 52, pp. 65–76, 2009.T. Sterling, P. Kogge, W. J. Dally, S. Scott, W. Gropp, D. Keyes, and P Beckman. "Multi-Core for HPC: breakthrough or breakdown?," In Pro. of the 2006 ACM/IEEE conf. on Supercomputing, article 73, 2006.32Slide33
PtSel Algorithm
33
Slide34
PtSel CPU34
Search window
Pixel
Next
neighbour
visited
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
OpenMP
Threads
Window-parallelSlide35
PtSel GPU : Optimization35
012345678910
1112
1314150123456
789
101112
13
14
15
0’s search window
5’s search window
“Half-window”Slide36
PtSel GPU : Optimization
36Overlapping communication with computationS1:K0S2:H2D1
S2:H2D2 (patch data for p + 1, except for the last patch)
S1:K1S1:K2S1:D2H1HH
time
(patch data, only for p = 1)
p – Patch ID
H – Host code
S1 –
Stream 1,
S2 –
Stream 2
H2D – Asynchronous host to device memory copy
D2H
–
Asynchronous device to host memory copy
K0 – kernel: Identifies arcs to process
K1 – kernel: Temporal coherence computation
K2 – kernel: Max-reduction
(output)
(process p)
delete p
(arc-parallel only)Slide37
System ASystem BCPUDual Intel Xeon E5 2650 v2Dual Intel Xeon E5 2670 v2
CPU core/thread count
16/3220/40System memoryDDR3 - 256GBDDR3 - 512GBPCIe 3.0 - x163.0 - x16GPU
4x Nvidia
Geforce GTX TITAN (GK110)2x Nvidia Tesla K40c (GK110B)
GPU thread count
2048/multi-processor
2048/multi-processor
GPU memory
GDDR5 - 6GB
GDDR5 - 12GB
Systems used in
experimentsSlide38
Roofline ModelThe roofline model establishes a relation between achievable performance, e.g., floating point operations (FLOP) per second, and
arithmetic intensity of an application to a processing platform’s peak performance. Arithmetic intensity (AI) for an application is the ratio of its generated FLOP count to bytes read from the DRAM.Peak performance for a platform, measured in GFLOP/s, is computed as: Peak performance (GFLOP/s) = min (Peak GFLOP/s of the platform, peak memory bandwidth (GB/s) of the platform × arithmetic intensity (FLOPs/bytes)).This equation leads to the roofline-like curve in the figure. 38Slide39
1.5
Compute bound
CPU peak GFLOP/s
Theoretical: 88.5 GFLOP/s10.8Achieved: 18.31 GFLOP/s
Peak memory bandwidth 59GB/s
Memory bound
Theoretical arithmetic intensity
Achieved arithmetic intensity
Achievable: 637.2 GFLOP/sSlide40
Evaluation : PtSel GPU Scalability
Execution times for varying number of GPUs (left y-axis). Speedup compared to execution time on a single GPU (right y-axis).Comparison of execution time for different patch sizes40Slide41
Evaluation: Window vs Arc-parallel41
Speedup achieved by arc-parallel over window-parallel for different patch sizes for two different workloads. Here, the search window width is fixed to 35 pixels and GPU count is four. Slide42
Evaluation : Energy Consumption
Power rating1x CPU (TDP)95W1x GPU (TDP)250WDDR3 256 GB (loaded)80WIdle system power280W42
WattsUP
Meter, http://www.wattsupmeters.comAverage power consumption per secondEnergy consumption2x CPU401W11.63 kilowatt hour4x GPU893W
0.4
kilowatt hour
29x improvement