/
A Scalable Heterogeneous Parallelization Framework for A Scalable Heterogeneous Parallelization Framework for

A Scalable Heterogeneous Parallelization Framework for - PowerPoint Presentation

pinperc
pinperc . @pinperc
Follow
345 views
Uploaded On 2020-08-27

A Scalable Heterogeneous Parallelization Framework for - PPT Presentation

Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science Texas State UniversitySan Marcos 2 Department of Mathematics Texas State UniversitySan Marcos ID: 806216

framework local iterative parallelization local framework parallelization iterative heterogeneous scalable searches gpu opt node ilcs gpus ils cpu parallel

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "A Scalable Heterogeneous Parallelization..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A Scalable Heterogeneous Parallelization Framework forIterative Local Searches

Martin Burtscher1 and Hassan Rabeti21Department of Computer Science, Texas State University-San Marcos2Department of Mathematics, Texas State University-San Marcos

Slide2

Problem: HPC is Hard to ExploitHPC application writers are

domain expertsThey are not typically computer scientists and have little or no formal education in parallel programmingParallel programming is difficult and error proneModern HPC systems are complexConsist of interconnected compute nodes with multiple

CPUs

and one or more

GPUs per nodeRequire parallelization at multiple levels (inter-node, intra-node, and accelerator) for best performance

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

2

Slide3

Target Area: Iterative Local SearchesImportant application domain

Widely used in engineering & real-time environmentsExamplesAll sorts of random restart greedy algorithmsAnt colony opt, Monte Carlo, n-opt hill climbing, etc.ILS propertiesIteratively produce better solutionsCan exploit large amounts of parallelismOften have exponential

search space

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

3

Slide4

Our Solution: ILCS Framework

Iterative Local Champion Search (ILCS) frameworkSupports non-random restart heuristicsGenetic algorithms, tabu search, particle swarm opt, etc.Simplifies implementation of ILS on parallel systemsDesign goalEase of use and scalability

Framework benefits

Handles threading, communication, locking, resource allocation, heterogeneity, load balance, termination decision, and result recording (check pointing)

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

4

Slide5

User InterfaceUser writes 3

serial C functions and/or 3 single-GPU CUDA functions with some restrictionssize_t CPU_Init(int argc, char *

argv

[]);

void

CPU_Exec

(long

seed

, void const *

champion

, void *

result

);

void

CPU_Output

(void const *

champion

);

See paper for GPU interface and sample code

Framework runs Exec (map) functions in parallel

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

5

Slide6

Internal Operation: ThreadingA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

6

ILCS master thread starts

master forks a worker per core

master forks a handler per GPU

workers evaluate seeds, record local opt

GPU workers evaluate seeds, record local opt

handlers launch GPU code, sleep, record result

master sporadically finds global opt via MPI, sleeps

Slide7

Internal Operation: Seed DistributionE.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2)

BenefitsBalanced workload irrespective of number of CPU cores or GPUs (or their relative performance)Users can generate other distributions from seedsAny injective mapping results in no redundant evaluationsA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

7

each node gets chunk of 64-bit seed range

CPUs process chunk bottom up

GPUs process chunk top down

Slide8

Related WorkMapReduce

/Hadoop/MARS and PADOTheir generality and unnecessary features for ILS incur overhead and increase learning curveSome do not support accelerators, some require JavaILCS framework is optimized for ILS applicationsReduction is provided, does not require multiple keys, does not need secondary storage to buffer data, directly supports non-random restart heuristics, allows early termination, works with GPUs and MICs, targets single-node workstations through HPC clusters

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

8

Slide9

Evaluation MethodologyThree HPC Systems (at TACC and NICS)

Largest tested configurationA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches9

datacenterknowledge.com

Slide10

Sample ILS CodesTraveling Salesman Problem (

TSP)Find shortest tour4 inputs from TSPLIB2-opt hill climbingFinite State Machine (FSM) Find best FSM config to predict hit/miss events4 sizes (n

= 3, 4, 5, 6)

Monte Carlo method

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

10

Slide11

FSM Transitions/Second EvaluatedA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

11

21,532,197,798,304 s

-1

GPU shmem limit

Ranger uses twice as many cores as Stampede

Slide12

TSP Tour-Changes/Second EvaluatedA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

12

12,239,050,704,370 s

-1

based on serial CPU code

CPU pre-computes: O(

n

2

) memory

GPU re-computes: O(

n

) memory

each core

evals

a tour change every 3.6 cycles

Slide13

TSP Moves/Second/Node EvaluatedA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

13

GPUs provide >90% of performance on Keeneland

Slide14

ILCS Scaling on Ranger (FSM)A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

14

>99% parallel efficiency on 2048 nodes

other two systems are similar

Slide15

ILCS Scaling on Ranger (TSP)A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

15

>95% parallel efficiency on 2048 nodes

longer runs are even better

Slide16

Intra-Node Scaling on Stampede (TSP)

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches16

>98.9% parallel efficiency on 16 threads

framework overhead is very small

Slide17

Tour Quality Evolution (Keeneland)A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

17

quality depends on chance: ILS provides good solution quickly, then progressively improves it

Slide18

Tour Quality after 6 Steps (Stampede)

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches18

larger node counts typically yield better results faster

Slide19

Summary and Conclusions

ILCS FrameworkAutomatic parallelization of iterative local searchesProvides MPI, OpenMP, and multi-GPU supportCheckpoints currently best solution every few secondsScales very well (decentralized)Evaluation2-opt hill climbing (TSP) and Monte Carlo method (FSM)AMD + Intel

CPUs

, NVIDIA

GPUs, and Intel MICsILCS source code is freely available

http://cs.txstate.edu/~burtscher/research/ILCS/

Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

19