Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science Texas State UniversitySan Marcos 2 Department of Mathematics Texas State UniversitySan Marcos ID: 806216
Download The PPT/PDF document "A Scalable Heterogeneous Parallelization..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A Scalable Heterogeneous Parallelization Framework forIterative Local Searches
Martin Burtscher1 and Hassan Rabeti21Department of Computer Science, Texas State University-San Marcos2Department of Mathematics, Texas State University-San Marcos
Slide2Problem: HPC is Hard to ExploitHPC application writers are
domain expertsThey are not typically computer scientists and have little or no formal education in parallel programmingParallel programming is difficult and error proneModern HPC systems are complexConsist of interconnected compute nodes with multiple
CPUs
and one or more
GPUs per nodeRequire parallelization at multiple levels (inter-node, intra-node, and accelerator) for best performance
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
2
Slide3Target Area: Iterative Local SearchesImportant application domain
Widely used in engineering & real-time environmentsExamplesAll sorts of random restart greedy algorithmsAnt colony opt, Monte Carlo, n-opt hill climbing, etc.ILS propertiesIteratively produce better solutionsCan exploit large amounts of parallelismOften have exponential
search space
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
3
Slide4Our Solution: ILCS Framework
Iterative Local Champion Search (ILCS) frameworkSupports non-random restart heuristicsGenetic algorithms, tabu search, particle swarm opt, etc.Simplifies implementation of ILS on parallel systemsDesign goalEase of use and scalability
Framework benefits
Handles threading, communication, locking, resource allocation, heterogeneity, load balance, termination decision, and result recording (check pointing)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
4
Slide5User InterfaceUser writes 3
serial C functions and/or 3 single-GPU CUDA functions with some restrictionssize_t CPU_Init(int argc, char *
argv
[]);
void
CPU_Exec
(long
seed
, void const *
champion
, void *
result
);
void
CPU_Output
(void const *
champion
);
See paper for GPU interface and sample code
Framework runs Exec (map) functions in parallel
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
5
Slide6Internal Operation: ThreadingA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
6
ILCS master thread starts
master forks a worker per core
master forks a handler per GPU
workers evaluate seeds, record local opt
GPU workers evaluate seeds, record local opt
handlers launch GPU code, sleep, record result
master sporadically finds global opt via MPI, sleeps
Slide7Internal Operation: Seed DistributionE.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2)
BenefitsBalanced workload irrespective of number of CPU cores or GPUs (or their relative performance)Users can generate other distributions from seedsAny injective mapping results in no redundant evaluationsA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
7
each node gets chunk of 64-bit seed range
CPUs process chunk bottom up
GPUs process chunk top down
Slide8Related WorkMapReduce
/Hadoop/MARS and PADOTheir generality and unnecessary features for ILS incur overhead and increase learning curveSome do not support accelerators, some require JavaILCS framework is optimized for ILS applicationsReduction is provided, does not require multiple keys, does not need secondary storage to buffer data, directly supports non-random restart heuristics, allows early termination, works with GPUs and MICs, targets single-node workstations through HPC clusters
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
8
Slide9Evaluation MethodologyThree HPC Systems (at TACC and NICS)
Largest tested configurationA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches9
datacenterknowledge.com
Slide10Sample ILS CodesTraveling Salesman Problem (
TSP)Find shortest tour4 inputs from TSPLIB2-opt hill climbingFinite State Machine (FSM) Find best FSM config to predict hit/miss events4 sizes (n
= 3, 4, 5, 6)
Monte Carlo method
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
10
Slide11FSM Transitions/Second EvaluatedA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
11
21,532,197,798,304 s
-1
GPU shmem limit
Ranger uses twice as many cores as Stampede
Slide12TSP Tour-Changes/Second EvaluatedA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
12
12,239,050,704,370 s
-1
based on serial CPU code
CPU pre-computes: O(
n
2
) memory
GPU re-computes: O(
n
) memory
each core
evals
a tour change every 3.6 cycles
Slide13TSP Moves/Second/Node EvaluatedA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
13
GPUs provide >90% of performance on Keeneland
Slide14ILCS Scaling on Ranger (FSM)A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
14
>99% parallel efficiency on 2048 nodes
other two systems are similar
Slide15ILCS Scaling on Ranger (TSP)A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
15
>95% parallel efficiency on 2048 nodes
longer runs are even better
Slide16Intra-Node Scaling on Stampede (TSP)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches16
>98.9% parallel efficiency on 16 threads
framework overhead is very small
Slide17Tour Quality Evolution (Keeneland)A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
17
quality depends on chance: ILS provides good solution quickly, then progressively improves it
Slide18Tour Quality after 6 Steps (Stampede)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches18
larger node counts typically yield better results faster
Slide19Summary and Conclusions
ILCS FrameworkAutomatic parallelization of iterative local searchesProvides MPI, OpenMP, and multi-GPU supportCheckpoints currently best solution every few secondsScales very well (decentralized)Evaluation2-opt hill climbing (TSP) and Monte Carlo method (FSM)AMD + Intel
CPUs
, NVIDIA
GPUs, and Intel MICsILCS source code is freely available
http://cs.txstate.edu/~burtscher/research/ILCS/
Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
19