xxxxxx Gremlins the Sequel the Horrors of Exascale Judit Gimenez BSC Martin Schulz LLNL http scalabilityllnlgov http wwwbsces Can we make a Petascale ID: 299424
Download Presentation The PPT/PDF document "LLNL-PRES" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
LLNL-PRES
-
xxxxxx
Gremlins’ the Sequel – the Horrors of
Exascale
Judit
Gimenez, BSCMartin Schulz, LLNL
http://scalability.llnl.gov/
http://
www.bsc.es
/Slide2
Can we make a Petascale class machine behave like what we expect
Exascale machines to look like? Limit Resources (power, memory, network, I/O, …)
Increase compute/bandwidth ratiosIncrease fault rates and lower MTBF ratesIn short: release GREMLINs
into a petascale machineGoal: Emulation Platform for the Co-Design processEvaluate proxy-apps and compare to baselineDetermine bounds of behaviors proxy apps can tolerate
Drive changes in proxy apps to counter-act exascale propertiesHow (not) to Build an Exascale MachineSlide3
Techniques to force “bad” behavior on “good” systemsTarget individual resources and artificially limit themUsing hardware techniquesStealing resources by creating contention
Directly inject bad behavior through external eventsIndividual GREMLINs are implemented as modules
One effect at a timeOrthogonal to each otherEach GREMLIN has “knobs” to control behaviorThe GREMLIN frameworkDynamic configuration and loading of individual GREMLINsAbility to couple a range of “bad behaviors”Transparent to system and (mostly) to applications
How to Release GREMLINs onto your MachineSlide4
The Gremlin Architecture
Applications
Architecture
Applications
Architecture
Measurement
Measurement
Measurement
GREMLIN
Env
.
GREMLIN
Env
.
GREMLIN
Env
.
Power GREMLIN
Fault GREMLIN
Applications
Architecture
Rank 0
Rank 1
Rank
N
Multi node job (e.g., MPI)
Front end node
GREMLIN
Control
Power GREMLIN
Memory GREMLINSlide5
PowerImpact of changes in frequency/voltageImpact of limits in available power per machine/rack/node/coreMemoryRestrictions in bandwidthReduction of cache size
Limitations of memory sizeResiliency
Injection of faults to understand impact of faultsNotification of “fake” faults to test recoveryNoiseInjection of controlled or random noise eventsCrosscut summarizing the effects of previous GREMLINs
Broad Classes of GREMLINsSlide6
PowerImpact of changes in frequency/voltageImpact of limits in available power per machine/rack/node/coreMemoryRestrictions in bandwidthReduction of cache size
Limitations of memory size
ResiliencyInjection of faults to understand impact of faultsNotification of “fake” faults to test recoveryNoise
Injection of controlled or random noise eventsCrosscut summarizing the effects of previous GREMLINsBroad Classes of GREMLINsSlide7
Using RAPL to install power capsExposes chip variationsTurns homogenous machines into inhomogeneous onesOptimal configuration under a power capWidely differing performance
Application specific characteristicsNeed for models
Results from the Power GremlinsSlide8
Low-level infrastructureLibmsr: user-level API to enable access to MSRs (incl. RAPL capping)Msr-safe: kernel module to make MSR access “safe”Current status
Support for Intel Sandy BridgeMore CPUs (incl. AMD) in progress
Code released on github: https://github.com/scalability-llnl/libmsrInclusion into RHEL pendingDeployed on TLCC cluster cab
Analysis updateFull system characterization (see Barry’s talk)Application analysis in progressUpdate on Power Gremlin InfrastructureSlide9
Scheduling researchFind optimal configurations for each codeBalance processor inhomogeneityUnderstand relationship to load balancingIntegration into FLUXNew Gremlins
Artificially introduce noise eventsNetwork Gremlins
Limit network bandwidth or increase latencyInject controlled cross trafficAdaptation of the Gremlins to new programming modelsInitially developed for MPI (using PnMPI
as base)First new target: OmpSsNext Steps: Feeding the Gremlins After MidnightSlide10
BSC Experiments with GremlinsGremlins integrated with OmpSs runtime
Extrae instrumentation (online mode)
Analysis of Gremlins’ impactLiving with Gremlins! Measure applications’ sensitivityGrowing our Gremlins! uniform vs. non-uniform populationsHave Gremlins side-effects
? Do they increase/affect variability? Should not affect other resourcesFirst resultsUp to now playing with memory GremlinsSlide11
Integrating Gremlins in OmpSs runtime system
Gremlins launched at runtime initialization but remain transparentEach gremlin thread is exclusively pinned on one coreThese processors then become inaccessible by the runtime
Runtime parameters can be used to enable/disable gremlins, define number of gremlin threads, resource type how much of that resource a single gremlin thread should use
Marc CasasSlide12
Current WorkIdentify sensitive tasksClassify how sensitive they areMatch them with tasks that can be run concurrently and are less sensitive to the respective resource type
Implement smart Scheduler in OmpSs
Modify OmpSs scheduler to identify resource sensitive tasks with the use of gremlin threadsImplement a scheduler that takes this information into account when scheduling tasks for executionSlide13
Results L3 cache capacity interference
0 gremlin - 20MB
1 gremlin - 15MB2 gremlin - 12MB3 gremlin - 7MB4 gremlin - 4MB Slide14
Results with memory BW interference0 gremlin
- 40GB/s
1 gremlin - 37.2GB/s2 gremlin - 34.3GB/s
3 gremlin - 31.5GB/s4 gremlin - 28.7GB/sSlide15
CGPOP sensitivity LLC cache size and bw
Without gremlins
L
imiting size (4MB)
Limiting bandwidth (28.7 GB/s)Slide16
CGPOP sensitivity LLC cache – variability / unbalance
4lnj8dmatvec
8hyn9lpcg
Limiting size (4MB)
Limiting bandwidth (28.7 GB/s)Slide17
CGPOP sensitivity LLC cache – unbalance along time
8hyn9lpcg
Limiting size (4MB)
Limiting bandwidth (28.7 GB/s)Slide18
CGPOP sensitivity LLC cache – tracking evolution
4lnj8dmatvec
8hyn9lpcg
a
hyn9lpcg
3hyn9lpcg
Limiting bandwidthSlide19
Integrating Gremlins in Extrae Extrae online analysis mode
Based on MRNet
Gremlins API to activate them locallyAll Gremlins launched at initialization timeFirst experiments with LLC cache size gremlinsPeriodic increase of GremlinsUnbalanced steal of resourcesSlide20
LULESH sensitivity to LLC cache size
Can we extract insight from chaos?
Unbalanced Gremlins creationSlide21
LULESH sensitivity to LLC cache size – tracking evol.
L3/instr. ratioSlide22
ConclusionsDifferent regions show different sensitivity to resource reductionsAsynchrony affects actual sharing of resources
Today happens without control
variability Detailed analysis detects increases on variability and potential non-uniform impact