LLNL-PRES - PowerPoint Presentation

376 views
Uploaded On 2016-04-30

LLNL-PRES - PPT Presentation

xxxxxx Gremlins the Sequel the Horrors of Exascale Judit Gimenez BSC Martin Schulz LLNL http scalabilityllnlgov http wwwbsces Can we make a Petascale ID: 299424

gremlin gremlins power cache gremlins gremlin cache power sensitivity size llc resources memory bandwidth limiting machine ompss resource analysis node llnl faults

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/299424" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "LLNL-PRES" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

LLNL-PRES

xxxxxx

Gremlins’ the Sequel – the Horrors of

Exascale

Judit

Gimenez, BSCMartin Schulz, LLNL

http://scalability.llnl.gov/

http://

www.bsc.es

/Slide2

Can we make a Petascale class machine behave like what we expect

Exascale machines to look like? Limit Resources (power, memory, network, I/O, …)

Increase compute/bandwidth ratiosIncrease fault rates and lower MTBF ratesIn short: release GREMLINs

into a petascale machineGoal: Emulation Platform for the Co-Design processEvaluate proxy-apps and compare to baselineDetermine bounds of behaviors proxy apps can tolerate

Drive changes in proxy apps to counter-act exascale propertiesHow (not) to Build an Exascale MachineSlide3

Techniques to force “bad” behavior on “good” systemsTarget individual resources and artificially limit themUsing hardware techniquesStealing resources by creating contention

Directly inject bad behavior through external eventsIndividual GREMLINs are implemented as modules

One effect at a timeOrthogonal to each otherEach GREMLIN has “knobs” to control behaviorThe GREMLIN frameworkDynamic configuration and loading of individual GREMLINsAbility to couple a range of “bad behaviors”Transparent to system and (mostly) to applications

How to Release GREMLINs onto your MachineSlide4

The Gremlin Architecture

Applications

Architecture

Applications

Architecture

Measurement

GREMLIN

Env

GREMLIN

Env

GREMLIN

Env

Power GREMLIN

Fault GREMLIN

Applications

Architecture

Rank 0

Rank 1

Rank

Multi node job (e.g., MPI)

Front end node

GREMLIN

Control

Power GREMLIN

Memory GREMLINSlide5

PowerImpact of changes in frequency/voltageImpact of limits in available power per machine/rack/node/coreMemoryRestrictions in bandwidthReduction of cache size

Limitations of memory sizeResiliency

Injection of faults to understand impact of faultsNotification of “fake” faults to test recoveryNoiseInjection of controlled or random noise eventsCrosscut summarizing the effects of previous GREMLINs

Broad Classes of GREMLINsSlide6

PowerImpact of changes in frequency/voltageImpact of limits in available power per machine/rack/node/coreMemoryRestrictions in bandwidthReduction of cache size

Limitations of memory size

ResiliencyInjection of faults to understand impact of faultsNotification of “fake” faults to test recoveryNoise

Injection of controlled or random noise eventsCrosscut summarizing the effects of previous GREMLINsBroad Classes of GREMLINsSlide7

Using RAPL to install power capsExposes chip variationsTurns homogenous machines into inhomogeneous onesOptimal configuration under a power capWidely differing performance

Application specific characteristicsNeed for models

Results from the Power GremlinsSlide8

Low-level infrastructureLibmsr: user-level API to enable access to MSRs (incl. RAPL capping)Msr-safe: kernel module to make MSR access “safe”Current status

Support for Intel Sandy BridgeMore CPUs (incl. AMD) in progress

Code released on github: https://github.com/scalability-llnl/libmsrInclusion into RHEL pendingDeployed on TLCC cluster cab

Analysis updateFull system characterization (see Barry’s talk)Application analysis in progressUpdate on Power Gremlin InfrastructureSlide9

Scheduling researchFind optimal configurations for each codeBalance processor inhomogeneityUnderstand relationship to load balancingIntegration into FLUXNew Gremlins

Artificially introduce noise eventsNetwork Gremlins

Limit network bandwidth or increase latencyInject controlled cross trafficAdaptation of the Gremlins to new programming modelsInitially developed for MPI (using PnMPI

as base)First new target: OmpSsNext Steps: Feeding the Gremlins After MidnightSlide10

BSC Experiments with GremlinsGremlins integrated with OmpSs runtime

Extrae instrumentation (online mode)

Analysis of Gremlins’ impactLiving with Gremlins! Measure applications’ sensitivityGrowing our Gremlins! uniform vs. non-uniform populationsHave Gremlins side-effects

? Do they increase/affect variability? Should not affect other resourcesFirst resultsUp to now playing with memory GremlinsSlide11

Integrating Gremlins in OmpSs runtime system

Gremlins launched at runtime initialization but remain transparentEach gremlin thread is exclusively pinned on one coreThese processors then become inaccessible by the runtime

Runtime parameters can be used to enable/disable gremlins, define number of gremlin threads, resource type how much of that resource a single gremlin thread should use

Marc CasasSlide12

Current WorkIdentify sensitive tasksClassify how sensitive they areMatch them with tasks that can be run concurrently and are less sensitive to the respective resource type

Implement smart Scheduler in OmpSs

Modify OmpSs scheduler to identify resource sensitive tasks with the use of gremlin threadsImplement a scheduler that takes this information into account when scheduling tasks for executionSlide13

Results L3 cache capacity interference

0 gremlin - 20MB

1 gremlin - 15MB2 gremlin - 12MB3 gremlin - 7MB4 gremlin - 4MB Slide14

Results with memory BW interference0 gremlin

- 40GB/s

1 gremlin - 37.2GB/s2 gremlin - 34.3GB/s

3 gremlin - 31.5GB/s4 gremlin - 28.7GB/sSlide15

CGPOP sensitivity LLC cache size and bw

Without gremlins

imiting size (4MB)

Limiting bandwidth (28.7 GB/s)Slide16

CGPOP sensitivity LLC cache – variability / unbalance

4lnj8dmatvec

8hyn9lpcg

Limiting size (4MB)

Limiting bandwidth (28.7 GB/s)Slide17

CGPOP sensitivity LLC cache – unbalance along time

8hyn9lpcg

Limiting size (4MB)

Limiting bandwidth (28.7 GB/s)Slide18

CGPOP sensitivity LLC cache – tracking evolution

4lnj8dmatvec

8hyn9lpcg

hyn9lpcg

3hyn9lpcg

Limiting bandwidthSlide19

Integrating Gremlins in Extrae Extrae online analysis mode

Based on MRNet

Gremlins API to activate them locallyAll Gremlins launched at initialization timeFirst experiments with LLC cache size gremlinsPeriodic increase of GremlinsUnbalanced steal of resourcesSlide20

LULESH sensitivity to LLC cache size

Can we extract insight from chaos?

Unbalanced Gremlins creationSlide21

LULESH sensitivity to LLC cache size – tracking evol.

L3/instr. ratioSlide22

ConclusionsDifferent regions show different sensitivity to resource reductionsAsynchrony affects actual sharing of resources

Today happens without control

 variability Detailed analysis detects increases on variability and potential non-uniform impact