/
Chimera: Collaborative Preemption for Multitasking on a Shared GPU Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Chimera: Collaborative Preemption for Multitasking on a Shared GPU - PowerPoint Presentation

holly
holly . @holly
Follow
64 views
Uploaded On 2024-01-29

Chimera: Collaborative Preemption for Multitasking on a Shared GPU - PPT Presentation

Jason Jong Kyu Park 1 Yongjun Park 2 and Scott Mahlke 1 1 1 University of Michigan Ann Arbor 2 Hongik University GPUs in Modern Computer Systems 2 GPU is now a default ID: 1043064

latency preemption thread block preemption latency block thread kernel shared sms benchmark switch chimera progress gpgpu throughput time launchescontextsavecontextload256kb

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Chimera: Collaborative Preemption for Mu..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Chimera: Collaborative Preemption for Multitasking on a Shared GPUJason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke111University of Michigan, Ann Arbor2Hongik University

2. GPUs in Modern Computer Systems2GPU is now a defaultcomponent in moderncomputer systemsServers, desktops, laptops, etc.Mobile devicesOffloads data-parallel kernels CUDA, OpenCL, and etc.

3. GPU Execution Model3threadsthread blocks

4. Multitasking Needs in GPUs4Augmented RealityBodytrack3D renderingGraphicsData parallel algorithm….

5. Traditional Context Switching5K1K2TimeK2 launchesContextSaveContextLoad256kB registers + 48kB shared memoryGTX 780 (Kepler)288.4 GB/s for 12 SMs~88us per SM(~1us for CPUs)

6. Challenge 1: Preemption latency6K1K2TimeK2 launchesContextSaveContextLoad256kB registers + 48kB shared memoryGTX 780 (Kepler)288.4 GB/s for 12 SMs~88us per SMToo longfor latency-critical kernels

7. Challenge 2: Throughput overhead7K1K2TimeK2 launchesContextSaveContextLoad256kB registers + 48kB shared memoryGTX 780 (Kepler)288.4 GB/s for 12 SMs~88us per SMNo useful work is done

8. Objective of This Work8Preemption Cost100%0%Thread Block Progress (%)SwitchPrior work

9. SM draining [Tanasic’ 14]9K1K2TimeK2 launchesThread blockNo issue

10. Chimera10Preemption Cost100%0%Thread Block Progress (%)SwitchDrainSwitch!!Drain!!Opportunity

11. SM FlushingInstant preemptionThrow away what was running on the SMRe-execute from the beginning later (Idempotent kernel)11GPUCPUGlobal MemoryOnly observable stateIdempotent ifread state is not modified

12. Finding Relaxed Idempotence12__global__ voidkernel_cuda(const float *in, float* out, float* inout){ … = inout[idx]; … atomicAdd(…); … out[idx] = …; inout[idx] = …;}CUDAAtomic OperationGlobal OverwriteIdempotent RegionDetected by compiler

13. ChimeraFlush near the beginningContext switch in the middleDrain near the end13Preemption Cost100%0%Thread Block Progress (%)FlushSwitchDrainOptimal

14. Independent Thread Block Execution14SM…SMGPUThread blockNo shared stateNo shared state between SMs and thread blocksEach SM/thread block can be preempted with different preemption technique

15. Chimera15ThreadSM…FlushDrainSwitchSMGPUProgressCollaborative PreemptionThread block

16. Two level schedulerKernel scheduler + thread block schedulerArchitecture16Kernel SchedulerThread BlockSchedulerTB-to-Kernel MappingSM Scheduling PolicyHow many SMs will each kernel have?SM Scheduling PolicyChimera

17. ChimeraWhich SM will be preempted?Which preemption technique to use?Architecture17Kernel SchedulerThread BlockSchedulerTB-to-Kernel MappingSM Scheduling PolicyChimera

18. Thread block schedulerWhich thread block will be scheduled?Carry out preemption decisionArchitecture18Kernel SchedulerThread BlockSchedulerTB-to-Kernel MappingThread Block Queue per KernelPreempted TBNext TBPreempt

19. SwitchContext size / (Memory bandwidth / # of SMs)Cost Estimation: Preemption Latency19Average execution insts:Progress in instsEstimatedremaining instsx CPI= Estimated preemption latencyDrainInstructions measured in a warp granularityFlushZero preemption latency

20. Cost Estimation: ThroughputSwitchIPC * Preemption latency * 2Doubled due to context saving and loading20Progress in instsOverheadMost progressed in the same SM:FlushExecuted instructions in a warp granularityDrainInstructions measured in a warp granularity

21. …Chimera Algorithm21ThreadSMFlushDrain :Switch:SMGPUPreemption victimLeast throughput overheadMeets preemption latency constraintThread blockFlush :Switch

22. Chimera Algorithm22ThreadSMFlushSMGPUPreemption victimLeast throughput overheadMeets preemption latency constraintThread blockSwitch…Latency :Overhead :Latency :Overhead :Constraint

23. Experimental SetupGPGPU-Sim v3.2.2GPU Model: Fermi architectureUp to 32,768 (128 kB) registersUp to 48 kB shared memoryWorkloads14 benchmarks from Nvidia SDK, Parboil, and RodiniaGPGPU benchmark + Synthetic benchmarkMimics periodic, real-time task (e.g. Graphics kernel)Period: 1msExecution Time: 200usGPGPU benchmark + GPGPU benchmarkBaseline: Non-preemptive First-come First-served23

24. Preemption Latency ViolationsGPGPU benchmark + Synthetic benchmark15 us preemption latency constraint (real-time task)240.2%Non-idempotent kernel with short thread block execution timeEstimated shorter preemption latency

25. System ThroughputCase study: LUD + Other GPGPU benchmarkLUD has many kernel launches with varying number of thread blocks25Drain has lower average normalized turnaround time(5.17x for Drain, 5.50x for Chimera)

26. Preemption Technique DistributionGPGPU benchmark + Synthetic benchmark26

27. SummaryContext switch can have high overhead on GPUsPreemption latency, and throughput overheadChimeraFlushingInstant preemptionCollaborative preemptionFlush + Switch + DrainAlmost always meets preemption latency constraint0.2% violations (estimated shorter preemption latency)5.5x ANTT improvement, 12.2% STP improvementFor GPGPU benchmark + GPGPU benchmark combinations27

28. Questions?28ContextSwitchDrainFlush