/
Distributed Order Scheduling and its Application to Distributed Order Scheduling and its Application to

Distributed Order Scheduling and its Application to - PowerPoint Presentation

leah
leah . @leah
Follow
66 views
Uploaded On 2023-06-23

Distributed Order Scheduling and its Application to - PPT Presentation

MultiCore DRAM Controllers Thomas Moscibroda Distributed Systems Research Redmond Onur Mutlu Computer Architecture Research Redmond TexPoint fonts used in EMF Read the TexPoint manual before you delete this box ID: 1002262

algorithm bank memory moscibroda bank algorithm moscibroda memory order scheduling thomas microsoft request distributed requests bankscheduler time completion dram

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Distributed Order Scheduling and its App..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Distributed Order Scheduling and its Application to Multi-Core DRAM ControllersThomas MoscibrodaDistributed Systems Research, RedmondOnur MutluComputer Architecture Research, RedmondTexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

2. We study an important problem in memory request scheduling in multi-core systems. Maps to a well-known scheduling problem  Order scheduling problem But, in a distributed setting...  distributed order scheduling problemOverviewHow well can thisscheduling problem be solved in distributed setting?How much communication (information exchange) needed for good solution?

3. Thomas Moscibroda, Microsoft ResearchMulti-Core Architectures – DRAM MemoryMulti-core systems  many cores (processor, caches) on a single chip  DRAM memory is typically shared Core 1L2 CacheDRAM Memory ControllerCore 2L2 CacheCore 3L2 CacheCore NL2 CacheDRAM Bank 1DRAM Bank 2DRAM Bank 3DRAM Bank 8DRAM BusOn-ChipDRAM MemorySystem

4. Thomas Moscibroda, Microsoft ResearchDRAM Memory ControllerCore 1L2 CacheDRAM Memory ControllerCore 2L2 CacheCore 3L2 CacheCore NL2 CacheDRAM Bank 1DRAM Bank 2DRAM Bank 3DRAM Bank 8DRAM BusOn-ChipDRAM MemorySystemCore 1L2 CacheDRAM Memory ControllerCore 2L2 CacheCore 3L2 CacheCore NL2 CacheDRAM Bank 1DRAM Bank 2DRAM Bank 3DRAM Bank 8DRAM BusOn-ChipDRAM MemorySystem

5. Thomas Moscibroda, Microsoft ResearchDRAM Memory ControllerCore 1L2 CacheDRAM Memory ControllerCore 2L2 CacheCore 3L2 CacheCore NL2 CacheDRAM Bank 1DRAM Bank 2DRAM Bank 3DRAM Bank 8DRAM BusOn-ChipDRAM MemorySystemDRAM is partitioned into different banksDRAM Controller consists of Request buffers (typically one per bank)Request scheduler that decides which request to schedule next.

6. Thomas Moscibroda, Microsoft ResearchDRAM Memory Controller - ExampleT2Memory Request Buffers:Core 1Core 2Core 3Core NT2T2T2Bank 1Bank 2Bank 3Bank 4DRAM Banks: BankScheduler 1 BankScheduler 2 BankScheduler 3 BankScheduler 4

7. Thomas Moscibroda, Microsoft ResearchDRAM Memory Controller - ExampleT2Memory Request Buffers:Core 1Core 2Core 3Core NT2T2T2Bank 1Bank 2Bank 3Bank 4DRAM Banks: BankScheduler 1 BankScheduler 2 BankScheduler 3 BankScheduler 4

8. Thomas Moscibroda, Microsoft ResearchDRAM Memory Controller - ExampleBank 1Bank 2Bank 3Bank 4T5T2T1T4T2T2T1T2T1T1T4T5T4T2T2Memory Request Buffers:Core 2Core 3DRAM Banks:T7T7T7T7 BankScheduler 1 BankScheduler 2 BankScheduler 3 BankScheduler 4Core NCore 1

9. Thomas Moscibroda, Microsoft ResearchDRAM Memory ControllerCores issue memory request (when missing in their cache)Each memory request is a tuple (Threadi, Bankj)Accesses to different banks can be served in parallel A thread/core……can run, if no memory request is outstanding…is blocked (stalled), if there is at least one request outstanding in the DRAM (the above is a significant simplification, but accurate to a first approximation)In combination with fairness substrate  minimizing avg. stall-times in DRAM greatly improves application performance.PAR-BS scheduling algorithm… [Mutlu, Moscibroda, ISCA’08] Goal: Minimize average stall-time of threads!

10. Distributed DRAM Controllers  Background & MotivationDistributed Order Scheduling Problem Base Cases  Complete information  No informationDistributed Algorithm:  Communication vs. Approximation trade-offEmpirical Evaluation / ConclusionsOverview

11. Also known as concurrent open shop scheduling problemGiven a set of n orders (=threads) T={T1, … , Tn}Given a set of m facilities (=banks) B={B1,…,Bm} Each thread Ti has a set of requests Rij going to bank BjLet pij be the total processing time of all requests RijCustomer Order SchedulingT5T2T1T4T2T2T1T2T1T1T4T5T4T2T2T3T3T3T3p21=2p33=3R21R33

12. Also known as concurrent open shop scheduling problemGiven a set of n orders (=threads) T={T1, … , Tn}Given a set of m facilities (=banks) B={B1,…,Bm} Each thread Ti has a set of requests Rij going to bank BjLet pij be the total processing time of all requests RijLet Cij be the completion time of a request RijAn order/thread is completed when all its requests are served  Order completion time Goal: Schedule all orders/threads in a given order such that average completion time is minimized. Customer Order Schedulingcorresponds to thread stall time

13. 753ExampleT2T3T1T0Bank 0Bank 1Bank 2Bank 3T3T3T1T3T2T2T1T2T1T0T2T0T3T2T3T3T3T3Baseline Scheduling (FIFO arrival order)Ordering-basedschedulingT2T3T1T0Bank 0Bank 1Bank 2Bank 3T3T3T1T3T2T2T1T2T1T0T2T0T3T2T3T3T3T3T0T1T2T34457 AVG = (4+4+5+7)/4 = 5AVG = (1+2+4+7)/4 = 3.5Completion times:T0T1T2T31247Completion times:Time1246Ranking: T0 > T1 > T2 > T3

14. Each bank has its own bank scheduler  computes its own scheduleScheduler only knows requests in its own bufferSchedulers should exchange information in order to coordinate their decisions!Simple Distributed Model: Time divided into (synchronous) roundsInitially, only local knowledgeIn every round, every scheduler Bj2B can broadcast one message of the form (Ti, pij) to all other schedulers After n rounds, complete information is exchanged. Customer Order SchedulingDistributedAmount of communication(information exchange)Quality of resultingglobal scheduleTrade-off BankScheduler 3Thread 3has 2 requests for bank 3Send to all other schedulers

15. Existing DRAM memory schedulers typically implement FR-FCFS algorithm [Rixner et al, ISCA’00]  no coordination between bank schedulers!FR-FCFS potentially unfair and insecure in multi-core systems [Moscibroda, Mutlu, USENIX Security’07]Fairnes-aware scheduling algorithms have been proposed[Nesbit et al, MICRO’06; Mutlu & Moscibroda, MICRO’07; Mutlu & Moscibroda, ISCA’08]Related Work I. Memory Request SchedulingII. Customer Order SchedulingProblem is NP-hard even for 2 facilities [Sung, Yoon’98; Roemer’06]Many heuristics extensively evaluated [Leung, Li, Pinedo’05]16/3-approximation algorithm for weighed version [Wang, Cheng’03]2-approximation algorithm for unweighted case first implicitly contained in [Queyranne, Sviridenko, SODA’00] later explicitly stated in [Chen, Hall’00; Leung, Li, Pinedo’07;Garg, Kumar, Pandit’07]

16. Distributed DRAM Controllers  Background & MotivationDistributed Order Scheduling Problem Base Cases  Complete information  No informationDistributed Algorithm:  Communication vs. Approximation trade-offEmpirical Evaluation / ConclusionsOverview

17. Thomas Moscibroda, Microsoft ResearchNo CommunicationEach scheduler only knows its own bufferConsider only “fair” algorithm  every scheduler decides on an ordering based only on processing times (not thread ID’s) Notice that most DRAM scheduling algorithms used in today’s computer systems are fair and do not use communication.  Theorem applies to most currently used algorithms. Theorem I:Every (possibly randomized) fair distributed order scheduling algorithm without communication has a worst-case approximation ratio of .

18. Thomas Moscibroda, Microsoft ResearchNo Communication - Proofm singleton orders T1,…,Tm with only a single request to Bi ¯=n-m orders Tm+1,…,Tn with a request for every bankOPT is to schedule all singletons first, followed by Tm+1,…,Tn Fair algorithm: all orders look exactly the sameNo better strategy than random order For any singleton, it holds that T{m+3}TnT3TmT{m+1}T2T{m+2}T1T{m+1}T{m+1}T{m+1}T{m+2}T{m+2}T{m+2}T{m+3}T{m+3}T{m+3}TnTnTnTheorem followsfrom setting

19. Thomas Moscibroda, Microsoft ResearchComplete CommunicationEvery scheduler has perfect global knowledge (centralized case!)Algorithm: Solve LP:Globally schedule threats in non-decreasing order of Ci as computed in LP.Theorem 2: [based on Queyranne, Sviridenko’00]There is a fair distributed order scheduling algorithm with communication complexity n and approximation ratio 2. Machine capacityconstraints

20. Distributed DRAM Controllers  Background & MotivationDistributed Order Scheduling Problem Base Cases  Complete information  No informationDistributed Algorithm:  Communication vs. Approximation trade-offEmpirical Evaluation / ConclusionsOverview

21. Thomas Moscibroda, Microsoft ResearchDistributed Algorithm The 2-approximation algorithm inherently requirescomplete knowledge of all pij for LP.  Only this way, all schedulers compute same LP solution… …and same thread orderingWhat happens if not all pij are known ?Challenge: Different schedulers have different viewsCompute different thread orderings Suboptimal performance!

22. Thomas Moscibroda, Microsoft ResearchDistributed Algorithm Input k  algorithm has time complexity t=n/k.For each bank Bj, define Lj as the requests with the t longest processing times in this bank, and Sj as all other n-t requestsBroadcasts exact information (Ti, pij) about all long requests in LjBroadcasts average value (Ti, Pj) of all short requests in SjUsing received information, every scheduler locally computes LP*  exact values for long requests  per-bank averaged values for all short requestsLet be the resulting completion times in LP*  Each scheduler schedules threads according to increasingLongest requestsShortest requestst requests  Ljn-t requests  Sjt rounds:1 round:

23. Thomas Moscibroda, Microsoft ResearchDistributed Algorithm SjLjEvery scheduler locally invokes LP using these averaged values.  LP*averagesonly

24. Thomas Moscibroda, Microsoft ResearchDistributed Algorithm - Results There are examples where algorithm is (k) worse than OPT.  our analysis is asymptotically tight  see paper for detailsProof is challenging for several reasons… Theorem 3:For any k, the distributed algorithm has a time-complexity of n/k+1 and achieves an approximation ratio of O(k).

25. Thomas Moscibroda, Microsoft ResearchDistributed Algorithm – Proof Overview Distinguish four completion times  : optimal completion time of Ti  : completion time in original LP  : completion time as computed by the averaged LP*  : completion times resulting from the algorithm1) show that averaged LP* is within O(k) of original LP 2) show that algorithm solution is also within O(k) of OPT See paper

26. Define Qh: t orders with highest completion times in original LP. Define virtual completion timeThree key lemmas about virtual completion times: Thomas Moscibroda, Microsoft ResearchDistributed Algorithm – Proof Overview Defined as averageof all in Qh.completiontimeQh2D1.2. form a feasible solution to (original) LP3.Bounds OPTBounds ALG

27. Thomas Moscibroda, Microsoft ResearchEmpirical EvaluationWe evaluate our algorithm using SPEC CPU2006 benchmarks and two large Windows desktop applications (Matlab, XML parsing app)Cycle-accurate simulator frameworkModels for processors & instr. windows, L2 caches, DRAM memory See paper for further methodology k=nk=0k=n-1Max-tot heuristic[Mutlu, Moscibroda’07]Local shortest-jobfirst heuristic

28. DRAM memory scheduling in multi-core systemsProblem maps to distributed order scheduling problemResults: No communication  (√n)-approximationComplete knowledge  2-approximationn/k communication rounds  O(k) approximation No matching lower bound  better approximations possible? Distributed computing multi-core computingSo far, mainly new programming paradigms…(transactional memory, parallel algorithms, etc…)In this paper: new distributed computing problem arising in the microarchitecture of multi-core systemsMany more such problems in this space!Summary / Future Work