/
Communication  and Coordination Paradigms for Communication  and Coordination Paradigms for

Communication and Coordination Paradigms for - PowerPoint Presentation

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
366 views
Uploaded On 2018-09-21

Communication and Coordination Paradigms for - PPT Presentation

Highlyparallel Accelerators Marc S Orr PhD Defense December 5 2016 Advisor David A Wood Committee Brad Beckmann Mark Hill Nam Sung Kim Karu Sankaralingam and Mike Swift ID: 674367

parallel coordination paradigms highly coordination parallel highly paradigms accelerators gpu level gpus work group aggregation simt system task communication memory lane 2016

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Communication and Coordination Paradigm..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Communication and Coordination Paradigms for Highly-parallel Accelerators

Marc S. OrrPhD Defense, December 5, 2016Advisor: David A. WoodCommittee: Brad Beckmann, Mark Hill, Nam Sung Kim,Karu Sankaralingam, and Mike Swift

Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide2

Motivation

Highly-parallel accelerators, like GPUs, are becoming ubiquitous.GPUs are performant and energy-efficient for data-parallel applications that accommodate the GPU’s peculiarities.However, many data-parallel applications are not able to accommodate the GPU’s peculiarities.12/5/20162Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide3

Research Statement

I focus on one of the GPU’s peculiarities: it is difficult synchronize on GPUsMy first goal is to enable GPU threads to coordinate efficiently.My second goal is to identify novel use cases for efficient GPU thread coordination.Task Aggregation for GPUs (In ISCA-2014)Message Aggregation for GPUs (Under review)12/5/20163

Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide4

12/5/2016

4Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain

System-level Operations

Task

Aggregation for GPUs

Message Aggregation for

GPUs

Conclusions

OutlineSlide5

Highly-parallel AcceleratorsLots of simple threads

Data-parallel hardwareExamples:Intel Xeon Phi, GPUsMy focus: GPUsAbundant commodity sold by many companiesOther accelerators are less common/more niche12/5/20165Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide6

GPUs are Everywhere

Bottom tier (cell phones, tablets, and netbooks)Mid tier (laptops and desktops)Top Tier (super computers and cloud computing)New programming abstractions are making them more accessibleCUDA, OpenCLC++ AMP, OpenACC, Java Aparapi, etc.Integration is reducing their performance “tariff”12/5/20166Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide7

GPU PrimerWork-item: GPU thread executed by SIMD lane

Wavefront: Work-items executed by same SIMDWork-group: Wavefronts executing on same CUSIMT group: a wavefront or a work-group12/5/20167Communication and Coordination Paradigms for Highly-parallel Accelerators

Compute Unit

(CU)

wavefront

wi

SIMD

SIMD

work-group

L1

L2/MemorySlide8

Result: many applications cannot easily benefit from the GPU’s

Higher compute capability

Higher memory bandwidth

Lower energy

Paradigms that Map Poorly to GPUs

System-level Operations

Task scheduling

I/O

System calls

Memory allocation

More

generally, communication & coordination

Synchronization

Resource

sharing

12/5/2016

8

Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide9

Communication & Coordination

What do I mean?Locks? RMWs? Transactions? etc.Focus: producer/consumer queuesProducers: GPU threadsConsumers: CPUs, GPUs, network interface, etc.Goal: arbitrate system-level operationsNarrow enough for a well-defined problemBroad enough for novel research12/5/20169Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide10

What Makes Coordination Hard?Many threads

 contentionAMD Kaveri (iGPU): 20,480 threadsAMD FirePro S9170 (dGPU): 112,640 threadsIntra-wavefront dependencies can cause deadlockwhile (trylock

(&L)) ; // spin

u

nlock();

12/5/2016

10

Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide11

12/5/2016

11Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain

System-level Operations

Task

Aggregation for GPUs

Message Aggregation for

GPUs

Conclusions

OutlineSlide12

Prior WorkCoprocessor Model

GPU threads don’t execute system-level operationsInstead, the CPU executes themOperation-per-lane ModelGPU threads independently execute system-level operationCoalesced APIsGPU work-groups (not threads) execute system-level operations12/5/201612Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide13

Examples of Prior Work12/5/2016

13Communication and Coordination Paradigms for Highly-parallel AcceleratorsTaskingNetwork I/OFile I/O

Coprocessor

-

GPUDirect

RDMA

- MVAPICH2-GPU

Operation-per-lane

- CUDA & OpenCL 2.0 dynamic parallelism

- NVSHMEM

DCGN

GGAS

Coalesced APIs

-

GPUnet

GPUrdma

-

GPUfsSlide14

Coprocessor Modelsystem-level

operations executed by host CPU12/5/201614Communication and Coordination Paradigms for Highly-parallel AcceleratorsSoC

Producer/consumer

queue

GPU

SIMT

group 0

wi

0

Memory

wi

1

wi

2

wi

3

SIMT

group 1

wi

4

wi

5

wi

6

wi

7

CPUSlide15

Operation-per-lane Modelsystem-level

operations executed by GPU threads12/5/201615Communication and Coordination Paradigms for Highly-parallel AcceleratorsSoC

Producer/consumer

queues

GPU

SIMT

group 0

wi

0

Memory

wi

1

wi

2

wi

3

SIMT

group 1

wi

4

wi

5

wi

6

wi

7

wrPtr

wrPtr

wrPtrSlide16

Coalesced APIssystem-level

operations executed by GPU work-groups12/5/201616Communication and Coordination Paradigms for Highly-parallel Accelerators

SoC

Producer/consumer

queues

GPU

SIMT

group 0

wi

0

Memory

wi

1

wi

2

wi

3

SIMT

group 1

wi

4

wi

5

wi

6

wi

7

wrPtr

wrPtr

wrPtrSlide17

Takeaways12/5/2016

17Communication and Coordination Paradigms for Highly-parallel Accelerators

MY GOAL

Operation-per-lane model’s programmability

Coprocessor model’s performance

Model

Programmable?

Coprocessor

L

Operation

-

per

-

lane

J

Coalesced APIs

K

Prod/cons

Overhead

low

high

medium

Pipelined

execution?

requires

programmer

effort

yes

yesSlide18

12/5/2016

18Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain System-level Operations

Task

Aggregation for GPUs

Message Aggregation for

GPUs

Conclusions

OutlineSlide19

Implementation Overview

Introduce leader-level synchronizationAmortize communication cost across SIMT groupEliminate dependencies within a SIMT groupDescribe two communication schemes that build off of leader-level synchronizationSIMT-direct aggregationIndirect aggregation12/5/201619Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide20

Leader-level SynchronizationSIMT group elects one work-item to synchronize on its behalf

12/5/201620SoC

Producer/consumer

queues

GPU

SIMT

group 0

wi

0

Memory

wi

1

wi

2

wi

3

SIMT

group 1

wi

4

wi

5

wi

6

wi

7

wrPtr

wrPtr

µB:

~118x speedup over work-item-level syncSlide21

SIMT-direct Aggregation

SIMT group invokes leader-level sync once for each producer/consumer queue that it targets12/5/201621Communication and Coordination Paradigms for Highly-parallel Accelerators

SoC

Producer/consumer

queues

GPU

SIMT

group 0

wi

0

Memory

wi

1

wi

2

wi

3

SIMT

group 1

wi

4

wi

5

wi

6

wi

7

wrPtr

wrPtr

wrPtrSlide22

Indirect Aggregation

SIMT group uses leader-level sync to export its operations to a dedicated aggregator12/5/201622Communication and Coordination Paradigms for Highly-parallel Accelerators

SoC

Producer/consumer

queues

GPU

SIMT

group 0

wi

0

Memory

wi

1

wi

2

wi

3

SIMT

group 1

wi

4

wi

5

wi

6

wi

7

wrPtr

wrPtr

wrPtr

wrPtr

Aggregator

rdPtrSlide23

SIMT-direct vs. Indirect

12/5/201623Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide24

Wavefront vs. Work-group

12/5/201624Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide25

More on Work-groupsLeader is elected by communicating across the threads in a work-group

Problem: intra-work-group communication is not defined when a work-group’s wavefronts divergeTwo high-level solutionsTrack control flow at work-group granularityTrack which wavefronts are executing together12/5/201625Communication and Coordination Paradigms for Highly-parallel Accelerators

Details in dissertationSlide26

TakeawaysLeverage leader-level synchronization

 efficient operation-per-lane modelSIMT-direct: scales to a small # of queuesIndirect: scales to a large # of queuesSIMT group can be defined as a wavefront or a work-groupWavefronts have well-defined behaviorWork-groups amortize communication across more work-items12/5/201626Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide27

12/5/2016

27Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain

System-level Operations

Task

Aggregation for GPUs

Message Aggregation for

GPUs

Conclusions

OutlineSlide28

Task Aggregation for GPUsExecutive Summary

SIMT languages (e.g. CUDA & OpenCL) restrict GPU programmers to regular parallelismGoal: enable irregular parallelism on GPUsWhy? More GPU applicationsHow? Fine-grain task aggregationWhat? Cilk on GPUs12/5/201628Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide29

GPUs Today

GPU tasks scheduled by control processor (CP)—small, in-order programmable coreToday’s GPU abstractions are coarse-grain12/5/201629Communication and Coordination Paradigms for Highly-parallel Accelerators

CP

GPU

CP

SIMD

SIMD

SIMD

System Memory

SIMD

+ Maps well to SIMD hardware

- Limits fine-grain schedulingSlide30

Fine-grain Scheduling via Channels

CP, or aggregator (agg), manages channelsFinite task queues, except:User-defined schedulingDynamic aggregationOne consumption function12/5/201630Communication and Coordination Paradigms for Highly-parallel Accelerators

Agg

Agg

channels

GPU

SIMD

SIMD

SIMD

SIMD

System Memory

Dynamic aggregation enables “CPU-like” scheduling abstractions on

GPUsSlide31

Contributions

First Channel Implementation (in software)Lock-free, non-blocking, array-basedSIMT-direct, wavefront granularityImplementation detailsEnable Cilk on GPUs by using channels as an intermediate representationBreadth-first task traversalCoarse-grain dependency resolutionMemory bounded with wavefront yieldCilk-to-channels formulation12/5/201631Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide32

MethodologyImplemented Cilk on channels on a simulated APU

Caches are coherentAggregator schedules Cilk tasks12/5/201632Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide33

Cilk scales with the GPU Architecture12/5/2016

33Communication and Coordination Paradigms for Highly-parallel AcceleratorsMore Compute Units  Faster execution

More ResultsSlide34

TakeawaysDynamic

aggregation enables “CPU-like” scheduling abstractions for GPUsAchieved by extending the GPU’s control processor to manage channelsChannels can be used as an intermediate layer for higher-level abstractions like Cilk12/5/201634Communication and Coordination Paradigms for Highly-parallel Accelerators

Dynamic aggregation brings fine-grain task scheduling to GPUsSlide35

12/5/2016

35Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain

System-level Operations

Task

Aggregation for GPUs

Message Aggregation for

GPUs

Conclusions

OutlineSlide36

Message Aggregation for GPUsExecutive Summary

Irregular data-intensive: small and hard-to-predict data access patternsPrior frameworks for these workloads focused on:CPUs (GraphChi); CPU-based clusters (Grappa, GraphLab, etc.); GPUs (GasCL)My focus: irregular data-intensive workloads on a cluster of GPUsAggregate fine-grain GPU-initiated messages12/5/201636

Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide37

GasCL: Vertex-Centric Graph Model for GPUs

Gather: Vertex receives messages from neighborsApply: Compute and update vertex’s valueScatter: Send messages to vertex’s neighborsPageRank exampleLimited to one GPU12/5/201637Communication and Coordination Paradigms for Highly-parallel Accelerators

def

scatter:

foreach

n in neighbors:

sendMsg

(n,

my_value

/

my_edge_cnt

)

def

gather-and-apply:

my_value

= 0.85 *

combine_msg_sum

()

+ 0.15/

vertex_cnt

0.2

0.2

0.2

0.2

0.2

0.07

0.07

0.07

0.07

0.07

0.07

0.07

0.07

0.07

0.07

0.07

0.07

0.07

0.07

0.1

0.1

0.2

0.07

0.1

0.07

0.07

0.2

0.07

0.31

0.23

0.23

0.14

0.09

0.1Slide38

Distributing GasCL

Work-items initiate network messagesOne-sided communicationread data (GET), write data (PUT), higher-level operations (active message)12/5/201638

Communication and Coordination Paradigms for Highly-parallel Accelerators

0.2

0.2

0.2

0.2

0.2

Node 0

Node 1

Node 2

Network

*

dest

=

val

;

post_write

(

dest

,

&

val

,

sizeof

(float),

…);

0.07

post_write

(

dest

,

&

val

,

sizeof

(float),

…);

0.07

0.07

0.07

0.07

0.07Slide39

ApproachHSA unified virtual memory can help to offload network requests to

CPU12/5/201639Communication and Coordination Paradigms for Highly-parallel Accelerators

Request queue:

Network

CPU

Node 1

Memory

GPU

wi

0

1

2

3

Node 0

Memory

out

[2:3]:

8

out

[0:1]:

2

0

1

3

ST

ack

ST

ack

put

put

put

put

ST

ST

PUT

PUT

int

out[4];Slide40

Message Size vs. Sync Performancefine-grain producer/consumer synchronization is expensive!

12/5/201640Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide41

Message Size vs Network Performance

Small messages degrade network performanceBidirectional Bandwidth Test* run on a pair of AMD Kaveri’s connected over a 10 Gb/s Ethernet link.12/5/201641Communication and Coordination Paradigms for Highly-parallel Accelerators

*

http://mvapich.cse.ohio-state.edu/benchmarks/.Slide42

Aggregate Network MessagesIndirect aggregation @ work-group granularity

12/5/201642Communication and Coordination Paradigms for Highly-parallel Accelerators

Network

GPU

WG

Node 0

Memory

n1:

t

0

’s bin buffers

GPU-producer/multi-consumer queue

WG

Node 1

Node 2

C

PU

n2:

n1:

t

1

’s bin buffers

n2:

t

0

t

1Slide43

GPU-producer/multi-consumer queue organization

Array-based queueSynchronization occurs at work-group granularityPayload organized so adjacent work-items activate the memory coalescer12/5/201643Communication and Coordination Paradigms for Highly-parallel Accelerators

full

full bit:

payload:

Lane 0

Lane 1

Lane 2

Lane N-1

[0]

Lane 0

Lane 1

Lane 2

Lane N-1

[1]

Lane 0

Lane 1

Lane 2

Lane N-1

[N-1]

tickets:

numServed

writeTicket

writeIdx

readIdx

readTicketSlide44

GPU-producer/multi-consumer queue performance

12/5/201644Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide45

Compared to CPUExpensive synchronization (e.g., acquiring a lock): once per

work-groupAdjacent work-items write same cache lines.CPU threads write different cache lines  less false sharing, lower cache utilization.12/5/201645Communication and Coordination Paradigms for Highly-parallel Accelerators

CPU

LOCK()

LOCK()

LOCK()

LOCK()

G

PU

work-group

LOCK()

CPU

core

cache

v

1

.val

pad

v

2

.val

pad

core

cache

v

3

.val

pad

v

4

.val

pad

G

PU

CU

cache

work-group

v

1

.val

v

2

.val

v

3

.val

v

4

.valSlide46

Prototype Called GravelHardware

8 AMD Kaveri processors (AMD A10-7850K)56 Gb Infiniband interconnectSoftware: mpic++ (Open MPI 1.6.5) & snack.shUseful HSA features: Unified virtual memory, C11 atomicsSupport subset of OpenSHMEM: work-items can initiate PUT and ATOMIC_INC12/5/2016

46

Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide47

Gravel vs. CPU-only 12/5/2016

47Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide48

Gravel vs. Prior Work12/5/2016

48Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide49

TakeawaysGravel’s focus: small and unpredictable

messagesDynamic aggregation amortizes network overheadDynamic aggregation overhead is amortized across work-groupsGravel enables work-items to initiate network transactionsNetwork transactions offloaded CPU, which aggregates messages12/5/201649Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide50

12/5/2016

50Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain

System-level Operations

Task

Aggregation for GPUs

Message Aggregation for

GPUs

Conclusions

OutlineSlide51

Publications12/5/2016

51Communication and Coordination Paradigms for Highly-parallel AcceleratorsThesis ChaptersMarc S. Orr, Bradford M. Beckmann, Steven K. Reinhardt, and David A. Wood, Fine-grain Task Aggregation and Coordination on GPUs, ISCA 2014.Marc S. Orr,

Shuai

Che

, Bradford M. Beckmann, Mark Oskin, Steven K. Reinhardt, and David A. Wood

,

Gravel: Fine-grain GPU-initiated Network Messages

, Under review.

Other PhD Research

Jason Power, Joel

Hestness

,

Marc S. Orr

,

Mark D. Hill, and David A. Wood,

gem5-gpu: A Heterogeneous CPU-GPU Simulator

, CAL 2014.

Marc S. Orr

,

Shuai

Che

,

Ayse

Yilmazer

, Bradford M. Beckmann, Mark D. Hill, and David A.

Wood

,

Synchronization Using Remote-Scope Promotion

, ASPLOS 2015.

Johnathan Alsop,

Marc S.

Orr

,

Bradford M. Beckmann, and David A.

Wood, Lazy Release Consistency for GPUs, Micro 2016.Slide52

Conclusion

Three main contributionsProposed efficient implementations for GPU-initiated system-level operations that follow the operation-per-lane modelExplored task aggregation for GPUsExplore message aggregation for GPUsQuestions?12/5/201652Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide53

Backup

12/5/201653Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide54

GPU Performance & Bandwidth

12/5/201654Communication and Coordination Paradigms for Highly-parallel Accelerators

NVIDIA,

CUDA

C Programming

Guide, 2014.Slide55

GPU Energy12/5/2016

55Communication and Coordination Paradigms for Highly-parallel Accelerators

Keckler, Keynote presentation at Micro, 2011.

The GPU is 7.6x more energy efficient!Slide56

First Channel Implementation

Our array-based, lock-free, non-blocking design accommodates SIMT access patterns

Data:

head, tail

Ctrl:

head, tail,

reserveTail

Done Count:

0

0

0

GPU

CP

SIMD

SIMD

SIMD

SIMD

System Memory

Channel A

Channel

B

head

tail

CP

head

tail

,

reserveTail

4

CP

Back

12/5/2016

Communication and Coordination Paradigms for Highly-parallel Accelerators

56Slide57

Cilk BackgroundCilk extends C for divide and conquer parallelism

Adds keywordsspawn: schedule a thread to execute a functionsync: wait for prior spawns to complete12/5/201657Communication and Coordination Paradigms for Highly-parallel Accelerators

1:

int

fib(

int

n) {

2:

if

(n <= 2)

return

1;

3:

int

x =

spawn

fib(n - 1);

4:

int

y =

spawn

fib(n - 2);

5:

sync

;

6:

return

(x + y);

7: }

BackSlide58

Enable Cilk on GPUs via Channels(Step 1)

Cilk routines split by sync into sub-routines

1:

int

fib (

int

n) {

2:

if

(n<=2)

return

1;

3:

int

x =

spawn

fib (n-1);

4:

int

y =

spawn

fib (n-2);

5:

sync

;

6:

return

(x+y);

7: }

1:

int

fib(

int

n){

2:

if

(n<=2)

return

1;

3:

int

x =

spawn

fib (n-1);

4:

int

y =

spawn

fib (n-2);

5: }

6: int fib_cont(int x, int y){

7:

return

(x+y);

8: }

“pre-sync”

“continuation”

Back

12/5/2016

Communication and Coordination Paradigms for Highly-parallel Accelerators

58Slide59

Enable Cilk on GPUs via Channels(Step 2)

Back34

3

5

5

3

4

3

Channels instantiated for breadth-first traversal

Quickly populates

GPU’s

tens of thousands of

lanes

Facilitates coarse-grain dependency management

“pre-sync” task ready

“continuation” task

task A spawned task B

A

B

t

ask B depends on task A

A

B

“pre-sync” task done

5

4

3

2

2

1

2

1

3

fib_cont channel stack:

t

op of

stack

fib channel

12/5/2016

Communication and Coordination Paradigms for Highly-parallel Accelerators

59Slide60

Bound Cilk’s Memory Footprint

Bound memory to the depth of the Cilk tree by draining channels closer to the base caseThe amount of work generated dynamically is not known a prioriProposal: GPUs allow SIMT threads to yieldFacilitates resolving conflicts on shared resources like memory5

4

3

2

2

1

2

1

3

Back

12/5/2016

Communication and Coordination Paradigms for Highly-parallel Accelerators

60Slide61

Divergence and Channels

Branch divergenceMemory divergence+ Data in channels good– Pointers to data in channels badBack

12/5/2016

Communication and Coordination Paradigms for Highly-parallel Accelerators

61Slide62

GPU NOT Blocked on Aggregator

Back12/5/2016Communication and Coordination Paradigms for Highly-parallel Accelerators62Slide63

GPU Cilk vs. standard GPU workloads

Cilk is more succinct than SIMT languagesChannels trigger more GPU dispatches 

LOC reduction

Dispatch rate

Speedup

Strassen

42%

13x

1.06

Queens

36%

12.5x

0.98

Same performance, easier to

program

Back

12/5/2016

Communication and Coordination Paradigms for Highly-parallel Accelerators

63Slide64

GPMC Queue Synchronizationpart 1

Lots of synchronization, but once per work-groupProducers ordered through a per-slot write ticketConsumers ordered through a per-slot read ticketA producer and consumer, with valid tickets, are ordered through the full bitE.g., queue, with one slot and its per-slot state12/5/201664Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide65

GPMC Queue Synchronizationpart 2

Below, is a queue, with one slot and its per-slot state12/5/201665Communication and Coordination Paradigms for Highly-parallel Accelerators

GPU

readIdx

,

writeIdx

GPU

C

PU

full

numServed

writeTicket

readTicket

0

0

0

0

WG

2

ticket=0

WG

1

ticket=1

1

2

t

icket=0

t

0

1

t

1

t

0

WG

0

WG

1

WG

4

WG

2

ticket=1

t

1

t

icket=0

t

0

WG

1

ticket=1

1

2

1

0