/
Switch Design Switch Design

Switch Design - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
378 views
Uploaded On 2017-12-05

Switch Design - PPT Presentation

a unified view of microarchitecture and circuits Giorgos Dimitrakopoulos Electrical and Computer Engineering Democritus University of Thrace DUTH Xanthi Greece dimitrakeeduthgr httputopiaduthgrdimitrak ID: 612692

design switch duth dimitrakopoulos switch design dimitrakopoulos duth 2012 nocs output input buffer port packet chip control cycle vcs

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Switch Design" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Switch Designa unified view of micro-architecture and circuits

Giorgos Dimitrakopoulos

Electrical and Computer Engineering

Democritus University of Thrace (DUTH)

Xanthi, Greece

dimitrak@ee.duth.gr

http://utopia.duth.gr/~dimitrakSlide2

Algorithms-Applications

System abstraction

Processors

for computation

Memories for storageIO for connecting to the outside worldNetwork for communication and system integration

Switch Design - NoCs 2012

Operating System

Instruction Set Architecture

Microarchitecture

Register-Transfer Level

Logic design

Circuits

Devices

Network

Processors

Memory

IO

G. Dimitrakopoulos - DUTH

2Slide3

Logic, State and Memory

Datapath functions

Controlled by FSMs

Can be pipelined

Mapped on silicon chipsGate-level netlist from a cell libraryCells built from transistors after custom layoutMemory macros store large chunks of dataMulti-ported register files for fast local storage and access of dataSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

3Slide4

On-Chip Wires

Passive devices that connect transistors

Many layers of wiring on

a chip

Wire width, spacing depends on metal layerHigh density local connections, Metal 1-5Upper metal layers > 6 are wider and used for less dense low-delay global connectionsSwitch Design - NoCs 2012G. Dimitrakopoulos - DUTH

4Slide5

Future of wires: 2.5D – 3D integrationSwitch Design - NoCs 2012

Evolution

G. Dimitrakopoulos - DUTH

5Slide6

Optical wiringOptical connections will be integrated on chip Useful when the power of electrical connections will limit the available chip IO bandwidthA balanced solution that involves both optical and electrical components will probably win

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

6Slide7

Let’s send a word on a chip

Sender and receiver on the same clock domain

Clock-domain crossing just adds latency

Any relation of the sender-receiver clocks is exploited

Mesochronous interfaceTightly coupled synchronizersSwitch Design - NoCs 2012

[AMD

Zacate]

G. Dimitrakopoulos - DUTH

7Slide8

Point-to-point links: Flow controlSwitch Design - NoCs 2012

S

R

Data

S

R

Data

Valid

S

R

Valid

Stall

Data

Synchronous operation

Data on every cycle

Sender can stall

Data valid signal

Receiver can stall

Stall (back-pressure) signal

Either can stall

Valid and Stall backpressure

Partially decouple Sender and Receiver by adding a buffer at the receive side

S

R

Stall

Data

G. Dimitrakopoulos - DUTH

8Slide9

Sender and Receiver decoupled by a bufferReceiver accepts some of the sender’s traffic even if the transmitted words are not consumed When to stop? How is buffer overflow avoided?Let’s see first how to build a buffer

Clock-domain crossing can be tightly coupled within the buffer

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

9Slide10

Buffer organization

A FIFO container that maintains order of arrival

4 interfaces (full, empty, put, get)

Elastic

Cascade of depth-1 stages Internal full/empty signalsShift register in/Parallel outPut: shift all entries

Get: tail pointerCircular buffer

Memory with head / tail pointersWrap around array implementationStorage can be register basedSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

10Slide11

Buffer implementation

The same basic structure evolves with extra read/write flexibility

Multiplexers and head/tail pointers handle data movement and addressing

Elastic

Circular array

Shift In/Parallel Out

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

11Slide12

Link-level flow control: Backpressure

Link-level flow control provides a

closed feedback loop to control the

flow of data from a sender to a receiver

Explicit flow control (stall-go) Receiver notifies the sender when to stop/resume transmissionImplicit flow control (credits)Sender knows when to stop to avoid buffer overflowFor unreliable channels we need extra mechanisms for detecting and handling transmission errors

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH12Slide13

STALL-GO flow controlOne signal STALL/GO is sent back to the receiverSTALL=0 (G0) means that the sender is allowed to send STALL=1 (STALL) means that the sender should stop

The sender changes its behavior the moment it detects a change to the backpressure signal

Data valid (not shown) is asserted when new data are available

Switch Design - NoCs 2012

Stall

G. Dimitrakopoulos - DUTH

13Slide14

STALL-GO flow control: exampleSwitch Design - NoCs 2012

Stall

In-flight words will be dropped or they will replace the ones that wait to be consumed

In every case data

are

lostSTALL and GO should be connected with the buffer availability of the receiver’s queueThe example assumes that the receiver is stalled or released for other network reasons

G. Dimitrakopoulos - DUTH

14Slide15

STALL should be asserted early enoughNot drop words in-flight Timing of STALL assertion guarantees lossless operationGO should be asserted late enough Have words ready-to-consume before new words arriveCorrect timing guarantees high throughput Minimum buffering for full throughput and lossless operation should cover both STALL&GO re-action cycles

Switch Design - NoCs 2012

Buffering requirements of STALL&GO

Stall

If not available the link remains idle

G. Dimitrakopoulos - DUTH

15Slide16

Switch Design - NoCs 2012STALL&GO on pipelined and elastic linksTraffic is “blind” during a time interval of Round-trip Time (RTT)

the source will only learn about the effects of its transmission RTT after this transmission has started

the (corrective) effects of a contention notification will only appear at the site of contention RTT after that occurrence

G. Dimitrakopoulos - DUTH

16Slide17

Credit-based flow controlSender keeps track of the available buffer slots of the receiverThe number of available slots is called credits The available credits are stored in a credit counter

If #credits > 0 sender is allowed to send a new word

Credits are decremented by 1 for each transmitted word

When one buffer slot is made free in the receive side, the sender is notified to increase the credit count

An example where credit update signal is registered first at the receive sideSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

17Slide18

Credit-based flow control: ExampleSwitch Design - NoCs 2012

0*

means that credit counter is

incremented and decremented

at the same cycle (ways and stays at 0)

Credit Update

G. Dimitrakopoulos - DUTH18

Available CreditsSlide19

Credit-based flow control: Buffers and ThroughputSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

19Slide20

Condition for 100% throughputThe number of registers that the data and the credits pass through define the credit loop100% throughput is guaranteed only when the number of available buffer slots at the receive side equals the registers of the credit loopChanging the available number of credits can reconfigure maximum throughput at runtime

Credit-based FC is lossless with any buffer size > 0.

Stall and Go FC requires at least one

loop

extra buffer space than credit-based FCSwitch Design - NoCs 2012

Credit loop

G. Dimitrakopoulos - DUTH

20Slide21

Link-level flow control enhancementsReservation based flow controlSeparate control and data functionsControl links race ahead of the data to reserve resources

When data words arrive, they can proceed with little overhead

Speculative flow control

The sender can transmit cells even without sufficient credits

Speculative transmissions occur when no other words with available credits is eligible for transmissionThe receiver drops an incoming cell if its buffer is fullFor every dropped word a NACK is returned to the senderEach cell remains stored at the sender until it is positively acknowledgedEach cell may be speculatively transmitted at most once All retransmissions must be performed when credits are availableThe sender consumes credit for every cell sent, i.e., for speculative as well as credited transmissions.

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH21Slide22

Send a large message(packet)

Send long packet of 1Kbit over a 32-bit-wire channel

Serialize the message to 16 words of 32 bits

Need 16 cycles for packet transmission

Each packet is transmitted word-by-wordSwitch Design - NoCs 2012

When the output port is free, send the next word immediately

Old fashioned Store-and-forward required the entire packet to reach each node before initiating next transmission

G. Dimitrakopoulos - DUTH

22Slide23

Buffer allocation policiesEach transmitted word needs a free downstream buffer slot When the output of the downstream node is blocked the buffer will hold the arriving wordsHow much free buffering is guaranteed before sending the first word of a packet?

Virtual Cut Through

(VCT): The available buffer slots equal the words of the packet

Each blocked packet stays together and consumes the buffers of only one node

Wormhole: Just a few are enoughPacket inevitably occupies the buffers of more nodes Nothing is lost due to flow control backpressure policySwitch Design - NoCs 2012G. Dimitrakopoulos - DUTH

23Slide24

VCT and Wormhole in graphicsSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

24Slide25

Link sharingThe number of wires of the link does not increaseOne word can be sent on each clock cycleThe channel should be sharedA multiplexer is needed at the output port of the sender

Switch Design - NoCs 2012

Each packet is sent un-interrupted

Wormhole, and VCT behave this way

Connection is locked for a packet until the tail of the packet passes the output port

G. Dimitrakopoulos - DUTH

25Slide26

Who drives the select signals?The arbiter is responsible for selecting which packet will gain access to the output channelA word is sent if buffer slots are available downstreamIt receives requests

from the inputs and

grants

only one of them

Decisions are based on some internal priority stateSwitch Design - NoCs 2012G. Dimitrakopoulos - DUTH

26Slide27

Arbitration for Wormhole and VCT In wormhole and VCT the words of each packet are not mixed with the words of other packetsArbitration is performed once per packet and the decision is locked at the output for all packet durationEven if a packet is blocked downstream the connection does not change until the tail of the packet leaves the output port

Buffer utilization managed by flow control mechanism

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

27Slide28

How can I place my buffers?Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

28Slide29

Let’s add some complexity: NetworksA network of terminal nodes Each node can be a source or a sinkMultiple point-to-point links connected with switchesParallel communication between components

Switch Design - NoCs 2012

Source/Sink

Terminal Node

Switch

G. Dimitrakopoulos - DUTH

29Slide30

Multiple input-output permutations should be supportedContention should be resolved and non-winning inputs should be handledBuffered locallyDeflected to the networkSeparate flow control for each linkEach packet needs to know/compute the path to its destination

Switch Design - NoCs 2012

Complexity affects the switches

G. Dimitrakopoulos - DUTH

30Slide31

More than one terminal nodes can connect per switchConcentration good for bursty traffic Local switch isolates local traffic from the main networkSwitch Design - NoCs 2012

How are the terminal nodes connected to the switch?

G. Dimitrakopoulos - DUTH

31Slide32

Switch design: IO interfaceSwitch Design - NoCs 2012

Separate flow control per link

G. Dimitrakopoulos - DUTH

32Slide33

Switch design: One output portSwitch Design - NoCs 2012

per-output requests

Let’s reuse the circuit we already have for one output port

G. Dimitrakopoulos - DUTH

33Slide34

Switch Design - NoCs 2012

Move buffers to the inputs

Switch design: Input buffers

Data from input#1

Requests

for output #0

G. Dimitrakopoulos - DUTH

34Slide35

Switch design: Complete output portsSwitch Design - NoCs 2012

How

are the output requests computed?

G. Dimitrakopoulos - DUTH

35Slide36

Routing computationRouting computation generates per output requestsThe header of the packet carries the requests for each intermediate node (source routing)The requests are computed/retrieved based on the packet’s destination (distributed routing)

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

36Slide37

Routing logicRouting logic translates a global destination address to a local output port requestTo reach node X from node Y

should use output port #2 of Y

A Lookup-table is enough for holding the request vector that corresponds to each destination

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

37Slide38

Switch building blocksSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

38Slide39

Running example of switch operationSwitch Design - NoCs 2012

Switches transfer packets

Packets are broken to flits

Head flit only knows packet’s destination

The wires of each link equals the bits of each flit

G. Dimitrakopoulos - DUTH

39Slide40

Buffer accessSwitch Design - NoCs 2012Buffer incoming packets per linkRead the destination of the head of each queue

G. Dimitrakopoulos - DUTH

40Slide41

Routing Computation/Request GenerationCompute output requests and drive the output arbitersSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

41Slide42

Arbitration-Multiplexer path setupSwitch Design - NoCs 2012Arbitrate per output

The grant signals

Drive the output multiplexers

Notify the inputs about the arbitration outcome

G. Dimitrakopoulos - DUTH42Slide43

Switch traversalSwitch Design - NoCs 2012Words H will leave the switch on the next clock edge provided they have at least one credit

G. Dimitrakopoulos - DUTH

43Slide44

Link traversalSwitch Design - NoCs 2012Words going to a non-blocked output leave the switch

The grants of a blocked output (due to flow control) are lost

An output arbiter can also stall in case of blocked output

G. Dimitrakopoulos - DUTH

44Slide45

Head-Of-Line blocking: performance limiterSwitch Design - NoCs 2012The FIFO order of the input buffers limit the throughput of the switch

The flit is blocked by the Head-of-Line that lost arbitration

A memory throughput problem

G. Dimitrakopoulos - DUTH

45Slide46

Wormhole switch operation Switch Design - NoCs 2012The operations can fit in the same cycle or they can be pipelined

Extra registers are needed in the control path

Registers in the input/output ports already present

LT at the end involves a register write

Body/tail flits inherit the decisions taken by the head flitsG. Dimitrakopoulos - DUTH

46Slide47

Look-ahead routingRouting computation is based only on packet’s destinationCan be performed in switch A and used in switch BLook-ahead routing computation (LRC)

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

47Slide48

Look-ahead routingThe LRC is performed in parallel to SALRC should be completed before the ST stage in the same switchThe head flit needs the output port requests for the next switch

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

48Slide49

Look-ahead routing detailsSwitch Design - NoCs 2012The head flit of each packet carries the output port requests for the next switch together with the destination address

G. Dimitrakopoulos - DUTH

49Slide50

Low-latency organizationsSwitch Design - NoCs 2012Baseline

SA precedes ST (no speculation)

SA decoupled from ST

Predict or Speculate arbiter’s decisions

When prediction is wrong replay all the tasks (same as baseline)Do in different phases Circuit switchingArbitration and routing at setup phaseAt transmit only ST is needed since contention is already resolvedBypass switchesReduce latency under certain criteriaWhen bypass not enabled same as baseline

ST

Setup

Xmit

SA

LRC

LT

SA

LRC

LT

ST

ST

SA

LRC

ST

LT

ST

LT

G. Dimitrakopoulos - DUTH

50

LT

Setup

XmitSlide51

Prediction-based ST: HitSwitch Design - NoCs 2012

Crossbar

Buffer

X+

X-

Y+

Y-

X+

X-

Y+

Y-

PREDICTOR

Crossbar is reserved

Idle

state:

Output port X+ is selected and reserved

1st cycle:

Incoming flit is transferred to X+ without RC and

SA

Correct

1st cycle:

RC is performed

The prediction is correct!

2nd cycle:

Next

flit is transferred to X+ without RC and

SA

ARBITER

G. Dimitrakopoulos - DUTH

51Slide52

Prediction-based ST: MissSwitch Design - NoCs 2012

X+

X-

Y+

Y-

Idle state:

Output port X+ is selected and reserved

Correct

Dead flit

1st cycle:

RC is

performed

The prediction is wrong! (X- is correct)

2nd/3rd cycle:

Dead flit is removed; retransmission to the correct port

Buffer

X+

X-

Y+

Y-

PREDICTOR

ARBITER

1st cycle:

Incoming flit is transferred to X+ without RC and

SA

KILL

Kill signal to X+ is asserted

Crossbar

@Miss: tasks replayed as the baseline case

G. Dimitrakopoulos - DUTH

52Slide53

Speculative STAssume contention doesn’t happenIf correct then flit transferred directly to output port without waiting SAIn case of contention replay SAWasted cycle in the event of contention

Arbiter decides what will be sent on the next cycle

Switch Design - NoCs 2012

Switch

Fabric

Control

B

A

A

clk

port 0

port 1

grant

valid out

data out

0

1

4

cycle

2

3

A

p0

A

A

B

p1

???

B

A

A

?

B

A

p0

B

A

A

B

A

B Wins

A

Wins

G. Dimitrakopoulos - DUTH

53Slide54

XOR-based STAssume contention never happensIf correct then flit transferred directly to output portIf not then bitwise=XOR all the competing flits and send the encoded result to the link

At the same time arbitrate and mask (set to 0) the winning input

Repeat on the next cycle

In the case of contention encoded outputs (due to contention) are resolved at the receiver

Can be done at the output port of the switch tooSwitch Design - NoCs 2012

Switch

Fabric

Control

B

A

B

A

A

A^B

A

0

1

4

cycle

2

3

clk

port 0

port 1

grant

valid out

data out

A

p0

A

A

B

p1

B^A

A

A

A

No Contention

Contention

B

Wins

G. Dimitrakopoulos - DUTH

54Slide55

XOR-based ST: Flit recoveryWorks upon simple XOR property. (A^B^C) ^ (B^C) = AAlways able to decode by XORing two sequential values

Performs similarly to speculative switches

Only head-flit collisions matter

Maintains previous router’s arbitration order

Switch Design - NoCs 2012

Coded

Flit Buffer

A

A^B^C

B^C

C

A

0

0

B^C

1

A^B^C

C

B^C

B

G. Dimitrakopoulos - DUTH

55Slide56

Bypassing intermediate nodes Switch bypassing criteria:Frequently used paths Packets continually moving along the same dimension Most techniques can bypass some pipeline stages only for specific packet transfers and traffic patterns

Not generic enough

Switch Design - NoCs 2012

3-

cycle

SRC

DST

3-cycle

3-cycle

Virtual bypassing paths

3-

cycle

3-

cycle

1-

cycle

Bypassed

1-

cycle

Bypassed

G. Dimitrakopoulos - DUTH

56Slide57

Circuit switchingSwitch Design - NoCs 2012G. Dimitrakopoulos - DUTH

57

Network traversal done in phases

Path reservation (multiple switch allocations) is done all at once

Switch traversal finds no contentionData buffers are avoidedPart of the reserved and unutilized path is needlessly blocked Slide58

Speculation-free low-latency switchesPrediction and speculation drawbacksOn miss-prediction(speculation) the tasks should be replayedLatency not always saved. Depends on network conditions

Merged Switch allocation and Traversal (SAT)

Latency always saved – no speculation

Delay of SAT smaller than SA and ST in series

Switch Design - NoCs 2012G. Dimitrakopoulos - DUTH58Slide59

Arbitration and Multiplexing

Stop thinking arbitration and multiplexing separately

One new algorithm that fits every policy

Generic priority-based solution that works even when arbitration and multiplexing are done separately

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

59Slide60

Round-robin arbitrationRound-robin arbitrationMost commonly usedStart from the High-Priority position and grant the first active request you find after searching all cyclically all requests

Granted input becomes lowest-priority for the next arbitration

Cyclic search found in many other algorithms

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

60Slide61

Switch Design - NoCs 2012Transform each request and priority bit to a 2bit unsigned arithmetic symbol The request is the MSBitRound-robin arbitration is equivalent to finding the maximum symbol that lies in the rightmost position

Cyclic search disappears

Let’s think out of the box

G. Dimitrakopoulos - DUTH

61Slide62

Working examples

Switch Design - NoCs 2012

Maximum selection is done via a tree structure

The rightmost maximum symbol always wins

Direction flags (L,R) always point to the direction of the winning inputDirection flags form the path to the winning input

G. Dimitrakopoulos - DUTH

62Slide63

Why not switch data in parallel?Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

63Slide64

Grant signals produced simultaneouslyWhen F=0 the maximum came from the RightWhen F=1 the maximum came from the LeftOnehot, thermometer, weighted-binary grant signals can be derived by the tree of MAX nodes

Switch Design - NoCs 2012

Direction flag F

G. Dimitrakopoulos - DUTH

64Slide65

Wormhole/VCT MARX-based switchesSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

65Slide66

SRAM-based input buffersBuffer reads and writes are treated as separate tasksBuffer write occurs always after link traversalA separate read and write port is required for maximum performance

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

66Slide67

Speculative Buffer ReadBuffer read occurs after SA for Head flits (no speculation) Buffer read can occur in parallel to SA (speculation)HOL Head flit is read out before knowing if it received a grant

Once SA has finished speculation is removed for the rest flits

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

67Slide68

Pipelining and credits

Credit loop begins from upstream SA stage

Deep pipelining increases the buffering requirements for 100% throughput

Elastic pipeline stages that can stall independently can partially alleviate the problem

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

68Slide69

Bufferless Assume there are no buffersWhen a packet loses switch allocation it is:DroppedDeflected to any free outputDeflection spreads contention in space (in the network)

Allocation solves contention at each time slot but spreads it in time (next time slots)

Deflection (or misrouting) can occur in buffered switches too

Rotary router

Switch Design - NoCs 2012G. Dimitrakopoulos - DUTH69Slide70

SlicingSwitch Design - NoCs 2012

Introduces hierarchy inside the switch

When traffic is concentrated to certain outputs the switch suffers high performance penalties

Intermediate buffers partially alleviate the loss

Dimension slicing

Port slicing

G. Dimitrakopoulos - DUTH

70Slide71

How can we increase throughput?Switch Design - NoCs 2012

Green flow is blocked until red passes the switch. Physical channel left idle

G. Dimitrakopoulos - DUTH

71Slide72

Decouple output port allocation from next-hop buffer allocationContention present on:Output links (crossbar output port) Input port of the crossbarContention is resolved by time sharing the resourcesMixing words of two packets on the same channelThe words are on different virtual channels

Separate buffers at the end of the link guarantee no interference between the packets

Switch Design - NoCs 2012

Virtual Channels

G. Dimitrakopoulos - DUTH

72Slide73

Virtual channelsVirtual-channel support does not mean extra linksThey act as extra street lanes Traffic on each lane is time shared on a common channelProvide dedicated buffer space for each virtual channel

Decouple channels from buffers

Interleave flits from different packets

“The Swiss Army Knife for Interconnection Networks”

Prevent deadlocksReduce head-of-line blockingProvide QoSSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

73Slide74

Datapath of a VC-based switch Switch Design - NoCs 2012Separate buffer for each VC

Separate flow control signals (credits) for each VC

The radix of the crossbar can stay the same

Input VCs can share a common input port of the crossbar

On each cycle at most one VC will receive a new word

G. Dimitrakopoulos - DUTH

74Slide75

A switch connects input VCs to output VCsRouting computation (RC) determines the output portMay restrict the output VCs that can be used An input VC should allocate first an output VCAllocation is performed by the VC allocator (VA)RC and VA are done per packet on the head flits and inherited to the rest flits of the packet

Switch Design - NoCs 2012

Input VCs

Output VCs

Per-packet operation of a VC-based switch

G. Dimitrakopoulos - DUTH

75Slide76

Per-flit operation of a VC-based switch Flits with an allocated output VC fight for an output portOutput port allocated by switch allocatorThe VCs of the same input share a common input port of the crossbar

Each input has multiple requests (equal to the #input VCs)

The flit leaves the switch provided that credits are available downstream

Credits are counted per output VC

Switch Design - NoCs 2012

Input VCs

Output VCs

G. Dimitrakopoulos - DUTH

76Slide77

Switch allocationAll VCs at a given input port share one crossbar input portSwitch allocator matches ready-to-go flits with crossbar time slots

Switch Design - NoCs 2012

Allocation performed on a cycle-by-cycle basis

N×V requests (input VCs), N resources (output ports)

At most one flit at each input port can be granted

At most one flit et each output port can be leave

Other options need more crossbar ports (input-output speedup)

G. Dimitrakopoulos - DUTH

77Slide78

Switch allocation example

One request (arc) for each input VC

Example with 2 VCs per input

At most 2 arcs leaving each input

At most 2 requests per row in the request matrix

Matching:

Each grant must satisfy a request

Each requester gets at most one grant

Each resource is granted at most once

Switch Design - NoCs 2012

Inputs

Outputs

0

0

1

2

2

1

Inputs

Outputs

0

2

1

0

2

1

Bipartite graph

Request matrix

G. Dimitrakopoulos - DUTH

78Slide79

Separable allocationMatchings have at most one grant per row and per columnTwo phases of arbitrationColumn-wise and row-wisePerform in either order

Arbiters in each stage are independent

But the outcome of each one affects the quality of the overall match

Fast and cheap

Bad choices in first phase can prevent second stage from generating a good matchingMultiple iterations required for a good matchSwitch Design - NoCs 2012

Input-first:

Output-first:

G. Dimitrakopoulos - DUTH

79Slide80

ImplementationG. Dimitrakopoulos - DUTH

Switch Design - NoCs 2012

80

Output first allocation

Input first allocationSlide81

Multi-cycle separable allocatorsSwitch Design - NoCs 2012Allocators produce better results if they run for many cycles

On each cycle they try to fill the input-output match with new edges

We don’t have the time to wait more than one cycle

Run two or more allocators in parallel and interleave their grants to the switch

G. Dimitrakopoulos - DUTH81Slide82

Centralized allocator

Wavefront

allocation

Pick initial diagonal

Grant all requests on each diagonalNever conflict!For each grant, delete requests in same row, columnRepeat for next diagonalSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

82Slide83

Switch allocation for adaptive routingInput VCs can request more than one output portsCalled the set of Admissible Output Ports (AOP)This adds an extra selection step (not arbitration)

Selection mostly tries to load balance the traffic

Input-first allocation

For each input VC select one request of the AOP

Arbitrate locally per input and select one input VCArbitrate globally per output and select one VC from all fighting inputsOutput-first allocationSend all requests of the AOP of each input VC to the outputsArbitrate globally per output and grant one requestArbitrate locally per input and grant an input VCFor this input VC select one out of the possibly multiple grants of the AOP setSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

83Slide84

VC allocationVirtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels)Before packets can proceed through router, need to claim ownership of VC buffer at next routerVC acquired by head flit, is inherited by body & tail flits

VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use

N×V inputs (input VCs), N×V outputs (output VCs)

Once assigned, VC is used for entire packet’s duration in the switch

Switch Design - NoCs 2012

Input VCs

Output VCs

G. Dimitrakopoulos - DUTH

84Slide85

VC allocation exampleInput VC match to an output VC simultaneously with the restEven if it belongs to the same inputNo port constraint as in switch allocators

VC allocation refers to allocating buffer id (output VC) on the next router

Allocation can be both separable (2 arbitration steps) or centralized

Switch Design - NoCs 2012

Inputs VCs

Output VCs

0

1

In#0

In#1

In#2

Out#0

Requests

Grants

2

3

4

5

0

1

2

3

4

5

Out#1

Out#2

Inputs VCs

Output VCs

0

1

In#0

In#1

In#2

Out#0

2

3

4

5

0

1

2

3

4

5

Out#1

Out#2

G. Dimitrakopoulos - DUTH

85Slide86

Any-to-any flexibility in VC allocator is unnecessaryPartition set of VCs to restrict legal requestsDifferent use cases for VCs restrict possible transitions:Message class never changesResource classes are traversed in orderVCs within a packet class are functionally equivalentCan take advantage of these properties to reduce VC allocator complexity!

Switch Design - NoCs 2012

Input – output VC mapping

G. Dimitrakopoulos - DUTH

86Slide87

VA single cycle or pipelined organizationHeader flits see longer latency than body/tail flitsRC, VA decisions taken for head flits and inherited to the rest of the packetEvery flit fights for SA

Can we parallelize SA and VA?

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

87Slide88

The order of VC and switch allocationVA first SA followsOnly packets will an allocated output VC fight for SAVA and SA can be performed concurrently:Speculate that waiting packets will successfully acquire a VC

Prioritize non-speculative requests over speculative ones

Speculation holds only for the head flits

(The body/tail flits always know their output VC)

Switch Design - NoCs 2012

VASA

DescriptionWinWinEverything OK!! Leave the switch

WinLoseAllocated a VC

Retry SA (not speculative - high priority next cycle)LoseWin

Does not know the output VCAllocated output port (grant lost – output idle)Lose

LoseRetry both VA and SAG. Dimitrakopoulos - DUTH

88Slide89

Speculative switch allocationPerform switch allocation in parallel with VC allocation

Speculate that the latter will be successful

If so, saves delay, otherwise try again

Reduces zero-load latency, but adds complexity

Prioritize non-speculative requestsAvoid performance degradation due to miss-speculationUsually implemented through secondary switch allocator

But need to prioritize non-speculative grants

Switch Design - NoCs 2012

G. Dimitrakopoulos - DUTH89Slide90

Free list of VCs per outputCan assign a VC non-speculatively after SAA free list of output VCs exists at each outputThe flit that was granted access to this output receives the first free VC before leaving the switch

If no VC available output port allocation slot is missed

Flit retries for switch allocation

VCs are not unnecessarily occupied for flits that don’t win SA

Switch Design - NoCs 2012G. Dimitrakopoulos - DUTH90Slide91

VC buffer implementationSwitch Design - NoCs 2012

Static partitioning

Dynamic partitioning

G. Dimitrakopoulos - DUTH

91

Linked-List Shared Buffer ImplementationSlide92

VC-based switches with MARX unitsMerged Switch Allocation and Traversal can be applied to VC-based switches tooVA can be run before or in parallel to SAT

G. Dimitrakopoulos - DUTH

Switch Design - NoCs 2012

92Slide93

VC-based switches with MARX units: DatapathSwitch Design - NoCs 2012

G. Dimitrakopoulos - DUTH

93Slide94

NoC: The science & art of on-chip connectionsSwitch Design - NoCs 2012G. Dimitrakopoulos - DUTH

94

Micro-architecture

of Network-on-Chip

Routers

Giorgos Dimitrakopoulos, Springer, mid 2013

ADVERTISEMENTSlide95

References (1) W. J. Dally and B. Towles. Route packets, not wires: On-chip interconnection networks, DAC 2001A. Kumar, et al. A 4.6tbits/s 3.6ghz single-cycle noc router with a novel switch allocator. In in 65nm CMOS”, ICCD-2007

A. Kumar, et al. “Express virtual channels: towards the ideal interconnection fabric”, ISCA ’07

H.

Matsutani

, et al. “Prediction router: A low-latency on-chip router architecture with multiple predictors”, IEEE Trans. Computers, 2011G. Michelogiannakis, J. Balfour, and W. Dally, “Elastic bufferflow control for on-chip networks”, HPCA 2009.Mitchell Hayenga, Mikko

Lipasti, “The NoX

Router”, MICRO 2011T. Moscibroda and O. Mutlu. A case for bufferless routing in on-chip networks, ISCA 2009R. Mullins, A. West, and S. Moore. Low-latency virtual-channel routers for on-chip networks. ISCA 2004

L.-S. Peh and W. J. Dally. A delay model and speculative architecture for pipelined router HPCA 2001D. Wentzlaff, et al. “On-chip interconnection architecture of the tile processor. Micro, IEEE,2007Y. J. Yoon, et al. “Virtual channels vs. Multiple Physical Networks”, DAC 2010M. Azimi, et al.

“Flexible and adaptive on-chip interconnect for terascale architectures,” Intel Technology Journal, 2009.A. Golander, et al. “A cost-efficient L1–L2 multicore interconnect: Performance, power, and area considerations,”

IEEE TCAS-I 2011.P. Kumar, “Exploring concentration and channel slicing in on-chip network router,” HPCA 2009M. Galles, “Spider: A high-speed network interconnect,” IEEE Micro, 1997.A. S. Vaidya, et al. , “Lapses: A recipe for high performance adaptive router design”, HPCA 1999.

C. Batten Interconnection Networks Course, Columbia UniversityM. Katevenis, Packet Switch Architectures Course, University of Crete, Greece.W. J. Dally, “Virtual-Channel Flow Control,” ISCA 1990.D. U. Becker and W. J. Dally, “Allocator implementations for network-on-chip routers,”, SC 2009.S. S. Mukherjee

, et al., “A comparative study of arbitration algorithms for the Alpha 21364 pipelined router,” ASPLOS 2002.Y. Tamir and H.-C. Chi, “Symmetric crossbar arbiters for VLSI communication switches,” IEEE Trans. on Par. and Distributed Systems, 1993.J. Hurt, et al. , “Design and implementation of high-speed symmetric crossbar schedulers,” in ICC 1999G. Ascia, et al., “Implementation

and analysis of a new selection strategy for adaptive routing in networks-on-chip,” IEEE T. on Comp. 2008P. Salihundam, et al. , “A 2Tb/s 6x4 Mesh Network with DVFS and 2.3Tb/s/W router in 45nm CMOS,” in Symp. VLSI Circuits, 2010.

P. Gupta and N. McKeown, “Design and implementation of a fast crossbar scheduler,” IEEE Micro 1999.J. Flich and D. Bertozzi (editors), “Network on Chip in the Nanoscale

Era”, CRC Press, 2010G. Dimitrakopoulos - DUTH

Switch Design - NoCs 201295Slide96

References (2)L. Pirvu et al. “The impact of link arbitration on switch performance,” HPCA, 1999.

M. Coppola, et al. “

Spidergon

: A Novel On-Chip Communication Network” IEEE SOC 2004.

W. Dally and C. Seitz. “Deadlock-Free Message Routing in Multiprocessor Interconnection Networks” IEEE Tran.on Computers, 1987M. Karol, “Input vs Output Queuing on a Space-Division Packet Switch”, In IEEE Transactions on Communications, 1987.Zhonghai Lu, et al. “Evaluation of on-chip networks using deflection routing”. In Proceedings of GLSVLSI, 2006.

Zhonghai Lu, et al.. “Layered switching for networks on chip”. DAC 2007R.

Ginosar, "Metastability and Synchronizers: A Tutorial," IEEE Design & Test, Sept/Oct. 2011.G. Dimitrakopoulos, D. Bertozzi, “Switch architecture”, in J. Flich and D. Bertozzi (editors), “Network on Chip in the Nanoscale Era”, CRC Press, 2010G. Dimitrakopoulos, “Logic-level Design of Basic Switch Components”, in J. Flich and D. Bertozzi (editors), “Network on Chip in the

Nanoscale Era”, CRC Press, 2010G. Dimitrakopoulos E. Kalligeros, “Dynamic-Priority Arbiter and Multiplexer Soft Macros for On-Chip Networks Switches”, DATE 2012G. Dimitrakopoulos, E. Kalligeros, K. Galanopoulos, “Merged Switch allocation and traversal in Network-on-Chip Switches”, to appear in IEEE transactions on Computers (available at IEEExplore preprints)

Se-Joong Lee et al. Packet-Switched On-Chip Interconnection Network for System-on-Chip Applications, IEEE TCAS II 2005.Donghyun Kim et al. A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on-Chip, ISCAS

2005.Anh Tran and Bevan Baas, "RoShaQ: High-Performance On-Chip Router with Shared Queues,“ iCCD 2011Anh Tran et al. "A Reconfigurable Source-Synchronous On-Chip Network for GALS Many-Core Platforms,“

IEEE Trans.on CAD, 2010.B. Dally and B. Towles, “Interconnection networks”, Morgan Kaufman 2004C.A. Nicopoulos, “ViChaR

: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers “, MICRO 2006.Clive Maxfield , “2D vs. 2.5D vs. 3D ICs 101” EE Times, Design 2012Mike Santarini , “2.5D ICs are more than a stepping stone to 3D Ics

”, EE Times, Design 2012Nathan Binkert et al. , “The Role of Optics in Future High Radix Switch Design”,  ISCA-2011Eylon Caspi “Design Automation for Streaming Systems”, PhD Thesis, Berkeley 2005C

Minkenberg, M Gusat , “Design and performance of speculative flow control for high-radix datacenter interconnect switches”, JPDC 09Peh, Li-

Shiuan and Dally, William J., "Flit-Reservation Flow Control," in HPCA 1999M. Gerla and L. Kleinrock. Flow Control: A Comparative Survey. IEEE Transactions on Communications, 1980.

G. Dimitrakopoulos - DUTH

Switch Design - NoCs 2012

96