a unified view of microarchitecture and circuits Giorgos Dimitrakopoulos Electrical and Computer Engineering Democritus University of Thrace DUTH Xanthi Greece dimitrakeeduthgr httputopiaduthgrdimitrak ID: 612692
Download Presentation The PPT/PDF document "Switch Design" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Switch Designa unified view of micro-architecture and circuits
Giorgos Dimitrakopoulos
Electrical and Computer Engineering
Democritus University of Thrace (DUTH)
Xanthi, Greece
dimitrak@ee.duth.gr
http://utopia.duth.gr/~dimitrakSlide2
Algorithms-Applications
System abstraction
Processors
for computation
Memories for storageIO for connecting to the outside worldNetwork for communication and system integration
Switch Design - NoCs 2012
Operating System
Instruction Set Architecture
Microarchitecture
Register-Transfer Level
Logic design
Circuits
Devices
Network
Processors
Memory
IO
G. Dimitrakopoulos - DUTH
2Slide3
Logic, State and Memory
Datapath functions
Controlled by FSMs
Can be pipelined
Mapped on silicon chipsGate-level netlist from a cell libraryCells built from transistors after custom layoutMemory macros store large chunks of dataMulti-ported register files for fast local storage and access of dataSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
3Slide4
On-Chip Wires
Passive devices that connect transistors
Many layers of wiring on
a chip
Wire width, spacing depends on metal layerHigh density local connections, Metal 1-5Upper metal layers > 6 are wider and used for less dense low-delay global connectionsSwitch Design - NoCs 2012G. Dimitrakopoulos - DUTH
4Slide5
Future of wires: 2.5D – 3D integrationSwitch Design - NoCs 2012
Evolution
G. Dimitrakopoulos - DUTH
5Slide6
Optical wiringOptical connections will be integrated on chip Useful when the power of electrical connections will limit the available chip IO bandwidthA balanced solution that involves both optical and electrical components will probably win
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
6Slide7
Let’s send a word on a chip
Sender and receiver on the same clock domain
Clock-domain crossing just adds latency
Any relation of the sender-receiver clocks is exploited
Mesochronous interfaceTightly coupled synchronizersSwitch Design - NoCs 2012
[AMD
Zacate]
G. Dimitrakopoulos - DUTH
7Slide8
Point-to-point links: Flow controlSwitch Design - NoCs 2012
S
R
Data
S
R
Data
Valid
S
R
Valid
Stall
Data
Synchronous operation
Data on every cycle
Sender can stall
Data valid signal
Receiver can stall
Stall (back-pressure) signal
Either can stall
Valid and Stall backpressure
Partially decouple Sender and Receiver by adding a buffer at the receive side
S
R
Stall
Data
G. Dimitrakopoulos - DUTH
8Slide9
Sender and Receiver decoupled by a bufferReceiver accepts some of the sender’s traffic even if the transmitted words are not consumed When to stop? How is buffer overflow avoided?Let’s see first how to build a buffer
Clock-domain crossing can be tightly coupled within the buffer
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
9Slide10
Buffer organization
A FIFO container that maintains order of arrival
4 interfaces (full, empty, put, get)
Elastic
Cascade of depth-1 stages Internal full/empty signalsShift register in/Parallel outPut: shift all entries
Get: tail pointerCircular buffer
Memory with head / tail pointersWrap around array implementationStorage can be register basedSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
10Slide11
Buffer implementation
The same basic structure evolves with extra read/write flexibility
Multiplexers and head/tail pointers handle data movement and addressing
Elastic
Circular array
Shift In/Parallel Out
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
11Slide12
Link-level flow control: Backpressure
Link-level flow control provides a
closed feedback loop to control the
flow of data from a sender to a receiver
Explicit flow control (stall-go) Receiver notifies the sender when to stop/resume transmissionImplicit flow control (credits)Sender knows when to stop to avoid buffer overflowFor unreliable channels we need extra mechanisms for detecting and handling transmission errors
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH12Slide13
STALL-GO flow controlOne signal STALL/GO is sent back to the receiverSTALL=0 (G0) means that the sender is allowed to send STALL=1 (STALL) means that the sender should stop
The sender changes its behavior the moment it detects a change to the backpressure signal
Data valid (not shown) is asserted when new data are available
Switch Design - NoCs 2012
Stall
G. Dimitrakopoulos - DUTH
13Slide14
STALL-GO flow control: exampleSwitch Design - NoCs 2012
Stall
In-flight words will be dropped or they will replace the ones that wait to be consumed
In every case data
are
lostSTALL and GO should be connected with the buffer availability of the receiver’s queueThe example assumes that the receiver is stalled or released for other network reasons
G. Dimitrakopoulos - DUTH
14Slide15
STALL should be asserted early enoughNot drop words in-flight Timing of STALL assertion guarantees lossless operationGO should be asserted late enough Have words ready-to-consume before new words arriveCorrect timing guarantees high throughput Minimum buffering for full throughput and lossless operation should cover both STALL&GO re-action cycles
Switch Design - NoCs 2012
Buffering requirements of STALL&GO
Stall
If not available the link remains idle
G. Dimitrakopoulos - DUTH
15Slide16
Switch Design - NoCs 2012STALL&GO on pipelined and elastic linksTraffic is “blind” during a time interval of Round-trip Time (RTT)
the source will only learn about the effects of its transmission RTT after this transmission has started
the (corrective) effects of a contention notification will only appear at the site of contention RTT after that occurrence
G. Dimitrakopoulos - DUTH
16Slide17
Credit-based flow controlSender keeps track of the available buffer slots of the receiverThe number of available slots is called credits The available credits are stored in a credit counter
If #credits > 0 sender is allowed to send a new word
Credits are decremented by 1 for each transmitted word
When one buffer slot is made free in the receive side, the sender is notified to increase the credit count
An example where credit update signal is registered first at the receive sideSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
17Slide18
Credit-based flow control: ExampleSwitch Design - NoCs 2012
0*
means that credit counter is
incremented and decremented
at the same cycle (ways and stays at 0)
Credit Update
G. Dimitrakopoulos - DUTH18
Available CreditsSlide19
Credit-based flow control: Buffers and ThroughputSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
19Slide20
Condition for 100% throughputThe number of registers that the data and the credits pass through define the credit loop100% throughput is guaranteed only when the number of available buffer slots at the receive side equals the registers of the credit loopChanging the available number of credits can reconfigure maximum throughput at runtime
Credit-based FC is lossless with any buffer size > 0.
Stall and Go FC requires at least one
loop
extra buffer space than credit-based FCSwitch Design - NoCs 2012
Credit loop
G. Dimitrakopoulos - DUTH
20Slide21
Link-level flow control enhancementsReservation based flow controlSeparate control and data functionsControl links race ahead of the data to reserve resources
When data words arrive, they can proceed with little overhead
Speculative flow control
The sender can transmit cells even without sufficient credits
Speculative transmissions occur when no other words with available credits is eligible for transmissionThe receiver drops an incoming cell if its buffer is fullFor every dropped word a NACK is returned to the senderEach cell remains stored at the sender until it is positively acknowledgedEach cell may be speculatively transmitted at most once All retransmissions must be performed when credits are availableThe sender consumes credit for every cell sent, i.e., for speculative as well as credited transmissions.
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH21Slide22
Send a large message(packet)
Send long packet of 1Kbit over a 32-bit-wire channel
Serialize the message to 16 words of 32 bits
Need 16 cycles for packet transmission
Each packet is transmitted word-by-wordSwitch Design - NoCs 2012
When the output port is free, send the next word immediately
Old fashioned Store-and-forward required the entire packet to reach each node before initiating next transmission
G. Dimitrakopoulos - DUTH
22Slide23
Buffer allocation policiesEach transmitted word needs a free downstream buffer slot When the output of the downstream node is blocked the buffer will hold the arriving wordsHow much free buffering is guaranteed before sending the first word of a packet?
Virtual Cut Through
(VCT): The available buffer slots equal the words of the packet
Each blocked packet stays together and consumes the buffers of only one node
Wormhole: Just a few are enoughPacket inevitably occupies the buffers of more nodes Nothing is lost due to flow control backpressure policySwitch Design - NoCs 2012G. Dimitrakopoulos - DUTH
23Slide24
VCT and Wormhole in graphicsSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
24Slide25
Link sharingThe number of wires of the link does not increaseOne word can be sent on each clock cycleThe channel should be sharedA multiplexer is needed at the output port of the sender
Switch Design - NoCs 2012
Each packet is sent un-interrupted
Wormhole, and VCT behave this way
Connection is locked for a packet until the tail of the packet passes the output port
G. Dimitrakopoulos - DUTH
25Slide26
Who drives the select signals?The arbiter is responsible for selecting which packet will gain access to the output channelA word is sent if buffer slots are available downstreamIt receives requests
from the inputs and
grants
only one of them
Decisions are based on some internal priority stateSwitch Design - NoCs 2012G. Dimitrakopoulos - DUTH
26Slide27
Arbitration for Wormhole and VCT In wormhole and VCT the words of each packet are not mixed with the words of other packetsArbitration is performed once per packet and the decision is locked at the output for all packet durationEven if a packet is blocked downstream the connection does not change until the tail of the packet leaves the output port
Buffer utilization managed by flow control mechanism
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
27Slide28
How can I place my buffers?Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
28Slide29
Let’s add some complexity: NetworksA network of terminal nodes Each node can be a source or a sinkMultiple point-to-point links connected with switchesParallel communication between components
Switch Design - NoCs 2012
Source/Sink
Terminal Node
Switch
G. Dimitrakopoulos - DUTH
29Slide30
Multiple input-output permutations should be supportedContention should be resolved and non-winning inputs should be handledBuffered locallyDeflected to the networkSeparate flow control for each linkEach packet needs to know/compute the path to its destination
Switch Design - NoCs 2012
Complexity affects the switches
G. Dimitrakopoulos - DUTH
30Slide31
More than one terminal nodes can connect per switchConcentration good for bursty traffic Local switch isolates local traffic from the main networkSwitch Design - NoCs 2012
How are the terminal nodes connected to the switch?
G. Dimitrakopoulos - DUTH
31Slide32
Switch design: IO interfaceSwitch Design - NoCs 2012
Separate flow control per link
G. Dimitrakopoulos - DUTH
32Slide33
Switch design: One output portSwitch Design - NoCs 2012
per-output requests
Let’s reuse the circuit we already have for one output port
G. Dimitrakopoulos - DUTH
33Slide34
Switch Design - NoCs 2012
Move buffers to the inputs
Switch design: Input buffers
Data from input#1
Requests
for output #0
G. Dimitrakopoulos - DUTH
34Slide35
Switch design: Complete output portsSwitch Design - NoCs 2012
How
are the output requests computed?
G. Dimitrakopoulos - DUTH
35Slide36
Routing computationRouting computation generates per output requestsThe header of the packet carries the requests for each intermediate node (source routing)The requests are computed/retrieved based on the packet’s destination (distributed routing)
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
36Slide37
Routing logicRouting logic translates a global destination address to a local output port requestTo reach node X from node Y
should use output port #2 of Y
A Lookup-table is enough for holding the request vector that corresponds to each destination
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
37Slide38
Switch building blocksSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
38Slide39
Running example of switch operationSwitch Design - NoCs 2012
Switches transfer packets
Packets are broken to flits
Head flit only knows packet’s destination
The wires of each link equals the bits of each flit
G. Dimitrakopoulos - DUTH
39Slide40
Buffer accessSwitch Design - NoCs 2012Buffer incoming packets per linkRead the destination of the head of each queue
G. Dimitrakopoulos - DUTH
40Slide41
Routing Computation/Request GenerationCompute output requests and drive the output arbitersSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
41Slide42
Arbitration-Multiplexer path setupSwitch Design - NoCs 2012Arbitrate per output
The grant signals
Drive the output multiplexers
Notify the inputs about the arbitration outcome
G. Dimitrakopoulos - DUTH42Slide43
Switch traversalSwitch Design - NoCs 2012Words H will leave the switch on the next clock edge provided they have at least one credit
G. Dimitrakopoulos - DUTH
43Slide44
Link traversalSwitch Design - NoCs 2012Words going to a non-blocked output leave the switch
The grants of a blocked output (due to flow control) are lost
An output arbiter can also stall in case of blocked output
G. Dimitrakopoulos - DUTH
44Slide45
Head-Of-Line blocking: performance limiterSwitch Design - NoCs 2012The FIFO order of the input buffers limit the throughput of the switch
The flit is blocked by the Head-of-Line that lost arbitration
A memory throughput problem
G. Dimitrakopoulos - DUTH
45Slide46
Wormhole switch operation Switch Design - NoCs 2012The operations can fit in the same cycle or they can be pipelined
Extra registers are needed in the control path
Registers in the input/output ports already present
LT at the end involves a register write
Body/tail flits inherit the decisions taken by the head flitsG. Dimitrakopoulos - DUTH
46Slide47
Look-ahead routingRouting computation is based only on packet’s destinationCan be performed in switch A and used in switch BLook-ahead routing computation (LRC)
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
47Slide48
Look-ahead routingThe LRC is performed in parallel to SALRC should be completed before the ST stage in the same switchThe head flit needs the output port requests for the next switch
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
48Slide49
Look-ahead routing detailsSwitch Design - NoCs 2012The head flit of each packet carries the output port requests for the next switch together with the destination address
G. Dimitrakopoulos - DUTH
49Slide50
Low-latency organizationsSwitch Design - NoCs 2012Baseline
SA precedes ST (no speculation)
SA decoupled from ST
Predict or Speculate arbiter’s decisions
When prediction is wrong replay all the tasks (same as baseline)Do in different phases Circuit switchingArbitration and routing at setup phaseAt transmit only ST is needed since contention is already resolvedBypass switchesReduce latency under certain criteriaWhen bypass not enabled same as baseline
ST
Setup
Xmit
SA
LRC
LT
SA
LRC
LT
ST
ST
SA
LRC
ST
LT
ST
LT
G. Dimitrakopoulos - DUTH
50
LT
Setup
XmitSlide51
Prediction-based ST: HitSwitch Design - NoCs 2012
Crossbar
Buffer
X+
X-
Y+
Y-
X+
X-
Y+
Y-
PREDICTOR
Crossbar is reserved
Idle
state:
Output port X+ is selected and reserved
1st cycle:
Incoming flit is transferred to X+ without RC and
SA
Correct
1st cycle:
RC is performed
The prediction is correct!
2nd cycle:
Next
flit is transferred to X+ without RC and
SA
ARBITER
G. Dimitrakopoulos - DUTH
51Slide52
Prediction-based ST: MissSwitch Design - NoCs 2012
X+
X-
Y+
Y-
Idle state:
Output port X+ is selected and reserved
Correct
Dead flit
1st cycle:
RC is
performed
The prediction is wrong! (X- is correct)
2nd/3rd cycle:
Dead flit is removed; retransmission to the correct port
Buffer
X+
X-
Y+
Y-
PREDICTOR
ARBITER
1st cycle:
Incoming flit is transferred to X+ without RC and
SA
KILL
Kill signal to X+ is asserted
Crossbar
@Miss: tasks replayed as the baseline case
G. Dimitrakopoulos - DUTH
52Slide53
Speculative STAssume contention doesn’t happenIf correct then flit transferred directly to output port without waiting SAIn case of contention replay SAWasted cycle in the event of contention
Arbiter decides what will be sent on the next cycle
Switch Design - NoCs 2012
Switch
Fabric
Control
B
A
A
clk
port 0
port 1
grant
valid out
data out
0
1
4
cycle
2
3
A
p0
A
A
B
p1
???
B
A
A
?
B
A
p0
B
A
A
B
A
B Wins
A
Wins
G. Dimitrakopoulos - DUTH
53Slide54
XOR-based STAssume contention never happensIf correct then flit transferred directly to output portIf not then bitwise=XOR all the competing flits and send the encoded result to the link
At the same time arbitrate and mask (set to 0) the winning input
Repeat on the next cycle
In the case of contention encoded outputs (due to contention) are resolved at the receiver
Can be done at the output port of the switch tooSwitch Design - NoCs 2012
Switch
Fabric
Control
B
A
B
A
A
A^B
A
0
1
4
cycle
2
3
clk
port 0
port 1
grant
valid out
data out
A
p0
A
A
B
p1
B^A
A
A
A
No Contention
Contention
B
Wins
G. Dimitrakopoulos - DUTH
54Slide55
XOR-based ST: Flit recoveryWorks upon simple XOR property. (A^B^C) ^ (B^C) = AAlways able to decode by XORing two sequential values
Performs similarly to speculative switches
Only head-flit collisions matter
Maintains previous router’s arbitration order
Switch Design - NoCs 2012
Coded
Flit Buffer
A
A^B^C
B^C
C
A
0
0
B^C
1
A^B^C
C
B^C
B
G. Dimitrakopoulos - DUTH
55Slide56
Bypassing intermediate nodes Switch bypassing criteria:Frequently used paths Packets continually moving along the same dimension Most techniques can bypass some pipeline stages only for specific packet transfers and traffic patterns
Not generic enough
Switch Design - NoCs 2012
3-
cycle
SRC
DST
3-cycle
3-cycle
Virtual bypassing paths
3-
cycle
3-
cycle
1-
cycle
Bypassed
1-
cycle
Bypassed
G. Dimitrakopoulos - DUTH
56Slide57
Circuit switchingSwitch Design - NoCs 2012G. Dimitrakopoulos - DUTH
57
Network traversal done in phases
Path reservation (multiple switch allocations) is done all at once
Switch traversal finds no contentionData buffers are avoidedPart of the reserved and unutilized path is needlessly blocked Slide58
Speculation-free low-latency switchesPrediction and speculation drawbacksOn miss-prediction(speculation) the tasks should be replayedLatency not always saved. Depends on network conditions
Merged Switch allocation and Traversal (SAT)
Latency always saved – no speculation
Delay of SAT smaller than SA and ST in series
Switch Design - NoCs 2012G. Dimitrakopoulos - DUTH58Slide59
Arbitration and Multiplexing
Stop thinking arbitration and multiplexing separately
One new algorithm that fits every policy
Generic priority-based solution that works even when arbitration and multiplexing are done separately
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
59Slide60
Round-robin arbitrationRound-robin arbitrationMost commonly usedStart from the High-Priority position and grant the first active request you find after searching all cyclically all requests
Granted input becomes lowest-priority for the next arbitration
Cyclic search found in many other algorithms
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
60Slide61
Switch Design - NoCs 2012Transform each request and priority bit to a 2bit unsigned arithmetic symbol The request is the MSBitRound-robin arbitration is equivalent to finding the maximum symbol that lies in the rightmost position
Cyclic search disappears
Let’s think out of the box
G. Dimitrakopoulos - DUTH
61Slide62
Working examples
Switch Design - NoCs 2012
Maximum selection is done via a tree structure
The rightmost maximum symbol always wins
Direction flags (L,R) always point to the direction of the winning inputDirection flags form the path to the winning input
G. Dimitrakopoulos - DUTH
62Slide63
Why not switch data in parallel?Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
63Slide64
Grant signals produced simultaneouslyWhen F=0 the maximum came from the RightWhen F=1 the maximum came from the LeftOnehot, thermometer, weighted-binary grant signals can be derived by the tree of MAX nodes
Switch Design - NoCs 2012
Direction flag F
G. Dimitrakopoulos - DUTH
64Slide65
Wormhole/VCT MARX-based switchesSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
65Slide66
SRAM-based input buffersBuffer reads and writes are treated as separate tasksBuffer write occurs always after link traversalA separate read and write port is required for maximum performance
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
66Slide67
Speculative Buffer ReadBuffer read occurs after SA for Head flits (no speculation) Buffer read can occur in parallel to SA (speculation)HOL Head flit is read out before knowing if it received a grant
Once SA has finished speculation is removed for the rest flits
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
67Slide68
Pipelining and credits
Credit loop begins from upstream SA stage
Deep pipelining increases the buffering requirements for 100% throughput
Elastic pipeline stages that can stall independently can partially alleviate the problem
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
68Slide69
Bufferless Assume there are no buffersWhen a packet loses switch allocation it is:DroppedDeflected to any free outputDeflection spreads contention in space (in the network)
Allocation solves contention at each time slot but spreads it in time (next time slots)
Deflection (or misrouting) can occur in buffered switches too
Rotary router
Switch Design - NoCs 2012G. Dimitrakopoulos - DUTH69Slide70
SlicingSwitch Design - NoCs 2012
Introduces hierarchy inside the switch
When traffic is concentrated to certain outputs the switch suffers high performance penalties
Intermediate buffers partially alleviate the loss
Dimension slicing
Port slicing
G. Dimitrakopoulos - DUTH
70Slide71
How can we increase throughput?Switch Design - NoCs 2012
Green flow is blocked until red passes the switch. Physical channel left idle
G. Dimitrakopoulos - DUTH
71Slide72
Decouple output port allocation from next-hop buffer allocationContention present on:Output links (crossbar output port) Input port of the crossbarContention is resolved by time sharing the resourcesMixing words of two packets on the same channelThe words are on different virtual channels
Separate buffers at the end of the link guarantee no interference between the packets
Switch Design - NoCs 2012
Virtual Channels
G. Dimitrakopoulos - DUTH
72Slide73
Virtual channelsVirtual-channel support does not mean extra linksThey act as extra street lanes Traffic on each lane is time shared on a common channelProvide dedicated buffer space for each virtual channel
Decouple channels from buffers
Interleave flits from different packets
“The Swiss Army Knife for Interconnection Networks”
Prevent deadlocksReduce head-of-line blockingProvide QoSSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
73Slide74
Datapath of a VC-based switch Switch Design - NoCs 2012Separate buffer for each VC
Separate flow control signals (credits) for each VC
The radix of the crossbar can stay the same
Input VCs can share a common input port of the crossbar
On each cycle at most one VC will receive a new word
G. Dimitrakopoulos - DUTH
74Slide75
A switch connects input VCs to output VCsRouting computation (RC) determines the output portMay restrict the output VCs that can be used An input VC should allocate first an output VCAllocation is performed by the VC allocator (VA)RC and VA are done per packet on the head flits and inherited to the rest flits of the packet
Switch Design - NoCs 2012
Input VCs
Output VCs
Per-packet operation of a VC-based switch
G. Dimitrakopoulos - DUTH
75Slide76
Per-flit operation of a VC-based switch Flits with an allocated output VC fight for an output portOutput port allocated by switch allocatorThe VCs of the same input share a common input port of the crossbar
Each input has multiple requests (equal to the #input VCs)
The flit leaves the switch provided that credits are available downstream
Credits are counted per output VC
Switch Design - NoCs 2012
Input VCs
Output VCs
G. Dimitrakopoulos - DUTH
76Slide77
Switch allocationAll VCs at a given input port share one crossbar input portSwitch allocator matches ready-to-go flits with crossbar time slots
Switch Design - NoCs 2012
Allocation performed on a cycle-by-cycle basis
N×V requests (input VCs), N resources (output ports)
At most one flit at each input port can be granted
At most one flit et each output port can be leave
Other options need more crossbar ports (input-output speedup)
G. Dimitrakopoulos - DUTH
77Slide78
Switch allocation example
One request (arc) for each input VC
Example with 2 VCs per input
At most 2 arcs leaving each input
At most 2 requests per row in the request matrix
Matching:
Each grant must satisfy a request
Each requester gets at most one grant
Each resource is granted at most once
Switch Design - NoCs 2012
Inputs
Outputs
0
0
1
2
2
1
Inputs
Outputs
0
2
1
0
2
1
Bipartite graph
Request matrix
G. Dimitrakopoulos - DUTH
78Slide79
Separable allocationMatchings have at most one grant per row and per columnTwo phases of arbitrationColumn-wise and row-wisePerform in either order
Arbiters in each stage are independent
But the outcome of each one affects the quality of the overall match
Fast and cheap
Bad choices in first phase can prevent second stage from generating a good matchingMultiple iterations required for a good matchSwitch Design - NoCs 2012
Input-first:
Output-first:
G. Dimitrakopoulos - DUTH
79Slide80
ImplementationG. Dimitrakopoulos - DUTH
Switch Design - NoCs 2012
80
Output first allocation
Input first allocationSlide81
Multi-cycle separable allocatorsSwitch Design - NoCs 2012Allocators produce better results if they run for many cycles
On each cycle they try to fill the input-output match with new edges
We don’t have the time to wait more than one cycle
Run two or more allocators in parallel and interleave their grants to the switch
G. Dimitrakopoulos - DUTH81Slide82
Centralized allocator
Wavefront
allocation
Pick initial diagonal
Grant all requests on each diagonalNever conflict!For each grant, delete requests in same row, columnRepeat for next diagonalSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
82Slide83
Switch allocation for adaptive routingInput VCs can request more than one output portsCalled the set of Admissible Output Ports (AOP)This adds an extra selection step (not arbitration)
Selection mostly tries to load balance the traffic
Input-first allocation
For each input VC select one request of the AOP
Arbitrate locally per input and select one input VCArbitrate globally per output and select one VC from all fighting inputsOutput-first allocationSend all requests of the AOP of each input VC to the outputsArbitrate globally per output and grant one requestArbitrate locally per input and grant an input VCFor this input VC select one out of the possibly multiple grants of the AOP setSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
83Slide84
VC allocationVirtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels)Before packets can proceed through router, need to claim ownership of VC buffer at next routerVC acquired by head flit, is inherited by body & tail flits
VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use
N×V inputs (input VCs), N×V outputs (output VCs)
Once assigned, VC is used for entire packet’s duration in the switch
Switch Design - NoCs 2012
Input VCs
Output VCs
G. Dimitrakopoulos - DUTH
84Slide85
VC allocation exampleInput VC match to an output VC simultaneously with the restEven if it belongs to the same inputNo port constraint as in switch allocators
VC allocation refers to allocating buffer id (output VC) on the next router
Allocation can be both separable (2 arbitration steps) or centralized
Switch Design - NoCs 2012
Inputs VCs
Output VCs
0
1
In#0
In#1
In#2
Out#0
Requests
Grants
2
3
4
5
0
1
2
3
4
5
Out#1
Out#2
Inputs VCs
Output VCs
0
1
In#0
In#1
In#2
Out#0
2
3
4
5
0
1
2
3
4
5
Out#1
Out#2
G. Dimitrakopoulos - DUTH
85Slide86
Any-to-any flexibility in VC allocator is unnecessaryPartition set of VCs to restrict legal requestsDifferent use cases for VCs restrict possible transitions:Message class never changesResource classes are traversed in orderVCs within a packet class are functionally equivalentCan take advantage of these properties to reduce VC allocator complexity!
Switch Design - NoCs 2012
Input – output VC mapping
G. Dimitrakopoulos - DUTH
86Slide87
VA single cycle or pipelined organizationHeader flits see longer latency than body/tail flitsRC, VA decisions taken for head flits and inherited to the rest of the packetEvery flit fights for SA
Can we parallelize SA and VA?
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
87Slide88
The order of VC and switch allocationVA first SA followsOnly packets will an allocated output VC fight for SAVA and SA can be performed concurrently:Speculate that waiting packets will successfully acquire a VC
Prioritize non-speculative requests over speculative ones
Speculation holds only for the head flits
(The body/tail flits always know their output VC)
Switch Design - NoCs 2012
VASA
DescriptionWinWinEverything OK!! Leave the switch
WinLoseAllocated a VC
Retry SA (not speculative - high priority next cycle)LoseWin
Does not know the output VCAllocated output port (grant lost – output idle)Lose
LoseRetry both VA and SAG. Dimitrakopoulos - DUTH
88Slide89
Speculative switch allocationPerform switch allocation in parallel with VC allocation
Speculate that the latter will be successful
If so, saves delay, otherwise try again
Reduces zero-load latency, but adds complexity
Prioritize non-speculative requestsAvoid performance degradation due to miss-speculationUsually implemented through secondary switch allocator
But need to prioritize non-speculative grants
Switch Design - NoCs 2012
G. Dimitrakopoulos - DUTH89Slide90
Free list of VCs per outputCan assign a VC non-speculatively after SAA free list of output VCs exists at each outputThe flit that was granted access to this output receives the first free VC before leaving the switch
If no VC available output port allocation slot is missed
Flit retries for switch allocation
VCs are not unnecessarily occupied for flits that don’t win SA
Switch Design - NoCs 2012G. Dimitrakopoulos - DUTH90Slide91
VC buffer implementationSwitch Design - NoCs 2012
Static partitioning
Dynamic partitioning
G. Dimitrakopoulos - DUTH
91
Linked-List Shared Buffer ImplementationSlide92
VC-based switches with MARX unitsMerged Switch Allocation and Traversal can be applied to VC-based switches tooVA can be run before or in parallel to SAT
G. Dimitrakopoulos - DUTH
Switch Design - NoCs 2012
92Slide93
VC-based switches with MARX units: DatapathSwitch Design - NoCs 2012
G. Dimitrakopoulos - DUTH
93Slide94
NoC: The science & art of on-chip connectionsSwitch Design - NoCs 2012G. Dimitrakopoulos - DUTH
94
Micro-architecture
of Network-on-Chip
Routers
Giorgos Dimitrakopoulos, Springer, mid 2013
ADVERTISEMENTSlide95
References (1) W. J. Dally and B. Towles. Route packets, not wires: On-chip interconnection networks, DAC 2001A. Kumar, et al. A 4.6tbits/s 3.6ghz single-cycle noc router with a novel switch allocator. In in 65nm CMOS”, ICCD-2007
A. Kumar, et al. “Express virtual channels: towards the ideal interconnection fabric”, ISCA ’07
H.
Matsutani
, et al. “Prediction router: A low-latency on-chip router architecture with multiple predictors”, IEEE Trans. Computers, 2011G. Michelogiannakis, J. Balfour, and W. Dally, “Elastic bufferflow control for on-chip networks”, HPCA 2009.Mitchell Hayenga, Mikko
Lipasti, “The NoX
Router”, MICRO 2011T. Moscibroda and O. Mutlu. A case for bufferless routing in on-chip networks, ISCA 2009R. Mullins, A. West, and S. Moore. Low-latency virtual-channel routers for on-chip networks. ISCA 2004
L.-S. Peh and W. J. Dally. A delay model and speculative architecture for pipelined router HPCA 2001D. Wentzlaff, et al. “On-chip interconnection architecture of the tile processor. Micro, IEEE,2007Y. J. Yoon, et al. “Virtual channels vs. Multiple Physical Networks”, DAC 2010M. Azimi, et al.
“Flexible and adaptive on-chip interconnect for terascale architectures,” Intel Technology Journal, 2009.A. Golander, et al. “A cost-efficient L1–L2 multicore interconnect: Performance, power, and area considerations,”
IEEE TCAS-I 2011.P. Kumar, “Exploring concentration and channel slicing in on-chip network router,” HPCA 2009M. Galles, “Spider: A high-speed network interconnect,” IEEE Micro, 1997.A. S. Vaidya, et al. , “Lapses: A recipe for high performance adaptive router design”, HPCA 1999.
C. Batten Interconnection Networks Course, Columbia UniversityM. Katevenis, Packet Switch Architectures Course, University of Crete, Greece.W. J. Dally, “Virtual-Channel Flow Control,” ISCA 1990.D. U. Becker and W. J. Dally, “Allocator implementations for network-on-chip routers,”, SC 2009.S. S. Mukherjee
, et al., “A comparative study of arbitration algorithms for the Alpha 21364 pipelined router,” ASPLOS 2002.Y. Tamir and H.-C. Chi, “Symmetric crossbar arbiters for VLSI communication switches,” IEEE Trans. on Par. and Distributed Systems, 1993.J. Hurt, et al. , “Design and implementation of high-speed symmetric crossbar schedulers,” in ICC 1999G. Ascia, et al., “Implementation
and analysis of a new selection strategy for adaptive routing in networks-on-chip,” IEEE T. on Comp. 2008P. Salihundam, et al. , “A 2Tb/s 6x4 Mesh Network with DVFS and 2.3Tb/s/W router in 45nm CMOS,” in Symp. VLSI Circuits, 2010.
P. Gupta and N. McKeown, “Design and implementation of a fast crossbar scheduler,” IEEE Micro 1999.J. Flich and D. Bertozzi (editors), “Network on Chip in the Nanoscale
Era”, CRC Press, 2010G. Dimitrakopoulos - DUTH
Switch Design - NoCs 201295Slide96
References (2)L. Pirvu et al. “The impact of link arbitration on switch performance,” HPCA, 1999.
M. Coppola, et al. “
Spidergon
: A Novel On-Chip Communication Network” IEEE SOC 2004.
W. Dally and C. Seitz. “Deadlock-Free Message Routing in Multiprocessor Interconnection Networks” IEEE Tran.on Computers, 1987M. Karol, “Input vs Output Queuing on a Space-Division Packet Switch”, In IEEE Transactions on Communications, 1987.Zhonghai Lu, et al. “Evaluation of on-chip networks using deflection routing”. In Proceedings of GLSVLSI, 2006.
Zhonghai Lu, et al.. “Layered switching for networks on chip”. DAC 2007R.
Ginosar, "Metastability and Synchronizers: A Tutorial," IEEE Design & Test, Sept/Oct. 2011.G. Dimitrakopoulos, D. Bertozzi, “Switch architecture”, in J. Flich and D. Bertozzi (editors), “Network on Chip in the Nanoscale Era”, CRC Press, 2010G. Dimitrakopoulos, “Logic-level Design of Basic Switch Components”, in J. Flich and D. Bertozzi (editors), “Network on Chip in the
Nanoscale Era”, CRC Press, 2010G. Dimitrakopoulos E. Kalligeros, “Dynamic-Priority Arbiter and Multiplexer Soft Macros for On-Chip Networks Switches”, DATE 2012G. Dimitrakopoulos, E. Kalligeros, K. Galanopoulos, “Merged Switch allocation and traversal in Network-on-Chip Switches”, to appear in IEEE transactions on Computers (available at IEEExplore preprints)
Se-Joong Lee et al. Packet-Switched On-Chip Interconnection Network for System-on-Chip Applications, IEEE TCAS II 2005.Donghyun Kim et al. A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on-Chip, ISCAS
2005.Anh Tran and Bevan Baas, "RoShaQ: High-Performance On-Chip Router with Shared Queues,“ iCCD 2011Anh Tran et al. "A Reconfigurable Source-Synchronous On-Chip Network for GALS Many-Core Platforms,“
IEEE Trans.on CAD, 2010.B. Dally and B. Towles, “Interconnection networks”, Morgan Kaufman 2004C.A. Nicopoulos, “ViChaR
: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers “, MICRO 2006.Clive Maxfield , “2D vs. 2.5D vs. 3D ICs 101” EE Times, Design 2012Mike Santarini , “2.5D ICs are more than a stepping stone to 3D Ics
”, EE Times, Design 2012Nathan Binkert et al. , “The Role of Optics in Future High Radix Switch Design”, ISCA-2011Eylon Caspi “Design Automation for Streaming Systems”, PhD Thesis, Berkeley 2005C
Minkenberg, M Gusat , “Design and performance of speculative flow control for high-radix datacenter interconnect switches”, JPDC 09Peh, Li-
Shiuan and Dally, William J., "Flit-Reservation Flow Control," in HPCA 1999M. Gerla and L. Kleinrock. Flow Control: A Comparative Survey. IEEE Transactions on Communications, 1980.
G. Dimitrakopoulos - DUTH
Switch Design - NoCs 2012
96