/
Presented by: Presented by:

Presented by: - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
374 views
Uploaded On 2016-03-12

Presented by: - PPT Presentation

Priyank Gupta 04022012 Generic Low Latency NoC Router Architecture for FPGA Computing Systems amp A Complete Network on Chip Emulation Framework 1 Introduction Moores law is pushing towards more complex ID: 252525

noc router flow network router noc network flow channel control flit packet latency state data routing fpga output emulation packets time node

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Presented by:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Presented by:Priyank Gupta04/02/2012

Generic Low Latency NoC Router Architecture for FPGA Computing Systems&A Complete Network on Chip Emulation Framework

1Slide2

IntroductionMoore’s law is pushing towards more complex

SoCsNetwork on Chip one of the technologies which enable us to keep up with the law.Designing and evaluation of NoC has been a challengeComplexity of designInaccurate network/component modelsSimulation/CAD toolsFPGA provides a platform where above concerns can be addressed and real life data can be measured.

Present on-chip FPGA resources can be used as underlying interconnect fabric

2Slide3

NoC Fundamentals3

Block

diagram of a router. Slide4

NoC Fundamentals4Slide5

Router Architecture

Simplified block diagram of a router. Flits arriving over the input channels are stored in

buffers associated

with each input. A set of allocators assigns buffers on the next node and

channel bandwidth

to pending flits. When a flit has been allocated the resources it needs, it is

forwarded by

the crossbar switch to an output

channel.

5Slide6

Crossbar Example

A 4×5 crossbar switch as implemented with 5 4:1 multiplexers. Each multiplexer selects

the input

to be connected to the corresponding output.

6Slide7

Crossbar Symbol

7Slide8

Data Packet Example

Packet format for our simple network. Time, in cycles, is shown in the vertical direction,

while the

18 signals of a channel are shown in the horizontal direction. The leftmost signals

contain the

phit

type, while the 16 remaining signals contain either a destination address or data,

or are

unused in the case of a null

phit

.

8Slide9

Data Flow Control

Units of resource allocation. Messages are divided into packets for allocation of control state. Each packet includes routing information (RI) and a sequence number (SN). Packets are further divided into flits for allocation of buffer capacity and channel bandwidth. Flits include no routing or sequencing information beyond that carried in the packet, but may include a virtual-channel identifier (VCID) to record the assignment of packets to control state.

9Slide10

Data Flow Algorithms10

Deterministic: Algorithm always chooses the same path between x and y even if multiple options existSimple to implementPoor job of balancing loadOblivious: Algorithm decides on a path between x and y without any information on the network’s prior state. Deterministic is a subset.Adaptive: Algorithm adapts to the state of the network. Information includes node status, channel load information, queue length etc.Slide11

Routing Mechanics11

Mechanism used to implement any routing algorithmTable based Routing: The set of paths for each pair of nodes is stored in the table, and the table is indexed by the source and destination node. Only that portion of the table that is needed on a particular node need be stored on that node.

Algorithmic Routing: Routing relation is computed using a network specific algorithm.

M

ore efficient in terms of both area and

speed.Slide12

Flow Control12

Determines how network resources such as channel bandwidth, buffer capacity and control state are allocated to the data traversing.Bufferless Flow ControlSimple to implementPackets are dropped or misrouted if resource is not availableBuffered Flow ControlTemporarily store packets in a buffer therefore less probability of dropped packets

Additional cost of hardware for implementationSlide13

Network Topologies

Torus and mesh networks: (a) a torus network (4-ary 2-cube) includes the connection from

node 3

to node 0 in both dimensions, but (b) a mesh network (4-ary 2-mesh) omits this connection.

13Slide14

FPGA Emulation Framework14

Emulation environment developed to explore, evaluate and compare a variety of NoC solutions.Current FPGA implementations are limited in flexibility and do not allow full test of NoC implementationsCycle accurate simulations using a combination of hardware and software modules.

Added flexibility due to modular approach of architecture.

16000 times faster than HDL simulatorSlide15

NoC Emulation Architecture15Slide16

NoC Emulation Architecture16

Xilinx Virtex 2 Pro V20 with an embedded power PC: Processor is povides the much needed flexibility for the emulation proces.Monitor Module: Responsible for the interface between the host PC and FPGA board. Also streams out data generated from various tests.

Programmable

NoC

Platform: Responsible for traffic generators and receptors. Also keeps the user defined interconnection set between switches and network.Slide17

Data Flow17

Above Traffic Generator (TG) can provide

an image of the congestion of the network at

each moment

in time of the emulation.

Each

time a

flit is

not acknowledged by its receptor (i.e. switch or TR) and

has to be resent, a readable counter by the processor is incremented.Slide18

NoC Emulation Flow18Slide19

NoC Emulation Flow19

No re synthesizing or re mapping of system due to HW-SW structureProcessor is able to initialize parameters in hardwareEmulation flow is categorized as:Stochastic Emulation Flow: Implemented at the hardware level only. Configuration is implemented at the compilation levelTrace-based Emulation Flow: Entire NoC trace is loaded via software located on RAM. Processor streams the data into the emulated

NoC

and collects information on latency and congestion.Slide20

Results – Run Time20

The

total delivery time with the

same amount

of packets for the burst-mode is higher than for

the uniform

traffic. This is because the probability of collisions

between packets in the burst-mode is significantly higher.Slide21

Results – Congestion Rate21

Plots

indicate that the congestion rate does not increase

linearly with

the number of delivered packets in a burst mode.Slide22

Results – Avg. Latency22

The

average latency

of packets

reaches a limit of congestion, which is the limit of

the

NoC

in terms of latencySlide23

Generic Low Latency Router - Motivation23

New Generation of FPGAs comprise of millions of LUTs and will contain many parallel soft processor cores and glue/extra logic.Use of traditional interconnect schemes will lead to under utilization.Future designs are perceived to be at a higher level than traditional gate level. Functionality will be implemented through programmability of such cores.Increased complexity of FPGA will lead to inefficient RTL based design flow.Slide24

Proposed Solution24

Network on Chip can provide a flexible, scalable and reliable communication solution for such large and complex solutions.NoC provides the ability to change bandwidth and add processing elements. Cost is linear in this case whereas, traditional cross bar interconnects scale exponentially.FPGA contains significant global and local routing resources which can be used to construct the interconnect fabric and implement routing algorithm.Slide25

Prior Work25

Many routers have been designed for NoC FPGA implementationCircuit switch router: Head flit charts out the path, body follows. It has long circuit setup latency and low bandwidth utilization but once path is setup, Q0S is guaranteed and data latency is less.Time multiplexed router: Precomputed communication pattern. Less flexible. Works well when communication loads are 100% but performance drops significantly when load < 40%

Packet switch router: Negotiate network resources dynamically at run time. Flexible and scalable and low resource utilization but have high latency (about 8 clock cycles per hop).

FPGAs primarily used for prototyping and evaluating latency, throughput, cost and power.Slide26

Generic Low Latency NoC Router - Overview

26Reconfigurable wormhole router for packet-switched NoC designs

Low routing latency, low complexity and high buffer utilization

Designed to be scalable, flexible and reliable for a variety of FPGA platforms and network configurations

.

1-D

ring, 2-D mesh and 3-D cube network topologies were used to measure the feasibility of design and implementation on the FPGA

.

2 Cycles per hop latencySlide27

Wormhole Router Block Diagram27

Three main components: flow control, components and pipeline control

Wormhole flow control

Components include input and output ports, arbiter to arbitrate between multiple requests and FSM to maintain state of output port

Pipeline control is instrumental in achieving low latency per hop and parallel computationSlide28

Wormhole Flow Control28

(a) the header

arrives at

the node, while the virtual channel is in the idle state (I) and the desired upper (U)

output channel

is busy — allocated to the lower (L) input.

(b) the header is buffered and the virtual

channel is in the waiting state (W), while the first body flit arrives.Slide29

Wormhole Flow Control29

(c) the header and first

body flit

are buffered, while the virtual channel is still in the waiting state. In this state,

the

input channel is blocked. The second body flit cannot be transmitted,

since it

cannot acquire a flit buffer.

(d) the output virtual channel becomes available and

allocated to

this packet. The state moves to active (A) and the head is transmitted to the next node.Slide30

Wormhole Flow Control30

(

e)

The body flits follow

(

f

)

The body flits follow Slide31

Wormhole Flow Control31

(g) the tail flit is transmitted and frees the virtual

channel, returning

it to the idle state.

(h) a time-space diagram showing this process.Slide32

Packet Format32

Packet length is unfixedFormat is defined by network topologyOutput Channel (OC) field stores the output channel used by packetSlide33

Input Port33

Single entry flit buffer uses dual port memoryDimensional Ordering RoutingRouting computation is decoupled from arbitrationHead & tail pointers are used to evaluate whether a flit is present in the bufferSlide34

Output FSM34

Maintains state of output portActive state indicates that an output port has been matched with a downstream input portTail flit departure puts router in wait stateOnce all flits leave downstream routers, the router goes in idle mode.Slide35

Router Pipeline Organization35

Clock 1: Destination address and output channel latched in. Flit is written in flit buffer. Durin

g this period, arbitration result and look ahead routing is computed.

Clock 2: Crossbar control signal latched in and flit is read from the granted port.Slide36

Timing of Second Pipeline Stage36

T

=

Tco

+

Tlut

+

Trot

+

Tsu

Tco

is clock to register (or memory)

output delay

Tlut

is the delay of LUT cells (multiplexer logic)

Trot

is

the delay due to programmable wire routing

Tsu

isthe

setup time of the deviceSlide37

Pipeline Diagram ASIC vs FPGA Router

37Slide38

Network Topologies Used38Slide39

Credit Based Routing39

Upstream router keeps count of free buffers of downstream router.The credit count is decremented every time a buffer is consumed. If count is zero, all downstream buffers are full.Data is transferred only if credit count > 0Slide40

Results – Resource Utilization40

Logic cost vs. Router

radix

(32 bits data-path width

)Slide41

Results – Resource Utilization41

logic cost vs. Data-path widthSlide42

Results - Timing42

Maximum

clock rate vs. Router

radix

(

b) two

important critical

paths within a routerSlide43

Results - Power43

Static

and dynamic power (

mw) vs

. Router

radix

Normalized

per-packet power consumption vs. radixSlide44

Packet Generator and Receiver44

Due to pin limitations, packets must be

generated using

on chip logic within the FPGA rather than

external sources

. Each node of the

NoC

system is attached with

a packet

generator and receiver.Slide45

Results – Resource Utilization per Configuration45

Resource utilization of different

network configurationsSlide46

Results – Resource Utilization per Configuration46Slide47

Summary47

Highly scalable router which is easily used among different network topologies.Low hop by hop propagation delay using a packet switch NoC router.Analysis of router in terms of scalability, hardware cost, operation speed and power dissipation.Real world feasibility of such router architecture has been demonstrated and its usage within FPGA platform provides a very robust and cost effective solution.