Clocking
51K - views

Clocking

Similar presentations


Download Presentation

Clocking




Download Presentation - The PPT/PDF document "Clocking" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Clocking"— Presentation transcript:

Slide1

Clocking and Timing in Fault-Tolerant Systems-on-Chip

Andreas Steininger

Slide2

Outline

The Clock as a BlessingThe Clock as a CurseAlternative Synchronization SchemesGALSfully asynchronousthe DARTS approachConclusion

2

Slide3

Contributors to this Work

The DARTS project team TU Vienna Gottfried Fuchs Matthias Fuegger Ulrich Schmid Thomas Handl RUAG Space Gerald Kempf Manfred Sust Wolfgang Zangerl

3

Slide4

The Need for Fault Tolerance

miniaturization is key to progress in VLSI => smaller structures => lower voltage swing => smaller critical charge => higher operating frequencies…result in higher susceptibility to faults (SET, EMI,…)=> cannot avoid faults, need to tolerate them

4

Slide5

The Role of Time

“The only reason for time is so that everything doesn’t happen at once”, Albert Einstein

5

Slide6

The Need for Clocking

activities need to be co-ordinatedon system level (braking of wheels, …)on algorithmic level (consensus, …)on communication levelon logic level (state machine switching,…)co-ordination in the time domain (synchronization) is an efficient way to attain this=> need a global notion of time (discrete „ticks“)

6

Slide7

The Quality of Synchronization

real time

local

time (number of ticks)

precision π

7

Slide8

Typical Precision Values

on system level: ms … mson algorithm level: ms … mson communication level: ns … mson logic level: ps … ns

8

Slide9

Synchronization Requirements

9

phase

synchronisation(for „hardware clock“on logic level)

clock

synchronisation

(

for

distributed

time

base

on

algorithmic

level

)

1

m

s

is

excellent

precision

for

distributed

clock

at

1GHz

this

means

360.000°

phase

shift

Slide10

Globally Synchronous Design

whole design is „isochronic“ („perfect“ precision)time conveyed by clock transitionsperfect co-ordination of all activitiesvery efficient designcan assume consistent stateshigh level of abstractionvery efficient implementation:single crystal oscillatorsingle control line (clock net)

10

Slide11

„Isochronic“ Regions ?

speed of light (in medium) = 2 x 108 m/s = 20cm/ns

11

2cm

Ref

1GHz

4GHz

8GHz

Slide12

The Variation Problem

12

Designer

system

model

projected

conditions

User

actual conditions

actual system

worst

case

safety

margins

?(unknown)

?(imperfections)

Timing completely fixed after designNo way to react to actual conditions & system („PVT variations“)

Slide13

Fault-Tolerant

Architectures

Duplication & Comparison

Triple-Modular Redundancy

13

FU

FU

=?

ERR

FU

FU

vo-ter

Y

FU

Slide14

Lock-Step Operation

single clock

14

„3“

„4“

„3“

„4“

single

point

of

failure

good

replica

determinism

FU

FU

vo-ter

Y

FU

„3“

„4“

Slide15

Lock-Step Operation

independent clocks

15

„3“

„4“

„3“

„4“

single

fault tolerant

bad

replica

determinism

FU

FU

vo-ter

Y

FU

„3“

„4“

Slide16

Fault-Tolerant HW-Clocking

16

FU

FU

vo-ter

Y

FU

v

v

v

Slide17

Fault-Tolerant HW-Clocking

17

FU

FU

vo-ter

Y

FU

v

v

v

D

D

?

?

Slide18

The Charme of SoCs

billions of transistors fit on one die=> structuring into (IP) modules „System-on-Chip“BUT:large clock distribution networks => „isochronic“??FT clocking does not work with large skewmay need individual clocks for function modules=> clock-synchrony neither attainable nor desirable

18

Slide19

Co-ordination of Data Exchange

19

SRC

SNK

f(x)

When

it is valid and consistent

When

SNK has consumed the previous one

When

can SNK use its input?

When

can

SRC

apply

the

next

input

?

Slide20

The Synchronous Approach

20

SRC

SNK

f(x)

co

-ordination

based

on (global) time

Slide21

Alternative: Asynchronous Design

21

SRC

SNK

f(x)

co

-ordination based on handshaking

REQ: „Data word valid, you can use it“

ACK: „Data

word consumed, send the next“

Slide22

Async. Design – Advantages

closed-loop control makes timing much more robust and adaptive to PVT variationsno need for worst-case timinglocal handshakes replace global clockactivity only when neededbeneficial for EMItends to stop operation in case of fault

22

Slide23

Async. Design – Disadvantages

Need to handle race between REQ and data

23

Slide24

Async. Design – Disadvantages

Need to handle race between REQ and data

24

SRC

SNK

f(x)

REQ: „Data

word valid, you can use it“

Slide25

Async. Design – Disadvantages

Need to handle race between REQ and dataSolution 1: „Bundled Data“

25

SRC

SNK

f(x)

REQ: „Data

word valid, you can use it“

Slide26

Async. Design – Disadvantages

Need to handle race between REQ and dataSolution 2: „Delay Insensitive“ (Coding)

26

SRC

SNK

f(x)

REQ: „Data

word valid, you can use it“

Completion

detection

Slide27

Async. Design – Disadvantages

Need to handle race between REQ and datasignificant HW overhead (coding, delay elements)„adaptive“ timing not as predictablemore difficult to designclassical fault-tolerance schemes not applicabletends to stop operation in case of fault

27

Slide28

Best of Both Worlds

GALS: Globally Asynchronous Locally Synchronous

28

retain efficiency of synchronous design wherever possible:„intra-module“

use asynchronousprinciple whereclock distributiontoo cumbersome:„inter-module“

First

mention

in

PhD

thesis

by

Chapiro

/ Stanford 84

Slide29

A GALS Example

29

CPU

2GHz

PCI-IF

533MHz

DSP

2,7GHz

USB-IF

24MHz

Slide30

Communication in GALS

Shared Memoryproducer writes to memory, consumer reads from therepro: control flow stays independentshared single-port memory true dual-port memory Direct Messages (Data words)move data word from producer‘s output register to consumer‘s input registernon-buffered / buffered (FIFO-queues)clock fixed, data-driven or pausible

30

Slide31

Shared Memory

decoupling of clock domains by memory acting as a third party => high area overhead => unusualfor single port memory arbitration requiredarbitration problem (unbounded delay…)one side may block the other at the arbiterfor multiport memory problems are confined to access to the same cellbusy flag may become metastableblocking still possible for one specific address

31

Slide32

Shared Memory

32

CPU

2GHz

shared memory

Arbi-tration

0xff14

DSP

2,7GHz

perfect

decoupling

of

data

path

potential

metastability

problems

at

arbitration

logic

potential

blocking

through

arbitration

Slide33

Direct Messages

clock domain boundary is between producer‘s output register and consumer‘s input registerin general a synchronizer is needed at consumer‘s inputdefinitely for conventional (fixed) clockcan be avoided by data-driven / pausible clockingcontrol flows of producer and consumer are strongly coupled: not maintaining the input/output register blocks other partybuffers/queues/FIFOs can mitigate, but not avoid this problem (full/empty)compensate variations in the data rate on both sides, but not different average data rates

33

Slide34

Direct Messages

data moving over clock domain boundarymetastability problems=> need to insert handshake…with synchronizers

34

S

0xff14

CPU

2GHz

DSP

2,7GHz

S

and

(optional)

buffers

Slide35

Arbiter: Principle

purpose: ○ manage concurring requests to shared resourcemethod: ○ handle pairs of request_in / grant_out ○ requests may arrive in any order ○ arbiter must activate only one grant_out at a time (respond to the first requester)Mutual Exclusion (MUTEX)problem: ○ resolve concurrent requests => metastability problem

35

Slide36

Arbiter: Circuit

36

„Metastability filter“: e.g., hi-threshold inverter [from D. J. Kinniment „Synchronization and Arbitration in Digital Systems“, Wiley]

MUTEX-element: SR-

latch

G1’

G2’

R1

R2

G1

G2

V

out,FF

t

V

th,inv

V

meta

Slide37

Arbiter: Operation

37

R1

G1

R2

G2

G1’

G2’

R1

R2

G1

G2

Slide38

Muller C-Element

38

RS

reset

set

a

b

y

IF a =

b

THEN y = a

ELSE hold y

C

a

b

y

C

a

b

y

Slide39

Muller C-Element: Circuit

39

[Alan Martin,

Caltech

]

Slide40

Data-Driven Clocking

Principle: ○ as soon as new data arrive => start clocking ○ determine number k of clock cycles required to process new data ○ stop clocking after k cycles, wait for next dataProperties: ○ need to switch clock on and off => beware spurious clock pulses! ○ no metastability problem: data stable as soon as consumer clock starts ○ potential for power saving ○ useful for specific applications only (no pipe!)

40

Slide41

Data-Driven Clock: Circuit / 1

41

CLK out

D

CLK out

CLK half period determined by

D

D

Slide42

Data-Driven Clock: Circuit / 2

42

D

C

REQ

ACK

CLK out

REQ

ACK

transition on REQ answered by transition on CLK out

min CLK half period deter-mined by

D

CLK out

D

Slide43

Pausible Clocking

Principle: ○ producer requests consumer‘s clock to pause ○ data provided to input register during idle time ○ consumer‘s clock may resume - free running („pausible clock“) - with one cycle only („stoppable clock“)Properties: ○ need to switch clock on and off => beware spurious clock pulses! => beware of clock tree delays! ○ producer controls consumer‘s clock (blocking!) ○ applications must cope with paused clock

43

Slide44

Pausible Clock: Circuit / 1

44

D

C

REQ

ACK

CLK out

REQ

ACK

inverter generates next

REQ

from ACK

self-oscillation

CLK out

D

Slide45

Pausible Clock: Circuit / 2

45

D

C

REQ’

ACK’

external unit can safely stop CLK by activating REQ’

… and gets ACK’ as a response

CLK out

CLK out

REQ’

ACK’

Arb

D

Slide46

Pausible Clock: Circuit / 3

46

D

C

REQ1

ACK1

for more external sources arbiters can be added and “

anded

” before the Muller C-Element

the two inverters can be eliminated by using a Muller C-Element with inverting output

CLK out

Arb

REQn

ACKn

Arb

Slide47

Advantages of GALS

synchronous islands can be designed efficientlymodules operate independentlycan use module specific-clock & timingclocking is no single point of failure

47

Slide48

Problems with GALS

operation of modules not (inherently) co-ordinatedsynchrony for communication but not on system / algorithm levelcommunication has to cross clock boundariespotential for metastability => performance penalty through synchronizers OR => module must handle irregular clocking

48

Slide49

The DARTS Idea

49

phase synchronisation

tick synchronisation

clock synchronisation

Distributed

Algorithms

for

Robust Tick

Synchronization

Slide50

The DARTS Approach

Concept:

Multiple synchronized tick generatorsMethod: Distributed algorithm for fault-tolerant tick generation implemented in (asynchronous) digital logicAdvantagesNo crystal oscillator(s)No critical clock treeClock is no single point of failure! Reasonable synchrony

50

Slide51

The DARTS Principle

51

Every

function unit Fui augmented with simple local clock unit (TG-Alg)TG-Algs communicate over dedicated TG-Net to generate tick-synchronized local clock signalsUp to f TG-Algs can be Byzantine faulty  need n ≥ 3f + 2 TG-Algs

Fu1

Fu2

Fu3

data bus

Clock tree

TG-

Algs

TG-Net

DARTS

clocks

Standard synchronous clocking

Formally

proven

synchronization

properties

Slide52

A Comparison

52

tick(3)

tick(4)

Fu

1

clk

Fu

2

clk

52

global synchrony

(< 1 tick)

synchronous

SoC

GALS

DARTS

single point

of failure

global synchrony

(potentially

1 tick)

no single point

of failure

no single point

of failure

 NO (inherent)

global synchrony

Slide53

The Distributed Algorithm

Initially:

send tick(0) to all; clock:= 0;

“Relay Rule”If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all [once]; clock:= m;“Increment Rule”If received tick(m) from at least 2f+1 remote nodes and m >= clock: send tick(m+1) to all [once]; clock:= m+1;

[

Srikanth

&

Toueg

,

87]

TG-Alg 1

TG-Alg 6

TG-Alg 5

TG-Alg 4

TG-Alg 3

TG-

Alg

2

TG-Net

Slide54

Implementation Challenges

54

Initially:

send tick(0) to all; clock:= 0;

“Relay Rule”If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all [once]; clock:= m;“Increment Rule”If received tick(m) from at least 2f+1 remote nodes and m >= clock: send tick(m+1) to all [once]; clock:= m+1;

Replacement by zero-bit messages

k-bit messages

k unbounded

Atomicity of actions

To be ensured by the architecture and delay constraints

Thresholds

functions

for

fault

tolerance

Glitch-free asynchronous implementation

k-bit

msg

vs. zero-bit tick

Software-based algorithm

Slide55

The DARTS Prototype

55

ASIC design:

radhard 180nm technology2 designs:- flexible- fast

Prototype board:

8 chips plus fixed & programmable interconnect

Slide56

Proof of Concept

56

Slide57

Frequency Stability (Warm-up)

57

Slide58

Frequency Stability (detail)

58

Slide59

DARTS – General Properties

Fully asynchronous implementation  NO oscillatorsTolerates up to three Byzantine faulty nodes(configurable number of TG-Algs; 5 to 12)Adapts to operating conditions (asynchronous logic)

59

Slide60

Still Room for Improvements

Transient faults are permanently stored in the elastic pipelinesNo on-the-fly integration of TG-AlgRelatively low clock speedInterfacing to traditional synchronous designsScaling with number of faults is costly

60

Slide61

Summary: Trends & Needs

Preceding miniaturization necessitates fault toleranceCo-ordinaton of activities is fundamental, thus tight synchrony is a desirable feature on all levelsSoCs are large modular designs on a single die

61

Slide62

Summary: SoC Clocking

globally synchronous clock:+ ideal synchrony, efficient in design & implementation- isochrony unrealistic, single point of failureDARTS clock+ best attainable global synchrony, adaptive timing, FT- high implementation efforts, frequency not stableGALS+ uses best of syn & asyn, indep. & module-specific clock- no global synchrony, metastability issuesasynchronous design+ power-efficient, robust against faults & PVT- high overheads, difficult to design, timing hard to predict

62

Slide63

More information on DARTS

http://ti.tuwien.ac.at/ecs/research/projects/darts

63

Slide64