/
ECE/CS 552:  Nanophotonics ECE/CS 552:  Nanophotonics

ECE/CS 552: Nanophotonics - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
365 views
Uploaded On 2018-03-16

ECE/CS 552: Nanophotonics - PPT Presentation

Instructor Mikko H Lipasti Fall 2010 University of WisconsinMadison Optional lecture just for fun Good News Technology advances at astounding rate 19 th century attempts to build mechanical computers ID: 653089

token coherence ring mutex coherence token mutex ring arbitration nanophotonics atomic cache substrate core optical block latency nanophotonic speed

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "ECE/CS 552: Nanophotonics" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

ECE/CS 552: Nanophotonics

Instructor: Mikko H

Lipasti

Fall 2010

University of

Wisconsin-Madison

Optional lecture – just for “fun”Slide2

Good News

Technology advances at astounding rate

19

th century: attempts to build mechanical computersEarly 20th century: mechanical counting systems (cash registers, etc.)Mid 20th century: vacuum tubes as switchesSince: transistors, integrated circuits1965: Moore’s law [Gordon Moore]Predicted doubling of IC capacity every 18 monthsDrives functionality, performance, costExponential improvement for 40+ yearsBuilt on Von Neumann model (fetch/execute)Slide3

Distributed processing on chip

Future chips rely on distributed processing

Many computation/cache/DRAM/IO nodes

Placement, topology, core uarch/strength, tbdConventional interconnects may not sufficeBuses not viableCrossbars are slow, power-hungry, expensiveNOCs impose latency, power overheadNanophotonics to the rescueCommunicate with photonsInherent bandwidth, latency, energy advantagesSilicon integration becoming a reality

Challenges & opportunities remainSlide4

~3.5

μ

m

[HP]

Ring Resonator

Wavelength Magnet

Si Photonics: How it works

[Intel]

0.5

μ

m

Waveguide

Optical Wire

Laser

Off-Chip Power

[Koch ‘07]

OFF

ONSlide5

Ring Resonators

5

OFF

ON : Diverting

ON : Diverting

+ Detecting

ON : InjectingSlide6

Key attributes of Si photonicsVery low latency, very high bandwidth

Up to 1000x energy efficiency gain

Challenges

Resonator thermal tuning: heatersIntegration, fabrication, is this real?OpportunitiesStatic power dominant (laser, thermal)Destructive reads: fast wired orSlide7

© Hill, Lipasti

7

Nanophotonics

Nanophotonics overviewSharing the nanophotonic channelLight-speed arbitration [MICRO 09]Utilizing the nanophotonic channelAtomic coherence [HPCA 11]Slide8

Corona substrate

[ISCA08]

Targeting Year 2017

Logically a ring topologyOne concentric ring per node3D stacked: optical, analog, digital

8Slide9

Multiple writer single reader (MWSR) interconnects

Arbitration

p

revents corruption of in-flight datalatchless/wave-pipelinedSlide10

Motivating an optical arbitration solution

MWSR Arbiter must be:

Global

- Many writers requesting accessVery fast – Otherwise bottleneckOptical arbiter avoids OEO conversion delays, provides light-speed arbitrationSlide11

Proposed optical protocolsToken-based protocols

Inspired by classic token ring

Token == transmission rights

Fits well with ring-shaped interconnectDistributed, Scalable(limited to ring)Slide12

Baseline

Based on traditional token protocols

Repeat token at each node

But data is not repeated!Poor utilization

Interconnect bubble

(grows linearly

with # of non-requesters)Slide13

Token -

Inject

Token - SeizeToken - Pass

Optical arbitration basics

W

aveguide

Ring resonator

Power

No Repeat!

Token latency bounded by the time of flight between requesters.Slide14

Token Slot

Token

Channel

Single Token / Serial Writes

Multiple Tokens / Simultaneous Writes

Arbitration solutions

Token passing allows token to pace transmission tail (no bubbles)

Token passing allows token to directly precede slotSlide15

Flow control and fairness

Flow Control:

Use token refresh as opportunity to encode flow control information (credits available)

Arbitration winners decrement credit countFairness:Upstream nodes get first shot at tokensNeed mechanism to prevent starvation of downstream nodesSlide16

Results - Performance

Uniform

HotSpot

Token Slot benefits from the availability of multiple tokens (multiple writers) fast turn-around time of flow-control mechanismSlide17

Results - Latency

Token Slot has the lowest latency and saturates at 80%+ load

Uniform

HotSpotSlide18

Optical arbitration summary

Arbitration speed has to match transfer speed for fine-grained communication

Arbiter has to be optical

High throughput is achievable85+% for token slotLimited to simple topologies (MWSR)Implementation challengesOpt-elec-logic-elec-opt in 200ps (@5GHz)Slide19

© Hill, Lipasti

19

Nanophotonics

Nanophotonics overview Sharing the nanophotonic channelLight-speed arbitration [MICRO 09]Utilizing the nanophotonic channelAtomic coherence [HPCA 11]Slide20

What makes coherence hard?

Unordered interconnects

split transaction buses,

meshes, etcSpeculation Sharer-prediction, speculative data use, etc.Multiple initiators of coherence requestsL1-to-L2, Directory Caches, Coherence Domains, etcState-event pair explosionVerification headacheSlide21

Example: MSI

(SGI-Origin-like, directory, invalidate)

21

Stable StatesSlide22

Example: MSI

(SGI-Origin-like, directory, invalidate)

22

Stable StatesBusy StatesSlide23

Example: MSI

(SGI-Origin-like, directory, invalidate)

23

Stable StatesBusy StatesRaces

“unexpected” events from concurrent requests to same blockSlide24

Cache coherence complexity

24

[

Lepak Thesis, ‘03]L2 MOETSI TransitionsSlide25

Cache coherenceverification headache

25

Papers:

So Many States, So Little Time:Verifying Memory Coherence in the Cray X1

Formal Methods:e.g. Leslie Lamport’s

TLA+ specification language @

Intel

Intel Core 2 Duo Errata:

AI39. Cache Data Access Request from One Core Hitting a Modified Line in the L1 Data Cache of the Other Core May Cause Unpredictable System Behavior

Complex Protocol

=

Complex Verification

Simple

SimpleSlide26

Atomic Coherence: Simplicity

26

w/ races

w/o racesSlide27

Race resolution

Cause:

Concurrently active coherence requests to block A

Remedy: Only allow one coherence request to block A to be active at a time.27

Core 0

$CACHE$

Core 1

$CACHE$

A

ASlide28

Race resolution

28

Core 0

$CACHE$

Core 1

$CACHE$

A

Atomic Substrate

Coherence SubstrateSlide29

Race resolution

29

Atomic Substrate

Coherence Substrate

-- Atomic Substrate is on critical path

+ Can optimize substrates separatelySlide30

Atomic & Coherence Substrates

30

Coherence Substrate

Atomic Substrate

(Apply Fancy Nanophotonics Here)

(Add speculation to a traditional protocol)

aggressive

aggressiveSlide31

Mutexes circulate on ring

31

Single out mutex:

hash(

addr

X)

λ

Y

@ cycle

ZSlide32

Mutex acquire

[Requesting Mutex]

[Requesting Mutex]

[Won Mutex]

32

Exploits OFF-resonance rings: mutex passes P1, P2 uninterruptedSlide33

[Requesting Mutex]

[Won Mutex]

[Release Mutex]

[Requesting Mutex][Won

Mutex]

Mutex

release

33Slide34

Mutexes on ring

34

Detectors

Injectors

1 mutex = 200

ps

= ~2cm = 1 cycle @ 5 GHz

2 cm

# Mutex

Latency To:

seize free mutex : ≤

4

cycles

tune ring resonator: <

1

cycleSlide35

35

Static:

Dynamic:

(random tester)* Atomic Coherence reduces complexityAtomic Coherence: ComplexitySlide36

Performance

36

(128 in-order cores, optical data interconnect, MOEFSI directory)

Slowdown relative to non-atomic MOEFSIWhat is causing the slowdown?

coherence

agnosticSlide37

Optimizing coherence

37

O.wned

and F.orward State: Responsible for satisfying on-chip read missesOpportunity: Try to keep O/F alive If O (or F

) block evicted: While mutex is held, ‘shift’ O/F state to sharer

Observation:

Holding Block B’s

mutex

gives holder free

reign over

coherence activity related to block B

(or hand-off responsibility)Slide38

Optimizing coherence

38

If

O (or F) block evicted: ‘Shift’ O/F state to sharer

# L2 transitions

(b/c less variety in sharing possibilities)

Speedup relative to atomic MOEFSI

Complexity:

Performance:Slide39

Atomic Coherence Summary

Nanophotonics

as enabler

Very fast chip-wide consensusAtomic Protocols are simpler protocolsAnd can have minimal cost to performance (w/ nanophotonics)Opportunity for straightforward protocol enhancements: ShiftFMore details in HPCA-11 paperPush protocol (update-like)39races

coherenceSlide40

© Hill, Lipasti

40

Nanophotonics

Nanophotonics overviewSharing the nanophotonic channelLight-speed arbitration [MICRO 09]Utilizing the nanophotonic channelAtomic coherence [HPCA 11]