/
1 COMP 1 COMP

1 COMP - PowerPoint Presentation

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
387 views
Uploaded On 2017-05-20

1 COMP - PPT Presentation

740 Computer Architecture and Implementation Montek Singh Nov 14 2016 Topic Intro to Multiprocessors and ThreadLevel Parallelism 2 Outline Motivation Multiprocessors SISD SIMD MIMD and MISD ID: 550522

multiple thread memory instruction thread multiple instruction memory data level multithreading threads single parallelism switch stream processors communication processes

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 COMP" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

1

COMP 740:Computer Architecture and Implementation

Montek Singh

Nov 14, 2016

Topic:

Intro to Multiprocessors and Thread-Level ParallelismSlide2

2

OutlineMotivationMultiprocessors

SISD, SIMD, MIMD, and MISD

Memory organization

Communication mechanisms

MultithreadingSlide3

3

MotivationInstruction-Level Parallelism (ILP): What all we have covered so far:

simple pipelining

dynamic scheduling: scoreboarding and Tomasulo

s alg.

dynamic branch prediction

multiple-issue architectures: superscalar, VLIW

hardware-based speculation

compiler techniques and software approaches

Bottom line: There just aren

t enough instructions that can actually be executed in parallel!

instruction issue: limit on maximum issue count

branch prediction: imperfect

# registers: finite

functional units: limited in number

data dependencies: hard to detect dependencies via memorySlide4

4

So, What do we do?Key Idea: Increase number of running processes

multiple processes: at a given

point

in time

i.e., at the granularity of one (or a few) clock cycles

not sufficient to have multiple processes at the OS level!

Two Approaches:

multiple CPU

s: each executing a distinct process

Multiprocessors

or

Parallel Architectures

single CPU: executing multiple processes (

threads

)

Multi-threading

or

Thread-level parallelism

”Slide5

5

Taxonomy of Parallel Architectures

Flynn

s Classification:

SISD: Single instruction stream, single data stream

uniprocessor

SIMD: Single instruction stream, multiple data streams

same instruction executed by multiple processors

each has its own data memory

Ex: multimedia processors, vector architectures

MISD: Multiple instruction streams, single data stream

successive functional units operate on the same stream of data

rarely found in general-purpose commercial designs

special-purpose stream processors (digital filters etc.)

MIMD: Multiple instruction stream, multiple data stream

each processor has its own instruction and data streams

most popular form of parallel processing

single-user: high-performance for one application

multiprogrammed: running many tasks simultaneously (e.g., servers)Slide6

6

Multiprocessor: Memory OrganizationCentralized, shared-memory multiprocessor:

usually few processors

share single memory & bus

use large cachesSlide7

7

Multiprocessor: Memory OrganizationDistributed-memory multiprocessor:

can support large processor counts

cost-effective way to scale memory bandwidth

works well if most accesses are to local memory node

requires interconnection network

communication between processors becomes more complicated, slowerSlide8

8

Multiprocessor: Hybrid OrganizationUse distributed-memory organization at top level

Each node itself may be a shared-memory multiprocessor (2-8 processors)Slide9

9

Communication MechanismsShared-Memory Communicationaround for a long time, so well understood and standardized

memory-mapped

ease of programming when communication patterns are complex or dynamically varying

better use of bandwidth when items are small

Problem:

cache coherence harder

use

Snoopy

and other protocols

Message-Passing Communication

simpler hardware because keeping caches coherent is easier

communication is explicit, simpler to understand

focusses programmer attention on communication

synchronization: naturally associated with communication

fewer errors due to incorrect synchronizationSlide10

Multi-threading

10Slide11

11

11

Performance Beyond Single Thread

Motivation:

There

is much higher natural parallelism in some

applications

e.g

., Database or

Scientific

Explicit Thread Level Parallelism or Data Level Parallelism

What is a Thread?

a process

with own instructions and data

thread may be a process, part of a parallel program of multiple processes, or an independent program

e

ach

thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute

What is Data

Level

Parallelism

Perform

identical operations on lots of dataSlide12

12

MultithreadingThreads: multiple processes that share code and data (and much of their address space)

recently, the term has come to include processes that may run on different processors and even have disjoint address spaces, as long as they share the code

Multithreading: exploit thread-level parallelism within a processor

fine-grain multithreading

switch between threads on

each

instruction!

coarse-grain multithreading

switch to a different thread only if current thread has a costly stall

E.g., switch only on a level-2 cache missSlide13

13

Thread Level Parallelism (TLP)ILP s. TLP

ILP

exploits implicit parallel operations within a loop or straight-line code segment

TLP explicitly represented by the use of multiple threads of execution that are inherently

parallel

each thread needs: its own PC and its own

Register File

Goal

: Use multiple instruction streams to improve

Throughput of computers that run many programs

Execution time of multi-threaded programs

TLP could be more cost-effective to exploit than ILPSlide14

14

Multithreaded ExecutionMultithreading: multiple threads to share the functional units of 1 processor via overlapping

processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table

memory shared through the virtual memory mechanisms, which already support multiple processes

HW for fast thread switch; much faster than full process switch

(100s

to 1000s of

clocks)

When to switch?

Alternate instruction per thread (fine grain)

When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)Slide15

15

Fine-Grain Multithreading

Fine-grain multithreading

switch between threads on

each

instruction!

multiple threads executed in interleaved manner

interleaving is usually round-robin

CPU must be capable of switching threads on every cycle!

fast, frequent switches

main disadvantage:

slows down the execution of individual threads

that is, traded off latency for better

throughput

example: Sun

s

NiagaraSlide16

16

Coarse-Grain Multithreading

Coarse-grain multithreading

switch only if current thread has a costly stall

E.g., level-2 cache miss

can accommodate slightly costlier switches

less likely to slow down an individual thread

a thread is switched

off

only when it has a costly stall

main disadvantage:

limited in ability to overcome throughput losses

shorter stalls are ignored, and there may be plenty of those

issues instructions from a single thread

every switch involves emptying and restarting the instruction

pipeline

hence, better

for reducing penalty of high cost stalls, where pipeline refill << stall time

example: IBM

AS/400Slide17

17

Simultaneous Multithreading (SMT)Example: new Pentium with

Hyperthreading

Key Idea: Exploit ILP across multiple threads!

i.e., convert thread-level parallelism into more ILP

exploit following features of modern processors:

multiple functional units

modern processors typically have more functional units available than a single thread can utilize

register renaming and dynamic scheduling

multiple instructions from independent threads can co-exist and co-execute!Slide18

18

Multithreaded Categories

Time (processor cycle)

Superscalar

Fine-Grained

Multiprocessing

Simultaneous

Multithreading

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Idle slot

Coarse-GrainedSlide19

19

SMT: Design ChallengesDealing with a large register file

needed to hold multiple contexts

Maintaining low overhead on clock cycle

fast instruction issue: choosing what to issue

instruction commit: choosing what to commit

keeping cache conflicts within acceptable boundsSlide20

20

Example: Power 4

Single-threaded predecessor to Power 5.

8 execution units in out-of-order engine,

each may issue an instruction each cycle.Slide21

21

Power 4

Power 5

2 fetch (PC),

2 initial decodes

2 commitsSlide22

22

Power 5 Data Flow

Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck Slide23

23

Power 5 Performance

On 8 processor IBM servers

ST baseline w/ 8 threads

SMT with 16 threads

Note few with performance lossSlide24

24

Changes in Power 5 to support SMTIncreased associativity of L1 instruction cache and the instruction address translation buffers

Added per thread load and store queues

Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches

Added separate instruction prefetch and buffering per thread

Increased the number of virtual registers from 152 to 240

Increased the size of several issue queues

The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support