740 Computer Architecture and Implementation Montek Singh Nov 14 2016 Topic Intro to Multiprocessors and ThreadLevel Parallelism 2 Outline Motivation Multiprocessors SISD SIMD MIMD and MISD ID: 550522
Download Presentation The PPT/PDF document "1 COMP" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1
COMP 740:Computer Architecture and Implementation
Montek Singh
Nov 14, 2016
Topic:
Intro to Multiprocessors and Thread-Level ParallelismSlide2
2
OutlineMotivationMultiprocessors
SISD, SIMD, MIMD, and MISD
Memory organization
Communication mechanisms
MultithreadingSlide3
3
MotivationInstruction-Level Parallelism (ILP): What all we have covered so far:
simple pipelining
dynamic scheduling: scoreboarding and Tomasulo
’
s alg.
dynamic branch prediction
multiple-issue architectures: superscalar, VLIW
hardware-based speculation
compiler techniques and software approaches
Bottom line: There just aren
’
t enough instructions that can actually be executed in parallel!
instruction issue: limit on maximum issue count
branch prediction: imperfect
# registers: finite
functional units: limited in number
data dependencies: hard to detect dependencies via memorySlide4
4
So, What do we do?Key Idea: Increase number of running processes
multiple processes: at a given
“
point
”
in time
i.e., at the granularity of one (or a few) clock cycles
not sufficient to have multiple processes at the OS level!
Two Approaches:
multiple CPU
’
s: each executing a distinct process
“
Multiprocessors
”
or
“
Parallel Architectures
”
single CPU: executing multiple processes (
“
threads
”
)
“
Multi-threading
”
or
“
Thread-level parallelism
”Slide5
5
Taxonomy of Parallel Architectures
Flynn
’
s Classification:
SISD: Single instruction stream, single data stream
uniprocessor
SIMD: Single instruction stream, multiple data streams
same instruction executed by multiple processors
each has its own data memory
Ex: multimedia processors, vector architectures
MISD: Multiple instruction streams, single data stream
successive functional units operate on the same stream of data
rarely found in general-purpose commercial designs
special-purpose stream processors (digital filters etc.)
MIMD: Multiple instruction stream, multiple data stream
each processor has its own instruction and data streams
most popular form of parallel processing
single-user: high-performance for one application
multiprogrammed: running many tasks simultaneously (e.g., servers)Slide6
6
Multiprocessor: Memory OrganizationCentralized, shared-memory multiprocessor:
usually few processors
share single memory & bus
use large cachesSlide7
7
Multiprocessor: Memory OrganizationDistributed-memory multiprocessor:
can support large processor counts
cost-effective way to scale memory bandwidth
works well if most accesses are to local memory node
requires interconnection network
communication between processors becomes more complicated, slowerSlide8
8
Multiprocessor: Hybrid OrganizationUse distributed-memory organization at top level
Each node itself may be a shared-memory multiprocessor (2-8 processors)Slide9
9
Communication MechanismsShared-Memory Communicationaround for a long time, so well understood and standardized
memory-mapped
ease of programming when communication patterns are complex or dynamically varying
better use of bandwidth when items are small
Problem:
cache coherence harder
use
“
Snoopy
”
and other protocols
Message-Passing Communication
simpler hardware because keeping caches coherent is easier
communication is explicit, simpler to understand
focusses programmer attention on communication
synchronization: naturally associated with communication
fewer errors due to incorrect synchronizationSlide10
Multi-threading
10Slide11
11
11
Performance Beyond Single Thread
Motivation:
There
is much higher natural parallelism in some
applications
e.g
., Database or
Scientific
Explicit Thread Level Parallelism or Data Level Parallelism
What is a Thread?
a process
with own instructions and data
thread may be a process, part of a parallel program of multiple processes, or an independent program
e
ach
thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute
What is Data
Level
Parallelism
Perform
identical operations on lots of dataSlide12
12
MultithreadingThreads: multiple processes that share code and data (and much of their address space)
recently, the term has come to include processes that may run on different processors and even have disjoint address spaces, as long as they share the code
Multithreading: exploit thread-level parallelism within a processor
fine-grain multithreading
switch between threads on
each
instruction!
coarse-grain multithreading
switch to a different thread only if current thread has a costly stall
E.g., switch only on a level-2 cache missSlide13
13
Thread Level Parallelism (TLP)ILP s. TLP
ILP
exploits implicit parallel operations within a loop or straight-line code segment
TLP explicitly represented by the use of multiple threads of execution that are inherently
parallel
each thread needs: its own PC and its own
Register File
Goal
: Use multiple instruction streams to improve
Throughput of computers that run many programs
Execution time of multi-threaded programs
TLP could be more cost-effective to exploit than ILPSlide14
14
Multithreaded ExecutionMultithreading: multiple threads to share the functional units of 1 processor via overlapping
processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table
memory shared through the virtual memory mechanisms, which already support multiple processes
HW for fast thread switch; much faster than full process switch
(100s
to 1000s of
clocks)
When to switch?
Alternate instruction per thread (fine grain)
When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)Slide15
15
Fine-Grain Multithreading
Fine-grain multithreading
switch between threads on
each
instruction!
multiple threads executed in interleaved manner
interleaving is usually round-robin
CPU must be capable of switching threads on every cycle!
fast, frequent switches
main disadvantage:
slows down the execution of individual threads
that is, traded off latency for better
throughput
example: Sun
’
s
NiagaraSlide16
16
Coarse-Grain Multithreading
Coarse-grain multithreading
switch only if current thread has a costly stall
E.g., level-2 cache miss
can accommodate slightly costlier switches
less likely to slow down an individual thread
a thread is switched
“
off
”
only when it has a costly stall
main disadvantage:
limited in ability to overcome throughput losses
shorter stalls are ignored, and there may be plenty of those
issues instructions from a single thread
every switch involves emptying and restarting the instruction
pipeline
hence, better
for reducing penalty of high cost stalls, where pipeline refill << stall time
example: IBM
AS/400Slide17
17
Simultaneous Multithreading (SMT)Example: new Pentium with
“
Hyperthreading
”
Key Idea: Exploit ILP across multiple threads!
i.e., convert thread-level parallelism into more ILP
exploit following features of modern processors:
multiple functional units
modern processors typically have more functional units available than a single thread can utilize
register renaming and dynamic scheduling
multiple instructions from independent threads can co-exist and co-execute!Slide18
18
Multithreaded Categories
Time (processor cycle)
Superscalar
Fine-Grained
Multiprocessing
Simultaneous
Multithreading
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Idle slot
Coarse-GrainedSlide19
19
SMT: Design ChallengesDealing with a large register file
needed to hold multiple contexts
Maintaining low overhead on clock cycle
fast instruction issue: choosing what to issue
instruction commit: choosing what to commit
keeping cache conflicts within acceptable boundsSlide20
20
Example: Power 4
Single-threaded predecessor to Power 5.
8 execution units in out-of-order engine,
each may issue an instruction each cycle.Slide21
21
Power 4
Power 5
2 fetch (PC),
2 initial decodes
2 commitsSlide22
22
Power 5 Data Flow
Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck Slide23
23
Power 5 Performance
On 8 processor IBM servers
ST baseline w/ 8 threads
SMT with 16 threads
Note few with performance lossSlide24
24
Changes in Power 5 to support SMTIncreased associativity of L1 instruction cache and the instruction address translation buffers
Added per thread load and store queues
Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches
Added separate instruction prefetch and buffering per thread
Increased the number of virtual registers from 152 to 240
Increased the size of several issue queues
The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support