Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, T

Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, T Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, T - Start

Added : 2016-05-28 Views :130K

Download Presentation

Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, T




Download Presentation - The PPT/PDF document "Prepared 6/4/2011 by T. O’Neil for 346..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentations text content in Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, T

Slide1

Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 1

Introduction to Massively Parallel Computing

Slide2

A quiet revolution and potential buildupComputation: TFLOPs vs. 100 GFLOPsCPU in every PC – massive volume and potential impact

Introduction to Massively Parallel Processing – Slide 2

CPUs and GPUs

T12

Westmere

NV30

NV40

G70

G80

GT200

3GHz Core2 Duo

3GHz Xeon Quad

Slide3

Introduction to Massively Parallel Processing – Slide 2

Topic 1: The Demand for Computational Speed

Slide4

Solve computation-intensive problems faster Make infeasible problems feasible Reduce design time Solve larger problems in the same amount of time Improve an answer’s precision Reduce design time Gain competitive advantage

Introduction to Massively Parallel Processing – Slide 4

Why Faster Computers?

Slide5

Another factor: modern scientific method

Introduction to Massively Parallel Processing – Slide 5

Why Faster Computers?

Nature

Observation

Theory

PhysicalExperimentation

Numerical

Simulation

Slide6

A growing number of applications in science, engineering, business and medicine are requiring computing speeds that cannot be achieved by the current conventional computers becauseCalculations too computationally intensiveToo complicated to modelToo hard, expensive, slow or dangerous for the laboratory.Parallel processing is an approach which helps to make these computations feasible.

Introduction to Massively Parallel Processing – Slide 6

The Need for Speed

Slide7

One example: grand challenge problems that cannot be solved in a reasonable amount of time with today’s computers. Modeling large DNA structures Global weather forecasting Modeling motion of astronomical bodiesAnother example: real-time applications which have time constrains.If data are not processed during some time the result become meaningless.

Introduction to Massively Parallel Processing – Slide 7

The Need for Speed

Slide8

Atmosphere modeled by dividing it into 3-dimensional cells. Calculations of each cell repeated many times to model passage of time.For exampleSuppose whole global atmosphere divided into cells of size 1 mile  1 mile  1 mile to a height of 10 miles (10 cells high) – about 5108 cells.If each calculation requires 200 floating point operations, 1011 floating point operations necessary in one time step.

Introduction to Massively Parallel Processing – Slide 8

One Example: Weather Forecasting

Slide9

Example continuedTo forecast the weather over 10 days using 10-minute intervals, a computer operating at 100 megaflops (108 floating point operations per second) would take 107 seconds or over 100 days.To perform the calculation in 10 minutes would require a computer operating at 1.7 teraflops (1.71012 floating point operations per second).

Introduction to Massively Parallel Processing – Slide 9

One Example: Weather Forecasting

Slide10

Two main typesParallel computing: Using a computer with more than one processor to solve a problem.Distributed computing: Using more than one computer to solve a problem.MotivesUsually faster computationVery simple idea – that n computers operating simultaneously can achieve the result n times faster It will not be n times faster for various reasons.Other motives include: fault tolerance, larger amount of memory available, …

Introduction to Massively Parallel Processing – Slide 10

Concurrent Computing

Slide11

“... There is therefore nothing new in the idea of parallel programming, but its application to computers. The author cannot believe that there will be any insuperable difficulty in extending it to computers. It is not to be expected that the necessary programming techniques will be worked out overnight. Much experimenting remains to be done. After all, the techniques that are commonly used in programming today were only won at considerable toil several years ago. In fact the advent of parallel programming may do something to revive the pioneering spirit in programming which seems at the present to be degenerating into a rather dull and routine occupation …” - S. Gill, “Parallel Programming”, The Computer Journal, April 1958

Parallel Programming No New Concept

Introduction to Massively Parallel Processing – Slide

11

Slide12

Applications are typically written from scratch (or manually adapted from a sequential program) assuming a simple model, in a high-level language (i.e. C) with explicit parallel computing primitives.Which meansComponents difficult to reuse so similar components coded again and againCode difficult to develop, maintain or debugNot really developing a long-term software solution

Introduction to Massively Parallel Processing – Slide 12

The Problems of Parallel Programs

Slide13

Introduction to Massively Parallel Processing – Slide 13

Another Complication: Memory

Processors

Memory level 1

Memory level 2

Memory level 3

Main Memory

Highest level

Small, fast, expensive

Lowest level

Large, slow, cheap

Registers

Slide14

Execution speed relies on exploiting data localitytemporal locality: a data item just accessed is likely to be used again in the near future, so keep it in the cachespatial locality: neighboring data is also likely to be used soon, so load them into the cache at the same time using a ‘wide’ bus (like a multi-lane motorway)from a programmer point of view, all handled automatically – good for simplicity but maybe not for best performance

Introduction to Massively Parallel Processing – Slide 14

Review of the Memory Hierarchy

Slide15

Useful work (i.e. floating point ops) can only be done at the top of the hierarchy.So data stored lower must be transferred to the registers before we can work on it.Possibly displacing other data already there.This transfer among levels is slow.This then becomes the bottleneck in most computations.Spend more time moving data than doing useful work.The result: speed depends to a large extent on problem size.

Introduction to Massively Parallel Processing – Slide 15

The Problems of the Memory Hierarchy

Slide16

Good algorithm design then requiresKeeping active data at the top of the hierarchy as long as possible.Minimizing movement between levels.Need enough work to do at the top of the hierarchy to mask the transfer time at the lower levels.The more processors you have to work with, the larger the problem has to be to accomplish this.

Introduction to Massively Parallel Processing – Slide 16

The Problems of the Memory Hierarchy

Slide17

Introduction to Massively Parallel Processing – Slide 17

Topic 2: CPU Architecture

Slide18

Each cycle, CPU takes data from registers, does an operation, and puts the result backLoad/store operations (memory registers) also take one cycleCPU can do different operations each cycleOutput of one operation can be input to nextCPUs haven’t been this simple for a long time!

Introduction to Massively Parallel Processing – Slide 18

Starting Point: von Neumann Processor

Slide19

“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer.” - Gordon E. Moore, co-founder, Intel Electronics Magazine, 4/19/1965

Why? Moore’s Law

Introduction to Massively Parallel Processing – Slide 19

Photo credit

: Wikipedia

Slide20

Translation: The number of transistors on an integrated circuit doubles every 1½ - 2 years.Data credit: Wikipedia

Introduction to Massively Parallel Processing – Slide 20

Moore’s Law

Slide21

As the chip size (amount of circuitry) continues to double every 18-24 months, the speed of a basic microprocessor also doubles over that time frame.This happens because, as we develop tricks, they are built into the newest generation of CPUs by the manufacturers.

Introduction to Massively Parallel Processing – Slide 21

Moore’s Law

Slide22

Instruction-level parallelism (ILP)Out-of-order execution, speculation, …Vanishing opportunities in power-constrained worldData-level parallelismVector units, SIMD execution, …Increasing opportunities; SSE, AVX, Cell SPE, Clearspeed, GPU, …Thread-level parallelismIncreasing opportunities; multithreading, multicore, …Intel Core2, AMD Phenom, Sun Niagara, STI Cell, NVIDIA Fermi, …

Introduction to Massively Parallel Processing – Slide 22

So What To Do With All The Extra Transistors

Slide23

Most processors have multiple pipelines for different tasks and can start a number of different operations each cycle.For example consider the Intel core architecture.

Modern Processor: Superscalar

Introduction to Massively Parallel Processing – Slide 23

Slide24

Each core in an Intel Core 2 Duo chip has14 stage pipeline3 integer units (ALU)1 floating-point addition unit (FPU)1 floating-point multiplication unit (FPU)FP division is very slow2 load/store unitsIn principle capable of producing 3 integer and 2 FP results per cycle

Introduction to Massively Parallel Processing – Slide 24

Modern Processor: Superscalar

Slide25

Technical challengesCompiler to extract best performance, reordering instructions if necessaryOut-of-order CPU execution to avoid delays waiting for read/write or earlier operationsBranch prediction to minimize delays due to conditional branching (loops, if-then-else)Memory hierarchy to deliver data to registers fast enough to feed the processorThese all limit the number of pipelines that can be used and increase the chip complexity90% of Intel chip devoted to control and data

Introduction to Massively Parallel Processing – Slide 25

Modern Processor: Superscalar

Slide26

(courtesy Mark Horowitz and Kevin

Skadron)

Introduction to Massively Parallel Processing – Slide 26

Tradeoff: Performance vs. Power

Power

Performance

Slide27

CPU clock stuck at about 3 GHz since 2006 due to problems with power consumption (up to 130W per chip)Chip circuitry still doubling every 18-24 monthsMore on-chip memory and memory management units (MMUs)Specialized hardware (e.g. multimedia, encryption)Multi-core (multiple CPUs on one chip)Thus peak performance of chip still doubling every 18-24 months

Introduction to Massively Parallel Processing – Slide 27

Current Trends

Slide28

Handful of processors each supporting ~1 hardware threadOn-chip memory near processors (cache, RAM or both)Shared global memory space (external DRAM)

Introduction to Massively Parallel Processing – Slide 28

A Generic Multicore Chip

Processor

Memory

Processor

Memory

Global Memory

Slide29

Many processors each supporting many hardware threadsOn-chip memory near processors (cache, RAM or both)Shared global memory space (external DRAM)

Introduction to Massively Parallel Processing – Slide 29

A Generic Manycore Chip

Global Memory

• • •

Processor

Memory

Processor

Memory

Slide30

Four 3.33 GHz cores, each of which can run 2 threadsIntegrated MMU

Introduction to Massively Parallel Processing – Slide 30

Example: Intel Nehalem 4-core processor

Slide31

Cannot continue to scale processor frequenciesNo 10 GHz chipsCannot continue to increase power consumptionCan’t melt chipsCan continue to increase transistor densityAs per Moore’s Law

Introduction to Massively Parallel Processing – Slide 31

So The Days of Serial Performance Scaling are Over

Slide32

Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”Molecular dynamic simulation, video and audio coding and manipulation, 3D imaging and visualization, consumer game physics, virtual reality products, …These “super apps” represent and model physical, concurrent worldVarious granularities of parallelism exist, but…Programming model must not hinder parallel implementationData delivery needs careful management

Introduction to Massively Parallel Processing – Slide 32

Future Applications Reflect a Concurrent World

Slide33

Computers no longer get faster, just widerYou must rethink your algorithms to be parallel!Data-parallel computing is most scalable solutionOtherwise: refactor code for 2 cores, 4 cores, 8 cores, 16 cores, …You will always have more data than cores – build the computation around the data

Introduction to Massively Parallel Processing – Slide 33

The “New” Moore’s Law

Slide34

Traditional parallel architectures cover some super appsDSP, GPU, network and scientific applications, …The game is to grow mainstream architectures “out” or domain-specific architectures “in”CUDA is the latter

Introduction to Massively Parallel Processing – Slide 34

Stretching Traditional Architectures

Slide35

Introduction to Massively Parallel Processing – Slide 35

Topic 3: Graphics Processing Units (GPUs)

Slide36

Massive economies of scaleMassively parallel

Introduction to Massively Parallel Processing – Slide 36

GPUs: The Big Development

Slide37

Produced in vast numbers for computer graphicsIncreasingly being used forComputer game “physics”Video (e.g. HD video decoding)Audio (e.g. MP3 encoding)Multimedia (e.g. Adobe software)Computational financeOil and gasMedical imagingComputational science

Introduction to Massively Parallel Processing – Slide 37

GPUs: The Big Development

Slide38

The GPU sits on a PCIe graphics card inside a standard PC/server with one or two multicore CPUs

Introduction to Massively Parallel Processing – Slide 38

GPUs: The Big Development

Slide39

Up to 448 cores on a single chipSimplified logic (no out-of-order execution, no branch prediction) means much more of the chip is devoted to computationArranged as multiple units with each unit being effectively a vector unit, all cores doing the same thing at the same timeVery high bandwidth (up to 140GB/s) to graphics memory (up to 4GB)Not general purpose – for parallel applications like graphics and Monte Carlo simulationsCan also build big clusters out of GPUs

Introduction to Massively Parallel Processing – Slide 39

GPUs: The Big Development

Slide40

Four major vendors:NVIDIAAMD: bought ATI several years agoIBM: co-developed Cell processor with Sony and Toshiba for Sony Playstation, but now dropped it for high-performance computingIntel: was developing “Larrabee” chip as GPU, but now aimed for high-performance computing

Introduction to Massively Parallel Processing – Slide 40

GPUs: The Big Development

Slide41

A quiet revolution and potential buildupComputation: TFLOPs vs. 100 GFLOPsCPU in every PC – massive volume and potential impact

Introduction to Massively Parallel Processing – Slide 41

CPUs and GPUs

T12

Westmere

NV30

NV40

G70

G80

GT200

3GHz Core2 Duo

3GHz Xeon Quad

Slide42

A quiet revolution and potential buildupBandwidth: ~10xCPU in every PC – massive volume and potential impact

Introduction to Massively Parallel Processing – Slide 42

CPUs and GPUs

NV30

NV40

G70

G80

GT200

T12

3GHz Core2 Duo

3GHz Xeon Quad

Westmere

Slide43

High throughput computationGeForce GTX 280: 933 GFLOPs/secHigh bandwidth memoryGeForce GTX 280: 140 GB/secHigh availability to all180M+ CUDA-capable GPUs in the wild

Introduction to Massively Parallel Processing – Slide 43

GPU Evolution

Slide44

Introduction to Massively Parallel Processing – Slide 44

GPU Evolution

1995

2000

2005

2010

RIVA 128

3M xtors

GeForce

®

256

23M xtors

GeForce FX125M xtors

GeForce 8800681M xtors

GeForce 3 60M xtors

“Fermi”3B xtors

Slide45

Throughput is paramountMust paint every pixel within frame timeScalabilityCreate, run and retire lots of threads very rapidlyMeasured 14.8 Gthreads/sec on increment() kernel Use multithreading to hide latencyOne stalled thread is o.k. if 100 are ready to run

Introduction to Massively Parallel Processing – Slide 45

Lessons from the Graphics Pipeline

Slide46

Different goals produce different designsGPU assumes work load is highly parallelCPU must be good at everything, parallel or notCPU: minimize latency experienced by one threadBig on-chip caches, sophisticated control logicGPU: maximize throughput of all threadNumber of threads in flight limited by resources  lots of resources (registers, bandwidth, etc.)Multithreading can hide latency  skip the big cachesShare control logic across many threads

Introduction to Massively Parallel Processing – Slide 46

Why is this different from a CPU?

Slide47

GeForce 8800 (2007)

Introduction to Massively Parallel Processing – Slide 47

NVIDIA GPU Architecture

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Load/store

Load/store

Load/store

Load/store

Load/store

Slide48

16 highly threaded streaming multiprocessors (SMs)> 128 floating point units (FPUs)367 GFLOPs peak performance (25-50 times that of current high-end microprocessors)265 GFLOPs sustained for applications such as visual molecular dynamics (VMD)768 MB DRAM86.4 GB/sec memory bandwidth4 GB/sec bandwidth to CPU

Introduction to Massively Parallel Processing – Slide 48

G80 Characteristics

Slide49

Massively parallel, 128 cores, 90 wattsMassively threaded, sustains 1000s of threads per application30-100 times speedup over high-end microprocessors on scientific and media applicationsMedical imaging, molecular dynamics, …

Introduction to Massively Parallel Processing – Slide 49

G80 Characteristics

Slide50

Fermi GF100 (2010)~ 1.5 TFLOPs (single precision), ~800 GFLOPs (double precision)230 GB/sec DRAM bandwidth

Introduction to Massively Parallel Processing – Slide 50

NVIDIA GPU Architecture

Slide51

Introduction to Massively Parallel Processing – Slide 51

Streaming Multiprocessors

Slide52

32 CUDA cores per SM (512 total)8x peak FP64 performance50% of peak FP32 performanceDirect load/store to memoryUsual linear sequence of bytesHigh bandwidth (hundreds of GB/sec)64K of fast, on-chip RAMSoftware or hardware managedShared amongst CUDA coresEnables thread communication

Introduction to Massively Parallel Processing – Slide 52

Streaming Multiprocessors (SMs)

Slide53

SIMT (Single Instruction Multiple Thread) executionThreads run in groups of 32 called warpsMinimum of 32 threads all doing the same thing at (almost) the same timeThreads in a warp share instruction unit (IU)All cores execute the same instructions simultaneously, but with different dataSimilar to vector computing on CRAY supercomputersHardware automatically handles divergenceNatural for graphics processing and much scientific computing; also a natural choice for many-core chips to simplify each core

Introduction to Massively Parallel Processing – Slide 53

Key Architectural Ideas

Slide54

Hardware multithreadingHardware resource allocation and thread schedulingHardware relies on threads to hide latencyThreads have all resources needed to runAny warp not waiting for something can runContext switching is (basically) free

Introduction to Massively Parallel Processing – Slide 54

Key Architectural Ideas

Slide55

Lots of active threads is the key to high performance:no “context switching”; each thread has its own registers, which limits the number of active threadsthreads on each SM execute in warps – execution alternates between “active” warps, with warps becoming temporarily “inactive” when waiting for datafor each thread, one operation completes long before the next starts – avoids the complexity of pipeline overlaps which can limit the performance of modern processorsmemory access from device memory has a delay of 400-600 cycles; with 40 threads this is equivalent to 10-15 operations, so hopefully there’s enough computation to hide the latency

Introduction to Massively Parallel Processing – Slide 55

Lessons from the Graphics Pipeline

Slide56

Intel Core 2 / Xeon / i74-6 MIMD (Multiple Instruction / Multiple Data) coreseach core operates independentlyeach can be working with a different code, performing different operations with entirely different datafew registers, multilevel caches10-30 GB/s bandwidth to main memory

Introduction to Massively Parallel Processing – Slide 56

Chip Comparison

Slide57

NVIDIA GTX280:240 cores, arranged as 30 units each with 8 SIMD (Single Instruction / Multiple Data) coresall cores executing the same instruction at the same time, but working on different dataonly one instruction de-coder needed to control all coresfunctions like a vector unitlots of registers, almost no cache5 GB/s bandwidth to host processor (PCIe x16 gen 2)140 GB/s bandwidth to graphics memory

Introduction to Massively Parallel Processing – Slide 57

Chip Comparison

Slide58

One issue with SIMD / vector execution: what happens with if-then-else branches?Standard vector treatment: execute both branches but only store the results for the correct branch.NVIDIA treatment: works with warps of length 32. If a particular warp only goes one way, then that is all that is executed.

Introduction to Massively Parallel Processing – Slide 58

Chip Comparison

Slide59

How do NVIDIA GPUs deal with same architectural challenges as CPUs?Very heavily multi-threaded, with at least 8 threads per core:Solves pipeline problem, because output of one calculation can be used an input for next calculation for same threadSwitch from one set of threads to another when waiting for data from graphics memory

Introduction to Massively Parallel Processing – Slide 59

Chip Comparison

Slide60

Introduction to Massively Parallel Processing – Slide 60

Topic 4: Compute Unified Device Architecture (CUDA)

Slide61

Scalable programming modelMinimal extensions to familiar C/C++ environmentHeterogeneous serial-parallel computing

Introduction to Massively Parallel Processing – Slide 61

Enter CUDA

Slide62

Introduction to Massively Parallel Processing – Slide 62

Motivation

110-240X

45X

100X

35X

17X

13–457x

Slide63

Big breakthrough in GPU computing has been NVIDIA’s development of CUDA programming environmentInitially driven by needs of computer games developersNow being driven by new markets (e.g. HD video decoding)C plus some extensions and some C++ features (e.g. templates)

Introduction to Massively Parallel Processing – Slide 63

GPU Programming

Slide64

Host code runs on CPU, CUDA code runs on GPUExplicit movement of data across the PCIe connectionVery straightforward for Monte Carlo applications, once you have a random number generatorHarder for finite difference applications

Introduction to Massively Parallel Processing – Slide 64

GPU Programming

Slide65

At the top level, we have a master process which runs on the CPU and performs the following steps:initialize cardallocate memory in host and on devicerepeat as neededcopy data from host to device memorylaunch multiple copies of execution “kernel” on devicecopy data from device memory to hostde-allocate all memory and terminate

Introduction to Massively Parallel Processing – Slide 65

Software View

Slide66

At a lower level, within the GPU:each copy of the execution kernel executes on an SMif the number of copies exceeds the number of SMs, then more than one will run at a time on each SM if there are enough registers and shared memory, and the others will wait in a queue and execute laterall threads within one copy can access local shared memory but can’t see what the other copies are doing (even if they are on the same SM)there are no guarantees on the order in which the copies will execute

Introduction to Massively Parallel Processing – Slide 66

Software View

Slide67

Augment C/C++ with minimalist abstractionsLet programmers focus on parallel algorithms, not mechanics of parallel programming languageProvide straightforward mapping onto hardwareGood fit to GPU architectureMaps well to multi-core CPUs tooScale to 100s of cores and 10000s of parallel threadsGPU threads are lightweight – create/switch is freeGPU needs 1000s of threads for full utilization

Introduction to Massively Parallel Processing – Slide 67

CUDA: Scalable Parallel Programming

Slide68

Hierarchy of concurrent threadsLightweight synchronization primitivesShared memory model for cooperating threads

Introduction to Massively Parallel Processing – Slide 68

Key Abstractions in CUDA

Slide69

Parallel kernels composed of many threadsAll threads execute the same sequential programThreads are grouped into thread blocksThreads in the same block can cooperateThreads/blocks have unique IDs

Introduction to Massively Parallel Processing – Slide 69

Hierarchy of Concurrent Threads

Thread

t

t0 t1 … tB

Block

b

Slide70

CUDA virtualizes the physical hardwareThread is a virtualized scalar processor: registers, PC, stateBlock is a virtualized multiprocessor: threads, shared memoryScheduled onto physical hardware without pre-emptionThreads/blocks launch and run to completionBlocks should be independent

Introduction to Massively Parallel Processing – Slide 70

CUDA Model of Parallelism

Slide71

Introduction to Massively Parallel Processing – Slide 71

CUDA Model of Parallelism

• • •

Block

Memory

Block

Memory

Global Memory

Slide72

Cf. PRAM (Parallel Random Access Machine) modelGlobal synchronization isn’t cheapGlobal memory access times are expensive

Introduction to Massively Parallel Processing – Slide 72

Not a Flat Multiprocessor

Processors

Global Memory

Slide73

Cf. BSP (Bulk Synchronous Parallel) model, MPIDistributed computing is a different setting

Introduction to Massively Parallel Processing – Slide 73

Not Distributed Processors

Interconnection Network

Processor

Memory

Processor

Memory

• • •

Slide74

Multicore CPU Manycore GPU

Introduction to Massively Parallel Processing – Slide 74

Is Heterogeneous Computing

Slide75

Philosophy: provide minimal set of extensions necessary to expose powerFunction and variable qualifiersExecution configurationBuilt-in variables and functions valid in device code

Introduction to Massively Parallel Processing – Slide 75

C for CUDA

Slide76

Introduction to Massively Parallel Processing – Slide 76

Topic 5: Case Studies

Slide77

Introduction to Massively Parallel Processing – Slide 77

Previous Projects, UIUC ECE 498AL

Application

Description

Source

Kernel

% time

H.264

SPEC ‘06 version, change in guess vector

34,811

194

35%

LBM

SPEC ‘06 version, change to single precision and print fewer reports

1,481

285

>99%

RC5-72

Distributed.net RC5-72 challenge client code

1,979

218

>99%

FEM

Finite element modeling, simulation of 3D graded materials

1,874

146

99%

RPES

Rye Polynomial Equation Solver, quantum

chem

, 2-electron repulsion

1,104

281

99%

PNS

Petri Net simulation of a distributed system

322

160

>99%

SAXPY

Single-precision implementation of

saxpy

, used in

Linpack’s

Gaussian

elim

. routine

952

31

>99%

TPACF

Two Point Angular Correlation Function

536

98

96%

FDTD

Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation

1,365

93

16%

MRI-Q

Computing a matrix Q, a scanner’s configuration in MRI reconstruction

490

33

>99%

Slide78

GeForce 8800 GTX vs. 2.2 GHz Opteron 248

Introduction to Massively Parallel Processing – Slide 78

Speedup of Applications

Slide79

10x speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads25x to 400x speedup if the function’s data requirements and control flow suit the GPU and the application is optimized

Introduction to Massively Parallel Processing – Slide 79

Speedup of Applications

Slide80

Parallel hardware is here to stayGPUs are massively parallel manycore processorsEasily available and fully programmableParallelism and scalability are crucial for successThis presents many important research challengesNot to mention the educational challenges

Introduction to Massively Parallel Processing – Slide 80

Final Thoughts

Slide81

Reading: Chapter 1, “Programming Massively Parallel Processors” by Kirk and Hwu.Based on original material fromThe University of Akron: Tim O’Neil, Kathy LiszkaHiram College: Irena LomonosovThe University of Illinois at Urbana-ChampaignDavid Kirk, Wen-mei W. HwuThe University of North Carolina at CharlotteBarry Wilkinson, Michael AllenOregon State University: Michael QuinnOxford University: Mike GilesStanford University: Jared Hoberock, David TarjanRevision history: last updated 6/4/2011.

Introduction to Massively Parallel Processing – Slide 81

End Credits

Slide82

Slide83

Slide84

Slide85

Slide86

Slide87

Slide88


About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.
Youtube