Bart JF De Smet bartdemicrosoftcom http blogsbartdesmetnetbart Software Development Engineer Microsoft Corporation Session Code DTL206 Wishful thinking Agenda The concurrency landscape ID: 760156
Download Presentation The PPT/PDF document "The Manycore Shift: Making Parallel Co..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Slide2The Manycore Shift: Making Parallel Computing Mainstream
Bart J.F. De Smetbartde@microsoft.comhttp://blogs.bartdesmet.net/bartSoftware Development EngineerMicrosoft CorporationSession Code: DTL206
Wishful thinking?
Slide3Agenda
The concurrency landscape
Language headaches
.NET 4.0 facilities
Task Parallel Library
PLINQ
Coordination Data Structures
Asynchronous programming
Incubation projects
Summary
Slide4Moore’s law
The number of transistors incorporated in a chip willapproximately double every 24 months. Gordon Moore – Intel – 1965
Let’s sell processors
Slide5Moore’s law today
It can't continue forever.The nature of exponentials is that you push them out and eventually disaster happens. Gordon Moore – Intel – 2005
Let’s sell
even moreprocessors
Slide6Hardware Paradigm Shift
“… we see a very significant shift in what architectures will look like in the future ...
fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to … multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this will permeate every architecture that we build. All will have massively multicore implementations.”Intel Developer Forum, Spring 2004Pat GelsingerChief Technology Officer, Senior Vice PresidentIntel CorporationFebruary, 19, 2004
10,000
1,000
100
101
‘70 ‘80 ‘90 ‘00 ‘10
Power Density (W/cm2)
4004
8008
8080
8085
8086
286
386
486
Pentium
®
processors
Hot Plate
Nuclear Reactor
Rocket Nozzle
Sun’s Surface
Intel Developer Forum, Spring 2004 - Pat
Gelsinger
Many-core Peak Parallel GOPs
Single Threaded
Perf
10% per year
To Grow, To Keep Up,
We Must Embrace Parallel Computing
GOPS
32,768
2,048
128
16
2004 2006 2008 2010 2012
2015
Today’s Architecture: Heat becoming an unmanageable problem!
Parallelism Opportunity
80X
Slide7Problem statement
Shared mutable state
Needs synchronization primitives
Locks are problematic
Risk for contention
Poor discoverability (
SyncRoot
anyone?)
Not
composable
Difficult to get right (deadlocks, etc.)
Coarse-grained concurrency
Threads well-suited for large units of work
Expensive context switching
Asynchronous programming
Slide8What can go wrong?
RacesDeadlocksLivelocksLock convoysCache coherencyOverheadsLost event notificationsBroken serializabilityPriority inversion
Slide9Microsoft Parallel Computing Initiative
VB
C
#
F#
Constructing Parallel Applications
Executing fine-grain Parallel Applications
Coordinating system resources/services
Slide10Agenda
The concurrency landscape
Language headaches
.NET 4.0 facilities
Task Parallel Library
PLINQ
Coordination Data Structures
Asynchronous programming
Incubation projects
Summary
Slide11Languages: two extremes
LISP heritage
(Haskell, ML)
No mutable state
Mutable state
Fortran heritage(C, C++, C#, VB)
Fundamentalist
functional programming
F#
Slide12Mutability
Mutable by default (C# et al)Immutable by default (F# et al)
int x = 5;// Share out xx++;
let x = 5// Share out x// Can’t mutate x
let mutable x = 5// Share out xx <- x + 1
Synchronization required
No locking required
Explicit
opt-in
Slide13Side-effects will kill you
Elimination of common sub-expressions?Runtime out of controlCan’t optimize codeTypes don’t reveal side-effectsHaskell concept of IO monadDid you know? LINQ is a monad!
Source:
www.cse.chalmers.se
let now = DateTime.Nowin (now, now)
(DateTime.Now, DateTime.Now)
static
DateTime
Now { get; }
Slide14T
IO<T> - Promote (Return)
Monads for dummies
IO T
Slide15IO<T>
- Combine (Bind)
T
Monads for dummies
Source:
www.arcanux.org
IO T
IO R
IO R
IEnumerable
<R>
SelectMany
(
IEnumerable
<T>,
Func
<T,
IEnumerable
<R>>)
Slide16Languages: two roadmaps?
Making C# betterAdd safety nets?ImmutabilityPurity constructsLinear typesSoftware Transactional MemoryKamikaze-style of concurrencySimplify common patternsMaking Haskell mainstreamJust right? Too academic?Not a smooth upgrade path?
C#
Haskell
Nirvana
Slide17Taming side-effects in F#
Bart J.F. De SmetSoftware Development EngineerMicrosoft Corporation
demo
Slide18Agenda
The concurrency landscape
Language headaches
.NET 4.0 facilities
Task Parallel Library
PLINQ
Coordination Data Structures
Asynchronous programming
Incubation projects
Summary
Slide19Parallel Extensions Architecture
.NET Program
Proc 1
…
PLINQ Execution Engine
C# Compiler
VB Compiler
C++ Compiler
IL
OS Scheduling Primitives
(also UMS in Windows 7 and up)
Declarative
Queries
Data Partitioning
Chunk
Range
Hash
Striped
Repartitioning
Operator Types
Map
Scan
Build
Search
Reduction
Merging
Async
(pipeline)SynchOrder PreservingSortingForAll
Proc p
Parallel Algorithms
Query Analysis
Task Parallel Library (TPL)
Coordination Data Structures
Thread-safe Collections
Synchronization Types
Coordination Types
Task APIs
Task Parallelism
Futures
Scheduling
PLINQ
TPL or CDS
F# Compiler
Other .NET Compiler
Slide20Task Parallel Library – Tasks
System.Threading.TasksTaskParent-child relationshipsExplicit groupingWaiting and cancelationTask<T>Tasks that produce valuesAlso known as futures
Slide21Work Stealing
Internally, the runtime usesWork stealing techniquesLock-free concurrent task queuesWork stealing has provablyGood localityWork distribution properties
p1
p2
p3
4
3
2
1
4
Slide22Example code to parallelize
void
MultiplyMatrices
(
int
size,
double
[,] m1,
double
[,] m2,
double
[,] result)
{
for
(
int
i
= 0;
i
< size;
i
++) {
for
(
int
j = 0; j < size; j++) {
result[
i
, j] = 0;
for
(
int
k = 0; k < size; k++) {
result[i, j] += m1[i, k] * m2[k, j];
}
}
}
}
Slide23Solution today
int N = size; int P = 2 * Environment.ProcessorCount; int Chunk = N / P; // size of a work chunk ManualResetEvent signal = new ManualResetEvent(false); int counter = P; // counter limits kernel transitions for (int c = 0; c < P; c++) { // for each chunk ThreadPool.QueueUserWorkItem(o => { int lc = (int)o; for (int i = lc * Chunk; // process one chunk i < (lc + 1 == P ? N : (lc + 1) * Chunk); // respect upper bound i++) { // original loop body for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } } if (Interlocked.Decrement(ref counter) == 0) { // efficient interlocked ops signal.Set(); // and kernel transition only when done } }, c); } signal.WaitOne();
Error Prone
High Overhead
Tricks
Static Work Distribution
Knowledge of Synchronization Primitives
Heavy Synchronization
Lack of Thread Reuse
Slide24Solution with Parallel Extensions
void MultiplyMatrices(int size, double[,] m1, double[,] m2, double[,] result){ Parallel.For (0, size, i => { for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } });}
Structured parallelism
Slide25Task Parallel Library – Loops
Common source of work in programsSystem.Threading.Parallel classParallelism when iterations are independentBody doesn’t depend on mutable state E.g. static variables, writing to local variables used in subsequent iterationsSynchronousAll iterations finish, regularly or exceptionally
for (
int
i = 0; i < n; i++) work(i);…foreach (T e in data) work(e);
Parallel.For
(0, n,
i => work(i));…Parallel.ForEach(data, e => work(e));
Why immutability gains attention
Slide26Task Parallel Library
Bart J.F. De SmetSoftware Development EngineerMicrosoft Corporation
demo
Slide27Amdahl’s law
Maximum speedup:Sk – speed-up factor for portion kPk – percentage of instructions inpart k that can parallelizedSimplified:P – percentage of instructions that can be parallelizedN – number of processors
Sky is
not
the limit
Slide28Amdahl’s law by example
Theoretical
maximum speedup determined by amount of linear code
Slide29Performance Tips
Compute intensive and/or large data sets
Work done should be at least 1,000s of cycles
Do not be gratuitous in task creation
Lightweight, but still requires object allocation, etc.
Parallelize only outer loops where possible
Unless N is insufficiently large to offer enough parallelism
Prefer
isolation & immutability
over synchronization
Synchronization == !Scalable
Try to
avoid shared data
Have realistic expectations
Amdahl’s Law
Speedup will be fundamentally limited by the
amount of sequential computation
Gustafson’s Law
But what if you add
more data
, thus increasing the parallelizable percentage of the application?
Slide30Enable LINQ developers to leverage parallel hardwareFully supports all .NET Standard Query OperatorsAbstracts away the hard work of using parallelismPartitions and merges data intelligently (classic data parallelism)Minimal impact to existing LINQ programming modelAsParallel extension methodOptional preservation of input ordering (AsOrdered)Query syntax enables runtime to auto-parallelizeAutomatic way to generate more Tasks, like ParallelGraph analysis determines how to do itVery little synchronization internally: highly efficient
Parallel LINQ (PLINQ)
var
q = from p in people
where p.Name == queryInfo.Name && p.State == queryInfo.State && p.Year >= yearStart && p.Year <= yearEnd orderby p.Year ascending select p;
.AsParallel()
Slide31PLINQ
Bart J.F. De SmetSoftware Development EngineerMicrosoft Corporation
demo
Slide32Coordination Data Structures
New synchronization primitives
(
System.Threading
)
Barrier
Multi-phased algorithm
Tasks signal and wait for phases
CountdownEvent
Has an initial counter value
Gets signaled when count reaches zero
LazyInitializer
Lazy initialization routines
Reference type variable gets initialized lazily
SemaphoreSlim
Slim brother to Semaphore (goes kernel mode)
SpinLock
,
SpinWait
Loop-based wait (“spinning”)
Avoids context switch or kernel mode transition
Slide33Coordination Data Structures
Concurrent collections
(
System.Collections.Concurrent
)
BlockingCollection
<T
>
Producer/consumer scenarios
Blocks when no data is available (consumer)
Blocks when no space is available (producer)
ConcurrentBag
<T
>
ConcurrentDictionary
<
TKey
,
TElement
>
ConcurrentQueue
<T>,
ConcurrentStack
<T
>
Thread-safe and scalable collections
As lock-free as possible
Partitioner
<T
>
Facilities to partition data in chunks
E.g. PLINQ partitioning problems
Slide34Coordination Data Structures
Bart J.F. De SmetSoftware Development EngineerMicrosoft Corporation
demo
Slide35Asynchronous workflows in F#
Language feature unique to F#Based on theory of monadsBut much more exhaustive compared to LINQ…Overloadable meaning for specific keywordsContinuation passing styleNot: ‘a -> ‘bBut: ‘a -> (‘b -> unit) -> unitIn C# style: Action<T, Action<R>>Core concept: async { /* code */ }Syntactic sugar for keywords inside blockE.g. let!, do!, use!
Function
takes computation
result
Slide36Asynchronous workflows in F#
let processAsync i = async { use stream = File.OpenRead(sprintf "Image%d.tmp" i) let! pixels = stream.AsyncRead(numPixels) let pixels' = transform pixels i use out = File.OpenWrite(sprintf "Image%d.done" i) do! out.AsyncWrite(pixels') }let processAsyncDemo = printfn "async demo..." let tasks = [ for i in 1 .. numImages -> processAsync i ] Async.RunSynchronously (Async.Parallel tasks) |> ignore printfn "Done!"
Run tasks in parallel
stream.Read(numPixels, pixels -> let pixels' = transform pixels i use out = File.OpenWrite(sprintf "Image%d.done" i) do! out.AsyncWrite(pixels'))
Slide37Asynchronous workflows in F#
Bart J.F. De SmetSoftware Development EngineerMicrosoft Corporation
demo
Slide38Reactive Fx
First-class events in .NET
Dualism of
IEnumerable
<T> interface
IObservable
<T>
Pull versus
push
Pull (active):
IEnumerable
<T> and
foreach
Push (passive): raise events and event handlers
Events based on functions
Composition at its best
Definition of operators:
LINQ to Events
Realization of the
continuation monad
Slide39IObservable<T> and IObserver<T>
// Dual of IEnumerable<out T>public interface IObservable<out T>{ IDisposable Subscribe(IObserver<T> observer);}// Dual of IEnumerator<out T>public interface IObserver<in T>{ // IEnumerator<T>.MoveNext return value void OnCompleted(); // IEnumerator<T>.MoveNext exceptional return void OnError(Exception error); // IEnumerator<T>.Current property void OnNext(T value);}
Way to unsubscribe
Signaling the last event
Virtually two return types
Contra-variance
Co-variance
Slide40ReactiveFx
Bart J.F. De SmetSoftware Development EngineerMicrosoft Corporation
demo
Visit channel9.msdn.com for info
Slide41Agenda
The concurrency landscape
Language headaches
.NET 4.0 facilities
Task Parallel Library
PLINQ
Coordination Data Structures
Asynchronous programming
Incubation projects
Summary
Slide42DevLabs
project (previously “Maestro”)Coordination between components“Disciplined sharing”Actor modelAgents communicate via messagesChannels to exchange data via portsLanguage features (based on C#)Declarative data pipelines and protocolsSide-effect-free functionsAsynchronous methodsIsolated methodsAlso suitable in distributed setting
Slide43Channels for message exchange
agent
Program
:
channel
Microsoft.Axum.
Application
{
public
Program()
{
string
[]
args
=
receive
(
PrimaryChannel
::
CommandLine
);
PrimaryChannel
::
ExitCode
<--
0;
}
}
Slide44Agents and channels
channel Adder{ input int Num1; input int Num2; output int Sum; } agent AdderAgent : channel Adder { public AdderAgent() { int result = receive(PrimaryChannel::Num1) + receive(PrimaryChannel::Num2); PrimaryChannel::Sum <-- result; } }
Send / receive
primitives
Slide45Protocols
channel Adder{ input int Num1; input int Num2; output int Sum; Start: { Num1 -> GotNum1; } GotNum1: { Num2 -> GotNum2; } GotNum2: { Sum -> End; } }
State transition
diagram
Slide46Use of pipelines
agent MainAgent : channel Microsoft.Axum.Application { function int Fibonacci(int n) { if (n <= 1) return n; return Fibonacci(n - 1) + Fibonacci(n - 2); } int c = 10; void ProcessResult(int n) { Console.WriteLine(n); if (--c == 0) PrimaryChannel::ExitCode <-- 0; } public MainAgent() { var nums = new OrderedInteractionPoint<int>(); nums ==> Fibonacci ==> ProcessResult; for (int i = 0; i < c; i++) nums <-- 42 - i; }}
Description of data flow
Mathematical
function
Slide47Domains
domain Chatroom { private string m_Topic; private int m_UserCount; reader agent User : channel UserCommunication { // ... } writer agent Administrator : channel AdminCommunication { // ... } }
Unit of
sharing
between agents
Slide48Asynchronous methods
private asynchronous void ReadFile(string path) { Stream stream = new Stream(...); int numRead = stream.Read(...); while (numRead > 0) { ... numRead = stream.Read(...); } }
Blocking
operations inside
Slide49Axum in a nutshell
Bart J.F. De SmetSoftware Development EngineerMicrosoft Corporation
demo
Slide50Another
DevLabs projectCutting edge, released 7/28Specialized fork from .NET 4.0 Beta 1CLR modifications requiredFirst-class transactions on memoryAs an alternative to locking“Optimistic” concurrency methodologyMake modificationsRollback changes on conflictCore concept: atomic { /* code */ }
Slide51Transactional memory
Subtle differenceProblems with locks:Potential for deadlocks……and more uglinessGranularity matters a lotDon’t compose well
atomic { m_x++; m_y--; throw new MyException() }
lock (GlobalStmLock) { m_x++; m_y--; throw new MyException() }
Slide52Bank account sample
public
static void
Transfer(
BankAccount
from,
BankAccount
backup,
BankAccount
to,
int
amount)
{
Atomic
.Do
(() =>
{
// Be optimistic, credit the beneficiary first
to.ModifyBalance
(amount);
// Find the appropriate funds in source accounts
try
{
from.ModifyBalance
(-amount);
}
catch
(
OverdraftException
)
{
backup.ModifyBalance
(-amount);
}
});
}
Slide53Atomic cell update
public class SingleCellQueue<T> where T : class { T m_item; public void T Get() { atomic { T temp = m_item; if (temp == null) retry; m_item = null; return temp; } } public void T Put(T item) { atomic { if (m_item != null) retry; m_item = item; } } }
Don’t
forget
Slide54The hard truth about STM
Great features
A
C
I
D
Optimistic concurrency
Transparent rollback and re-execute
System.Transactions
(LTM) and DTC support
Implementation
Instrumentation of shared state access
JIT compiler modification
No hardware support currently
Result:
2x to 7x
serial slowdown
(in alpha prototype)
But
improved parallel scalability
Slide55STM.NET
Bart J.F. De SmetSoftware Development EngineerMicrosoft Corporation
demo
Visit msdn.microsoft.com/
devlabs
Slide56DryadLINQ
Dryad
Infrastructure for
cluster computation
Concept of
job
DryadLINQ
LINQ over Dryad
Decomposition of query
Distribution
over computation nodes
Roughly similar to PLINQ
A la “map-reduce”
Declarative
approach works
Slide57DryadLINQ = LINQ + Dryad
C#
C#
C#
C#
Vertex
code
Query
plan
(Dryad job)
Data
collection
results
Collection<T> collection;
bool
IsLegal
(Key k);
string Hash(Key);
var
results = from c in collection
where
IsLegal
(
c.key
)
select new { Hash(
c.key
),
c.value
};
Slide58DryadLINQ
Bart J.F. De SmetSoftware Development EngineerMicrosoft Corporation
demo
Visit research.microsoft.com/dryad
Slide59Agenda
The concurrency landscape
Language headaches
.NET 4.0 facilities
Task Parallel Library
PLINQ
Coordination Data Structures
Asynchronous programming
Incubation projects
Summary
Slide60Summary
Parallel programming requires thinking
Avoid side-effects
Prefer immutability
Act 1 = Library approach in .NET 4.0
Task Parallel Library
Parallel LINQ
Coordination Data Structures
Asynchronous patterns (+ a bit of language sugar)
Act 2 = Different
approaches are lurking
Software Transactional Memory
Purification of
languages
Slide61question & answer
Slide62www.microsoft.com/teched
Sessions On-Demand & Community
http://microsoft.com/technet
Resources for IT Professionals
http://microsoft.com/msdn
Resources for Developers
www.microsoft.com/learning
Microsoft Certification & Training Resources
Resources
Slide63Related Content
Breakout Sessions (session codes and titles)
Interactive Theater Sessions (session codes and titles)
Hands-on Labs (session codes and titles)
Hands-on Labs (session codes and titles)
Slide64Track Resources
Resource 1ki
Resource 2
Resource 3
Resource
4
Slide65Complete an evaluation on
CommNet
and enter to win!
Required Slide
Slide66©
2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Required Slide