Artur Laksberg arturlmicrosoftcom Visual C Team Microsoft September 17 2014 Agenda Parallel Fundamentals Task regions Parallel Algorithms Parallelization Vectorization Part 1 The Fundamentals ID: 411411
Download Presentation The PPT/PDF document "Parallelism in the Standard C++: What to..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Parallelism in the Standard C++: What to Expect in C++ 17
Artur Laksbergarturl@microsoft.comVisual C++ Team, MicrosoftSeptember 17, 2014Slide2
Agenda
Parallel FundamentalsTask regionsParallel AlgorithmsParallelizationVectorizationSlide3
Part 1: The FundamentalsSlide4
OpenMP
TBB
PPL
MPI
OpenCL
OpenACC
CUDA
C++ AMP
Renderscript
Cilk
Plus
GCDSlide5
Parallelism in C++11/14
Fundamentals:Memory modelAtomics Basics:threadmutex
condition_variable
async
futureSlide6
Quicksort: Serial
void quicksort(int
*v,
int
start,
int
end) {
if (start < end) {
int
pivot = partition(v, start, end
);
quicksort(v, start, pivot - 1);
quicksort(v, pivot + 1, end);
}}Slide7
Quicksort: Use Threads
void quicksort(int
*v,
int
start,
int
end) {
if (start < end)
{
int
pivot = partition(v, start, end
);
std::thread t1([&] {
quicksort(v, start, pivot - 1); });
std
::thread t2([&] { quicksort(v, pivot + 1, end);
}); t1.join();
t2.join(); }
}
Problem 1:
expensive
Problem 2:
Fork-join not enforced
Problem 3:
Exceptions??Slide8
Andrzej Krzemieński:
“Do not use naked threads in the program: use RAII-like wrappers instead”Slide9
Quicksort: Fork-Join Parallelism
void quicksort(int
*v,
int
start,
int
end) {
if (start < end)
{
int
pivot = partition(v, start, end
);
quicksort(v, start, pivot - 1);
quicksort(v, pivot + 1, end);
}
}
parallel region
task
taskSlide10
Quicksort: Using Task Regions (N3832)
void quicksort(
int
*v,
int
start,
int
end) {
if (start < end)
{
task_region([&] (auto&
r) {
int pivot = partition(v, start, end
);
r.run([&] {
quicksort(v, start, pivot - 1); });
r.run
([&] {
quicksort(v, pivot + 1, end);
});
});
}
}
task
task
parallel regionSlide11
Under The Hood…Slide12
Work Stealing Scheduling
proc 1
proc
3
proc
2
proc
4Slide13
Work Stealing Scheduling
proc
1
Old items
proc
3
proc
2
proc
4
New itemsSlide14
Work Stealing Scheduling
proc
1
Old items
proc
3
proc
2
proc
4
New itemsSlide15
Work Stealing Scheduling
proc
1
Old items
proc
3
proc
2
proc
4
New itemsSlide16
Work Stealing Scheduling
proc
1
Old items
proc
3
proc
2
proc
4
New items
“Thief”Slide17
Fork-Join Parallelism and Work Stealing
e();
task_region
([] (
auto&
r)
{
r.run
(f);
g
();
});
h();
e()
f()
g
()
h
()
Q2: What thread runs g?
Q3: What thread runs h?
Q1: What thread runs f?Slide18
Work Stealing Design Choices
What Thread Executes After a Spawn?Child StealingContinuation (parent) StealingWhat Thread Executes After a Join?
Stalling: initiating thread waits
Greedy: the last thread to reach join continues
task_region
([] (
auto&
r) {
for(
int
i
=0; i
<n; ++i)
r.run(f
);});Slide19
Thread Switching
std::thread::id thread_id1, thread_id2, thread_id3, thread_id4;
thread_id1 =
std
::
this_thread
::
get_id
();
task_region
([&] (
auto&
r) {
thread_id2 = std::
this_thread::get_id
(); r.run
(f); thread_id3 =
std::this_thread::
get_id();
});
auto thread_id4 =
std
::
this_thread
::
get_id
();
assert(thread_id1 == thread_id4); //
huh ???
assert(thread_id2 == thread_id3); //
huh ???
?Slide20
Part 2: The AlgorithmsSlide21
Alex
Stepanov: Start With The AlgorithmsSlide22
Inspiration
Performing Parallel Operations On Containers
Intel
Threading Building Blocks
Microsoft
Parallel Patterns Library, C++ AMP
Nvidia
ThrustSlide23
Parallel STL
Just like STL, only parallel…Can be fasterIf you know what you’re doingTwo Execution Policies:std:parstd::par_vecSlide24
Parallelization: What’s a Big Deal?
Why not already parallel?std
::sort(begin, end,
[](
int
a,
int
b)
{ return a <
b;
});
User-provided closures
must be thread safe:
int
comparisons = 0;
std
::sort(begin, end,
[&](
int
a,
int
b)
{ comparisons++; return a <
b;
});
But also special-member functions,
std
::swap
etc.Slide25
It’s a Contract
What the user can doWhat the implementer can doAsymptotic Guarantees:std::sort: O(n*log(n)), std
::
stable_sort
:
O(n*log
2
(n))
, what about parallel sort?
What is a valid implementation? (see next slide)Slide26
Chaos Sort
template<typename
Iterator,
typename
Compare>
void
chaos_sort
( Iterator first, Iterator last, Compare comp ) {
auto n = last-first;
std
::vector<char> c(n);
for(;;) { bool
flag = false; for(
size_t i
=1; i<n; ++
i ) { c[
i] = comp(first[i
],first[i-1]); flag |= c[
i
];
}
if( !flag ) break;
for(
size_t
i
=1;
i
<n; ++
i
)
if( c[
i] ) std
::swap
( first[i-1], first[
i
] );
}
}Slide27
Execution Policies
Built-in Execution Policies:
extern
const
sequential_execution_policy
seq
;
extern
const
parallel_execution_policy
par;extern const
parallel_vector_execution_policy
par_vec;
Dynamic Execution Policy:
class
execution_policy
{
public:
// ...
const
type_info
&
target_type
()
const
;
template<class T> T *target();
template<class T>
const
T *target()
const
;
};Slide28
Using Execution Policy To Write Paralel Code
std
::vector<
int
>
vec
=
...
// standard sequential sort
std
::sort(
vec.begin
(), vec.end());
using namespace
std::experimental::parallel;
// explicitly sequential sort
sort(seq,
vec.begin(), vec.end());
//
permitting parallel execution
sort(par,
vec.begin
(),
vec.end
());
//
permitting
vectorization
as well
sort(
par_vec
,
vec.begin
(),
vec.end());Slide29
Picking Execution Policy Dynamically
size_t
threshold = ...
execution_policy
exec =
seq
;
if(
vec.size
() > threshold)
{
exec = par;}
sort(exec,
vec.begin(), vec.end
());Slide30
Exception Handling
In C++ philosophy, no exception is silently ignoredException list: container of exception_ptr objects
try
{
r
=
std
::
inner_product
(
std
::par, a.begin(),
a.end(), b.begin
(),
func1, func2, 0);
}catch(const
exception_list& list){
for(auto& exptr
: list)
{
// process
exception pointer
exptr
}
}Slide31
Vectorization: A Tale From AgricultureSlide32
A Tale From AgricultureSlide33
A Tale From AgricultureSlide34
Idea: Fewer Tractors, Wider PlowsSlide35
Vectorization: What’s a Big Deal?
int a[n] = ...;int b[n] = ...;
for(int i=0; i<n; ++i)
{
a[i] = b[i] +
c;
}
movdqu xmm1, XMMWORD PTR _b$[esp+eax+132]
movdqu xmm0, XMMWORD PTR _a$[esp+eax+132]
paddd xmm1, xmm2
paddd xmm1, xmm0
movdqu XMMWORD PTR _a$[esp+eax+132], xmm1
a[i:i+3
] = b[i:i+3] +
c;
Move Unaligned Double
QuadwordSlide36
Vector Lane is not a Thread!
Taking locksThread with thread_id x takes a lock…Then another “thread” with the same thread_id enters the lock…Deadlock!!!
Exceptions
Can we unwind 1/4
th
of the stack?Slide37
Vectorization: Not So Easy Any More…
void f(int* a, int*b){
for(int i=0; i<n; ++i)
{
a[i] = b[i] +
c;
func
();
}
}
mov ecx, DWORD PTR _b$[esp+esi+140]
add ecx, edi
add DWORD PTR _a$[esp+esi+140], ecx
call
func
Aliasing?
Side effects?
Dependence?
Exceptions?Slide38
How Do We Get This?
void
f(float*
a, float*b)
{
for(int i=0; i<n; ++i)
{
a[i] = b[i] +
c;
func();
}
}
for(int i=0; i<n; i+=4)
{
a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j)
func();
}
Need a helping hand from the programmer!Slide39
Vectorization Hazard: Locks
for(int i=0; i<n; ++i)
{
lock.enter();
a[i
] = b[i] +
c;
lock.release();
}
for(int i=0; i<n; i+=4)
{
for(int
j=0; j<4; ++j)
lock.enter();
a[i:i+3] = b[i:i+3] + c;
for(int j=0; j<4; ++j)
lock.release();}
This transformation is not safe!
Consider:
f
takes a lock,
g
releases the lock:
?Slide40
But Wait, There Is One Little Problem…
void
f(float*
a, float*b)
{
std
::
for_each
(a, b, [&](float f)
{
// Oops, no ‘
i’: a[
i] = b[i
] + c;
func();
});}
void
f(float*
a, float*b)
{
for(int i=0; i<n; ++i)
{
// OK:
a[i] = b[i] +
c;
func
();
}
}
Index-based algorithm:
Element-based algorithm:Slide41
Vector Loop with Parallel STL
void f(float*
a,
float*b
)
{
integer_iterator
begin {
0
};
// almost, see N3976 integer_iterator
end {b-a};
std
::for_each(
std::par_vec, begin, end, [&](
int
i
)
{
a[
i
] = b[
i
] + c;
func
();
});
}Slide42
Parallelization vs. Vectorization
ParallelizationThreadsStackGood for divergent codeRelatively heavy-weight
Vectorization
Vector Lanes
No stack
Lock-step execution
Very light-weightSlide43
When To Vectorize
std::par
No race conditions
No aliasing
std
::
par_vec
Same as
std
::par
, plus:No ExceptionsNo Locks
No/Little Divergence Slide44
References
N3991: Task RegionN3872: A Primer on Scheduling Fork-Join Parallelism with Work StealingN3724: A Parallel Algorithms LibraryN3989: Working Draft,
Technical Specification
for C++
Extensions for Parallelism
N3976 : Multidimensional bounds, index and
array_view
parallelstl.codeplex.comSlide45