/
Parallelism in the Standard C++: What to Expect in C++ 17 Parallelism in the Standard C++: What to Expect in C++ 17

Parallelism in the Standard C++: What to Expect in C++ 17 - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
403 views
Uploaded On 2016-07-19

Parallelism in the Standard C++: What to Expect in C++ 17 - PPT Presentation

Artur Laksberg arturlmicrosoftcom Visual C Team Microsoft September 17 2014 Agenda Parallel Fundamentals Task regions Parallel Algorithms Parallelization Vectorization Part 1 The Fundamentals ID: 411411

thread int proc std int thread std proc vec start amp quicksort sort parallel execution par pivot task void

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Parallelism in the Standard C++: What to..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Parallelism in the Standard C++: What to Expect in C++ 17

Artur Laksbergarturl@microsoft.comVisual C++ Team, MicrosoftSeptember 17, 2014Slide2

Agenda

Parallel FundamentalsTask regionsParallel AlgorithmsParallelizationVectorizationSlide3

Part 1: The FundamentalsSlide4

OpenMP

TBB

PPL

MPI

OpenCL

OpenACC

CUDA

C++ AMP

Renderscript

Cilk

Plus

GCDSlide5

Parallelism in C++11/14

Fundamentals:Memory modelAtomics Basics:threadmutex

condition_variable

async

futureSlide6

Quicksort: Serial

void quicksort(int

*v,

int

start,

int

end) {

if (start < end) {

int

pivot = partition(v, start, end

);

quicksort(v, start, pivot - 1);

quicksort(v, pivot + 1, end);

}}Slide7

Quicksort: Use Threads

void quicksort(int

*v,

int

start,

int

end) {

if (start < end)

{

int

pivot = partition(v, start, end

);

std::thread t1([&] {

quicksort(v, start, pivot - 1); });

std

::thread t2([&] { quicksort(v, pivot + 1, end);

}); t1.join();

t2.join(); }

}

Problem 1:

expensive

Problem 2:

Fork-join not enforced

Problem 3:

Exceptions??Slide8

Andrzej Krzemieński:

“Do not use naked threads in the program: use RAII-like wrappers instead”Slide9

Quicksort: Fork-Join Parallelism

void quicksort(int

*v,

int

start,

int

end) {

if (start < end)

{

int

pivot = partition(v, start, end

);

quicksort(v, start, pivot - 1);

quicksort(v, pivot + 1, end);

}

}

parallel region

task

taskSlide10

Quicksort: Using Task Regions (N3832)

void quicksort(

int

*v,

int

start,

int

end) {

if (start < end)

{

task_region([&] (auto&

r) {

int pivot = partition(v, start, end

);

r.run([&] {

quicksort(v, start, pivot - 1); });

r.run

([&] {

quicksort(v, pivot + 1, end);

});

});

}

}

task

task

parallel regionSlide11

Under The Hood…Slide12

Work Stealing Scheduling

proc 1

proc

3

proc

2

proc

4Slide13

Work Stealing Scheduling

proc

1

Old items

proc

3

proc

2

proc

4

New itemsSlide14

Work Stealing Scheduling

proc

1

Old items

proc

3

proc

2

proc

4

New itemsSlide15

Work Stealing Scheduling

proc

1

Old items

proc

3

proc

2

proc

4

New itemsSlide16

Work Stealing Scheduling

proc

1

Old items

proc

3

proc

2

proc

4

New items

“Thief”Slide17

Fork-Join Parallelism and Work Stealing

e();

task_region

([] (

auto&

r)

{

r.run

(f);

g

();

});

h();

e()

f()

g

()

h

()

Q2: What thread runs g?

Q3: What thread runs h?

Q1: What thread runs f?Slide18

Work Stealing Design Choices

What Thread Executes After a Spawn?Child StealingContinuation (parent) StealingWhat Thread Executes After a Join?

Stalling: initiating thread waits

Greedy: the last thread to reach join continues

task_region

([] (

auto&

r) {

for(

int

i

=0; i

<n; ++i)

r.run(f

);});Slide19

Thread Switching

std::thread::id thread_id1, thread_id2, thread_id3, thread_id4;

thread_id1 =

std

::

this_thread

::

get_id

();

task_region

([&] (

auto&

r) {

thread_id2 = std::

this_thread::get_id

(); r.run

(f); thread_id3 =

std::this_thread::

get_id();

});

auto thread_id4 =

std

::

this_thread

::

get_id

();

assert(thread_id1 == thread_id4); //

huh ???

assert(thread_id2 == thread_id3); //

huh ???

?Slide20

Part 2: The AlgorithmsSlide21

Alex

Stepanov: Start With The AlgorithmsSlide22

Inspiration

Performing Parallel Operations On Containers

Intel

Threading Building Blocks

Microsoft

Parallel Patterns Library, C++ AMP

Nvidia

ThrustSlide23

Parallel STL

Just like STL, only parallel…Can be fasterIf you know what you’re doingTwo Execution Policies:std:parstd::par_vecSlide24

Parallelization: What’s a Big Deal?

Why not already parallel?std

::sort(begin, end,

[](

int

a,

int

b)

{ return a <

b;

});

User-provided closures

must be thread safe:

int

comparisons = 0;

std

::sort(begin, end,

[&](

int

a,

int

b)

{ comparisons++; return a <

b;

});

But also special-member functions,

std

::swap

etc.Slide25

It’s a Contract

What the user can doWhat the implementer can doAsymptotic Guarantees:std::sort: O(n*log(n)), std

::

stable_sort

:

O(n*log

2

(n))

, what about parallel sort?

What is a valid implementation? (see next slide)Slide26

Chaos Sort

template<typename

Iterator,

typename

Compare>

void

chaos_sort

( Iterator first, Iterator last, Compare comp ) {

auto n = last-first;

std

::vector<char> c(n);

for(;;) { bool

flag = false; for(

size_t i

=1; i<n; ++

i ) { c[

i] = comp(first[i

],first[i-1]); flag |= c[

i

];

}

if( !flag ) break;

for(

size_t

i

=1;

i

<n; ++

i

)

if( c[

i] ) std

::swap

( first[i-1], first[

i

] );

}

}Slide27

Execution Policies

Built-in Execution Policies:

extern

const

sequential_execution_policy

seq

;

extern

const

parallel_execution_policy

par;extern const

parallel_vector_execution_policy

par_vec;

Dynamic Execution Policy:

class

execution_policy

{

public:

// ...

const

type_info

&

target_type

()

const

;

template<class T> T *target();

template<class T>

const

T *target()

const

;

};Slide28

Using Execution Policy To Write Paralel Code

std

::vector<

int

>

vec

=

...

// standard sequential sort

std

::sort(

vec.begin

(), vec.end());

using namespace

std::experimental::parallel;

// explicitly sequential sort

sort(seq,

vec.begin(), vec.end());

//

permitting parallel execution

sort(par,

vec.begin

(),

vec.end

());

//

permitting

vectorization

as well

sort(

par_vec

,

vec.begin

(),

vec.end());Slide29

Picking Execution Policy Dynamically

size_t

threshold = ...

execution_policy

exec =

seq

;

if(

vec.size

() > threshold)

{

exec = par;}

sort(exec,

vec.begin(), vec.end

());Slide30

Exception Handling

In C++ philosophy, no exception is silently ignoredException list: container of exception_ptr objects

try

{

r

=

std

::

inner_product

(

std

::par, a.begin(),

a.end(), b.begin

(),

func1, func2, 0);

}catch(const

exception_list& list){

for(auto& exptr

: list)

{

// process

exception pointer

exptr

}

}Slide31

Vectorization: A Tale From AgricultureSlide32

A Tale From AgricultureSlide33

A Tale From AgricultureSlide34

Idea: Fewer Tractors, Wider PlowsSlide35

Vectorization: What’s a Big Deal?

int a[n] = ...;int b[n] = ...;

for(int i=0; i<n; ++i)

{

a[i] = b[i] +

c;

}

movdqu xmm1, XMMWORD PTR _b$[esp+eax+132]

movdqu xmm0, XMMWORD PTR _a$[esp+eax+132]

paddd xmm1, xmm2

paddd xmm1, xmm0

movdqu XMMWORD PTR _a$[esp+eax+132], xmm1

a[i:i+3

] = b[i:i+3] +

c;

Move Unaligned Double

QuadwordSlide36

Vector Lane is not a Thread!

Taking locksThread with thread_id x takes a lock…Then another “thread” with the same thread_id enters the lock…Deadlock!!!

Exceptions

Can we unwind 1/4

th

of the stack?Slide37

Vectorization: Not So Easy Any More…

void f(int* a, int*b){

for(int i=0; i<n; ++i)

{

a[i] = b[i] +

c;

func

();

}

}

mov ecx, DWORD PTR _b$[esp+esi+140]

add ecx, edi

add DWORD PTR _a$[esp+esi+140], ecx

call

func

Aliasing?

Side effects?

Dependence?

Exceptions?Slide38

How Do We Get This?

void

f(float*

a, float*b)

{

for(int i=0; i<n; ++i)

{

a[i] = b[i] +

c;

func();

}

}

for(int i=0; i<n; i+=4)

{

a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j)

func();

}

Need a helping hand from the programmer!Slide39

Vectorization Hazard: Locks

for(int i=0; i<n; ++i)

{

lock.enter();

a[i

] = b[i] +

c;

lock.release();

}

for(int i=0; i<n; i+=4)

{

for(int

j=0; j<4; ++j)

lock.enter();

a[i:i+3] = b[i:i+3] + c;

for(int j=0; j<4; ++j)

lock.release();}

This transformation is not safe!

Consider:

f

takes a lock,

g

releases the lock:

?Slide40

But Wait, There Is One Little Problem…

void

f(float*

a, float*b)

{

std

::

for_each

(a, b, [&](float f)

{

// Oops, no ‘

i’: a[

i] = b[i

] + c;

func();

});}

void

f(float*

a, float*b)

{

for(int i=0; i<n; ++i)

{

// OK:

a[i] = b[i] +

c;

func

();

}

}

Index-based algorithm:

Element-based algorithm:Slide41

Vector Loop with Parallel STL

void f(float*

a,

float*b

)

{

integer_iterator

begin {

0

};

// almost, see N3976 integer_iterator

end {b-a};

std

::for_each(

std::par_vec, begin, end, [&](

int

i

)

{

a[

i

] = b[

i

] + c;

func

();

});

}Slide42

Parallelization vs. Vectorization

ParallelizationThreadsStackGood for divergent codeRelatively heavy-weight

Vectorization

Vector Lanes

No stack

Lock-step execution

Very light-weightSlide43

When To Vectorize

std::par

No race conditions

No aliasing

std

::

par_vec

Same as

std

::par

, plus:No ExceptionsNo Locks

No/Little Divergence Slide44

References

N3991: Task RegionN3872: A Primer on Scheduling Fork-Join Parallelism with Work StealingN3724: A Parallel Algorithms LibraryN3989: Working Draft,

Technical Specification

for C++

Extensions for Parallelism

N3976 : Multidimensional bounds, index and

array_view

parallelstl.codeplex.comSlide45