Parallelism in the Standard C++: What to Expect in C++ 17 - PowerPoint Presentation

403 views
Uploaded On 2016-07-19

Parallelism in the Standard C++: What to Expect in C++ 17 - PPT Presentation

Artur Laksberg arturlmicrosoftcom Visual C Team Microsoft September 17 2014 Agenda Parallel Fundamentals Task regions Parallel Algorithms Parallelization Vectorization Part 1 The Fundamentals ID: 411411

thread int proc std int thread std proc vec start amp quicksort sort parallel execution par pivot task void

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/411411" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Parallelism in the Standard C++: What to..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Parallelism in the Standard C++: What to Expect in C++ 17

Artur Laksbergarturl@microsoft.comVisual C++ Team, MicrosoftSeptember 17, 2014Slide2

Agenda

Parallel FundamentalsTask regionsParallel AlgorithmsParallelizationVectorizationSlide3

Part 1: The FundamentalsSlide4

OpenMP

TBB

PPL

MPI

OpenCL

OpenACC

CUDA

C++ AMP

Renderscript

Cilk

Plus

GCDSlide5

Parallelism in C++11/14

Fundamentals:Memory modelAtomics Basics:threadmutex

condition_variable

async

futureSlide6

Quicksort: Serial

void quicksort(int

*v,

int

start,

int

end) {

if (start < end) {

int

pivot = partition(v, start, end

);

quicksort(v, start, pivot - 1);

quicksort(v, pivot + 1, end);

}}Slide7

Quicksort: Use Threads

void quicksort(int

*v,

int

start,

int

end) {

if (start < end)

{

int

pivot = partition(v, start, end

);

std::thread t1([&] {

quicksort(v, start, pivot - 1); });

std

::thread t2([&] { quicksort(v, pivot + 1, end);

}); t1.join();

t2.join(); }

}

Problem 1:

expensive

Problem 2:

Fork-join not enforced

Problem 3:

Exceptions??Slide8

Andrzej Krzemieński:

“Do not use naked threads in the program: use RAII-like wrappers instead”Slide9

Quicksort: Fork-Join Parallelism

void quicksort(int

*v,

int

start,

int

end) {

if (start < end)

{

int

pivot = partition(v, start, end

);

quicksort(v, start, pivot - 1);

quicksort(v, pivot + 1, end);

}

parallel region

task

taskSlide10

Quicksort: Using Task Regions (N3832)

void quicksort(

int

*v,

int

start,

int

end) {

if (start < end)

{

task_region([&] (auto&

r) {

int pivot = partition(v, start, end

);

r.run([&] {

quicksort(v, start, pivot - 1); });

r.run

([&] {

quicksort(v, pivot + 1, end);

});

}

task

parallel regionSlide11

Under The Hood…Slide12

Work Stealing Scheduling

proc 1

proc

4Slide13

Work Stealing Scheduling

proc

Old items

proc

New itemsSlide14

Work Stealing Scheduling

proc

Old items

proc

New itemsSlide15

Work Stealing Scheduling

proc

Old items

proc

New itemsSlide16

Work Stealing Scheduling

proc

Old items

proc

New items

“Thief”Slide17

Fork-Join Parallelism and Work Stealing

e();

task_region

([] (

auto&

{

r.run

(f);

();

});

h();

e()

f()

()

Q2: What thread runs g?

Q3: What thread runs h?

Q1: What thread runs f?Slide18

Work Stealing Design Choices

What Thread Executes After a Spawn?Child StealingContinuation (parent) StealingWhat Thread Executes After a Join?

Stalling: initiating thread waits

Greedy: the last thread to reach join continues

task_region

([] (

auto&

r) {

for(

int

=0; i

<n; ++i)

r.run(f

);});Slide19

Thread Switching

std::thread::id thread_id1, thread_id2, thread_id3, thread_id4;

thread_id1 =

std

this_thread

get_id

();

task_region

([&] (

auto&

r) {

thread_id2 = std::

this_thread::get_id

(); r.run

(f); thread_id3 =

std::this_thread::

get_id();

});

auto thread_id4 =

std

this_thread

get_id

();

assert(thread_id1 == thread_id4); //

huh ???

assert(thread_id2 == thread_id3); //

huh ???

?Slide20

Part 2: The AlgorithmsSlide21

Alex

Stepanov: Start With The AlgorithmsSlide22

Inspiration

Performing Parallel Operations On Containers

Intel

Threading Building Blocks

Microsoft

Parallel Patterns Library, C++ AMP

Nvidia

ThrustSlide23

Parallel STL

Just like STL, only parallel…Can be fasterIf you know what you’re doingTwo Execution Policies:std:parstd::par_vecSlide24

Parallelization: What’s a Big Deal?

Why not already parallel?std

::sort(begin, end,

[](

int

{ return a <

});

User-provided closures

must be thread safe:

int

comparisons = 0;

std

::sort(begin, end,

[&](

int

{ comparisons++; return a <

});

But also special-member functions,

std

::swap

etc.Slide25

It’s a Contract

What the user can doWhat the implementer can doAsymptotic Guarantees:std::sort: O(n*log(n)), std

stable_sort

O(n*log

(n))

, what about parallel sort?

What is a valid implementation? (see next slide)Slide26

Chaos Sort

template<typename

Iterator,

typename

Compare>

void

chaos_sort

( Iterator first, Iterator last, Compare comp ) {

auto n = last-first;

std

::vector<char> c(n);

for(;;) { bool

flag = false; for(

size_t i

=1; i<n; ++

i ) { c[

i] = comp(first[i

],first[i-1]); flag |= c[

];

}

if( !flag ) break;

for(

size_t

=1;

<n; ++

)

if( c[

i] ) std

::swap

( first[i-1], first[

] );

}

}Slide27

Execution Policies

Built-in Execution Policies:

extern

const

sequential_execution_policy

seq

;

extern

const

parallel_execution_policy

par;extern const

parallel_vector_execution_policy

par_vec;

Dynamic Execution Policy:

class

execution_policy

{

public:

// ...

const

type_info

target_type

()

const

;

template<class T> T *target();

template<class T>

const

T *target()

const

;

};Slide28

Using Execution Policy To Write Paralel Code

std

::vector<

int

vec

...

// standard sequential sort

std

::sort(

vec.begin

(), vec.end());

using namespace

std::experimental::parallel;

// explicitly sequential sort

sort(seq,

vec.begin(), vec.end());

permitting parallel execution

sort(par,

vec.begin

(),

vec.end

());

permitting

vectorization

as well

sort(

par_vec

vec.begin

(),

vec.end());Slide29

Picking Execution Policy Dynamically

size_t

threshold = ...

execution_policy

exec =

seq

;

if(

vec.size

() > threshold)

{

exec = par;}

sort(exec,

vec.begin(), vec.end

());Slide30

Exception Handling

In C++ philosophy, no exception is silently ignoredException list: container of exception_ptr objects

try

{

std

inner_product

(

std

::par, a.begin(),

a.end(), b.begin

(),

func1, func2, 0);

}catch(const

exception_list& list){

for(auto& exptr

: list)

{

// process

exception pointer

exptr

}

}Slide31

Vectorization: A Tale From AgricultureSlide32

A Tale From AgricultureSlide33

A Tale From AgricultureSlide34

Idea: Fewer Tractors, Wider PlowsSlide35

Vectorization: What’s a Big Deal?

int a[n] = ...;int b[n] = ...;

for(int i=0; i<n; ++i)

{

a[i] = b[i] +

}

movdqu xmm1, XMMWORD PTR _b$[esp+eax+132]

movdqu xmm0, XMMWORD PTR _a$[esp+eax+132]

paddd xmm1, xmm2

paddd xmm1, xmm0

movdqu XMMWORD PTR _a$[esp+eax+132], xmm1

a[i:i+3

] = b[i:i+3] +

Move Unaligned Double

QuadwordSlide36

Vector Lane is not a Thread!

Taking locksThread with thread_id x takes a lock…Then another “thread” with the same thread_id enters the lock…Deadlock!!!

Exceptions

Can we unwind 1/4

of the stack?Slide37

Vectorization: Not So Easy Any More…

void f(int* a, int*b){

for(int i=0; i<n; ++i)

{

a[i] = b[i] +

func

();

}

mov ecx, DWORD PTR _b$[esp+esi+140]

add ecx, edi

add DWORD PTR _a$[esp+esi+140], ecx

call

func

Aliasing?

Side effects?

Dependence?

Exceptions?Slide38

How Do We Get This?

void

f(float*

a, float*b)

{

for(int i=0; i<n; ++i)

{

a[i] = b[i] +

func();

}

for(int i=0; i<n; i+=4)

{

a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j)

func();

}

Need a helping hand from the programmer!Slide39

Vectorization Hazard: Locks

for(int i=0; i<n; ++i)

{

lock.enter();

a[i

] = b[i] +

lock.release();

}

for(int i=0; i<n; i+=4)

{

for(int

j=0; j<4; ++j)

lock.enter();

a[i:i+3] = b[i:i+3] + c;

for(int j=0; j<4; ++j)

lock.release();}

This transformation is not safe!

Consider:

takes a lock,

releases the lock:

?Slide40

But Wait, There Is One Little Problem…

void

f(float*

a, float*b)

{

std

for_each

(a, b, [&](float f)

{

// Oops, no ‘

i’: a[

i] = b[i

] + c;

func();

});}

void

f(float*

a, float*b)

{

for(int i=0; i<n; ++i)

{

// OK:

a[i] = b[i] +

func

();

}

Index-based algorithm:

Element-based algorithm:Slide41

Vector Loop with Parallel STL

void f(float*

float*b

)

{

integer_iterator

begin {

};

// almost, see N3976 integer_iterator

end {b-a};

std

::for_each(

std::par_vec, begin, end, [&](

int

)

{

] = b[

] + c;

func

();

});

}Slide42

Parallelization vs. Vectorization

ParallelizationThreadsStackGood for divergent codeRelatively heavy-weight

Vectorization

Vector Lanes

No stack

Lock-step execution

Very light-weightSlide43

When To Vectorize

std::par

No race conditions

No aliasing

std

par_vec

Same as

std

::par

, plus:No ExceptionsNo Locks

No/Little Divergence Slide44

References

N3991: Task RegionN3872: A Primer on Scheduling Fork-Join Parallelism with Work StealingN3724: A Parallel Algorithms LibraryN3989: Working Draft,

Technical Specification

for C++

Extensions for Parallelism

N3976 : Multidimensional bounds, index and

array_view

parallelstl.codeplex.comSlide45

Parallelism in the Standard C++: What to Expect in C++ 17 - PowerPoint Presentation

Parallelism in the Standard C++: What to Expect in C++ 17 - PPT Presentation

Share:

Link:

Embed:

Related Contents