/
Optimizing compiler. Vectorization Optimizing compiler. Vectorization

Optimizing compiler. Vectorization - PowerPoint Presentation

maniakiali
maniakiali . @maniakiali
Follow
342 views
Uploaded On 2020-11-06

Optimizing compiler. Vectorization - PPT Presentation

The trend in the microprocessor development Sequential instruction execution Parallel instruction execution Different kinds of parallel Pipeline Superscalar Vector operations ID: 816333

data loop aligned vectorization loop data vectorization aligned vector packed instructions vectorized float instruction exe compiler alignment int intel

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Optimizing compiler. Vectorization" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Optimizing compiler.

Vectorization

.

Slide2

The trend in the microprocessor development

Sequential instruction execution

Parallel instruction execution

Different kinds of

parallel:

Pipeline

Superscalar

Vector operations

Multi-core and multiprocessor tasks

An optimizing compiler is a tool which translates source code into an executable module and optimize source code for better performance. Parallelization is a transformation

of the sequential

program

into multi-threaded or

vectorized

(or both) to utilize multiple execution units

simultaneously without

breaking the

correctness of the program.

Slide3

Vectorization

is an example of data parallelism (SIMD)C[1] = A[1]+B[1]C[2] = A[2]+B[2]

C[3] = A[3]+B[3]C[4] = A[4]+B[4]

A[1]

A[2]

A[3]

A[4]

A[1]

A[2]

A[3]

A[4]

Vector operation

B[1]

B[2]

B[3]

B[4]

B[1]

B[2]

B[3]

B[4]

+

=

C[1] C[2] C[3] C[4]

C[1]

C[2]

C[3]

C[4]

Slide4

10/17/10

for(

i

=0;i<100;i++)

p[

i

]=b[

i

]

0

p[0]=b[0]

1

p[1]=b[1]

2

p[2]=b[2]

3

p[2]=b[2]…

97p[97]=b[97]98

p[98]=b[98]99

p[99]=b[99]

100 operations

0p[0]=b[0]

1p[1]=b[1]

0p[98]=b[98]

1

p[99]=b[99]

0p[2:9]=b[2:9]

…12p[90:97]=b[90:97]

16

operations

The approximate scheme of the loop vectorization

Non-aligned starting elements passed

Vector operations

Vector register -

8 elements

Tail loop

Slide5

A

typical vector instruction is an operation on two vectors in memory or in

fixed length registers. These vectors can be loaded from memory by a single or multiple operations.

Vectorization is a compiler optimization that inserts vector instructions instead of scalar. This optimization "wraps" the data into vector; scalar operations are replaced by operations with

these vectors (packets).Such optimization can be also performed manually by developer.

A

(1: n: k) - section of the array in

Fortran is very convenient for vector register representation.

for(i=0;i<

U,i

++) {

S1: lhs1[i] = rhs1[i];

… Sn

: lhsn[i] = rhsn[i];

}

for(i=0;i<U,i+=

vl) { S1: lhs1[i:i+vl-1:1] = rhs1[i:i+vl-1:1]; … Sn

: lhsn[i:i+vl-1:1] = rhsn[i:i+vl-1:1];}

A(1:10:3)

Slide6

MMX

,

SSE vector instruction sets

MMX is a single instruction, multiple data (SIMD) instruction set designed by Intel, introduced in 1996 for P5-based Pentium line of microprocessors, known as "Pentium with MMX Technology”.

MMX (Multimedia Extensions) is

a set of instructions

perform specific actions for

streaming

audio/video

encoding

and decoding.

MMX is:

MMM0-MMM7

64 bit registers (were aliased with existing 80 bit FP stack registers)

the concept of packages (each register

can store one 64 bit integer or 2 - 32 bit or 4 - 16

bit or 8 - 8-bit)

57 instructions are divided into groups:

data movement, arithmetic, comparison, conversion, logical, shift, shufle, unpack, cache control and prefetch state management.

MMX provides only integer operations.

Simultaneous operations with floats and packed integers are not possible.

Slide7

Streaming

SIMD Extensions (SSE) is a SIMD instruction set extension

of the x86 architecture, designed by Intel and introduced with Pentium

III series processors in 1999. SIMD instructions can greatly increase the performance when exactly the same operations are performed on the multiple data objects. Typical applications are digital signal processing and

computer graphics.

SSE

is

:

8 128-bit

registers (xmm0 to xmm7)

set

of instructions

for operations with scalar and packed data types

70

new instructions for single precision floating point

data mainly

SSE2, SSE3, SSEE3, SSE4 are further extensions of SSE.

SSE2 has packed data type with double precision floating point.

Advanced

Vector Extensions (AVX) is an extension of the x86 instruction set architecture for Intel and AMD microprocessors proposed by Intel in March 2008. Firstly

supported by Intel Westmere processor in Q1 2011 and

by AMD Bulldozer processor in Q3 2011.

AVX provides new features, new instructions and a new coding scheme.

The

size of vector registers is increased from 128 to 256

bits (YMM0-YMM15).

The existing 128-bit instructions use the lower half of YMM registers.

Slide8

Different

data types can be packed in vector registers as follows

Packed data type

Vectorlength

Bits

per

element

Data

type range

signed bytes

16

8

-2**7

to

2**7-1

unsigned bytes16

80 до 2**8-1

signed words

8

16

-2**15 to 2**15-1

unsigned words

8

160 до 2**16

signed

doublewords4

32-2**31

to 2**31-1

unsigned doublewords

4

32

0 до 2**32-1signed

quadwords

2

64-2**63

to 2*63-1

unsigned

quadwords

2

64

0 до 2**64-1

single-precision fps

4

32

2**-126

to

2**127

double-precision fps

2

64

2**-1022

to

2**1023

Selecting the appropriate data type for calculations can significantly affect application performance.

Slide9

9

Optimization with Switches

SIMD – SSE, SSE2, SSE3, SSE4.2 Support

16x bytes

8x words

4x dwords

2x qwords

1x dqword

4x floats

2x doubles

MMX*

SSE

SSE4.2

SSE3

SSE2

* MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers

Slide10

Instruction groups

Data movement instructions :

Instruction Suffix Description

movdqa move double quadword aligned movdqu

move double quadword unaligned mova [

ps

,

pd

] move floating-point aligned

movu

[ ps,

pd ] move floating-point unaligned movhl [

ps ] move packed floating-point high to low movlh [

ps ] move packed floating-point low to high

movh [ ps,

pd ] move high packed floating-point

movl [ ps, pd ] move low packed floating-point

mov [ d, q, ss, sd

] move scalar data lddqu

load double quadword unaligned mov

<d/sh/sl>dup

move and duplicate pextr [ w ] extract word

pinstr [ w ] insert word

pmovmsk [ b ] move mask movmsk [

ps, pd ] move mask

An aligned data movement instruction cannot be applied to the memory location which is not aligned by 16 (bytes).

10/17/10

Slide11

Intel arithmetic instructions :

Instruction Suffix Description padd [ b, w, d, q ] packed addition (signed and unsigned) psub [ b, w, d, q ] packed subtraction (signed and unsigned) padds [ b, w ] packed addition with saturation (signed) paddus [ b, w ] packed addition with saturation (unsigned) psubs [ b, w ] packed subtraction with saturation (signed)

psubus [ b, w ] packed subtraction with saturation (unsigned) pmins [ w ] packed minimum (signed) pminu [ b ] packed minimum (unsigned) pmaxs [ w ] packed maximum (signed) pmaxu [ b ] packed maximum (unsigned)

10/17/10

Slide12

Floating-point arithmetic instructions :

Instruction Suffix Description add [ ss, ps, sd,

pd ] addition div [ ss, ps, sd, pd ] division

min [ ss, ps,

sd

,

pd

] minimum

max [

ss

, ps,

sd, pd ] maximum mul

[ ss, ps,

sd, pd ] multiplication

sqrt [ ss, ps

, sd, pd ] square root

sub [ ss, ps, sd, pd

] subtraction rcp [ ss

, ps] approximated reciprocal

rsqrt [

ss, ps] approximated reciprocal square root

Idiomatic arithmetic instructions : Instruction Suffix Description

pang [ b, w ] packed average with rounding (unsigned) pmulh/

pmulhu/pmull [ w ] packed multiplication

psad [ bw ] packed sum of absolute differences (unsigned)

pmadd [ wd ] packed multiplication and addition (signed)

addsub [ ps, pd ] floating-point addition/subtraction

hadd [ ps, pd

] floating-point horizontal addition hsub [ ps

, pd ] floating-point horizontal subtraction

10/17/10

Slide13

Logical instructions :

Instruction Suffix Description pand bitwise logical AND pandn bitwise logical AND-NOT por bitwise logical

OR pxor bitwise logical XOR and [ ps, pd ] bitwise logical AND andn [ ps, pd ] bitwise logical

AND-NOT or [ ps, pd ] bitwise logical OR xor

[ ps, pd ]

bitwise logical

XOR

Comparison instructions :

Instruction Suffix Description pcmp<cc> [ b, w, d ] packed compare cmp<cc> [ ss, ps, sd, pd ] floating-point compare

<cc> defines comparison operation. lt – less,

gt – greater, eq - equal

10/17/10

Slide14

Conversion instructions :

Instruction Suffix Description packss [wb, dw] pack with saturation (signed) pa

сkus [wb] pack with saturation (unsigned) cvt<s2d> conversion cvtt<s2d> conversion with truncation

Shift instructions : Instruction Suffix Description psll [ w, d, q, dq ] shift left logical (zero in)

psra [w, d] shift right arithmetic (sign in)

psrl [ w, d, q, dq ] shift right logical (zero in)

Shuffle instructions :

Instruction Suffix Description pshuf [ w, d ] packed shuffle pshufh [w] packed shuffle high

pshufl [w] packed shuffle low

ырга [ ps, pd ] shuffle

10/17/10

Slide15

Unpack instructions :

Instruction Suffix Description punpckh [bw, wd,

dq, qdq] unpack high punpckl [bw,

wd, dq, qdq] unpack low

unpckh

[

ps

,

pd

] unpack high unpckl

[ps, pd] unpack low

Cacheability control and prefetch

instructions : Instruction Suffix Description movnt

[ ps, pd, q,

dq ] move aligned non-temporal prefetch<hint>

prefetch with hint State management instructions

These instructions are commonly used by operating system.

10/17/10

Slide16

16

Three Sets of Switches to Enable Processor-specific Extensions

Switches –x<EXT> like –xSSE4_1

Imply an Intel processor checkRun-time error message at program start when launched on processor without <EXT>

Switches –m<EXT> like –mSSE3No processor check

Illegal instruction fault when launched on processor without <EXT>

Switches –ax<EXT> like –axSSE4_2

Automatic processor dispatch – multiple code paths

Processor check only available on Intel processors

Non-Intel processors take default path

Default path is –mSSE2 but can be modified again by another –x<EXT> switch

Slide17

Simple

estimation of

vectorization profitability

A typical vector instruction is an operation on two vectors in memory or in registers of fixed length. These vectors can be loaded from memory in a single operation or in part.Description of the foundations of SSE technology can be found in CHAPTER 10 PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) of the document «Intel 64 and IA-32 Intel Architecture Software Developer's Manual» (Volume 1)

Microsoft Visual Studio supports a set of SSE intrinsics that allows you to use SSE instructions directly from C/C++ code. You need to include

xmmintrin.h

, which defines the vector type __m128 and vector operations.

For example, we need to

vectorize

manually the following loop:

for(i=0;i<

N;i

++) C[i]=A[i]

*B[i];

To do this: 1.) organize / fill the vector variables

2.) use multiply intrinsic for the vector variables 3.) write the results of calculations back to memory

Slide18

#include <stdio.h>

#include <xmmintrin.h>

#define N 40

int main() {

float a[N][N][N],b[N][N][N],c[N][N][N];

int i,j,k,rep; __m128 *xa,*xb,*xc;

for(i=0;i<N;i++)

for(j=0;j<N;j++)

for(k=0;k<N;k++) {

a[i][j][k]=1.0; b[i][j][k]=2.0;

}

for(rep=0;rep<10000;rep++) {

#ifdef PERF

for(i=0;i<N;i++)

for(j=0;j<N;j++)

for(k=0;k<N;k+=4) {

xa=(__m128*)&(a[i][j][k]); xb=(__m128*)&(b[i][j][k]); xc=(__m128*)&(c[i][j][k]);

*xc=_mm_mul_ps(*xa,*xb); }

#else

for(i=0;i<N;i++)

for(j=0;j<N;j++)

for(k=0;k<N;k++)

c[i][j][k]=a[i][j][k]*b[i][j][k];

#endif

}

printf("%f\n ",c[21][11][18]);

}

An example illustrating the vectorization with SSE intrinsics

icl -Od test.c -Fetest1.exe

icl -Od test.c -DPERF -Fetest_opt.exe

time test1.exe

2.000000

CPU time for command: 'test1.exe'

real 3.406 sec

user 3.391 sec

system 0.000 sec

time test_opt.exe

2.000000

CPU time for command: 'test_opt.exe'

real 1.281 sec

user 1.250 sec

system 0.000 sec

Intel compiler 12.0 was used for this experiment

Slide19

T

he resulting speedup is 2.7x.We

used aligned by 16 memory access instructions in this example. Alignment was matched accidently, in the real case, you need to worry about it. Test-optimized compiler shows the following result: icl test.c-Qvec_report3-Fetest_intel_opt.exe time test_intel_opt.exe 2.000000CPU

time for command: 'test_intel_opt.exe' real 0.328 sec user 0.313 sec system 0.000 sec

Slide20

Admissibility

of vectorization

Vectorization is a permutation optimization. Initial execution order is changed during vectorization.Permutation optimization is acceptable, if it preserves the order of dependencies. Thus we have a criterion for the admissibility of vectorization in terms of dependencies.

The simplest case when there are no dependencies inside the processed loop.In more complicated case there are dependences inside the vectorized loop but its order is the same as inside the initial scalar loop.Let’s recall the description of the dependency in a loop:There is a loop dependency between the statements S1 and S2 in the set of nested loops, if and only if 1) there are two loop nest iteration vectors i and j such that i <j or i = j and there is a path from S1 to S2 inside the loop 2) statement S1 for iteration i and statement S2 for iteration j refer to the same memory area. 3) One of these statements writes to this memory.

Slide21

Option

for vectorization control /Qvec-report[n] control amount of vectorizer

diagnostic information n=0 no diagnostic information n=1 indicate vectorized loops (DEFAULT) n=2 indicate vectorized/non-vectorized loops

n=3 indicate vectorized/non-vectorized loops and prohibiting

data dependence information

n=4 indicate non-

vectorized

loops

n=5 indicate non-

vectorized

loops and prohibiting data dependence information

Usage: icl -c -Qvec_report3 loop.c Diagnostic examples

: C:\loops\loop1.c(5) (col. 1): remark: LOOP WAS VECTORIZED.

C:\loops\loop3.c(5) (col. 1): remark: loop was not vectorized: vectorization possible but seems inefficient.

C:\loops\loop6.c(5) (col. 1): remark: loop was not vectorized: nonstandard loop is not a

vectorization candidate.

10/17/10

Slide22

Simple

criteria of

vectorization admissibilityLet’s write vectorization of loop with usage of fortran

array sections.A good criterion for vectorization is the fact that the introduction section of the array does not create dependency.

DO I=1,N A(I)=A(I)END DO

DO I=1,N/VL

A(I:I+VL)=A(I:I+VL)

END DO

DO I=1,N

A(I+1)=A(I)

END DO

DO I=1,N/VL

A(I+1:I+1+VL)=A(I:I+VL)

END DO

There is dependency because A(I+1:I+1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 are intersected.

DO I=1,N

A(I-1)=A(I)

END DO

DO I=1,N/VL

A(I-1:I-1+VL)=A(I:I+VL)

END DO

There is no dependency because A(I-1:I-1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 aren’t intersected.

Slide23

Let’s check an assumption:

Loop can be

vectorized, if the dependence distance greater or equal to number of array elements within the vector register. Check this with compiler:

ifort test.F90 -o a.out

–vec_report3

echo -------------------------------------

ifort

test.F90 -DPERF -o

b.out

–vec_report3

./build.sh

test.F90(11): (col. 1) remark: loop was not

vectorized

: existence of vector dependence.-------------------------------------

test.F90(11): (col. 1) remark: LOOP WAS VECTORIZED.

PROGRAM TEST_VEC

INTEGER,PARAMETER :: N=1000

#ifdef PERF

INTEGER,PARAMETER :: P=4

#else

INTEGER,PARAMETER :: P=3

#endif

INTEGER A(N)

DO I=1,N

-P

A(I+P)=A(I)

END DO

PRINT *,A(50)

END

Slide24

10/17/10

Dependency analysis and

directives

There are two tasks which compiler should

perform for

dependency

evaluation:

Alias

analysis (pointers which

can address the same memory should be

detected)

Definition-use chains analysis

Compiler should prove that there are not aliased objects and precisely calculate the dependencies

. It is hard task and sometimes compiler isn’t able to solve it.There are methods of providing additional information to the compiler:- Option –

ansi_alias (the pointers can refer only to the objects of the same or compatible type).- restrict attributes for pointer arguments (C/C++).

- #pragma ivdep says that there are not dependencies in the following loop. (C/C++)

- !DEC$ IVDEP Fortran analogue of #pragma ivdep

Slide25

Some

performance issues for the

vectorized code

INTEGER :: A(1000),B(1000)INTEGER I,KINTEGER, PARAMETER :: REP = 500000A = 2DO K=1,REP CALL ADD(A,B)END DOPRINT *,SHIFT,B(101)

CONTAINS SUBROUTINE ADD(A,B) INTEGER A(1000),B(1000) INTEGER I!DEC$ UNROLL(0) DO I=1,1000-SHIFT B(I) = A(I+

SHIFT

)+1

END DO

END SUBROUTINE

END

Let’s consider some simple test with a assignment which is appropriate for

vectorization

. Let us obtain vectorized code with usage of Intel Fortran compiler for different values of SHIFT macro. /fpp – option for preprocessorIntel compiler makes vectorization

if level of optimization is 2 or 3. (-O2 or –O3)Option –Ob0 is used to forbid inlining.

Slide26

Experiment

results

ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=0 -Fea.exe -Qvec_report >a.out 2>&1

ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=1 -Feb.exe -

Qvec_report

>

b.out

2>&1

time.exe a.exe

0 3

CPU time for command: 'a.exe' real 0.125 sec user 0.094 sec system 0.000 sec time.exe b.exe

1 3 CPU time for command: 'b.exe' real 0.297 sec user 0.281 sec system 0.000 sec

Slide27

10/17/10

ifort

test1.F90 -O2 -Ob0 /

fpp

/DSHIFT=0 /Fas -Ob0 -S –

Fafast.s

.B2.5: ;

Preds

.B2.5 .B2.4

$LN83:

;;; B(I) = A(I+SHIFT)+1

movdqa xmm1, XMMWORD PTR [

eax+ecx*4] ;17.11$LN84:

paddd xmm1, xmm0 ;17.4$LN85:

movdqa XMMWORD PTR [edx+ecx*4], xmm1 ;17.4

$LN86: add ecx, 4 ;16.3$LN87: cmp

ecx, 1000 ;16.3$LN88:

jb .B2.5 ; Prob 99% ;16.3

fast.s

Slide28

10/17/10

ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=1 /Fas -Ob0 -S –Fa

slow

.s

.B2.5: ;

Preds

.B2.5 .B2.4

$LN81:

;;; B(I) = A(I+SHIFT)+1

movdqu

xmm1, XMMWORD PTR [4+eax+ecx*4] ;17.11

$LN82:

paddd xmm1, xmm0 ;17.4$LN83:

movdqa XMMWORD PTR [edx+ecx*4], xmm1 ;17.4

$LN84: add ecx, 4 ;16.3

$LN85: cmp ecx, 996 ;16.3$LN86:

jb .B2.5 ; Prob 99% ;16.3

slow.s

CONCLUSION:

MOVDQA—Move Aligned Double QuadwordMOVDQU—Move Unaligned Double

QuadwordIn fast version aligned

instructions are used and vector registers are filled faster.Unaligned instructions are slower. For latest architectures they shows the same performance as aligned instructions if applied to the aligned data.

Slide29

Performance

of

vectorized loop depends on the memory location of the

objects used. The important aspect of program performance is the memory alignment of the data.Data

Structure Alignment is computer memory data placement. This concept includes two distinct but

related issues: alignment of the data (Data alignment) and data

structure

filling (Data structure padding).

Data

alignment specifies how certain data is located relative to the boundaries of memory. This

property is

usually associated with a data type.

Filling data structures involves insertion of unnamed fields into

the data structure in order to preserve the relative alignment of structure fields.

Slide30

10/17/10

Data alignment

Information

about the alignment can be obtained with

intrinsic

__

alignof

__. The size and the default alignment of the variable of

a

type may depend on

the compiler

. (ia32 or intel64)

printf

("int: sizeof=%d align=%d\n",

sizeof(a),__alignof__(a));

Alignment for ia32 Intel C++ compiler:

bool sizeof = 1 alignof = 1

wchar_t sizeof = 2 alignof = 2short int sizeof = 2 alignof = 2

int sizeof = 4 alignof = 4long int sizeof = 4

alignof = 4long long int sizeof = 8 alignof

= 8float sizeof = 4 alignof = 4double sizeof = 8

alignof = 8long double sizeof = 8 alignof = 8void*

sizeof = 4 alignof = 4

The same rules are used for array alignment. There is the possibility

to force the compiler to align object in a certain way: __declspec

(align(16)) float x[N];

Slide31

10/17/10

Data Structure Alignment

struct

foo {   

bool

a;

   short b;  

 long

long

c;

  

bool

d;

};

struct foo {   

bool a; char pad1[1];    short b;

char pad2[4];    long long c;   bool

d; char pad3[7]; };

The order of fields in the structure affects the size of the object of a derived type.

To reduce the size of the object structure fields should be

sorted by descending of its size. You can use __declspec to align structure fields.

typedef

struct aStuct

{ __declspec(align(16)) float x[N];

__declspec(align(16)) float y[N]; __

declspec(align(16)) float z[N];};

Slide32

10/17/10

for(

i

=0;i<100;i++)

p[

i

]=b[

i

]

0

p[0]=b[0]

1

p[1]=b[1]

2

p[2]=b[2]

3

p[2]=b[2]…

97p[97]=b[97]98

p[98]=b[98]99

p[99]=b[99]

0p[0]=b[0]1

p[1]=b[1]

0p[98]=b[98]

1p[99]=b[99]

0

p[2:9]=b[2:9]

…12

p[90:97]=b[90:97]

The approximate scheme of the loop vectorization

Non-aligned starting elements passed

Vector operations

Vector register - 8

elements

Tail loop

Loop vectorization usually produces three loops: loop for non-aligned staring elements,

vectorized loop and tail. Vectorization of loop with small number of iterations can be unprofitable.

Slide33

Additional vectorization

example

10/17/10

void Calculate(float *

a,float

* b,

float * c ,

int

n) {

int

i;

for(i=0;i<

n;i++) {

a[i] = a[i]+b[i]+c[i]; } return;}

#include <

stdio.h>#define N 1000

extern void Calculate(float *,float *, float *,int);int main() {float x[N],y[N],z[N];int

i,rep;for(i=0;i<N;i++) {

x[i] = 1;y[i] = 0; z[i] = 1;}for(rep=0;rep<10000000;rep++) { Calculate(&

x[1],&y[0],&z[0]

,N-1);}printf("x[1]=%f\

n",x[1]);}

Vector.c

Main.c

First argument alignment differs

icl

main.c vec.c

-O1 –FeAtime a.exe 12.6 s.

Slide34

10/17/10

Compiler makes auto

vectorization

for –O2 or –

O3.

Option

-

Qvec_report

informs about

vectorized

loops.

icl

main.c

vec.c –O2 –Qvec_report –Feb

vec.c(3): (col. 3) remark: LOOP WAS VECTORIZED.time b.exe 3.67 s.

Vectorization is possible because the compiler inserts run-time check for

vectorizing when some of the pointers may be not

aliased. The application size is enlarged.

1)void Calculate(float *

resrtict a,float * restrict

b, float * restrict c , int n) {

2)

To restrict

align attribute we need to add option –Qstd=c99

icl main.c vec.c

–Qstd=c99 –O2 –Qvec_report

–Fecvec.c(3): (col. 3) remark: LOOP WAS VECTORIZED.

time c.exe 3.55 s.Small improvement because of avoiding run-time check

Useful fact: For modern calculation systems performance of aligned and unaligned instructions almost the same when applied to aligned objects.

Slide35

10/17/10

int

main() {

__

declspec

(align(16)) float x[N];

__

declspec

(align(16)) float y[N];

__

declspec(align(16)) float z[N];

void Calculate(float *

resrtict a,float

* restrict b, float * restrict c , int n) {Int

n;__assume_aligned(a,16);

__assume_aligned(b,16);__assume_aligned(c,16);

3)

icl main.c

vec.c –Qstd=c99 –O2 –Qvec_report

–Fedvec.c(3): (col. 3) remark: LOOP WAS VECTORIZED.time d.exe

3.20 s.This update demonstrates improvement because of the better alignment of vectorized

objects. Arrays in main are aligned to 16. With this update all argument pointers are well aligned and the compiler

is informed by __assume_aligned

directive. It allows to remove the first scalar loop.

Calculate(&

x[0],&y[0],&z[0],N-1);

Slide36

Data

alignment

Good array data alignment for SSE: 16B for AVX: 32B Data alignment directives:C/C++ Windows : __declspec(align(16)) float X[N]; Linux/MacOS : float X[N] __attribute__ ((aligned (16));

Fortran !DIR$ ATTRIBUTES ALIGN: 16:: A Aligned malloc_aligned_malloc()_mm_malloc() Data alignment assertion (16B example)C/C++: __assume_aligned(p,16);Fortran: !DIR$ ASSUME_ALIGNED A(1):16 Aligned loop assertionC/C++: #pragma vector alignedFortran: !DIR$ VECTOR ALIGNED

10/17/10

Slide37

Non-unit

stride and unit stride access

10/17/10

Well aligned data is better for

vectorization

because in this case vector register is filled

by

the

single operation

. In case with non-unit stride access to array

, register filling

is more complicated task and vectorization

is less profitable. Auto vectorizer cooperates with loop optimizations for improving access to objects.

There are compiler directives which recommend to make vectorization in case when compiler doesn’t make it because it

looks unprofitable. C/C++#pragma

vector{aligned|unaligned|always}

#pragma novector Fortran!DEC$ VECTOR ALWAYS!DEC$ NOVECTOR

Slide38

Vectorization

of outer loop

Usually auto vectorizer processes the nested loop. Vectorization of the outer loop can be done using “simd” directive.

#define N 200

#include <

stdio.h

>

int

main() {

int

A[N][N],B[N][N],C[N][N];

int i,j,rep;

for(i=0;i<N;i++)

for(j=0;j<N;j++) { A[i][j]=

i+j; B[i][j]=2*j-i; C[i][j]=0;

}

for(rep=0;rep<10000000;rep++) { #pragma simd for(i=0;i<

N;i++) { j=0; while(A[i][j]<=B[i][j] && j<N) { C[i][j]=C[i][j]+B[j][i]-A[j][i];

j++; } }}

printf("%d\n",C[0][2]);

}

icl vec.c -O3 -Qvec- -Fea (Qvec- disable vectorization) 20.7 sicl vec.c -O3 -Qvec_report -Feb 17.s

vec.c(17): (col. 3) remark: SIMD LOOP WAS VECTORIZED.

Slide39

10/17/10

Thank you.