The trend in the microprocessor development Sequential instruction execution Parallel instruction execution Different kinds of parallel Pipeline Superscalar Vector operations ID: 816333
Download The PPT/PDF document "Optimizing compiler. Vectorization" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Optimizing compiler.
Vectorization
.
Slide2The trend in the microprocessor development
Sequential instruction execution
Parallel instruction execution
Different kinds of
parallel:
Pipeline
Superscalar
Vector operations
Multi-core and multiprocessor tasks
An optimizing compiler is a tool which translates source code into an executable module and optimize source code for better performance. Parallelization is a transformation
of the sequential
program
into multi-threaded or
vectorized
(or both) to utilize multiple execution units
simultaneously without
breaking the
correctness of the program.
Slide3Vectorization
is an example of data parallelism (SIMD)C[1] = A[1]+B[1]C[2] = A[2]+B[2]
C[3] = A[3]+B[3]C[4] = A[4]+B[4]
A[1]
A[2]
A[3]
A[4]
A[1]
A[2]
A[3]
A[4]
Vector operation
B[1]
B[2]
B[3]
B[4]
B[1]
B[2]
B[3]
B[4]
+
=
C[1] C[2] C[3] C[4]
C[1]
C[2]
C[3]
C[4]
Slide410/17/10
for(
i
=0;i<100;i++)
p[
i
]=b[
i
]
0
p[0]=b[0]
1
p[1]=b[1]
2
p[2]=b[2]
3
p[2]=b[2]…
97p[97]=b[97]98
p[98]=b[98]99
p[99]=b[99]
100 operations
0p[0]=b[0]
1p[1]=b[1]
0p[98]=b[98]
1
p[99]=b[99]
0p[2:9]=b[2:9]
…12p[90:97]=b[90:97]
16
operations
The approximate scheme of the loop vectorization
Non-aligned starting elements passed
Vector operations
Vector register -
8 elements
Tail loop
Slide5A
typical vector instruction is an operation on two vectors in memory or in
fixed length registers. These vectors can be loaded from memory by a single or multiple operations.
Vectorization is a compiler optimization that inserts vector instructions instead of scalar. This optimization "wraps" the data into vector; scalar operations are replaced by operations with
these vectors (packets).Such optimization can be also performed manually by developer.
A
(1: n: k) - section of the array in
Fortran is very convenient for vector register representation.
for(i=0;i<
U,i
++) {
S1: lhs1[i] = rhs1[i];
… Sn
: lhsn[i] = rhsn[i];
}
for(i=0;i<U,i+=
vl) { S1: lhs1[i:i+vl-1:1] = rhs1[i:i+vl-1:1]; … Sn
: lhsn[i:i+vl-1:1] = rhsn[i:i+vl-1:1];}
A(1:10:3)
Slide6MMX
,
SSE vector instruction sets
MMX is a single instruction, multiple data (SIMD) instruction set designed by Intel, introduced in 1996 for P5-based Pentium line of microprocessors, known as "Pentium with MMX Technology”.
MMX (Multimedia Extensions) is
a set of instructions
perform specific actions for
streaming
audio/video
encoding
and decoding.
MMX is:
MMM0-MMM7
64 bit registers (were aliased with existing 80 bit FP stack registers)
the concept of packages (each register
can store one 64 bit integer or 2 - 32 bit or 4 - 16
bit or 8 - 8-bit)
57 instructions are divided into groups:
data movement, arithmetic, comparison, conversion, logical, shift, shufle, unpack, cache control and prefetch state management.
MMX provides only integer operations.
Simultaneous operations with floats and packed integers are not possible.
Slide7Streaming
SIMD Extensions (SSE) is a SIMD instruction set extension
of the x86 architecture, designed by Intel and introduced with Pentium
III series processors in 1999. SIMD instructions can greatly increase the performance when exactly the same operations are performed on the multiple data objects. Typical applications are digital signal processing and
computer graphics.
SSE
is
:
8 128-bit
registers (xmm0 to xmm7)
set
of instructions
for operations with scalar and packed data types
70
new instructions for single precision floating point
data mainly
SSE2, SSE3, SSEE3, SSE4 are further extensions of SSE.
SSE2 has packed data type with double precision floating point.
Advanced
Vector Extensions (AVX) is an extension of the x86 instruction set architecture for Intel and AMD microprocessors proposed by Intel in March 2008. Firstly
supported by Intel Westmere processor in Q1 2011 and
by AMD Bulldozer processor in Q3 2011.
AVX provides new features, new instructions and a new coding scheme.
The
size of vector registers is increased from 128 to 256
bits (YMM0-YMM15).
The existing 128-bit instructions use the lower half of YMM registers.
Slide8Different
data types can be packed in vector registers as follows
Packed data type
Vectorlength
Bits
per
element
Data
type range
signed bytes
16
8
-2**7
to
2**7-1
unsigned bytes16
80 до 2**8-1
signed words
8
16
-2**15 to 2**15-1
unsigned words
8
160 до 2**16
signed
doublewords4
32-2**31
to 2**31-1
unsigned doublewords
4
32
0 до 2**32-1signed
quadwords
2
64-2**63
to 2*63-1
unsigned
quadwords
2
64
0 до 2**64-1
single-precision fps
4
32
2**-126
to
2**127
double-precision fps
2
64
2**-1022
to
2**1023
Selecting the appropriate data type for calculations can significantly affect application performance.
Slide99
Optimization with Switches
SIMD – SSE, SSE2, SSE3, SSE4.2 Support
16x bytes
8x words
4x dwords
2x qwords
1x dqword
4x floats
2x doubles
MMX*
SSE
SSE4.2
SSE3
SSE2
* MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers
Slide10Instruction groups
Data movement instructions :
Instruction Suffix Description
movdqa move double quadword aligned movdqu
move double quadword unaligned mova [
ps
,
pd
] move floating-point aligned
movu
[ ps,
pd ] move floating-point unaligned movhl [
ps ] move packed floating-point high to low movlh [
ps ] move packed floating-point low to high
movh [ ps,
pd ] move high packed floating-point
movl [ ps, pd ] move low packed floating-point
mov [ d, q, ss, sd
] move scalar data lddqu
load double quadword unaligned mov
<d/sh/sl>dup
move and duplicate pextr [ w ] extract word
pinstr [ w ] insert word
pmovmsk [ b ] move mask movmsk [
ps, pd ] move mask
An aligned data movement instruction cannot be applied to the memory location which is not aligned by 16 (bytes).
10/17/10
Slide11Intel arithmetic instructions :
Instruction Suffix Description padd [ b, w, d, q ] packed addition (signed and unsigned) psub [ b, w, d, q ] packed subtraction (signed and unsigned) padds [ b, w ] packed addition with saturation (signed) paddus [ b, w ] packed addition with saturation (unsigned) psubs [ b, w ] packed subtraction with saturation (signed)
psubus [ b, w ] packed subtraction with saturation (unsigned) pmins [ w ] packed minimum (signed) pminu [ b ] packed minimum (unsigned) pmaxs [ w ] packed maximum (signed) pmaxu [ b ] packed maximum (unsigned)
10/17/10
Slide12Floating-point arithmetic instructions :
Instruction Suffix Description add [ ss, ps, sd,
pd ] addition div [ ss, ps, sd, pd ] division
min [ ss, ps,
sd
,
pd
] minimum
max [
ss
, ps,
sd, pd ] maximum mul
[ ss, ps,
sd, pd ] multiplication
sqrt [ ss, ps
, sd, pd ] square root
sub [ ss, ps, sd, pd
] subtraction rcp [ ss
, ps] approximated reciprocal
rsqrt [
ss, ps] approximated reciprocal square root
Idiomatic arithmetic instructions : Instruction Suffix Description
pang [ b, w ] packed average with rounding (unsigned) pmulh/
pmulhu/pmull [ w ] packed multiplication
psad [ bw ] packed sum of absolute differences (unsigned)
pmadd [ wd ] packed multiplication and addition (signed)
addsub [ ps, pd ] floating-point addition/subtraction
hadd [ ps, pd
] floating-point horizontal addition hsub [ ps
, pd ] floating-point horizontal subtraction
10/17/10
Slide13Logical instructions :
Instruction Suffix Description pand bitwise logical AND pandn bitwise logical AND-NOT por bitwise logical
OR pxor bitwise logical XOR and [ ps, pd ] bitwise logical AND andn [ ps, pd ] bitwise logical
AND-NOT or [ ps, pd ] bitwise logical OR xor
[ ps, pd ]
bitwise logical
XOR
Comparison instructions :
Instruction Suffix Description pcmp<cc> [ b, w, d ] packed compare cmp<cc> [ ss, ps, sd, pd ] floating-point compare
<cc> defines comparison operation. lt – less,
gt – greater, eq - equal
10/17/10
Slide14Conversion instructions :
Instruction Suffix Description packss [wb, dw] pack with saturation (signed) pa
сkus [wb] pack with saturation (unsigned) cvt<s2d> conversion cvtt<s2d> conversion with truncation
Shift instructions : Instruction Suffix Description psll [ w, d, q, dq ] shift left logical (zero in)
psra [w, d] shift right arithmetic (sign in)
psrl [ w, d, q, dq ] shift right logical (zero in)
Shuffle instructions :
Instruction Suffix Description pshuf [ w, d ] packed shuffle pshufh [w] packed shuffle high
pshufl [w] packed shuffle low
ырга [ ps, pd ] shuffle
10/17/10
Slide15Unpack instructions :
Instruction Suffix Description punpckh [bw, wd,
dq, qdq] unpack high punpckl [bw,
wd, dq, qdq] unpack low
unpckh
[
ps
,
pd
] unpack high unpckl
[ps, pd] unpack low
Cacheability control and prefetch
instructions : Instruction Suffix Description movnt
[ ps, pd, q,
dq ] move aligned non-temporal prefetch<hint>
prefetch with hint State management instructions
These instructions are commonly used by operating system.
10/17/10
Slide1616
Three Sets of Switches to Enable Processor-specific Extensions
Switches –x<EXT> like –xSSE4_1
Imply an Intel processor checkRun-time error message at program start when launched on processor without <EXT>
Switches –m<EXT> like –mSSE3No processor check
Illegal instruction fault when launched on processor without <EXT>
Switches –ax<EXT> like –axSSE4_2
Automatic processor dispatch – multiple code paths
Processor check only available on Intel processors
Non-Intel processors take default path
Default path is –mSSE2 but can be modified again by another –x<EXT> switch
Slide17Simple
estimation of
vectorization profitability
A typical vector instruction is an operation on two vectors in memory or in registers of fixed length. These vectors can be loaded from memory in a single operation or in part.Description of the foundations of SSE technology can be found in CHAPTER 10 PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) of the document «Intel 64 and IA-32 Intel Architecture Software Developer's Manual» (Volume 1)
Microsoft Visual Studio supports a set of SSE intrinsics that allows you to use SSE instructions directly from C/C++ code. You need to include
xmmintrin.h
, which defines the vector type __m128 and vector operations.
For example, we need to
vectorize
manually the following loop:
for(i=0;i<
N;i
++) C[i]=A[i]
*B[i];
To do this: 1.) organize / fill the vector variables
2.) use multiply intrinsic for the vector variables 3.) write the results of calculations back to memory
Slide18#include <stdio.h>
#include <xmmintrin.h>
#define N 40
int main() {
float a[N][N][N],b[N][N][N],c[N][N][N];
int i,j,k,rep; __m128 *xa,*xb,*xc;
for(i=0;i<N;i++)
for(j=0;j<N;j++)
for(k=0;k<N;k++) {
a[i][j][k]=1.0; b[i][j][k]=2.0;
}
for(rep=0;rep<10000;rep++) {
#ifdef PERF
for(i=0;i<N;i++)
for(j=0;j<N;j++)
for(k=0;k<N;k+=4) {
xa=(__m128*)&(a[i][j][k]); xb=(__m128*)&(b[i][j][k]); xc=(__m128*)&(c[i][j][k]);
*xc=_mm_mul_ps(*xa,*xb); }
#else
for(i=0;i<N;i++)
for(j=0;j<N;j++)
for(k=0;k<N;k++)
c[i][j][k]=a[i][j][k]*b[i][j][k];
#endif
}
printf("%f\n ",c[21][11][18]);
}
An example illustrating the vectorization with SSE intrinsics
icl -Od test.c -Fetest1.exe
icl -Od test.c -DPERF -Fetest_opt.exe
time test1.exe
2.000000
CPU time for command: 'test1.exe'
real 3.406 sec
user 3.391 sec
system 0.000 sec
time test_opt.exe
2.000000
CPU time for command: 'test_opt.exe'
real 1.281 sec
user 1.250 sec
system 0.000 sec
Intel compiler 12.0 was used for this experiment
Slide19T
he resulting speedup is 2.7x.We
used aligned by 16 memory access instructions in this example. Alignment was matched accidently, in the real case, you need to worry about it. Test-optimized compiler shows the following result: icl test.c-Qvec_report3-Fetest_intel_opt.exe time test_intel_opt.exe 2.000000CPU
time for command: 'test_intel_opt.exe' real 0.328 sec user 0.313 sec system 0.000 sec
Slide20Admissibility
of vectorization
Vectorization is a permutation optimization. Initial execution order is changed during vectorization.Permutation optimization is acceptable, if it preserves the order of dependencies. Thus we have a criterion for the admissibility of vectorization in terms of dependencies.
The simplest case when there are no dependencies inside the processed loop.In more complicated case there are dependences inside the vectorized loop but its order is the same as inside the initial scalar loop.Let’s recall the description of the dependency in a loop:There is a loop dependency between the statements S1 and S2 in the set of nested loops, if and only if 1) there are two loop nest iteration vectors i and j such that i <j or i = j and there is a path from S1 to S2 inside the loop 2) statement S1 for iteration i and statement S2 for iteration j refer to the same memory area. 3) One of these statements writes to this memory.
Slide21Option
for vectorization control /Qvec-report[n] control amount of vectorizer
diagnostic information n=0 no diagnostic information n=1 indicate vectorized loops (DEFAULT) n=2 indicate vectorized/non-vectorized loops
n=3 indicate vectorized/non-vectorized loops and prohibiting
data dependence information
n=4 indicate non-
vectorized
loops
n=5 indicate non-
vectorized
loops and prohibiting data dependence information
Usage: icl -c -Qvec_report3 loop.c Diagnostic examples
: C:\loops\loop1.c(5) (col. 1): remark: LOOP WAS VECTORIZED.
C:\loops\loop3.c(5) (col. 1): remark: loop was not vectorized: vectorization possible but seems inefficient.
C:\loops\loop6.c(5) (col. 1): remark: loop was not vectorized: nonstandard loop is not a
vectorization candidate.
10/17/10
Slide22Simple
criteria of
vectorization admissibilityLet’s write vectorization of loop with usage of fortran
array sections.A good criterion for vectorization is the fact that the introduction section of the array does not create dependency.
DO I=1,N A(I)=A(I)END DO
DO I=1,N/VL
A(I:I+VL)=A(I:I+VL)
END DO
DO I=1,N
A(I+1)=A(I)
END DO
DO I=1,N/VL
A(I+1:I+1+VL)=A(I:I+VL)
END DO
There is dependency because A(I+1:I+1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 are intersected.
DO I=1,N
A(I-1)=A(I)
END DO
DO I=1,N/VL
A(I-1:I-1+VL)=A(I:I+VL)
END DO
There is no dependency because A(I-1:I-1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 aren’t intersected.
Slide23Let’s check an assumption:
Loop can be
vectorized, if the dependence distance greater or equal to number of array elements within the vector register. Check this with compiler:
ifort test.F90 -o a.out
–vec_report3
echo -------------------------------------
ifort
test.F90 -DPERF -o
b.out
–vec_report3
./build.sh
test.F90(11): (col. 1) remark: loop was not
vectorized
: existence of vector dependence.-------------------------------------
test.F90(11): (col. 1) remark: LOOP WAS VECTORIZED.
PROGRAM TEST_VEC
INTEGER,PARAMETER :: N=1000
#ifdef PERF
INTEGER,PARAMETER :: P=4
#else
INTEGER,PARAMETER :: P=3
#endif
INTEGER A(N)
DO I=1,N
-P
A(I+P)=A(I)
END DO
PRINT *,A(50)
END
Slide2410/17/10
Dependency analysis and
directives
There are two tasks which compiler should
perform for
dependency
evaluation:
Alias
analysis (pointers which
can address the same memory should be
detected)
Definition-use chains analysis
Compiler should prove that there are not aliased objects and precisely calculate the dependencies
. It is hard task and sometimes compiler isn’t able to solve it.There are methods of providing additional information to the compiler:- Option –
ansi_alias (the pointers can refer only to the objects of the same or compatible type).- restrict attributes for pointer arguments (C/C++).
- #pragma ivdep says that there are not dependencies in the following loop. (C/C++)
- !DEC$ IVDEP Fortran analogue of #pragma ivdep
Some
performance issues for the
vectorized code
INTEGER :: A(1000),B(1000)INTEGER I,KINTEGER, PARAMETER :: REP = 500000A = 2DO K=1,REP CALL ADD(A,B)END DOPRINT *,SHIFT,B(101)
CONTAINS SUBROUTINE ADD(A,B) INTEGER A(1000),B(1000) INTEGER I!DEC$ UNROLL(0) DO I=1,1000-SHIFT B(I) = A(I+
SHIFT
)+1
END DO
END SUBROUTINE
END
Let’s consider some simple test with a assignment which is appropriate for
vectorization
. Let us obtain vectorized code with usage of Intel Fortran compiler for different values of SHIFT macro. /fpp – option for preprocessorIntel compiler makes vectorization
if level of optimization is 2 or 3. (-O2 or –O3)Option –Ob0 is used to forbid inlining.
Slide26Experiment
results
ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=0 -Fea.exe -Qvec_report >a.out 2>&1
ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=1 -Feb.exe -
Qvec_report
>
b.out
2>&1
time.exe a.exe
0 3
CPU time for command: 'a.exe' real 0.125 sec user 0.094 sec system 0.000 sec time.exe b.exe
1 3 CPU time for command: 'b.exe' real 0.297 sec user 0.281 sec system 0.000 sec
Slide2710/17/10
ifort
test1.F90 -O2 -Ob0 /
fpp
/DSHIFT=0 /Fas -Ob0 -S –
Fafast.s
.B2.5: ;
Preds
.B2.5 .B2.4
$LN83:
;;; B(I) = A(I+SHIFT)+1
movdqa xmm1, XMMWORD PTR [
eax+ecx*4] ;17.11$LN84:
paddd xmm1, xmm0 ;17.4$LN85:
movdqa XMMWORD PTR [edx+ecx*4], xmm1 ;17.4
$LN86: add ecx, 4 ;16.3$LN87: cmp
ecx, 1000 ;16.3$LN88:
jb .B2.5 ; Prob 99% ;16.3
fast.s
Slide2810/17/10
ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=1 /Fas -Ob0 -S –Fa
slow
.s
.B2.5: ;
Preds
.B2.5 .B2.4
$LN81:
;;; B(I) = A(I+SHIFT)+1
movdqu
xmm1, XMMWORD PTR [4+eax+ecx*4] ;17.11
$LN82:
paddd xmm1, xmm0 ;17.4$LN83:
movdqa XMMWORD PTR [edx+ecx*4], xmm1 ;17.4
$LN84: add ecx, 4 ;16.3
$LN85: cmp ecx, 996 ;16.3$LN86:
jb .B2.5 ; Prob 99% ;16.3
slow.s
CONCLUSION:
MOVDQA—Move Aligned Double QuadwordMOVDQU—Move Unaligned Double
QuadwordIn fast version aligned
instructions are used and vector registers are filled faster.Unaligned instructions are slower. For latest architectures they shows the same performance as aligned instructions if applied to the aligned data.
Performance
of
vectorized loop depends on the memory location of the
objects used. The important aspect of program performance is the memory alignment of the data.Data
Structure Alignment is computer memory data placement. This concept includes two distinct but
related issues: alignment of the data (Data alignment) and data
structure
filling (Data structure padding).
Data
alignment specifies how certain data is located relative to the boundaries of memory. This
property is
usually associated with a data type.
Filling data structures involves insertion of unnamed fields into
the data structure in order to preserve the relative alignment of structure fields.
Slide3010/17/10
Data alignment
Information
about the alignment can be obtained with
intrinsic
__
alignof
__. The size and the default alignment of the variable of
a
type may depend on
the compiler
. (ia32 or intel64)
printf
("int: sizeof=%d align=%d\n",
sizeof(a),__alignof__(a));
Alignment for ia32 Intel C++ compiler:
bool sizeof = 1 alignof = 1
wchar_t sizeof = 2 alignof = 2short int sizeof = 2 alignof = 2
int sizeof = 4 alignof = 4long int sizeof = 4
alignof = 4long long int sizeof = 8 alignof
= 8float sizeof = 4 alignof = 4double sizeof = 8
alignof = 8long double sizeof = 8 alignof = 8void*
sizeof = 4 alignof = 4
The same rules are used for array alignment. There is the possibility
to force the compiler to align object in a certain way: __declspec
(align(16)) float x[N];
Slide3110/17/10
Data Structure Alignment
struct
foo {
bool
a;
short b;
long
long
c;
bool
d;
};
struct foo {
bool a; char pad1[1]; short b;
char pad2[4]; long long c; bool
d; char pad3[7]; };
The order of fields in the structure affects the size of the object of a derived type.
To reduce the size of the object structure fields should be
sorted by descending of its size. You can use __declspec to align structure fields.
typedef
struct aStuct
{ __declspec(align(16)) float x[N];
__declspec(align(16)) float y[N]; __
declspec(align(16)) float z[N];};
Slide3210/17/10
for(
i
=0;i<100;i++)
p[
i
]=b[
i
]
0
p[0]=b[0]
1
p[1]=b[1]
2
p[2]=b[2]
3
p[2]=b[2]…
97p[97]=b[97]98
p[98]=b[98]99
p[99]=b[99]
0p[0]=b[0]1
p[1]=b[1]
0p[98]=b[98]
1p[99]=b[99]
0
p[2:9]=b[2:9]
…12
p[90:97]=b[90:97]
The approximate scheme of the loop vectorization
Non-aligned starting elements passed
Vector operations
Vector register - 8
elements
Tail loop
Loop vectorization usually produces three loops: loop for non-aligned staring elements,
vectorized loop and tail. Vectorization of loop with small number of iterations can be unprofitable.
Slide33Additional vectorization
example
10/17/10
void Calculate(float *
a,float
* b,
float * c ,
int
n) {
int
i;
for(i=0;i<
n;i++) {
a[i] = a[i]+b[i]+c[i]; } return;}
#include <
stdio.h>#define N 1000
extern void Calculate(float *,float *, float *,int);int main() {float x[N],y[N],z[N];int
i,rep;for(i=0;i<N;i++) {
x[i] = 1;y[i] = 0; z[i] = 1;}for(rep=0;rep<10000000;rep++) { Calculate(&
x[1],&y[0],&z[0]
,N-1);}printf("x[1]=%f\
n",x[1]);}
Vector.c
Main.c
First argument alignment differs
icl
main.c vec.c
-O1 –FeAtime a.exe 12.6 s.
Slide3410/17/10
Compiler makes auto
vectorization
for –O2 or –
O3.
Option
-
Qvec_report
informs about
vectorized
loops.
icl
main.c
vec.c –O2 –Qvec_report –Feb
vec.c(3): (col. 3) remark: LOOP WAS VECTORIZED.time b.exe 3.67 s.
Vectorization is possible because the compiler inserts run-time check for
vectorizing when some of the pointers may be not
aliased. The application size is enlarged.
1)void Calculate(float *
resrtict a,float * restrict
b, float * restrict c , int n) {
2)
To restrict
align attribute we need to add option –Qstd=c99
icl main.c vec.c
–Qstd=c99 –O2 –Qvec_report
–Fecvec.c(3): (col. 3) remark: LOOP WAS VECTORIZED.
time c.exe 3.55 s.Small improvement because of avoiding run-time check
Useful fact: For modern calculation systems performance of aligned and unaligned instructions almost the same when applied to aligned objects.
Slide3510/17/10
int
main() {
__
declspec
(align(16)) float x[N];
__
declspec
(align(16)) float y[N];
__
declspec(align(16)) float z[N];
void Calculate(float *
resrtict a,float
* restrict b, float * restrict c , int n) {Int
n;__assume_aligned(a,16);
__assume_aligned(b,16);__assume_aligned(c,16);
3)
icl main.c
vec.c –Qstd=c99 –O2 –Qvec_report
–Fedvec.c(3): (col. 3) remark: LOOP WAS VECTORIZED.time d.exe
3.20 s.This update demonstrates improvement because of the better alignment of vectorized
objects. Arrays in main are aligned to 16. With this update all argument pointers are well aligned and the compiler
is informed by __assume_aligned
directive. It allows to remove the first scalar loop.
Calculate(&
x[0],&y[0],&z[0],N-1);
Slide36Data
alignment
Good array data alignment for SSE: 16B for AVX: 32B Data alignment directives:C/C++ Windows : __declspec(align(16)) float X[N]; Linux/MacOS : float X[N] __attribute__ ((aligned (16));
Fortran !DIR$ ATTRIBUTES ALIGN: 16:: A Aligned malloc_aligned_malloc()_mm_malloc() Data alignment assertion (16B example)C/C++: __assume_aligned(p,16);Fortran: !DIR$ ASSUME_ALIGNED A(1):16 Aligned loop assertionC/C++: #pragma vector alignedFortran: !DIR$ VECTOR ALIGNED
10/17/10
Slide37Non-unit
stride and unit stride access
10/17/10
Well aligned data is better for
vectorization
because in this case vector register is filled
by
the
single operation
. In case with non-unit stride access to array
, register filling
is more complicated task and vectorization
is less profitable. Auto vectorizer cooperates with loop optimizations for improving access to objects.
There are compiler directives which recommend to make vectorization in case when compiler doesn’t make it because it
looks unprofitable. C/C++#pragma
vector{aligned|unaligned|always}
#pragma novector Fortran!DEC$ VECTOR ALWAYS!DEC$ NOVECTOR
Slide38Vectorization
of outer loop
Usually auto vectorizer processes the nested loop. Vectorization of the outer loop can be done using “simd” directive.
#define N 200
#include <
stdio.h
>
int
main() {
int
A[N][N],B[N][N],C[N][N];
int i,j,rep;
for(i=0;i<N;i++)
for(j=0;j<N;j++) { A[i][j]=
i+j; B[i][j]=2*j-i; C[i][j]=0;
}
for(rep=0;rep<10000000;rep++) { #pragma simd for(i=0;i<
N;i++) { j=0; while(A[i][j]<=B[i][j] && j<N) { C[i][j]=C[i][j]+B[j][i]-A[j][i];
j++; } }}
printf("%d\n",C[0][2]);
}
icl vec.c -O3 -Qvec- -Fea (Qvec- disable vectorization) 20.7 sicl vec.c -O3 -Qvec_report -Feb 17.s
vec.c(17): (col. 3) remark: SIMD LOOP WAS VECTORIZED.
Slide3910/17/10
Thank you.