with BLIS Kiran varaganti 19 September 2016 Contents Introduction libFLAME Baseline Performance Cholesky QR LU factorization Analysis Optimizations Summary Introduction AMD provides highperformance computing libraries for various verticals ID: 673956
Download Presentation The PPT/PDF document "lib flame optimizations" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
libflame optimizations with BLIS
Kiran
varaganti
19 September 2016Slide2
ContentsIntroduction
libFLAME
Baseline Performance
Cholesky
QR
LU factorization
Analysis
Optimizations
SummarySlide3
IntroductionAMD provides high-performance computing libraries for various verticals:
Oil & Gas Exploration
Computer Vision
Machine Learning
…
We will provide BLAS & LAPACK functionalities through
BLIS
&
libFLAME
respectively
Basically optimize these open source libraries for AMD architecturesSlide4
benefits of BLIS & libFLAME
The benefits are enormous, but to name a few:
Removed the inconvenient dependencies on the Fortran runtime. The source for both BLIS and
libFLAME
is written in C, or in assembly where performance
requires.
Object-Based
Abstractions and
API
Built
around opaque structures that hide matrices implementation details (data-layout
)
Exports object-based programming interfaces to operate on these
objects
An
expanded API that is a strict superset of the traditional BLAS and LAPACK
libraries
High-performance
dense linear algebra library
framework
Abstraction facilitates programming without array or loop indices, which allows the user to avoid painful index-related programming
errors
Provides
algorithm families for each operation
, so
developers can choose the one that
best suits
their
needs
Provides
a
framework for
building complete custom linear algebra
codesSlide5
libFLAME
Goal: Optimize
libFLAME
using
BLIS as the BLAS library
LAPACK functionalities are implemented on top of BLAS routines
libFLAME
can be configured to use any external BLAS compatible libraries
BLAS Compatible librariesSlide6
libflame functionalities
Currently Focusing on
Cholesky
LU
QR
Very useful in solving linear equationsSlide7
Cholesky, LU, QR benchmarking on AMD reference CPU
Ubuntu
14.04 LTS 64-bit
OS
Cholesky
& LU performance is comparable with
OpenBLAS
except for small dimensions
QR needs to be improved for all dimensions
No
OptimizationsSlide8
Block Sizes - Cholesky
b x b
(m-128) x 128
******
(m-128) x (m-128)
b
= 128
32 x 32
96 x 32
96 x 96
m x m = input matrix size
128 x 128 = Block size
32 X 32 = sub-block size (
Cholesky
using Level-1 & Level-2 BLAS
128 x 128
m
x m
96 x 96
32 x 32
64 x 32
64 x 64
64 x 64
32 x 32
***
32 x 32
32 x 32
Repeat sub-block partitioning after HERK update for (m-128) x (m-128) matrix
BLAS Operations
- Level 1 & 2 , block size < 32
- Level 3 TRSM, 32 x 32 <= block size <= (m-128) x 128
- Level 3 HERK , 32 x 32 <= block size <= (m-128) x (m-128)Slide9
BLAS DependenciesCholesky
Level 1
DOT (Data Size < 32)
Scalv
(Data Size < 32
)
Level 2
GEMV
(Data Size < 32
)Level 3TRSM (Data Size < input Matrix Size)HERK (Data Size < input Matrix Size)Summary: The impact of L1 & L2 routines is less compared to Level-3 BLAS routines. Optimization: Improve TRSM & HERK for all block sizes. Slide10
Block sizes LU & QR
LU with pivoting
128 x 128, 16 x 16 (Level 1 & Level 2)
QR
Level 1 & Level 2 works on all higher block sizes as well.
It has high impact on the performance of QRSlide11
BLAS DependenciesLU with Pivot
Level 1
Dot
Amax
scalv
Level 2
GEMV
Level 3
TRSM
GEMM
QR FactorizationLevel 1CopyvScalvAxpyvAxpytLevel 2Ger (General rank-1 update)GEMVLevel 3TRMMGEMMTRSMSlide12
Blis optimizationsAVX x86 SIMD Optimization
a
xpyf
– which in turn optimizes GEMV (Column Major)
d
otxf
- which in turn optimizes GEMV (row major)
a
xpyv
3
1
9
7
3
3
3
3
9
3
27
21
x
=
256 bit YMM Registers
....
....
3197
1201
2220
8654
301..
x
f
1
2
0
1
0
0
0
0
0
0
0
0
x
=
9
3
27
21
+Slide13
Blis optimizations Impact – QR Factorization
Ubuntu
14.04 LTS 64-bit OS
1.9x improvement
1.11x improvement
bli_ssumsqv_unb_var1() is replaced with
dotv
for norm2 calculation
2
.12x improvement
AMD Reference
cpuSlide14
Profile Data – Cholesky, LU & QR
Cholesky
Factorization
QR
Factorization
LU
FactorizationSlide15
Profile Data – Cholesky, LU & QR
Cholesky
Factorization
QR
Factorization
LU
FactorizationSlide16
Blis optimizations Impact – Cholesky & LU Factorization
Ubuntu
14.04 LTS 64-bit OS
~7% improvement
AMD Reference
cpu
~14% improvement
Better
than
OpenBLAS
Needs
improvement
Better
than
OpenBLAS
Needs
improvementSlide17
Profile Data – Cholesky
Square
Matrix Size = 128
Square
Matrix Size = 160
Square
Matrix Size = 640
Square
Matrix Size = 1600
To improve performance for small matrices – Optimize
dotxvSlide18
Profile Data – LU
Square
Matrix Size = 128
Square
Matrix Size = 320
Square
Matrix Size = 640
Square
Matrix Size = 1600
To improve performance for small matrices – Optimize Framework
code. Slide19
Profile Data – QR
Square
Matrix Size = 160
Square
Matrix Size = 480
Square
Matrix Size = 640
Square
Matrix Size = 1600
Small matrices – L1 routines are dominatingSlide20
SummaryFor larger matrix sizes, Cholesky
factorization performance is better than
OpenBLAS
To improve performance of
Cholesky
for smaller matrices, optimize
dotxv
routine & framework code
LU performance is better than
OpenBLAS
for matrices greater 320 Performance of QR factorization is improved by more than 2x, but still falling short in performance compared to OpenBLAS for all matrix sizesSlide21
Thank YouSlide22
libflame - Parameters
3
Number of repeats per experiment
c
Flat matrix storage scheme(s) to test ('c' = col-major; 'r' = row-major; 'g' = general; 'm' = mixed)
s Datatype(s) to test
40 Algorithmic
blocksize
for blocked algorithms10 Algorithmic blocksize for algorithms-by-blocks40 Storage blocksize for algorithms-by-blocks160 Problem size: first to test1600 Problem size: maximum to test160 Problem size: increment between experiments0 Number of SuperMatrix threads (0 = disable SuperMatrix
in FLASH front-ends)i Reaction to test failure ('i' = ignore; 's' = sleep() and continue; 'a' = abort)Slide23
Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
©
2016
Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo
and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners
.