/
lib flame optimizations lib flame optimizations

lib flame optimizations - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
362 views
Uploaded On 2018-09-21

lib flame optimizations - PPT Presentation

with BLIS Kiran varaganti 19 September 2016 Contents Introduction libFLAME Baseline Performance Cholesky QR LU factorization Analysis Optimizations Summary Introduction AMD provides highperformance computing libraries for various verticals ID: 673956

128 size amp matrix size 128 matrix amp level cholesky performance amd square factorization blas data libflame block blis

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "lib flame optimizations" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

libflame optimizations with BLIS

Kiran

varaganti

19 September 2016Slide2

ContentsIntroduction

libFLAME

Baseline Performance

Cholesky

QR

LU factorization

Analysis

Optimizations

SummarySlide3

IntroductionAMD provides high-performance computing libraries for various verticals:

Oil & Gas Exploration

Computer Vision

Machine Learning

We will provide BLAS & LAPACK functionalities through

BLIS

&

libFLAME

respectively

Basically optimize these open source libraries for AMD architecturesSlide4

benefits of BLIS & libFLAME

The benefits are enormous, but to name a few:

Removed the inconvenient dependencies on the Fortran runtime. The source for both BLIS and

libFLAME

is written in C, or in assembly where performance

requires.

Object-Based

Abstractions and

API

Built

around opaque structures that hide matrices implementation details (data-layout

)

Exports object-based programming interfaces to operate on these

objects

An

expanded API that is a strict superset of the traditional BLAS and LAPACK

libraries

High-performance

dense linear algebra library

framework

Abstraction facilitates programming without array or loop indices, which allows the user to avoid painful index-related programming

errors

Provides

algorithm families for each operation

, so

developers can choose the one that

best suits

their

needs

Provides

a

framework for

building complete custom linear algebra

codesSlide5

libFLAME

Goal: Optimize

libFLAME

using

BLIS as the BLAS library

LAPACK functionalities are implemented on top of BLAS routines

libFLAME

can be configured to use any external BLAS compatible libraries

BLAS Compatible librariesSlide6

libflame functionalities

Currently Focusing on

Cholesky

LU

QR

Very useful in solving linear equationsSlide7

Cholesky, LU, QR benchmarking on AMD reference CPU

Ubuntu

14.04 LTS 64-bit

OS

Cholesky

& LU performance is comparable with

OpenBLAS

except for small dimensions

QR needs to be improved for all dimensions

No

OptimizationsSlide8

Block Sizes - Cholesky

b x b

(m-128) x 128

******

(m-128) x (m-128)

b

= 128

32 x 32

96 x 32

96 x 96

m x m = input matrix size

128 x 128 = Block size

32 X 32 = sub-block size (

Cholesky

using Level-1 & Level-2 BLAS

128 x 128

m

x m

96 x 96

32 x 32

64 x 32

64 x 64

64 x 64

32 x 32

***

32 x 32

32 x 32

Repeat sub-block partitioning after HERK update for (m-128) x (m-128) matrix

BLAS Operations

- Level 1 & 2 , block size < 32

- Level 3 TRSM, 32 x 32 <= block size <= (m-128) x 128

- Level 3 HERK , 32 x 32 <= block size <= (m-128) x (m-128)Slide9

BLAS DependenciesCholesky

Level 1

DOT (Data Size < 32)

Scalv

(Data Size < 32

)

Level 2

GEMV

(Data Size < 32

)Level 3TRSM (Data Size < input Matrix Size)HERK (Data Size < input Matrix Size)Summary: The impact of L1 & L2 routines is less compared to Level-3 BLAS routines. Optimization: Improve TRSM & HERK for all block sizes. Slide10

Block sizes LU & QR

LU with pivoting

128 x 128, 16 x 16 (Level 1 & Level 2)

QR

Level 1 & Level 2 works on all higher block sizes as well.

It has high impact on the performance of QRSlide11

BLAS DependenciesLU with Pivot

Level 1

Dot

Amax

scalv

Level 2

GEMV

Level 3

TRSM

GEMM

QR FactorizationLevel 1CopyvScalvAxpyvAxpytLevel 2Ger (General rank-1 update)GEMVLevel 3TRMMGEMMTRSMSlide12

Blis optimizationsAVX x86 SIMD Optimization

a

xpyf

– which in turn optimizes GEMV (Column Major)

d

otxf

- which in turn optimizes GEMV (row major)

a

xpyv

3

1

9

7

3

3

3

3

9

3

27

21

x

=

256 bit YMM Registers

....

....

3197

1201

2220

8654

301..

x

f

 

 

1

2

0

1

0

0

0

0

0

0

0

0

x

=

9

3

27

21

+Slide13

Blis optimizations Impact – QR Factorization

Ubuntu

14.04 LTS 64-bit OS

1.9x improvement

1.11x improvement

bli_ssumsqv_unb_var1() is replaced with

dotv

for norm2 calculation

2

.12x improvement

AMD Reference

cpuSlide14

Profile Data – Cholesky, LU & QR

Cholesky

Factorization

QR

Factorization

LU

FactorizationSlide15

Profile Data – Cholesky, LU & QR

Cholesky

Factorization

QR

Factorization

LU

FactorizationSlide16

Blis optimizations Impact – Cholesky & LU Factorization

Ubuntu

14.04 LTS 64-bit OS

~7% improvement

AMD Reference

cpu

~14% improvement

Better

than

OpenBLAS

Needs

improvement

Better

than

OpenBLAS

Needs

improvementSlide17

Profile Data – Cholesky

Square

Matrix Size = 128

Square

Matrix Size = 160

Square

Matrix Size = 640

Square

Matrix Size = 1600

To improve performance for small matrices – Optimize

dotxvSlide18

Profile Data – LU

Square

Matrix Size = 128

Square

Matrix Size = 320

Square

Matrix Size = 640

Square

Matrix Size = 1600

To improve performance for small matrices – Optimize Framework

code. Slide19

Profile Data – QR

Square

Matrix Size = 160

Square

Matrix Size = 480

Square

Matrix Size = 640

Square

Matrix Size = 1600

Small matrices – L1 routines are dominatingSlide20

SummaryFor larger matrix sizes, Cholesky

factorization performance is better than

OpenBLAS

To improve performance of

Cholesky

for smaller matrices, optimize

dotxv

routine & framework code

LU performance is better than

OpenBLAS

for matrices greater 320 Performance of QR factorization is improved by more than 2x, but still falling short in performance compared to OpenBLAS for all matrix sizesSlide21

Thank YouSlide22

libflame - Parameters

3

Number of repeats per experiment

c

Flat matrix storage scheme(s) to test ('c' = col-major; 'r' = row-major; 'g' = general; 'm' = mixed)

s Datatype(s) to test

40 Algorithmic

blocksize

for blocked algorithms10 Algorithmic blocksize for algorithms-by-blocks40 Storage blocksize for algorithms-by-blocks160 Problem size: first to test1600 Problem size: maximum to test160 Problem size: increment between experiments0 Number of SuperMatrix threads (0 = disable SuperMatrix

in FLASH front-ends)i Reaction to test failure ('i' = ignore; 's' = sleep() and continue; 'a' = abort)Slide23

Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

©

2016

Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo

and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners

.