Mustafa Mohamad Department of Mechanical Engineering MIT 18337 Fall 2016 Problem Develop pure Julia codes for elementary function Trigonometric functions sin cos tan sincos asin acos ID: 561912
Download Presentation The PPT/PDF document "Elementary Functions in Julia" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Elementary Functions in Julia
Mustafa Mohamad
Department of Mechanical Engineering, MIT
18.337, Fall 2016Slide2
Problem
Develop pure Julia codes for elementary function
Trigonometric functions
sin, cos, tan,
sincos
asin
,
acos
,
atan
, atan2
Hyperbolic functions
sinh
,
cosh
, tanh,
asinh
,
acosh
,
atanh
Exponential and logarithmic
log, log2, log10, log1p, ilog2,
exp
, exp2, exp10, expm1,
cbrt
, pow
Floating point manipulation
ldexp
, ilog2,
frexpSlide3
Grand Goal
Excise dependency on
openlibm
Basically a fork of
freebsd's
msun
math library plus patches (originally developed by Sun microsystems)
Accumulated patches have gathered over the years that have not been ported over to
openlibmSlide4
The Julia Promise
Multiple dispatch and macros, and other features make it possible to write generic versions of algorithms and still get maximal performance
Generic versions often easier to understand
Get efficient code for Float16 and Float128
Float16: forthcoming in Julia 0.6/1.0
Float128: forthcoming in Julia 0.6/1.0Slide5
Background: Julia port of SLEEF C library
Takes advantage of SIMD
and minimal code branches
Look at code @
code_llvm
or @
code_native
Julia ported code benefits from auto SIMD vectorization
See: https://github.com/JuliaMath/Sleef.jlSlide6
SLEEF good accuracy but too slow
Avoiding branches necessitates unnecessary computations for some branches
Double-double arithmetic used to squeeze extra bits of precision and avoid branching (library too slow on non-
fma
systems, ~6-10 slower then Base)Slide7
Benchmarks
MEDIAN TIME RATIO SLEEF/BASE
Performance critically depends on availability of fused-multiply-add instruction (FMA)
sin/cos/tan have bugs so benchmark here is not fair, same for the log functions
exp10 function is very fast compared to base exp10, but the exp10 function in
openlibm
is dumb; it just calls the pow functionSlide8
Amal.jl
https://github.com/JuliaMath/Amal.jlSlide9
General lessons
Developing accurate and fast elementary functions is very hard and time consuming….
strict accuracy constrains for high quality implementation require error’s to be < 1
ulp
Must take advantage of the IEEE form of Floating point numbers to extract maximal speedSlide10
Important considerations
Ideally prefer polynomial approximants instead of rational approximants (division is much slower than
or
~can)
Availability of FMA instruction (fused multiply and add with no intermediate rounding)
Avoid algorithms that depend on table-lookups (throughout of CPU instructions is outpacing memory access speed, as a result these table-based methods have fallen out of flavor)
Slide11
Testing
Critical
Possible to test all Float32 numbers
in a couple of minutes
Unfortunately it’s not possible to test all Float64 numbers
Benchmarking: @
code_llvm
/@
code_native
, @
code_warntype
,
BenchmarkTools.jl
Slide12
1. Range reduction on argument to the smallest set
possible
2. Compute minimax approximation polynomial on
3. Scale back if necessary
Remez based algorithm to compute minimax polynomial
Minimizes norm
For rational approximations use
Padé
approximant
Developing approximantsSlide13
Developed Public Functions
exp
, exp2, exp10
exp
, exp2 have non-
fma
versions
(
msun
based)exp10 (polynomial < 1ulp for both fma and nonfma)logFloating point manipulation functions
frexp
,
ldexp
, significand, exponent Slide14
Contributing to Base’s math.jlSlide15
Pure Julia ldexp
function
regular branch: ~ 1.5x speedup
subnormal number and subnormal output: ~ 3.6x speedup
normal number and subnormal output: ~ 2.6x speedup
subnormal number and normal output: ~ 13x speedup
Generic over both argument x AND e (may be Int128,
BigInt
, etc., without sacrificing Int64/Int32 perf)
computes significand x 2
exponent
without multiplication
Wins on readability and speed!
Generic!Slide16
exp function example
https://github.com/JuliaMath/Amal.jl/blob/master/src/exp.jl
Method
observe:
Argument reduction: Reduce x to an r so that |r| <= 0.5*ln(2).
Given x, find r and integer k such that
x = k ln(2) + r, |r|
0.5 ln(2).
2. Approximate
exp
(r) by a polynomial on the interval [-0.5*ln(2), 0.5*ln(2)]:
exp
(x) = 1.0 + x + polynomial(x),
3. Scale back:
exp
(x) = 2
k
exp
(r)
We have taken advantage of the float format!
Note the special representation!Slide17
median speed relative to base: 0.95 (Amal/Base)
Possible to make this a tad faster
custom macro that converts numeric coefficients to the same type as
with no overhead (dispatch)
Slide18
exp for non-
fma
systems
Pade approximation to
A Remez algorithm is used to generate a polynomial to approximate
Therefore we can write
(it’s actually better to then carry out another long division step to minimize cancelation effects)
Slide19
Thank you for listening!