/
CS 295: Modern Systems CS 295: Modern Systems

CS 295: Modern Systems - PowerPoint Presentation

jane-oiler
jane-oiler . @jane-oiler
Follow
343 views
Uploaded On 2019-11-22

CS 295: Modern Systems - PPT Presentation

CS 295 Modern Systems CacheEfficient Algorithms SangWoo Jun Spring 2019 Back To The Matrix Multiplication Example Blocked matrix multiplication recap C1 submatrix A1B11 A1B21 A1B31 ID: 767012

tree cache point trees cache tree trees point performance merge small stencil oblivious binary multiplication log storage merger ceiling

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS 295: Modern Systems" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

CS 295: Modern SystemsCache-Efficient Algorithms Sang-Woo JunSpring, 2019

Back To The Matrix Multiplication Example Blocked matrix multiplication recap C1 sub-matrix = A1×B11 + A1×B21 + A1×B31 … Intuition: One full read of BT per S rows in A. Repeated N/S times × A B T A1 C = A2 A3 A4 C 1 C2 C3 C4 B11 B12 B13 B14 B21 B22 B23 B24 B31 B32 B33 B34 … S N

Back To The Matrix Multiplication ExampleFor sub-block size S × S -> N * N * (N/S) reads. What S do we use?Optimized for L1? (64 KiB for me, who knows for who else?)If S*S exceeds cache, we lose performance If S*S is too small, we lose performanceDo we ignore the rest of the cache hierarchy?Say S optimized for L3, S × S multiplication is further divided into T×T blocks for L2 cache T × T multiplication is further divided into U×U blocks for L1 cache …

Solution: Cache Oblivious AlgorithmsNo explicit knowledge of cache architecture/structureExcept that one exists, and is hierarchical Also, “tall cache assumption”, which is naturalStill (mostly) cache optimalTypically divide-and-conquer algorithms B B/M Tall cache assumption: B 2 < cM for a small c ex) Modern Intel L1: M : 64 KiB, B : 16 B

Aside: Even More Important With Storage/NetworkLatency difference becomes even largerCache: ~5 ns DRAM: 100+ ns Network: 10,000+ ns Storage: 100,000+ nsAccess granularity also becomes largerCache/DRAM: Cache lines (64 B)Storage: Pages (4 KB – 16 KB) Also see: “Latency Numbers Every Programmer Should Know” https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Applications of InterestMatrix multiplicationMerge SortStencil ComputationTrees And Search

Recursive Matrix Multiplication C11 = C C 12 C 21 C 2 2 A 11 × A A 12 A21 A22 B11 BB12 B 21 B22= A11B11 A 21B11 A11B12 A 21B12+ A12 B21 A22B21 A 12 B22 A 22 B 22 8 multiply-adds of (n/2) × (n/2) matrices Recurse down until very small

Performance AnalysisWork: Recursion tree depth is log(N), each node fan-out is 8 Same amount of work! Cache misses: Recurse tree for cache access has depth log(N)-1/2(log( cM )) (Because we stop recursing at n 2 < cM for a small c) So number of leaves = At leaf, we load cache lines Total cache lines read = – Optimal solution   Also, logN function call overhead is not high

Bonus: Cache-Oblivious Matrix TransposeAlso possible to define recursively A 11 A 12 A 21 A 22 A A 11 T A 12T A 21T A22T AT

Applications of InterestMatrix multiplicationTrees And SearchMerge Sort Stencil Computation

Trees And SearchBinary Search Trees are cache-ineffectivee.g., Searching for 72 results in 3 cache line reads Not to mention trees in the heap! 50 20 70 10 30 60 90 1 11 25 33 55 66 72 99 L1 L2 L3 L4 Tree layers B1 B2 B3 B4 Cache blocks Each traversal pretty much hits new cache line: Ɵ (Log(N)) cache lines read

Better Layout For Trees 50 20 70 10 30 60 90 1 11 25 33 55 66 72 99 L1+L2 (L3+L4) 1 Tree layers B2 B3 B4 Cache blocks (L3+L4) 2 (L3+L4) 3 (L3+L4) 4 Tree can be organized into locally encoded sub-trees Much better cache characteristics! We want cache-obliviousness: How to choose the size of sub-tree? … … B1

van Emde Boas Layout h … A B 1 B k ceiling(h/2) ceiling(h/2) ceiling(h/4) ceiling(h/4) ceiling(h/4) ceiling(h/4) A B 1 B k … … … … Recursively organized binary tree Needs to be balanced to be efficient Recurses until sub-tree is size 1 In terms of cache access Recursion leaf has cache line bytes Sub-tree height: log(B) Traverses leaf (green) trees  

Performance Evaluations Against Binary Tree Brodal et.al., “Cache Oblivious Search Trees via Binary Trees of Small Height,” SODA 02 1 GHz Pentium III (Coppermine ) 256 KB cache 1 GB DRAM * Pointer: pointer to two children* Implicit : calculates children location

Performance Evaluations Against Binary Tree And B-Tree Brodal et.al., “Cache Oblivious Search Trees via Binary Trees of Small Height,” SODA 02 * High1024: 1024 elements per node, to make use of the whole cache line (B-Tree) Question: How do we optimize N in HighN?Databased use N optimized for storage page

More on the van Emde Boas TreeActually a tricky data structure to do inserts/deletionsTree needs to be balanced to be effective van Emde Boas trees with van Emde Boas trees as leaves? Good thing to have, in the back of your head!

Applications of InterestMatrix multiplicationTrees And Search Merge SortStencil Computation

Merge Sort Source: https:// imgur.com/gallery/voutF , created by morolin Depth-first Breadth-first

Merge Sort Cache EffectsDepth-first binary merge sort is relatively cache efficient Log(N) full accesses on data, for blocks larger than Mn × Binary merge sort of higher fan-in (say, R) is more cache-efficient Using a tournament of mergers! n × Cache obliviousness: how to choose R? Too large R spills merge out of cache -> Thrash -> Performance loss!   R

Lazy K-Merger … k Again, recursive definition of mergers! Each sub-merger has k 3 element output buffer Second level has sub-mergers sub-mergers feeding into 1 sub-merger Each sub-merger has inputs -element buffer per bottom sub-merger Recurses until very small fan-in (two?)   … k 3 k 3/2

Lazy K-Mergerwhile v’s output buffer is not full if left input buffer empty Fill(left child of v) if right input buffer empty Fill( right child of v) perform one merge stepProcedure Fill(v): Each k merger fits in k 2 space Ideal cache effects! Proof too complex to show today… What should k be? Given N elements, k = N (1/3) – “ Funnelsort ”

In-Memory Funnelsort Empirical Performance Overhead… Source: Brodal et. al., “Engineering a Cache-Oblivious Sorting Algorithm” Probably need SIMD to really test memory limitations

In-Memory Funnelsort Empirical Performance Improvement! Overhead… Athlon has more L1, Less L2. Is that why? Source: Brodal et. al., “Engineering a Cache-Oblivious Sorting Algorithm”

In-Storage Funnelsort Empirical Performance Storage-optimized Library! Improvement! Source: Brodal et. al., “Engineering a Cache-Oblivious Sorting Algorithm”

Applications of InterestMatrix multiplicationTrees And Search Merge SortStencil Computation

Stencil ComputationExample: Heat diffusionUses parabolic partial differential equation to simulate heat diffusion

Heat Equation In Stencil FormSimplified model: 1-dimensional heat diffusion

A 3-point Stencilu(x, t + Δt) can be calculated using u(x, t), u(x + Δ x, t), u(x - Δx, t) Sentries x t A “stencil” updates each position using surrounding values as input This is a 1D 3-point stencil 2D 5 point, 2D 9 point, 3D 7 point, 3D 25-point stencils popular Popular for simulations, including fluid dynamics, solving linear equations and PDEs

Some Important Stencils [1] 19-point 3D Stencil for Lattice Boltzmann Method flow simulation [1] Peng, et. al., “High-Order Stencil Computations on Multicore Clusters”[2] Gentryx, Wikipedia [2] 25-point 3D stencil forseismic wave propagation applications

Cache Behavior of Naïve LoopsUsing the 1D 3-point stencilUnless x is small enough, there is no cache reuse Continuing the theme, can we recursively process data in a cache-optimal way? x t

Cache Efficient Processing:Trapezoid UnitsComputation in a trapezoid is either: Self-contained, does not require anything from outside( ), orOnly uses data that has been computed and ready ( , after ) x t

Recursion #1: Space CutIf width >= height × 2Cut the trapezoid through the center using a line of slope -1Process left, then right t x

Recursion #2: Time CutIf width < height × 2Cut the trapezoid with a horizontal line through the centerProcess bottom, then top t x

AnalysisIntuitively, trapezoids are split until they are of size M (cache size)Data read = Ɵ(NT/M) Cache lines read = Ɵ(NT/MB) Let’s look at a performance video!