/
Advancement of Buffer Management Research and Development i Advancement of Buffer Management Research and Development i

Advancement of Buffer Management Research and Development i - PowerPoint Presentation

briana-ranney
briana-ranney . @briana-ranney
Follow
408 views
Uploaded On 2016-12-08

Advancement of Buffer Management Research and Development i - PPT Presentation

Xiaodong Zhang The Ohio State University Numbers Everyone Should Know Jeff Dean Google L1 cache reference 05 ns Branch mispredict 5 ns ID: 498771

cache block lirs lru block cache lru lirs stack hir lir resident data hit page blocks clock lock replacement

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Advancement of Buffer Management Researc..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Advancement of Buffer Management Research and Development in Computer and Data Systems

Xiaodong Zhang

The Ohio State UniversitySlide2

Numbers Everyone Should Know (Jeff Dean, Google)

L1 cache reference: 0.5 ns

Branch mis-predict: 5 ns

L2 cache reference: 7 ns Mutex lock/unlock: 25 ns Main memory reference 100 ns Compress 1K Bytes with Zippy 3000 ns Send 2K Bytes over 1 GBPS network 20000 nsRead 1 MB sequentially from memory 250000 nsRound trip within data center 500000 ns Disk seek 1000000 nsRead 1MB sequentially from disk 2000000 nsSend one packet from CA to Europe 15000000 ns

2Slide3

Replacement Algorithms in Data Storage Management

A replacement algorithm decides

Which data entry to be evicted

when the data storage is full. Objective: keep to-be-reused data, replace ones not to-be-reusedMaking a critical decision: a miss means an increasingly long delayWidely used in all memory-capable digital systems Small buffers: cell phone, Web browsers, e-mail boxes … Large buffers: virtual memory, I/O buffer, databases …A simple concept, but hard to optimize More than 40 years tireless algorithmic and system efforts LRU-like algorithms/implementations have serious limitations.3Slide4

Least Recent Used (LRU) Replacement

LRU is most commonly used replacement for data management.

Blocks are ordered by an LRU order (from bottom to top) Blocks enter from the top (MRU), and leave from bottom.(LRU)

Recency

– the distance from a block to the top of the LRU stack

Upon a hit

– move block to the top

3

2

5

LRU stack

The stack is

long

, the bottom is the

only exit

.

9

Recency = 2

Recency

of Block 2

is its distance to the top of stack

Upon a

Hit

to block 2

1

Move block 2 to the top of stack

4Slide5

Least Recent Used (LRU) Replacement

LRU is most commonly used replacement for data management. Blocks are ordered by an LRU order (from bottom to top) Blocks enter from the top, and leave from bottom.

LRU stack

The stack is

long

, the bottom is the

only exit

.

3

2

5

9

1

6

Disk

6

Load block 6 from disk

Put block 6 on the stack top

Recency

– the distance from a block to the top of the LRU stack

Upon a hit

– move block to the top

Upon a miss

– evict block at the bottom

Replacement – the block 1 at the stack bottom is evicted

Upon a

Miss

to block 6

5Slide6

LRU is a Classical Problem in Theory and Systems

First LRU paper

L. Belady, IBM System Journal, 1966

Analysis of LRU algorithmsAho, Denning & Ulman, JACM, 1971 Rivest, CACM, 1976Sleator & Tarjan, CACM, 1985 Knuth, J. Algorithm, 1985 Karp, et. al, J. Algorithms, 1991 Many papers in systems and databasesASPLOS, ISCA, SIGMETRICS, SIGMOD, VLDB, USENIX…6Slide7

The Problem of LRU:

Inability to Deal with Certain Access Patterns

File Scanning

One-time accessed data evict to-be-reused data (cache pollution)A common data access pattern (50% data in NCAR accessed once)LRU stack holds them until they reach to the bottom.Loop-like accessesA loop size k+1 will miss k times for a LRU stack of kAccess with different frequencies (mixed workloads)Frequently accessed data can be replaced by infrequent ones

7Slide8

Why Flawed LRU is so Powerful in Practice

What is the major flaw?

The assumption of “recently used will be reused”

is not always rightThis prediction is based on a simple metrics of “recency”Some are cached too long, some are evicted too early.Why it is so widely used?Works well for data accesses following LRU assumption A simple data structure to implement8Slide9

Challenges of Addressing the LRU Problem

Two types of Efforts have been made

Detect

specific access patterns: handle it case by case Learn insights into accesses with complex algorithmsMost published papers could not be turned into realityTwo Critical GoalsFundamentally address the LRU problemRetain LRU merits: low overhead and its assumption The goals are achieved by a set of three papersThe LIRS algorithm (SIGMETRICS’02) Clock-pro: a system implementation (USENIX’05) BP-Wrapper: lock-contention free assurance (ICDE’09) 9Slide10

Outline

The LIRS Algorithm

How the LRU problem is fundamentally addressed

How a data structure with low complexity is builtClock-proTurn LIRS algorithm into system reality BP-Wrapper free lock contention so that LIRS and others can be implemented without approximationWhat would we do for multicore processors? Research impact in daily computing operations10Slide11

Recency vs

Reuse Distance

1

LRU stack

3

2

5

9

8

4

3

. . .

4

5

4

3

Recency = 1

Recency = 2

Recency

is the

distance from a block to the top of the

LRU

stack

Recency

– the distance between last reference to the current time

Reuse Distance

(Inter reference recency) – the distance between two consecutive reference to the block (

deeper and more useful information

)

11Slide12

LRU stack

3

2

4

5

9

8

3

. . .

5

4

3

3

Recency = 2

IRR = 2

Inter-Reference Recency (IRR)

The number of other unique blocks accessed between two consecutive references to the block.

Recency = 0

IRR

is the

recency

of the block being accessed

last

time – need an extra stack to help, increasing complexity.

5

3

LRU Stack for HIRs

Recency

– the distance between last reference to the current time

Reuse Distance

(Inter reference recency) – the distance between two consecutive reference to the block (

deeper and more useful info

)

Recency vs Reuse Distance

12Slide13

Diverse Locality Patterns on an Access Map

Virtual Time (Reference Stream)

Logical Block Number

strong locality

loops

one-time accesses

13Slide14

What Blocks does LRU Cache (measured by IRR)?

Locality Strength

Cache Size

MULTI2

IRR (Re-use Distance in Blocks)

Virtual Time (Reference Stream)

LRU holds frequently

accessed

blocks with

“absolutely” strong locality.

holds one-time accessed blocks (0 locality)

Likely to replace other relatively strong locality blocks

14Slide15

LIRS: Only Cache Blocks with Low Reuse Distances

Locality Strength

Cache Size

MULTI2

IRR (Re-use Distance in Blocks)

Virtual Time (Reference Stream)

Holds strong locality blocks (ranked by reuse distance)

15Slide16

Basic Ideas of LIRS (SIGMETRICS’02)

LIRS:

Low Inter-Reference recency Set Low IRR blocks are kept in buffer cache High IRR blocks are candidates for replacements Two stacks are maintainedA large LRU stack contains low IRR resident blocksA small LRU stack contains high IRR blocksThe large stack also records resident/nonresident high IRR blocksIRRs are measured by the two stacksAfter a hit to a resident high IRR block in small stack, the block becomes low IRR block and goes to the large stack if it can also be found in the large stack => low IRR , otherwise, top it locallyThe low IRR block in the bottom stack

will become a high IRR block

and go to the small stack when the large stack is full.

16Slide17

Low Complexity of LIRS

Both recencies and IRRs are recorded in each stack

The block in the bottom of LIRS has the maximum recency

A block is low IRR if it can be found in in both stacksNo explicit comparisons and measurements are neededComplexity of LIRS = LRU = O(1) although Additional object movements between two stacksPruning operations in stacks17Slide18

Data Structure: Keep LIR Blocks in Cache

L

ow IRR (LIR) blocks and High IRR (HIR) blocks

LIR block set

(size is

L

lirs

)

HIR block set

Cache size

L

=

L

lirs

+

L

hirs

L

hirs

L

lirs

Physical Cache

Block Sets

18Slide19

LIRS Operations

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

5

3

2

1

6

9

4

8

LIRS stack

5

3

LRU Stack for HIRs

Initialization

: All the referenced blocks are given an LIR status until LIR block set is full.

We place resident HIR blocks in a small LRU Stack.

Upon accessing

an LIR block (a hit)

Upon accessing

a resident HIR block (a hit)

Upon accessing

a non-resident HIR block (a miss)

19Slide20

5

3

2

1

6

9

4

8

5

3

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

. . .

4

8

3

5

7

9

5

Access an LIR Block (a Hit)

20Slide21

5

3

2

1

6

9

4

8

5

3

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

. . .

8

3

5

7

9

5

Access an LIR Block (a Hit)

21Slide22

Access an LIR block (a Hit)

6

9

5

3

2

1

4

8

5

3

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

. . .

3

5

7

9

5

8

22Slide23

Access a Resident HIR Block (a Hit)

5

3

2

1

4

8

5

3

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

. . .

3

5

7

9

5

3

23Slide24

1

5

2

5

4

8

3

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

. . .

3

5

7

9

5

Access a Resident HIR Block (a Hit)

24Slide25

1

5

2

5

4

8

3

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

. . .

3

5

7

9

5

1

Access a Resident HIR Block (a Hit)

25Slide26

5

4

8

3

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

. . .

5

7

9

5

1

5

Access a Resident HIR Block (a Hit)

26Slide27

Access a Non-Resident HIR block (a Miss)

5

4

8

3

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

. . .

7

9

5

1

5

7

7

27Slide28

5

4

8

3

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

. . .

9

5

5

7

7

9

5

9

5

Access a Non-Resident HIR block (a Miss)

28Slide29

4

8

3

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

. . .

5

7

7

9

5

9

7

5

4

7

Access a Non-Resident HIR block (a Miss)

29Slide30

8

3

resident in cache

LIR block

HIR block

Cache size

L

=

5

L

lir

=

3

L

hir

=

2

. . .

9

9

4

7

5

Access a Non-Resident HIR block (a Miss)

30Slide31

A Simplified Finite Automata for LRU

31

Miss

(fetch data and place on the top)

Hit

(

place

it on the top)

Operations on LRU stack

Upon a block access

block evicts

(remove the block in the bottom)Slide32

A Simplified Finite Automata for LIRS

32

Hit

Operations on LIR stack

Upon a block access

Pruning

Operations on HIR stack

Operations on HIR stack

Operations on HIR stack

Demotion to HIR stack

Block evicts

Miss (with a record)

Hit/promotion to LIR stack

Miss/promotion to LIR stack

Miss

Hit

Block evicts

Miss (no record)

Add a record on resident HIRSlide33

Hit

Operations on LIR stack

Upon a block access

Pruning

Operations on HIR stack

Operations on HIR stack

Operations on HIR stack

Demotion to HIR stack

Miss (with a record)

Hit/promotion to LIR stack

Miss/promotion to LIR stack

Miss

Hit

Block evicts

Miss (no record)

Add a record on resident HIR

A Simplified Finite Automata for LIRS Slide34

How LIRS addresses the LRU problem

File scanning

: one-time access blocks will be replaced timely;

(due to their high IRRs)Loop-like accesses: a section of loop data will be protected in low IRR stack; (misses only happen in the high IRR stack)Accesses with distinct frequencies: Frequently accessed blocks in short reuse distance will NOT be replaced. (dynamic status changes)

34Slide35

Performance Evaluation

Trace-driven simulation on different patterns shows

LIRS

outperforms existing replacement algorithms in almost all the cases.The performance of LIRS is not sensitive to its only parameter Lhirs. Performance is not affected even when LIRS stack size is bounded.The time/space overhead is as low as LRU.

LRU is

a special case

of LIRS (without recording resident and non-resident HIR blocks, in the large stack).

35Slide36

Looping Pattern:

postgres

(Time-space map

)

36Slide37

Looping Pattern:

postgres

(IRR Map)

IRR (Re-use Distance in Blocks)

Virtual Time (Reference Stream)

LRU

LIRS

37Slide38

Looping Pattern:

postgres

(Hit Rates)

38Slide39

Two Technical Issues to Turn it into Reality

High overhead in implementations

For each data access, a set of operations defined in replacement algorithms (e.g. LRU or LIRS) are performed

This is not affordable to any systems, e.g. OS, buffer caches …An approximation with reduced operations is required in practice High lock contention costFor concurrent accesses, the stack(s) need to be locked for each operationLock contention limits the scalability of the system

Clock-pro

and

BP-Wrapper

addressed these two issues

39Slide40

Only Approximations can be Implemented in OS

The dynamic changes in LRU and LIRS cause some computing overhead, thus OS kernels cannot directly adopt them.

An approximation reduce overhead at the cost of lower accuracy.

The clock algorithm for LRU approximation was first implemented in the Multics system in 1968 at MIT by Corbato (1990 Turing Award Laureate)

Objective

: LIRS approximation for OS kernels.

40Slide41

All the resident pages are placed around a circular list, like a clock;

Each page is associated with a

reference bit

, indicating if the page has been accessed.Basic Operations of CLOCK Replacement

0

CLOCK hand

0

1

0

0

0

0

1

1

0

0

0

1

1

0

0

1

0

1

1

0

0

0

1

0

Upon a

HIT

:

Set reference bit to 1

No algorithm operations

On a block HIT

41Slide42

Basic CLOCK Replacement

0

CLOCK hand

1

0

0

0

0

1

1

0

0

0

1

1

0

0

1

0

1

1

0

0

0

On a sequence of two MISSes

Starts from the currently pointed page, and evicts the page if it is”`0”;

Move the clock hand until reach a “0” page;

Give “1” page a second chance, and reset its “1” to “0”

1

0

0

0

Upon a

MISS

:

Evict the block w/ reference bit 0

Insert a new block here

Reset reference bit to 0

Upon the second

MISS

:

Evict the block w/ reference bit 0

0

42Slide43

Unbalanced R&D on LRU versus CLOCK

FBR (1990, SIGMETRICS)

LRU-2 (1993, SIGMOD)

2Q (1994, VLDB)

SEQ (1997, SIGMETRICS)

LRFU (1999, OSDI)

EELRU (1999, SIGMETRICS)

MQ (2001, USENIX)

LIRS (2002, SIGMETRICS)

ARC (2003,

FAST, IBM patent)

GCLOCK (1978, ACM TDBS)

LRU related work

CLOCK related work

1968,

Corbato

2003,

Corbato

CAR

(2004,

FAST, IBM patent)

CLOCK-Pro (2005, USENIX)

43Slide44

It is an approximation of

LIRS

based on the CLOCK infrastructure.

Pages categorized into two groups: cold pages

and

hot pages

based on their reuse distances (or IRR).

There are three hands:

Hand-hot

for hot pages,

Hand-cold for cold pages, and Hand-test for running a reuse distance test for a block;

The allocation of memory pages between hot pages (

Mhot) and cold pages (Mcold ) are adaptively adjusted

. (M = Mhot + Mcold)

All hot pages

are resident (=Lir blocks), some cold pages are also resident (= Hir Blocks); keep track of recently replaced pages (=non-resident Hir blocks)

Basic Ideas of CLOCK-Pro

44Slide45

Two reasons for a resident cold page

:

A fresh replacement: a first access.

It is demoted from a hot page. CLOCK-Pro (USENIX’05)

0

0

1

0

0

1

0

0

0

0

0

0

1

0

0

hand-hot

hand-test

hand-cold

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0

1

1

Cold resident

Hot

Cold non-resident

0

All hands move in the clockwise direction.

Hand-cold

is used to find a page for replacement.

Hand-test

: (1) to determine if a cold page is

promoted to be hot

; (2)

remove non-resident cold pages

out of the clock.

Hand-hot

: find a hot page to be demoted into a cold page.

0

45Slide46

46Slide47

47Slide48

Hit ratio is largely determined by effectiveness of

replacement algorithm

It determines which pages to be kept and which to be evicted

LRU-k, 2Q, LIRS, ARC, … Lock (latch) is required to serialize the update after each page requestConcurrency Management in Buffer Management 48

Lock (Latch)

Replacement Management inside Lock

Buffer Pool (in DRAM)

Pages

Hard Disk

Concurrent accesses to buffer caches

A critical section is needed

Buffer cache (pool) keeps hot pages

Maximizing hit ratio

is the key

Page Accesses

Maximizing hit ratio

48Slide49

Accurate Algorithms and Their Approximations

49

LRU, LIRS, ARC, ….

Approximations

CLOCK

(LRU),

CLOCK-Pro

(LIRS),

CAR

(ARC)

0

0

1

1

0

1

1

0

1

0

0

0

0

1

0

0

CLOCK hand

1

clock sets bit to 1 without lock for a page hit.

Lock synchronization is only used only for misses.

Clock approximation

reduces lock contention

at the price of

reducing hit ratios

.

49Slide50

1996-2000: LRU (suffer lock contention moderately due to low concurrency)

2000-2003:

LRU-k

(hit ratio outperforms LRU, but lock contention became more serious)2004: ARC/CAR are implemented, but quickly removed due to an IBM patent protection.2005: 2Q was implemented (hit ratios were further improved, but lock contention was high)2006 to now: CLOCK (approximation of LRU, lock contention is reduced, but hit ratio is the lowest compared with all the previous ones)History of Buffer Pool's Caching Management in PostgreSQL50Slide51

51

Trade-offs between

Hit Ratios and Low Lock Contention LRU-k, 2Q, LIRS, ARC, SEQ, ….

……

……

for

high hit ratio

Update page metadata

Low Lock Synchronization

CLOCK, CLOCK-Pro, and CAR

for

high scalability

?

Clock-based approximations lower hit ratios (compared to original ones).

The transformation can be

difficult

and demand

great efforts

;

Some algorithms

do not have

clock-based approximations.

Our Goal: to have both!

Lock Synchronization

modify data structures

51Slide52

Reducing Lock Contention by Batching Requests

Replacement Algorithm

(modify data structures, etc. )

Buffer Pool

Replacement Algorithm

(modify data structures, etc. )

Buffer Pool

One batch queue per thread

52

Page hit

Fetch the page directly.

Fulfill page request

Commit assess history for a set of replacement operations

52Slide53

Reducing Lock Holding Time by Prefetching

53

Time

Thread 2

Thread 1

Data Cache Miss Stall

Time

Thread 2

Thread 1

Pre-read data that will be accessed in the critical section

53Slide54

Lock Contention Reduction by BP-Wrapper

(ICDE’09)

54Lock contention: a lock cannot be obtained without blocking; Number of lock acquisitions (contention) per million page accesses.

Reduced by over 7000 times!

54Slide55

Impact of LIRS in Academic Community

LIRS is a

benchmark

to compare replacement algorithmsReuse distance is first used in replacement algorithm designA paper in SIGMETRICS’05 confirmed that LIRS outperforms all the other replacement. LIRS has become a topic to teach in both graduate and undergraduate classes of OS, performance evaluation, and databases at many US universities. The LIRS paper (SIGMETRICS’02) is highly and continuously cited.Linux Memory Management group has established an Internet Forum on Advanced Replacement for LIRS 55Slide56

LIRS has been adopted in MySQL

MySQL

is the most widely used relational database

11 million installations in the world The busiest Internet services use MySQL to maintain their databases for high volume Web sites: google, YouTube, wikipedia, facebook, Taobao… LIRS is managing the buffer pool of MySQLThe adoption is the most recent version (5.1), November 2008.

56Slide57

57Slide58

58Slide59

Infinispan (a Java-based Open Software)

59

The data grid forms a huge in-memory cache being managed using

LIRSBP-Wrapper is used to ensure lock-freeSlide60

Concurrentlinkedhashmap as a Software Cache

Linked list structure (a Java class)

60

http://code.google.com/p/concurrentlinkedhashmap/wiki/Design

Elements are Linked and managed using

LIRS

replacement policy.

BP-Wrapper

ensures

lock contention-freeSlide61

LIRS in Management of Big Data

LIRS has been adopted in

GridGain SoftwareA Java based open source middle ware for real time big data processing and analytics (www.gridgain.com) LIRS makes replacement decisions for the in-memory data gridOver 500 products and organizations using GridGain software daily: Sony, Cisco, Canon, JobsonJonson, Deutsche Bank, …. LIRS has been adopted in SYSTAP’s storage management

Big data scale-out storage systems (

www.bigdata.com

)

61Slide62

LIRS in Functional Programming Language:

Clojure

Clojure

is a dynamic programming language that targets Java Virtual Machine (http://clojure.org) A dialect of Lisp, functional programming, and designed for concurrency Have be used by many organizationsLIRS is a member of the clojure library: LIRSCache 62Slide63

LIRS Principle in Hardware Caches

A cache replacement hardware implementation based on

Re-Reference Interval prediction

(RRIP) Presented in ISCA’10 by IntelTwo bits are added to each cache line to measure reuse-distance in a static and dynamic wayPerformance gains are up to 4-10%Hardware cost may not be affordable in practice. 63Slide64

Impact of Clock-Pro

in OS and Other Systems

Clock-pro has been adopted in FreeBSD/

NetBSD (open source Unix) Two patches in Linux kernel for usersClock-pro patches in 2.6.12 by Rik van RielPeterZClockPro2 in 2.6.15-17 by Peter Zijlstra

Clock-pro is patched in

Aparche

Derby (a relational DB)

Clock-pro is patched in

OpenLDAP

(directory accesses)

64Slide65

Impact of Multicore Procesors in Computer Systems

65

256MB

Memory

Dell Precision GX620

Purchased in 2004

L1

2MB L2

Disk

8MB L3 Cache

8GB Memory

L1

L2

L1

L2

L1

L2

L1

L2

Disk

Dell Precision1500

in 2009 with similar priceSlide66

8MB L3 Cache

8GB Memory

L1

L2

L1

L2

L1

L2

L1

L2

Disk

Dell Precision 1500

Purchased in 2009 with similar price

Performance Issues w/ the Multicore Architecture

66

Slow data accesses to memory and disks continue to be major bottlenecks.

Almost all the CPUs in Top-500 Supercomputers are multicores.

Cache Contention and Pollution:

Conflict cache misses among multi-threads can significantly degrade performance.

Memory Bus Congestion:

Bandwidth is limited to as the number of cores increases

“Disk Wall”:

Data-intensive applications also demand high throughput from disks. Slide67

67

Multi-core Cannot Deliver Expected Performance as It Scales

Ideal

Reality

The Troubles with

Multicores

”, David Patterson, IEEE Spectrum, July, 2010

Finding the Door in the Memory Wall

”, Erik

Hagersten

,

HPCwire

, Mar, 2009

Multicore

Is Bad News For Supercomputers

”, Samuel K. Moore, IEEE Spectrum, Nov, 2008

Performance

Throughput = Concurrency/Latency

Exploiting parallelism

Exploiting localitySlide68

Challenges of Managing LLC in Multi-cores

Recent theoretical results about LLC in multicores

Single core:

optimal offline LRU algorithm existsOnline LRU is k-competitive (k is the cache size) Multicore: finding an offline optimal LRU is NP-completeCache partitioning for threads: an optimal solution in theorySystem Challenges in practiceLLC lacks necessary hardware mechanism to control inter-thread cache contentionLLC share the same design with single-core cachesSystem software has limited information and methods to effectively control cache contention68Slide69

OS Cache Partitioning in Multi-cores (HPCA’08)

virtual page number

Virtual address

page offset

physical page number

Physical address

Page offset

Address translation

Cache tag

Block offset

Set index

Cache address

Physically indexed cache

page color bits

… …

OS control

=

Physically indexed caches are divided into multiple regions (colors).

All cache lines in a physical page are cached in one of those regions (colors).

OS can control the page color of a virtual page through address mapping

(by selecting a physical page with a specific value in its page color bits).

69Slide70

Shared LLC can be partitioned into multiple regions

… …

...

……

……

Physically indexed cache

……

Physical pages are grouped to different

bins based on their page colors

1

2

3

4

i+2

i

i+1

Process 1

1

2

3

4

i+2

i

i+1

Process 2

OS address mapping

Shared cache is partitioned

between two processes through OS

address mapping.

Main memory space needs to be partitioned too (

co-partitioning

).

70Slide71

Implementations in Linux and its Impact

Static partitioning

Predetermines the amount of cache blocks allocated to each running process at the beginning of its execution

Dynamic cache partitioningAdjusts cache allocations among processes dynamically Dynamically changes processes’ cache usage through OS page address re-mapping (page re-coloring) Current Status of the system facility Open source in Linux kernels

adopted as a software solution

in

Intel SSG

in May 2010

used in applications of Intel platforms, e.g. automation

71Slide72

Final Remarks: Why LIRS-related Efforts Make the Difference?

Caching the most deserved data blocks

Using reuse-distance as the ruler, approaching to the optimal

2Q, LRU-k, ARC, and others can still cache non-deserved blocks LIRS with its two-stack yields constant operations: O(1) Consistent to LRU, but recording much more useful informationClock-pro turns LIRS into reality in production systemsNone of other algorithms except ARC have approximation versionsBP-Wrapper ensures lock contention free in DBMSOS partitioning executes LIRS principle in LLC in multicoresProtect strong locality data, and control weak locality data 72Slide73

Acknowledgement to Co-authors and Sponsors

Song Jiang

Ph.D.’04 at William and Mary, faculty at Wayne State

Feng Chen Ph.D.’10 , Intel Labs (Oregon) Xiaoning DingPh.D. ‘10 , Intel Labs (Pittsburgh) Qingda LuPh.D.’09 , Intel (Oregon) Jiang Lin

Ph.D’08 at Iowa State, AMD

Zhao Zhang

Ph.D.’02 at William and Mary, faculty at Iowa State

P.

Sadayappan

, Ohio State

Continuous support from the

National Science Foundation

73Slide74

CSE 788: Winter Quarter 2011

Principle of Locality in Design and Implementation of Computer and Distributed Systems

Exploiting locality at different levels of computer systems Challenges of algorithms design and implantations Readings of both classical and new papers A proposals- and projects-based class Many high quality research started from this class, and published in FAST, HPCA, Micro, PODC, PACT, SIGMETRICS, USENIX, and VLDB You are welcome to take the class next quarter

74Slide75

75

Xiaodong Zhang

: zhang@cse.ohio-state.eduhttp://www.cse.ohio-state.edu/~zhang

Thank You !