/
Scaling a file system to many cores using an operation log Scaling a file system to many cores using an operation log

Scaling a file system to many cores using an operation log - PowerPoint Presentation

welnews
welnews . @welnews
Follow
344 views
Uploaded On 2020-10-06

Scaling a file system to many cores using an operation log - PPT Presentation

Srivatsa S Bhat Rasha Eqbal Austin T Clements M Frans Kaashoek Nickolai Zeldovich MIT CSAIL Motivation Current file systems dont scale well Filesystem Linux ext4 4921 ID: 813211

file1 core memfs dira core file1 dira memfs cache disk diskfs scalefs block inode journal file 100 link number

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Scaling a file system to many cores usin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Scaling a file system to many cores using an operation log

Srivatsa S. Bhat, Rasha Eqbal, Austin T. Clements,M. Frans Kaashoek, Nickolai ZeldovichMIT CSAIL

Slide2

Motivation: Current file systems don’t scale well

Filesystem:Linux ext4 (4.9.21)Benchmark:dbench [https://

dbench.samba.org]Experimental setup:

80-cores, 256 GB

RAM

Backing store: “RAM” disk

Slide3

Linux ext4 scales poorly on multicore machines

Slide4

Concurrent file creation in Linux ext4

MEMORY

creat

(

dirA

/file1)

CORE 1

creat

(

dirA

/file2)

Journal

DISK

CORE 2

dirA’s

block

ext4

Slide5

Block contention limits scalability of file creation

MEMORY

creat

(

dirA

/file1)

CORE 1

creat

(

dirA

/file2)

Journal

DISK

CORE 2

ext4

file1 : 100

file2 : 200

dirA’s

block

Contention on blocks limits scalability on 80 cores

Even apps not limited by disk I/O don’t scale

Contends on the directory block!

Slide6

Goal : Multicore scalability

Problem : Contention limits scalabilityContention involves cache-line conflictsGoal : Multicore scalability = No cache-line conflictsEven a single contended cache-line can wreck scalabilityCommutative operations can be implemented without cache-line conflicts

[Scalable Commutativity Rule, Clements SOSP ’13]

How do we scale all commutative operations in file systems?

Slide7

ScaleFS approach: Two separate

file systems

MEMORY

Link Name

Inode

number

Journal

DISK

Block cache

MemFS

DiskFS

Directories

(

as hash-tables)

Designed for

multicore scalability

Designed for

durability

f

sync

Slide8

Concurrent file creation scales in ScaleFS

DiskFS

MemFS

Link Name

Inode

Number

dirA

c

reat

(

dirA

/file1)

CORE 1

c

reat

(

dirA

/file2)

Journal

DISK

CORE 2

Block cache

MEMORY

Slide9

Concurrent file creation scales in ScaleFS

DiskFS

MemFS

Link Name

Inode

Number

file1

100

file2

200

dirA

Journal

DISK

Block cache

MEMORY

c

reat

(

dirA

/file1)

CORE 1

c

reat

(

dirA

/file2)

CORE 2

No

contention No cache-line conflicts Scalability!

Slide10

DiskFS

MemFS

Link Name

Inode

Number

file1

100

file2

200

dirA

Journal

DISK

Block cache

MEMORY

fsync

Challenge: How to implement

fsync

?

Slide11

Challenge: How to implement fsync?

DiskFS

MemFS

dirA

file1 : 100

file2 : 200

Link Name

Inode

Number

file1

100

file2

200

dirA

Journal

DISK

Block cache

MEMORY

fsync

DiskFS

updates must be consistent with

MemFS

f

sync

must preserve conflict-freedom for commutative ops

Slide12

Contributions

ScaleFS, a file system that achieves excellent multicore scalabilityTwo separate file systems: MemFS and DiskFSDesign for fsync:Per-core operation logs to

scalably defer updates to DiskFSOrdering operations using Time Stamp Counters

E

valuation :

Benchmarks on

ScaleFS

scale 35x-60x on 80 cores

Workload/Machine independent analysis for cache-conflicts

Suggests

ScaleFS

a good fit for workloads not limited by disk I/O

Slide13

DISK

Journal

MemFS

DiskFS

Designed for multicore scalability

Designed for durability

ScaleFS

design : Two

separate

file systems

f

sync

Per-core

Operation Logs

Uses:

hash-tables, radix-trees,

seqlocks

for lock-free reads

MEMORY

Uses:

b

locks, transactions, journaling

Slide14

Design challenges

How to order operations in the per-core operation logs?How to operate MemFS and DiskFS independently:How to allocate inodes in a scalable manner in MemFS

?. . .

Slide15

Problem: Preserve ordering of non-commutative ops

DiskFS

MemFS

Link Name

Inode

Number

file1

100

dirA

Journal

DISK

Block cache

Per-core

Operation Logs

u

nlink

(file1

)

CORE 1

Slide16

DiskFS

MemFS

Link Name

Inode

Number

file1

100

dirA

Journal

DISK

Block cache

op1:

UNLINK

Per-core

Operation Logs

Problem: Preserve ordering of non-commutative ops

u

nlink

(file1

)

CORE 1

Slide17

DiskFS

MemFS

Link Name

Inode

Number

file

1

100

dirA

Journal

DISK

Block cache

op1:

UNLINK

Per-core

Operation Logs

Problem: Preserve ordering of non-commutative ops

u

nlink

(file1

)

CORE 1

c

reat

(file1

)

CORE 2

Slide18

DiskFS

MemFS

Link Name

Inode

Number

file1

100

file1

200

dirA

Journal

DISK

Block cache

op1:

UNLINK

op2:

CREATE

Per-core

Operation Logs

Problem: Preserve ordering of non-commutative ops

c

reat

(file1

)

CORE 2

u

nlink

(file1

)

CORE 1

Slide19

DiskFS

MemFS

Link Name

Inode

Number

file1

100

file1

200

dirA

Journal

DISK

Block cache

Order:

How??

op1:

UNLINK

op2:

CREATE

Per-core

Operation Logs

Problem: Preserve ordering of non-commutative ops

c

reat

(file1

)

CORE 2

u

nlink

(file1

)

CORE 1

fsync

CORE 3

Slide20

DiskFS

MemFS

Link Name

Inode

Number

file1

100

dirA

Journal

DISK

Block cache

Solution: Use synchronized Time Stamp Counters

Per-core

Operation Logs

[ RDTSCP does not incur cache-line conflicts ]

Slide21

DiskFS

MemFS

Link Name

Inode

Number

file1

100

file1

200

dirA

Journal

DISK

Block cache

Order:

ts1 < ts2

op1:

UNLINK,

ts1

op2:

CREATE,

ts2

Per-core

Operation Logs

Solution: Use synchronized Time Stamp Counters

[ RDTSCP does not incur cache-line conflicts ]

c

reat

(file1

)

CORE 2

u

nlink

(file1

)

CORE 1

fsync

CORE 3

Slide22

DiskFS

MemFS

Link Name

Inode

Number

file1

???

dirA

Journal

DISK

Block cache

Problem: How to allocate

i

nodes

scalably

in

MemFS

?

Inode

Allocator

creat

(

dirA

/file1

)

CORE 1

Slide23

DiskFS

MemFS

Link Name

Mnode

Number

file1

???

dirA

Journal

DISK

Block cache

Solution (1) : Separate

mnodes

in

MemFS

from

inodes

in

DiskFS

Inode

Allocator

Per-core

Mnode

Allocator

creat

(

dirA

/file1

)

CORE

1

Slide24

DiskFS

MemFS

Link Name

Mnode

Number

file1

100

dirA

Journal

DISK

Block cache

Inode

Allocator

Per-core

Mnode

Allocator

creat

(

dirA

/file1

)

CORE

1

Solution (1) : Separate

mnodes

in

MemFS

from

inodes

in

DiskFS

Slide25

DiskFS

MemFS

Link Name

Mnode

Number

file1

100

dirA

Journal

DISK

Block cache

Solution (2) : Defer allocating

inodes

in

DiskFS

until an

fsync

Inode

Allocator

Per-core

Mnode

Allocator

m

node

i

node

table

mnode

#

inode

#

100

456

.

.

Slide26

Other design challenges

How to scale concurrent fsyncs?How to order lock-free reads?

How to resolve dependencies affecting multiple inodes?

How to ensure internal consistency despite crashes?

Slide27

Implementation

ScaleFS component Lines of C++ code

MemFS (based on FS from sv6)2,458

DiskFS

(based on FS from xv6)

2,331

Operation Logs

4,094

ScaleFS

is implemented on the sv6 research operating system

Supported filesystem system calls:

creat

, open,

openat

,

mkdir

,

mkdirat

, mknod, dup, dup2, lseek, read,

pread, write, pwrite,

chdir, readdir, pipe, pipe2, stat, fstat, link, unlink, rename, fsync, sync, close

Slide28

Evaluation

Does ScaleFS achieve good scalability?Measure scalability on 80 coresObserve conflict-freedom for commutative operationsDoes

ScaleFS achieve good disk throughput?

What memory overheads are introduced by

ScaleFS’s

split of

MemFS

and

DiskFS

?

Slide29

Evaluation methodology

Machine configuration:80-cores, with Intel E7-8870 2.4 GHz CPUs256 GB RAM

Backing store: “RAM” diskBenchmarks:

m

ailbench

: mail server workload

d

bench

: file server workload

l

argefile

: Creates a file, writes 100 MB,

fsyncs

and deletes it

s

mallfile

: Creates, writes,

fsyncs and deletes lots of 1KB files

Slide30

ScaleFS scales 35x-60x on a RAM disk

[ Single-core performance of

ScaleFS

is on par with Linux ext4. ]

Slide31

Machine-independent methodology

Use Commuter [Clements SOSP ’13]to observe conflict-freedom for commutative opsCommuter:Generates testcases for pairs of commutative ops

Reports observed cache-conflicts

Slide32

Conflict-freedom for commutative ops on Linux ext4 : 65%

138

Slide33

Conflict-freedom for commutative ops on ScaleFS

: 99.2%

Slide34

Conflict-freedom for commutative ops on ScaleFS

: 99.2%Why not 100% conflict-free?Tradeoff scalability for performance

Probabilistic conflicts

Slide35

Evaluation summary

ScaleFS scales well on an 80 core machineCommuter reports 99.2% conflict-freedom on ScaleFS

Workload/machine independentSuggests scalability beyond our experimental setup and benchmarks

Slide36

Related Work

Scalability studies: FxMark [USENIX ’16], Linux Scalability [OSDI ’10]Scaling file systems using sharding:Hare [Eurosys

’15], SpanFS [USENIX ’15]ScaleFS uses similar techniques:

Operation Logging:

OpLog

[CSAIL TR ’14]

Per-

inode

/ Per-core logs :

NOVA [FAST ’16],

iJournaling

[USENIX ’17], Strata [SOSP ’17]

Decoupling in-memory and on-disk representations:

Linux

dcache, ReconFS [FAST ’14] ScaleFS focus : Achieve scalability by avoiding cache-line conflicts

Slide37

Conclusion

ScaleFS – a novel file system design for multicore scalabilityTwo separate file systems : MemFS and DiskFS

Per-core operation logsOrdering using Time Stamp Counters

ScaleFS

scales 35x-60x on an 80 core machine

ScaleFS

is conflict-free for 99.2% of

testcases

in

Commuter

https

://

github.com/mit-pdos/scalefs