Srivatsa S Bhat Rasha Eqbal Austin T Clements M Frans Kaashoek Nickolai Zeldovich MIT CSAIL Motivation Current file systems dont scale well Filesystem Linux ext4 4921 ID: 813211
Download The PPT/PDF document "Scaling a file system to many cores usin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Scaling a file system to many cores using an operation log
Srivatsa S. Bhat, Rasha Eqbal, Austin T. Clements,M. Frans Kaashoek, Nickolai ZeldovichMIT CSAIL
Slide2Motivation: Current file systems don’t scale well
Filesystem:Linux ext4 (4.9.21)Benchmark:dbench [https://
dbench.samba.org]Experimental setup:
80-cores, 256 GB
RAM
Backing store: “RAM” disk
Slide3Linux ext4 scales poorly on multicore machines
Slide4Concurrent file creation in Linux ext4
MEMORY
creat
(
dirA
/file1)
CORE 1
creat
(
dirA
/file2)
Journal
DISK
CORE 2
dirA’s
block
ext4
Slide5Block contention limits scalability of file creation
MEMORY
creat
(
dirA
/file1)
CORE 1
creat
(
dirA
/file2)
Journal
DISK
CORE 2
ext4
file1 : 100
file2 : 200
dirA’s
block
Contention on blocks limits scalability on 80 cores
Even apps not limited by disk I/O don’t scale
Contends on the directory block!
Slide6Goal : Multicore scalability
Problem : Contention limits scalabilityContention involves cache-line conflictsGoal : Multicore scalability = No cache-line conflictsEven a single contended cache-line can wreck scalabilityCommutative operations can be implemented without cache-line conflicts
[Scalable Commutativity Rule, Clements SOSP ’13]
How do we scale all commutative operations in file systems?
Slide7ScaleFS approach: Two separate
file systems
MEMORY
Link Name
Inode
number
Journal
DISK
Block cache
MemFS
DiskFS
Directories
(
as hash-tables)
Designed for
multicore scalability
Designed for
durability
f
sync
Slide8Concurrent file creation scales in ScaleFS
DiskFS
MemFS
Link Name
Inode
Number
dirA
c
reat
(
dirA
/file1)
CORE 1
c
reat
(
dirA
/file2)
Journal
DISK
CORE 2
Block cache
MEMORY
Slide9Concurrent file creation scales in ScaleFS
DiskFS
MemFS
Link Name
Inode
Number
file1
100
file2
200
dirA
Journal
DISK
Block cache
MEMORY
c
reat
(
dirA
/file1)
CORE 1
c
reat
(
dirA
/file2)
CORE 2
No
contention No cache-line conflicts Scalability!
Slide10DiskFS
MemFS
Link Name
Inode
Number
file1
100
file2
200
dirA
Journal
DISK
Block cache
MEMORY
fsync
Challenge: How to implement
fsync
?
Slide11Challenge: How to implement fsync?
DiskFS
MemFS
dirA
file1 : 100
file2 : 200
Link Name
Inode
Number
file1
100
file2
200
dirA
Journal
DISK
Block cache
MEMORY
fsync
DiskFS
updates must be consistent with
MemFS
f
sync
must preserve conflict-freedom for commutative ops
Slide12Contributions
ScaleFS, a file system that achieves excellent multicore scalabilityTwo separate file systems: MemFS and DiskFSDesign for fsync:Per-core operation logs to
scalably defer updates to DiskFSOrdering operations using Time Stamp Counters
E
valuation :
Benchmarks on
ScaleFS
scale 35x-60x on 80 cores
Workload/Machine independent analysis for cache-conflicts
Suggests
ScaleFS
a good fit for workloads not limited by disk I/O
Slide13DISK
Journal
MemFS
DiskFS
Designed for multicore scalability
Designed for durability
ScaleFS
design : Two
separate
file systems
f
sync
Per-core
Operation Logs
Uses:
hash-tables, radix-trees,
seqlocks
for lock-free reads
MEMORY
Uses:
b
locks, transactions, journaling
Slide14Design challenges
How to order operations in the per-core operation logs?How to operate MemFS and DiskFS independently:How to allocate inodes in a scalable manner in MemFS
?. . .
Slide15Problem: Preserve ordering of non-commutative ops
DiskFS
MemFS
Link Name
Inode
Number
file1
100
dirA
Journal
DISK
Block cache
Per-core
Operation Logs
u
nlink
(file1
)
CORE 1
Slide16DiskFS
MemFS
Link Name
Inode
Number
file1
100
dirA
Journal
DISK
Block cache
op1:
UNLINK
Per-core
Operation Logs
Problem: Preserve ordering of non-commutative ops
u
nlink
(file1
)
CORE 1
Slide17DiskFS
MemFS
Link Name
Inode
Number
file
1
100
dirA
Journal
DISK
Block cache
op1:
UNLINK
Per-core
Operation Logs
Problem: Preserve ordering of non-commutative ops
u
nlink
(file1
)
CORE 1
c
reat
(file1
)
CORE 2
Slide18DiskFS
MemFS
Link Name
Inode
Number
file1
100
file1
200
dirA
Journal
DISK
Block cache
op1:
UNLINK
op2:
CREATE
Per-core
Operation Logs
Problem: Preserve ordering of non-commutative ops
c
reat
(file1
)
CORE 2
u
nlink
(file1
)
CORE 1
Slide19DiskFS
MemFS
Link Name
Inode
Number
file1
100
file1
200
dirA
Journal
DISK
Block cache
Order:
How??
op1:
UNLINK
op2:
CREATE
Per-core
Operation Logs
Problem: Preserve ordering of non-commutative ops
c
reat
(file1
)
CORE 2
u
nlink
(file1
)
CORE 1
fsync
CORE 3
Slide20DiskFS
MemFS
Link Name
Inode
Number
file1
100
dirA
Journal
DISK
Block cache
Solution: Use synchronized Time Stamp Counters
Per-core
Operation Logs
[ RDTSCP does not incur cache-line conflicts ]
Slide21DiskFS
MemFS
Link Name
Inode
Number
file1
100
file1
200
dirA
Journal
DISK
Block cache
Order:
ts1 < ts2
op1:
UNLINK,
ts1
op2:
CREATE,
ts2
Per-core
Operation Logs
Solution: Use synchronized Time Stamp Counters
[ RDTSCP does not incur cache-line conflicts ]
c
reat
(file1
)
CORE 2
u
nlink
(file1
)
CORE 1
fsync
CORE 3
Slide22DiskFS
MemFS
Link Name
Inode
Number
file1
???
dirA
Journal
DISK
Block cache
Problem: How to allocate
i
nodes
scalably
in
MemFS
?
Inode
Allocator
creat
(
dirA
/file1
)
CORE 1
Slide23DiskFS
MemFS
Link Name
Mnode
Number
file1
???
dirA
Journal
DISK
Block cache
Solution (1) : Separate
mnodes
in
MemFS
from
inodes
in
DiskFS
Inode
Allocator
Per-core
Mnode
Allocator
creat
(
dirA
/file1
)
CORE
1
Slide24DiskFS
MemFS
Link Name
Mnode
Number
file1
100
dirA
Journal
DISK
Block cache
Inode
Allocator
Per-core
Mnode
Allocator
creat
(
dirA
/file1
)
CORE
1
Solution (1) : Separate
mnodes
in
MemFS
from
inodes
in
DiskFS
Slide25DiskFS
MemFS
Link Name
Mnode
Number
file1
100
dirA
Journal
DISK
Block cache
Solution (2) : Defer allocating
inodes
in
DiskFS
until an
fsync
Inode
Allocator
Per-core
Mnode
Allocator
m
node
i
node
table
mnode
#
inode
#
100
456
…
.
…
.
Slide26Other design challenges
How to scale concurrent fsyncs?How to order lock-free reads?
How to resolve dependencies affecting multiple inodes?
How to ensure internal consistency despite crashes?
Implementation
ScaleFS component Lines of C++ code
MemFS (based on FS from sv6)2,458
DiskFS
(based on FS from xv6)
2,331
Operation Logs
4,094
ScaleFS
is implemented on the sv6 research operating system
Supported filesystem system calls:
creat
, open,
openat
,
mkdir
,
mkdirat
, mknod, dup, dup2, lseek, read,
pread, write, pwrite,
chdir, readdir, pipe, pipe2, stat, fstat, link, unlink, rename, fsync, sync, close
Slide28Evaluation
Does ScaleFS achieve good scalability?Measure scalability on 80 coresObserve conflict-freedom for commutative operationsDoes
ScaleFS achieve good disk throughput?
What memory overheads are introduced by
ScaleFS’s
split of
MemFS
and
DiskFS
?
Slide29Evaluation methodology
Machine configuration:80-cores, with Intel E7-8870 2.4 GHz CPUs256 GB RAM
Backing store: “RAM” diskBenchmarks:
m
ailbench
: mail server workload
d
bench
: file server workload
l
argefile
: Creates a file, writes 100 MB,
fsyncs
and deletes it
s
mallfile
: Creates, writes,
fsyncs and deletes lots of 1KB files
Slide30ScaleFS scales 35x-60x on a RAM disk
[ Single-core performance of
ScaleFS
is on par with Linux ext4. ]
Slide31Machine-independent methodology
Use Commuter [Clements SOSP ’13]to observe conflict-freedom for commutative opsCommuter:Generates testcases for pairs of commutative ops
Reports observed cache-conflicts
Slide32Conflict-freedom for commutative ops on Linux ext4 : 65%
138
Slide33Conflict-freedom for commutative ops on ScaleFS
: 99.2%
Slide34Conflict-freedom for commutative ops on ScaleFS
: 99.2%Why not 100% conflict-free?Tradeoff scalability for performance
Probabilistic conflicts
Slide35Evaluation summary
ScaleFS scales well on an 80 core machineCommuter reports 99.2% conflict-freedom on ScaleFS
Workload/machine independentSuggests scalability beyond our experimental setup and benchmarks
Slide36Related Work
Scalability studies: FxMark [USENIX ’16], Linux Scalability [OSDI ’10]Scaling file systems using sharding:Hare [Eurosys
’15], SpanFS [USENIX ’15]ScaleFS uses similar techniques:
Operation Logging:
OpLog
[CSAIL TR ’14]
Per-
inode
/ Per-core logs :
NOVA [FAST ’16],
iJournaling
[USENIX ’17], Strata [SOSP ’17]
Decoupling in-memory and on-disk representations:
Linux
dcache, ReconFS [FAST ’14] ScaleFS focus : Achieve scalability by avoiding cache-line conflicts
Slide37Conclusion
ScaleFS – a novel file system design for multicore scalabilityTwo separate file systems : MemFS and DiskFS
Per-core operation logsOrdering using Time Stamp Counters
ScaleFS
scales 35x-60x on an 80 core machine
ScaleFS
is conflict-free for 99.2% of
testcases
in
Commuter
https
://
github.com/mit-pdos/scalefs