Suli Yang Tyler Harter Nishant Agrawal Samer Al Kiswany Salini Selvaraj Kowsalya Anand Krishnamurthy Rini T Kaushik Andrea C ArpaciDusseau Remzi ID: 409269
Download Presentation The PPT/PDF document "Split-Level I/O Scheduling" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Split-Level I/O Scheduling
Suli Yang, Tyler Harter, Nishant Agrawal, Samer Al-Kiswany, Salini Selvaraj Kowsalya, Anand Krishnamurthy, Rini T Kaushik, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Slide2
…yet another I/O scheduling paper?
2CFQ (2003)BFQ (2010)Deadline (2002)mClock (2011)Token-Bucket (2008)Libra (2014)
pClock (2007)
Fahrrad
(2008)
Y
FQ (1999)
Facade(2003)Slide3
Some mistakes we have been making for decades…
(in trying to build better schedulers)3Slide4
Current frameworks
fundamentally limitedCFQ, Deadline, Token-BucketImportant policies cannot be realizedFairness, Latency Guarantee, IsolationWasted effort trying to build new schedulers without fixing the framework4ProblemSlide5
Can we design a
simple and effective framework that lets us build schedulers to correctly realize important I/O policies?5Slide6
Solution: Split-Level Framework
Control: Allow scheduling at multiple levels Block levelSystem-call levelPage-cache levelInformation: Tag requests to identify the originSimplicity: Small set of hooks at key junctions within the storage stack6Slide7
Results
Three distinct policies implementedPriory, Deadline, IsolationLarge performance improvementsFairness: 12xTail latency: 4xIsolation: 6xGood foundation for applicationsReduce transaction latency for databasesImprove isolation for virtual machinesEffective rate limit for HDFS7Slide8
Overview
How I/O scheduling frameworks workSplit-Level Scheduling Framework: DesignSplit-Level Scheduler Case StudyConclusion8Slide9
Framework vs. Scheduler
Framework: A running environment (mechanism)Scheduler: Implement different policiesHow it works Framework provides callbacks to schedulers.9Slide10
Traditional Approach:
Block-Level I/O SchedulingPage Cache File SystemBlock-Level Queues
add_req
d
ispatch_req
r
eq_complete
Block-Level Scheduler
10
App
App
App
DeviceSlide11
Block-Level I/O Scheduling
11Simplified Complete Faire Queuing (CFQ) Implementation: Block-Level Queues dispatch_req
req_complete
Block-Level Scheduler
Device
a
dd_req
add_req
(r){
p =
r.submit_process
q =
get_queue
(p)
enqueue
(
q,r
)
}
dispatch_req
(){
q =
get_high_prio_queue
()
r =
d
e
queue
(q)
dispatch(r)
}
c
omplete_req
(r){
//clean up
}Slide12
Overview
What is an I/O scheduling frameworkSplit-Level Scheduling Framework: DesignThe reordering problemThe cause-mapping problemThe cost-estimation problem Split-Level Scheduler Case StudyConclusion12Slide13
Reordering
Scheduling is just reordering I/O requests13Slide14
File System
Data EntanglementBlock-Level Scheduler14File system tangles
data into one bundle Journal transactionShared metadata blockImpossible
for the schedulers to reorder
App1
App2Slide15
File System
Write DependenciesBlock-Level Scheduler15File systems carefully
order writes Schedulers cannot reorder
(
unless FS allows
)
App
tx1
tx2Slide16
Fundamental Limitation #1
(of block-level scheduling)The file system imposes ordering requirements contrary to the scheduling goals The scheduler cannot reorderToo late once data in the file system Need admission control16Slide17
Split-Level I/O Scheduling:
Multi-Layer HooksPage CacheFile SystemBlock-Level Queues
add_req
d
ispatch_req
r
eq_complete
Split-Level Scheduler
17
App
App
App
Device
w
rite()
fsync()
a
void data entanglement and ordering
above
the file systemSlide18
Cause Mapping
A scheduler needs to map an I/O request to the originating application18Slide19
Write Delegation
Page CacheBlock-Level SchedulerApp1App2
write()
write()
Write-back Daemon
Loss of cause
i
nformation!
Write-back daemon submits all requests
!
Write-back, journaling, delayed allocation….Slide20
Fundamental Limitation #2
(of block-level scheduling)Cause-mapping information lost within the frameworkImpossible to map an I/O request back to its originating application (no matter how you implement the scheduler)20Slide21
Split-Level I/O Scheduling: Tags
Page CacheBlock-Level SchedulerApp1App2
write()
write()
Write-back Daemon
Tags to identify origin
Tags pass across layers
1
1
2
1
1
2Slide22
Cost Estimation
A scheduler needs to estimate the cost of I/OMemory-level notification for timely estimateBlock-level notification for accurate estimateDetails in paper22Slide23
Split-Level I/O Scheduling Framework: Summary
Three key pieces: Multiple-layer hooks to prevent adverse file system interaction Tags to track causes across layersEarly memory-level notification of write workEasy Implementation~300 LOC in LinuxLittle added complexity for building schedulers23Slide24
Overview
How I/O scheduling frameworks workSplit-Level Scheduling Framework: DesignSplit-Level Scheduler Case StudyConclusion24Slide25
Challenge #1:
Priority SchedulerFairly allocate I/O resources based on the processes’ priorities25Slide26
Block-Level: CFQ
26goalWorkload:Eight processes with different priority (0-7), each sequentially writing its own file add_req(r){ p = r.submit_process q = get_queue(p) enqueue(q,r)}Slide27
Block-Level: CFQ
27the write-back threadadd_req(r){ p = r.submit_process q = get_queue(p) enqueue(q,r)}Slide28
Split-Level: AFQ
28CFQ deviate from the goal by 82%AFQ by 7% 12x improvementadd_req(r){ p = r.tagged_cause q = get_queue(p) enqueue(q,r)}Slide29
Challenge #2:
Deadline SchedulerProvide guaranteed latency of I/O requests29Slide30
Block-Deadline
Block-Deadline: cannot serve the low-latency requests until previous transaction completedFile SystemBlock-Deadline
App
tx1
tx2Slide31
Block-Deadline
Workload: Flush 4KB data to disk with or w/o background writesExpected Results: Operation finish within deadline (100ms)Slide32
Split-Deadline
Split-Deadline: suspend write() and fsync() to avoid many high-latency requests to accumulate in one transaction. File SystemSplit-Deadline
App
tx1
App
write()
fsync()
Write and fsync blocked to prevent high-latency data into FSSlide33
Split-Level:
Split-DeadlineSplit-Deadline maintains the deadline regardless of background writes. Slide34
The Fsync-Freeze Problem
During checkpointing, the system begins writing out the data that need to fsync()’d so aggressively that the service time for I/O requests from other processes go through the roof. ---Robert Hass (PostgreSQL)34Slide35
The Fsync-Freeze Problem
354x tail latency reduction.Split-Deadline solves the fsync-freeze problem!Workload: SQLite transaction with different checkpoint intervalExpected Results: Consistent transaction latency Slide36
Other Evaluation Results
Low overhead <1% runtime overhead <50 MB memory overheadOther schedulers Token-bucket for performance isolationOther applications PostgreSQL: latency guarantee for TPC-B workloads QEMU: provides isolation across VMs HDFS: effective I/O rate limit36Slide37
Overview
What is an I/O scheduling framework and how does it work. Split-Level Scheduling Framework: DesignSplit-Level Scheduler Case StudyConclusion37Slide38
Conclusion
For decades, people have been trying to build better block-level schedulersbound to fail without appropriate framework supportSplit-level framework enables correct scheduler implementationCross-layer tagsMulti-level hooksMemory-level notification38Source code and more information: http://research.cs.wisc.edu/adsl/Software/split/Slide39
Backup slides
39Slide40Slide41
File System
Write DependenciesAppBlock-Level Scheduler
41Modern file system maintains data consistency by carefully
ordering writes
.
Schedulers
cannot reorder
unless file system allows it.
tx1
tx2Slide42
Split-Level I/O Scheduling:
Multi-Layer Hooks42System-call scheduling above the file system to avoid data entanglement.Block-level scheduling below the file system to maximize performance.Page Cache
App
App
App
r
ead()
write()
fsync()
File System
write-back
Block-Level Queues
a
dd_req
d
ispatch_req
r
eq_complete
Disk
SSD
SchedulerSlide43
Split-Level I/O Scheduling: Tags
43Write-heavy HDFS workload on a machine with 8GB RAM.Slide44
Split-Level I/O Scheduling: Tags
44Write-heavy HDFS workload on a machine with 8GB RAM.Slide45
Split-Level Framework Overhead
45I/O performance with noop scheduler:Slide46
Split-Level I/O Scheduling: Tags
46Write-heavy HDFS workload on a machine with 8GB RAM.Worse case memory overhead of tags: 50MB. Slide47
Block-Level: Windows
47Slide48
Performance Isolation
Sequential ReaderUnthrottledA:Throttled to 10MB/sB:48Slide49
Real Applications
49Slide50
Page Cache
Write Delegation
App1
App2
write()
write()
Block-Level Scheduler
50
w
rite-back
Loss of Cause Information!
The process that submitted the block-level requests may not be the process that issued the I/O.
Write-back, journaling, delayed allocation….Slide51
Page Cache
Split-Level I/O Scheduling: Tags
App1
App2
write()
write()
Block-Level Scheduler
51
w
rite-back
Use
tags
to track I/O request across layers and identify the originating application.
Tags identify
a set of processes
responsible for an I/O request.
1
1
2
1
1
2Slide52
Myth
#1 in I/O Scheduling: I don’t have to care about I/O scheduling. It is someone else’s problem…52Slide53
bottleneck of many systems, from
phones to servers. […our servers appear to freeze for tens of seconds during disk writes…]Foundation of performance isolation. […the interference as a result of competing I/Os remains problematic in a virtualized environment…]Pain points for databases, hypervisors, key-value stores and more.
[…one customer reported that just changing cfq to noop
solved
their
innoDB
IO problems
…
]
53
Why Is I/O Scheduling Relevant (to You)Slide54
Myth
#1 in I/O Scheduling: I don’t have to care about I/O scheduling. It is someone else’s problem…54
Fact #1: If you care about performance, you should care about I/O schedulingSlide55
Myth
#2 in I/O Scheduling: Can’t the disk (or SSD) handle all I/O scheduling? (Do I still need I/O scheduling in the era of SSD?)55Slide56
Device powerless when handed the “wrong” requests from the OS -- file system may withhold requestsDevices rely on OS-provided information --lack such mechanismsOther common reasons: --more contextual information --OS-level isolation unit --multi-device I/O scheduling56Why Should OS Do I/O SchedulingSlide57
Device is powerless when handed the “wrong” requests from the OS.Isolation can only be done at the OS level, as only OS knows about the isolation unit (processes, containers, or virtual machines). OS has more contextual information to assist I/O scheduling (e.g., file-based prefetching).Multi-device I/O scheduling can only be done at the OS-level.57Why Should OS Do I/O SchedulingSlide58
Myth
#2 in I/O Scheduling: Shouldn’t the disk (or SSD) handle all the I/O scheduling?58
Fact #2: OS has to issue the right request at the right timeSlide59
Current I/O scheduling frameworks are
fundamentally limited (so does any scheduler built under them).Important policies (isolation, fairness, meeting deadlines…) cannot be realized in current framework.Causes applications to suffer (databases, hypervisors, and more). […one customer reported that just changing cfq to noop sovled their innoDB IO problems…]
59Why Is I/O Scheduling Still An Open ProblemSlide60
Myth
#3 in I/O Scheduling: Isn’t it a solved problem? After all, we have many different I/O schedulers.60Slide61
Myth
#3 in I/O Scheduling: Isn’t it a solved problem? After all, we have many different I/O schedulers.61
Fact #3: Fundamental limitations in framework
Important policies cannot be realized Slide62
What is I/O Scheduling?
Applications submit I/O requests to storage devices.I/O scheduling: which requests, and when, to send to the device?Different scheduling goals and strategies for different schedulers.62Slide63
Block-Level I/O Scheduling
63Simplified Complete Faire Queuing (CFQ) Implementation: Block-Level Queues dispatch_req
req_complete
Block-Level Scheduler
Device
a
dd_req
add_req
(r){
p =
r.submit_process
q =
get_queue
(p)
enqueue
(
q,r
)
}
dispatch_req
(){
q =
get_high_prio_queue
()
r =
d
e
queue
(q)
dispatch(r)
}
c
omplete_req
(r){
//clean up
}