Linpeng Tang Princeton Qi Huang Cornell amp Facebook Wyatt Lloyd USC amp Facebook Sanjeev Kumar Facebook Kai Li Princeton 1 2 2 Facebook 2014 Q4 Report Photo Serving Stack ID: 622113
Download Presentation The PPT/PDF document "RIPQ: Advanced Photo Caching on Flash fo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
RIPQ: Advanced Photo Caching on Flash for Facebook
Linpeng Tang (Princeton)Qi Huang (Cornell & Facebook)Wyatt Lloyd (USC & Facebook)Sanjeev Kumar (Facebook)Kai Li (Princeton)
1Slide2
2
2* Facebook 2014 Q4 ReportPhoto Serving Stack
2 Billion
*
Photos
Shared
Daily
Storage
BackendSlide3
3Photo Caches
Close to usersReduce backbone trafficCo-located with backendReduce backend IOFlashStorageBackend
Edge Cache
Origin Cache
Photo Serving StackSlide4
4
FlashStorageBackendEdge CacheOrigin Cache
Photo Serving Stack
An Analysis of
Facebook Photo Caching
[Huang et al. SOSP’13]
Segmented
LRU
-
3:
10% less backbone traffic
Greedy
-Dual-Size-Frequency-3:
23% fewer backend IOs
Advanced caching
algorithms help!Slide5
5
Flash FIFO was still usedNo known way to implement advanced algorithms efficientlyStorageBackendEdge Cache
Origin Cache
In Practice
Photo Serving StackSlide6
6
Advanced caching helps:23% fewer backend IOs10% less backbone traffic TheoryPracticeDifficult to implement on flash:FIFO still usedRestricted Insertion Priority Queue
: efficiently
implement advanced
caching algorithms on
flashSlide7
OutlineWhy are advanced
caching algorithmsdifficult to implement on flash efficiently?How RIPQ solves this problem?Why use priority queue?How to efficiently implement one on flash?Evaluation10% less backbone traffic 23% fewer backend IOs7Slide8
OutlineWhy are advanced
caching algorithmsdifficult to implement on flash efficiently?Write pattern of FIFO and LRUHow RIPQ solves this problem?Why use priority queue?How to efficiently implement
one
on
flash?
Evaluation10% less backbone traffic
23% fewer backend
IOs
8Slide9
FIFO Does Sequential Writes9
Cache space of FIFOHeadTailSlide10
FIFO Does Sequential Writes10
Cache space of FIFOHeadTailMissSlide11
FIFO Does Sequential Writes11
Cache space of FIFOHeadTailHitSlide12
FIFO Does Sequential Writes12
Cache space of FIFOHeadTail
Evicted
No random writes needed for FIFOSlide13
LRU Needs Random Writes13
Cache space of LRUHeadTail
Locations on flash ≠ Locations in LRU queue
HitSlide14
LRU Needs Random Writes14
HeadTail
Non-contiguous
on flash
Random writes needed to reuse space
Cache space of LRUSlide15
Why Care About Random Writes?Write-heavy workloadLong tail access pattern, moderate hit ratio
Each miss triggers a write to cacheSmall random writes are harmful for flashe.g. Min et al. FAST’12High write amplification15Low write throughputShort device lifetimeSlide16
What write size do we need?Large writesHigh write throughput at high utilization
16~32MiB in Min et al. FAST’2012What’s the trend since then?Random writes tested for 3 modern devices128~512MiB needed now16100MiB+ writes needed for efficiencySlide17
OutlineWhy are
advanced caching algorithmsdifficult to implement on flash efficiently?How RIPQ solves this problem?Evaluation17Slide18
RIPQ Architecture(Restricted Insertion Priority Queue)18
Advanced Caching Policy(SLRU, GDSF …)RIPQPriority Queue API
RAM
Flash
Flash-friendly Workloads
Approximate Priority Queue
Efficient
caching
on
flash
C
aching
algorithms
approximated as wellSlide19
RIPQ Architecture(Restricted Insertion Priority Queue)19
Advanced Caching Policy(SLRU, GDSF …)RIPQPriority Queue API
RAM
Flash
Flash-friendly Workloads
Approximate Priority Queue
Restricted insertion
Section
m
erge/split
Large writes
Lazy updatesSlide20
Priority Queue APINo single best caching
policySegmented LRU [Karedla’94]Reduce both backend IO and backbone trafficSLRU-3: best algorithm for Edge so farGreedy-Dual-Size-Frequency [Cherkasova’98]Favor small objectsFurther reduces backend IOGDSF-3: best algorithm for Origin so far20Slide21
Segmented LRUConcatenation of K LRU caches21
Cache space of SLRU-3HeadL2
L1
Tail
L3
MissSlide22
Segmented LRUConcatenation of K LRU caches22
HeadL2L1
Tail
L3
Miss
Cache space of SLRU-3Slide23
Segmented LRUConcatenation of K LRU caches23
Cache space of SLRU-3HeadL2
L1
Tail
L3
HitSlide24
Segmented LRUConcatenation of K LRU caches24
Cache space of SLRU-3HeadL2
L1
Tail
L3
Hit
againSlide25
Greedy-Dual-Size-FrequencyFavoring small objects25
Cache space of GDSF-3HeadTailSlide26
Greedy-Dual-Size-FrequencyFavoring small objects26
Cache space of GDSF-3HeadTailMissSlide27
Greedy-Dual-Size-FrequencyFavoring small objects27
Cache space of GDSF-3HeadTail
MissSlide28
Greedy-Dual-Size-FrequencyFavoring small objects28
Cache space of GDSF-3HeadWrite workload more random than LRUOperations similar to priority queue
TailSlide29
Relative Priority Queue forAdvanced Caching Algorithms29
Cache spaceHeadTail1.00.0
Miss object: insert(x, p
)
pSlide30
Relative Priority Queue forAdvanced Caching Algorithms30
Cache spaceHeadTail1.00.0
Hit object: increase(x, p’
)
p
’Slide31
Relative Priority Queue forAdvanced Caching Algorithms31
Cache spaceHeadTail1.00.0
Implicit demotion on insert/increase:
Object with lower priorities
moves towards the tailSlide32
Relative Priority Queue forAdvanced Caching Algorithms32
Cache spaceHeadTail1.00.0
Evict from queue tail
Evicted
Relative priority queue captures
the
dynamics of many caching algorithms!Slide33
RIPQ Design: Large Writes33
Need to buffer object writes (10s KiB) into block writesOnce written, blocks are immutable!256MiB block size, 90% utilizationLarge caching capacityHigh write throughputSlide34
RIPQ Design: Restricted Insertion Points34
Exact priority queueInsert to any block in the queueEach block needs a separate bufferWhole flash space buffered in RAM!Slide35
RIPQ Design: Restricted Insertion Points35
Solution: restricted insertion pointsSlide36
Section is Unit for Insertion
361 .. 0.60.6 .. 0.350.35 .. 0
Active block with
RAM buffer
Sealed block
o
n flash
Head
Tail
Each section has one insertion point
S
ection
S
ection
S
ectionSlide37
Section is Unit for Insertion37
HeadTail
+1
i
nsert(x, 0.55)
1 .. 0.6
0.6 .. 0.35
0.35 .. 0
1 .. 0.62
0.62 .. 0.33
0.33 .. 0
Insert procedure
Find corresponding section
Copy data into active block
Updating section priority range
S
ection
S
ection
S
ectionSlide38
1 .. 0.62
0.62 .. 0.330.33 .. 0Section is Unit for Insertion38Active block withRAM bufferSealed blockon flashHead
Tail
Relative orders within one section
not guaranteed!
S
ection
S
ection
S
ectionSlide39
Trade-off in Section Size39
Section size controls approximation errorSections , approximation error Sections , RAM buffer HeadTail
1 .. 0.62
0.62 .. 0.33
0.33 .. 0
S
ection
S
ection
S
ectionSlide40
RIPQ Design: Lazy Update40
HeadTail
increase(x, 0.9)
Problem with naïve approach
Data copying/duplication on flash
x
+1
Naïve approach: copy to the corresponding active block
S
ection
S
ection
S
ectionSlide41
RIPQ Design: Lazy Update41
HeadTail
Solution: use
virtual block
to
track the updated location!
S
ection
S
ection
S
ectionSlide42
RIPQ Design: Lazy Update42
HeadTail
Virtual Blocks
Solution: use
virtual block
to
track the updated location!
S
ection
S
ection
S
ectionSlide43
Virtual Block RemembersUpdate Location
43HeadTail
No data written during virtual update
increase(x, 0.9)
x
+1
S
ection
S
ection
S
ectionSlide44
Actual Update During Eviction44
HeadTail
x
S
ection
S
ection
S
ection
x
now at tail block.Slide45
Actual Update During Eviction45
HeadTail
-1
+1
x
C
opy data to
the active block
Always one copy of data on flash
S
ection
S
ection
S
ectionSlide46
RIPQ DesignRelative priority queue APIRIPQ design pointsLarge writes
Restricted insertion pointsLazy updateSection merge/splitBalance section sizes and RAM buffer usageStatic cachingPhotos are static46Slide47
OutlineWhy
are advanced caching algorithmsdifficult to implement on flash efficiently?How RIPQ solves this problem?Evaluation47Slide48
Evaluation QuestionsHow much RAM buffer needed?How good is RIPQ’s approximation?
What’s the throughput of RIPQ?48Slide49
Evaluation ApproachReal-world Facebook workloadsOriginEdge670 GiB flash card
256MiB block size90% utilizationBaselinesFIFOSIPQ: Single Insertion Priority Queue49Slide50
RIPQ Needs Small Number of Insertion PointsInsertion
points50Object-wise hit-ratio (%)
+6%
+16%Slide51
RIPQ Needs Small Number of Insertion Points51
Object-wise hit-ratio (%) Insertion pointsSlide52
RIPQ Needs Small Number of Insertion Points52
Object-wise hit-ratio (%) You don’t need much RAM buffer (2GiB)!Insertion pointsSlide53
RIPQ Has High Fidelity53
Object-wise hit-ratio (%)Slide54
RIPQ Has High Fidelity54
Object-wise hit-ratio (%) Slide55
RIPQ Has High Fidelity55
Object-wise hit-ratio (%)
RIPQ achieves ≤0.5% difference for all algorithmsSlide56
RIPQ Has High Fidelity56
Object-wise hit-ratio (%)+16% hit-ratio 23% fewer backend IOs+16%Slide57
RIPQ Has High Throughput57
Throughput (req./sec)RIPQ throughput comparable to FIFO (≤10% diff.)
FIFOSlide58
Related Works
RAM-based advanced cachingSLRU(Karedla’94), GDSF(Young’94, Cao’97, Cherkasova’01), SIZE(Abrams’96), LFU(Maffeis’93), LIRS (Jiang’02), …Flash-based caching solutionsFacebook FlashCache, Janus(Albrecht ’13), Nitro(Li’13), OP-FCL(Oh’12), FlashTier(Saxena’12), Hec(Yang’13), …Flash performanceStoica’09, Chen’09,
Bouganim’09, Min’12,
…
58
RIPQ enables their use on flash
RIPQ
supports
advanced
algorithms
Trend
continues
for
modern
flash
cardsSlide59
RIPQFirst framework for advanced caching on flashRelative priority queue i
nterfaceLarge writesRestricted insertion pointsLazy updateSection merge/splitEnables SLRU-3 & GDSF-3 for Facebook photos10% less backbone traffic23% fewer backend IOs59