Theoretical Analysis Motivation Challenges Learning Caching Policies with Subsampling Haonan Wang Hao He Mohammad Alizadeh Hongzi Mao MIT Computer Science and Artificial Intelligence Laboratory ID: 776077
Download Presentation The PPT/PDF document " RL with subsampling " is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
RL with subsampling
Theoretical Analysis
Motivation
Challenges
Learning Caching Policies with Subsampling
Haonan
Wang, Hao He, Mohammad Alizadeh, Hongzi Mao MIT Computer Science and Artificial Intelligence Laboratory
Why learning approach for caching policies?
Handcrafting optimal policy can be tedious
Object size distribution/arrival pattern changes
Optimize for different objectives (e.g., minimize overwrites for SSD based caches)
Problem setup
Experiment
Large caches significantly delay the reward feedback
Typical CDN caches are hundreds of GBs (hosts many millions of objects)Reward feedback are 100× longer than normal RL applications (e.g., AlphaGo only deals with MDPs of < 400 steps)
Cache admission setting (LRU for eviction)State: size of incoming object, steps since last object visit, object total visit count, remaining cache sizeAction: admit/dropReward: total byte hits since last actionStep: proceed until next cache miss
Subsample trace by
hashing
on object IDReduce the cache size proportionally
The caching statistics remains unchanged
Train RL on reduced caching problem (with small cache and subsampled trace)
Generalize the policy to larger problem (normalize remaining cache size to [0, 1]
As a reference, directly training RL on small cache (both subsampled and original) reaches state-of-the-art performance
For large caches, only subsampled RL can successfully train
State transition:
Reward function: