ErrorRuntime Tradeoffs in Distributed SGD Sanghamitra Dutta Joint Work with Jianyu Wang Gauri Joshi Soumyadip Ghosh Parijat Dube Priya Nagpurkar Presented in part at AISTATS 2018 ID: 1002209
Download Presentation The PPT/PDF document "Slow and Stale Gradients can win the Rac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. Slow and Stale Gradients can win the Race: Error-Runtime Trade-offs in Distributed SGDSanghamitra DuttaJoint Work with: Jianyu Wang, Gauri Joshi,Soumyadip Ghosh, Parijat Dube, Priya NagpurkarPresented in part at AISTATS 2018
2. Stochastic Gradient Descent is the backbone of MLSpeeding Up SGD convergence is of critical importance!
3. This Work: Speeding Up Distributed SGD viaScheduling + Algorithmic Techniques
4. Batch Gradient DescentF(w) is the empirical risk functionToo expensive for large datasetsis the n-th labeled sample
5. Mini-batch SGDF(w) is the empirical risk functionNoisier, but computationally tractableis the n-th labeled sample For large training datasets single-node SGD can be prohibitively slow…
6. Parameter ServerSynchronous Distributed SGDCan process a P-times larger mini-batch in each iterationBottlenecked by one or more slow/straggling learnersLearner 1Learner 2Learner Pwjwjwjis the mini-batch of m samples for l th learner
7. Parameter ServerLearner 1Learner 2Learner Pw1w1w1w1w1w1[Recht 2011, Dean 2012, Cipar 2013 …]Asynchronous Distributed SGD
8. Learner 1Parameter ServerLearner 2Learner Pw2w2w1w1[Recht 2011, Dean 2012, Cipar 2013 …]Asynchronous Distributed SGD
9. Don’t have to wait for straggling learners Gradient Staleness can increase errorLearner 1Learner 2Learner Pw3Parameter Server[Recht 2011, Dean 2012, Cipar 2013 …]Gradient Stalenessw2w3w1Asynchronous Distributed SGD
10. Performance Comparison
11. Performance ComparisonLog LossWall-clock TimeLog LossIterationsSynchronousAsynchronous
12. Performance ComparisonLog LossWall-clock TimeNeed to understand convergence with wall-clock time and not only with number of iterations!Log LossIterationsSynchronousAsynchronous
13. Question: How do the Error-Runtime trade-offs compare?
14. Assumptions for Runtime AnalysisRuntime is a random variable
15. Assumptions for Runtime AnalysisRuntime is a random variable Independent and identically distributed across different learners and mini-batches
16. Assumptions for Runtime AnalysisRuntime is a random variable Independent and identically distributed across different learners and mini-batches
17. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
18. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
19. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
20. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
21. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
22. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
23. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
24. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
25. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
26. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0
27. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0
28. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1
29. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1
30. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1
31. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
32. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
33. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
34. Expected Runtimes of SGD VariantsSync variants[Gupta et al. ICDM 2016] [Chen et al. 2016]L1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
35. Expected Runtimes of SGD VariantsL1L2L3PSw0w1w2K-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
36. Expected Runtimes of SGD VariantsL1L2L3PSw0w1w2K-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
37. Expected Runtimes of SGD VariantsL1L2L3PSw0w1w2K-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
38. Expected Runtimes of SGD VariantsL1L2L3PSw0w1w2K-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2Key Idea of the Proof:Every gradient push – minimum of P exponentials
39. Expected Runtimes of SGD VariantsL1L2L3PSw0w1w2K-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2
40. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0Async SGDK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
41. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1Async SGDK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
42. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1Async SGDK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
43. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
44. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
45. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
46. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
47. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0K-Batch Async SGDL1L2L3PSw0w1w3w2
48. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1K-Batch Async SGDL1L2L3PSw0w1w3w2
49. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1K-Batch Async SGDL1L2L3PSw0w1w3w2
50. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w2K-Batch Async SGDL1L2L3PSw0w1w3w2
51. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w2K-Batch Async SGDL1L2L3PSw0w1w3w2
52. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
53. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0
54. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0
55. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1
56. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1
57. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w2
58. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w2
59. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
60. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
61. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2[Lian et al. NIPS 2015]
62. Expected Runtimes of SGD VariantsK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
63. Expected Runtimes of SGD VariantsK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2Observation:Runtime is the K-th statistic of all learners’ remaining times.
64. Expected Runtimes of SGD VariantsK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
65. Expected Runtimes of SGD VariantsK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2
66. Expected Runtimes of SGD VariantsK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2Key Observation:For each learner, gradient pushes – renewal process(inter-arrival times are iid)
67. Comparison of Expected Runtime
68. Comparison of Expected Runtime
69. How to get the error-runtime tradeoff?
70. Assumptions for Error AnalysisLipchitz smooth loss function with parameter LStrongly Convex with parameter cUnbiased Estimate of gradientUpper Bound on Variance of Stochastic Gradient
71. Fully Sync SGD: Error AnalysisUpdate Rule: Equivalent to mini-batch SGD with batch size Pm
72. Fully Sync SGD: Error AnalysisUpdate Rule: Equivalent to mini-batch SGD with batch size PmFor c-strongly convex, L-smooth functions [Bottou, 2016]
73. Fully Sync SGD: Error AnalysisDecay RateError FloorUpdate Rule: Equivalent to mini-batch SGD with batch size PmFor c-strongly convex, L-smooth functions [Bottou, 2016]
74. Error after J iterations for sync variantsK-sync and K-batch-sync SGD: Error Analysis
75. Async SGD: Error AnalysisUpdate RuleHard to analyze due to stale gradients
76. Async SGD: Error AnalysisUpdate RuleAssumptions in Previous worksUpper Bound on Staleness [Lian et al 2015]Geometric staleness distribution [Mitiliagkas et al 2016]Staleness process is independent of w [Mitiliagkas et al 2016]Hard to analyze due to stale gradients
77. Async SGD: Error AnalysisUpdate RuleAssumptions in Previous worksUpper Bound on Staleness [Lian et al 2015]Geometric staleness distribution [Mitiliagkas et al 2016]Staleness process is independent of w [Mitiliagkas et al 2016]We remove these assumptions, and instead consider Hard to analyze due to stale gradients
78. Async SGD: Error AnalysisFor c-strongly convex, L-smooth functions, where is the staleness bound,and is the probability of getting a fresh gradient, i.e., same learner pushes the next update
79. Async SGD: Error AnalysisCan be faster than Sync SGD with iterations if po/ 2 > 𝛾Larger than Sync-SGDFor c-strongly convex, L-smooth functions, where is the staleness bound,and is the probability of getting a fresh gradient, i.e., same learner pushes the next update
80. Async SGD: Error AnalysisCan be faster than Sync SGD with iterations if po/ 2 > 𝛾Larger than Sync-SGDFor c-strongly convex, L-smooth functions, where is the staleness bound,and is the probability of getting a fresh gradient, i.e., same learner pushes the next updateAnalysis can be generalized to non-convex objectives
81. K-async and K-batch-async SGD: Error AnalysisError after J iterations for Async variants
82. Error-Runtime Trade-offs
83. Spanning the spectrum betweenSynchronous and Asynchronous SGDError at ConvergenceRuntimeAsync SGDFullySync
84. Spanning the spectrum betweenSynchronous and Asynchronous SGDError at ConvergenceRuntimeAsync SGDK-Batch AsyncK-Async SGDK=4K=3K=2K=2FullySyncBatch SyncK=3K=4
85. Key TakeawaysStraggling LearnersGradient StalenessTrue SGD convergence is with respect to wall-clock time!
86. Key TakeawaysStraggling LearnersGradient StalenessError at ConvergenceRuntimeAsync SGDK-Batch AsyncK-Async SGDK=4K=3K=2K=2FullySyncBatch SyncK=3K=4S Dutta, G Joshi, S Ghosh, P Dube, and P Nagpurkar, “Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD”, AISTATS 2018: 803-812.Contributions:Theoretical understanding of Error-Runtime TradeoffsIntegrate Scheduling and Algorithmic techniques for improved performanceTrue SGD convergence is with respect to wall-clock time!
87. Ongoing & Future DirectionsGradually increasing synchronyAsync-SGDSync-SGDSync Error FloorAsync Error FloorWant this envelopeStochastic Staleness Analysis(0,0)(0,1)(0,2)(1,0)(2,0)S Dutta, J Wang, G Joshi, “Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD”, submitted to JMLR (2019).
88. Sync SGD: Choosing the best KError is equivalent to mini-batch SGD with batch size Km MNIST datasetK = 4 strikes a good balance between conv.speed and error floor
89. Thank You!