/
Slow and Stale Gradients can win the Race: Slow and Stale Gradients can win the Race:

Slow and Stale Gradients can win the Race: - PowerPoint Presentation

winnie
winnie . @winnie
Follow
67 views
Uploaded On 2023-06-23

Slow and Stale Gradients can win the Race: - PPT Presentation

ErrorRuntime Tradeoffs in Distributed SGD Sanghamitra Dutta Joint Work with Jianyu Wang Gauri Joshi Soumyadip Ghosh Parijat Dube Priya Nagpurkar Presented in part at AISTATS 2018 ID: 1002209

sync sgd async batch sgd sync batch async expected runtimes sgdk error sgdl1l2l3psk variantsasync sgdl1l2l3psw0w1w3w2 variantssync sgdw0w1w2 variantsl1l2l3psw0w1w3async sgdw2l1l2l3psw0w1w3w2k

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Slow and Stale Gradients can win the Rac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Slow and Stale Gradients can win the Race: Error-Runtime Trade-offs in Distributed SGDSanghamitra DuttaJoint Work with: Jianyu Wang, Gauri Joshi,Soumyadip Ghosh, Parijat Dube, Priya NagpurkarPresented in part at AISTATS 2018

2. Stochastic Gradient Descent is the backbone of MLSpeeding Up SGD convergence is of critical importance!

3. This Work: Speeding Up Distributed SGD viaScheduling + Algorithmic Techniques

4. Batch Gradient DescentF(w) is the empirical risk functionToo expensive for large datasetsis the n-th labeled sample

5. Mini-batch SGDF(w) is the empirical risk functionNoisier, but computationally tractableis the n-th labeled sample For large training datasets single-node SGD can be prohibitively slow…

6. Parameter ServerSynchronous Distributed SGDCan process a P-times larger mini-batch in each iterationBottlenecked by one or more slow/straggling learnersLearner 1Learner 2Learner Pwjwjwjis the mini-batch of m samples for l th learner

7. Parameter ServerLearner 1Learner 2Learner Pw1w1w1w1w1w1[Recht 2011, Dean 2012, Cipar 2013 …]Asynchronous Distributed SGD

8. Learner 1Parameter ServerLearner 2Learner Pw2w2w1w1[Recht 2011, Dean 2012, Cipar 2013 …]Asynchronous Distributed SGD

9. Don’t have to wait for straggling learners Gradient Staleness can increase errorLearner 1Learner 2Learner Pw3Parameter Server[Recht 2011, Dean 2012, Cipar 2013 …]Gradient Stalenessw2w3w1Asynchronous Distributed SGD

10. Performance Comparison

11. Performance ComparisonLog LossWall-clock TimeLog LossIterationsSynchronousAsynchronous

12. Performance ComparisonLog LossWall-clock TimeNeed to understand convergence with wall-clock time and not only with number of iterations!Log LossIterationsSynchronousAsynchronous

13. Question: How do the Error-Runtime trade-offs compare?

14. Assumptions for Runtime AnalysisRuntime is a random variable

15. Assumptions for Runtime AnalysisRuntime is a random variable Independent and identically distributed across different learners and mini-batches

16. Assumptions for Runtime AnalysisRuntime is a random variable Independent and identically distributed across different learners and mini-batches

17. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

18. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

19. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

20. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

21. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

22. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

23. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

24. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

25. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

26. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0

27. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0

28. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1

29. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1

30. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1

31. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

32. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

33. Expected Runtimes of SGD VariantsSync variantsL1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

34. Expected Runtimes of SGD VariantsSync variants[Gupta et al. ICDM 2016] [Chen et al. 2016]L1L2L3PSw0w1w2L1L2L3PSw0w1w2Fully Sync-SGDK-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

35. Expected Runtimes of SGD VariantsL1L2L3PSw0w1w2K-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

36. Expected Runtimes of SGD VariantsL1L2L3PSw0w1w2K-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

37. Expected Runtimes of SGD VariantsL1L2L3PSw0w1w2K-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

38. Expected Runtimes of SGD VariantsL1L2L3PSw0w1w2K-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2Key Idea of the Proof:Every gradient push – minimum of P exponentials

39. Expected Runtimes of SGD VariantsL1L2L3PSw0w1w2K-sync SGDL1L2L3PSK-batch-sync SGDw0w1w2

40. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0Async SGDK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

41. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1Async SGDK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

42. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1Async SGDK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

43. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

44. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

45. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

46. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

47. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0K-Batch Async SGDL1L2L3PSw0w1w3w2

48. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1K-Batch Async SGDL1L2L3PSw0w1w3w2

49. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1K-Batch Async SGDL1L2L3PSw0w1w3w2

50. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w2K-Batch Async SGDL1L2L3PSw0w1w3w2

51. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w2K-Batch Async SGDL1L2L3PSw0w1w3w2

52. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

53. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0

54. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0

55. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1

56. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1

57. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w2

58. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w2

59. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

60. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

61. Expected Runtimes of SGD VariantsAsync variantsL1L2L3PSw0w1w3Async SGDK-Async SGDw2L1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2[Lian et al. NIPS 2015]

62. Expected Runtimes of SGD VariantsK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

63. Expected Runtimes of SGD VariantsK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2Observation:Runtime is the K-th statistic of all learners’ remaining times.

64. Expected Runtimes of SGD VariantsK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

65. Expected Runtimes of SGD VariantsK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2

66. Expected Runtimes of SGD VariantsK-Async SGDL1L2L3PSw0w1w3w2K-Batch Async SGDL1L2L3PSw0w1w3w2Key Observation:For each learner, gradient pushes – renewal process(inter-arrival times are iid)

67. Comparison of Expected Runtime

68. Comparison of Expected Runtime

69. How to get the error-runtime tradeoff?

70. Assumptions for Error AnalysisLipchitz smooth loss function with parameter LStrongly Convex with parameter cUnbiased Estimate of gradientUpper Bound on Variance of Stochastic Gradient

71. Fully Sync SGD: Error AnalysisUpdate Rule: Equivalent to mini-batch SGD with batch size Pm

72. Fully Sync SGD: Error AnalysisUpdate Rule: Equivalent to mini-batch SGD with batch size PmFor c-strongly convex, L-smooth functions [Bottou, 2016]

73. Fully Sync SGD: Error AnalysisDecay RateError FloorUpdate Rule: Equivalent to mini-batch SGD with batch size PmFor c-strongly convex, L-smooth functions [Bottou, 2016]

74. Error after J iterations for sync variantsK-sync and K-batch-sync SGD: Error Analysis

75. Async SGD: Error AnalysisUpdate RuleHard to analyze due to stale gradients

76. Async SGD: Error AnalysisUpdate RuleAssumptions in Previous worksUpper Bound on Staleness [Lian et al 2015]Geometric staleness distribution [Mitiliagkas et al 2016]Staleness process is independent of w [Mitiliagkas et al 2016]Hard to analyze due to stale gradients

77. Async SGD: Error AnalysisUpdate RuleAssumptions in Previous worksUpper Bound on Staleness [Lian et al 2015]Geometric staleness distribution [Mitiliagkas et al 2016]Staleness process is independent of w [Mitiliagkas et al 2016]We remove these assumptions, and instead consider Hard to analyze due to stale gradients

78. Async SGD: Error AnalysisFor c-strongly convex, L-smooth functions, where is the staleness bound,and is the probability of getting a fresh gradient, i.e., same learner pushes the next update

79. Async SGD: Error AnalysisCan be faster than Sync SGD with iterations if po/ 2 > 𝛾Larger than Sync-SGDFor c-strongly convex, L-smooth functions, where is the staleness bound,and is the probability of getting a fresh gradient, i.e., same learner pushes the next update

80. Async SGD: Error AnalysisCan be faster than Sync SGD with iterations if po/ 2 > 𝛾Larger than Sync-SGDFor c-strongly convex, L-smooth functions, where is the staleness bound,and is the probability of getting a fresh gradient, i.e., same learner pushes the next updateAnalysis can be generalized to non-convex objectives

81. K-async and K-batch-async SGD: Error AnalysisError after J iterations for Async variants

82. Error-Runtime Trade-offs

83. Spanning the spectrum betweenSynchronous and Asynchronous SGDError at ConvergenceRuntimeAsync SGDFullySync

84. Spanning the spectrum betweenSynchronous and Asynchronous SGDError at ConvergenceRuntimeAsync SGDK-Batch AsyncK-Async SGDK=4K=3K=2K=2FullySyncBatch SyncK=3K=4

85. Key TakeawaysStraggling LearnersGradient StalenessTrue SGD convergence is with respect to wall-clock time!

86. Key TakeawaysStraggling LearnersGradient StalenessError at ConvergenceRuntimeAsync SGDK-Batch AsyncK-Async SGDK=4K=3K=2K=2FullySyncBatch SyncK=3K=4S Dutta, G Joshi, S Ghosh, P Dube, and P Nagpurkar, “Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD”, AISTATS 2018: 803-812.Contributions:Theoretical understanding of Error-Runtime TradeoffsIntegrate Scheduling and Algorithmic techniques for improved performanceTrue SGD convergence is with respect to wall-clock time!

87. Ongoing & Future DirectionsGradually increasing synchronyAsync-SGDSync-SGDSync Error FloorAsync Error FloorWant this envelopeStochastic Staleness Analysis(0,0)(0,1)(0,2)(1,0)(2,0)S Dutta, J Wang, G Joshi, “Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD”, submitted to JMLR (2019).

88. Sync SGD: Choosing the best KError is equivalent to mini-batch SGD with batch size Km MNIST datasetK = 4 strikes a good balance between conv.speed and error floor

89. Thank You!