Tyler Smith Robert van de Geijn Mikhail Smelyanskiy Enrique QuintanaOrt í 1 Introduction Soft errors T ransient hardware failures C aused by highenergy particle incidence Cause crashes and numerical errors ID: 1039506
Download Presentation The PPT/PDF document "Adding Algorithm Based Fault-Tolerance ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. Adding Algorithm Based Fault-Tolerance to BLISTyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí1
2. IntroductionSoft errors:Transient hardware failures Caused by high-energy particle incidence.Cause crashes, and numerical errorsPresent-day supercomputers:Mean time between failures (MTBF) is already quite high2
3. MotivationFuture supercomputers:3 orders of magnitude more components in exascale systemsMTBF will deteriorate Resiliance will be a fundamental problem [3]3
4. Some solutionsCheckpoint and restart of entire applicationRecovery from hard but not soft errorRedundancyDouble redundancy to detectTriple redundancy to correctThese solutions may cost too much in terms of power budget4
5. Application Based Fault-tolerance (ABFT) ABFT [11]Low overheadNeeds to be integrated into applicationFLARE [10]Fault tolerant ITXGEMM [9]Our workFault Tolerant BLIS5
6. OutlineDetecting and Correcting ErrorsIntegrating ABFT into BLISPerformance Results6
7. Detecting ErrorsOur GEMM operation:7
8. Detecting ErrorsRight Checksum:8
9. Detecting ErrorsRight Checksum:9
10. Detecting ErrorsRight Checksum:10
11. Detecting ErrorsLeft Checksum:11
12. Detecting ErrorsLeft Checksum:12
13. Detecting ErrorsLeft Checksum:13
14. Detecting ErrorsError Location:14
15. Detecting ErrorsMultiple Errors:15
16. Errors in A and BSingle Errors in A or B can corrupt multiple elements of COne corrupted element in A can corrupt a whole row of COne corrupted element in B can corrupt a whole column of COur approach handles this16
17. Correcting ErrorsTraditional ABFT approach:Calculate what the error is, subtract it awayQuestions about numerical stabilityWe do checkpoint-and-rollbackCheckpoint C to main memoryIf error is detected, rollback and recomputeWe rollback and recompute only corrupted elements17
18. OutlineDetecting and Correcting ErrorsIntegrating ABFT into BLISPerformance Results18
19. Integrating ABFT into BLIS19Each loop here represents a different layer within BLISCan implement ABFT at your choice of layerTradeoff:Higher levels:Cheaper ABFTBut errors are detected less soonLower levels:Expensive ABFTErrors are caught quicklyWe implement ABFT at the macro-kernel level
20. Integrating ABFT into BLIS20
21. Fault Tolerance at the Macro-kernel LevelThings to add to BLISRight ChecksumLeft ChecksumCheckpointing CRollback and Recovery21
22. Right ChecksumMust compute: B(w)A(Bw)CwGoal: Reduce extra memory movements 22
23. Right ChecksumB(w)A(Bw)Cw23
24. Right ChecksumB(w)A(Bw)Cw24
25. Right ChecksumB(w)A(Bw)Cw25
26. Right ChecksumB(w)A(Bw)Cw26
27. Right Checksum27
28. Left ChecksumMust ComputevTA(vTA)BvTC28
29. Left ChecksumvTA(vTA)BvTC29
30. Left ChecksumvTA(vTA)BvTC30
31. Left ChecksumvTA(vTA)BvTC31
32. Left ChecksumvTA(vTA)BvTC32
33. Left Checksum33
34. Left ChecksumCan perform vTA while packingProblem: (vTA) B must be performed once per macro-kernelLeft checksum has a higher overhead than rightSolution:Perform left checksum lazilyOnly perform left checksum if right checksum detects error34
35. Lazy Left Checksum35
36. Checkpointing36
37. Checkpointing37
38. Multithreading IssuesFewer loops have independent iterationsChecksum vector computationSolved by giving each thread their own checksum vectors, doing a reductionLoad imbalanceWhen 1 thread is busy doing recovery, other threads wait38
39. Load ImbalanceSolutions:Dynamic parallelismWaiting threads can steal work from slow threadsLazy recomputationMark corrupted elements of CAll threads cooperatively perform recoveryEasy to implement in BLISData is cold in cache39
40. Final Implementation40
41. OutlineDetecting and Correcting ErrorsIntegrating ABFT into BLISPerformance Results41
42. Performance Results42Cost of detecting errorsNo errors introduced Both 1 and 16 coreK is set to 256
43. Performance Results43Breakdown of costs of detecting errorsNo errors introduced Single CoreK is set to 256Both Checksums and checkpointing exhibit similar costs
44. Performance Results44Breakdown of costs of detecting errorsNo errors introduced 16 CoresK is set to 256Both Checksums and checkpointing exhibit similar costs
45. Performance Results45Detecting and correcting errorsin CSingle CoreSquare matricesQuantifying cost of corrrectingfor small matrices
46. Performance Results46Detecting and correcting errorsin A and BSingle CoreK = 2561 error in A corrupts Nc elements of C1 error in B corrupts Melements of C
47. Performance Results47Detecting and correcting errorsin A and BMulti CoreK = 2561 error in A corrupts Nc elements of C1 error in B corrupts Melements of C
48. Thank You!Questions?tms@cs.utexas.edu48