/
Adding Algorithm Based  Fault-Tolerance to BLIS Adding Algorithm Based  Fault-Tolerance to BLIS

Adding Algorithm Based Fault-Tolerance to BLIS - PowerPoint Presentation

natalie
natalie . @natalie
Follow
64 views
Uploaded On 2024-01-13

Adding Algorithm Based Fault-Tolerance to BLIS - PPT Presentation

Tyler Smith Robert van de Geijn Mikhail Smelyanskiy Enrique QuintanaOrt í 1 Introduction Soft errors T ransient hardware failures C aused by highenergy particle incidence Cause crashes and numerical errors ID: 1039506

checksum detecting abft left detecting checksum left abft errors correcting vta performance error corek perform costs checksumvta corrupts checksumb

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Adding Algorithm Based Fault-Tolerance ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Adding Algorithm Based Fault-Tolerance to BLISTyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí1

2. IntroductionSoft errors:Transient hardware failures Caused by high-energy particle incidence.Cause crashes, and numerical errorsPresent-day supercomputers:Mean time between failures (MTBF) is already quite high2

3. MotivationFuture supercomputers:3 orders of magnitude more components in exascale systemsMTBF will deteriorate Resiliance will be a fundamental problem [3]3

4. Some solutionsCheckpoint and restart of entire applicationRecovery from hard but not soft errorRedundancyDouble redundancy to detectTriple redundancy to correctThese solutions may cost too much in terms of power budget4

5. Application Based Fault-tolerance (ABFT) ABFT [11]Low overheadNeeds to be integrated into applicationFLARE [10]Fault tolerant ITXGEMM [9]Our workFault Tolerant BLIS5

6. OutlineDetecting and Correcting ErrorsIntegrating ABFT into BLISPerformance Results6

7. Detecting ErrorsOur GEMM operation:7

8. Detecting ErrorsRight Checksum:8

9. Detecting ErrorsRight Checksum:9

10. Detecting ErrorsRight Checksum:10

11. Detecting ErrorsLeft Checksum:11

12. Detecting ErrorsLeft Checksum:12

13. Detecting ErrorsLeft Checksum:13

14. Detecting ErrorsError Location:14

15. Detecting ErrorsMultiple Errors:15

16. Errors in A and BSingle Errors in A or B can corrupt multiple elements of COne corrupted element in A can corrupt a whole row of COne corrupted element in B can corrupt a whole column of COur approach handles this16

17. Correcting ErrorsTraditional ABFT approach:Calculate what the error is, subtract it awayQuestions about numerical stabilityWe do checkpoint-and-rollbackCheckpoint C to main memoryIf error is detected, rollback and recomputeWe rollback and recompute only corrupted elements17

18. OutlineDetecting and Correcting ErrorsIntegrating ABFT into BLISPerformance Results18

19. Integrating ABFT into BLIS19Each loop here represents a different layer within BLISCan implement ABFT at your choice of layerTradeoff:Higher levels:Cheaper ABFTBut errors are detected less soonLower levels:Expensive ABFTErrors are caught quicklyWe implement ABFT at the macro-kernel level

20. Integrating ABFT into BLIS20

21. Fault Tolerance at the Macro-kernel LevelThings to add to BLISRight ChecksumLeft ChecksumCheckpointing CRollback and Recovery21

22. Right ChecksumMust compute: B(w)A(Bw)CwGoal: Reduce extra memory movements 22

23. Right ChecksumB(w)A(Bw)Cw23

24. Right ChecksumB(w)A(Bw)Cw24

25. Right ChecksumB(w)A(Bw)Cw25

26. Right ChecksumB(w)A(Bw)Cw26

27. Right Checksum27

28. Left ChecksumMust ComputevTA(vTA)BvTC28

29. Left ChecksumvTA(vTA)BvTC29

30. Left ChecksumvTA(vTA)BvTC30

31. Left ChecksumvTA(vTA)BvTC31

32. Left ChecksumvTA(vTA)BvTC32

33. Left Checksum33

34. Left ChecksumCan perform vTA while packingProblem: (vTA) B must be performed once per macro-kernelLeft checksum has a higher overhead than rightSolution:Perform left checksum lazilyOnly perform left checksum if right checksum detects error34

35. Lazy Left Checksum35

36. Checkpointing36

37. Checkpointing37

38. Multithreading IssuesFewer loops have independent iterationsChecksum vector computationSolved by giving each thread their own checksum vectors, doing a reductionLoad imbalanceWhen 1 thread is busy doing recovery, other threads wait38

39. Load ImbalanceSolutions:Dynamic parallelismWaiting threads can steal work from slow threadsLazy recomputationMark corrupted elements of CAll threads cooperatively perform recoveryEasy to implement in BLISData is cold in cache39

40. Final Implementation40

41. OutlineDetecting and Correcting ErrorsIntegrating ABFT into BLISPerformance Results41

42. Performance Results42Cost of detecting errorsNo errors introduced Both 1 and 16 coreK is set to 256

43. Performance Results43Breakdown of costs of detecting errorsNo errors introduced Single CoreK is set to 256Both Checksums and checkpointing exhibit similar costs

44. Performance Results44Breakdown of costs of detecting errorsNo errors introduced 16 CoresK is set to 256Both Checksums and checkpointing exhibit similar costs

45. Performance Results45Detecting and correcting errorsin CSingle CoreSquare matricesQuantifying cost of corrrectingfor small matrices

46. Performance Results46Detecting and correcting errorsin A and BSingle CoreK = 2561 error in A corrupts Nc elements of C1 error in B corrupts Melements of C

47. Performance Results47Detecting and correcting errorsin A and BMulti CoreK = 2561 error in A corrupts Nc elements of C1 error in B corrupts Melements of C

48. Thank You!Questions?tms@cs.utexas.edu48