D3S: Debug Deployed Distributed Systems Xuezheng
Author : tawny-fly | Published Date : 2025-05-19
Description: D3S Debug Deployed Distributed Systems Xuezheng Liu Zhenyu Guo Xi Wang Feibo Chen Xiaochen Lian Jian Tang Ming Wu M Frans Kaashoek Zheng Zhang Microsoft Research Asia Tsinghua University Fudan University Shanghai Jiaotong
Presentation Embed Code
Download Presentation
Download
Presentation The PPT/PDF document
"D3S: Debug Deployed Distributed Systems Xuezheng" is the property of its rightful owner.
Permission is granted to download and print the materials on this website for personal, non-commercial use only,
and to display it on your personal computer provided you do not modify the materials and that you retain all
copyright notices contained in the materials. By downloading content from our website, you accept the terms of
this agreement.
Transcript:D3S: Debug Deployed Distributed Systems Xuezheng:
D3S: Debug Deployed Distributed Systems Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, Zheng Zhang Microsoft Research Asia, Tsinghua University, Fudan University, Shanghai Jiaotong University, MIT CSAIL Debugging distributed systems is difficult Bugs are difficult to reproduce Many machines executing concurrently Machines may fail Network may fail Example: Distributed lock Distributed reader-writer locks Lock mode: exclusive, shared Invariant: only one client can hold a lock in the exclusive mode Debugging is difficult because the protocol is complex For performance, clients cache locks For failure tolerance, locks have a lease How do people debug? Simulation Model-checking Runtime checking State-of-the-art of runtime checking Step 1: add logs void ClientNode::OnLockAcquired(…) { … print_log( m_NodeID, lock, mode); } Step 2: Collect logs, align them into a globally consistent sequence Keep partial order Step 3: Write checking scripts Scan the logs to retrieve lock states Check the consistency of locks Problems for large/deployed systems Too much manual effort Difficult to anticipate what needs to log Too much information: slow systems down Too little information: miss a problem Checking for large system is challenging A central checker cannot keep up Snapshots must be consistent Our focus: make runtime checking easier and feasible for deployed/large-scale system D3S approach Predicate: no conflict locks Violation! state state state state state Conflict! Our contributions/outline A simple language for writing distributed predicates Programmers can change what is being checked on-the-fly Failure tolerant consistent snapshot for predicate checking Evaluation with five real-world applications Design goals Simplicity: a sequential style for writing predicates Parallelism: run in parallel on multiple checkers Correctness: check consistent states in spite of failures Solution MapReduce model Failure-tolerant consistent snapshot Developers write a D3S predicate V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) } V1: V0 { ( conflict: LockID ) } as final after (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2) after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2) class MyChecker : vertex { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for partitioning }; Part 1: define the dataflow and types of states, and how states are retrieved Part 2: define the logic and mapping function in each stage for predicates D3S parallel predicate checker Lock clients Checkers Expose states individually Reconstruct: SN1, SN2, … Exposed