Peng SunPrinceton Ratul Mahajan Jennifer Rexford Princeton Lihua Yuan Ming Zhang Ahsan Arefin Microsoft in the Proc of ACM SIGCOMM 2014 Presented by ID: 561322
Download Presentation The PPT/PDF document "A Network-State Management Service" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A Network-State Management Service
Peng Sun(Princeton), Ratul Mahajan, Jennifer Rexford, (Princeton), Lihua Yuan, Ming Zhang, Ahsan Arefin( Microsoft) ( in the Proc. of ACM SIGCOMM 2014)
Presented by: Suman MarojuEECS,Northwestern University
Paper 1Slide2Paper Overview
A network-state management service called Statesman for DCNs is presented.It allows multiple management applications to operate independently while ensuring performance and network-wide invariants.View based architecture.
Tested on Microsoft Azure data center for 7 months.3 applications tested.Slide3DCN
A modern data center is home to tens of thousands of hosts, each consisting of one or more processors, memory, network interface, and local high-speed I/O (disk or flash).
Compute resources are packaged into racks and allocated as clusters consisting of thousands of hosts that are tightly connected with a high-bandwidth network. Slide4Slide5Slide6Slide7Slide8Slide9
Problem 1-Application conflict
Which ever happens first takes the controlSlide10
Problem 2-Safety violation
Joint actions disconnects the ToR (top-of-rack)Slide11Slide12Slide13
(Corybantic)Slide14Slide15Slide16Slide17Slide18Slide19Slide20Slide21Slide22Slide23
Three views of the network state:
Statement uses three views of the proposed state:ObservedProposedTarget Design inspired by version control system gitEach application corresponds to different git user.Observed-pull;Proposed-pushed;Target-merged.Slide24Slide25Slide26Slide27
Prior work considered independent variable-value pairs.
Does not contain enough semantic knowledge about how various state variables are related.
Dependency model can capture the domain-specific cross variable dependencies among the state variable.
Dependency Model of State VariableSlide28Slide29Slide30
Detailed Architecture:Slide31
Input and Output in StatesmanSlide32Slide33Slide34
Network State Variables & Controllability
Firmware-upgradeDeviceFirmwareVersionDeviceFirmwareVersionls-ControllableSwitch configurationDeviceConfiglsControllableLinkAdminPowerlsControllableSlide35Checking Network State
Resolving Conflicts
1.TS-OSOpenFlow agent-DeviceAgent-BootStatus=Down. So TS cannot be applied.2.PS-OSLinkEndAddress=Down PS cannot be applied.3.PS-TSUpgrading a switch.Read controllability values from OS, set uncontrollability values at PS or TS and use
SkipUpdate to resolve TS-OS conflicts or partial rejection.For PS-TS conflicts, last-write wins, priority based locking.Slide36Slide37
Statesman System Design and Implementation
Storage:50000 lines of C# and C++ code.RESTful web service.Paxos rings (Smaller):Storage instance multiple locations.Smaller rings.Proxy layer for uniform access. Updator
:Command Template(OpenFlow,BGP etc)Monitor:(SNMP,OpenFlow)Slide38
Read-write APIs of Statesman
Implemented as a HTTP web service with RESTful APIs.Freshness parameter included (Staleness).Link failure mitigationDeviceFirmwareVersionSlide39Slide40Slide41
Application experiences:
1.Switch upgradeDeviceFirmwareVersion2.Failure mitigationFrame-Check-Sequence(FCS) error rates, LinkAdminPower-shutdown and generate repair ticket.3.Inter-DC TEBandwidth demands from bandwidth brokerTunnel status and flow matching rules.
99% of the ToR pairs in the DC should have atleast 50% of their baseline capacitySlide42Slide43Slide44Slide45Slide46Slide47Slide48Slide49Slide50Slide51Slide52
Conflict resolution in StatesmanSlide53Slide54Slide55Slide56Slide57Slide58Slide59Slide60Slide61Slide62
Handling Operational Failures
Switch-upgrade application on 250 switches. A. Straggling switch takes 4 hours to upgrade.Cannot download new firmware image.B. Unstable switches.C. Failure case( human intervention)Slide63
System PerformanceLatencySlide64
Checker performanceSlide65
Read-write performanceSlide66Related Work
Most of the previous works enable centralized control of traffic flow by directly forwarding states of switches.Similar to
Statesman, Onix and Hercules provide a shared network-state platform for all applications but not designed to resolve conflicts.Pyretic, PANE and Maple are recent proposals to deal with multiple applications but focus only on traffic management.Corybantic used explicit resolution by evaluation other applications proposals leading to complexity.Other approaches include partitioning the network into multiple isolated virtual slices. Slide67Slide68
Thanks!Question?