/
Failure Prediction Mechanism for Pluggable  Optical Interconnect at Facebook Data Centers Failure Prediction Mechanism for Pluggable  Optical Interconnect at Facebook Data Centers

Failure Prediction Mechanism for Pluggable Optical Interconnect at Facebook Data Centers - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
368 views
Uploaded On 2019-11-06

Failure Prediction Mechanism for Pluggable Optical Interconnect at Facebook Data Centers - PPT Presentation

Failure Prediction Mechanism for Pluggable Optical Interconnect at Facebook Data Centers Abhijit Chakravarty and Vincent Zeng Problem Statements Currently t here is no method developed to avoid the optical transceiver failures ahead of time ID: 763768

switch power transmitter current power switch current transmitter correlation bias temperature time failure mechanism data monitoring propagates defect laser

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Failure Prediction Mechanism for Pluggab..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Failure Prediction Mechanism for Pluggable Optical Interconnect at Facebook Data Centers Abhijit Chakravarty and Vincent Zeng

Problem Statements Currently t here is no method developed to avoid the optical transceiver failures ahead of time. Network traffic loss is not predicable

Real-time performance monitoring mechanism at FB data centers  Temperature Readout By data centers By suppliers By part number By switch port Over time All switch platforms (RSW = rack switch, FSW = fabric switch, ESW = edge switch, SSW = spine switch) Temperature, Tx Bias, Tx Power, Vcc, Rx Power Transmitter Power Readout Tx Bias Current Readout

Real-time monitoring mechanism implementation at our DCs  Shows the relationship between tx / rx power, current, and temperature As a transceiver degrades/begins to fail, current gradually increases to maintain steady Tx Power, until reaching a plateau at 65mA (depends on supplier) Beyond the plateau, recovery is impossible and the particular transmitter will likely fail in a few Also, the case temperature has a positive correlation with transmitter bias current and negative correlation with transmitter optical powerThis correlation can help us better predict the failures and prevent the link failure in a data center before it actually occurs

Failure Modes Observation ~20 units were pulled All of them failed after stressful test within two weeks

Some Basic Of Laser Diode Failures Power Reduction Wavelength shift Spectral linewidth widening Modulation speed change No lasing suddenly Defect/dislocation propagates Metal diffusion/mitigation. Defect propagates Grating area disorder/precipitation/facet melting Defect propagates/growsBonding part/alloy reaction/thermal fatigue

Algorithm Proposed The adjustment of the sensitivity of top power monitoring (TPM device) Need to set the algorithm to find out the saturation of Tx power output.

Conclusions We investigated the correlation among the bias current of the laser diode, transmitter power degradation and environmental changes. We identified signatures for laser diode degradation We are developing a mechanism to predicate the failure modes.