Emma Foley Intel Maryam Tahhan Intel Carlos Gonça lves NEC Ryota Mibu NEC Outline Introduction The project formerly known as SFQM Doctor Demo Summary Data Centres are powering our everyday lives Organizations lose an average of 138000 for one hour of downtime 1 ID: 814518
Download The PPT/PDF document "SFQM and Doctor Keeping My (Telco) Cloud..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
SFQM and Doctor
Keeping My (Telco) Cloud Afloat
Emma Foley, IntelMaryam Tahhan, IntelCarlos Gonçalves, NECRyota Mibu, NEC
Slide2Outline
Introduction
The project formerly known as SFQMDoctorDemo
Summary
Slide3“Data Centres are powering our everyday lives. Organizations lose an average of $138,000 for one hour of downtime.” [1].
Telco and Enterprise alike are asking how they get and provide Service Assurance, QoS and provide SLA’s on the platform and services when deploying NFV.
It is vital to monitor systems for malfunctions or misbehaviours that could lead to service disruption and promptly react to these faults/events to minimize service disruption/downtime.
Slide4Barometer Overview
The ability to monitor the Network Function Virtualization Infrastructure (NFVI) where VNFs are in operation will be a key part of Service Assurance within an NFV environment, in order to:
1. Enforce SLAs2. Detect violations and faults3. Detect degradation in the performance of NFVI resources so that events and relevant metrics are reported to higher level fault management systems.
The output of the project will provide interfaces to support monitoring of the NFVI.
Slide5DPDK
Collecting Statistic and Events with collectd
SFQM Plug-Ins
Plug-Ins
OpenStack/
Fault Management Application
Platform
Platform/application features
MANO
Barometer
Plug-Ins
Provided
Functionality
Ceilometer plugin
Extended Stats
Open vSwitch
RDT
RAS
Consumer
Stats
Events
BIOS
Legacy
RAS plugin
RDT plugin
OVS plugins
DPDK plugins
BIOS plugin
Legacy plugin
Output
Input
SNMP plugin
collectd
Slide6Collectd + DPDK Statistics
Dpdkstat
Collectd plugin : Merged!DPDK secondary processMonitor
DPDK primary application
Read extended NIC statistics
Publish
statistics to collectd
dpdkevents
Will be upstreamed soon
DPDK secondary process
Monitor
DPDK primary application liveliness and link status
Publish
liveliness and link status notifications when a failure occurs or all the time.
Slide7Collectd + OVS Statistics and Events
OVS Stats plugin
Features:Subscribe for DB table events relevant to Stats per interfaceDPDK agnostic.
OVS Events plugin
Features
Connect / Disconnect, Subscribe for DB table events, Custom requests, DB Echos for livelines
Upstreaming at
https://github.com/collectd/collectd/pull/1971
Collectd with RAS Events and Stats
Reliability, Availability and Serviceability features
Reporting Machine Check Exceptions (MCEs) from mcelogWhere possible report metrics relevant to MCEs
Machine Check Exceptions: Hardware errors that are corrected get reported by the HW to SW
Slide9Collectd with RDT Statistics
Resource Di
rector Technology Per core group:Last Level Cache (LLC) Occupancy
Local Memory Bandwidth
Remote Memory Bandwidth
Merged to collectd master
Slide10Collectd + Legacy and BIOS plugins
Legacy:
Leverage the IPMI interface to retrieve platform thermals, voltage info, fan speeds…
BIOS:
Retrieve Version, manufacturer, Vendor and other info from SMBIOS table.
Slide11Collecting OVS Interface Events + Stats with collectd
SFQM Plug-Ins
OVS stats + Events
Plugins
collectd
1. Read
2. Get stats
3. Dispatch Values
OVS
Example
RX
TX
collectd Ceilometer Plugin
VM
Ceilometer
collectd
RX
TX
5. Post Values
4. Pass Values
Plug-Ins
Consumer/
OpenStack
Platform
collectd
NB to
MANO/VNFM
Example
Application
Barometer
Features
Provided
Functionality
Barometer
Plug-Ins
Slide12SFQM Plug-Ins
OVS stats + Events plugins
collectd
1. Read
2. Get stats
3. Dispatch Values
OVS With DPDK
Example
RX
TX
collectd Ceilometer Plugin
VM
Ceilometer
collectd
RX
TX
5. Post Values
4. Pass Values
Plug-Ins
Consumer/
OpenStack
Platform
collectd
NB to
MANO/VNFM
SFQM Plug-Ins
Example
DPDK Application
SFQM Features
Provided
Functionality
OVS
Plug-Ins
Consumer/
OpenStack
Platform
collectd
NB to
MANO/VNFM
Barometer
Features
Application
Barometer
Plug-Ins
Barometer
Features
Collecting OVS Interface Events + Stats with collectd
Slide13DPDK
SFQM Plug-Ins
Plug-Ins
OpenStack/
Fault Management Application
Platform
collectd
MANO
SFQM Plug-Ins
Under Implementation
Ceilometer plugin
Extended Stats
Open vSwitch
RDT
RAS
Consumer
Stats
Events
BIOS
Legacy
RAS plugin
RDT plugin
OVS plugins
DPDK plugins
BIOS plugin
Legacy plugin
Output
Input
SNMP plugin
Status Update
Being Upstreamed
Upstreamed
Slide14Project in OPNFV working on building an
open-source NFVI
fault management and maintenance framework to ensure Telco VNFs availability in fault and maintenance eventsIdentify requirements
Gap analysis
Implementation work in upstream
Integration and testing
Consistent Resource State Awareness
Immediate Notification
Fault Correlation
Extensible Monitoring
Doctor
Slide15Status Update II
Taking advantage of the notification plugin architecture in collectd to post an event (like link status failure or application thread failure) directly to the notification bus for immediate alarming in Aodh.
Performance, scalability and aggregation analysis.Gnocchi integration
Slide16Doctor: fault management use case
Slide17Doctor: mapping to the OpenStack ecosystem
Slide18Doctor: focus of initial contributions
Consistent Resource State Awareness
Immediate Notification
Slide19Doctor: focus of initial contributions
Immediate Notification
Consistent Resource State Awareness
Slide20Doctor: extending contribution focus
Consistent Resource State Awareness
Immediate Notification
Fault Correlation
Extensible Monitoring
Slide21Doctor Inspector
The module has the ability to...… receive various failure notifications regarding physical resource(s) from Monitor module(s)… find the affected virtual resource(s) by querying the resource map in the Controller module
… update the state of the virtual resource (and physical resource)It has drivers for different types of events and resourcesUses a failure policy database
Slide22Why a failure policy database?
“Failure” can be subjective. Depends onApplications (VNFs)
Back-end technologies used in the deploymentRedundancy of the equipment/componentsOperator PolicyRegulationTopologies of Network / Power-supply
So, “failure” has to be dynamically configurable case by case
Slide23Doctor Inspector: OpenStack Congress
Governance as a ServiceDefine and enforce policy for Cloud Services
Dynamic data collection from OpenStack servicesFlexible policy definition for correlation (Datalog)Well integrated with other OpenStack projectsPolicy example
host_down(host) :-
doctor:events(hostname=host, type="compute.host.down", status="down")
execute[nova:services.force_down(host, "nova-compute", "True")] :-
host_down(host)
Slide24Congress PushType DataSource Driver
Slide25Congress Doctor Driver
Slide26Doctor blueprints in OpenStack
Project
Blueprint
Spec Drafter
Developer
Status
Aodh
Event Alarm Evaluator
Ryota Mibu (NEC)
Ryota Mibu (NEC)
Completed (Liberty)
Nova
New nova API call to mark nova-compute down
Tomi Juvonen (Nokia)
Roman Dobosz (Intel)
Completed (Liberty)
Support forcing service down
Tomi Juvonen (Nokia)
Carlos Goncalves (NEC)
Completed (Liberty)
Get valid server state
Tomi Juvonen (Nokia)
Tomi Juvonen (Nokia)
Completed (Mitaka)
Add notification for service status change
Balazs Gibizer (Ericsson)
Balazs Gibizer (Ericsson)
Completed (Mitaka)
Maintenance Reason to Server
Tomi Juvonen (Nokia)
Tomi Juvonen (Nokia)
WIP (Ocata)
Congress
Push Type Datasource Driver
Masahito Muroi (NTT)
Masahito Muroi (NTT)
Completed (Mitaka)
Adds Doctor Driver
Masahito Muroi (NTT)
Masahito Muroi (NTT)
Completed (Mitaka)
Neutron
Port data plane status
Carlos Goncalves (NEC)
Carlos Goncalves (NEC)
WIP (Ocata)
Slide27SFQM + Doctor
Ceilometer
collectd
dpdkstat
Plugin
collectd
1. Read
2. Get stats
3. Dispatch Values
OVS With DPDK
RX
TX
collectd Ceilometer Plugin
VM
collectd
RX
TX
5. Post Values
4. Pass Values
Slide28Demo
Slide29Summary
“Trying to manage a complex cloud solution without a proper telemetry infrastructure in place is like trying to walk across a busy highway with blind eyes and deft ears. You have little to no idea of where the issues can come from, and no chances to take any smart move without getting in trouble”
. [2]
Doctor
Painting the pedestrian crossing
Slide30References
[1]
http://www.datacenterknowledge.com/archives/2016/02/11/curb-data-center-downtime-predictive-maintenance/ [2] https://azure.microsoft.com/en-us/blog/cloud-service-fundamentals-telemetry-basics-and-troubleshooting/
Slide31Slide32Legal notices and disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.© 2016 Intel Corporation.
Slide33Backup
Slide34The Corner Stone
Telemetry is the cornerstone for
:BillingBenchmarkingIntelligent orchestration
Fault management
Slide35Use case example
Slide36Use case Example
Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.
Controller
Compute Node 1
Compute Node 2
collectd
OVS With DPDK
OVS
ceilometer
aodh
VNF
VNF
VNF
= Active
VNF
= Standby
Slide37Use case Example
Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.
Controller
Compute Node 1
Compute Node 2
collectd
OVS With DPDK
OVS
ceilometer
aodh
VNF
VNF
localhost-port.0-link_status != 0
VNF
= Active
VNF
= Standby
Slide38Use case Example
Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.
Controller
Compute Node 1
Compute Node 2
collectd
OVS With DPDK
OVS
ceilometer
aodh
VNF
VNF
localhost-port.0-link_status != 0
VNF
= Active
VNF
= Standby
Slide39Use case Example
Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.
Controller
Compute Node 1
Compute Node 2
collectd
OVS With DPDK
OVS
ceilometer
aodh
VNF
VNF
localhost-port.0-link_status != 0
X
VNF
= Active
VNF
= Standby
Slide40Use case Example
Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.
Controller
Compute Node 1
Compute Node 2
collectd
OVS With DPDK
OVS
ceilometer
aodh
VNF
VNF
localhost-port.0-link_status == 0
X
VNF
= Active
VNF
= Standby
Slide41Use case Example
Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.
Controller
Compute Node 1
(out of service)
Compute Node 2
collectd
OVS With DPDK
OVS
ceilometer
aodh
VNF
VNF
X
VNF
= Active
Slide42Collectd ceilometer
plugin
Slide43Doctor blueprints in OpenStack Liberty
Project
Blueprint
Spec Drafter
Developer
Status
Aodh
Event Alarm Evaluator
Ryota Mibu (NEC)
Ryota Mibu (NEC)
Completed (Liberty)
Nova
New nova API call to mark nova-compute down
Tomi Juvonen (Nokia)
Roman Dobosz (Intel)
Completed (Liberty)
Support forcing service down
Tomi Juvonen (Nokia)
Carlos Goncalves (NEC)
Completed (Liberty)
Slide44State correction
Slide45From project creation to Brahmaputra release
Slide46Immediate event alarming
Slide47Doctor Inspector
The module has the ability to...… receive various failure notifications regarding physical resource(s) from Monitor module(s)… find the affected virtual resource(s) by querying the resource map in the Controller module
… update the state of the virtual resource (and physical resource)It has drivers for different types of events and resourcesMonitor: collectd, Zabbix, …Resources: servers, networks, storage, ...Uses a failure policy databaseDecide on the failure selection and aggregation from raw eventsConfigured by the administrator (physical resources) and user (virtual resources)