/
SFQM and Doctor Keeping My (Telco) Cloud Afloat SFQM and Doctor Keeping My (Telco) Cloud Afloat

SFQM and Doctor Keeping My (Telco) Cloud Afloat - PowerPoint Presentation

messide
messide . @messide
Follow
343 views
Uploaded On 2020-10-22

SFQM and Doctor Keeping My (Telco) Cloud Afloat - PPT Presentation

Emma Foley Intel Maryam Tahhan Intel Carlos Gonça lves NEC Ryota Mibu NEC Outline Introduction The project formerly known as SFQM Doctor Demo Summary Data Centres are powering our everyday lives Organizations lose an average of 138000 for one hour of downtime 1 ID: 814518

ovs collectd vnf dpdk collectd ovs dpdk vnf compute plugin status link node doctor events service resource ceilometer stats

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "SFQM and Doctor Keeping My (Telco) Cloud..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

SFQM and Doctor

Keeping My (Telco) Cloud Afloat

Emma Foley, IntelMaryam Tahhan, IntelCarlos Gonçalves, NECRyota Mibu, NEC

Slide2

Outline

Introduction

The project formerly known as SFQMDoctorDemo

Summary

Slide3

“Data Centres are powering our everyday lives. Organizations lose an average of $138,000 for one hour of downtime.” [1].

Telco and Enterprise alike are asking how they get and provide Service Assurance, QoS and provide SLA’s on the platform and services when deploying NFV.

It is vital to monitor systems for malfunctions or misbehaviours that could lead to service disruption and promptly react to these faults/events to minimize service disruption/downtime.

Slide4

Barometer Overview

The ability to monitor the Network Function Virtualization Infrastructure (NFVI) where VNFs are in operation will be a key part of Service Assurance within an NFV environment, in order to:

1. Enforce SLAs2. Detect violations and faults3. Detect degradation in the performance of NFVI resources so that events and relevant metrics are reported to higher level fault management systems. 

The output of the project will provide interfaces to support monitoring of the NFVI. 

Slide5

DPDK

Collecting Statistic and Events with collectd

SFQM Plug-Ins

Plug-Ins

OpenStack/

Fault Management Application

Platform

Platform/application features

MANO

Barometer

Plug-Ins

Provided

Functionality

Ceilometer plugin

Extended Stats

Open vSwitch

RDT

RAS

Consumer

Stats

Events

BIOS

Legacy

RAS plugin

RDT plugin

OVS plugins

DPDK plugins

BIOS plugin

Legacy plugin

Output

Input

SNMP plugin

collectd

Slide6

Collectd + DPDK Statistics

Dpdkstat

Collectd plugin : Merged!DPDK secondary processMonitor

DPDK primary application

Read extended NIC statistics

Publish

statistics to collectd

dpdkevents

Will be upstreamed soon

DPDK secondary process

Monitor

DPDK primary application liveliness and link status

Publish

liveliness and link status notifications when a failure occurs or all the time.

Slide7

Collectd + OVS Statistics and Events

OVS Stats plugin

Features:Subscribe for DB table events relevant to Stats per interfaceDPDK agnostic.

OVS Events plugin

Features

Connect / Disconnect, Subscribe for DB table events, Custom requests, DB Echos for livelines

Upstreaming at

https://github.com/collectd/collectd/pull/1971

Slide8

Collectd with RAS Events and Stats

Reliability, Availability and Serviceability features

Reporting Machine Check Exceptions (MCEs) from mcelogWhere possible report metrics relevant to MCEs

Machine Check Exceptions: Hardware errors that are corrected get reported by the HW to SW

Slide9

Collectd with RDT Statistics

Resource Di

rector Technology Per core group:Last Level Cache (LLC) Occupancy

Local Memory Bandwidth

Remote Memory Bandwidth

Merged to collectd master

Slide10

Collectd + Legacy and BIOS plugins

Legacy:

Leverage the IPMI interface to retrieve platform thermals, voltage info, fan speeds…

BIOS:

Retrieve Version, manufacturer, Vendor and other info from SMBIOS table.

Slide11

Collecting OVS Interface Events + Stats with collectd

SFQM Plug-Ins

OVS stats + Events

Plugins

collectd

1. Read

2. Get stats

3. Dispatch Values

OVS

Example

RX

TX

collectd Ceilometer Plugin

VM

Ceilometer

collectd

RX

TX

5. Post Values

4. Pass Values

Plug-Ins

Consumer/

OpenStack

Platform

collectd

NB to

MANO/VNFM

Example

Application

Barometer

Features

Provided

Functionality

Barometer

Plug-Ins

Slide12

SFQM Plug-Ins

OVS stats + Events plugins

collectd

1. Read

2. Get stats

3. Dispatch Values

OVS With DPDK

Example

RX

TX

collectd Ceilometer Plugin

VM

Ceilometer

collectd

RX

TX

5. Post Values

4. Pass Values

Plug-Ins

Consumer/

OpenStack

Platform

collectd

NB to

MANO/VNFM

SFQM Plug-Ins

Example

DPDK Application

SFQM Features

Provided

Functionality

OVS

Plug-Ins

Consumer/

OpenStack

Platform

collectd

NB to

MANO/VNFM

Barometer

Features

Application

Barometer

Plug-Ins

Barometer

Features

Collecting OVS Interface Events + Stats with collectd

Slide13

DPDK

SFQM Plug-Ins

Plug-Ins

OpenStack/

Fault Management Application

Platform

collectd

MANO

SFQM Plug-Ins

Under Implementation

Ceilometer plugin

Extended Stats

Open vSwitch

RDT

RAS

Consumer

Stats

Events

BIOS

Legacy

RAS plugin

RDT plugin

OVS plugins

DPDK plugins

BIOS plugin

Legacy plugin

Output

Input

SNMP plugin

Status Update

Being Upstreamed

Upstreamed

Slide14

Project in OPNFV working on building an

open-source NFVI

fault management and maintenance framework to ensure Telco VNFs availability in fault and maintenance eventsIdentify requirements

Gap analysis

Implementation work in upstream

Integration and testing

Consistent Resource State Awareness

Immediate Notification

Fault Correlation

Extensible Monitoring

Doctor

Slide15

Status Update II

Taking advantage of the notification plugin architecture in collectd to post an event (like link status failure or application thread failure) directly to the notification bus for immediate alarming in Aodh.

Performance, scalability and aggregation analysis.Gnocchi integration

Slide16

Doctor: fault management use case

Slide17

Doctor: mapping to the OpenStack ecosystem

Slide18

Doctor: focus of initial contributions

Consistent Resource State Awareness

Immediate Notification

Slide19

Doctor: focus of initial contributions

Immediate Notification

Consistent Resource State Awareness

Slide20

Doctor: extending contribution focus

Consistent Resource State Awareness

Immediate Notification

Fault Correlation

Extensible Monitoring

Slide21

Doctor Inspector

The module has the ability to...… receive various failure notifications regarding physical resource(s) from Monitor module(s)… find the affected virtual resource(s) by querying the resource map in the Controller module

… update the state of the virtual resource (and physical resource)It has drivers for different types of events and resourcesUses a failure policy database

Slide22

Why a failure policy database?

“Failure” can be subjective. Depends onApplications (VNFs)

Back-end technologies used in the deploymentRedundancy of the equipment/componentsOperator PolicyRegulationTopologies of Network / Power-supply

So, “failure” has to be dynamically configurable case by case

Slide23

Doctor Inspector: OpenStack Congress

Governance as a ServiceDefine and enforce policy for Cloud Services

Dynamic data collection from OpenStack servicesFlexible policy definition for correlation (Datalog)Well integrated with other OpenStack projectsPolicy example

host_down(host) :-

doctor:events(hostname=host, type="compute.host.down", status="down")

execute[nova:services.force_down(host, "nova-compute", "True")] :-

host_down(host)

Slide24

Congress PushType DataSource Driver

Slide25

Congress Doctor Driver

Slide26

Doctor blueprints in OpenStack

Project

Blueprint

Spec Drafter

Developer

Status

Aodh

Event Alarm Evaluator

Ryota Mibu (NEC)

Ryota Mibu (NEC)

Completed (Liberty)

Nova

New nova API call to mark nova-compute down

Tomi Juvonen (Nokia)

Roman Dobosz (Intel)

Completed (Liberty)

Support forcing service down

Tomi Juvonen (Nokia)

Carlos Goncalves (NEC)

Completed (Liberty)

Get valid server state

Tomi Juvonen (Nokia)

Tomi Juvonen (Nokia)

Completed (Mitaka)

Add notification for service status change

Balazs Gibizer (Ericsson)

Balazs Gibizer (Ericsson)

Completed (Mitaka)

Maintenance Reason to Server

Tomi Juvonen (Nokia)

Tomi Juvonen (Nokia)

WIP (Ocata)

Congress

Push Type Datasource Driver

Masahito Muroi (NTT)

Masahito Muroi (NTT)

Completed (Mitaka)

Adds Doctor Driver

Masahito Muroi (NTT)

Masahito Muroi (NTT)

Completed (Mitaka)

Neutron

Port data plane status

Carlos Goncalves (NEC)

Carlos Goncalves (NEC)

WIP (Ocata)

Slide27

SFQM + Doctor

Ceilometer

collectd

dpdkstat

Plugin

collectd

1. Read

2. Get stats

3. Dispatch Values

OVS With DPDK

RX

TX

collectd Ceilometer Plugin

VM

collectd

RX

TX

5. Post Values

4. Pass Values

Slide28

Demo

Slide29

Summary

“Trying to manage a complex cloud solution without a proper telemetry infrastructure in place is like trying to walk across a busy highway with blind eyes and deft ears. You have little to no idea of where the issues can come from, and no chances to take any smart move without getting in trouble”

. [2]

Doctor

Painting the pedestrian crossing

Slide30

References

[1]

http://www.datacenterknowledge.com/archives/2016/02/11/curb-data-center-downtime-predictive-maintenance/ [2] https://azure.microsoft.com/en-us/blog/cloud-service-fundamentals-telemetry-basics-and-troubleshooting/

Slide31

Slide32

Legal notices and disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.© 2016 Intel Corporation.

Slide33

Backup

Slide34

The Corner Stone

Telemetry is the cornerstone for

:BillingBenchmarkingIntelligent orchestration

Fault management

Slide35

Use case example

Slide36

Use case Example

Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.

Controller

Compute Node 1

Compute Node 2

collectd

OVS With DPDK

OVS

ceilometer

aodh

VNF

VNF

VNF

= Active

VNF

= Standby

Slide37

Use case Example

Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.

Controller

Compute Node 1

Compute Node 2

collectd

OVS With DPDK

OVS

ceilometer

aodh

VNF

VNF

localhost-port.0-link_status != 0

VNF

= Active

VNF

= Standby

Slide38

Use case Example

Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.

Controller

Compute Node 1

Compute Node 2

collectd

OVS With DPDK

OVS

ceilometer

aodh

VNF

VNF

localhost-port.0-link_status != 0

VNF

= Active

VNF

= Standby

Slide39

Use case Example

Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.

Controller

Compute Node 1

Compute Node 2

collectd

OVS With DPDK

OVS

ceilometer

aodh

VNF

VNF

localhost-port.0-link_status != 0

X

VNF

= Active

VNF

= Standby

Slide40

Use case Example

Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.

Controller

Compute Node 1

Compute Node 2

collectd

OVS With DPDK

OVS

ceilometer

aodh

VNF

VNF

localhost-port.0-link_status == 0

X

VNF

= Active

VNF

= Standby

Slide41

Use case Example

Compute node DPDK interface monitoring for Host link status and switches from active to standby service when the link goes down.

Controller

Compute Node 1

(out of service)

Compute Node 2

collectd

OVS With DPDK

OVS

ceilometer

aodh

VNF

VNF

X

VNF

= Active

Slide42

Collectd ceilometer

plugin

Slide43

Doctor blueprints in OpenStack Liberty

Project

Blueprint

Spec Drafter

Developer

Status

Aodh

Event Alarm Evaluator

Ryota Mibu (NEC)

Ryota Mibu (NEC)

Completed (Liberty)

Nova

New nova API call to mark nova-compute down

Tomi Juvonen (Nokia)

Roman Dobosz (Intel)

Completed (Liberty)

Support forcing service down

Tomi Juvonen (Nokia)

Carlos Goncalves (NEC)

Completed (Liberty)

Slide44

State correction

Slide45

From project creation to Brahmaputra release

Slide46

Immediate event alarming

Slide47

Doctor Inspector

The module has the ability to...… receive various failure notifications regarding physical resource(s) from Monitor module(s)… find the affected virtual resource(s) by querying the resource map in the Controller module

… update the state of the virtual resource (and physical resource)It has drivers for different types of events and resourcesMonitor: collectd, Zabbix, …Resources: servers, networks, storage, ...Uses a failure policy databaseDecide on the failure selection and aggregation from raw eventsConfigured by the administrator (physical resources) and user (virtual resources)