/
Two Years Working With Big Data Two Years Working With Big Data

Two Years Working With Big Data - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
345 views
Uploaded On 2018-12-04

Two Years Working With Big Data - PPT Presentation

Samarth Shah Senior Software Engineer Introduction 10 years at Microsoft 56 years of embedded clientside code 23 years of web services 9month Data Science and Machine Learning certification at the University of Washington in Seattle ID: 735224

2017 data windows utc data 2017 utc windows visualization telemetry charging screen device structured eventname learning machine years 10pm

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Two Years Working With Big Data" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Two Years Working With Big Data

Samarth Shah

Senior Software EngineerSlide2

Introduction

10 years at Microsoft5-6 years of embedded, client-side code

2-3 years of web services

9-month Data Science and Machine Learning certification at the University of Washington in Seattle

Schemas, data preparation & cleaning in R and PythonMachine learning: Classification and clustering, SVMs and random forestsA bit of cloud computing and HadoopHypothesis testingBottom line: pure Data Science and Machine learning is too much math! Last 2 years: “Data engineering” in Windows Platform HealthWindows power, performance, reliability, memoryWindows telemetry

Courtesy xkcd.comSlide3

What’s telemetry?

Wikipedia: “Telemetry is an automated communications process by which measurements and other data are collected at remote or inaccessible points and transmitted to receiving equipment for monitoring.

The word is derived from Greek roots:

tele = remote, and metron = measure.”19th century originsIn 2010:“Number of browser favorites this user has” : 100s - 1000s of metricsUpload once a month!Powerful SQL database server machinesStorage and retention issuesCumbersome data access, analysisSamplingFast-forward to 2015:Device on/off, screen on/off, charging/not, networked/not, program started/ended/crashed, settings changed, battery drained 3%, memory threshold violation, app launch latency violation … : millions of metrics

Upload once a day if user opted in, if charging, and if network not meteredCloud computing revolution: massive cloud storage, cluster computingStill sampling: only a random 5% - 10% of Windows 10 devices in the sample… merely tens of millions17+ petabytes (17 * 10^15 bytes) of Windows 10 top-level telemetry currently in cloud storageAnd beyond:

Sensors, Internet of thingsA whole new dimension of software testing?Slide4

What’s data engineering?

Data pipeline

Telemetry

Reports

Machine learning

Analytics, and diagnosticsSlide5

My customers

Management – State of the Union“What’s the median battery life of a Windows tablet? Better or worse than our last release?”

Data Scientists – Patterns

Hypothesis: 75% of users charge their laptops continuously between midnight and 6am

K-means clustering of software activity in standbyA/B testingSoftware analysts – Invariants and anomalies“Why is the screen so bright on this device model in low light environments?”Slide6

Structured vs. unstructured data

Structured:

UniqueDeviceID

Timestamp

EventName

ToBattery

BatteryChargeLevel

16/30/2017 3.30pm UTC“PowerSupplyChange”TRUE

78

UniqueDeviceID

Timestamp

EventName

Description (Freeze, Hang, Crash)

Screenshot

1

6/30/2017 11.10pm UTC

ReliabilityProblem

“Freeze”

- Problem? I now want to change my

BatteryChargeLevel

to floating-point and I want to add a new column for “

Timezone

”Slide7

Unstructured data

Key-value pairs

Flexibility vs. storage size

‘Data lake’

Accessing it:Query all PowerSupplyChange dataSELECT CommonData

,Data[“ToBattery”], Data[“BatteryChangeLevel”], Data[“Timezone”]

WHERE Data[“EventName”] == “PowerSupplyChange” CommonData: unique in the entire lake per real-world event

Retention = duration of Windows 10 support

CommonData

<

DeviceID,Timestamp

>

Key

Value

<1,6/30/2017 3.30pm UTC>

EventName

PowerSupplyChange

<1,6/30/2017 3.30pm UTC>

ToBattery

TRUE

<1,6/30/2017 3.30pm UTC>

BatteryChargeLevel

78

<1,6/30/2017 11.10pm UTC>

EventNameReliabilityProblem<1,6/30/2017 11.10pm UTC>

DescriptionFreeze<1,6/30/2017 11.10pm UTC>

EventNameReliabilityProblem

<1,6/30/2017 11.10pm UTC>Screenshot

CommonData

<

DeviceID

, Timestamp>

Key

Value

<1,7/1/2017 4.30am UTC>

EventName

PowerSupplyChange

<1,7/1/2017 4.30am UTC>

ToBattery

FALSE

<1,7/1/2017 4.30am UTC>

BatteryChargeLevel

11.36

<1,7/1/2017 4.30am UTC>

Timezone

PST

…Slide8

Data pipeline Rube Goldberg machine

Grab

Daily or hourly workflows extracting unstructured data, converting to structured data, basic syntactic sanity checking, dealing with multiple schemas

Transform or “cook”

1. Triggered by grab completion, or schedule, or availability of structured data

2. Cleaning (semantic checks), removing outliers,

JOINing

, slicing to share with different clusters (e.g. different OEMs), different clouds (e.g.

PowerBI

cloud, Splunk), aggregating

Index subsets for interactive queries and reports

Compute:

Custom programming language which blends SQL and C#

Workflow manager: jobs, scheduling, tokens

Language runtime uses MapReduce

TBs, 180 day retention

GBs-TBs, 30 day retention

Data on demand: The feedback loop

After 1

st

level analysis, finding a suspicious threshold or scenario (“too many crashes rendering a specific website” or “takes excessively long to open Outlook, but not other apps”) and instructing the device to send detailed telemetry only in that scenario when it occurs nextSlide9

Data pipeline

Retention

Windows 10 support

180 days

30 days

Size

PBs

100s of TBs

10s of TBs

Unstructured top-level telemetry

Structured, cooked telemetry

Indexed

dataSlide10

The holy grail

You have one metric X you’re tracking

It depends on N different parameters (“features” in Data Science), each of which can have m different values

m^N

different slices or statesX = F(a1, a2, a3, …, aN)

Problems:Too many input signals for real-time slicingCombinatorics – JOINs galore for post-slicing

ExtensibilityDoes screen brightness, Bluetooth, microphone on/off, speaker volume impact the drain?What about game vs. browsing vs. email vs. video chat?

Settings: battery saver on vs. off?Data volume – aggregation vs. point data

Android battery history details appSlide11

Reports and visualization

Numerous simple chartsThink airplane cockpit or car dashboard

Detailed – prefer to have actual numbers and metrics on it!

“Look at how big this bar is compared to the other one” is not good!

Leave less to interpretationOn-demandLess pre-publishing the betterWhat’s wrong with the visualization on the right?Is the dependent axis truly dependent on the independent axis?How do I tell the exact number of Involuntary denials of boarding per 100,000?How do I tell the exact number of boarders?Slide12

Reports and visualization contd.

Is this clearer?

Just 3 days later…

Caveat: numbers are inferred, I could not actually tell the accurate numbers from previous visualizationSlide13

More visualization dos and don’ts

Keep mobile screens and dearth of real estate in mind

Interactivity

Use hover, sort, zoom

Make individual chart elements links to more detailed charts

Images from firstpost.com

Image from WikipediaSlide14

Visualization take away

Visualization in tech is quite different from visualization in other domainsMaking it eye-popping and colorful is extra credit

Minimal loss of precision/detail

High density

Windows Platform Health management looks every day or two at a single page of charts that includes separate tabs for:App launch perf * # of device models * Charging/not * # of appsBoot perf * # of device models * Charging/notBatteryLife * Screen On/Off * # of device modelsCrashes * Screen On/Off * Charging/not * # of appsMemory consumed * Screen On/Off * Charging/not * # of scenarios“Release Quality View” (RQV): 30-40 individual chartsDecides if this version is worth shipping or far from it

Image courtesy WikimediaSlide15
Slide16

Big data analytics’ dirty secret: smaller data!

Subset of data indexed for speedReal-time query results

On-demand visualization

Windows “insiders” (

https://insider.windows.com)Tens of thousands, instead of tens of millionsStill terabytes of data!Not always representative of the entire populationBe very aware of what’s not in your indexed subsetSplunk (https://www.splunk.com)Very powerful: quick investigations, quick ballpark answers, driving priorities, finding a specimen of a problem nearbySlide17

Machine learning

K-means clustering of activity in standby using Weka (http://www.cs.waikato.ac.nz/ml/weka

)

One standby session is one row in the data set

Fraction of time N programs were active during session is in the columnsA distinct knee in the curve, meaning there is a very definite set of patternsIf the number of clusters increases, we have a new pattern of standby activityCluster size indicates the prevalence of patternSlide18

Waxing philosophical…