Samarth Shah Senior Software Engineer Introduction 10 years at Microsoft 56 years of embedded clientside code 23 years of web services 9month Data Science and Machine Learning certification at the University of Washington in Seattle ID: 735224
Download Presentation The PPT/PDF document "Two Years Working With Big Data" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Two Years Working With Big Data
Samarth Shah
Senior Software EngineerSlide2
Introduction
10 years at Microsoft5-6 years of embedded, client-side code
2-3 years of web services
9-month Data Science and Machine Learning certification at the University of Washington in Seattle
Schemas, data preparation & cleaning in R and PythonMachine learning: Classification and clustering, SVMs and random forestsA bit of cloud computing and HadoopHypothesis testingBottom line: pure Data Science and Machine learning is too much math! Last 2 years: “Data engineering” in Windows Platform HealthWindows power, performance, reliability, memoryWindows telemetry
Courtesy xkcd.comSlide3
What’s telemetry?
Wikipedia: “Telemetry is an automated communications process by which measurements and other data are collected at remote or inaccessible points and transmitted to receiving equipment for monitoring.
The word is derived from Greek roots:
tele = remote, and metron = measure.”19th century originsIn 2010:“Number of browser favorites this user has” : 100s - 1000s of metricsUpload once a month!Powerful SQL database server machinesStorage and retention issuesCumbersome data access, analysisSamplingFast-forward to 2015:Device on/off, screen on/off, charging/not, networked/not, program started/ended/crashed, settings changed, battery drained 3%, memory threshold violation, app launch latency violation … : millions of metrics
Upload once a day if user opted in, if charging, and if network not meteredCloud computing revolution: massive cloud storage, cluster computingStill sampling: only a random 5% - 10% of Windows 10 devices in the sample… merely tens of millions17+ petabytes (17 * 10^15 bytes) of Windows 10 top-level telemetry currently in cloud storageAnd beyond:
Sensors, Internet of thingsA whole new dimension of software testing?Slide4
What’s data engineering?
Data pipeline
Telemetry
Reports
Machine learning
Analytics, and diagnosticsSlide5
My customers
Management – State of the Union“What’s the median battery life of a Windows tablet? Better or worse than our last release?”
Data Scientists – Patterns
Hypothesis: 75% of users charge their laptops continuously between midnight and 6am
K-means clustering of software activity in standbyA/B testingSoftware analysts – Invariants and anomalies“Why is the screen so bright on this device model in low light environments?”Slide6
Structured vs. unstructured data
Structured:
UniqueDeviceID
Timestamp
EventName
ToBattery
BatteryChargeLevel
16/30/2017 3.30pm UTC“PowerSupplyChange”TRUE
78
UniqueDeviceID
Timestamp
EventName
Description (Freeze, Hang, Crash)
Screenshot
1
6/30/2017 11.10pm UTC
“
ReliabilityProblem
”
“Freeze”
- Problem? I now want to change my
BatteryChargeLevel
to floating-point and I want to add a new column for “
Timezone
”Slide7
Unstructured data
Key-value pairs
Flexibility vs. storage size
‘Data lake’
Accessing it:Query all PowerSupplyChange dataSELECT CommonData
,Data[“ToBattery”], Data[“BatteryChangeLevel”], Data[“Timezone”]
WHERE Data[“EventName”] == “PowerSupplyChange” CommonData: unique in the entire lake per real-world event
Retention = duration of Windows 10 support
CommonData
<
DeviceID,Timestamp
>
Key
Value
<1,6/30/2017 3.30pm UTC>
EventName
PowerSupplyChange
<1,6/30/2017 3.30pm UTC>
ToBattery
TRUE
<1,6/30/2017 3.30pm UTC>
BatteryChargeLevel
78
<1,6/30/2017 11.10pm UTC>
EventNameReliabilityProblem<1,6/30/2017 11.10pm UTC>
DescriptionFreeze<1,6/30/2017 11.10pm UTC>
EventNameReliabilityProblem
<1,6/30/2017 11.10pm UTC>Screenshot
CommonData
<
DeviceID
, Timestamp>
Key
Value
<1,7/1/2017 4.30am UTC>
EventName
PowerSupplyChange
<1,7/1/2017 4.30am UTC>
ToBattery
FALSE
<1,7/1/2017 4.30am UTC>
BatteryChargeLevel
11.36
<1,7/1/2017 4.30am UTC>
Timezone
PST
…
…
…Slide8
Data pipeline Rube Goldberg machine
Grab
Daily or hourly workflows extracting unstructured data, converting to structured data, basic syntactic sanity checking, dealing with multiple schemas
Transform or “cook”
1. Triggered by grab completion, or schedule, or availability of structured data
2. Cleaning (semantic checks), removing outliers,
JOINing
, slicing to share with different clusters (e.g. different OEMs), different clouds (e.g.
PowerBI
cloud, Splunk), aggregating
Index subsets for interactive queries and reports
Compute:
Custom programming language which blends SQL and C#
Workflow manager: jobs, scheduling, tokens
Language runtime uses MapReduce
TBs, 180 day retention
GBs-TBs, 30 day retention
Data on demand: The feedback loop
After 1
st
level analysis, finding a suspicious threshold or scenario (“too many crashes rendering a specific website” or “takes excessively long to open Outlook, but not other apps”) and instructing the device to send detailed telemetry only in that scenario when it occurs nextSlide9
Data pipeline
Retention
Windows 10 support
180 days
30 days
Size
PBs
100s of TBs
10s of TBs
Unstructured top-level telemetry
Structured, cooked telemetry
Indexed
dataSlide10
The holy grail
You have one metric X you’re tracking
It depends on N different parameters (“features” in Data Science), each of which can have m different values
m^N
different slices or statesX = F(a1, a2, a3, …, aN)
Problems:Too many input signals for real-time slicingCombinatorics – JOINs galore for post-slicing
ExtensibilityDoes screen brightness, Bluetooth, microphone on/off, speaker volume impact the drain?What about game vs. browsing vs. email vs. video chat?
Settings: battery saver on vs. off?Data volume – aggregation vs. point data
Android battery history details appSlide11
Reports and visualization
Numerous simple chartsThink airplane cockpit or car dashboard
Detailed – prefer to have actual numbers and metrics on it!
“Look at how big this bar is compared to the other one” is not good!
Leave less to interpretationOn-demandLess pre-publishing the betterWhat’s wrong with the visualization on the right?Is the dependent axis truly dependent on the independent axis?How do I tell the exact number of Involuntary denials of boarding per 100,000?How do I tell the exact number of boarders?Slide12
Reports and visualization contd.
Is this clearer?
Just 3 days later…
Caveat: numbers are inferred, I could not actually tell the accurate numbers from previous visualizationSlide13
More visualization dos and don’ts
Keep mobile screens and dearth of real estate in mind
Interactivity
Use hover, sort, zoom
Make individual chart elements links to more detailed charts
Images from firstpost.com
Image from WikipediaSlide14
Visualization take away
Visualization in tech is quite different from visualization in other domainsMaking it eye-popping and colorful is extra credit
Minimal loss of precision/detail
High density
Windows Platform Health management looks every day or two at a single page of charts that includes separate tabs for:App launch perf * # of device models * Charging/not * # of appsBoot perf * # of device models * Charging/notBatteryLife * Screen On/Off * # of device modelsCrashes * Screen On/Off * Charging/not * # of appsMemory consumed * Screen On/Off * Charging/not * # of scenarios“Release Quality View” (RQV): 30-40 individual chartsDecides if this version is worth shipping or far from it
Image courtesy WikimediaSlide15Slide16
Big data analytics’ dirty secret: smaller data!
Subset of data indexed for speedReal-time query results
On-demand visualization
Windows “insiders” (
https://insider.windows.com)Tens of thousands, instead of tens of millionsStill terabytes of data!Not always representative of the entire populationBe very aware of what’s not in your indexed subsetSplunk (https://www.splunk.com)Very powerful: quick investigations, quick ballpark answers, driving priorities, finding a specimen of a problem nearbySlide17
Machine learning
K-means clustering of activity in standby using Weka (http://www.cs.waikato.ac.nz/ml/weka
)
One standby session is one row in the data set
Fraction of time N programs were active during session is in the columnsA distinct knee in the curve, meaning there is a very definite set of patternsIf the number of clusters increases, we have a new pattern of standby activityCluster size indicates the prevalence of patternSlide18
Waxing philosophical…