Definition and Taxonomy Subgroup Presentation September 30 2013 Nancy Grady SAIC Natasha Balac SDSC Eugene Lister R2AD Overview Objectives Approach Big Data Component Definitions Data Science Component Definitions ID: 136686
Download Presentation The PPT/PDF document "NIST Big Data Public Working Group" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
NIST Big Data Public Working Group
Definition and Taxonomy Subgroup Presentation
September 30, 2013
Nancy Grady, SAIC
Natasha
Balac, SDSC
Eugene Lister, R2ADSlide2
Overview
Objectives
Approach
Big Data Component DefinitionsData Science Component DefinitionsTaxonomyRolesActivitiesComponentsSubcomponentsTemplatesNext Steps
2Slide3
Objectives
Identify concepts
Focus on what is new and different
Clarify terminologyAttempt to avoid terms that have domain-specific meaningsRemain independent of specific implementations3Slide4
Approach
Hold scope to what is different because of Big Data
Use additional concepts needed for completeness
Restrict terms to represent single conceptsDon’t stray too far from common usageIn the report go straight to Big Data and Data ScienceThis presentation will start from more elemental conceptsRelationship to cloud, but not required4Slide5
Definitions
Big Data
Data Science
5Slide6
Concepts Relating to Data
Data Type (structured, semi-structured, unstructured)
Beyond our scope (and not new)
Data LifecycleRaw DataUsable InformationSynthesized KnowledgeImplemented BenefitMetadata: data about data or system or processingProvenance: Data Lifecycle historyComplexity: dependent relationships across data elements
6Slide7
Concepts Relating to Dataset at Rest
Volume: amount of data
Variety: many data types
and also across data domainsPersistence: storing in {flat files, RDBMS, NoSQL, markup,…}NoSQLBig TableName-valueGraphDocumentTiered storage {in-memory, cache, SSD, hard disk, …}Distributed {local, multiple local, network-based}
7Slide8
Concepts Related to Dataset in Motion
Velocity: rate of data flow
Variability: change in rate of data flow, also
StructureRefresh rateAccessibility: new concept of Data-as-a-ServiceTransport formats (not new)Transport protocols (not new)8Slide9
Big Data Analogy to Parallel computing
Processor improvements slowed
Coordinate a loose collection of processors
Adds resource communication complexities System clocksMessage passingDistribution of processing codeDistribution of data for processing nodes9Slide10
Big Data - Jan 15-17 NIST Cloud/Big Data Workshop
Big Data refers to digital data volume, velocity, and/or variety that:
E
nable novel approaches to frontier questions previously inaccessible or impractical using current or conventional methods; and/or Exceed the storage capacity or analysis capability of current or conventional methods and systems.Differentiates by storing and analyzing population data and not sample sizes
10Slide11
Refinements are Welcome
The heart of the change is the scaling
Data seek times increasing slower than Moore’s Law
Data volumes increasing faster than Moore’s LawImplies the addition of horizontal scaling to vertical scalingData analogous to MPP processing changesDifficult to define asAn implication of engineering changesData Lifecycle process order changesImplication of a new type of analytics
As moving the processing to the data not the data to the processing
11Slide12
Big Data Analytics Characteristics
Analytics Characteristics are not new
Value: produced when the analytics output is put into action
Veracity: measure of accuracy and timliness Quality: well-formed dataMissing valuescleanliness
Latency: time between measurement and availability
Data types have differing pre-analytics needs
12Slide13
Data Science as a Science Progression
Coined the “Fourth Paradigm” by the late Jim Gray
Experiment: Empirical measurement science
Theory: Causal interpretation Explains experimentsCalculates measurements that would confirm the theoretical modelsSimulation: Performing theory (model)-driven experiments that are not empirically possibleData Science: Empirical analysis of data produced by processes
13Slide14
Data Science Analogy (simplistically)
Statistics
precise deterministic causal analysis
over precisely collected dataData Mining: deterministic causal analysis over re-purposed data that has been carefully sampledData ScienceTrending or correlation analysisOver existing data that typically uses the bulk of the population
14Slide15
Data Science
Data Science
is
the extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and analytical hypothesis analysis.A Data Scientist is a practitioner who has sufficient knowledge of the overlapping regimes of expertise in business needs, domain knowledge, analytical skills and programming expertise to manage the end-to-end scientific method process through each stage in the big data lifecycle (through action) to deliver value.
15Slide16
Data Science Skillsets
16Slide17
Data Science Addendums
Is not just Analytics
The end-to-end data system is the equipment
The analytics over Big Data can beExploratory or discovery-driven for hypothesis generationFocused hypothesis verificationFocused on operationalization17Slide18
Taxonomy
Actors
Roles
ActivitiesComponentsSubcomponents18Slide19
Big Data Taxonomy
Actors
Roles
ActivitiesComponentsSub-components19Slide20
Actors
Sensors
Applications
Software agentsIndividualsOrganizationsHardware resourcesService abstractions20Slide21
System Roles
Data Provider
– makes available data internal and/or external to the system
Data Consumer – uses the output of the systemSystem Orchestrator – governance, requirements, monitoringBig Data Application Provider – instantiates applicationBig Data Framework Provider – provides resources
21Slide22
Roles and Actors
22Slide23
Data Provider
23Slide24
System Orchestrator
24Slide25
Big Data Application Provider
25Slide26
Big Data Framework Provider
26Slide27
Data Consumer
27Slide28
Big Data Security
28Slide29
Big Data Application Provider
29Slide30
Data Lifecycle Processes
30
Collect
Analyze
Need
Curate
Act &
Monitor
Data
Information
Knowledge
Benefit
Goal
EvaluateSlide31
Data Warehouse Template– store after curate
31
Domain
Cleanse
Transform
ETL
Action
Warehouse
Summarized
Data
Algorithm
Analytic
Mart
COLLECT
CURATE
ANALYZE
ACT
Staging
ETL = extract, transform, loadSlide32
Volume template – store raw data after collect
32
Raw Data Cluster
Model Building
Model Analytics
Data Product
Map/Reduce
Mart
Model Data
COLLECT
CURATE
ANALYZE
ACT
Volume
Complexity
Domain
Cleanse
Transform
AnalyzeSlide33
Velocity Template – store after analytics
33
COLLECT
CURATE
ANALYZE
ACT
Enriched
Data Cluster
Velocity
Volume
Alerting
Domain
Cleanse
TransformSlide34
Variety Template – Schema-on-Read
34
Analyze
Common Query
Fused
Data
COLLECT
CURATE
ANALYZE
ACT
Variety
Complexity
Map/Reduce
QuerySlide35
Analysis to Action Template
Seconds – Streaming Real-time Analytics
Minutes– Batch jobs of operational model
Hours – Ad-hoc analysisMonths – Exploratory analysis35Slide36
Possible Next Steps
Refinement Big Data Definition
Word-
smithing of all definitionsRefinement Taxonomy Mindmap for completenessExploration of Templates for categorizationData distribution templates according to CAP complianceMeasures and Metrics (how big is Big Data)
36