/
NIST Big Data Public Working Group NIST Big Data Public Working Group

NIST Big Data Public Working Group - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
385 views
Uploaded On 2015-09-22

NIST Big Data Public Working Group - PPT Presentation

Definition and Taxonomy Subgroup Presentation September 30 2013 Nancy Grady SAIC Natasha Balac SDSC Eugene Lister R2AD Overview Objectives Approach Big Data Component Definitions Data Science Component Definitions ID: 136686

big data provider science data big science provider analytics analyze analysis collect curate domain system processing act volume concepts

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "NIST Big Data Public Working Group" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

NIST Big Data Public Working Group

Definition and Taxonomy Subgroup Presentation

September 30, 2013

Nancy Grady, SAIC

Natasha

Balac, SDSC

Eugene Lister, R2ADSlide2

Overview

Objectives

Approach

Big Data Component DefinitionsData Science Component DefinitionsTaxonomyRolesActivitiesComponentsSubcomponentsTemplatesNext Steps

2Slide3

Objectives

Identify concepts

Focus on what is new and different

Clarify terminologyAttempt to avoid terms that have domain-specific meaningsRemain independent of specific implementations3Slide4

Approach

Hold scope to what is different because of Big Data

Use additional concepts needed for completeness

Restrict terms to represent single conceptsDon’t stray too far from common usageIn the report go straight to Big Data and Data ScienceThis presentation will start from more elemental conceptsRelationship to cloud, but not required4Slide5

Definitions

Big Data

Data Science

5Slide6

Concepts Relating to Data

Data Type (structured, semi-structured, unstructured)

Beyond our scope (and not new)

Data LifecycleRaw DataUsable InformationSynthesized KnowledgeImplemented BenefitMetadata: data about data or system or processingProvenance: Data Lifecycle historyComplexity: dependent relationships across data elements

6Slide7

Concepts Relating to Dataset at Rest

Volume: amount of data

Variety: many data types

and also across data domainsPersistence: storing in {flat files, RDBMS, NoSQL, markup,…}NoSQLBig TableName-valueGraphDocumentTiered storage {in-memory, cache, SSD, hard disk, …}Distributed {local, multiple local, network-based}

7Slide8

Concepts Related to Dataset in Motion

Velocity: rate of data flow

Variability: change in rate of data flow, also

StructureRefresh rateAccessibility: new concept of Data-as-a-ServiceTransport formats (not new)Transport protocols (not new)8Slide9

Big Data Analogy to Parallel computing

Processor improvements slowed

Coordinate a loose collection of processors

Adds resource communication complexities System clocksMessage passingDistribution of processing codeDistribution of data for processing nodes9Slide10

Big Data - Jan 15-17 NIST Cloud/Big Data Workshop

Big Data refers to digital data volume, velocity, and/or variety that:

E

nable novel approaches to frontier questions previously inaccessible or impractical using current or conventional methods; and/or Exceed the storage capacity or analysis capability of current or conventional methods and systems.Differentiates by storing and analyzing population data and not sample sizes

10Slide11

Refinements are Welcome

The heart of the change is the scaling

Data seek times increasing slower than Moore’s Law

Data volumes increasing faster than Moore’s LawImplies the addition of horizontal scaling to vertical scalingData analogous to MPP processing changesDifficult to define asAn implication of engineering changesData Lifecycle process order changesImplication of a new type of analytics

As moving the processing to the data not the data to the processing

11Slide12

Big Data Analytics Characteristics

Analytics Characteristics are not new

Value: produced when the analytics output is put into action

Veracity: measure of accuracy and timliness Quality: well-formed dataMissing valuescleanliness

Latency: time between measurement and availability

Data types have differing pre-analytics needs

12Slide13

Data Science as a Science Progression

Coined the “Fourth Paradigm” by the late Jim Gray

Experiment: Empirical measurement science

Theory: Causal interpretation Explains experimentsCalculates measurements that would confirm the theoretical modelsSimulation: Performing theory (model)-driven experiments that are not empirically possibleData Science: Empirical analysis of data produced by processes

13Slide14

Data Science Analogy (simplistically)

Statistics

precise deterministic causal analysis

over precisely collected dataData Mining: deterministic causal analysis over re-purposed data that has been carefully sampledData ScienceTrending or correlation analysisOver existing data that typically uses the bulk of the population

14Slide15

Data Science

Data Science

is

the extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and analytical hypothesis analysis.A Data Scientist is a practitioner who has sufficient knowledge of the overlapping regimes of expertise in business needs, domain knowledge, analytical skills and programming expertise to manage the end-to-end scientific method process through each stage in the big data lifecycle (through action) to deliver value.

15Slide16

Data Science Skillsets

16Slide17

Data Science Addendums

Is not just Analytics

The end-to-end data system is the equipment

The analytics over Big Data can beExploratory or discovery-driven for hypothesis generationFocused hypothesis verificationFocused on operationalization17Slide18

Taxonomy

Actors

Roles

ActivitiesComponentsSubcomponents18Slide19

Big Data Taxonomy

Actors

Roles

ActivitiesComponentsSub-components19Slide20

Actors

Sensors

Applications

Software agentsIndividualsOrganizationsHardware resourcesService abstractions20Slide21

System Roles

Data Provider

– makes available data internal and/or external to the system

Data Consumer – uses the output of the systemSystem Orchestrator – governance, requirements, monitoringBig Data Application Provider – instantiates applicationBig Data Framework Provider – provides resources

21Slide22

Roles and Actors

22Slide23

Data Provider

23Slide24

System Orchestrator

24Slide25

Big Data Application Provider

25Slide26

Big Data Framework Provider

26Slide27

Data Consumer

27Slide28

Big Data Security

28Slide29

Big Data Application Provider

29Slide30

Data Lifecycle Processes

30

Collect

Analyze

Need

Curate

Act &

Monitor

Data

Information

Knowledge

Benefit

Goal

EvaluateSlide31

Data Warehouse Template– store after curate

31

Domain

Cleanse

Transform

ETL

Action

Warehouse

Summarized

Data

Algorithm

Analytic

Mart

COLLECT

CURATE

ANALYZE

ACT

Staging

ETL = extract, transform, loadSlide32

Volume template – store raw data after collect

32

Raw Data Cluster

Model Building

Model Analytics

Data Product

Map/Reduce

Mart

Model Data

COLLECT

CURATE

ANALYZE

ACT

Volume

Complexity

Domain

Cleanse

Transform

AnalyzeSlide33

Velocity Template – store after analytics

33

COLLECT

CURATE

ANALYZE

ACT

Enriched

Data Cluster

Velocity

Volume

Alerting

Domain

Cleanse

TransformSlide34

Variety Template – Schema-on-Read

34

Analyze

Common Query

Fused

Data

COLLECT

CURATE

ANALYZE

ACT

Variety

Complexity

Map/Reduce

QuerySlide35

Analysis to Action Template

Seconds – Streaming Real-time Analytics

Minutes– Batch jobs of operational model

Hours – Ad-hoc analysisMonths – Exploratory analysis35Slide36

Possible Next Steps

Refinement Big Data Definition

Word-

smithing of all definitionsRefinement Taxonomy Mindmap for completenessExploration of Templates for categorizationData distribution templates according to CAP complianceMeasures and Metrics (how big is Big Data)

36