/
Dan Han, Eleni Stroulia University of Alberta Dan Han, Eleni Stroulia University of Alberta

Dan Han, Eleni Stroulia University of Alberta - PowerPoint Presentation

likets
likets . @likets
Follow
342 views
Uploaded On 2020-08-29

Dan Han, Eleni Stroulia University of Alberta - PPT Presentation

A 3Dimensional Data Model for Large TimeSeries Dataset Analysis in HBase 9152012 1 MESOCA 2012 Outline Background and Motivation Related Work A 3Dimensional Data Model in HBase Case Study and Experiment Results ID: 810928

data 2012 time mesoca 2012 data mesoca time dataset hbase schema particles series cosmology snapshot version properties row nosql

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Dan Han, Eleni Stroulia University of Al..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Dan Han, Eleni StrouliaUniversity of Alberta

A 3-Dimensional Data Model for Large Time-Series Dataset Analysisin HBase

9/15/2012

1

MESOCA 2012

Slide2

OutlineBackground and Motivation

Related WorkA 3-Dimensional Data Model in HBaseCase Study and Experiment ResultsDiscussionConclusions and Future Work

9/15/2012

2

MESOCA 2012

Slide3

Migrating Applications To the Cloud

Cloud is an attractive computing platformElasticity, Excellent Scalability, High Availability, Low

operating costApplications are moving to the cloud

Social networking, online shopping, monitoring system

Time-Series data: grows monotonously over time

Analysis of large scale time-series data

May lead to new knowledge

May lead to Improvements of existing services

Success adoption of this movement paradigm requires a

new model of storage

9/15/2012

3

MESOCA 2012

Slide4

Migrating RDBMS ContentTo NoSQL

From RDBMS to NoSQL storage systemsEnable the storage of big data, in order of row key

Scale horizontally across storage nodes easily

Not much data-organization support

Migration challenges

Few

experiences and principles to follow

Steep learning curve for

programming

Much experimentation is required before deployment

Much

time is spent in designing the data schema

The “wrong” schema may lead to inefficient, high-latency queries

9/15/2012

4

MESOCA 2012

Slide5

We need Design Patterns for HBase Schemas

Our objective is to develop a systematic method forGuiding data organization in NoSQL

databases, given the types of data stored,

the amount of data

its usage patterns

We start our investigation with HBase

A

NoSQL

database offering

, built

on top of

Hadoop

Parallel Distributed Computation

MapReduce Framework

Coprocessor

Framework

9/15/2012

5

MESOCA 2012

Slide6

Related Work

Talks in HBaseCon2012, held in MayData schema and Coprocessor are two main topicsExperience from

30 enterprises, such as

Facebook,

Yapmap

,

eBay, Adobe

Organizing time-series data into period-specific “buckets”

OpenTSDB

: a distributed scalable time series database,

written

on

top

of

HBase

A data Model in

Cassendra

, another

NoSQL

database offering

Applied into our case study

9/15/2012

6

MESOCA 2012

Slide7

Data Organization in HBaseCell in HBase

(Row, Family: Column, Version) => (X,Y,Z) = value

VS

Schema/

dimension

Row

Family: Column

Version

2-D

unique id - timestamp

varying properties

current timestamp

3-D

unique id

varying properties

timestamps

Z

X

Y

X

Y

9/15/2012

7

MESOCA 2012

Slide8

Case study: The Datasets

Cosmology DatasetProduct of an

N-body simulationThree types of

particles: dark matter, gas and star

Particles evolve over

a series

of discrete timestamps

Each snapshot records the properties of all particles at

the

time of the

snapshot

9 snapshots,

consists of 321,065,547 particles

Bixi

Dataset

Data

from a bicycle-renting service in the city of

Montreal

Every

minute, the

statistic information about bike usage

a

station is collected by the sensor

96,842 data points involved

9/15/2012

8

MESOCA 2012

Slide9

Three Schemas for the Cosmology Dataset

Schema/

dimension

Row

Family: Column

Version

Schema1

sid-type-pid

particle properties

No meaning

Schema2

type-pid

particle properties

Snapshot id

Schema3

type-

reversedpid

particle properties

Snapshot id

Region

Region

Region

24-2-33446666

64

-2-33559999

84

-2-33550000

2-33446666

2-33550000

2-33559999

2-00005533

2-66664433

2-99995533

Schema1

Schema2

Schema3

9/15/2012

9

MESOCA 2012

Z

X

Y

Slide10

Three Schemas for the Bixi Dataset

Schema/

dimension

Row

Family: Column

Version

Schema1

hour-

sid

minutes[0,59]

no

meaning

Schema2

hour-sid

monitoring

metrics

minutes

[0,59]

Schema3

day-

sid

monitoring

metrics

minutes

[0,1439]

Schema1

Schema2

Schema3

Time

X

metrics

Time

X

metrics

X

Time

9/15/2012

10

MESOCA 2012

Slide11

Experiment Results

Experiment EnvironmentHadoop 0.20, HBase 0.93-snapshot (Coprocessor support)A four-node cluster on virtual machines

Quires for each datasetThree Queries of Cosmology dataset from related research

One query of

Bixi

dataset from business requirement

Query

processing Implementation

Native java API

User-Level

Coprocessor

Implementation

9/15/2012

11

MESOCA 2012

Slide12

Query1 of Cosmology Dataset

Get all the particles of this type in this snapshot whose property matches the expression

9/15/2012

12

MESOCA 2012

Slide13

Query2 of Cosmology Dataset

Get all the particles added/destroyed between S1 and s2

9/15/2012

13

MESOCA 2012

Slide14

Query3 of Cosmology Dataset

Get the values of the property for the given set of particles across

the selected snapshots.

9/15/2012

14

MESOCA 2012

Slide15

Bixi Query

For a given list of stations and a time, get their average bike usage for last 1, 2, 4, 8 and 16 days

9/15/2012

15

MESOCA 2012

Slide16

Discussion“Qualitative” versus “Quantitative” Suggestions

Dynamic Data versus Static DataHistorical Dataset versus Real-Time DatasetsSupported versus Non-Supported Datasets9/15/2012

16

MESOCA 2012

Slide17

Conclusion

A 3-dimensional data modelImproved performance can be got from the data schema

that

use the version dimension

of

HBase

Fit

in “write-once, read-many” system

Monitoring system

Sensor-based system

Version-based analysis

9/15/2012

17

MESOCA 2012

Slide18

Future Work

More Evaluation of this data modelscalability, elasticity, and utilization How to design data model for other datasets

Spatial datasetGraphic dataset

9/15/2012

18

MESOCA 2012

Slide19

Questions?

Thank you9/15/2012

19

MESOCA 2012