A 3Dimensional Data Model for Large TimeSeries Dataset Analysis in HBase 9152012 1 MESOCA 2012 Outline Background and Motivation Related Work A 3Dimensional Data Model in HBase Case Study and Experiment Results ID: 810928
Download The PPT/PDF document "Dan Han, Eleni Stroulia University of Al..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Dan Han, Eleni StrouliaUniversity of Alberta
A 3-Dimensional Data Model for Large Time-Series Dataset Analysisin HBase
9/15/2012
1
MESOCA 2012
Slide2OutlineBackground and Motivation
Related WorkA 3-Dimensional Data Model in HBaseCase Study and Experiment ResultsDiscussionConclusions and Future Work
9/15/2012
2
MESOCA 2012
Slide3Migrating Applications To the Cloud
Cloud is an attractive computing platformElasticity, Excellent Scalability, High Availability, Low
operating costApplications are moving to the cloud
Social networking, online shopping, monitoring system
Time-Series data: grows monotonously over time
Analysis of large scale time-series data
May lead to new knowledge
May lead to Improvements of existing services
Success adoption of this movement paradigm requires a
new model of storage
9/15/2012
3
MESOCA 2012
Slide4Migrating RDBMS ContentTo NoSQL
From RDBMS to NoSQL storage systemsEnable the storage of big data, in order of row key
Scale horizontally across storage nodes easily
Not much data-organization support
Migration challenges
Few
experiences and principles to follow
Steep learning curve for
programming
Much experimentation is required before deployment
Much
time is spent in designing the data schema
The “wrong” schema may lead to inefficient, high-latency queries
9/15/2012
4
MESOCA 2012
Slide5We need Design Patterns for HBase Schemas
Our objective is to develop a systematic method forGuiding data organization in NoSQL
databases, given the types of data stored,
the amount of data
its usage patterns
We start our investigation with HBase
A
NoSQL
database offering
, built
on top of
Hadoop
Parallel Distributed Computation
MapReduce Framework
Coprocessor
Framework
9/15/2012
5
MESOCA 2012
Slide6Related Work
Talks in HBaseCon2012, held in MayData schema and Coprocessor are two main topicsExperience from
30 enterprises, such as
Facebook,
Yapmap
,
eBay, Adobe
Organizing time-series data into period-specific “buckets”
OpenTSDB
: a distributed scalable time series database,
written
on
top
of
HBase
A data Model in
Cassendra
, another
NoSQL
database offering
Applied into our case study
9/15/2012
6
MESOCA 2012
Slide7Data Organization in HBaseCell in HBase
(Row, Family: Column, Version) => (X,Y,Z) = value
VS
Schema/
dimension
Row
Family: Column
Version
2-D
unique id - timestamp
varying properties
current timestamp
3-D
unique id
varying properties
timestamps
Z
X
Y
X
Y
9/15/2012
7
MESOCA 2012
Slide8Case study: The Datasets
Cosmology DatasetProduct of an
N-body simulationThree types of
particles: dark matter, gas and star
Particles evolve over
a series
of discrete timestamps
Each snapshot records the properties of all particles at
the
time of the
snapshot
9 snapshots,
consists of 321,065,547 particles
Bixi
Dataset
Data
from a bicycle-renting service in the city of
Montreal
Every
minute, the
statistic information about bike usage
a
station is collected by the sensor
96,842 data points involved
9/15/2012
8
MESOCA 2012
Slide9Three Schemas for the Cosmology Dataset
Schema/
dimension
Row
Family: Column
Version
Schema1
sid-type-pid
particle properties
No meaning
Schema2
type-pid
particle properties
Snapshot id
Schema3
type-
reversedpid
particle properties
Snapshot id
Region
Region
Region
24-2-33446666
64
-2-33559999
84
-2-33550000
2-33446666
2-33550000
2-33559999
2-00005533
2-66664433
2-99995533
Schema1
Schema2
Schema3
9/15/2012
9
MESOCA 2012
Z
X
Y
Slide10Three Schemas for the Bixi Dataset
Schema/
dimension
Row
Family: Column
Version
Schema1
hour-
sid
minutes[0,59]
no
meaning
Schema2
hour-sid
monitoring
metrics
minutes
[0,59]
Schema3
day-
sid
monitoring
metrics
minutes
[0,1439]
Schema1
Schema2
Schema3
Time
X
metrics
Time
X
metrics
X
Time
9/15/2012
10
MESOCA 2012
Slide11Experiment Results
Experiment EnvironmentHadoop 0.20, HBase 0.93-snapshot (Coprocessor support)A four-node cluster on virtual machines
Quires for each datasetThree Queries of Cosmology dataset from related research
One query of
Bixi
dataset from business requirement
Query
processing Implementation
Native java API
User-Level
Coprocessor
Implementation
9/15/2012
11
MESOCA 2012
Slide12Query1 of Cosmology Dataset
Get all the particles of this type in this snapshot whose property matches the expression
9/15/2012
12
MESOCA 2012
Slide13Query2 of Cosmology Dataset
Get all the particles added/destroyed between S1 and s2
9/15/2012
13
MESOCA 2012
Slide14Query3 of Cosmology Dataset
Get the values of the property for the given set of particles across
the selected snapshots.
9/15/2012
14
MESOCA 2012
Slide15Bixi Query
For a given list of stations and a time, get their average bike usage for last 1, 2, 4, 8 and 16 days
9/15/2012
15
MESOCA 2012
Slide16Discussion“Qualitative” versus “Quantitative” Suggestions
Dynamic Data versus Static DataHistorical Dataset versus Real-Time DatasetsSupported versus Non-Supported Datasets9/15/2012
16
MESOCA 2012
Slide17Conclusion
A 3-dimensional data modelImproved performance can be got from the data schema
that
use the version dimension
of
HBase
Fit
in “write-once, read-many” system
Monitoring system
Sensor-based system
Version-based analysis
9/15/2012
17
MESOCA 2012
Slide18Future Work
More Evaluation of this data modelscalability, elasticity, and utilization How to design data model for other datasets
Spatial datasetGraphic dataset
9/15/2012
18
MESOCA 2012
Slide19Questions?
Thank you9/15/2012
19
MESOCA 2012