What is Big Data httpswwwyoutubecomwatchvc4BwefH5Ve8 Employeegenerated data Usergenerated data Machinegenerated data Big Data Analytics 11 Case Histories and Success Stories httpswwwyoutubecomwatchannotationidannotation3535169775ampfeatureivampsrcvidc4BwefH5Ve8a ID: 726941
Download Presentation The PPT/PDF document "Big Data and NoSQL BUS 782" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Big Data and NoSQL
BUS 782Slide2
What is Big Data?
https://www.youtube.com/watch?v=c4BwefH5Ve8
Employee-generated data
User-generated data
Machine-generated data
Big Data Analytics: 11 Case Histories and Success Stories
https://www.youtube.com/watch?annotation_id=annotation_3535169775&feature=iv&src_vid=c4BwefH5Ve8&v=t4wtzIuoY0wSlide3
Big Data
Data Size:
Gigabyte
Terabyte: Terabyte USB
Petabyte: Wal-Mart handles more than 1m customer transactions every hour at more than 2.5 petabytes
Exabyte: the amount of traffic flowing over the internet about 700
exabytes
annually
Z
ettabyte
Slide4
Big Data: Some Facts
World’s information is doubling every two years
World generated 1.8 ZB of information in 2011
Cisco predicts that by 2016 global IP traffic will reach 1.3
zettabytes
There will be 19 billion networked devices by 2016
7
0% of this data is being generated by individuals as opposed to enterprises & organizationsSlide5
Big Data Sources
Web sites
Social media
Machine generated
RFID
Image, video, and audio
Etc.Slide6
Big Data Challenges
Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.
“3Vs":
Volume: Size >= 30-50 TBs
Velocity: Processing speed
Variety:
Structured: able to fit in a database table
unstructured dataSlide7
Do Companies care about Data?
Not really, What they care about are Key Performance Indicators (KPIs)
Some examples of KPIs are
Revenue
Profit
Revenue per customer/employee
Customer Attrition: the loss of clients or customers
Big Data is only useful if it helps drive KPIsSlide8
Big Data to KPIsSlide9
Applications
Text mining: deriving high-quality information from text.
text categorization, text clustering, concept/entity extraction, sentiment analysis, etc.
Web mining:
Web usage mining
Web content mining
Social media mining
Salesforce Radian6 Social Marketing Cloud
http://www.youtube.com/watch?v=EH1dcFh_-I4Slide10
Advantages of Relational Databases
Well-defined database schema
Flexible query language
Maintain database consistency in business transactions:
Concurrent database processing with multiple users
Reading/updating
LockingSlide11
Transaction ACID Properties
Atomic
Transaction cannot be subdivided
All or nothing
Consistent
Constraints don’t change from before transaction to after transaction
A transaction transforms a database from one consistent state to another consistent state.
Isolated
Transactions execute independently of one another.
Database changes not revealed to users until after transaction has completed
Durable
Database changes are permanent and must not be lost.Slide12
Problems with relational databases in managing Big Data
High overhead in maintaining database consistency
Do not support unstructured data search very well (i.e. google type searching)
Do not handle data in unexpected formats well
D
on’t scale well to very large size databases:
Expensive “scale up”: adding processer, storage
S
low query response time
Data must move to server
Server failure
Organizations such as Facebook, Yahoo, Google, and Amazon were among the first to decide that relational databases were not good solutions for the volumes and types of data that they were dealing with.Slide13
What is needed in new approach
Deal with data size never imagined before.
Hardware failure should be expected.
Data has gravity, compute has to move to data.Slide14
What is Hadoop?
Open source project by Apache Foundation
Based on papers published by Google
Google File System ( Oct, 2003)
MapReduce
( Dec, 2004)
Consists of two core components
Hadoop Distributed File System (Storage)
MapReduce
(Compute)Slide15
How Hadoop fits in the new approach
Run on cluster of low cost commodity servers so can accommodate petabytes of data cost effectively.
Embraces partial failures
Data locality (computation on local node where
data resides)
Horizontally Scales
Scale Out
Hadoop file is:
Distributed: a file is stored in many servers
Replicated: a file is replicated with many copiesSlide16
Hadoop HDFS: Hadoop Distributed File System
Based on GFS
Designed
to store very large amount of data (TBs
and PBs) and much larger file sizes
Write-once
, read many-times access pattern
Designed
to run on clusters of
commodity hardware
and does replication for reliability
Allows
data to be read and processed
locally
Supports limited operations on files - write,
delete, append
and reads
but no
updatesSlide17
MapReduce: a programming model for distributed
processing of
data
R
ather than take the conventional step of moving data over a network to be processed by software,
MapReduce
moves the processing software to the data.
Each
node does both store and compute, and
does best
to process local data.
MapReduce
has two main phases:
Map
ReduceSlide18
Example: Word CountSlide19
Hadoop Ecosystem
Hbase
–a
column-oriented data store
Hive
–provides a SQL like query capability
Pig
–a high-level language for creating
MapReducejobs
HCatalog
–takes
Hive’s metadata and makes it available across the Hadoop
ecosystem
Mahout –a library of algorithms for clustering, classification, and filtering
Sqoop
–accelerates
bulk loads of data between Hadoop and RDMS
Flume
–streams large volumes of log data from multiple sources into HadoopSlide20
NoSQL Database
NotOnlySQL
is a broad class of database management systems identified by non-adherence to the widely used relational database management system model.
They are useful when working with a huge quantity of data when the data's nature does not require a relational model. Slide21
Types of NoSQL Databases
Column-oriented database
Example: Cassandra
Document-oriented database:
Example: MongoDB,
CouchDB
Data
stored
in
JSON, JavaScript Object Notation,
formatSlide22
JSON, JavsScript Object Notation
http://www.w3schools.com/json/default.asp
JSON Example
{"employees":[
{"
firstName
":"John", "
lastName
":"Doe"},
{"
firstName
":"Anna", "
lastName
":"Smith"},
{"
firstName
":"Peter", "
lastName
":"Jones"}
]}Slide23
Cassandra is essentially a key-value store.
This means that all data is stored only in one ‘table’, each row of which is uniquely identified by a key, with JSON representation.
https://blog.safaribooksonline.com/2012/12/11/modeling-data-in-cassandra/
{
"user1": {
"Bio": {
"name": "
Shaneeb
Kamran",
"age" : 23
}
},
"user2": {
"Bio": {
"name": "Salman
ul
Haq
",
"profession": "Developer"
},
"Education": {
"bachelors": "NUST"
}
}
}Slide24
Column Data Model
http://www.sinbadsoft.com/blog/cassandra-data-model-cheat-sheet/
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
A column
is a
key-value pair
consisting of three elements
:
1: Unique
name: Used to reference the column
2: Value
: The content of the column.
3: Timestamp
:
used
to determine the valid content
.
Column Family: A
container for columns sorted by their names. Column Families are referenced and sorted by row keys
.
Super Column: A sorted associative array of
columns
Example: Multi-value attribute
Super column family
: A container for super columns sorted by their names. Super Column Families are referenced and sorted by row keys
.
Keyspace
: Top
level
element.
Container for column families
.Slide25
Column Family
Super column familySlide26Slide27
Migrate a Relational Database Structure into a NoSQL Cassandra Structure
http://www.divconq.com/2010/migrate-a-relational-database-structure-into-a-nosql-cassandra-structure-part-i/
{
"
biologicalfeatures
":
{
"forests" :
{
"forest003"
: {
"name" : "Black Forest
",
"trees" : "two million",
"bushes" : "three
million“ },
"forest045"
: {
"name" : "100 Acre Woods
",
"trees" : "four thousand",
"bushes" : "five
thousand“ },
"forest127"
: {
"name" : "Lonely Grove
",
"trees" : "none",
"bushes" : "one
hundred“ }
},
"
famoustrees
"
:
{
"tree12345"
: {
"
forestID
" : "forest003
",
"name" : "Der Tree
",
"species" : "Red
Oak“ },
"tree12399"
: {
"
forestID
" : "forest045
",
"name" : "Happy
Hunny
Tree",
"species" : "
Willow“ },
"tree32345"
: {
"
forestID
" : "forest003
",
"name" : "Das
Ubertree
",
"species" : "Blue
Spruce“
}
}
}
}Slide28
Document database: MongoDB
http://docs.mongodb.org/manual/core/data-modeling-introduction/
MongoDB
stores business subjects
in
documents.
A
document
is the
basic unit of data in MongoDB.
Documents
are analogous to JSON objects but exist in the database in a more type-rich format known as
BSON, Binary
JSON, is a binary-encoded serialization of JSON-like documents.
The structure of MongoDB documents
and how the application represents relationships between
data:
references
and embedded documents.Slide29
Example using referenceSlide30
Embedded Data ModelsSlide31
CouchDB
A
CouchDB
document is a JSON object that consists of named fields. Field values may be strings, numbers, dates, or even ordered lists and associative maps. An example of a document would be a blog post:
{
"Subject": "I like Plankton",
"Author": "Rusty",
"
PostedDate
": "5/23/2006",
"Tags": ["plankton", "baseball", "decisions"],
"Body": "I decided today that I don't like baseball. I like plankton."
}Slide32
Problems with NoSQL Databases
Does not support transaction consistency as relational database systems.
There is no standard query language for NoSQL databasesSlide33
NewSQL Databases
http://en.wikipedia.org/wiki/NewSQL
NewSQL
is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.Slide34
Approaches of NewSQL Systems
1. Distributed
cluster of shared-nothing
nodes:
node owns a subset of the data. These databases
include
components such as distributed concurrency
control and
distributed query processing.
2. Transparent
sharding
:
These systems provide a
sharding
middleware layer to automatically split databases across multiple nodes
.
3. Highly optimized SQL engines
4. In-memory databaseSlide35
In-Memory Database
An
in-memory database
is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism.
Main memory databases are faster than disk-optimized databases.
Good for Big Data analytics.
Use non-volatile main memory module that retains data even when electrical power is removed.Slide36
SAP HANA, High-Speed Analytical Appliance
SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by
SAP. HANA's
architecture is designed to handle both high transaction rates and complex query processing on the same platform
HANA's performance is 10,000 times faster when compared to standard disks, which allows companies to analyze data in a matter of seconds instead of long hours.