/
Big  Data and NoSQL BUS 782 Big  Data and NoSQL BUS 782

Big Data and NoSQL BUS 782 - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
358 views
Uploaded On 2018-11-10

Big Data and NoSQL BUS 782 - PPT Presentation

What is Big Data httpswwwyoutubecomwatchvc4BwefH5Ve8 Employeegenerated data Usergenerated data Machinegenerated data Big Data Analytics 11 Case Histories and Success Stories httpswwwyoutubecomwatchannotationidannotation3535169775ampfeatureivampsrcvidc4BwefH5Ve8a ID: 726941

database data big column data database column big databases hadoop relational transaction json nosql processing high distributed cassandra systems

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Big Data and NoSQL BUS 782" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Big Data and NoSQL

BUS 782Slide2

What is Big Data?

https://www.youtube.com/watch?v=c4BwefH5Ve8

Employee-generated data

User-generated data

Machine-generated data

Big Data Analytics: 11 Case Histories and Success Stories

https://www.youtube.com/watch?annotation_id=annotation_3535169775&feature=iv&src_vid=c4BwefH5Ve8&v=t4wtzIuoY0wSlide3

Big Data

Data Size:

Gigabyte

Terabyte: Terabyte USB

Petabyte: Wal-Mart handles more than 1m customer transactions every hour at more than 2.5 petabytes

Exabyte: the amount of traffic flowing over the internet about 700

exabytes

annually

Z

ettabyte

Slide4

Big Data: Some Facts

World’s information is doubling every two years

World generated 1.8 ZB of information in 2011

Cisco predicts that by 2016 global IP traffic will reach 1.3

zettabytes

There will be 19 billion networked devices by 2016

7

0% of this data is being generated by individuals as opposed to enterprises & organizationsSlide5

Big Data Sources

Web sites

Social media

Machine generated

RFID

Image, video, and audio

Etc.Slide6

Big Data Challenges

Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.

“3Vs":

Volume: Size >= 30-50 TBs

Velocity: Processing speed

Variety:

Structured: able to fit in a database table

unstructured dataSlide7

Do Companies care about Data?

Not really, What they care about are Key Performance Indicators (KPIs)

Some examples of KPIs are

Revenue

Profit

Revenue per customer/employee

Customer Attrition: the loss of clients or customers

Big Data is only useful if it helps drive KPIsSlide8

Big Data to KPIsSlide9

Applications

Text mining: deriving high-quality information from text.

text categorization, text clustering, concept/entity extraction, sentiment analysis, etc.

Web mining:

Web usage mining

Web content mining

Social media mining

Salesforce Radian6 Social Marketing Cloud

http://www.youtube.com/watch?v=EH1dcFh_-I4Slide10

Advantages of Relational Databases

Well-defined database schema

Flexible query language

Maintain database consistency in business transactions:

Concurrent database processing with multiple users

Reading/updating

LockingSlide11

Transaction ACID Properties

Atomic

Transaction cannot be subdivided

All or nothing

Consistent

Constraints don’t change from before transaction to after transaction

A transaction transforms a database from one consistent state to another consistent state.

Isolated

Transactions execute independently of one another.

Database changes not revealed to users until after transaction has completed

Durable

Database changes are permanent and must not be lost.Slide12

Problems with relational databases in managing Big Data

High overhead in maintaining database consistency

Do not support unstructured data search very well (i.e. google type searching)

Do not handle data in unexpected formats well

D

on’t scale well to very large size databases:

Expensive “scale up”: adding processer, storage

S

low query response time

Data must move to server

Server failure

Organizations such as Facebook, Yahoo, Google, and Amazon were among the first to decide that relational databases were not good solutions for the volumes and types of data that they were dealing with.Slide13

What is needed in new approach

Deal with data size never imagined before.

Hardware failure should be expected.

Data has gravity, compute has to move to data.Slide14

What is Hadoop?

Open source project by Apache Foundation

Based on papers published by Google

Google File System ( Oct, 2003)

MapReduce

( Dec, 2004)

Consists of two core components

Hadoop Distributed File System (Storage)

MapReduce

(Compute)Slide15

How Hadoop fits in the new approach

Run on cluster of low cost commodity servers so can accommodate petabytes of data cost effectively.

Embraces partial failures

Data locality (computation on local node where

data resides)

Horizontally Scales

Scale Out

Hadoop file is:

Distributed: a file is stored in many servers

Replicated: a file is replicated with many copiesSlide16

Hadoop HDFS: Hadoop Distributed File System

Based on GFS

Designed

to store very large amount of data (TBs

and PBs) and much larger file sizes

Write-once

, read many-times access pattern

Designed

to run on clusters of

commodity hardware

and does replication for reliability

Allows

data to be read and processed

locally

Supports limited operations on files - write,

delete, append

and reads

but no

updatesSlide17

MapReduce: a programming model for distributed

processing of

data

R

ather than take the conventional step of moving data over a network to be processed by software,

MapReduce

moves the processing software to the data.

Each

node does both store and compute, and

does best

to process local data.

MapReduce

has two main phases:

Map

ReduceSlide18

Example: Word CountSlide19

Hadoop Ecosystem

Hbase

–a

column-oriented data store

Hive

–provides a SQL like query capability

Pig

–a high-level language for creating

MapReducejobs

HCatalog

–takes

Hive’s metadata and makes it available across the Hadoop

ecosystem

Mahout –a library of algorithms for clustering, classification, and filtering

Sqoop

–accelerates

bulk loads of data between Hadoop and RDMS

Flume

–streams large volumes of log data from multiple sources into HadoopSlide20

NoSQL Database

NotOnlySQL

is a broad class of database management systems identified by non-adherence to the widely used relational database management system model.

They are useful when working with a huge quantity of data when the data's nature does not require a relational model. Slide21

Types of NoSQL Databases

Column-oriented database

Example: Cassandra

Document-oriented database:

Example: MongoDB,

CouchDB

Data

stored

in

JSON, JavaScript Object Notation,

formatSlide22

JSON, JavsScript Object Notation

http://www.w3schools.com/json/default.asp

JSON Example

{"employees":[

{"

firstName

":"John", "

lastName

":"Doe"},

{"

firstName

":"Anna", "

lastName

":"Smith"},

{"

firstName

":"Peter", "

lastName

":"Jones"}

]}Slide23

Cassandra is essentially a key-value store.

This means that all data is stored only in one ‘table’, each row of which is uniquely identified by a key, with JSON representation.

https://blog.safaribooksonline.com/2012/12/11/modeling-data-in-cassandra/

{

"user1": {

"Bio": {

"name": "

Shaneeb

Kamran",

"age" : 23

}

},

"user2": {

"Bio": {

"name": "Salman

ul

Haq

",

"profession": "Developer"

},

"Education": {

"bachelors": "NUST"

}

}

}Slide24

Column Data Model

http://www.sinbadsoft.com/blog/cassandra-data-model-cheat-sheet/

http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/

A column

is a

key-value pair

consisting of three elements

:

1: Unique

name: Used to reference the column

2: Value

: The content of the column.

3: Timestamp

:

used

to determine the valid content

.

Column Family: A

container for columns sorted by their names. Column Families are referenced and sorted by row keys

.

Super Column: A sorted associative array of

columns

Example: Multi-value attribute

Super column family

: A container for super columns sorted by their names. Super Column Families are referenced and sorted by row keys

.

Keyspace

: Top

level

element.

Container for column families

.Slide25

Column Family

Super column familySlide26
Slide27

Migrate a Relational Database Structure into a NoSQL Cassandra Structure

http://www.divconq.com/2010/migrate-a-relational-database-structure-into-a-nosql-cassandra-structure-part-i/

{

"

biologicalfeatures

":

{

"forests" :

{

"forest003"

: {

"name" : "Black Forest

",

"trees" : "two million",

"bushes" : "three

million“ },

"forest045"

: {

"name" : "100 Acre Woods

",

"trees" : "four thousand",

"bushes" : "five

thousand“ },

"forest127"

: {

"name" : "Lonely Grove

",

"trees" : "none",

"bushes" : "one

hundred“ }

},

"

famoustrees

"

:

{

"tree12345"

: {

"

forestID

" : "forest003

",

"name" : "Der Tree

",

"species" : "Red

Oak“ },

"tree12399"

: {

"

forestID

" : "forest045

",

"name" : "Happy

Hunny

Tree",

"species" : "

Willow“ },

"tree32345"

: {

"

forestID

" : "forest003

",

"name" : "Das

Ubertree

",

"species" : "Blue

Spruce“

}

}

}

}Slide28

Document database: MongoDB

http://docs.mongodb.org/manual/core/data-modeling-introduction/

MongoDB

stores business subjects

in

documents.

A

document

is the

basic unit of data in MongoDB.

Documents

are analogous to JSON objects but exist in the database in a more type-rich format known as

BSON, Bin­ary

JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments.

The structure of MongoDB documents

and how the application represents relationships between

data:

references

and embedded documents.Slide29

Example using referenceSlide30

Embedded Data ModelsSlide31

CouchDB

A

CouchDB

document is a JSON object that consists of named fields. Field values may be strings, numbers, dates, or even ordered lists and associative maps. An example of a document would be a blog post:

{

"Subject": "I like Plankton",

"Author": "Rusty",

"

PostedDate

": "5/23/2006",

"Tags": ["plankton", "baseball", "decisions"],

"Body": "I decided today that I don't like baseball. I like plankton."

}Slide32

Problems with NoSQL Databases

Does not support transaction consistency as relational database systems.

There is no standard query language for NoSQL databasesSlide33

NewSQL Databases

http://en.wikipedia.org/wiki/NewSQL

NewSQL

is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.Slide34

Approaches of NewSQL Systems

1. Distributed

cluster of shared-nothing

nodes:

node owns a subset of the data. These databases

include

components such as distributed concurrency

control and

distributed query processing.

2. Transparent

sharding

:

These systems provide a

sharding

middleware layer to automatically split databases across multiple nodes

.

3. Highly optimized SQL engines

4. In-memory databaseSlide35

In-Memory Database

An

in-memory database

is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism.

Main memory databases are faster than disk-optimized databases.

Good for Big Data analytics.

Use non-volatile main memory module that retains data even when electrical power is removed.Slide36

SAP HANA, High-Speed Analytical Appliance

SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by

SAP. HANA's

architecture is designed to handle both high transaction rates and complex query processing on the same platform

HANA's performance is 10,000 times faster when compared to standard disks, which allows companies to analyze data in a matter of seconds instead of long hours.