/
Data Warehousing COMP3017 Advanced Databases Data Warehousing COMP3017 Advanced Databases

Data Warehousing COMP3017 Advanced Databases - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
379 views
Uploaded On 2018-03-10

Data Warehousing COMP3017 Advanced Databases - PPT Presentation

Dr Nicholas Gibbins nmgecssotonacuk 20122013 Processing Styles OLTP 2 OnLine Transaction Processing Traditional workloads bread and butter processing Volumes of data ID: 646009

code data time product data code product time analysis warehouse multidimensional processing month line olap account performance decision brand

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Data Warehousing COMP3017 Advanced Datab..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Data Warehousing

COMP3017 Advanced Databases

Dr

Nicholas Gibbins –

nmg@ecs.soton.ac.uk

2012-2013Slide2

Processing Styles – OLTP

2

On-Line Transaction Processing

Traditional workloads

,

bread

and

butter

processing

Volumes of data

, transactions

grow

, networks

getting larger

.Slide3

Processing Styles – OLAP

3

On-Line Analytical Processing

includes

the use of data warehouses

multidimensional

databases

data analysisSlide4

Online Analytical Processing

4

OLAP is the name given to the dynamic enterprise analysis required

to create

,

manipulate, animate and synthesise

information

from exegetical,

contemplative

and formulaic

data analysis

modelsSlide5

Online Analytical Processing

5

OLAP is the name given to the dynamic enterprise analysis required

to create

,

manipulate, animate and synthesise

information

from

exegetical

,

contemplative

and formulaic

data analysis

models

Exegesis: critical explanation

How did we get to where we are?Slide6

Online Analytical Processing

6

OLAP is the name given to the dynamic enterprise analysis required

to create

,

manipulate, animate and synthesise

information

from exegetical,

contemplative

and formulaic

data analysis

models

Asking ‘what if?’ questions

How does the outcome change if

we vary the parameters?Slide7

Online Analytical Processing

7

OLAP is the name given to the dynamic enterprise analysis required

to create

,

manipulate, animate and synthesise

information

from exegetical,

contemplative

and

formulaic

data analysis

models

Which parameters must be varied

in order to achieve a given outcome?Slide8

Multidimensional conceptual view

Transparency

Accessibility

Consistent reporting performance

Client-server architecture

Generic dimensionality

Dynamic sparse matrix handling

Multi-user support

Unrestricted cross-dimensional operations

Intuitive data manipulation

Flexible reporting

Unlimited dimensions and aggregation levels

12 Rules for OLAP

8Slide9

Data Mining

9

Data mining

is the process of discovering hidden patterns and relations in large databases using a variety of advanced analytical techniques

Data mining attempts to use the computer to discover relationships that can be used to make predictions

Data mining tools often find unsuspected relationships in data that other techniques will overlookSlide10

Data Mining Approaches

10

Rule-based analysis

Neural networks

Fuzzy Logic

K-nearest-neighbour

Genetic algorithms

Advanced visualisation

Combination of any of the above Slide11

11

A

data

w

arehouse

is a subject-oriented, integrated,

time

-

variant, non-volatile collection of data that is used primarily in organisational decision making

The Data WarehouseSlide12

12

A

data

w

arehouse

is a

subject-oriented

, integrated,

time

-

variant, non-volatile collection of data that is used primarily in organisational decision making

The Data Warehouse

The data

is organised according to subject

instead of application and contains

only

the information necessary for ‘decision support’ processing.Slide13

13

A

data

w

arehouse

is a subject-oriented,

integrated

,

time

-

variant, non-volatile collection of data that is used primarily in organisational decision making

The Data Warehouse

Data

encoding is made uniform

(e.g.

sex = f or m

, 1 or 2, b or g - needs to be all the same in the warehouse).Data naming is made consistent.Slide14

14

A

data

w

arehouse

is a subject-oriented, integrated,

time

-

variant

, non-volatile collection of data that is used primarily in organisational decision making

The Data Warehouse

Data

is collected over

time and can

then

be used

for comparisons, trends and forecastingSlide15

15

A

data

w

arehouse

is a subject-oriented, integrated,

time

-

variant,

non-volatile

collection of data that is used primarily in organisational decision making

The Data Warehouse

The

data is not updated or changed once in

the data warehouse, but is

simply loaded, and then accessed.

The data warehouse is held quite separately from the operational database, which supports OLTP.Slide16

Why a Separate Data Warehouse?

16

Performance

Operational databases are optimised to support known transactions and workloads

Special

data organisation, access methods and implementation methods are needed

Complex

OLAP queries would degrade performance for operational transactionsSlide17

Why a Separate Data Warehouse?

17

Missing data

Decision support requires historical data, which operational databases do not typically maintain

Data consolidation

Decision support requires consolidation (aggregation, summarisation) of data from many heterogeneous sources, including operational databases and external sources

Data quality

Different

sources typically use inconsistent data representations, codes and formats, which have to be

reconciledSlide18

Extracting Data

18

Car

Policy

Claim

OPERATIONAL

DATA WAREHOUSE

Customers

replace

change

insert

change

replace

insert

Liability

Risk

EIS

DSS

Analysis

Etc

- Data is cleansed

- Data is restructuredSlide19

The Data Warehouse

19

A Data Warehouse may be realised:

via a front end to existing databases and files

in a fresh relational database

in a multidimensional database (MDDB)

in a proprietary database format

using a mixture of the aboveSlide20

The Data Warehouse

20

Data may be accessed in various ways:

Decision Support Systems (DSS)

Executive Information Systems (EIS)

Data Mining

On-Line Analytical ProcessingSlide21

Data Marts

21

A data mart focuses on

only one subject area, or

only one group of users

An organisation can have

one enterprise data warehouse

many data marts

Data marts do not contain operational data

Data marts are more easily understood and navigatedSlide22

Multidimensional Analysis

22

Need to examine data in various ways

P

roduce views of multidimensional data for users:

Slice

Dice

Pivot

Drill

down

Roll upSlide23

Multidimensional Analysis – Slice

23

Operator

Performance

Time

One

operator's

performance

over time

Overall performance

on a particular day

Overall performance

for one criterion

over time

Train Performance

- 3 dimensions

- Operators

- Performance

- TimeSlide24

Multidimensional Analysis – Dice

24

10

50

20

12

15

10

Juice

Cola

Milk

Cream

Toothpaste

Soap

1 2 3 4 5 6

Month

N

W

S

Region

ProductSlide25

Multidimensional Analysis – Pivot

25

10

50

20

12

15

10

Juice

Cola

Milk

Cream

Toothpaste

Soap

1 2 3 4 5 6

Month

N

W

S

Region

Product

Product

Month

RegionSlide26

Multidimensional Analysis – Drill Down

26

10

50

20

12

15

10

Juice

Cola

Milk

Cream

Toothpaste

Soap

1 2 3 4 5 6

Month

N

W

S

Region

Product

by brandSlide27

Multidimensional Analysis – Roll Up

27

80

37

Beverages

Toiletries

1 2 3 4 5 6

Month

N

W

S

Region

ProductSlide28

Internal Aspects

28

Schemas

Star schema

Snowflake schema

Fact constellation schema

Aggregated data

Specialised indexes

Bit map

indexes (see lecture on multidimensional indexes)

Join indexes

Specialised

join

methodsSlide29

Star Schema

29

Time Code

Quarter Code

Quarter Name

Date

Month Code

Month Name

Day Code

Day of Week

Season

Geography Code

Region Code

Region Manager

City Code

City Name

Post Code

Account Code

Key Account Code

Key Account Name

Account Name

Account Type

Account Market

Product Code

Product Name

Brand Manager

Brand Name

Prod Line Code

Prod Line Name

Prod Line Mgr

Product Name

Product Colour

Product Model No

Geography Code

Time Code

Account Code

Product Code

Sterling Amount

Units

Time

Sales

Account

ProductSlide30

Fact Tables

k

ey

columns joining fact table

to the dimension tables

n

umerical

m

easures

Prod_Code

Time_Code

Acct_Code

Sales

Qty

101

2045

501

100

1

102

2045

501

225

2

103

2046

501

200

20

104

2046

502

250

25

105

2046

502

20

1

30Slide31

Part of a Snowflake Schema

31

Time Code

Year Code

Quarter Code

Month Code

Day Code

Product Code

Prod Line Code

Brand Code

Geography Code

Time Code

Account Code

Product Code

Sterling Amount

Units

Time

Sales

Product

Quarter Code

Quarter Name

Quarter

Month Code

Month Name

Month

Day Code

Day of Week

Season

Day

Product Name

Product Colour

Product Model No

Product

Desc

Brand Code

Brand Manager

Brand Name

Brand

Prod Line Code

Prod Line Name

Prod Line Mgr

Product LineSlide32

Data Warehouse Databases

32

Relational and Specialised RDBMSs

Specialised indexing techniques, join and scan methods

Relational OLAP (ROLAP) servers

Explicitly developed to use a relational engine to support OLAP

Include aggregation navigation logic, the ability to generate multi-statement SQL, and other additional services

Multidimensional OLAP (MOLAP) servers

The storage model is an n-dimensional array

May use a 2-level approach, with 2-D dense arrays indexed by B-Trees

Time is often one of the dimensions