/
© Prentice Hall © Prentice Hall

© Prentice Hall - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
407 views
Uploaded On 2017-11-12

© Prentice Hall - PPT Presentation

1 DATA MINING Introductory and Advanced Topics Part III Margaret H Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr MHDunham ID: 604900

hall prentice web data prentice hall data web spatial warehouse pages time mining tree based analysis zone rules information association export sequence

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "© Prentice Hall" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

© Prentice Hall

1

DATA MININGIntroductory and Advanced TopicsPart III

Margaret H. Dunham

Department of Computer Science and Engineering

Southern Methodist University

Companion slides for the text by Dr. M.H.Dunham,

Data Mining, Introductory and Advanced Topics

, Prentice Hall, 2002.Slide2

© Prentice Hall

2Data Mining Outline

PART I

Introduction

Related Concepts

Data Mining Techniques

PART II

Classification

Clustering

Association Rules

PART III

Web Mining

Spatial Mining

Temporal MiningSlide3

© Prentice Hall

3

Web Mining OutlineGoal: Examine the use of data mining on the World Wide Web

Introduction

Web Content Mining

Web Structure Mining

Web Usage MiningSlide4

© Prentice Hall

4Web Mining Issues

Size>350 million pages (1999) Grows at about 1 million pages a dayGoogle indexes 3 billion documentsDiverse types of dataSlide5

© Prentice Hall

5Web Data

Web pagesIntra-page structuresInter-page structuresUsage dataSupplemental dataProfiles

Registration information

CookiesSlide6

© Prentice Hall

6

Web Mining Taxonomy

Modified from [zai01]Slide7

© Prentice Hall

7Web Content Mining

Extends work of basic search enginesSearch EnginesIR application

Keyword based

Similarity between query and document

Crawlers

Indexing

Profiles

Link analysisSlide8

© Prentice Hall

8

CrawlersRobot (spider) traverses the hypertext sructure in the Web.

Collect information from visited pages

Used to construct indexes for search engines

Traditional Crawler

– visits entire Web (?) and replaces index

Periodic Crawler

– visits portions of the Web and updates subset of index

Incremental Crawler

– selectively searches the Web and incrementally modifies index

Focused Crawler

– visits pages related to a particular subjectSlide9

© Prentice Hall

9Focused Crawler

Only visit links from a page if that page is determined to be relevant.Classifier is static after learning phase.

Components:

Classifier which assigns relevance score to each page based on crawl topic.

Distiller to identify

hub

pages which determines the pages that contain links to many relevant pages.

Crawler visits pages

based on

classifier

and distiller scores.Slide10

© Prentice Hall

10Focused Crawler

Classifier to related documents to topicsClassifier also determines how useful outgoing links areHub Pages contain links to many relevant pages. Must be visited even if not high relevance score.Slide11

© Prentice Hall

11Focused CrawlerSlide12

© Prentice Hall

12Context Focused Crawler

Context Graph:Context graph created for each seed document .

Root is

the

seed

document.

Nodes at each level show documents with links to documents at next higher level.

Updated during crawl itself .

Approach:

Construct context graph and classifiers using seed documents as training data.

Perform crawling using classifiers and context graph created.Slide13

© Prentice Hall

13Context GraphSlide14

© Prentice Hall

14

Harvest SystemBased on the use of caching ,indexing and crawling

Harvest is a set of tools that facilitate gathering of information from diverse sources.

Gatherers – Obtains information for indexing from an Internet Service Provider

Brokers- Provides the index and query

interface

Gatherers use essence system to assist in collecting data

Essence-Classifies documents by creating a semantic index

Slide15

© Prentice Hall

15

Virtual Web ViewMultiple Layered DataBase (MLDB)

built on top of the Web.

Each layer of the database is more generalized (and smaller) and centralized than the one beneath it.

Upper layers of MLDB are structured and can be accessed with SQL type queries.

Translation tools convert Web documents to XML.

Extraction tools extract desired information to place in first layer of MLDB.

Higher levels contain more summarized data obtained through generalizations of the lower levels.Slide16

© Prentice Hall

16Personalization

Web access or contents tuned to better fit the desires of each user.Manual techniques identify user’s preferences based on profiles or demographics.

Collaborative filtering

identifies preferences based on ratings from similar users.

Content based filtering

retrieves pages based on similarity between pages and user profiles.Slide17

© Prentice Hall

17Web Structure Mining

Mine structure (links, graph) of the WebTechniquesPageRankCLEVERCreate a model of the Web organization.

May be combined with content mining to more effectively retrieve important pages.Slide18

© Prentice Hall

18PageRank

Used by GooglePrioritize pages returned from search by looking at Web structure.

Importance of page is calculated based on number of pages which point to it –

Backlinks

.

Weighting is used to provide more importance to backlinks coming form important pages.Slide19

© Prentice Hall

19PageRank (cont’d)

PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)PR(i): PageRank for a page i which points to target page p.

N

i

: number of links coming out of page iSlide20

© Prentice Hall

20CLEVER

Identify authoritative and hub pages.Authoritative Pages :Highly important pages.Best source for requested information.

Hub Pages

:

Contain links to highly important pages.Slide21

© Prentice Hall

21HITS

Hyperlink-Induces Topic SearchBased on a set of keywords, find set of relevant pages – R.Identify hub and authority pages for these.Expand R to a base set, B, of pages linked to or from R.

Calculate weights for authorities and hubs.

Pages with highest ranks in R are returned.Slide22

© Prentice Hall

22

HITS AlgorithmSlide23

© Prentice Hall

23Web Usage Mining

Extends work of basic search enginesSearch EnginesIR application

Keyword based

Similarity between query and document

Crawlers

Indexing

Profiles

Link analysisSlide24

© Prentice Hall

24

Web Usage Mining ApplicationsPersonalizationImprove structure of a site’s Web pagesAid in caching and prediction of future page referencesImprove design of individual pages

Improve effectiveness of e-commerce (sales and advertising)Slide25

© Prentice Hall

25

Web Usage Mining ActivitiesPreprocessing Web log

Cleanse

Remove extraneous information

Sessionize

Session:

Sequence of pages referenced by one user at a sitting.

Pattern Discovery

Count patterns that occur in sessions

Pattern

is sequence of pages references in session.

Similar to association rules

Transaction: session

Itemset: pattern (or subset)

Order is important

Pattern AnalysisSlide26

© Prentice Hall

26ARs in Web Mining

Web Mining:Content

Structure

Usage

Frequent patterns of sequential page references in Web searching.

Uses:

Caching

Clustering users

Develop user profiles

Identify important pagesSlide27

© Prentice Hall

27Web Usage Mining Issues

Identification of exact user not possible.Exact sequence of pages referenced by a user not possible due to caching.Session not well definedSecurity, privacy, and legal issuesSlide28

© Prentice Hall

28Web Log Cleansing

Replace source IP address with unique but non-identifying ID.Replace exact URL of pages referenced with unique but non-identifying ID.Delete error records and records containing not page data (such as figures and code)Slide29

© Prentice Hall

29Sessionizing

Divide Web log into sessions.Two common techniques:Number of consecutive page references from a source IP address occurring within a predefined time interval (e.g. 25 minutes).All consecutive page references from a source IP address where the interclick time is less than a predefined threshold.Slide30

© Prentice Hall

30Data Structures

Keep track of patterns identified during Web usage mining processCommon techniques:Trie Suffix TreeGeneralized Suffix Tree

WAP TreeSlide31

© Prentice Hall

31Trie vs. Suffix Tree

Trie:Rooted treeEdges labeled which character (page) from patternPath from root to leaf represents pattern.

Suffix Tree:

Single child collapsed with parent. Edge contains labels of both prior edges.Slide32

© Prentice Hall

32Trie and Suffix TreeSlide33

© Prentice Hall

33Generalized Suffix Tree

Suffix tree for multiple sessions. Contains patterns from all sessions.Maintains count of frequency of occurrence of a pattern in the node.WAP Tree:

Compressed version of generalized suffix treeSlide34

© Prentice Hall

34Types of Patterns

Algorithms have been developed to discover different types of patterns.Properties:Ordered – Characters (pages) must occur in the exact order in the original session.

Duplicates

– Duplicate characters are allowed in the pattern.

Consecutive

– All characters in pattern must occur consecutive in given session.

Maximal

– Not subsequence of another pattern.Slide35

© Prentice Hall

35Pattern Types

Association RulesNone of the properties holdEpisodes

Only ordering holds

Sequential Patterns

Ordered and maximal

Forward Sequences

Ordered, consecutive, and maximal

Maximal Frequent Sequences

All properties holdSlide36

© Prentice Hall

36Episodes

Partially ordered set of pagesSerial episode – totally ordered with time constraintParallel episode

– partial ordered with time constraint

General episode

– partial ordered with no time constraintSlide37

© Prentice Hall

37DAG for EpisodeSlide38

© Prentice Hall

38

Spatial Mining OutlineGoal:

Provide an introduction to some spatial mining techniques.

Introduction

Spatial Data Overview

Spatial Data Mining Primitives

Generalization/Specialization

Spatial Rules

Spatial Classification

Spatial ClusteringSlide39

© Prentice Hall

39Spatial Object

Contains both spatial and nonspatial attributes.Must have a location type attributes:Latitude/longitudeZip codeStreet address

May retrieve object using either (or both) spatial or nonspatial attributes.Slide40

© Prentice Hall

40

Spatial Data Mining ApplicationsGeologyGIS SystemsEnvironmental Science

Agriculture

Medicine

Robotics

May involved both spatial and temporal aspectsSlide41

© Prentice Hall

41

Spatial QueriesSpatial selection may involve specialized selection comparison operations:

Near

North, South, East, West

Contained in

Overlap/intersect

Region (Range) Query

– find objects that intersect a given region.

Nearest Neighbor Query

– find object close to identified object.

Distance Scan

– find object within a certain distance of an identified object where distance is made increasingly larger.Slide42

© Prentice Hall

42

Spatial Data StructuresData structures designed specifically to store or index spatial data.

Often based on B-tree or Binary Search Tree

Cluster data on disk basked on geographic location.

May represent complex spatial structure by placing the spatial object in a containing structure of a specific geographic shape.

Techniques:

Quad Tree

R-Tree

k-D TreeSlide43

© Prentice Hall

43MBR

Minimum Bounding RectangleSmallest rectangle that completely contains the objectSlide44

© Prentice Hall

44MBR ExamplesSlide45

© Prentice Hall

45Quad Tree

Hierarchical decomposition of the space into quadrants (MBRs)Each level in the tree represents the object as the set of quadrants which contain any portion of the object.

Each level is a more exact representation of the object.

The number of levels is determined by the degree of accuracy desired.Slide46

© Prentice Hall

46Quad Tree ExampleSlide47

© Prentice Hall

47R-Tree

As with Quad Tree the region is divided into successively smaller rectangles (MBRs).Rectangles need not be of the same size or number at each level.Rectangles may actually overlap.

Lowest level cell has only one object.

Tree maintenance algorithms similar to those for B-trees.Slide48

© Prentice Hall

48R-Tree ExampleSlide49

© Prentice Hall

49K-D Tree

Designed for multi-attribute data, not necessarily spatialVariation of binary search treeEach level is used to index one of the dimensions of the spatial object.

Lowest level cell has only one object

Divisions not based on MBRs but successive divisions of the dimension range.Slide50

© Prentice Hall

50k-D Tree ExampleSlide51

© Prentice Hall

51Topological Relationships

DisjointOverlaps or IntersectsEqualsCovered by or inside or contained inCovers or containsSlide52

© Prentice Hall

52

Distance Between ObjectsEuclideanManhattan

Extensions:Slide53

© Prentice Hall

53Progressive Refinement

Make approximate answers prior to more accurate ones.Filter out data not part of answerHierarchical view of data based on spatial relationshipsCoarse predicate recursively refinedSlide54

© Prentice Hall

54Progressive RefinementSlide55

© Prentice Hall

55

Spatial Data Dominant AlgorithmSlide56

© Prentice Hall

56STING

STatistical Information Grid-basedHierarchical technique to divide area into rectangular cellsGrid data structure contains summary information about each cellHierarchical clustering Similar to quad treeSlide57

© Prentice Hall

57STINGSlide58

© Prentice Hall

58

STING Build AlgorithmSlide59

© Prentice Hall

59

STING AlgorithmSlide60

© Prentice Hall

60Spatial Rules

Characteristic Rule The average family income in Dallas is $50,000.

Discriminant Rule

The average family income in Dallas is $50,000, while in Plano the average income is $75,000.

Association Rule

The average family income in Dallas for families living near White Rock Lake is $100,000.Slide61

© Prentice Hall

61Spatial Association Rules

Either antecedent or consequent must contain spatial predicates.View underlying database as set of spatial objects.May create using a type of progressive refinementSlide62

© Prentice Hall

62

Spatial Association Rule AlgorithmSlide63

© Prentice Hall

63Spatial Classification

Partition spatial objectsMay use nonspatial attributes and/or spatial attributesGeneralization and progressive refinement may be used.Slide64

© Prentice Hall

64ID3 Extension

Neighborhood GraphNodes – objectsEdges – connects neighborsDefinition of neighborhood variesID3 considers nonspatial attributes of all objects in a neighborhood (not just one) for classification.Slide65

© Prentice Hall

65Spatial Decision Tree

Approach similar to that used for spatial association rules.Spatial objects can be described based on objects close to them – Buffer.Description of class based on aggregation of nearby objects.Slide66

© Prentice Hall

66

Spatial Decision Tree AlgorithmSlide67

© Prentice Hall

67Spatial Clustering

Detect clusters of irregular shapesUse of centroids and simple distance approaches may not work well.Clusters should be independent of order of input.Slide68

© Prentice Hall

68Spatial ClusteringSlide69

© Prentice Hall

69CLARANS Extensions

Remove main memory assumption of CLARANS.Use spatial index techniques.Use sampling and R*-tree to identify central objects.Change cost calculations by reducing the number of objects examined.

Voronoi DiagramSlide70

© Prentice Hall

70VoronoiSlide71

© Prentice Hall

71SD(CLARANS)

Spatial DominantFirst clusters spatial components using CLARANSThen iteratively replaces medoids, but limits number of pairs to be searched.Uses generalizationUses a learning to to derive description of cluster.Slide72

© Prentice Hall

72

SD(CLARANS) AlgorithmSlide73

© Prentice Hall

73DBCLASD

Extension of DBSCANDistribution Based Clustering of LArge Spatial DatabasesAssumes items in cluster are uniformly distributed.Identifies distribution satisfied by distances between nearest neighbors.Objects added if distribution is uniform.Slide74

© Prentice Hall

74DBCLASD AlgorithmSlide75

© Prentice Hall

75Aggregate Proximity

Aggregate Proximity – measure of how close a cluster is to a feature.Aggregate proximity relationship finds the k closest features to a cluster.

CRH Algorithm

– uses different shapes:

Encompassing Circle

Isothetic Rectangle

Convex HullSlide76

© Prentice Hall

76CRHSlide77

© Prentice Hall

77

Temporal Mining OutlineGoal: Examine some temporal data mining issues and approaches.

Introduction

Modeling Temporal Events

Time Series

Pattern Detection

Sequences

Temporal Association RulesSlide78

© Prentice Hall

78Temporal Database

Snapshot – Traditional database

Temporal

– Multiple time points

Ex:Slide79

© Prentice Hall

79

Temporal QueriesQuery

Database

Intersection Query

Inclusion Query

Containment Query

Point Query – Tuple retrieved is valid at a particular point in time.

t

s

q

t

e

q

t

s

d

t

e

d

t

s

q

t

e

q

t

s

d

t

e

d

t

s

q

t

e

q

t

s

d

t

e

d

t

s

q

t

e

q

t

s

d

t

e

dSlide80

© Prentice Hall

80

Types of DatabasesSnapshot – No temporal supportTransaction Time – Supports time when transaction inserted data

Timestamp

Range

Valid Time – Supports time range when data values are valid

Bitemporal – Supports both transaction and valid time.Slide81

© Prentice Hall

81

Modeling Temporal EventsTechniques to model temporal events.

Often based on earlier approaches

Finite State Recognizer (Machine) (FSR)

Each event recognizes one character

Temporal ordering indicated by arcs

May recognize a sequence

Require precisely defined transitions between states

Approaches

Markov Model

Hidden Markov Model

Recurrent Neural NetworkSlide82

© Prentice Hall

82FSRSlide83

© Prentice Hall

83Markov Model (MM)

Directed graphVertices represent states

Arcs show transitions between states

Arc has probability of transition

At any time one state is designated as current state.

Markov Property

– Given a current state, the transition probability is independent of any previous states.

Applications: speech recognition, natural language processingSlide84

© Prentice Hall

84Markov ModelSlide85

© Prentice Hall

85Hidden Markov Model (HMM)

Like HMM, but states need not correspond to observable states.HMM models process that produces as output a sequence of observable symbols.

HMM will actually output these symbols.

Associated with each node is the probability of the observation of an event.

Train HMM to recognize a sequence.

Transition and observation probabilities learned from training set.Slide86

© Prentice Hall

86Hidden Markov Model

Modified from [RJ86]Slide87

© Prentice Hall

87HMM AlgorithmSlide88

© Prentice Hall

88HMM Applications

Given a sequence of events and an HMM, what is the probability that the HMM produced the sequence?Given a sequence and an HMM, what is the most likely state sequence which produced this sequence?Slide89

© Prentice Hall

89

Recurrent Neural Network (RNN)Extension to basic NNNeuron can obtian input form any other neuron (including output layer).Can be used for both recognition and prediction applications.Time to produce output unknown

Temporal aspect added by backlinks.Slide90

© Prentice Hall

90RNNSlide91

© Prentice Hall

91Time Series

Set of attribute values over timeTime Series Analysis – finding patterns in the values.TrendsCyclesSeasonal

OutliersSlide92

© Prentice Hall

92Analysis Techniques

Smoothing – Moving average of attribute values.Autocorrelation

– relationships between different subseries

Yearly, seasonal

Lag

– Time difference between related items.

Correlation Coefficient rSlide93

© Prentice Hall

93

SmoothingSlide94

© Prentice Hall

94

Correlation with Lag of 3Slide95

© Prentice Hall

95

SimilarityDetermine similarity between a target pattern, X, and sequence, Y: sim(X,Y)

Similar to Web usage mining

Similar to earlier word processing and spelling corrector applications.

Issues:

Length

Scale

Gaps

Outliers

BaselineSlide96

© Prentice Hall

96Longest Common Subseries

Find longest subseries they have in common.Ex:X = <10,5,6,9,22,15,4,2>Y = <6,9,10,5,6,22,15,4,2>

Output: <22,15,4,2>

Sim(X,Y) = l/n = 4/9Slide97

© Prentice Hall

97

Similarity based on Linear TransformationLinear transformation function fConvert a value form one series to a value in the second

e

f

– tolerated difference in results

d

– time value difference allowedSlide98

© Prentice Hall

98Prediction

Predict future value for time seriesRegression may not be sufficientStatistical TechniquesARMAARIMANNSlide99

© Prentice Hall

99Pattern Detection

Identify patterns of behavior in time seriesSpeech recognition, signal processingFSR, MM, HMMSlide100

© Prentice Hall

100String Matching

Find given pattern in sequenceKnuth-Morris-Pratt: Construct FSM

Boyer-Moore:

Construct FSMSlide101

© Prentice Hall

101Distance between Strings

Cost to convert one to the otherTransformationsMatch: Current characters in both strings are the sameDelete: Delete current character in input string

Insert: Insert current character in target string into stringSlide102

© Prentice Hall

102Distance between StringsSlide103

© Prentice Hall

103 Frequent SequenceSlide104

© Prentice Hall

104Frequent Sequence Example

Purchases made by customerss(<{A},{C}>) = 1/3s(<{A},{D}>) = 2/3s(<{B,C},{D}>) = 2/3Slide105

© Prentice Hall

105Frequent Sequence LatticeSlide106

© Prentice Hall

106SPADE

Sequential Pattern Discovery using Equivalence classesIdentifies patterns by traversing lattice in a top down manner.Divides lattice into equivalent classes and searches each separately.ID-List:

Associates customers and transactions with each item.Slide107

© Prentice Hall

107SPADE Example

ID-List for Sequences of length 1:

Count for <{A}> is 3

Count for <{A},{D}> is 2Slide108

© Prentice Hall

108Q

1 Equivalence ClassesSlide109

© Prentice Hall

109SPADE AlgorithmSlide110

© Prentice Hall

110Temporal Association Rules

Transaction has time: <TID,CID,I1,I2

, …, I

m

,t

s

,t

e

>

[t

s

,t

e

] is range of time the transaction is active.

Types:

Inter-transaction rules

Episode rules

Trend dependencies

Sequence association rules

Calendric association rulesSlide111

© Prentice Hall

111Inter-transaction Rules

Intra-transaction association rules Traditional association Rules

Inter-transaction association rules

Rules across transactions

Sliding window

– How far apart (time or number of transactions) to look for related itemsets.Slide112

© Prentice Hall

112Episode Rules

Association rules applied to sequences of events.Episode – set of event predicates and partial ordering on themSlide113

© Prentice Hall

113Sequence Association Rules

Association rules involving sequencesEx:

<{A},{C}>

<{A},{D}>

Support = 1/3

Confidence 1Slide114

© Prentice Hall

114Calendric Association Rules

Each transaction has a unique timestamp.Group transactions based on time interval within which they occur.Identify large itemsets by looking at transactions only in this predefined interval.Slide115

UNIT V-CASE STUDYSlide116

DATA WAREHOUSING FOR THE TAMILNADU GOVERNMENT

GISTNIC DATA WAREHOUSE

The general information service terminal of national informatics centre Data warehouse is an initiative taken by national informatics centre to provide a comprehensive information database by the government on national issues ranging across diverse subjects like food and agriculture to trends in the economy and latest updates on science and technology. The GISTNIC data warehouse is a web-enabled SAS software solution.

The data warehouse aims to provide online information to key decision makers in the government sector enabling them to make better strategic decisions with regard to administrative policies and investments

The government of tamilnadu is the first one to perceive the need and importance of converting data into a valuable information for better decision making.

The GISTNIC web site has online data warehouse which includes data marts on village amenities, rainfall, agricultural census data, essential commodity prices etc.

OBJECTIVES OF WEB ENABLED DATA WAREHOUSE

The objective is to provide a powerful decision making tools in the hands of the end users in order to facilitate prompt decision making.Slide117

DATA WAREHOUSING FOR THE MINISTRY OF COMMERCE

The ministry of commerce has set up the following seven export processing zones at various locations.

1.Kandla free trade zone gandhidham2.

Santacruz

electronics export processing zone

bombay

3.

Falta

export processing zone , west

bengal

4. Madras export processing zone,

chennai

5.Cochin export processing zone ,cochin

6.

Noida

export processing zone

noida

7.

Vishakapatnam

export processing zone

visakhapatnam

This case study report presents how the data warehouse can be effectively built for the ministry of commerce.

The following are the objectives

Globalization of

india’s

foreign trade

Attracting

foregn

investment

Scaling down tariff barriers.

Encouraging high and internationally acceptable standards of quality

5. simplification and streamlining of procedures governing imports and exports.

DATA WAREHOUSE:

Data warehouse is subject oriented, integrated, time variant and non volatile collection of data in support of management decision making processSlide118

OBJECTIVES OF DATA WAREHOUSE FOR THE MINISTRY OF COMMERCE The ministry of commerce has been regularly reviewing the data warehouse in its board meetings. The ministry is equipped with all analysis variables a reporting form and the zone performance and the progress of exports are reviewed in each zone of the country.

The data warehouse includes all analysis variables into consideration from all the zones with an option to drill down to the daily data.

DATA WAREHOUSE IMPLEMENTATION STRATEGY FOR EXPORT PROCESSING ZONES

INFRASTRUCTURE

The basic infrastructure required for building the warehouse for the ministry of commerce is based on the communication infrastructure, hardware/software/tools and manpower.

The common infrastructure

This includes all the tasks necessary to provide the technical basis for the warehouse environment. This includes the connectivity between the legacy environment and the new warehouse environment on a network as well as on database level.

Man power requirements

The senior officials in the ministry of commerce sponsored the whole warehouse implementation and played active role as EXIM policy and business architect for data warehouse and also subject area specialists.

KEY AREAS FOR ANALYSIS

The data topic deals with the data access mapping, derivation , transformation and aggregation according to the requirement of ministry. The key areas are decision making , which will be used for analytical processing and for data mining at the ministry of commerce are listed as follows

Unitwise

,

sectorwise

and

countrywise

imports and exports

Direction of imports and exports

Sectorwise

,

countrywise

and zone wise trends imports and exports

Comparative country wise export and import for each sector

DTA sales

Claims of reimbursement of central sales tax of zones

Deemed export benefits

Employment generation

Investments in the zone

Deployment of infrastructure

Growth of industrial units

Occupancy detailsSlide119

IMPLEMENTATION OF DATA WAREHOUSE

Operational data systems The operational data system keeps the organization running by supporting daily transactions such as import and export bills submitted to customs department in each zone, daily transactions of permissions allotments etc.

ARCHITECTURE The architecture of the OLAP implementation consists of 5 layers.

All the 7 zones have DBMS/RDBMS data for internal management of zone activities and been forwarding the MIS reports to the MOC , new

delhi

.

The second layer is located at the

MoC

new

delhi

with large metadata repository and data warehouse

The metadata warehouse tools and OLAP server handling and maintaining same are

focussed

in level-3 of the architecture.

secretary

Front end tools

mddb

Olap

server

metadata

Data warehouse

Data marts

mepz

sepz

cepz

kaepz

nepz

vepz

fepzSlide120

DESIGN OF ANALYSIS/CATEGORY VARIABLES

The data model is prepared by the entire data availability and data requirement are analyzed and the analysis variables for building the multidimensional cube are listed as follows1. Employment generation managerial/skilled/unskilled classification with zone/unit/industry break-up

2. Investments in the zone

3. Performance of units and close monitoring during production

4. Deployment of infrastructure etc.

Related tables:

Ep_mast

,

ep_stmst

,

dist_mast

,states ,

indu_mast

;

eo_mast

,

eo_stmst

.

Analysis variables:

1.EPZ or EOU

2.Zone

3.Type of approval

4. Type of industry

5. State

6. District

7. Year of approval

8. Month of approval

9. Day of approval

10 . Year of production commencementSlide121

11. Month of production commencement

12.Day of production commencement13. Current status

14. Date of current status15. Net foreign exchange percentage16. Stipulated NFE17. Number of approvals

18. Number of units

exports and imports values with zone sector port year month and day break-ups deal with the following performances

Zone-wise performance

Industry-wise performance

Country-wise performance

This will indicate the following direction of exports

Country-wise performance overall and with zone break-up

Port-wise performance which will decide the examination of infrastructure in these ports

Export performance during different time periods and the analysis of the same.

Trend over years/quarters/months for different country, sector, and zone

Comparative country-wise import/export for each sector

Related tables:

Shipbill

, country, industry, currency,

shipment_mode

, export , bill entry

Export analysis variables:

Zone , export type, year of shipping bill, month of shipping bill, country, currency, mode of shipment, destination port, value of export etc.

Import analysis variables:

Auto/approval, zone, import type, import year, import month, import day, type of goods, import value, duty foregone, import country and mode of shipment Slide122

DEEMED EXPORT BENEFITS

Analysis variables: Zone , sector, claims received amount disbursed year/quarter/month

EMPLOYMENT GENERATIONAnalysis variables:Zone , male/female, managerial/skilled/unskilled, number of employees

INVESTMENTS IN THE ZONE

Analysis variables:

Zone, unit, NRI foreign investment, Indian investment, remittances received, approved value.

CONCLUSION

The data warehouse for EPZ provides the ability to scale large volume of data seamless presentation of historical , projected and derived data.

It helps the planners in what-if analysis and planning without depending on the zones to supply the dat.

The time lag between the zone and the ministry is saved and then the analysis can be carried out with the speed of thought analysis.

The data warehouse for the ministry of commerce can add more dimensions to the propose warehouse with the data collected from other offices to evolve a data warehouse model for better analysis of promotion of imports and exports in the country.

This will provide an excellent executive information system to the secretary joint secretaries of the ministry.Slide123

DATA WAREHOUSE FOR FINANCE DEPARTMENT

Responsibilities of the finance department:The finance department of the government of

andhra pradesh has the following responsibilities.

Preparing a department-wise budget up to the sub-detail head and submission to the legislature for its approval.

Watching out the government expenditure and revenue department-wise

Looking after development activities under various plan schemes

Monitoring other administrative matters related to all heads of the department.

Treasuries in

andhra

pradesh

:

Money standing in the government account are kept either in treasuries or in banks. Money deposited in the banks shall be considered as general fund held in the books of the banks on behalf of the state.

Director of treasuries:

Treasuries are under the general control of the director of treasuries and accounts.

Sub treasuries:

If the requirements of the public business make necessary the establishment of one or more sub-treasures under a district treasury.

The accounts of receipts and payments at a sub-treasury must be included monthly in the accounts of the district treasury.

Treasuries handle all the government receipts and payments.

Every transaction in the government is made through related departments.

DATA WAREHOUSING FOR THE GOVERNMENT OF ANDHRA PRADESHSlide124

LESS THAN 2000 RECEIPTS

2000<4000 SERVICE MAJOR HEADS

4000<6000 CAPITAL OUTLAY

6000<8000 LOANS

MORE THAN 8000 DEPOSITS

A project for building data warehouse for OLAP is implemented by NIC for the department of treasuries. The concept of building DW in the department of treasuries has been established for providing easy access to integrated up to date data related to various aspects of the department functions.

DW technology is used to develop analytical tools designed to provide support for decision making at all levels of the department.

DESCRIPTION OF THE ACCOUNT HEAD

LENGTH OF THE CODE

MAJOR

4

SUB-MAJOR

2

MINOR

3

GROUP SUB-HEAD

1

SUB-HEAD

2

DETAIL HEAD

3

SUB-DETAIL HEAD

3Slide125

Traditional information systems implemented in the department of treasuries are based on transactional databases

Which are not designed for providing fast and efficient access to information critical for decision making.Data required for analysis are typically distributed among a number of isolated information systems meeting the needs of different sub-treasuries.

Data warehouse technology provided to the department of treasuries by national informatics centre eliminates these problems by storing current historical data from disparate information systems.Data ware house provide efficient analysis and monitoring of financial data of treasuries.

It also evaluates the internal and external business factors related to operational economic and financial conditions of treasuries budget utilization.

DIMENSIONS COVERED UNDER FINANCE DATA WAREHOUSE

The different dimensions taken for drill down approach against two measures(payments and receipts)

Department

District treasury office

Sub-treasury office

Drawing and disbursing officer

Time

Bank-

w

ise

Based on different forms

COGNOS GRAPHIC USER INTERFACE FOR TREASURIES DATA WAREHOUSE

Impromptu:

It is used for generating various kind of reports like simple crosstab etc.

Transformer:

transformer model objects contain definition of queries dimensions measures dimension views user classes and related authentication information as well as objects for one or more cubes that transformer creates for viewing and reporting in

powerplay

.Slide126

Powerplay

:COGNOS

powerplay is used to populate reports with drill-down facility.Scheduler:Scheduler coordinates the execution of automated processes called task , on a set date and time, or at recurring intervals.

Authenticator:

Authenticator is a user class management system. It provides COGNOS client applications with the ability to create and show data based on user authenticated access.Slide127

HP can easily access and quickly analyze enormous volumes of self through data to help its reseller customers improve the efficiency and profitability of their businesses with an OLAP system based on Microsoft SQL server and

knosys

proclarity. Hewlett-Packard is a worldwide market leader in the $18 million inkjet industry.In the past, HP’s brand recognition and reputation for reliability were enough to ensure that a reseller would carry HP products.

ACCESS TO INFORM ATION NEEDED USING DATA WAREHOUSING TECHNOLOGY

HP has captured and stored the information both from primary research and third parties

The

businesss

analysis group decided they needed a system that would provide market metric data to help field sales force managers or account teams make brand and channel management decisions.

HP requires a system that requires low cost, low maintenance and as simple to administer as possible. So the group turned too

Knosys

, a

Boise,Idaho

based software company that has developed a business analysis/online analytical processing(OLAP) package called ProClarity.

The solution enables to move the data so quickly and at such a low cost of maintenance and ownership that it would solve the problems.

Knosys

has helped the group build the data flow algorithms with

microsoft

visual basic development system and SQL server 7.0 data transformation services.

Due to HP’s enormous sell through data volumes, it would take too long to build analytical models with a pure, multidimensional OLAP solution.

HP used SQL server 7.0 virtual cubes and cube partitioning capabilities.

Virtual cube capabilities allows decision makers to cross analyze data from all these OLAP sources.

Cube partitioning allows HP to more effectively manage large number of OLAP cubes .

DATA WAREHOUSING IN HEWLETT-PACKARDSlide128

Knosys

Proclarity provides HP decision makers with the key to analyzing masses of data.Pro clarity is fully integrated with Microsoft products, and its PC-based client is modelled

after Internet explorer 4.0

Proclarity

powerful analytical features takes full advantage of the robust capabilities found in SQL server 7.0 OLAP services

CONCLUSION

HP’s

proclarity

system and SQL server 7.0 provides the system with more accurate, detailed and timely data, that makes the business more efficient and helps its resellers make their businesses more efficient.Slide129

In 1998,

ArsDigita corporation built a web service as a front-end to an experimental custom clothing factory operated by Levi strauss

. The whole purpose of the factory and web service was to test and analyze the consumer reaction to this method of buying clothes. Therefore , a data warehouse was built into their project almost from the beginning.The public website was supported by a mid-range HP Unix server that had ample leftover capacity to run the data warehouse.

The new ‘

dw

’ oracle user was created to support the data warehouse.

GRANTed

SELECT on the OLTP tables to the ‘

dw

’ user and wrote procedures to copy all the data from the OLTP system into a star schema of tables owned by the ‘

dw

’ user

This kind of schema was proven to scale to the world’s largest data warehouses.

In star join schema, there is one fact table that references a bunch of dimension tables.

The following dimension tables are designed after discussing with Levi’s executives

1.

TIME

: for queries comparing sales by season, quarter or holiday

2.PRODUCT

: for queries comparing sales by color or style

3.SHIP TO

: for queries comparing sales by region or state.

4.PROMOTION:

for queries aimed at determining the relationship between discounts and sales.

5.USER EXPERIENCE

: for queries looked at returned versus exchanged versus accepted items

DATA WAREHOUSING IN

LEVI STRAUSSSlide130

The world bank collects and maintains huge data of economic and developmental parameters for all the third world countries across the globe.

The bank performed analysis on this huge data manually and later with limited tools for analysis.The bank collects and analyzes macro economic financial statistics and also information on parameters such as poverty, health, education, environment and public sector.

THE LIVE DATABASE DATA WAREHOUSE The OLAP cubes were defined for this database using OLAP server module of SQL server 2000.

Universal access was provided for this data warehouse was called

live database

BENEFITS OF THE SECOND-GENERATION LDB DATA WAREHOUSE

the first generation LDB data warehouse had certain limitations. Therefore the second generation ODB

datawaregouse

was built using SQL server 2000 analysis server along with

proclarity

of

knosys

corporation,

The package offered direct user functionality which was otherwise requires technical intervention by a programmer.

Proclarity

also provides web enablement, thereby ensuring universal accessibility.

It results in significant cost savings by reducing the time and effort required to prepare a large variety of reports to suit varying needs of the economists and other governmental decision makers to aid effective and better economic planning.

DATA WAREHOUSING IN

WORLD BANK