1 DATA MINING Introductory and Advanced Topics Part III Margaret H Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr MHDunham ID: 604900
Download Presentation The PPT/PDF document "© Prentice Hall" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
© Prentice Hall
1
DATA MININGIntroductory and Advanced TopicsPart III
Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
Companion slides for the text by Dr. M.H.Dunham,
Data Mining, Introductory and Advanced Topics
, Prentice Hall, 2002.Slide2
© Prentice Hall
2Data Mining Outline
PART I
Introduction
Related Concepts
Data Mining Techniques
PART II
Classification
Clustering
Association Rules
PART III
Web Mining
Spatial Mining
Temporal MiningSlide3
© Prentice Hall
3
Web Mining OutlineGoal: Examine the use of data mining on the World Wide Web
Introduction
Web Content Mining
Web Structure Mining
Web Usage MiningSlide4
© Prentice Hall
4Web Mining Issues
Size>350 million pages (1999) Grows at about 1 million pages a dayGoogle indexes 3 billion documentsDiverse types of dataSlide5
© Prentice Hall
5Web Data
Web pagesIntra-page structuresInter-page structuresUsage dataSupplemental dataProfiles
Registration information
CookiesSlide6
© Prentice Hall
6
Web Mining Taxonomy
Modified from [zai01]Slide7
© Prentice Hall
7Web Content Mining
Extends work of basic search enginesSearch EnginesIR application
Keyword based
Similarity between query and document
Crawlers
Indexing
Profiles
Link analysisSlide8
© Prentice Hall
8
CrawlersRobot (spider) traverses the hypertext sructure in the Web.
Collect information from visited pages
Used to construct indexes for search engines
Traditional Crawler
– visits entire Web (?) and replaces index
Periodic Crawler
– visits portions of the Web and updates subset of index
Incremental Crawler
– selectively searches the Web and incrementally modifies index
Focused Crawler
– visits pages related to a particular subjectSlide9
© Prentice Hall
9Focused Crawler
Only visit links from a page if that page is determined to be relevant.Classifier is static after learning phase.
Components:
Classifier which assigns relevance score to each page based on crawl topic.
Distiller to identify
hub
pages which determines the pages that contain links to many relevant pages.
Crawler visits pages
based on
classifier
and distiller scores.Slide10
© Prentice Hall
10Focused Crawler
Classifier to related documents to topicsClassifier also determines how useful outgoing links areHub Pages contain links to many relevant pages. Must be visited even if not high relevance score.Slide11
© Prentice Hall
11Focused CrawlerSlide12
© Prentice Hall
12Context Focused Crawler
Context Graph:Context graph created for each seed document .
Root is
the
seed
document.
Nodes at each level show documents with links to documents at next higher level.
Updated during crawl itself .
Approach:
Construct context graph and classifiers using seed documents as training data.
Perform crawling using classifiers and context graph created.Slide13
© Prentice Hall
13Context GraphSlide14
© Prentice Hall
14
Harvest SystemBased on the use of caching ,indexing and crawling
Harvest is a set of tools that facilitate gathering of information from diverse sources.
Gatherers – Obtains information for indexing from an Internet Service Provider
Brokers- Provides the index and query
interface
Gatherers use essence system to assist in collecting data
Essence-Classifies documents by creating a semantic index
Slide15
© Prentice Hall
15
Virtual Web ViewMultiple Layered DataBase (MLDB)
built on top of the Web.
Each layer of the database is more generalized (and smaller) and centralized than the one beneath it.
Upper layers of MLDB are structured and can be accessed with SQL type queries.
Translation tools convert Web documents to XML.
Extraction tools extract desired information to place in first layer of MLDB.
Higher levels contain more summarized data obtained through generalizations of the lower levels.Slide16
© Prentice Hall
16Personalization
Web access or contents tuned to better fit the desires of each user.Manual techniques identify user’s preferences based on profiles or demographics.
Collaborative filtering
identifies preferences based on ratings from similar users.
Content based filtering
retrieves pages based on similarity between pages and user profiles.Slide17
© Prentice Hall
17Web Structure Mining
Mine structure (links, graph) of the WebTechniquesPageRankCLEVERCreate a model of the Web organization.
May be combined with content mining to more effectively retrieve important pages.Slide18
© Prentice Hall
18PageRank
Used by GooglePrioritize pages returned from search by looking at Web structure.
Importance of page is calculated based on number of pages which point to it –
Backlinks
.
Weighting is used to provide more importance to backlinks coming form important pages.Slide19
© Prentice Hall
19PageRank (cont’d)
PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)PR(i): PageRank for a page i which points to target page p.
N
i
: number of links coming out of page iSlide20
© Prentice Hall
20CLEVER
Identify authoritative and hub pages.Authoritative Pages :Highly important pages.Best source for requested information.
Hub Pages
:
Contain links to highly important pages.Slide21
© Prentice Hall
21HITS
Hyperlink-Induces Topic SearchBased on a set of keywords, find set of relevant pages – R.Identify hub and authority pages for these.Expand R to a base set, B, of pages linked to or from R.
Calculate weights for authorities and hubs.
Pages with highest ranks in R are returned.Slide22
© Prentice Hall
22
HITS AlgorithmSlide23
© Prentice Hall
23Web Usage Mining
Extends work of basic search enginesSearch EnginesIR application
Keyword based
Similarity between query and document
Crawlers
Indexing
Profiles
Link analysisSlide24
© Prentice Hall
24
Web Usage Mining ApplicationsPersonalizationImprove structure of a site’s Web pagesAid in caching and prediction of future page referencesImprove design of individual pages
Improve effectiveness of e-commerce (sales and advertising)Slide25
© Prentice Hall
25
Web Usage Mining ActivitiesPreprocessing Web log
Cleanse
Remove extraneous information
Sessionize
Session:
Sequence of pages referenced by one user at a sitting.
Pattern Discovery
Count patterns that occur in sessions
Pattern
is sequence of pages references in session.
Similar to association rules
Transaction: session
Itemset: pattern (or subset)
Order is important
Pattern AnalysisSlide26
© Prentice Hall
26ARs in Web Mining
Web Mining:Content
Structure
Usage
Frequent patterns of sequential page references in Web searching.
Uses:
Caching
Clustering users
Develop user profiles
Identify important pagesSlide27
© Prentice Hall
27Web Usage Mining Issues
Identification of exact user not possible.Exact sequence of pages referenced by a user not possible due to caching.Session not well definedSecurity, privacy, and legal issuesSlide28
© Prentice Hall
28Web Log Cleansing
Replace source IP address with unique but non-identifying ID.Replace exact URL of pages referenced with unique but non-identifying ID.Delete error records and records containing not page data (such as figures and code)Slide29
© Prentice Hall
29Sessionizing
Divide Web log into sessions.Two common techniques:Number of consecutive page references from a source IP address occurring within a predefined time interval (e.g. 25 minutes).All consecutive page references from a source IP address where the interclick time is less than a predefined threshold.Slide30
© Prentice Hall
30Data Structures
Keep track of patterns identified during Web usage mining processCommon techniques:Trie Suffix TreeGeneralized Suffix Tree
WAP TreeSlide31
© Prentice Hall
31Trie vs. Suffix Tree
Trie:Rooted treeEdges labeled which character (page) from patternPath from root to leaf represents pattern.
Suffix Tree:
Single child collapsed with parent. Edge contains labels of both prior edges.Slide32
© Prentice Hall
32Trie and Suffix TreeSlide33
© Prentice Hall
33Generalized Suffix Tree
Suffix tree for multiple sessions. Contains patterns from all sessions.Maintains count of frequency of occurrence of a pattern in the node.WAP Tree:
Compressed version of generalized suffix treeSlide34
© Prentice Hall
34Types of Patterns
Algorithms have been developed to discover different types of patterns.Properties:Ordered – Characters (pages) must occur in the exact order in the original session.
Duplicates
– Duplicate characters are allowed in the pattern.
Consecutive
– All characters in pattern must occur consecutive in given session.
Maximal
– Not subsequence of another pattern.Slide35
© Prentice Hall
35Pattern Types
Association RulesNone of the properties holdEpisodes
Only ordering holds
Sequential Patterns
Ordered and maximal
Forward Sequences
Ordered, consecutive, and maximal
Maximal Frequent Sequences
All properties holdSlide36
© Prentice Hall
36Episodes
Partially ordered set of pagesSerial episode – totally ordered with time constraintParallel episode
– partial ordered with time constraint
General episode
– partial ordered with no time constraintSlide37
© Prentice Hall
37DAG for EpisodeSlide38
© Prentice Hall
38
Spatial Mining OutlineGoal:
Provide an introduction to some spatial mining techniques.
Introduction
Spatial Data Overview
Spatial Data Mining Primitives
Generalization/Specialization
Spatial Rules
Spatial Classification
Spatial ClusteringSlide39
© Prentice Hall
39Spatial Object
Contains both spatial and nonspatial attributes.Must have a location type attributes:Latitude/longitudeZip codeStreet address
May retrieve object using either (or both) spatial or nonspatial attributes.Slide40
© Prentice Hall
40
Spatial Data Mining ApplicationsGeologyGIS SystemsEnvironmental Science
Agriculture
Medicine
Robotics
May involved both spatial and temporal aspectsSlide41
© Prentice Hall
41
Spatial QueriesSpatial selection may involve specialized selection comparison operations:
Near
North, South, East, West
Contained in
Overlap/intersect
Region (Range) Query
– find objects that intersect a given region.
Nearest Neighbor Query
– find object close to identified object.
Distance Scan
– find object within a certain distance of an identified object where distance is made increasingly larger.Slide42
© Prentice Hall
42
Spatial Data StructuresData structures designed specifically to store or index spatial data.
Often based on B-tree or Binary Search Tree
Cluster data on disk basked on geographic location.
May represent complex spatial structure by placing the spatial object in a containing structure of a specific geographic shape.
Techniques:
Quad Tree
R-Tree
k-D TreeSlide43
© Prentice Hall
43MBR
Minimum Bounding RectangleSmallest rectangle that completely contains the objectSlide44
© Prentice Hall
44MBR ExamplesSlide45
© Prentice Hall
45Quad Tree
Hierarchical decomposition of the space into quadrants (MBRs)Each level in the tree represents the object as the set of quadrants which contain any portion of the object.
Each level is a more exact representation of the object.
The number of levels is determined by the degree of accuracy desired.Slide46
© Prentice Hall
46Quad Tree ExampleSlide47
© Prentice Hall
47R-Tree
As with Quad Tree the region is divided into successively smaller rectangles (MBRs).Rectangles need not be of the same size or number at each level.Rectangles may actually overlap.
Lowest level cell has only one object.
Tree maintenance algorithms similar to those for B-trees.Slide48
© Prentice Hall
48R-Tree ExampleSlide49
© Prentice Hall
49K-D Tree
Designed for multi-attribute data, not necessarily spatialVariation of binary search treeEach level is used to index one of the dimensions of the spatial object.
Lowest level cell has only one object
Divisions not based on MBRs but successive divisions of the dimension range.Slide50
© Prentice Hall
50k-D Tree ExampleSlide51
© Prentice Hall
51Topological Relationships
DisjointOverlaps or IntersectsEqualsCovered by or inside or contained inCovers or containsSlide52
© Prentice Hall
52
Distance Between ObjectsEuclideanManhattan
Extensions:Slide53
© Prentice Hall
53Progressive Refinement
Make approximate answers prior to more accurate ones.Filter out data not part of answerHierarchical view of data based on spatial relationshipsCoarse predicate recursively refinedSlide54
© Prentice Hall
54Progressive RefinementSlide55
© Prentice Hall
55
Spatial Data Dominant AlgorithmSlide56
© Prentice Hall
56STING
STatistical Information Grid-basedHierarchical technique to divide area into rectangular cellsGrid data structure contains summary information about each cellHierarchical clustering Similar to quad treeSlide57
© Prentice Hall
57STINGSlide58
© Prentice Hall
58
STING Build AlgorithmSlide59
© Prentice Hall
59
STING AlgorithmSlide60
© Prentice Hall
60Spatial Rules
Characteristic Rule The average family income in Dallas is $50,000.
Discriminant Rule
The average family income in Dallas is $50,000, while in Plano the average income is $75,000.
Association Rule
The average family income in Dallas for families living near White Rock Lake is $100,000.Slide61
© Prentice Hall
61Spatial Association Rules
Either antecedent or consequent must contain spatial predicates.View underlying database as set of spatial objects.May create using a type of progressive refinementSlide62
© Prentice Hall
62
Spatial Association Rule AlgorithmSlide63
© Prentice Hall
63Spatial Classification
Partition spatial objectsMay use nonspatial attributes and/or spatial attributesGeneralization and progressive refinement may be used.Slide64
© Prentice Hall
64ID3 Extension
Neighborhood GraphNodes – objectsEdges – connects neighborsDefinition of neighborhood variesID3 considers nonspatial attributes of all objects in a neighborhood (not just one) for classification.Slide65
© Prentice Hall
65Spatial Decision Tree
Approach similar to that used for spatial association rules.Spatial objects can be described based on objects close to them – Buffer.Description of class based on aggregation of nearby objects.Slide66
© Prentice Hall
66
Spatial Decision Tree AlgorithmSlide67
© Prentice Hall
67Spatial Clustering
Detect clusters of irregular shapesUse of centroids and simple distance approaches may not work well.Clusters should be independent of order of input.Slide68
© Prentice Hall
68Spatial ClusteringSlide69
© Prentice Hall
69CLARANS Extensions
Remove main memory assumption of CLARANS.Use spatial index techniques.Use sampling and R*-tree to identify central objects.Change cost calculations by reducing the number of objects examined.
Voronoi DiagramSlide70
© Prentice Hall
70VoronoiSlide71
© Prentice Hall
71SD(CLARANS)
Spatial DominantFirst clusters spatial components using CLARANSThen iteratively replaces medoids, but limits number of pairs to be searched.Uses generalizationUses a learning to to derive description of cluster.Slide72
© Prentice Hall
72
SD(CLARANS) AlgorithmSlide73
© Prentice Hall
73DBCLASD
Extension of DBSCANDistribution Based Clustering of LArge Spatial DatabasesAssumes items in cluster are uniformly distributed.Identifies distribution satisfied by distances between nearest neighbors.Objects added if distribution is uniform.Slide74
© Prentice Hall
74DBCLASD AlgorithmSlide75
© Prentice Hall
75Aggregate Proximity
Aggregate Proximity – measure of how close a cluster is to a feature.Aggregate proximity relationship finds the k closest features to a cluster.
CRH Algorithm
– uses different shapes:
Encompassing Circle
Isothetic Rectangle
Convex HullSlide76
© Prentice Hall
76CRHSlide77
© Prentice Hall
77
Temporal Mining OutlineGoal: Examine some temporal data mining issues and approaches.
Introduction
Modeling Temporal Events
Time Series
Pattern Detection
Sequences
Temporal Association RulesSlide78
© Prentice Hall
78Temporal Database
Snapshot – Traditional database
Temporal
– Multiple time points
Ex:Slide79
© Prentice Hall
79
Temporal QueriesQuery
Database
Intersection Query
Inclusion Query
Containment Query
Point Query – Tuple retrieved is valid at a particular point in time.
t
s
q
t
e
q
t
s
d
t
e
d
t
s
q
t
e
q
t
s
d
t
e
d
t
s
q
t
e
q
t
s
d
t
e
d
t
s
q
t
e
q
t
s
d
t
e
dSlide80
© Prentice Hall
80
Types of DatabasesSnapshot – No temporal supportTransaction Time – Supports time when transaction inserted data
Timestamp
Range
Valid Time – Supports time range when data values are valid
Bitemporal – Supports both transaction and valid time.Slide81
© Prentice Hall
81
Modeling Temporal EventsTechniques to model temporal events.
Often based on earlier approaches
Finite State Recognizer (Machine) (FSR)
Each event recognizes one character
Temporal ordering indicated by arcs
May recognize a sequence
Require precisely defined transitions between states
Approaches
Markov Model
Hidden Markov Model
Recurrent Neural NetworkSlide82
© Prentice Hall
82FSRSlide83
© Prentice Hall
83Markov Model (MM)
Directed graphVertices represent states
Arcs show transitions between states
Arc has probability of transition
At any time one state is designated as current state.
Markov Property
– Given a current state, the transition probability is independent of any previous states.
Applications: speech recognition, natural language processingSlide84
© Prentice Hall
84Markov ModelSlide85
© Prentice Hall
85Hidden Markov Model (HMM)
Like HMM, but states need not correspond to observable states.HMM models process that produces as output a sequence of observable symbols.
HMM will actually output these symbols.
Associated with each node is the probability of the observation of an event.
Train HMM to recognize a sequence.
Transition and observation probabilities learned from training set.Slide86
© Prentice Hall
86Hidden Markov Model
Modified from [RJ86]Slide87
© Prentice Hall
87HMM AlgorithmSlide88
© Prentice Hall
88HMM Applications
Given a sequence of events and an HMM, what is the probability that the HMM produced the sequence?Given a sequence and an HMM, what is the most likely state sequence which produced this sequence?Slide89
© Prentice Hall
89
Recurrent Neural Network (RNN)Extension to basic NNNeuron can obtian input form any other neuron (including output layer).Can be used for both recognition and prediction applications.Time to produce output unknown
Temporal aspect added by backlinks.Slide90
© Prentice Hall
90RNNSlide91
© Prentice Hall
91Time Series
Set of attribute values over timeTime Series Analysis – finding patterns in the values.TrendsCyclesSeasonal
OutliersSlide92
© Prentice Hall
92Analysis Techniques
Smoothing – Moving average of attribute values.Autocorrelation
– relationships between different subseries
Yearly, seasonal
Lag
– Time difference between related items.
Correlation Coefficient rSlide93
© Prentice Hall
93
SmoothingSlide94
© Prentice Hall
94
Correlation with Lag of 3Slide95
© Prentice Hall
95
SimilarityDetermine similarity between a target pattern, X, and sequence, Y: sim(X,Y)
Similar to Web usage mining
Similar to earlier word processing and spelling corrector applications.
Issues:
Length
Scale
Gaps
Outliers
BaselineSlide96
© Prentice Hall
96Longest Common Subseries
Find longest subseries they have in common.Ex:X = <10,5,6,9,22,15,4,2>Y = <6,9,10,5,6,22,15,4,2>
Output: <22,15,4,2>
Sim(X,Y) = l/n = 4/9Slide97
© Prentice Hall
97
Similarity based on Linear TransformationLinear transformation function fConvert a value form one series to a value in the second
e
f
– tolerated difference in results
d
– time value difference allowedSlide98
© Prentice Hall
98Prediction
Predict future value for time seriesRegression may not be sufficientStatistical TechniquesARMAARIMANNSlide99
© Prentice Hall
99Pattern Detection
Identify patterns of behavior in time seriesSpeech recognition, signal processingFSR, MM, HMMSlide100
© Prentice Hall
100String Matching
Find given pattern in sequenceKnuth-Morris-Pratt: Construct FSM
Boyer-Moore:
Construct FSMSlide101
© Prentice Hall
101Distance between Strings
Cost to convert one to the otherTransformationsMatch: Current characters in both strings are the sameDelete: Delete current character in input string
Insert: Insert current character in target string into stringSlide102
© Prentice Hall
102Distance between StringsSlide103
© Prentice Hall
103 Frequent SequenceSlide104
© Prentice Hall
104Frequent Sequence Example
Purchases made by customerss(<{A},{C}>) = 1/3s(<{A},{D}>) = 2/3s(<{B,C},{D}>) = 2/3Slide105
© Prentice Hall
105Frequent Sequence LatticeSlide106
© Prentice Hall
106SPADE
Sequential Pattern Discovery using Equivalence classesIdentifies patterns by traversing lattice in a top down manner.Divides lattice into equivalent classes and searches each separately.ID-List:
Associates customers and transactions with each item.Slide107
© Prentice Hall
107SPADE Example
ID-List for Sequences of length 1:
Count for <{A}> is 3
Count for <{A},{D}> is 2Slide108
© Prentice Hall
108Q
1 Equivalence ClassesSlide109
© Prentice Hall
109SPADE AlgorithmSlide110
© Prentice Hall
110Temporal Association Rules
Transaction has time: <TID,CID,I1,I2
, …, I
m
,t
s
,t
e
>
[t
s
,t
e
] is range of time the transaction is active.
Types:
Inter-transaction rules
Episode rules
Trend dependencies
Sequence association rules
Calendric association rulesSlide111
© Prentice Hall
111Inter-transaction Rules
Intra-transaction association rules Traditional association Rules
Inter-transaction association rules
Rules across transactions
Sliding window
– How far apart (time or number of transactions) to look for related itemsets.Slide112
© Prentice Hall
112Episode Rules
Association rules applied to sequences of events.Episode – set of event predicates and partial ordering on themSlide113
© Prentice Hall
113Sequence Association Rules
Association rules involving sequencesEx:
<{A},{C}>
<{A},{D}>
Support = 1/3
Confidence 1Slide114
© Prentice Hall
114Calendric Association Rules
Each transaction has a unique timestamp.Group transactions based on time interval within which they occur.Identify large itemsets by looking at transactions only in this predefined interval.Slide115
UNIT V-CASE STUDYSlide116
DATA WAREHOUSING FOR THE TAMILNADU GOVERNMENT
GISTNIC DATA WAREHOUSE
The general information service terminal of national informatics centre Data warehouse is an initiative taken by national informatics centre to provide a comprehensive information database by the government on national issues ranging across diverse subjects like food and agriculture to trends in the economy and latest updates on science and technology. The GISTNIC data warehouse is a web-enabled SAS software solution.
The data warehouse aims to provide online information to key decision makers in the government sector enabling them to make better strategic decisions with regard to administrative policies and investments
The government of tamilnadu is the first one to perceive the need and importance of converting data into a valuable information for better decision making.
The GISTNIC web site has online data warehouse which includes data marts on village amenities, rainfall, agricultural census data, essential commodity prices etc.
OBJECTIVES OF WEB ENABLED DATA WAREHOUSE
The objective is to provide a powerful decision making tools in the hands of the end users in order to facilitate prompt decision making.Slide117
DATA WAREHOUSING FOR THE MINISTRY OF COMMERCE
The ministry of commerce has set up the following seven export processing zones at various locations.
1.Kandla free trade zone gandhidham2.
Santacruz
electronics export processing zone
bombay
3.
Falta
export processing zone , west
bengal
4. Madras export processing zone,
chennai
5.Cochin export processing zone ,cochin
6.
Noida
export processing zone
noida
7.
Vishakapatnam
export processing zone
visakhapatnam
This case study report presents how the data warehouse can be effectively built for the ministry of commerce.
The following are the objectives
Globalization of
india’s
foreign trade
Attracting
foregn
investment
Scaling down tariff barriers.
Encouraging high and internationally acceptable standards of quality
5. simplification and streamlining of procedures governing imports and exports.
DATA WAREHOUSE:
Data warehouse is subject oriented, integrated, time variant and non volatile collection of data in support of management decision making processSlide118
OBJECTIVES OF DATA WAREHOUSE FOR THE MINISTRY OF COMMERCE The ministry of commerce has been regularly reviewing the data warehouse in its board meetings. The ministry is equipped with all analysis variables a reporting form and the zone performance and the progress of exports are reviewed in each zone of the country.
The data warehouse includes all analysis variables into consideration from all the zones with an option to drill down to the daily data.
DATA WAREHOUSE IMPLEMENTATION STRATEGY FOR EXPORT PROCESSING ZONES
INFRASTRUCTURE
The basic infrastructure required for building the warehouse for the ministry of commerce is based on the communication infrastructure, hardware/software/tools and manpower.
The common infrastructure
This includes all the tasks necessary to provide the technical basis for the warehouse environment. This includes the connectivity between the legacy environment and the new warehouse environment on a network as well as on database level.
Man power requirements
The senior officials in the ministry of commerce sponsored the whole warehouse implementation and played active role as EXIM policy and business architect for data warehouse and also subject area specialists.
KEY AREAS FOR ANALYSIS
The data topic deals with the data access mapping, derivation , transformation and aggregation according to the requirement of ministry. The key areas are decision making , which will be used for analytical processing and for data mining at the ministry of commerce are listed as follows
Unitwise
,
sectorwise
and
countrywise
imports and exports
Direction of imports and exports
Sectorwise
,
countrywise
and zone wise trends imports and exports
Comparative country wise export and import for each sector
DTA sales
Claims of reimbursement of central sales tax of zones
Deemed export benefits
Employment generation
Investments in the zone
Deployment of infrastructure
Growth of industrial units
Occupancy detailsSlide119
IMPLEMENTATION OF DATA WAREHOUSE
Operational data systems The operational data system keeps the organization running by supporting daily transactions such as import and export bills submitted to customs department in each zone, daily transactions of permissions allotments etc.
ARCHITECTURE The architecture of the OLAP implementation consists of 5 layers.
All the 7 zones have DBMS/RDBMS data for internal management of zone activities and been forwarding the MIS reports to the MOC , new
delhi
.
The second layer is located at the
MoC
new
delhi
with large metadata repository and data warehouse
The metadata warehouse tools and OLAP server handling and maintaining same are
focussed
in level-3 of the architecture.
secretary
Front end tools
mddb
Olap
server
metadata
Data warehouse
Data marts
mepz
sepz
cepz
kaepz
nepz
vepz
fepzSlide120
DESIGN OF ANALYSIS/CATEGORY VARIABLES
The data model is prepared by the entire data availability and data requirement are analyzed and the analysis variables for building the multidimensional cube are listed as follows1. Employment generation managerial/skilled/unskilled classification with zone/unit/industry break-up
2. Investments in the zone
3. Performance of units and close monitoring during production
4. Deployment of infrastructure etc.
Related tables:
Ep_mast
,
ep_stmst
,
dist_mast
,states ,
indu_mast
;
eo_mast
,
eo_stmst
.
Analysis variables:
1.EPZ or EOU
2.Zone
3.Type of approval
4. Type of industry
5. State
6. District
7. Year of approval
8. Month of approval
9. Day of approval
10 . Year of production commencementSlide121
11. Month of production commencement
12.Day of production commencement13. Current status
14. Date of current status15. Net foreign exchange percentage16. Stipulated NFE17. Number of approvals
18. Number of units
exports and imports values with zone sector port year month and day break-ups deal with the following performances
Zone-wise performance
Industry-wise performance
Country-wise performance
This will indicate the following direction of exports
Country-wise performance overall and with zone break-up
Port-wise performance which will decide the examination of infrastructure in these ports
Export performance during different time periods and the analysis of the same.
Trend over years/quarters/months for different country, sector, and zone
Comparative country-wise import/export for each sector
Related tables:
Shipbill
, country, industry, currency,
shipment_mode
, export , bill entry
Export analysis variables:
Zone , export type, year of shipping bill, month of shipping bill, country, currency, mode of shipment, destination port, value of export etc.
Import analysis variables:
Auto/approval, zone, import type, import year, import month, import day, type of goods, import value, duty foregone, import country and mode of shipment Slide122
DEEMED EXPORT BENEFITS
Analysis variables: Zone , sector, claims received amount disbursed year/quarter/month
EMPLOYMENT GENERATIONAnalysis variables:Zone , male/female, managerial/skilled/unskilled, number of employees
INVESTMENTS IN THE ZONE
Analysis variables:
Zone, unit, NRI foreign investment, Indian investment, remittances received, approved value.
CONCLUSION
The data warehouse for EPZ provides the ability to scale large volume of data seamless presentation of historical , projected and derived data.
It helps the planners in what-if analysis and planning without depending on the zones to supply the dat.
The time lag between the zone and the ministry is saved and then the analysis can be carried out with the speed of thought analysis.
The data warehouse for the ministry of commerce can add more dimensions to the propose warehouse with the data collected from other offices to evolve a data warehouse model for better analysis of promotion of imports and exports in the country.
This will provide an excellent executive information system to the secretary joint secretaries of the ministry.Slide123
DATA WAREHOUSE FOR FINANCE DEPARTMENT
Responsibilities of the finance department:The finance department of the government of
andhra pradesh has the following responsibilities.
Preparing a department-wise budget up to the sub-detail head and submission to the legislature for its approval.
Watching out the government expenditure and revenue department-wise
Looking after development activities under various plan schemes
Monitoring other administrative matters related to all heads of the department.
Treasuries in
andhra
pradesh
:
Money standing in the government account are kept either in treasuries or in banks. Money deposited in the banks shall be considered as general fund held in the books of the banks on behalf of the state.
Director of treasuries:
Treasuries are under the general control of the director of treasuries and accounts.
Sub treasuries:
If the requirements of the public business make necessary the establishment of one or more sub-treasures under a district treasury.
The accounts of receipts and payments at a sub-treasury must be included monthly in the accounts of the district treasury.
Treasuries handle all the government receipts and payments.
Every transaction in the government is made through related departments.
DATA WAREHOUSING FOR THE GOVERNMENT OF ANDHRA PRADESHSlide124
LESS THAN 2000 RECEIPTS
2000<4000 SERVICE MAJOR HEADS
4000<6000 CAPITAL OUTLAY
6000<8000 LOANS
MORE THAN 8000 DEPOSITS
A project for building data warehouse for OLAP is implemented by NIC for the department of treasuries. The concept of building DW in the department of treasuries has been established for providing easy access to integrated up to date data related to various aspects of the department functions.
DW technology is used to develop analytical tools designed to provide support for decision making at all levels of the department.
DESCRIPTION OF THE ACCOUNT HEAD
LENGTH OF THE CODE
MAJOR
4
SUB-MAJOR
2
MINOR
3
GROUP SUB-HEAD
1
SUB-HEAD
2
DETAIL HEAD
3
SUB-DETAIL HEAD
3Slide125
Traditional information systems implemented in the department of treasuries are based on transactional databases
Which are not designed for providing fast and efficient access to information critical for decision making.Data required for analysis are typically distributed among a number of isolated information systems meeting the needs of different sub-treasuries.
Data warehouse technology provided to the department of treasuries by national informatics centre eliminates these problems by storing current historical data from disparate information systems.Data ware house provide efficient analysis and monitoring of financial data of treasuries.
It also evaluates the internal and external business factors related to operational economic and financial conditions of treasuries budget utilization.
DIMENSIONS COVERED UNDER FINANCE DATA WAREHOUSE
The different dimensions taken for drill down approach against two measures(payments and receipts)
Department
District treasury office
Sub-treasury office
Drawing and disbursing officer
Time
Bank-
w
ise
Based on different forms
COGNOS GRAPHIC USER INTERFACE FOR TREASURIES DATA WAREHOUSE
Impromptu:
It is used for generating various kind of reports like simple crosstab etc.
Transformer:
transformer model objects contain definition of queries dimensions measures dimension views user classes and related authentication information as well as objects for one or more cubes that transformer creates for viewing and reporting in
powerplay
.Slide126
Powerplay
:COGNOS
powerplay is used to populate reports with drill-down facility.Scheduler:Scheduler coordinates the execution of automated processes called task , on a set date and time, or at recurring intervals.
Authenticator:
Authenticator is a user class management system. It provides COGNOS client applications with the ability to create and show data based on user authenticated access.Slide127
HP can easily access and quickly analyze enormous volumes of self through data to help its reseller customers improve the efficiency and profitability of their businesses with an OLAP system based on Microsoft SQL server and
knosys
proclarity. Hewlett-Packard is a worldwide market leader in the $18 million inkjet industry.In the past, HP’s brand recognition and reputation for reliability were enough to ensure that a reseller would carry HP products.
ACCESS TO INFORM ATION NEEDED USING DATA WAREHOUSING TECHNOLOGY
HP has captured and stored the information both from primary research and third parties
The
businesss
analysis group decided they needed a system that would provide market metric data to help field sales force managers or account teams make brand and channel management decisions.
HP requires a system that requires low cost, low maintenance and as simple to administer as possible. So the group turned too
Knosys
, a
Boise,Idaho
based software company that has developed a business analysis/online analytical processing(OLAP) package called ProClarity.
The solution enables to move the data so quickly and at such a low cost of maintenance and ownership that it would solve the problems.
Knosys
has helped the group build the data flow algorithms with
microsoft
visual basic development system and SQL server 7.0 data transformation services.
Due to HP’s enormous sell through data volumes, it would take too long to build analytical models with a pure, multidimensional OLAP solution.
HP used SQL server 7.0 virtual cubes and cube partitioning capabilities.
Virtual cube capabilities allows decision makers to cross analyze data from all these OLAP sources.
Cube partitioning allows HP to more effectively manage large number of OLAP cubes .
DATA WAREHOUSING IN HEWLETT-PACKARDSlide128
Knosys
Proclarity provides HP decision makers with the key to analyzing masses of data.Pro clarity is fully integrated with Microsoft products, and its PC-based client is modelled
after Internet explorer 4.0
Proclarity
powerful analytical features takes full advantage of the robust capabilities found in SQL server 7.0 OLAP services
CONCLUSION
HP’s
proclarity
system and SQL server 7.0 provides the system with more accurate, detailed and timely data, that makes the business more efficient and helps its resellers make their businesses more efficient.Slide129
In 1998,
ArsDigita corporation built a web service as a front-end to an experimental custom clothing factory operated by Levi strauss
. The whole purpose of the factory and web service was to test and analyze the consumer reaction to this method of buying clothes. Therefore , a data warehouse was built into their project almost from the beginning.The public website was supported by a mid-range HP Unix server that had ample leftover capacity to run the data warehouse.
The new ‘
dw
’ oracle user was created to support the data warehouse.
GRANTed
SELECT on the OLTP tables to the ‘
dw
’ user and wrote procedures to copy all the data from the OLTP system into a star schema of tables owned by the ‘
dw
’ user
This kind of schema was proven to scale to the world’s largest data warehouses.
In star join schema, there is one fact table that references a bunch of dimension tables.
The following dimension tables are designed after discussing with Levi’s executives
1.
TIME
: for queries comparing sales by season, quarter or holiday
2.PRODUCT
: for queries comparing sales by color or style
3.SHIP TO
: for queries comparing sales by region or state.
4.PROMOTION:
for queries aimed at determining the relationship between discounts and sales.
5.USER EXPERIENCE
: for queries looked at returned versus exchanged versus accepted items
DATA WAREHOUSING IN
LEVI STRAUSSSlide130
The world bank collects and maintains huge data of economic and developmental parameters for all the third world countries across the globe.
The bank performed analysis on this huge data manually and later with limited tools for analysis.The bank collects and analyzes macro economic financial statistics and also information on parameters such as poverty, health, education, environment and public sector.
THE LIVE DATABASE DATA WAREHOUSE The OLAP cubes were defined for this database using OLAP server module of SQL server 2000.
Universal access was provided for this data warehouse was called
live database
BENEFITS OF THE SECOND-GENERATION LDB DATA WAREHOUSE
the first generation LDB data warehouse had certain limitations. Therefore the second generation ODB
datawaregouse
was built using SQL server 2000 analysis server along with
proclarity
of
knosys
corporation,
The package offered direct user functionality which was otherwise requires technical intervention by a programmer.
Proclarity
also provides web enablement, thereby ensuring universal accessibility.
It results in significant cost savings by reducing the time and effort required to prepare a large variety of reports to suit varying needs of the economists and other governmental decision makers to aid effective and better economic planning.
DATA WAREHOUSING IN
WORLD BANK