/
Data Mining Concepts Data Mining Concepts

Data Mining Concepts - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
395 views
Uploaded On 2017-09-20

Data Mining Concepts - PPT Presentation

Introduction to Directed Data Mining Neural Networks Prepared by David Douglas University of Arkansas Hosted by the University of Arkansas 1 Microsoft Enterprise Consortium Neural Networks Hosted by the University of Arkansas ID: 589385

university arkansas prepared david arkansas university david prepared hosted douglas layer node hidden neural output networks values nodes adapted

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Data Mining Concepts" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Data Mining Concepts

Introduction to Directed Data Mining: Neural Networks

Prepared by David Douglas, University of Arkansas

Hosted by the University of Arkansas

1

Microsoft Enterprise ConsortiumSlide2
Neural Networks

Hosted by the University of Arkansas

2

Complex learning systems recognized in animal brains

Single neuron has simple structure

Interconnected sets of neurons perform complex learning tasks

Human brain has 10

15

synaptic connections

Artificial Neural Networks

attempt to replicate non-linear learning found in nature—(artificial usually dropped)

Adapted from Larose

Prepared by David Douglas, University of ArkansasSlide3
Neural Networks (Cont)

Hosted by the University of Arkansas

3

Layers

Input, hidden, output

Feed forward

Fully connected

Back propagation

Learning rate

Momentum

Optimization / sub optimization

Prepared by David Douglas, University of Arkansas

Terms UsedSlide4
Neural Networks

(Cont)

Hosted by the University of Arkansas

4

Structure

of a neural network

Adapted from Barry &

Linoff

Prepared by David Douglas, University of ArkansasSlide5
Neural Networks

(Cont)

Prepared by David Douglas, University of Arkansas

Hosted by the University of Arkansas

5

Inputs uses weights and a combination function to obtain a value for each neuron in the hidden

layer.

Then a non-linear response is generated from each neuron in the hidden layer to the

output.

Activation Function

After

initial pass, accuracy

is evaluated

and back propagation through the network

occurs, while changing

weights for next

pass.

Repeated until apparent answers (delta) are small—beware, this could be

a sub

optimal

solution.

Combination Function

Transform (Usually a Sigmoid)

Hidden Layer

Input Layer

Output Layer

Adapted from LaroseSlide6

Neural network algorithms require inputs to be within a small numeric range. This is easy to do for numeric variables using the min-max range approach as follows (values between 0 and 1)

Other

methods

can also

be applied

Neural Networks, as with Logistic Regression, do not handle missing values whereas Decision Trees do. Many data mining software packages automatically patches up for missing values but I recommend the modeler know the software is handling the missing

values.

Neural Networks

(Cont)

Hosted by the University of Arkansas

6

Adapted from Larose

Prepared by David Douglas, University of ArkansasSlide7
Neural Networks

(Cont)Hosted by the University of Arkansas

7

Categorical

Indicator Variables (sometimes referred to as 1 of n)

are used

when number of category values small

Categorical variable with k classes translated to k – 1 indicator variables

For example, Gender attribute values are “Male”, “Female”, and “Unknown”

Classes k = 3

Create k – 1 = 2 indicator variables named Male_I and Female_I

Male records have values Male_I = 1, Female_I = 0

Female records have values Male_I = 0, Female_I = 1

Unknown records have values Male_I = 0, Female_I = 0

Adapted from Larose

Prepared by David Douglas, University of ArkansasSlide8
Neural Networks

(Cont)

Hosted by the University of Arkansas

8

Categorical

Be very careful when working with categorical variables in neural networks when mapping the variables to numbers. The mapping introduces an ordering of the variables, which the neural network takes into account. 1 of n solves this problem but is cumbersome for a large number of categories.

Codes for marital status (“single,” “divorced,” “married,” “widowed,” and “unknown”) could be

coded.

Single

0

Divorced .2

Married .4

Separated .6

Widowed .8

Unknown 1.0

Note

the implied

ordering.

Adapted from Barry &

Linoff

Prepared by David Douglas, University of ArkansasSlide9
Neural Networks

(Cont)

Hosted by the University of Arkansas

9

Data Mining Software

Note that most modern data mining software takes care of these issues for you. But you need to be aware that it is happening and what default

settings

are being used.

For example, the following was taken from the PASW Modeler 13 Help topics describing binary set

encoding(An

advanced

topic)

Prepared by David Douglas, University of Arkansas

Use binary set encoding

If this option is selected, a compressed binary encoding scheme for set fields is used. This option allows you to easily build neural net models using set fields with large numbers of values as inputs. However, if you use this option, you may need to increase the complexity of the network architecture (by adding more hidden units or more hidden layers) to allow the network to properly use the compressed information in binary encoded set fields.

Note:

The

simplemax

and

softmax

scoring methods, SQL generation, and export to PMML are not supported for models that use binary set encoding Slide10
A Numeric Example

Hosted by the University of Arkansas

10

Feed forward

restricts network flow to

single

direction

Fully

connected

Flow

does not loop or cycle

Network

composed of two or more layers

x

0

x

1

x

2

x

3

Adapted from Larose

Prepared by David Douglas, University of Arkansas

Node 1

Node

2

Node 3

Node B

Node A

Node Z

W

1A

W

1B

W

2A

W

2B

W

AZ

W

3A

W

3B

W

0A

W

BZ

W

0Z

W

0B

Input Layer

Hidden Layer

Output LayerSlide11
Numeric Example

(Cont)

Hosted by the University of Arkansas

11

Most networks have input, hidden & output layers.

Network may contain more than one hidden layer.

Network

is completely

connected.

Each node in given layer, connected to every node in next

layer.

Every connection has weight (

W

ij

) associated with it

Weight values randomly assigned 0 to 1 by algorithm

Number of input nodes dependent on number of predictors

Number of hidden and output nodes configurable

How many nodes in hidden layer?

Large number of nodes increases complexity of

model.

Detailed patterns uncovered in data

Leads to

overfitting

, at expense of

generalizability

Reduce number of hidden nodes when

overfitting

occurs

Increase number of hidden nodes when training accuracy unacceptably

low

Prepared by David Douglas, University of ArkansasSlide12

Combination function produces linear combination of node inputs and connection weights to single scalar

value.

Consider the following weights:

Combination function to get hidden layer node values

Net

A

= .5(1) + .6(.4) + .8(.2) + .6(.7) = 1.32

Net

B

= .7(1) + .9(.4) + .8(.2) + .4(.7) = 1.50

Numeric Example

(Cont)

Hosted by the University of Arkansas

12

Adapted from Larose

Prepared by David Douglas, University of Arkansas

x

0

= 1.0

W

0A

= 0.5

W

0B

= 0.7

W

0Z

= 0.5

x

1

= 0.4

W

1A

= 0.6

W

1B

= 0.9

W

AZ

= 0.9

x

2

= 0.2

W

2A

= 0.8

W

2B

= 0.8

W

BZ

= 0.9

x

3

= 0.7

W

3A

= 0.6

W

3B

= 0.4Slide13

Transformation

function is typically the sigmoid function as shown below:

The

transformed values for nodes A & B would then be:

Numeric Example

(Cont)

Hosted by the University of Arkansas

13

Adapted from Larose

Prepared by David Douglas, University of ArkansasSlide14

Node z combines the output of the two hidden nodes A & B as follows:

Net

z

= .5(1) + .9(.7892) + .9(.8716) = 1.9461

The

net

z

value is then put into the sigmoid function

Numeric Example

(Cont)

Hosted by the University of Arkansas

14

Adapted from Larose

Prepared by David Douglas, University of ArkansasSlide15
Learning via Back Propagation

The output from each record that goes through the network can be compared an actual valueThen sum the squared differences for all the records (SSE)

The idea is then to find weights that minimizes the sum of the squared errorsThe Gradient Descent method optimizes the weights to minimize the SSE

Results in an equation for the output layer nodes and a different equation for hidden layer nodesUtilizes learning rate and momentum

Prepared by David Douglas, University of Arkansas

Hosted by the University of Arkansas

15Slide16
Gradient Descent Method Equations

Output layer nodes Rj

= outputj

(1-outputj)(actual-output

j)where R

j is the responsibility for error at node j

Hidden layer nodes

R

j

=

outputj(1-outputj) ∑ wjk Rj where ∑ wjk Rj

is the weighted sum of the error responsibilities for the downstream nodesPrepared by David Douglas, University of Arkansas

Hosted by the University of Arkansas

16

downstream

downstreamSlide17

Assume

that these values

used to calculate the output of

.8750 is compared to the actual value of a record value of .8

Then the back propagation changes the weights based on the constant weight (initially .5) for node

z—the only output node

The equation for responsibility for error for the output node z

R

j

= outputj(1-outputj)(actual-outputj)R

z =.8750(1-.8750)(.8-.8750) = -.0082

Calculate change for weight transmitting 1 unit and learning rate

of

.1

w

z

=

.1(-.0082)(1) = -.00082

Calculate new

weight

w

z,new

=(.

5 - .00082) = .

49918

w

hich will now be used instead of .5

Hosted by the University of Arkansas

17

Adapted from Larose

Prepared by David Douglas, University of Arkansas

Numeric Example

(

output node

)Slide18
Numeric Example (hidden layer node)

Now consider the hidden layer node A

The equation isRj

= outputj(1-outputj)

∑ wjk Rj

The only downstream node is z; original w

AZ

= .9 and error responsibility is -.0082 and output of node A was .7892

Thus

R

A = .7892(1-.7893)(.9)(-.0082) = -.00123∆wAZ = .1(-.0082)(.7892) = -.00647wAZ,new =.9 - .00067 = .899353This back-propagation continues through the nodes and the process is repeated until weights change very little

Prepared by David Douglas, University of ArkansasHosted by the University of Arkansas

18

downstreamSlide19
Learning rate and Momentum

Hosted by the University of Arkansas

19

The learning rate,

eta

determines the magnitude of changes to the weights.

Momentum,

alpha

is analogous to the mass of a rolling object as shown below. The mass of the smaller object may not have enough momentum to roll over the top to find the true optimum.

Adapted from Larose

Prepared by David Douglas, University of Arkansas

SSE

I

A

B

C

w

SSE

I

A

B

C

w

Small Momentum

Large MomentumSlide20
Lessons Learnt

Hosted by the University of Arkansas

20

Versatile data mining tool

Proven

Based on biological models of how the brain works

Feed-forward is

the most

common type

Back propagation for training sets has been replaced with other

methods and

notable conjugate gradient.Drawbacks

Works best with only a few input variables and it does not help in selecting the input

variables

No guarantee that weights are optimal—build several and take the best one

Biggest problem is that it does not explain what it is

doing (No rules)

Prepared by David Douglas, University of Arkansas