/
Knowledge  Data Discovery Knowledge  Data Discovery

Knowledge Data Discovery - PowerPoint Presentation

everly
everly . @everly
Follow
64 views
Uploaded On 2024-01-13

Knowledge Data Discovery - PPT Presentation

TOPIC 9 Classification Advanced Methods 1 Antoni Wibowo Course outline Bayes classifiers Naive Bayes classifiers Bayesian Belief networks Artificial Neural Networks ANN Note Th ID: 1039735

computer network probability buys network computer buys probability weights networks hidden training data input bayesian neural class output layer

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Knowledge Data Discovery" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Knowledge Data DiscoveryTOPIC 9 - Classification: Advanced Methods (1)Antoni Wibowo

2. Course outlineBayes classifiersNaive Bayes classifiersBayesian Belief networksArtificial Neural Networks (ANN)

3. Note: This slides are based on the additional material provided with the textbook that we use: J. Han, M. Kamber and J. Pei, “Data Mining: Concepts and Techniques” and P. Tan, M. Steinbach, and V. Kumar "Introduction to Data Mining“.

4. Beyes Classifiers

5. Bayesian Classification: Why?A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilitiesFoundation: Based on Bayes’ Theorem. Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiersIncremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed dataStandard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured5

6. Bayes’ Theorem: BasicsTotal probability Theorem:Bayes’ Theorem:Let X be a data sample (“evidence”): class label is unknownLet H be a hypothesis that X belongs to class C Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample XP(H) (prior probability): the initial probabilityE.g., X will buy computer, regardless of age, income, …P(X): probability that sample data is observedP(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holdsE.g., Given that X will buy computer, the prob. that X is 31..40, medium income6

7. Prediction Based on Bayes’ TheoremGiven training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes’ theorem Informally, this can be viewed as posteriori = likelihood x prior/evidencePredicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classesPractical difficulty: It requires initial knowledge of many probabilities, involving significant computational cost7

8. Classification is to Derive the Maximum Posteriori8Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)Suppose there are m classes C1, C2, …, Cm.Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)This can be derived from Bayes’ theoremSince P(X) is constant for all classes, only needs to be maximized

9. Naive Bayes Classifiers

10. Naïve Bayes Classifier 10A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):This greatly reduces the computation cost: Only counts the class distributionIf Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σand P(xk|Ci) is

11. Naïve Bayes Classifier: Training Dataset11Class:C1:buys_computer = ‘yes’C2:buys_computer = ‘no’Data to be classified: X = (age <=30, Income = medium,Student = yesCredit_rating = Fair)

12. Naïve Bayes Classifier: An ExampleP(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007Therefore, X belongs to class (“buys_computer = yes”) 12

13. Avoiding the Zero-Probability Problem13Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10)Use Laplacian correction (or Laplacian estimator)Adding 1 to each caseProb(income = low) = 1/1003Prob(income = medium) = 991/1003Prob(income = high) = 11/1003The “corrected” prob. estimates are close to their “uncorrected” counterparts

14. Naïve Bayes (Summary)Advantages Easy to implement Good results obtained in most of the casesRobust to isolated noise pointsHandle missing values by ignoring the instance during probability estimate calculationsRobust to irrelevant attributesIndependence assumption may not hold for some attributesUse other techniques such as Bayesian Belief Networks (BBN)

15. Naïve Bayes (Summary)DisadvantagesAssumption: class conditional independence, therefore loss of accuracyPractically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Naïve Bayes Classifier15

16. Beyesian Belief Networks

17. 17Bayesian Belief NetworksBayesian belief network (also known as Bayesian network, probabilistic network): allows class conditional independencies between subsets of variablesTwo components: (1) A directed acyclic graph (called a structure) and (2) a set of conditional probability tables (CPTs)A (directed acyclic) graphical model of causal influence relationshipsRepresents dependency among the variables Gives a specification of joint probability distribution XYZP Nodes: random variables Links: dependency X and Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops/cycles

18. A Bayesian Network and Some of Its CPTs18Fire (F)Smoke (S)Leaving (L)Tampering (T)Alarm (A)Report (R)CPT: Conditional Probability TablesCPT shows the conditional probability for each possible combination of its parentsDerivation of the probability of a particular combination of values of X, from CPT:FireSmokeΘs|fTrueTrue.90FalseTrue.01FireTamperingAlarmΘa|f,tTrueTrueTrue.5TrueFalseTrue.99FalseTrueTrue.85FalseFalseTrue.0001

19. How Are Bayesian Networks Constructed?Subjective construction: Identification of (direct) causal structurePeople are quite good at identifying direct causes from a given set of variables & whether the set contains all relevant direct causesMarkovian assumption: Each variable becomes independent of its non-effects once its direct causes are knownE.g., S ‹— F —› A ‹— T, path S—›A is blocked once we know F—›A HMM (Hidden Markov Model): often used to model dynamic systems whose states are not observable, yet their outputs areSynthesis from other specificationsE.g., from a formal system design: block diagrams & info flowLearning from dataE.g., from medical records or student admission recordLearn parameters give its structure or learn both structure and parmsMaximum likelihood principle: favors Bayesian networks that maximize the probability of observing the given data set19

20. Training Bayesian Networks: Several ScenariosScenario 1: Given both the network structure and all variables observable: compute only the CPT entriesScenario 2: Network structure known, some variables hidden: gradient descent (greedy hill-climbing) method, i.e., search for a solution along the steepest descent of a criterion function Weights are initialized to random probability valuesAt each iteration, it moves towards what appears to be the best solution at the moment, w.o. backtrackingWeights are updated at each iteration & converge to local optimumScenario 3: Network structure unknown, all variables observable: search through the model space to reconstruct network topology Scenario 4: Unknown structure, all hidden variables: No good algorithms known for this purposeD. Heckerman. A Tutorial on Learning with Bayesian Networks. In Learning in Graphical Models, M. Jordan, ed. MIT Press, 1999.20

21. Artificial Neural Networks (ANN)

22. 22Classification by BackpropagationBackpropagation: A neural network learning algorithm Started by psychologists and neurobiologists to develop and test computational analogues of neuronsA neural network: A set of connected input/output units where each connection has a weight associated with itDuring the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuplesAlso referred to as connectionist learning due to the connections between units

23. 23Neuron: A Hidden/Output Layer Unit An n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mappingThe inputs to unit are outputs from the previous layer. They are multiplied by their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Then a nonlinear activation function is applied to it.mk fweighted sumInputvector xoutput yActivationfunctionweightvector wåw0w1wnx0x1xnbias

24. Artificial Neural Networks (ANN)Output Y is 1 if at least two of the three inputs are equal to 1.

25. Artificial Neural Networks (ANN)

26. Model is an assembly of inter-connected nodes and weighted linksOutput node sums up each of its input value according to the weights of its linksCompare output node against some threshold tPerceptron ModelorArtificial Neural Networks (ANN)

27. 27How A Multi-Layer Neural Network WorksThe inputs to the network correspond to the attributes measured for each training tuple Inputs are fed simultaneously into the units making up the input layerThey are then weighted and fed simultaneously to a hidden layerThe number of hidden layers is arbitrary, although usually only one The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's predictionThe network is feed-forward: None of the weights cycles back to an input unit or to an output unit of a previous layerFrom a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough training samples, they can closely approximate any function

28. 28A Multi-Layer Feed-Forward Neural Network Output layerInput layerHidden layerOutput vectorInput vector: Xwij

29. 29Defining a Network TopologyDecide the network topology: Specify # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layerNormalize the input values for each attribute measured in the training tuples to [0.0—1.0]One input unit per domain value, each initialized to 0Output, if for classification and more than two classes, one output unit per class is usedOnce a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights

30. 30Learning Algorithm: BackpropagationIteratively process a set of training tuples & compare the network's prediction with the actual known target valueFor each training tuple, the weights are modified to minimize the mean squared error between the network's prediction and the actual target value Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the first hidden layer, hence “backpropagation”StepsInitialize weights to small random numbers, associated with biases Propagate the inputs forward (by applying activation function) Backpropagate the error (by updating weights and biases)Terminating condition (when error is very small, etc.)

31. Learning Algorithm: BackpropagationInitialize the weights (w0, w1, …, wk)Adjust the weights in such a way that the output of ANN is consistent with class labels of training examplesObjective function:Find the weights wi’s that minimize the above objective function e.g., backpropagation algorithm

32. 32Efficiency and InterpretabilityEfficiency of backpropagation: Each epoch (one iteration through the training set) takes O(|D| * w), with |D| tuples and w weights, but # of epochs can be exponential to n, the number of inputs, in worst caseFor easier comprehension: Rule extraction by network pruningSimplify the network structure by removing weighted links that have the least effect on the trained networkThen perform link, unit, or activation value clusteringThe set of input and activation values are studied to derive rules describing the relationship between the input and hidden unit layersSensitivity analysis: assess the impact that a given input variable has on a network output. The knowledge gained from this analysis can be represented in rules

33. 33Neural Network as a ClassifierWeaknessLong training time Require a number of parameters typically best determined empirically, e.g., the network topology or “structure.”Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of “hidden units” in the networkStrengthHigh tolerance to noisy data Ability to classify untrained patterns Well-suited for continuous-valued inputs and outputsSuccessful on an array of real-world data, e.g., hand-written lettersAlgorithms are inherently parallelTechniques have recently been developed for the extraction of rules from trained neural networks

34. SummaryEvaluate Bayes classifiersEvaluate Naive Bayes classifiersEvaluate Bayesian Belief networksArtificial Neural Networks (ANN)

35. ReferencesHan, J., Kamber, M., & Pei, Y. (2006). “Data Mining: Concepts and Technique”. Edisi 3. Morgan Kaufman. San FranciscoTan, P.N., Steinbach, M., & Kumar, V. (2006). “Introduction to Data Mining”. Addison-Wesley. MichiganWitten, I. H., & Frank, E. (2005). “Data Mining : Practical Machine Learning Tools and Techniques”. Second edition. Morgan Kaufmann. San Francisco9/5/2017Introduction35

36.