/
Supervised Learning Regression, Classification Supervised Learning Regression, Classification

Supervised Learning Regression, Classification - PowerPoint Presentation

cadie
cadie . @cadie
Follow
64 views
Uploaded On 2024-01-03

Supervised Learning Regression, Classification - PPT Presentation

Linear regression k NN classification Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 11 2014 An Example Size of Engine vs Power 2 Engine displacement cc ID: 1038191

set data income training data set training income engine size points test classification nearest power distance algorithm age car

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Supervised Learning Regression, Classifi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Supervised LearningRegression, ClassificationLinear regression, k-NN classificationDebapriyo MajumdarData Mining – Fall 2014Indian Statistical Institute KolkataAugust 11, 2014

2. An Example: Size of Engine vs Power2Engine displacement (cc)Power (bhp)An unknown car has an engine of size 1800cc. What is likely to be the power of the engine?

3. An Example: Size of Engine vs Power3Engine displacement (cc)Power (bhp)Intuitively, the two variables have a relationLearn the relation from the given dataPredict the target variable after learningTargetVariable

4. Exercise: on a simpler set of data pointsPredict y for x = 2.54yxxy1123374102.5?

5. Linear Regression5Engine displacement (cc)Power (bhp)Assume: the relation is linearThen for a given x (=1800), predict the value of yTraining set

6. Linear Regression6Engine displacement (cc)Power (bhp)Linear regressionAssume y = a . x + bTry to find suitable a and bOptional exerciseEngine (cc)Power (bhp)80060100090120080120010012007514009015001201800160200014020001702400180

7. Exercise: using Linear RegressionDefine a regression line of your choicePredict y for x = 2.57yxxy1123374102.5?

8. Choosing the parameters rightThe data points: (x1, y1), (x2, y2), … , (xm, ym)The regression line: f(x) = y = a . x + bLeast-square cost function: J = Σi ( f(xi) – yi )2Goal: minimize J over choices of a and b8xy Goal: minimizing the deviation from the actual data points

9. How to Minimize the Cost Function?Goal: minimize J for all values of a and bStart from some a = a0 and b = b0Compute: J(a0,b0)Simultaneously change a and b towards the negative gradient and eventually hope to arrive an optimalQuestion: Can there be more than one optimal? 9abΔ

10. Another example: Given that a person’s age is 24, predict if (s)he has high blood sugarDiscrete values of the target variable (Y / N)Many ways of approaching this problem10High blood sugarNYAgeTraining set

11. Classification problemOne approach: what other data points are nearest to the new point?Other approaches?11High blood sugarNYAge?24

12. Classification AlgorithmsThe k-nearest neighbor classificationNaïve Bayes classificationDecision TreeLinear Discriminant AnalysisLogistics RegressionSupport Vector Machine12

13. Classification or Regression?Given data about some cars: engine size, number of seats, petrol / diesel, has airbag or not, priceProblem 1: Given engine size of a new car, what is likely to be the price?Problem 2: Given the engine size of a new car, is it likely that the car is run by petrol?Problem 3: Given the engine size, is it likely that the car has airbags?13

14. Classification

15. Example: Age, Income and Owning a flat15Monthly income (thousand rupees)AgeTraining setOwns a flatDoes not own a flatGiven a new person’s age and income, predict – does (s)he own a flat?

16. Example: Age, Income and Owning a flat16Monthly income (thousand rupees)AgeNearest neighbor approachFind nearest neighbors among the known data points and check their labelsTraining setOwns a flatDoes not own a flat

17. Example: Age, Income and Owning a flat17Monthly income (thousand rupees)AgeThe 1-Nearest Neighbor (1-NN) Algorithm:Find the closest point in the training setOutput the label of the nearest neighborTraining setOwns a flatDoes not own a flat

18. The k-Nearest Neighbor Algorithm18Monthly income (thousand rupees)AgeThe k-Nearest Neighbor (k-NN) Algorithm:Find the closest k point in the training setMajority vote among the labels of the k pointsTraining setOwns a flatDoes not own a flat

19. Distance measuresHow to measure distance to find closest points?Euclidean: Distance between vectors x = (x1, … , xk) and y = (y1, … , yk) 19Manhattan distance:Generalized squared interpoint distance: S is the covariance matrixThe Maholanobis distance (1936)

20. Classification setup20Training data / set: set of input data points and given answers for the data pointsLabels: the list of possible answersTest data / set: inputs to the classification algorithm for finding labelsUsed for evaluating the algorithm in case the answers are known (but known to the algorithm)Classification task: Determining labels of the data points for which the label is not known or not passed to the algorithmFeatures: attributes that represent the data

21. EvaluationTest set accuracy: the correct performance measureAccuracy = #of correct answer / #of all answersNeed to know the true test labels Option: use training set itselfParameter selection (for k-NN) by accuracy on training setOverfitting: a classifier performs too good on training set compared to new (unlabeled) test data 21

22. Better validation methodsLeave one out:For each training data point x of training set DConstruct training set D – x, test set {x}Train on D – x, test on xOverall accuracy = average over all such casesExpensive to computeHold out set: Randomly choose x% (say 25-30%) of the training data, set aside as test setTrain on the rest of training data, test on the test setEasy to compute, but tends to have higher variance22

23. The k-fold Cross Validation MethodRandomly divide the training data into k partitions D1,…, Dk : possibly equal divisionFor each fold DiTrain a classifier with training data = D – DiTest and validate with DiOverall accuracy: average accuracy over all cases23

24. ReferencesLecture videos by Prof. Andrew Ng, Stanford University Available on Coursera (Course: Machine Learning)Data Mining Map: http://www.saedsayad.com/24