MACHINE LEARNING GOALS Define the problem Prove some basic bounds MACHINE LEARNING What is machine learning Core PROBLEM Or The labeled training examples Main goal Classification rule ID: 1010146
Download Presentation The PPT/PDF document "TOMER BEN MOSHE May 2017" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. TOMER BEN MOSHEMay 2017MACHINE LEARNING
2. GOALSDefine the problem.Prove some basic bounds.
3. MACHINE LEARNINGWhat is machine learning?
4. Core PROBLEM = Or – The labeled training examplesMain goal: Classification rule
5. DEFINITIONSDistribution D over X.S is taken independently at random from D.Main goal : Predict well on new points that are also drawn from D.
6. DEFINITIONS - Target Concept.- hypothesisMain goal: Produce hypothesis, close as possible to c* (with respect to D).
7. DEFINITIONS - True error of h. Main goal: Produce h with true error as low as possible. - Training error of h.Overfitting.
8. Formalizing the problemH – Hypothesis class over XMain goal : Given H and S, find the hypothesis in H that most closely agree with c* over D.Assume H to be finite.
9. Emails(X)Target Concept(c*)Hypothesis Hypothesis (0,0,0)010(0,0,1)0100,1,0)001(0,1,1)010(1,0,0)101(1,0,1)101(1,1,0)110(1,1,1)101Emails(X)Target Concept(c*)010010001010101101110101
10. Emails(X)Target Concept(c*)Hypothesis Hypothesis (0,0,0)010(0,0,1)0100,1,0)001(0,1,1)010(1,0,0)101(1,0,1)101(1,1,0)110(1,1,1)101Emails(X)Target Concept(c*)010010001010101101110101
11. Emails(X)Target Concept(c*)Hypothesis Hypothesis (0,0,0)010(0,0,1)0100,1,0)001(0,1,1)010(1,0,0)101(1,0,1)101(1,1,0)110(1,1,1)101Emails(X)Target Concept(c*)010010001010101101110101
12. Emails(X)Target Concept(c*)Hypothesis Hypothesis (0,0,0)010(0,0,1)0100,1,0)001(0,1,1)010(1,0,0)101(1,0,1)101(1,1,0)110(1,1,1)101Emails(X)Target Concept(c*)010010001010101101110101
13. Emails(X)Target Concept(c*)Hypothesis Hypothesis (0,0,0)010(0,0,1)0100,1,0)001(0,1,1)010(1,0,0)101(1,0,1)101(1,1,0)110(1,1,1)10101Emails(X)Target Concept(c*)01001000101010110111010101
14. Emails(X)Target Concept(c*)Hypothesis Hypothesis (0,0,0)010(0,0,1)0100,1,0)001(0,1,1)010(1,0,0)101(1,0,1)101(1,1,0)110(1,1,1)101Emails(X)Target Concept(c*)010010001010101101110101
15. Emails(X)Target Concept(c*)Hypothesis Hypothesis (0,0,0)010(0,0,1)0100,1,0)001(0,1,1)010(1,0,0)101(1,0,1)101(1,1,0)110(1,1,1)101Emails(X)Target Concept(c*)010010001010101101110101
16. Emails(X)Target Concept(c*)Hypothesis Hypothesis (0,0,0)010(0,0,1)0100,1,0)001(0,1,1)010(1,0,0)101(1,0,1)101(1,1,0)110(1,1,1)101Emails(X)Target Concept(c*)010010001010101101110101
17. Emails(X)Target Concept(c*)Hypothesis Hypothesis (0,0,0)010(0,0,1)0100,1,0)001(0,1,1)010(1,0,0)101(1,0,1)101(1,1,0)110(1,1,1)101Emails(X)Target Concept(c*)010010001010101101110101
18. Emails(X)Target Concept(c*)Hypothesis Hypothesis (0,0,0)010(0,0,1)0100,1,0)001(0,1,1)010(1,0,0)101(1,0,1)101(1,1,0)110(1,1,1)101Emails(X)Target Concept(c*)010010001010101101110101
19. overfittingGood training error but bad true errorHow big should the sample group be in order to promise good true error?
20. thEorEm 5.1Let H be a hypothesis class and let ε, > 0.If S of size , is drawn from distribution D.Then w.p ≥ , every with , has .
21. Emails(X)Target Concept(c*)Hypothesis (0,0,0)01(0,0,1)010,1,0)00(0,1,1)01(1,0,0)10(1,0,1)10(1,1,0)11(1,1,1)10Emails(X)Target Concept(c*)0101000110101110
22. Emails(X)Target Concept(c*)Hypothesis (0,0,0)01(0,0,1)010,1,0)00(0,1,1)01(1,0,0)10(1,0,1)10(1,1,0)11(1,1,1)10Emails(X)Target Concept(c*)0101000110101110
23. Emails(X)Target Concept(c*)Hypothesis (0,0,0)01(0,0,1)010,1,0)00(0,1,1)01(1,0,0)10(1,0,1)10(1,1,0)11(1,1,1)10Emails(X)Target Concept(c*)0101000110101110
24. overfittingPAC-Learning Guarantee.It guarantees a Hypothesis that is Probably Approximately Correct.The theorem only addresses What about ?
25. OVERFITTINGIf S is Sufficiently large, with high Probability, Good Performance on S will Translate to Good Performance on D.
26. thEoREM 5.3Let H be a hypothesis class and let .S of size , is drawn from distribution D.W.p ≥ , every H satisfies
27. Hoeffding boundLet be independent {0,1}-valued randomly variables w.p p that = 1.Let .For any :
28. Learning disjunctionsX=H be the class of all possible disjunctions.But how can we efficiently build a consistent disjunction when one exists?
29. Simple disjunction learnerGiven sample S, discard all features that are set to 1 in any negative example in S. Output the concept h that is the OR of all features that remain.
30. Simple disjunction learner(Lemma 5.4)The simple disjunction learner produces a disjunction h that is consistent with the sample S () whenever the target concept is indeed a disjunction.
31. Simple disjunction learnerAnd it is proven that this algorithm is efficient for PAC-Learning the class of disjunction.
32. Occam’s razorOne should prefer simpler explanations over more complicated ones.But what does “Simple” mean?Fewer bits.
33. Occam’s razorWhat is the most simple method?All in h or nothing in h.Just one bit.Bad performance on S.What is the most complicated method?Specify any input in h.No training error.Bad true error.
34. Occam’s razorLet set H to be the set of all h that can be described using at most b bits.Plugging that into theorem 5.1.
35. Reminder - thEorEm 5.1Let H be a hypothesis class and let ε, > 0.If S of size , is drawn from distribution D.Then w.p ≥ , every with , has .
36. Occam’s razor(theorem 5.5)W.p at Least , any rule h consistent with S that can be described in this language using fewer than b bits will have for .In other words, w.p at least , all rules consistent with S that can be described in fewer than b Bits will have
37. Occam’s razorNo matter what method of ordering the bits is being used, the theorem can be applied.Good approach of picking good and simple rules.
38. 10 10 10 10DECISION TREE+-++
39. Learning decision treesFinding the smallest decision tree is NP-Hard.There are heuristic methods.Suppose that we run such a method and got a tree with k nodes. O(k log d) bits to describe such a tree.By theorem 5.5 – good true error if wecan find consistent tree withfewer than nodes.
40. RegularizationSuppose we have very simple rule with training error of 20%, or a complicated rule with 10%.We need something that’s called “Regularization”, also called “Complexity Penalization”.In general, we need to penalize any complication.
41. RegularizationLet denote The hypotheses that can be describes in i bits.Let
42. Reminder - thEoREM 5.3Let H be a hypothesis class and let .S of size , is drawn from distribution D.W.p ≥ , every H satisfies
43. RegularizationRemember By all that and the union bound gives us the following corollary:
44. Regularization(Corollary 5.6) W.p ≥ , all hypotheses h satisfy:That gives us a good tradeoff between complexity and training error.
45. The END Any Questions?