/
Practical  Machine Learning Tools and Techniques Practical  Machine Learning Tools and Techniques

Practical Machine Learning Tools and Techniques - PowerPoint Presentation

lam
lam . @lam
Follow
68 views
Uploaded On 2023-11-11

Practical Machine Learning Tools and Techniques - PPT Presentation

Slides for Chapter 2 Input concepts instances attributes 2 Input concepts instances attributes Components of the input for learning Whats a concept Classification association clustering numeric prediction ID: 1031132

data attribute learning attributes attribute data attributes learning numeric nominal class values bag instances concept classification instance weather false

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Practical Machine Learning Tools and Te..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Practical Machine Learning Tools and TechniquesSlides for Chapter 2, Input: concepts, instances, attributes

2. 2Input: concepts, instances, attributesComponents of the input for learningWhat’s a concept?Classification, association, clustering, numeric predictionWhat’s in an example?Relations, flat files, recursionWhat’s in an attribute?Nominal, ordinal, interval, ratioPreparing the inputARFF, sparse data, attributes, missing and inaccurate values, unbalanced data, getting to know your data

3. 3Components of the inputConcepts: kinds of things that can be learnedAim: intelligible and operational concept descriptionInstances: the individual, independent examples of a concept to be learnedMore complicated forms of input with dependencies between examples are possibleAttributes: measuring aspects of an instanceWe will focus on nominal and numeric ones

4. 4What’s a concept?Concept: thing to be learnedConcept description: output of learning schemeStyles of learning:Classification learning:predicting a discrete classAssociation learning:detecting associations between featuresClustering:grouping similar instances into clustersNumeric prediction:predicting a numeric quantity

5. 5Classification learningExample problems: weather data, contact lenses, irises, labor negotiationsClassification learning is supervisedScheme is provided with actual outcomeOutcome is called the class of the exampleMeasure success on fresh data for which class labels are known (test data)In practice success is often measured subjectively

6. 6Association learningCan be applied if no class is specified and any kind of structure is considered “interesting”Difference to classification learning:Can predict any attribute’s value, not just the class, and more than one attribute’s value at a timeHence: far more association rules than classification rulesThus: constraints are necessary, such as minimum coverage and minimum accuracy

7. 7ClusteringFinding groups of items that are similarClustering is unsupervisedThe class of an example is not knownSuccess often measured subjectively………Iris virginica1.95.12.75.8102101525121Iris virginica2.56.03.36.3Iris versicolor1.54.53.26.4Iris versicolor1.44.73.27.0Iris setosa0.21.43.04.9Iris setosa0.21.43.55.1TypePetal widthPetal lengthSepal widthSepal length

8. 8Numeric predictionVariant of classification learning where “class” is numeric (also called “regression”)Learning is supervisedScheme is being provided with target valueMeasure success on test data……………40FalseNormalMildRainy55FalseHighHot Overcast0TrueHighHotSunny5FalseHighHotSunnyPlay-timeWindyHumidityTemperatureOutlook

9. 9What’s in an example?Instance: specific type of exampleThing to be classified, associated, or clusteredIndividual, independent example of target conceptCharacterized by a predetermined set of attributesInput to learning scheme: set of instances/datasetRepresented as a single relation/flat fileRather restricted form of inputNo relationships between objectsMost common form in practical data mining

10. 10A family tree=StevenMGrahamMPamFGraceFRayM=IanMPippaFBrianM=AnnaFNikkiFPeggyFPeterM

11. 11Family tree represented as a tableIanPamFemaleNikkiIanPamFemaleAnnaRayGraceMaleBrianRayGraceFemalePippaRayGraceMaleIanPeggyPeterFemalePamPeggyPeterMaleGrahamPeggyPeterMaleSteven??FemalePeggy??MalePeterparent2Parent1GenderName

12. 12The “sister-of” relationyesAnnaNikki………YesNikkiAnna………YesPippaIan………YesPamStevenNoGrahamStevenNoPeterSteven………NoStevenPeterNoPeggyPeterSister of?Second personFirst personNoAll the restYesAnnaNikkiYesNikkiAnnaYesPippaBrianYesPippaIanYesPamGrahamYesPamStevenSister of?Second personFirst personClosed-world assumption

13. 13A full representation in one tableIanIanRayRayPeggyPeggyParent2FemaleFemaleFemaleFemaleFemaleFemaleGenderPamPamGraceGracePeterPeterParent1NameParent2Parent1GenderNameIanIanRayRayPeggyPeggyPamPamGraceGracePeterPeterFemaleFemaleMaleMaleMaleMaleNoAll the restYesAnnaNikkiYesNikkiAnnaYesPippaBrianYesPippaIanYesPamGrahamYesPamStevenSisterof?Second personFirst personIf second person’s gender = femaleand first person’s parent = second person’s parentthen sister-of = yes

14. 14Generating a flat fileProcess of flattening called “denormalization”Several relations are joined together to make onePossible with any finite set of finite relationsProblematic: relationships without a pre-specified number of objectsExample: concept of nuclear-familyNote that denormalization may produce spurious regularities that reflect the structure of the databaseExample: “supplier” predicts “supplier address”

15. 15The “ancestor-of” relationYesOther positive examples hereYesIanPamFemaleNikki??FemaleGraceRayIanIanIanPeggyPeggyParent2MaleFemaleFemaleFemaleFemaleMaleGenderGracePamPamPamPeterPeterParent1NameParent2Parent1GenderName?Peggy?????Peter????FemaleFemaleMaleMaleMaleMaleNoAll the restYesIanGraceYesNikkiPamYesNikkiPeterYesAnnaPeterYesPamPeterYesStevenPeterAncestor of?Second personFirst person

16. 16RecursionAppropriate techniques are known as “inductive logic programming” (ILP) methodsExample ILP method: Quinlan’s FOIL rule learnerProblems: (a) noise and (b) computational complexityIf person1 is a parent of person2then person1 is an ancestor of person2If person1 is a parent of person2and person2 is an ancestor of person3then person1 is an ancestor of person3Infinite relations require recursion

17. 17Multi-instance conceptsEach individual example comprises a bag (aka multi-set) of instancesAll instances are described by the same attributesOne or more instances within an example may be responsible for the example's classificationGoal of learning is still to produce a concept descriptionImportant real world applicationsProminent examples are drug activity prediction and image classificationA drug can be viewed as bag of different geometric arrangements of the drug moleculeAn image can be represented as a bag of image components

18. 18What’s in an attribute?Each instance is described by a fixed predefined set of features, its “attributes”But: number of attributes may vary in practicePossible solution: “irrelevant value” flagRelated problem: existence of an attribute may depend of value of another onePossible attribute types (“levels of measurement”):Nominal, ordinal, interval and ratio

19. 19Nominal levels of measurementValues are distinct symbolsValues themselves serve only as labels or namesNominal comes from the Latin word for nameExample: attribute “outlook” from weather dataValues: “sunny”,”overcast”, and “rainy”No relation is implied among nominal values (no ordering or distance measure)Only equality tests can be performed

20. 20Ordinal levels of measurementImpose order on valuesBut: no distance between values definedExample:attribute “temperature” in weather dataValues: “hot” > “mild” > “cool”Note: addition and subtraction don’t make senseExample rule: temperature < hot Þ play = yesDistinction between nominal and ordinal not always clear (e.g., attribute “outlook”)

21. 21Interval quantitiesInterval quantities are not only ordered but measured in fixed and equal unitsExample 1: attribute “temperature” expressed in degrees FahrenheitExample 2: attribute “year”Difference of two values makes senseSum or product doesn’t make senseZero point is not defined!

22. 22Ratio quantitiesRatio quantities are ones for which the measurement scheme defines a zero pointExample: attribute “distance”Distance between an object and itself is zeroRatio quantities are treated as real numbersAll mathematical operations are allowedBut: is there an “inherently” defined zero point?Answer depends on scientific knowledge (e.g., Fahrenheit knew no lower limit to temperature)

23. 23Attribute types used in practiceMany data mining schemes accommodate just two levels of measurement: nominal and ordinalOthers deal exclusively with ratio quantitiesNominal attributes are also called “categorical”, ”enumerated”, or “discrete”But: “enumerated” and “discrete” imply orderSpecial case: dichotomy (“boolean” attribute)Ordinal attributes are sometimes coded as “numeric” or “continuous”But: “continuous” implies mathematical continuity

24. 24MetadataInformation about the data that encodes background knowledgeIn theory this information can be used to restrict the search space of the learning algorithmExamples:Dimensional considerations(i.e., expressions must be dimensionally correct)Circular orderings(e.g., degrees in compass)Partial orderings(e.g., generalization/specialization relations)

25. 25Preparing the inputDenormalization is not the only issue when data is prepared for learningProblem: different data sources (e.g., sales department, customer billing department, …)Differences: styles of record keeping, coding conventions, time periods, data aggregation, primary keys, types of errorsData must be assembled, integrated, cleaned up“Data warehouse”: consistent point of accessExternal data may be required (“overlay data”)Critical: type and level of data aggregation

26. 26The ARFF data format%% ARFF file for weather data with some numeric features%@relation weather@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}@datasunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yes...

27. 27Additional attribute typesARFF data format also supports string attributes:Similar to nominal attributes but list of values is not pre-specifiedAdditionally, it supports date attributes:Uses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss@attribute description string@attribute today date

28. 28Relational attributesRelational attributes allow multi-instance problems to be represented in ARFF formatEach value of a relational attribute is a separate bag of instances, but each bag has the same attributesNested attribute block gives the structure of the referenced instances@attribute bag relational @attribute outlook { sunny, overcast, rainy } @attribute temperature numeric @attribute humidity numeric @attribute windy { true, false }@end bag

29. 29Multi-instance ARFF%% Multiple instance ARFF file for the weather data%@relation weather@attribute bag_ID { 1, 2, 3, 4, 5, 6, 7 }@attribute bag relational @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no}@end bag@data1, “sunny, 85, 85, false\nsunny, 80, 90, true”, no2, “overcast, 83, 86, false\nrainy, 70, 96, false”, yes...

30. 30Sparse dataIn some applications most attribute values are zero and storage requirements can be reducedE.g.: word counts in a text categorization problemARFF supports sparse data storageThis also works for nominal attributes (where the first value of the attribute corresponds to “zero”)Some learning algorithms work very efficiently with sparse data0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A”0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”{1 26, 6 63, 10 “class A”}{3 42, 10 “class B”}

31. 31Attribute typesInterpretation of attribute types in an ARFF file depends on the learning scheme that is appliedNumeric attributes are interpreted asordinal scales if less-than and greater-than are usedratio scales if distance calculations are performed (normalization/standardization may be required)Note also that some instance-based schemes define a distance between nominal values (0 if values are equal, 1 otherwise)Background knowledge may be required for correct interpretation of dataE.g., consider integers in some given data file: nominal, ordinal, or ratio scale?

32. 32Nominal vs. ordinalAttribute “age” nominalAttribute “age” ordinal(e.g. “young” < “pre-presbyopic” < “presbyopic”)If age = young and astigmatic = noand tear production rate = normalthen recommendation = softIf age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = softIf age  pre-presbyopic and astigmatic = noand tear production rate = normalthen recommendation = soft

33. 33Missing valuesMissing values are frequently indicated by out-of-range entries for an attributeThere are different types of missing values: unknown, unrecorded, irrelevantReasons:malfunctioning equipmentchanges in experimental designcollation of different datasetsmeasurement not possibleMissing value may have significance in itself (e.g., missing test in a medical examination)Most schemes assume that is not the case and “missing” may need to be coded as an additional, separate attribute value

34. 34Inaccurate valuesReason: data has not been collected for mining itResult: errors and omissions that affect the accuracy of data miningThese errors may not affect the original purpose of the data (e.g., age of customer)Typographical errors in nominal attributes  values need to be checked for consistencyTypographical and measurement errors in numeric attributes  outliers need to be identifiedErrors may be deliberate (e.g., wrong zip codes)Other problems: duplicates, stale data

35. 35Unbalanced dataUnbalanced data is a well-known problem in classification problemsOne class is often far more prevalent than the restExample: detecting a rare diseaseMain problem: simply predicting the majority class yields high accuracy but is not usefulPredicting that no patient has the rare disease gives high classification accuracyUnbalanced data requires techniques that can deal with unequal misclassification costsMisclassifying an afflicted patient may be much more costly than misclassifying a healthy one

36. 36Getting to know your dataSimple visualization tools are very usefulNominal attributes: histograms (Is the distribution consistent with background knowledge?)Numeric attributes: graphs(Any obvious outliers?)2-D and 3-D plots show dependenciesMay need to consult domain expertsToo much data to inspect manually? Take a sample!