/
Introduction to k Nearest Neighbour Classication and Condensed Nearest Neighbour Data Introduction to k Nearest Neighbour Classication and Condensed Nearest Neighbour Data

Introduction to k Nearest Neighbour Classication and Condensed Nearest Neighbour Data - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
540 views
Uploaded On 2014-12-09

Introduction to k Nearest Neighbour Classication and Condensed Nearest Neighbour Data - PPT Presentation

1 Example 1 12 Example 1 13 The General Case 2 2 The k Nearest Neighbours Algorithm 2 21 The Algorithm ID: 21787

Example

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Introduction to k Nearest Neighbour Clas..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

knowabout owerstohelphim.Instead,hegetspeopletomeasurevariouscharacteristicsofthe owers,suchasthestamensize,thenumberofpetals,theheight,thecolour,thesizeofthe owerhead,etc,andputthemintoacomputer.Hethenwantsthecomputertocomparethemtoapre-classi eddatabaseofsamples,andpredictwhatvarietyeachofthe owersis,basedonitscharacteristics.1.3TheGeneralCaseIngeneral,westartwithasetofdata,eachdatapointofwhichisinaknowclass.Then,wewanttobeabletopredicttheclassi cationofanewdatapointbasedontheknownclassi cationsoftheobservationsinthedatabase.Forthisreason,thedatabaseisknownasourtrainingset,sinceittrainsuswhatobjectsofthedi erentclasseslooklike.Theprocessofchoosingtheclassi cationofthenewobservationisknownastheclassi cationproblem,andthereareseveralwaystotackleit.Here,weconsiderchoosingtheclassi cationofthenewobservationbasedontheclassi cationsoftheobservationsinthedatabasewhichitis\mostsimilar"to.However,decidingwhethertwoobservationsaresimilarornotisquiteanopenquestion.Forinstance,decidingwhethertwocoloursaresimilarisacompletelydi erentprocesstodecidingwhethertwoparagraphsoftextaresimilar.Clearly,then,beforewecandecidewhethertwoobservationsaresimilar,weneedto ndsomewayofcomparingobjects.Theprincipletroublewiththisisthatourdatacouldbeofmanydi erenttypes-itcouldbeanumber,itcouldbeacolour,itcouldbeageographicallocation,itcouldbeatrue/false(boolean)answertoaquestion,etc-whichwouldallrequiredi erentwaysofmeasuringsimilarity.Itseemsthenthatthis rstproblemisoneofpreprocessingthedatainthedatabaseinsuchawayastoensurethatwecancompareobservations.Onecommonwayofdoingthisistotrytoconvertallourcharacteristicsintoanumericalvalue,suchasconvertingcolourstoRGBvalues,convertinglocationstolatitudeandlongitude,orconvertingbooleanvaluesintoonesandzeros.Oncewehaveeverythingasnumbers,wecanimagineaspaceinwhicheachofourcharacteristicsisrepresentedbyadi erentdimension,andthevalueofeachobservationforeachcharacteristicisitscoordinateinthatdimension.Then,ourobservationsbecomepointsinspaceandwecaninterpretthedistancebetweenthemastheirsimilarity(usingsomeappropriatemetric).Evenoncewehavedecidedonsomewayofdetermininghowsimilartwoobservationsare,westillhavetheproblemofdecidingwhichobservationsfromthedatabasearesimilarenoughtoournewobservationforustotaketheirclassi cationintoaccountwhenclassifyingthenewobservation.Thisproblemcanbesolvedinseveraldi erentways,eitherbyconsideringallthedatapointswithinacertainradiusofthenewsamplepoint,orbytakingonlyacertainnumberofthenearestpoints.ThislattermethodiswhatweconsidernowinthekNearestNeighboursAlgorithm.2ThekNearestNeighboursAlgorithmAsdiscussedearlier,weconsidereachofthecharacteristicsinourtrainingsetasadi erentdimensioninsomespace,andtakethevalueanobservationhasforthischaracteristictobeitscoordinateinthatdimension,sogettingasetofpointsinspace.Wecanthenconsiderthesimilarityoftwopointstobethedistancebetweentheminthisspaceundersomeappropriatemetric.2 However,ifwecreatethemapofadatasetwithoutanynoise(asshowninFigure4),wecanseethatthereisamuchclearerdivisionbetweenwherepointsshouldbeblueandpink,anditfollowstheintuitivelinemuchmoreclosely. Figure4:ThemapgeneratedbythedatasetwithoutanyrandomnoisepointsaddedThereasonforthisisnottoohardtosee.Theproblemisthatthealgorithmconsidersallpointsequally,whethertheyarepartofamaincluster,orjustarandomnoisepoint,soevenjusthavingafewunusualpointscanmaketheresultsverydi erent.However,inreality,wecertainlyexpecttogetsomeunusualpointsinoursample,sowewouldliketo ndawaytomakethealgorithmmorerobust.Withaviewto ndingawaytomaketheresultslookmorelikethemapinFigure4,we rstaddin20randompointsfromeachcolour,goingbacktowherewewerestarted.Plottingthemapatthisstagewillproducesomethingfairlyunintuitive,likeFigure3,withseeminglyrandomsectionsmappedtodi erentcolours.Instead,wetrychangingthevalueofk.Thisisachievedusingthe`Parameters'tabatthetop.Clearly,whenchoosinganewvaluefork,wedonotwanttopickanevennumber,sincewecould ndourselvesintheawkwardpositionofhavinganequalnumberofnearestneighboursofeachcolour.Instead,wepickanoddnumber,say5,andplotthemap.We ndthatasweincreasethevalueofk,themapgetscloserandclosertothemapproducedwhenwehadnorandompoints,asshowninFigure5.Thereasonwhythishappensisquitesimple.Ifweassumethatthepointsinthetwoclusterswillbemoredensethantherandomnoisepoints,itmakessensethatwhenweconsideralargevalueofnearestneighbours,thein uenceofthenoisepointsinrapidlyreduced.However,theproblemwiththisapproachisthatincreasingthevalueofkmeansthatthealgorithmtakesalongtimetorun,soifweconsideramuchbiggerdatabasewiththousandsofdatapoints,andalotmorenoise,wewillneedahighervalueofk,andtheprocesswilltakelongertorunanyway.Analternativemethodfordealingwithnoiseinourresultsistousecrossvalidation.Thisdetermineswhethereachdatapointwouldbeclassi edintheclassitisinifitwereaddedtothedatasetlater.Thisisachievedbyremovingitfromthetrainingset,andrunningthekNNalgorithmtopredictaclassforit.Ifthisclassmatches,thenthepointiscorrect,otherwiseitisdeemedtobeacross-validationerror.Toinvestigatethee ectsofaddingrandomnoisepointstothetrainingset,wecanusetheapplet.Settingitupasbefore,withtwoclustersinoppositecorners,wecanusethe\CrossValidation"featuretorunacrossvalidationonthedatainourtrainingset.Thiswasdonefor6di erentdatasets,andthenumberofnoisepoints5 Figure5:Themapsgeneratedasweincreasethevalueofk,startingwithk=3inthetopleftandendingwithk=17inthebottomright6 ineachonewasincreased.Theseresultswereaveraged,andtheaveragepercentageofpointsmisclassi edisshowninthetable: Forclarity,thesedataarealsopresentedinthegraphinFigure6.Ascanbeseen,thenumberofmis-classi cationsincreasesasweincreasethenumberofrandomnoisepointspresentinourtrainingdata.Thisaswewouldexpect,sincetherandomnoisepointsshouldshowupasbeingunexpected.Indeed,ifweextrapolatethedatausingthegeneraltrendofthegraph,wecanexpectthatthepercentageofmis-classi cationswouldincreasetoapproximately50%aswecontinueincreasingthenumberofrandomnoisepoints.Thisagreeswithhowwewouldintuitivelyexpectthecross-validationtobehaveonacompletelyrandomdataset. Figure6:Agraphdemonstratingthee ectofincreasingthenumberofrandomnoisepointsinthedatasetonthenumberofmis-classi cationsItisalsofoundthatincreasingthesizeofthetrainingsetmakesclassi cationofanewpointtakemuchlonger,duesimplytothenumberofcomparisonsrequired.ThisspeedproblemisoneofthemainissueswiththekNNalgorithm,alongsidetheneedto ndsomewayofcomparingdataofstrangetypes(forinstance,tryingtocomparepeople'sfavouritequotationstodecidetheirpoliticalpreferenceswouldrequireustotakeallsortsofmetricsabouttheactualquotationsuchasinformationabouttheauthor,abouttheworkitispartof,etc,thatitquicklybecomesverydicultforacomputer,whilestillrelativelyeasyforaperson).Tosolvethe rstoftheseproblems,wecanusedatareductionmethodstoreducethenumberofcomparisonsnecessary.7 ways:1.Thepercentageofpointsclassedasoutliersincreaseddramatically2.Thepercentageofpointsclassedasabsorbeddecreased3.ThepercentageofpointsclassedasprototypesincreasedslightlyAlloftheseareaswouldbeexpected.Thepercentageofoutliersincreasesbecausethereareincreasinglymanynoisepointsoftheothercolourineachcluster,whichwillleadthemtobemis-classi ed.Thepercentageofpointsdeemedtobeprototypesincreasesbecauseourdatasetnowhasamuchmorecomplexstructureoncewehaveincludedalltheserandomnoisepoints.Thepercentageofabsorbedpointsthereforemustdecreasesincetheothertwotypesareincreasing(andthethreearemutuallyexclusive). Figure7:Agraphshowingthee ectsofincreasingthenumberofrandomnoisepointsinthetrainingdataonthepercentageofpointsassignedeachofthethreeprimitivetypesbytheCNNalgorithm.References[1]E.Mirkes,KNNandPotentialEnergy(Applet).UniversityofLeicester.Available:http://www.math.le.ac.uk/people/ag153/homepage/KNN/KNN3.html,2011.[2]L.Kozma,kNearestNeighboursAlgorithm.HelsinkiUniversityofTechnology,Available:http://www.lkozma.net/knn2.pdf,2008.9