1 Example 1 12 Example 1 13 The General Case 2 2 The k Nearest Neighbours Algorithm 2 21 The Algorithm ID: 21787
Download Pdf The PPT/PDF document "Introduction to k Nearest Neighbour Clas..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
knowabout owerstohelphim.Instead,hegetspeopletomeasurevariouscharacteristicsofthe owers,suchasthestamensize,thenumberofpetals,theheight,thecolour,thesizeofthe owerhead,etc,andputthemintoacomputer.Hethenwantsthecomputertocomparethemtoapre-classieddatabaseofsamples,andpredictwhatvarietyeachofthe owersis,basedonitscharacteristics.1.3TheGeneralCaseIngeneral,westartwithasetofdata,eachdatapointofwhichisinaknowclass.Then,wewanttobeabletopredicttheclassicationofanewdatapointbasedontheknownclassicationsoftheobservationsinthedatabase.Forthisreason,thedatabaseisknownasourtrainingset,sinceittrainsuswhatobjectsofthedierentclasseslooklike.Theprocessofchoosingtheclassicationofthenewobservationisknownastheclassicationproblem,andthereareseveralwaystotackleit.Here,weconsiderchoosingtheclassicationofthenewobservationbasedontheclassicationsoftheobservationsinthedatabasewhichitis\mostsimilar"to.However,decidingwhethertwoobservationsaresimilarornotisquiteanopenquestion.Forinstance,decidingwhethertwocoloursaresimilarisacompletelydierentprocesstodecidingwhethertwoparagraphsoftextaresimilar.Clearly,then,beforewecandecidewhethertwoobservationsaresimilar,weneedtondsomewayofcomparingobjects.Theprincipletroublewiththisisthatourdatacouldbeofmanydierenttypes-itcouldbeanumber,itcouldbeacolour,itcouldbeageographicallocation,itcouldbeatrue/false(boolean)answertoaquestion,etc-whichwouldallrequiredierentwaysofmeasuringsimilarity.Itseemsthenthatthisrstproblemisoneofpreprocessingthedatainthedatabaseinsuchawayastoensurethatwecancompareobservations.Onecommonwayofdoingthisistotrytoconvertallourcharacteristicsintoanumericalvalue,suchasconvertingcolourstoRGBvalues,convertinglocationstolatitudeandlongitude,orconvertingbooleanvaluesintoonesandzeros.Oncewehaveeverythingasnumbers,wecanimagineaspaceinwhicheachofourcharacteristicsisrepresentedbyadierentdimension,andthevalueofeachobservationforeachcharacteristicisitscoordinateinthatdimension.Then,ourobservationsbecomepointsinspaceandwecaninterpretthedistancebetweenthemastheirsimilarity(usingsomeappropriatemetric).Evenoncewehavedecidedonsomewayofdetermininghowsimilartwoobservationsare,westillhavetheproblemofdecidingwhichobservationsfromthedatabasearesimilarenoughtoournewobservationforustotaketheirclassicationintoaccountwhenclassifyingthenewobservation.Thisproblemcanbesolvedinseveraldierentways,eitherbyconsideringallthedatapointswithinacertainradiusofthenewsamplepoint,orbytakingonlyacertainnumberofthenearestpoints.ThislattermethodiswhatweconsidernowinthekNearestNeighboursAlgorithm.2ThekNearestNeighboursAlgorithmAsdiscussedearlier,weconsidereachofthecharacteristicsinourtrainingsetasadierentdimensioninsomespace,andtakethevalueanobservationhasforthischaracteristictobeitscoordinateinthatdimension,sogettingasetofpointsinspace.Wecanthenconsiderthesimilarityoftwopointstobethedistancebetweentheminthisspaceundersomeappropriatemetric.2 However,ifwecreatethemapofadatasetwithoutanynoise(asshowninFigure4),wecanseethatthereisamuchclearerdivisionbetweenwherepointsshouldbeblueandpink,anditfollowstheintuitivelinemuchmoreclosely. Figure4:ThemapgeneratedbythedatasetwithoutanyrandomnoisepointsaddedThereasonforthisisnottoohardtosee.Theproblemisthatthealgorithmconsidersallpointsequally,whethertheyarepartofamaincluster,orjustarandomnoisepoint,soevenjusthavingafewunusualpointscanmaketheresultsverydierent.However,inreality,wecertainlyexpecttogetsomeunusualpointsinoursample,sowewouldliketondawaytomakethealgorithmmorerobust.WithaviewtondingawaytomaketheresultslookmorelikethemapinFigure4,werstaddin20randompointsfromeachcolour,goingbacktowherewewerestarted.Plottingthemapatthisstagewillproducesomethingfairlyunintuitive,likeFigure3,withseeminglyrandomsectionsmappedtodierentcolours.Instead,wetrychangingthevalueofk.Thisisachievedusingthe`Parameters'tabatthetop.Clearly,whenchoosinganewvaluefork,wedonotwanttopickanevennumber,sincewecouldndourselvesintheawkwardpositionofhavinganequalnumberofnearestneighboursofeachcolour.Instead,wepickanoddnumber,say5,andplotthemap.Wendthatasweincreasethevalueofk,themapgetscloserandclosertothemapproducedwhenwehadnorandompoints,asshowninFigure5.Thereasonwhythishappensisquitesimple.Ifweassumethatthepointsinthetwoclusterswillbemoredensethantherandomnoisepoints,itmakessensethatwhenweconsideralargevalueofnearestneighbours,thein uenceofthenoisepointsinrapidlyreduced.However,theproblemwiththisapproachisthatincreasingthevalueofkmeansthatthealgorithmtakesalongtimetorun,soifweconsideramuchbiggerdatabasewiththousandsofdatapoints,andalotmorenoise,wewillneedahighervalueofk,andtheprocesswilltakelongertorunanyway.Analternativemethodfordealingwithnoiseinourresultsistousecrossvalidation.Thisdetermineswhethereachdatapointwouldbeclassiedintheclassitisinifitwereaddedtothedatasetlater.Thisisachievedbyremovingitfromthetrainingset,andrunningthekNNalgorithmtopredictaclassforit.Ifthisclassmatches,thenthepointiscorrect,otherwiseitisdeemedtobeacross-validationerror.Toinvestigatetheeectsofaddingrandomnoisepointstothetrainingset,wecanusetheapplet.Settingitupasbefore,withtwoclustersinoppositecorners,wecanusethe\CrossValidation"featuretorunacrossvalidationonthedatainourtrainingset.Thiswasdonefor6dierentdatasets,andthenumberofnoisepoints5 Figure5:Themapsgeneratedasweincreasethevalueofk,startingwithk=3inthetopleftandendingwithk=17inthebottomright6 ineachonewasincreased.Theseresultswereaveraged,andtheaveragepercentageofpointsmisclassiedisshowninthetable: Forclarity,thesedataarealsopresentedinthegraphinFigure6.Ascanbeseen,thenumberofmis-classicationsincreasesasweincreasethenumberofrandomnoisepointspresentinourtrainingdata.Thisaswewouldexpect,sincetherandomnoisepointsshouldshowupasbeingunexpected.Indeed,ifweextrapolatethedatausingthegeneraltrendofthegraph,wecanexpectthatthepercentageofmis-classicationswouldincreasetoapproximately50%aswecontinueincreasingthenumberofrandomnoisepoints.Thisagreeswithhowwewouldintuitivelyexpectthecross-validationtobehaveonacompletelyrandomdataset. Figure6:Agraphdemonstratingtheeectofincreasingthenumberofrandomnoisepointsinthedatasetonthenumberofmis-classicationsItisalsofoundthatincreasingthesizeofthetrainingsetmakesclassicationofanewpointtakemuchlonger,duesimplytothenumberofcomparisonsrequired.ThisspeedproblemisoneofthemainissueswiththekNNalgorithm,alongsidetheneedtondsomewayofcomparingdataofstrangetypes(forinstance,tryingtocomparepeople'sfavouritequotationstodecidetheirpoliticalpreferenceswouldrequireustotakeallsortsofmetricsabouttheactualquotationsuchasinformationabouttheauthor,abouttheworkitispartof,etc,thatitquicklybecomesverydicultforacomputer,whilestillrelativelyeasyforaperson).Tosolvetherstoftheseproblems,wecanusedatareductionmethodstoreducethenumberofcomparisonsnecessary.7 ways:1.Thepercentageofpointsclassedasoutliersincreaseddramatically2.Thepercentageofpointsclassedasabsorbeddecreased3.ThepercentageofpointsclassedasprototypesincreasedslightlyAlloftheseareaswouldbeexpected.Thepercentageofoutliersincreasesbecausethereareincreasinglymanynoisepointsoftheothercolourineachcluster,whichwillleadthemtobemis-classied.Thepercentageofpointsdeemedtobeprototypesincreasesbecauseourdatasetnowhasamuchmorecomplexstructureoncewehaveincludedalltheserandomnoisepoints.Thepercentageofabsorbedpointsthereforemustdecreasesincetheothertwotypesareincreasing(andthethreearemutuallyexclusive). Figure7:AgraphshowingtheeectsofincreasingthenumberofrandomnoisepointsinthetrainingdataonthepercentageofpointsassignedeachofthethreeprimitivetypesbytheCNNalgorithm.References[1]E.Mirkes,KNNandPotentialEnergy(Applet).UniversityofLeicester.Available:http://www.math.le.ac.uk/people/ag153/homepage/KNN/KNN3.html,2011.[2]L.Kozma,kNearestNeighboursAlgorithm.HelsinkiUniversityofTechnology,Available:http://www.lkozma.net/knn2.pdf,2008.9