Online edition cn2009 Cambridge UP DRAFT ID: 290187
Download Pdf The PPT/PDF document "Online edition (c)\n2009 Cambridge UP" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Online edition (c)\n2009 Cambridge UP Online edition (c)\n2009 Cambridge UP DRAFT!©April1,2009CambridgeUniversityPress.Feedbackwelcome.37717HierarchicalclusteringFlatclusteringisefcientandconceptuallysimple,butaswesawinChap-ter 16 ithasanumberofdrawbacks.ThealgorithmsintroducedinChap-ter 16 returnaatunstructuredsetofclusters,requireaprespeciednum-berofclustersasinputandarenondeterministic.Hierarchicalclustering(orHIERARCHICALCLUSTERINGhierarchicclustering)outputsahierarchy,astructurethatismoreinformativethantheunstructuredsetofclustersreturnedbyatclustering. 1 HierarchicalclusteringdoesnotrequireustoprespecifythenumberofclustersandmosthierarchicalalgorithmsthathavebeenusedinIRaredeterministic.Thesead-vantagesofhierarchicalclusteringcomeatthecostoflowerefciency.Themostcommonhierarchicalclusteringalgorithmshaveacomplexitythatisatleastquadraticinthenumberofdocumentscomparedtothelinearcomplex-ityofK-meansandEM(cf.Section 16.4 ,page 364 ).Thischapterrstintroducesagglomerativehierarchicalclustering(Section 17.1 )andpresentsfourdifferentagglomerativealgorithms,inSections 17.2 17.4 ,whichdifferinthesimilaritymeasurestheyemploy:single-link,complete-link,group-average,andcentroidsimilarity.WethendiscusstheoptimalityconditionsofhierarchicalclusteringinSection 17.5 .Section 17.6 introducestop-down(ordivisive)hierarchicalclustering.Section 17.7 looksatlabelingclustersautomatically,aproblemthatmustbesolvedwheneverhumansin-teractwiththeoutputofclustering.WediscussimplementationissuesinSection 17.8 .Section 17.9 providespointerstofurtherreading,includingref-erencestosofthierarchicalclustering,whichwedonotcoverinthisbook.Therearefewdifferencesbetweentheapplicationsofatandhierarchi-calclusteringininformationretrieval.Inparticular,hierarchicalclusteringisappropriateforanyoftheapplicationsshowninTable 16.1 (page 351 ;seealsoSection 16.6 ,page 372 ).Infact,theexamplewegaveforcollectionclus-teringishierarchical.Ingeneral,weselectatclusteringwhenefciencyisimportantandhierarchicalclusteringwhenoneofthepotentialproblems 1.Inthischapter,weonlyconsiderhierarchiesthatarebinarytreesliketheoneshowninFig-ure 17.1 buthierarchicalclusteringcanbeeasilyextendedtoothertypesoftrees. Online edition (c)\n2009 Cambridge UP 37817Hierarchicalclusteringofatclustering(notenoughstructure,predeterminednumberofclusters,non-determinism)isaconcern.Inaddition,manyresearchersbelievethathi-erarchicalclusteringproducesbetterclustersthanatclustering.However,thereisnoconsensusonthisissue(seereferencesinSection 17.9 ).17.1HierarchicalagglomerativeclusteringHierarchicalclusteringalgorithmsareeithertop-downorbottom-up.Bottom-upalgorithmstreateachdocumentasasingletonclusterattheoutsetandthensuccessivelymerge(oragglomerate)pairsofclustersuntilallclustershavebeenmergedintoasingleclusterthatcontainsalldocuments.Bottom-uphierarchicalclusteringisthereforecalledhierarchicalagglomerativecluster-HIERARCHICALAGGLOMERATIVECLUSTERINGingorHAC.Top-downclusteringrequiresamethodforsplittingacluster.HACItproceedsbysplittingclustersrecursivelyuntilindividualdocumentsarereached.SeeSection 17.6 .HACismorefrequentlyusedinIRthantop-downclusteringandisthemainsubjectofthischapter.BeforelookingatspecicsimilaritymeasuresusedinHACinSections 17.2 17.4 ,werstintroduceamethodfordepictinghierarchicalclusteringsgraphically,discussafewkeypropertiesofHACsandpresentasimplealgo-rithmforcomputinganHAC.AnHACclusteringistypicallyvisualizedasadendrogramasshowninDENDROGRAMFigure 17.1 .Eachmergeisrepresentedbyahorizontalline.They-coordinateofthehorizontallineisthesimilarityofthetwoclustersthatweremerged,wheredocumentsareviewedassingletonclusters.Wecallthissimilaritythecombinationsimilarityofthemergedcluster.Forexample,thecombinationCOMBINATIONSIMILARITYsimilarityoftheclusterconsistingofLloyd'sCEOquestionedandLloyd'schief/U.S.grillinginFigure 17.1 is0.56.Wedenethecombinationsimilarityofasingletonclusterasitsdocument'sself-similarity(whichis1.0forcosinesimilarity).Bymovingupfromthebottomlayertothetopnode,adendrogramal-lowsustoreconstructthehistoryofmergesthatresultedinthedepictedclustering.Forexample,weseethatthetwodocumentsentitledWarheroColinPowellweremergedrstinFigure 17.1 andthatthelastmergeaddedAgtradereformtoaclusterconsistingoftheother29documents.AfundamentalassumptioninHACisthatthemergeoperationismono-MONOTONICITYtonic.Monotonicmeansthatifs1,s2,...,sK 1arethecombinationsimilaritiesofthesuccessivemergesofanHAC,thens1s2...sK 1holds.Anon-monotonichierarchicalclusteringcontainsatleastoneinversionsisi+1INVERSIONandcontradictsthefundamentalassumptionthatwechosethebestmergeavailableateachstep.WewillseeanexampleofaninversioninFigure 17.12 .Hierarchicalclusteringdoesnotrequireaprespeciednumberofclusters.However,insomeapplicationswewantapartitionofdisjointclustersjustas Online edition (c)\n2009 Cambridge UP 17.1Hierarchicalagglomerativeclustering379 1.00.80.60.40.20.0 Ag trade reform. Back-to-school spending is up Lloyd's CEO questioned Lloyd's chief / U.S. grilling Viag stays positive Chrysler / Latin America Ohio Blue Cross Japanese prime minister / Mexico CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues German unions split War hero Colin Powell War hero Colin Powell Oil prices slip Chains may raise prices Clinton signs law Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Most active stocks Mexican markets Hog prices tumble NYSE closing averages British FTSE index Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady IFigure17.1Adendrogramofasingle-linkclusteringof30documentsfromReuters-RCV1.Twopossiblecutsofthedendrogramareshown:at0.4into24clustersandat0.1into12clusters. Online edition (c)\n2009 Cambridge UP 38017Hierarchicalclusteringinatclustering.Inthosecases,thehierarchyneedstobecutatsomepoint.Anumberofcriteriacanbeusedtodeterminethecuttingpoint:Cutataprespeciedlevelofsimilarity.Forexample,wecutthedendro-gramat0.4ifwewantclusterswithaminimumcombinationsimilarityof0.4.InFigure 17.1 ,cuttingthediagramaty=0.4yields24clusters(groupingonlydocumentswithhighsimilaritytogether)andcuttingitaty=0.1yields12clusters(onelargenancialnewsclusterand11smallerclusters).Cutthedendrogramwherethegapbetweentwosuccessivecombinationsimilaritiesislargest.Suchlargegapsarguablyindicatenaturalclus-terings.Addingonemoreclusterdecreasesthequalityoftheclusteringsignicantly,socuttingbeforethissteepdecreaseoccursisdesirable.ThisstrategyisanalogoustolookingforthekneeintheK-meansgraphinFig-ure 16.8 (page 366 ).ApplyEquation( 16.11 )(page 366 ):K=argminK0[RSS(K0)+lK0]whereK0referstothecutofthehierarchythatresultsinK0clusters,RSSistheresidualsumofsquaresandlisapenaltyforeachadditionalcluster.InsteadofRSS,anothermeasureofdistortioncanbeused.Asinatclustering,wecanalsoprespecifythenumberofclustersKandselectthecuttingpointthatproducesKclusters.Asimple,naiveHACalgorithmisshowninFigure 17.2 .WerstcomputetheNNsimilaritymatrixC.ThealgorithmthenexecutesN 1stepsofmergingthecurrentlymostsimilarclusters.Ineachiteration,thetwomostsimilarclustersaremergedandtherowsandcolumnsofthemergedclusteriinCareupdated. 2 TheclusteringisstoredasalistofmergesinA.Iindicateswhichclustersarestillavailabletobemerged.ThefunctionSIM(i,m,j)computesthesimilarityofclusterjwiththemergeofclustersiandm.ForsomeHACalgorithms,SIM(i,m,j)issimplyafunctionofC[j][i]andC[j][m],forexample,themaximumofthesetwovaluesforsingle-link.Wewillnowrenethisalgorithmforthedifferentsimilaritymeasuresofsingle-linkandcomplete-linkclustering(Section 17.2 )andgroup-averageandcentroidclustering(Sections 17.3 and 17.4 ).ThemergecriteriaofthesefourvariantsofHACareshowninFigure 17.3 . 2.Weassumethatweuseadeterministicmethodforbreakingties,suchasalwayschoosethemergethatistherstclusterwithrespecttoatotalorderingofthesubsetsofthedocumentsetD. Online edition (c)\n2009 Cambridge UP 17.1Hierarchicalagglomerativeclustering381SIMPLEHAC(d1,...,dN)1forn 1toN2dofori 1toN3doC[n][i] SIM(dn,di)4I[n] 1(keepstrackofactiveclusters)5A [](assemblesclusteringasasequenceofmerges)6fork 1toN 17dohi,mi argmaxfhi,mi:i=m^I[i]=1^I[m]=1gC[i][m]8A.APPEND(hi,mi)(storemerge)9forj 1toN10doC[i][j] SIM(i,m,j)11C[j][i] SIM(i,m,j)12I[m] 0(deactivatecluster)13returnAIFigure17.2Asimple,butinefcientHACalgorithm. bbbb (a)single-link:maximumsimilarity bbbb (b)complete-link:minimumsimilarity bbbb (c)centroid:averageinter-similarity bbbb (d)group-average:averageofallsimilaritiesIFigure17.3ThedifferentnotionsofclustersimilarityusedbythefourHACal-gorithms.Aninter-similarityisasimilaritybetweentwodocumentsfromdifferentclusters. Online edition (c)\n2009 Cambridge UP 38217Hierarchicalclustering 01234 0123d5d6d7d8d1d2d3d4 01234 0123d5d6d7d8d1d2d3d4 IFigure17.4Asingle-link(left)andcomplete-link(right)clusteringofeightdoc-uments.Theellipsescorrespondtosuccessiveclusteringstages.Left:Thesingle-linksimilarityofthetwouppertwo-pointclustersisthesimilarityofd2andd3(solidline),whichisgreaterthanthesingle-linksimilarityofthetwolefttwo-pointclusters(dashedline).Right:Thecomplete-linksimilarityofthetwouppertwo-pointclustersisthesimilarityofd1andd4(dashedline),whichissmallerthanthecomplete-linksimilarityofthetwolefttwo-pointclusters(solidline). 17.2Single-linkandcomplete-linkclusteringInsingle-linkclusteringorsingle-linkageclustering,thesimilarityoftwoclus-SINGLE-LINKCLUSTERINGtersisthesimilarityoftheirmostsimilarmembers(seeFigure 17.3 ,(a)) 3 .Thissingle-linkmergecriterionislocal.Wepayattentionsolelytotheareawherethetwoclusterscomeclosesttoeachother.Other,moredistantpartsoftheclusterandtheclusters'overallstructurearenottakenintoaccount.Incomplete-linkclusteringorcomplete-linkageclustering,thesimilarityoftwoCOMPLETE-LINKCLUSTERINGclustersisthesimilarityoftheirmostdissimilarmembers(seeFigure 17.3 ,(b)).Thisisequivalenttochoosingtheclusterpairwhosemergehasthesmallestdiameter.Thiscomplete-linkmergecriterionisnon-local;theentirestructureoftheclusteringcaninuencemergedecisions.Thisresultsinapreferenceforcompactclusterswithsmalldiametersoverlong,stragglyclusters,butalsocausessensitivitytooutliers.Asingledocumentfarfromthecentercanincreasediametersofcandidatemergeclustersdramaticallyandcompletelychangethenalclustering.Figure 17.4 depictsasingle-linkandacomplete-linkclusteringofeightdocuments.Therstfoursteps,eachproducingaclusterconsistingofapairoftwodocuments,areidentical.Thensingle-linkclusteringjoinstheup-pertwopairs(andafterthatthelowertwopairs)becauseonthemaximum-similaritydenitionofclustersimilarity,thosetwoclustersareclosest.Complete- 3.Throughoutthischapter,weequatesimilaritywithproximityin2Ddepictionsofclustering. Online edition (c)\n2009 Cambridge UP 17.2Single-linkandcomplete-linkclustering383 1.00.80.60.40.20.0 NYSE closing averages Hog prices tumble Oil prices slip Ag trade reform. Chrysler / Latin America Japanese prime minister / Mexico Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady Mexican markets British FTSE index War hero Colin Powell War hero Colin Powell Lloyd's CEO questioned Lloyd's chief / U.S. grilling Ohio Blue Cross Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Viag stays positive Most active stocks CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues Back-to-school spending is up German unions split Chains may raise prices Clinton signs law IFigure17.5Adendrogramofacomplete-linkclustering.Thesame30documentswereclusteredwithsingle-linkclusteringinFigure 17.1 . Online edition (c)\n2009 Cambridge UP 38417Hierarchicalclustering´´´´´´´´´´´´ IFigure17.6Chaininginsingle-linkclustering.Thelocalcriterioninsingle-linkclusteringcancauseundesirableelongatedclusters. linkclusteringjoinsthelefttwopairs(andthentherighttwopairs)becausethosearetheclosestpairsaccordingtotheminimum-similaritydenitionofclustersimilarity. 4 Figure 17.1 isanexampleofasingle-linkclusteringofasetofdocumentsandFigure 17.5 isthecomplete-linkclusteringofthesameset.WhencuttingthelastmergeinFigure 17.5 ,weobtaintwoclustersofsimilarsize(doc-uments116,fromNYSEclosingaveragestoLloyd'schief/U.S.grilling,anddocuments1730,fromOhioBlueCrosstoClintonsignslaw).ThereisnocutofthedendrograminFigure 17.1 thatwouldgiveusanequallybalancedclustering.Bothsingle-linkandcomplete-linkclusteringhavegraph-theoreticinter-pretations.Denesktobethecombinationsimilarityofthetwoclustersmergedinstepk,andG(sk)thegraphthatlinksalldatapointswithasimilar-ityofatleastsk.Thentheclustersafterstepkinsingle-linkclusteringaretheconnectedcomponentsofG(sk)andtheclustersafterstepkincomplete-linkclusteringaremaximalcliquesofG(sk).AconnectedcomponentisamaximalCONNECTEDCOMPONENTsetofconnectedpointssuchthatthereisapathconnectingeachpair.AcliqueCLIQUEisasetofpointsthatarecompletelylinkedwitheachother.Thesegraph-theoreticinterpretationsmotivatethetermssingle-linkandcomplete-linkclustering.Single-linkclustersatstepkaremaximalsetsofpointsthatarelinkedviaatleastonelink(asinglelink)ofsimilarityssk;complete-linkclustersatstepkaremaximalsetsofpointsthatarecompletelylinkedwitheachothervialinksofsimilarityssk.Single-linkandcomplete-linkclusteringreducetheassessmentofclusterqualitytoasinglesimilaritybetweenapairofdocuments:thetwomostsim-ilardocumentsinsingle-linkclusteringandthetwomostdissimilardocu-mentsincomplete-linkclustering.Ameasurementbasedononepaircannotfullyreectthedistributionofdocumentsinacluster.Itisthereforenotsur-prisingthatbothalgorithmsoftenproduceundesirableclusters.Single-linkclusteringcanproducestragglingclustersasshowninFigure 17.6 .Sincethemergecriterionisstrictlylocal,achainofpointscanbeextendedforlong 4.Ifyouarebotheredbythepossibilityofties,assumethatd1hascoordinates(1+e,3 e)andthatallotherpointshaveintegercoordinates. Online edition (c)\n2009 Cambridge UP 17.2Single-linkandcomplete-linkclustering385 01234567 01d1d2d3d4d5 IFigure17.7Outliersincomplete-linkclustering.Thevedocumentshavethex-coordinates1+2e,4,5+2e,6and7 e.Complete-linkclusteringcre-atesthetwoclustersshownasellipses.Themostintuitivetwo-clustercluster-ingisffd1g,fd2,d3,d4,d5gg,butincomplete-linkclustering,theoutlierd1splitsfd2,d3,d4,d5gasshown. distanceswithoutregardtotheoverallshapeoftheemergingcluster.Thiseffectiscalledchaining.CHAININGThechainingeffectisalsoapparentinFigure 17.1 .Thelastelevenmergesofthesingle-linkclustering(thoseabovethe0.1line)addonsingledocu-mentsorpairsofdocuments,correspondingtoachain.Thecomplete-linkclusteringinFigure 17.5 avoidsthisproblem.Documentsaresplitintotwogroupsofroughlyequalsizewhenwecutthedendrogramatthelastmerge.Ingeneral,thisisamoreusefulorganizationofthedatathanaclusteringwithchains.However,complete-linkclusteringsuffersfromadifferentproblem.Itpaystoomuchattentiontooutliers,pointsthatdonottwellintotheglobalstructureofthecluster.IntheexampleinFigure 17.7 thefourdocumentsd2,d3,d4,d5aresplitbecauseoftheoutlierd1attheleftedge(Exercise 17.1 ).Complete-linkclusteringdoesnotndthemostintuitiveclusterstructureinthisexample.17.2.1TimecomplexityofHACThecomplexityofthenaiveHACalgorithminFigure 17.2 isQ(N3)becauseweexhaustivelyscantheNNmatrixCforthelargestsimilarityineachofN 1iterations.ForthefourHACmethodsdiscussedinthischapteramoreefcientalgo-rithmisthepriority-queuealgorithmshowninFigure 17.8 .Itstimecomplex-ityisQ(N2logN).TherowsC[k]oftheNNsimilaritymatrixCaresortedindecreasingorderofsimilarityinthepriorityqueuesP.P[k].MAX()thenreturnstheclusterinP[k]thatcurrentlyhasthehighestsimilaritywithwk,whereweusewktodenotethekthclusterasinChapter 16 .Aftercreatingthemergedclusterofwk1andwk2,wk1isusedasitsrepresentative.ThefunctionSIMcomputesthesimilarityfunctionforpotentialmergepairs:largestsimi-larityforsingle-link,smallestsimilarityforcomplete-link,averagesimilarityforGAAC(Section 17.3 ),andcentroidsimilarityforcentroidclustering(Sec- Online edition (c)\n2009 Cambridge UP 38617HierarchicalclusteringEFFICIENTHAC(~d1,...,~dN)1forn 1toN2dofori 1toN3doC[n][i].sim ~dn~di4C[n][i].index i5I[n] 16P[n] priorityqueueforC[n]sortedonsim7P[n].DELETE(C[n][n])(don'twantself-similarities)8A []9fork 1toN 110dok1 argmaxfk:I[k]=1gP[k].MAX().sim11k2 P[k1].MAX().index12A.APPEND(hk1,k2i)13I[k2] 014P[k1] []15foreachiwithI[i]=1^i=k116doP[i].DELETE(C[i][k1])17P[i].DELETE(C[i][k2])18C[i][k1].sim SIM(i,k1,k2)19P[i].INSERT(C[i][k1])20C[k1][i].sim SIM(i,k1,k2)21P[k1].INSERT(C[k1][i])22returnAclusteringalgorithm SIM(i,k1,k2) single-link max(SIM(i,k1),SIM(i,k2))complete-link min(SIM(i,k1),SIM(i,k2))centroid (1 Nm~vm)(1 Ni~vi)group-average 1 (Nm+Ni)(Nm+Ni 1)[(~vm+~vi)2 (Nm+Ni)]computeC[5] 12345 0.20.80.60.41.0 createP[5](bysorting) 2341 0.80.60.40.2 merge2and3,updatesimilarityof2,delete3 241 0.30.40.2 deleteandreinsert2 421 0.40.30.2 IFigure17.8Thepriority-queuealgorithmforHAC.Top:Thealgorithm.Center:Fourdifferentsimilaritymeasures.Bottom:Anexampleforprocessingsteps6and1619.ThisisamadeupexampleshowingP[5]fora55matrixC. Online edition (c)\n2009 Cambridge UP 17.2Single-linkandcomplete-linkclustering387SINGLELINKCLUSTERING(d1,...,dN)1forn 1toN2dofori 1toN3doC[n][i].sim SIM(dn,di)4C[n][i].index i5I[n] n6NBM[n] argmaxX2fC[n][i]:n=igX.sim7A []8forn 1toN 19doi1 argmaxfi:I[i]=igNBM[i].sim10i2 I[NBM[i1].index]11A.APPEND(hi1,i2i)12fori 1toN13doifI[i]=i^i=i1^i=i214thenC[i1][i].sim C[i][i1].sim max(C[i1][i].sim,C[i2][i].sim)15ifI[i]=i216thenI[i] i117NBM[i1] argmaxX2fC[i1][i]:I[i]=i^i=i1gX.sim18returnAIFigure17.9Single-linkclusteringalgorithmusinganNBMarray.Aftermergingtwoclustersi1andi2,therstone(i1)representsthemergedcluster.IfI[i]=i,theniistherepresentativeofitscurrentcluster.IfI[i]=i,thenihasbeenmergedintotheclusterrepresentedbyI[i]andwillthereforebeignoredwhenupdatingNBM[i1]. tion 17.4 ).WegiveanexampleofhowarowofCisprocessed(Figure 17.8 ,bottompanel).Theloopinlines17isQ(N2)andtheloopinlines921isQ(N2logN)foranimplementationofpriorityqueuesthatsupportsdeletionandinsertioninQ(logN).Theoverallcomplexityofthealgorithmisthere-foreQ(N2logN).InthedenitionofthefunctionSIM,~vmand~viarethevectorsumsofwk1[wk2andwi,respectively,andNmandNiarethenumberofdocumentsinwk1[wk2andwi,respectively.TheargumentofEFFICIENTHACinFigure 17.8 isasetofvectors(asop-posedtoasetofgenericdocuments)becauseGAACandcentroidclustering(Sections 17.3 and 17.4 )requirevectorsasinput.Thecomplete-linkversionofEFFICIENTHACcanalsobeappliedtodocumentsthatarenotrepresentedasvectors.Forsingle-link,wecanintroduceanext-best-mergearray(NBM)asafur-theroptimizationasshowninFigure 17.9 .NBMkeepstrackofwhatthebestmergeisforeachcluster.Eachofthetwotoplevelfor-loopsinFigure 17.9 areQ(N2),thustheoverallcomplexityofsingle-linkclusteringisQ(N2). Online edition (c)\n2009 Cambridge UP 38817Hierarchicalclustering 012345678910 01d1d2d3d4IFigure17.10Complete-linkclusteringisnotbest-mergepersistent.Atrst,d2isthebest-mergeclusterford3.Butaftermergingd1andd2,d4becomesd3'sbest-mergecandidate.Inabest-mergepersistentalgorithmlikesingle-link,d3'sbest-mergeclus-terwouldbefd1,d2g. CanwealsospeeduptheotherthreeHACalgorithmswithanNBMar-ray?Wecannotbecauseonlysingle-linkclusteringisbest-mergepersistent.BEST-MERGEPERSISTENCESupposethatthebestmergeclusterforwkiswjinsingle-linkclustering.Thenaftermergingwjwithathirdclusterwi=wk,themergeofwiandwjwillbewk'sbestmergecluster(Exercise 17.6 ).Inotherwords,thebest-mergecandidateforthemergedclusterisoneofthetwobest-mergecandidatesofitscomponentsinsingle-linkclustering.ThismeansthatCcanbeupdatedinQ(N)ineachiterationbytakingasimplemaxoftwovaluesonline14inFigure 17.9 foreachoftheremainingNclusters.Figure 17.10 demonstratesthatbest-mergepersistencedoesnotholdforcomplete-linkclustering,whichmeansthatwecannotuseanNBMarraytospeedupclustering.Aftermergingd3'sbestmergecandidated2withclusterd1,anunrelatedclusterd4becomesthebestmergecandidateford3.Thisisbecausethecomplete-linkmergecriterionisnon-localandcanbeaffectedbypointsatagreatdistancefromtheareawheretwomergecandidatesmeet.Inpractice,theefciencypenaltyoftheQ(N2logN)algorithmissmallcomparedwiththeQ(N2)single-linkalgorithmsincecomputingthesimilar-itybetweentwodocuments(e.g.,asadotproduct)isanorderofmagnitudeslowerthancomparingtwoscalarsinsorting.AllfourHACalgorithmsinthischapterareQ(N2)withrespecttosimilaritycomputations.Sothediffer-enceincomplexityisrarelyaconcerninpracticewhenchoosingoneofthealgorithms.?Exercise17.1Showthatcomplete-linkclusteringcreatesthetwo-clusterclusteringdepictedinFig-ure 17.7 .17.3Group-averageagglomerativeclusteringGroup-averageagglomerativeclusteringorGAAC(seeFigure 17.3 ,(d))evaluatesGROUP-AVERAGEAGGLOMERATIVECLUSTERINGclusterqualitybasedonallsimilaritiesbetweendocuments,thusavoidingthepitfallsofthesingle-linkandcomplete-linkcriteria,whichequatecluster Online edition (c)\n2009 Cambridge UP 17.3Group-averageagglomerativeclustering389similaritywiththesimilarityofasinglepairofdocuments.GAACisalsocalledgroup-averageclusteringandaverage-linkclustering.GAACcomputestheaveragesimilaritySIM-GAofallpairsofdocuments,includingpairsfromthesamecluster.Butself-similaritiesarenotincludedintheaverage:SIM-GA(wi,wj)=1 (Ni+Nj)(Ni+Nj 1)ådm2wi[wjådn2wi[wj,dn=dm~dm~dn(17.1)where~disthelength-normalizedvectorofdocumentd,denotesthedotproduct,andNiandNjarethenumberofdocumentsinwiandwj,respec-tively.ThemotivationforGAACisthatourgoalinselectingtwoclusterswiandwjasthenextmergeinHACisthattheresultingmergeclusterwk=wi[wjshouldbecoherent.Tojudgethecoherenceofwk,weneedtolookatalldocument-documentsimilaritieswithinwk,includingthosethatoccurwithinwiandthosethatoccurwithinwj.WecancomputethemeasureSIM-GAefcientlybecausethesumofindi-vidualvectorsimilaritiesisequaltothesimilaritiesoftheirsums:ådm2wiådn2wj(~dm~dn)=(ådm2wi~dm)(ådn2wj~dn)(17.2)With( 17.2 ),wehave:SIM-GA(wi,wj)=1 (Ni+Nj)(Ni+Nj 1)[(ådm2wi[wj~dm)2 (Ni+Nj)](17.3)Theterm(Ni+Nj)ontherightisthesumofNi+Njself-similaritiesofvalue1.0.Withthistrickwecancomputeclustersimilarityinconstanttime(as-sumingwehaveavailablethetwovectorsumsådm2wi~dmandådm2wj~dm)insteadofinQ(NiNj).Thisisimportantbecauseweneedtobeabletocom-putethefunctionSIMonlines18and20inEFFICIENTHAC(Figure 17.8 )inconstanttimeforefcientimplementationsofGAAC.Notethatfortwosingletonclusters,Equation( 17.3 )isequivalenttothedotproduct.Equation( 17.2 )reliesonthedistributivityofthedotproductwithrespecttovectoraddition.SincethisiscrucialfortheefcientcomputationofaGAACclustering,themethodcannotbeeasilyappliedtorepresentationsofdocumentsthatarenotreal-valuedvectors.Also,Equation( 17.2 )onlyholdsforthedotproduct.Whilemanyalgorithmsintroducedinthisbookhavenear-equivalentdescriptionsintermsofdotproduct,cosinesimilarityandEuclideandistance(cf.Section 14.1 ,page 291 ),Equation( 17.2 )canonlybeexpressedusingthedotproduct.Thisisafundamentaldifferencebetweensingle-link/complete-linkclusteringandGAAC.Thersttwoonlyrequirea Online edition (c)\n2009 Cambridge UP 39017Hierarchicalclusteringsquarematrixofsimilaritiesasinputanddonotcarehowthesesimilaritieswerecomputed.Tosummarize,GAACrequires(i)documentsrepresentedasvectors,(ii)lengthnormalizationofvectors,sothatself-similaritiesare1.0,and(iii)thedotproductasthemeasureofsimilaritybetweenvectorsandsumsofvec-tors.ThemergealgorithmsforGAACandcomplete-linkclusteringarethesameexceptthatweuseEquation( 17.3 )assimilarityfunctioninFigure 17.8 .There-fore,theoveralltimecomplexityofGAACisthesameasforcomplete-linkclustering:Q(N2logN).Likecomplete-linkclustering,GAACisnotbest-mergepersistent(Exercise 17.6 ).ThismeansthatthereisnoQ(N2)algorithmforGAACthatwouldbeanalogoustotheQ(N2)algorithmforsingle-linkinFigure 17.9 .Wecanalsodenegroup-averagesimilarityasincludingself-similarities:SIM-GA0(wi,wj)=1 (Ni+Nj)2(ådm2wi[wj~dm)2=1 Ni+Njådm2wi[wj[~dm~m(wi[wj)](17.4)wherethecentroid~m(w)isdenedasinEquation( 14.1 )(page 292 ).Thisdenitionisequivalenttotheintuitivedenitionofclusterqualityasaveragesimilarityofdocuments~dmtothecluster'scentroid~m.Self-similaritiesarealwaysequalto1.0,themaximumpossiblevalueforlength-normalizedvectors.Theproportionofself-similaritiesinEquation( 17.4 )isi/i2=1/iforaclusterofsizei.Thisgivesanunfairadvantagetosmallclusterssincetheywillhaveproportionallymoreself-similarities.Fortwodocumentsd1,d2withasimilaritys,wehaveSIM-GA0(d1,d2)=(1+s)/2.Incontrast,SIM-GA(d1,d2)=s(1+s)/2.ThissimilaritySIM-GA(d1,d2)oftwodocumentsisthesameasinsingle-link,complete-linkandcentroidclustering.WepreferthedenitioninEquation( 17.3 ),whichexcludesself-similaritiesfromtheaverage,becausewedonotwanttopenalizelargeclus-tersfortheirsmallerproportionofself-similaritiesandbecausewewantaconsistentsimilarityvaluesfordocumentpairsinallfourHACalgorithms.?Exercise17.2Applygroup-averageclusteringtothepointsinFigures 17.6 and 17.7 .Mapthemontothesurfaceoftheunitsphereinathree-dimensionalspacetogetlength-normalizedvectors.Isthegroup-averageclusteringdifferentfromthesingle-linkandcomplete-linkclusterings? Online edition (c)\n2009 Cambridge UP 17.4Centroidclustering391 01234567 012345d1d2d3d4d5d6 b m1 b m3 b m2 IFigure17.11Threeiterationsofcentroidclustering.Eachiterationmergesthetwoclusterswhosecentroidsareclosest. 17.4CentroidclusteringIncentroidclustering,thesimilarityoftwoclustersisdenedasthesimilar-ityoftheircentroids:SIM-CENT(wi,wj)=~m(wi)~m(wj)(17.5)=(1 Niådm2wi~dm)(1 Njådn2wj~dn)=1 NiNjådm2wiådn2wj~dm~dn(17.6)Equation( 17.5 )iscentroidsimilarity.Equation( 17.6 )showsthatcentroidsimilarityisequivalenttoaveragesimilarityofallpairsofdocumentsfromdifferentclusters.Thus,thedifferencebetweenGAACandcentroidclusteringisthatGAACconsidersallpairsofdocumentsincomputingaveragepair-wisesimilarity(Figure 17.3 ,(d))whereascentroidclusteringexcludespairsfromthesamecluster(Figure 17.3 ,(c)).Figure 17.11 showstherstthreestepsofacentroidclustering.Thersttwoiterationsformtheclustersfd5,d6gwithcentroidm1andfd1,d2gwithcentroidm2becausethepairshd5,d6iandhd1,d2ihavethehighestcentroidsimilarities.Inthethirditeration,thehighestcentroidsimilarityisbetweenm1andd4producingtheclusterfd4,d5,d6gwithcentroidm3.LikeGAAC,centroidclusteringisnotbest-mergepersistentandthereforeQ(N2logN)(Exercise 17.6 ).IncontrasttotheotherthreeHACalgorithms,centroidclusteringisnotmonotonic.So-calledinversionscanoccur:SimilaritycanincreaseduringINVERSION Online edition (c)\n2009 Cambridge UP 39217Hierarchicalclustering 012345 012345´´ b d1d2d3 4 3 2 1 0 d1d2d3 IFigure17.12Centroidclusteringisnotmonotonic.Thedocumentsd1at(1+e,1),d2at(5,1),andd3at(3,1+2p 3)arealmostequidistant,withd1andd2closertoeachotherthantod3.Thenon-monotonicinversioninthehierarchicalclusteringofthethreepointsappearsasanintersectingmergelineinthedendrogram.Theintersectioniscircled. clusteringasintheexampleinFigure 17.12 ,wherewedenesimilarityasnegativedistance.Intherstmerge,thesimilarityofd1andd2is (4 e).Inthesecondmerge,thesimilarityofthecentroidofd1andd2(thecircle)andd3is cos(p/6)4= p 3/24 3.46