/
Online edition (c)\n2009 Cambridge UP Online edition (c)\n2009 Cambridge UP

Online edition (c)\n2009 Cambridge UP - PDF document

tawny-fly
tawny-fly . @tawny-fly
Follow
397 views
Uploaded On 2016-04-23

Online edition (c)\n2009 Cambridge UP - PPT Presentation

Online edition cn2009 Cambridge UP DRAFT ID: 290187

Online edition (c)\n2009 Cambridge UP DRAFT!

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Online edition (c)\n2009 Cambridge UP" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Online edition (c)\n2009 Cambridge UP Online edition (c)\n2009 Cambridge UP DRAFT!©April1,2009CambridgeUniversityPress.Feedbackwelcome.37717HierarchicalclusteringFlatclusteringisefcientandconceptuallysimple,butaswesawinChap-ter 16 ithasanumberofdrawbacks.ThealgorithmsintroducedinChap-ter 16 returnaatunstructuredsetofclusters,requireaprespeciednum-berofclustersasinputandarenondeterministic.Hierarchicalclustering(orHIERARCHICALCLUSTERINGhierarchicclustering)outputsahierarchy,astructurethatismoreinformativethantheunstructuredsetofclustersreturnedbyatclustering. 1 HierarchicalclusteringdoesnotrequireustoprespecifythenumberofclustersandmosthierarchicalalgorithmsthathavebeenusedinIRaredeterministic.Thesead-vantagesofhierarchicalclusteringcomeatthecostoflowerefciency.Themostcommonhierarchicalclusteringalgorithmshaveacomplexitythatisatleastquadraticinthenumberofdocumentscomparedtothelinearcomplex-ityofK-meansandEM(cf.Section 16.4 ,page 364 ).Thischapterrstintroducesagglomerativehierarchicalclustering(Section 17.1 )andpresentsfourdifferentagglomerativealgorithms,inSections 17.2 – 17.4 ,whichdifferinthesimilaritymeasurestheyemploy:single-link,complete-link,group-average,andcentroidsimilarity.WethendiscusstheoptimalityconditionsofhierarchicalclusteringinSection 17.5 .Section 17.6 introducestop-down(ordivisive)hierarchicalclustering.Section 17.7 looksatlabelingclustersautomatically,aproblemthatmustbesolvedwheneverhumansin-teractwiththeoutputofclustering.WediscussimplementationissuesinSection 17.8 .Section 17.9 providespointerstofurtherreading,includingref-erencestosofthierarchicalclustering,whichwedonotcoverinthisbook.Therearefewdifferencesbetweentheapplicationsofatandhierarchi-calclusteringininformationretrieval.Inparticular,hierarchicalclusteringisappropriateforanyoftheapplicationsshowninTable 16.1 (page 351 ;seealsoSection 16.6 ,page 372 ).Infact,theexamplewegaveforcollectionclus-teringishierarchical.Ingeneral,weselectatclusteringwhenefciencyisimportantandhierarchicalclusteringwhenoneofthepotentialproblems 1.Inthischapter,weonlyconsiderhierarchiesthatarebinarytreesliketheoneshowninFig-ure 17.1 –buthierarchicalclusteringcanbeeasilyextendedtoothertypesoftrees. Online edition (c)\n2009 Cambridge UP 37817Hierarchicalclusteringofatclustering(notenoughstructure,predeterminednumberofclusters,non-determinism)isaconcern.Inaddition,manyresearchersbelievethathi-erarchicalclusteringproducesbetterclustersthanatclustering.However,thereisnoconsensusonthisissue(seereferencesinSection 17.9 ).17.1HierarchicalagglomerativeclusteringHierarchicalclusteringalgorithmsareeithertop-downorbottom-up.Bottom-upalgorithmstreateachdocumentasasingletonclusterattheoutsetandthensuccessivelymerge(oragglomerate)pairsofclustersuntilallclustershavebeenmergedintoasingleclusterthatcontainsalldocuments.Bottom-uphierarchicalclusteringisthereforecalledhierarchicalagglomerativecluster-HIERARCHICALAGGLOMERATIVECLUSTERINGingorHAC.Top-downclusteringrequiresamethodforsplittingacluster.HACItproceedsbysplittingclustersrecursivelyuntilindividualdocumentsarereached.SeeSection 17.6 .HACismorefrequentlyusedinIRthantop-downclusteringandisthemainsubjectofthischapter.BeforelookingatspecicsimilaritymeasuresusedinHACinSections 17.2 – 17.4 ,werstintroduceamethodfordepictinghierarchicalclusteringsgraphically,discussafewkeypropertiesofHACsandpresentasimplealgo-rithmforcomputinganHAC.AnHACclusteringistypicallyvisualizedasadendrogramasshowninDENDROGRAMFigure 17.1 .Eachmergeisrepresentedbyahorizontalline.They-coordinateofthehorizontallineisthesimilarityofthetwoclustersthatweremerged,wheredocumentsareviewedassingletonclusters.Wecallthissimilaritythecombinationsimilarityofthemergedcluster.Forexample,thecombinationCOMBINATIONSIMILARITYsimilarityoftheclusterconsistingofLloyd'sCEOquestionedandLloyd'schief/U.S.grillinginFigure 17.1 is0.56.Wedenethecombinationsimilarityofasingletonclusterasitsdocument'sself-similarity(whichis1.0forcosinesimilarity).Bymovingupfromthebottomlayertothetopnode,adendrogramal-lowsustoreconstructthehistoryofmergesthatresultedinthedepictedclustering.Forexample,weseethatthetwodocumentsentitledWarheroColinPowellweremergedrstinFigure 17.1 andthatthelastmergeaddedAgtradereformtoaclusterconsistingoftheother29documents.AfundamentalassumptioninHACisthatthemergeoperationismono-MONOTONICITYtonic.Monotonicmeansthatifs1,s2,...,sK1arethecombinationsimilaritiesofthesuccessivemergesofanHAC,thens1s2...sK1holds.Anon-monotonichierarchicalclusteringcontainsatleastoneinversionsisi+1INVERSIONandcontradictsthefundamentalassumptionthatwechosethebestmergeavailableateachstep.WewillseeanexampleofaninversioninFigure 17.12 .Hierarchicalclusteringdoesnotrequireaprespeciednumberofclusters.However,insomeapplicationswewantapartitionofdisjointclustersjustas Online edition (c)\n2009 Cambridge UP 17.1Hierarchicalagglomerativeclustering379 1.00.80.60.40.20.0 Ag trade reform. Back-to-school spending is up Lloyd's CEO questioned Lloyd's chief / U.S. grilling Viag stays positive Chrysler / Latin America Ohio Blue Cross Japanese prime minister / Mexico CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues German unions split War hero Colin Powell War hero Colin Powell Oil prices slip Chains may raise prices Clinton signs law Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Most active stocks Mexican markets Hog prices tumble NYSE closing averages British FTSE index Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady IFigure17.1Adendrogramofasingle-linkclusteringof30documentsfromReuters-RCV1.Twopossiblecutsofthedendrogramareshown:at0.4into24clustersandat0.1into12clusters. Online edition (c)\n2009 Cambridge UP 38017Hierarchicalclusteringinatclustering.Inthosecases,thehierarchyneedstobecutatsomepoint.Anumberofcriteriacanbeusedtodeterminethecuttingpoint:•Cutataprespeciedlevelofsimilarity.Forexample,wecutthedendro-gramat0.4ifwewantclusterswithaminimumcombinationsimilarityof0.4.InFigure 17.1 ,cuttingthediagramaty=0.4yields24clusters(groupingonlydocumentswithhighsimilaritytogether)andcuttingitaty=0.1yields12clusters(onelargenancialnewsclusterand11smallerclusters).•Cutthedendrogramwherethegapbetweentwosuccessivecombinationsimilaritiesislargest.Suchlargegapsarguablyindicate“natural”clus-terings.Addingonemoreclusterdecreasesthequalityoftheclusteringsignicantly,socuttingbeforethissteepdecreaseoccursisdesirable.ThisstrategyisanalogoustolookingforthekneeintheK-meansgraphinFig-ure 16.8 (page 366 ).•ApplyEquation( 16.11 )(page 366 ):K=argminK0[RSS(K0)+lK0]whereK0referstothecutofthehierarchythatresultsinK0clusters,RSSistheresidualsumofsquaresandlisapenaltyforeachadditionalcluster.InsteadofRSS,anothermeasureofdistortioncanbeused.•Asinatclustering,wecanalsoprespecifythenumberofclustersKandselectthecuttingpointthatproducesKclusters.Asimple,naiveHACalgorithmisshowninFigure 17.2 .WerstcomputetheNNsimilaritymatrixC.ThealgorithmthenexecutesN1stepsofmergingthecurrentlymostsimilarclusters.Ineachiteration,thetwomostsimilarclustersaremergedandtherowsandcolumnsofthemergedclusteriinCareupdated. 2 TheclusteringisstoredasalistofmergesinA.Iindicateswhichclustersarestillavailabletobemerged.ThefunctionSIM(i,m,j)computesthesimilarityofclusterjwiththemergeofclustersiandm.ForsomeHACalgorithms,SIM(i,m,j)issimplyafunctionofC[j][i]andC[j][m],forexample,themaximumofthesetwovaluesforsingle-link.Wewillnowrenethisalgorithmforthedifferentsimilaritymeasuresofsingle-linkandcomplete-linkclustering(Section 17.2 )andgroup-averageandcentroidclustering(Sections 17.3 and 17.4 ).ThemergecriteriaofthesefourvariantsofHACareshowninFigure 17.3 . 2.Weassumethatweuseadeterministicmethodforbreakingties,suchasalwayschoosethemergethatistherstclusterwithrespecttoatotalorderingofthesubsetsofthedocumentsetD. Online edition (c)\n2009 Cambridge UP 17.1Hierarchicalagglomerativeclustering381SIMPLEHAC(d1,...,dN)1forn 1toN2dofori 1toN3doC[n][i] SIM(dn,di)4I[n] 1(keepstrackofactiveclusters)5A [](assemblesclusteringasasequenceofmerges)6fork 1toN17dohi,mi argmaxfhi,mi:i=m^I[i]=1^I[m]=1gC[i][m]8A.APPEND(hi,mi)(storemerge)9forj 1toN10doC[i][j] SIM(i,m,j)11C[j][i] SIM(i,m,j)12I[m] 0(deactivatecluster)13returnAIFigure17.2Asimple,butinefcientHACalgorithm. bbbb (a)single-link:maximumsimilarity bbbb (b)complete-link:minimumsimilarity bbbb (c)centroid:averageinter-similarity bbbb (d)group-average:averageofallsimilaritiesIFigure17.3ThedifferentnotionsofclustersimilarityusedbythefourHACal-gorithms.Aninter-similarityisasimilaritybetweentwodocumentsfromdifferentclusters. Online edition (c)\n2009 Cambridge UP 38217Hierarchicalclustering 01234 0123d5d6d7d8d1d2d3d4 01234 0123d5d6d7d8d1d2d3d4 IFigure17.4Asingle-link(left)andcomplete-link(right)clusteringofeightdoc-uments.Theellipsescorrespondtosuccessiveclusteringstages.Left:Thesingle-linksimilarityofthetwouppertwo-pointclustersisthesimilarityofd2andd3(solidline),whichisgreaterthanthesingle-linksimilarityofthetwolefttwo-pointclusters(dashedline).Right:Thecomplete-linksimilarityofthetwouppertwo-pointclustersisthesimilarityofd1andd4(dashedline),whichissmallerthanthecomplete-linksimilarityofthetwolefttwo-pointclusters(solidline). 17.2Single-linkandcomplete-linkclusteringInsingle-linkclusteringorsingle-linkageclustering,thesimilarityoftwoclus-SINGLE-LINKCLUSTERINGtersisthesimilarityoftheirmostsimilarmembers(seeFigure 17.3 ,(a)) 3 .Thissingle-linkmergecriterionislocal.Wepayattentionsolelytotheareawherethetwoclusterscomeclosesttoeachother.Other,moredistantpartsoftheclusterandtheclusters'overallstructurearenottakenintoaccount.Incomplete-linkclusteringorcomplete-linkageclustering,thesimilarityoftwoCOMPLETE-LINKCLUSTERINGclustersisthesimilarityoftheirmostdissimilarmembers(seeFigure 17.3 ,(b)).Thisisequivalenttochoosingtheclusterpairwhosemergehasthesmallestdiameter.Thiscomplete-linkmergecriterionisnon-local;theentirestructureoftheclusteringcaninuencemergedecisions.Thisresultsinapreferenceforcompactclusterswithsmalldiametersoverlong,stragglyclusters,butalsocausessensitivitytooutliers.Asingledocumentfarfromthecentercanincreasediametersofcandidatemergeclustersdramaticallyandcompletelychangethenalclustering.Figure 17.4 depictsasingle-linkandacomplete-linkclusteringofeightdocuments.Therstfoursteps,eachproducingaclusterconsistingofapairoftwodocuments,areidentical.Thensingle-linkclusteringjoinstheup-pertwopairs(andafterthatthelowertwopairs)becauseonthemaximum-similaritydenitionofclustersimilarity,thosetwoclustersareclosest.Complete- 3.Throughoutthischapter,weequatesimilaritywithproximityin2Ddepictionsofclustering. Online edition (c)\n2009 Cambridge UP 17.2Single-linkandcomplete-linkclustering383 1.00.80.60.40.20.0 NYSE closing averages Hog prices tumble Oil prices slip Ag trade reform. Chrysler / Latin America Japanese prime minister / Mexico Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady Mexican markets British FTSE index War hero Colin Powell War hero Colin Powell Lloyd's CEO questioned Lloyd's chief / U.S. grilling Ohio Blue Cross Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Viag stays positive Most active stocks CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues Back-to-school spending is up German unions split Chains may raise prices Clinton signs law IFigure17.5Adendrogramofacomplete-linkclustering.Thesame30documentswereclusteredwithsingle-linkclusteringinFigure 17.1 . Online edition (c)\n2009 Cambridge UP 38417Hierarchicalclustering´´´´´´´´´´´´ IFigure17.6Chaininginsingle-linkclustering.Thelocalcriterioninsingle-linkclusteringcancauseundesirableelongatedclusters. linkclusteringjoinsthelefttwopairs(andthentherighttwopairs)becausethosearetheclosestpairsaccordingtotheminimum-similaritydenitionofclustersimilarity. 4 Figure 17.1 isanexampleofasingle-linkclusteringofasetofdocumentsandFigure 17.5 isthecomplete-linkclusteringofthesameset.WhencuttingthelastmergeinFigure 17.5 ,weobtaintwoclustersofsimilarsize(doc-uments1–16,fromNYSEclosingaveragestoLloyd'schief/U.S.grilling,anddocuments17–30,fromOhioBlueCrosstoClintonsignslaw).ThereisnocutofthedendrograminFigure 17.1 thatwouldgiveusanequallybalancedclustering.Bothsingle-linkandcomplete-linkclusteringhavegraph-theoreticinter-pretations.Denesktobethecombinationsimilarityofthetwoclustersmergedinstepk,andG(sk)thegraphthatlinksalldatapointswithasimilar-ityofatleastsk.Thentheclustersafterstepkinsingle-linkclusteringaretheconnectedcomponentsofG(sk)andtheclustersafterstepkincomplete-linkclusteringaremaximalcliquesofG(sk).AconnectedcomponentisamaximalCONNECTEDCOMPONENTsetofconnectedpointssuchthatthereisapathconnectingeachpair.AcliqueCLIQUEisasetofpointsthatarecompletelylinkedwitheachother.Thesegraph-theoreticinterpretationsmotivatethetermssingle-linkandcomplete-linkclustering.Single-linkclustersatstepkaremaximalsetsofpointsthatarelinkedviaatleastonelink(asinglelink)ofsimilarityssk;complete-linkclustersatstepkaremaximalsetsofpointsthatarecompletelylinkedwitheachothervialinksofsimilarityssk.Single-linkandcomplete-linkclusteringreducetheassessmentofclusterqualitytoasinglesimilaritybetweenapairofdocuments:thetwomostsim-ilardocumentsinsingle-linkclusteringandthetwomostdissimilardocu-mentsincomplete-linkclustering.Ameasurementbasedononepaircannotfullyreectthedistributionofdocumentsinacluster.Itisthereforenotsur-prisingthatbothalgorithmsoftenproduceundesirableclusters.Single-linkclusteringcanproducestragglingclustersasshowninFigure 17.6 .Sincethemergecriterionisstrictlylocal,achainofpointscanbeextendedforlong 4.Ifyouarebotheredbythepossibilityofties,assumethatd1hascoordinates(1+e,3e)andthatallotherpointshaveintegercoordinates. Online edition (c)\n2009 Cambridge UP 17.2Single-linkandcomplete-linkclustering385 01234567 01d1d2d3d4d5 IFigure17.7Outliersincomplete-linkclustering.Thevedocumentshavethex-coordinates1+2e,4,5+2e,6and7e.Complete-linkclusteringcre-atesthetwoclustersshownasellipses.Themostintuitivetwo-clustercluster-ingisffd1g,fd2,d3,d4,d5gg,butincomplete-linkclustering,theoutlierd1splitsfd2,d3,d4,d5gasshown. distanceswithoutregardtotheoverallshapeoftheemergingcluster.Thiseffectiscalledchaining.CHAININGThechainingeffectisalsoapparentinFigure 17.1 .Thelastelevenmergesofthesingle-linkclustering(thoseabovethe0.1line)addonsingledocu-mentsorpairsofdocuments,correspondingtoachain.Thecomplete-linkclusteringinFigure 17.5 avoidsthisproblem.Documentsaresplitintotwogroupsofroughlyequalsizewhenwecutthedendrogramatthelastmerge.Ingeneral,thisisamoreusefulorganizationofthedatathanaclusteringwithchains.However,complete-linkclusteringsuffersfromadifferentproblem.Itpaystoomuchattentiontooutliers,pointsthatdonottwellintotheglobalstructureofthecluster.IntheexampleinFigure 17.7 thefourdocumentsd2,d3,d4,d5aresplitbecauseoftheoutlierd1attheleftedge(Exercise 17.1 ).Complete-linkclusteringdoesnotndthemostintuitiveclusterstructureinthisexample.17.2.1TimecomplexityofHACThecomplexityofthenaiveHACalgorithminFigure 17.2 isQ(N3)becauseweexhaustivelyscantheNNmatrixCforthelargestsimilarityineachofN1iterations.ForthefourHACmethodsdiscussedinthischapteramoreefcientalgo-rithmisthepriority-queuealgorithmshowninFigure 17.8 .Itstimecomplex-ityisQ(N2logN).TherowsC[k]oftheNNsimilaritymatrixCaresortedindecreasingorderofsimilarityinthepriorityqueuesP.P[k].MAX()thenreturnstheclusterinP[k]thatcurrentlyhasthehighestsimilaritywithwk,whereweusewktodenotethekthclusterasinChapter 16 .Aftercreatingthemergedclusterofwk1andwk2,wk1isusedasitsrepresentative.ThefunctionSIMcomputesthesimilarityfunctionforpotentialmergepairs:largestsimi-larityforsingle-link,smallestsimilarityforcomplete-link,averagesimilarityforGAAC(Section 17.3 ),andcentroidsimilarityforcentroidclustering(Sec- Online edition (c)\n2009 Cambridge UP 38617HierarchicalclusteringEFFICIENTHAC(~d1,...,~dN)1forn 1toN2dofori 1toN3doC[n][i].sim ~dn~di4C[n][i].index i5I[n] 16P[n] priorityqueueforC[n]sortedonsim7P[n].DELETE(C[n][n])(don'twantself-similarities)8A []9fork 1toN110dok1 argmaxfk:I[k]=1gP[k].MAX().sim11k2 P[k1].MAX().index12A.APPEND(hk1,k2i)13I[k2] 014P[k1] []15foreachiwithI[i]=1^i=k116doP[i].DELETE(C[i][k1])17P[i].DELETE(C[i][k2])18C[i][k1].sim SIM(i,k1,k2)19P[i].INSERT(C[i][k1])20C[k1][i].sim SIM(i,k1,k2)21P[k1].INSERT(C[k1][i])22returnAclusteringalgorithm SIM(i,k1,k2) single-link max(SIM(i,k1),SIM(i,k2))complete-link min(SIM(i,k1),SIM(i,k2))centroid (1 Nm~vm)(1 Ni~vi)group-average 1 (Nm+Ni)(Nm+Ni1)[(~vm+~vi)2(Nm+Ni)]computeC[5] 12345 0.20.80.60.41.0 createP[5](bysorting) 2341 0.80.60.40.2 merge2and3,updatesimilarityof2,delete3 241 0.30.40.2 deleteandreinsert2 421 0.40.30.2 IFigure17.8Thepriority-queuealgorithmforHAC.Top:Thealgorithm.Center:Fourdifferentsimilaritymeasures.Bottom:Anexampleforprocessingsteps6and16–19.ThisisamadeupexampleshowingP[5]fora55matrixC. Online edition (c)\n2009 Cambridge UP 17.2Single-linkandcomplete-linkclustering387SINGLELINKCLUSTERING(d1,...,dN)1forn 1toN2dofori 1toN3doC[n][i].sim SIM(dn,di)4C[n][i].index i5I[n] n6NBM[n] argmaxX2fC[n][i]:n=igX.sim7A []8forn 1toN19doi1 argmaxfi:I[i]=igNBM[i].sim10i2 I[NBM[i1].index]11A.APPEND(hi1,i2i)12fori 1toN13doifI[i]=i^i=i1^i=i214thenC[i1][i].sim C[i][i1].sim max(C[i1][i].sim,C[i2][i].sim)15ifI[i]=i216thenI[i] i117NBM[i1] argmaxX2fC[i1][i]:I[i]=i^i=i1gX.sim18returnAIFigure17.9Single-linkclusteringalgorithmusinganNBMarray.Aftermergingtwoclustersi1andi2,therstone(i1)representsthemergedcluster.IfI[i]=i,theniistherepresentativeofitscurrentcluster.IfI[i]=i,thenihasbeenmergedintotheclusterrepresentedbyI[i]andwillthereforebeignoredwhenupdatingNBM[i1]. tion 17.4 ).WegiveanexampleofhowarowofCisprocessed(Figure 17.8 ,bottompanel).Theloopinlines1–7isQ(N2)andtheloopinlines9–21isQ(N2logN)foranimplementationofpriorityqueuesthatsupportsdeletionandinsertioninQ(logN).Theoverallcomplexityofthealgorithmisthere-foreQ(N2logN).InthedenitionofthefunctionSIM,~vmand~viarethevectorsumsofwk1[wk2andwi,respectively,andNmandNiarethenumberofdocumentsinwk1[wk2andwi,respectively.TheargumentofEFFICIENTHACinFigure 17.8 isasetofvectors(asop-posedtoasetofgenericdocuments)becauseGAACandcentroidclustering(Sections 17.3 and 17.4 )requirevectorsasinput.Thecomplete-linkversionofEFFICIENTHACcanalsobeappliedtodocumentsthatarenotrepresentedasvectors.Forsingle-link,wecanintroduceanext-best-mergearray(NBM)asafur-theroptimizationasshowninFigure 17.9 .NBMkeepstrackofwhatthebestmergeisforeachcluster.Eachofthetwotoplevelfor-loopsinFigure 17.9 areQ(N2),thustheoverallcomplexityofsingle-linkclusteringisQ(N2). Online edition (c)\n2009 Cambridge UP 38817Hierarchicalclustering 012345678910 01d1d2d3d4IFigure17.10Complete-linkclusteringisnotbest-mergepersistent.Atrst,d2isthebest-mergeclusterford3.Butaftermergingd1andd2,d4becomesd3'sbest-mergecandidate.Inabest-mergepersistentalgorithmlikesingle-link,d3'sbest-mergeclus-terwouldbefd1,d2g. CanwealsospeeduptheotherthreeHACalgorithmswithanNBMar-ray?Wecannotbecauseonlysingle-linkclusteringisbest-mergepersistent.BEST-MERGEPERSISTENCESupposethatthebestmergeclusterforwkiswjinsingle-linkclustering.Thenaftermergingwjwithathirdclusterwi=wk,themergeofwiandwjwillbewk'sbestmergecluster(Exercise 17.6 ).Inotherwords,thebest-mergecandidateforthemergedclusterisoneofthetwobest-mergecandidatesofitscomponentsinsingle-linkclustering.ThismeansthatCcanbeupdatedinQ(N)ineachiteration–bytakingasimplemaxoftwovaluesonline14inFigure 17.9 foreachoftheremainingNclusters.Figure 17.10 demonstratesthatbest-mergepersistencedoesnotholdforcomplete-linkclustering,whichmeansthatwecannotuseanNBMarraytospeedupclustering.Aftermergingd3'sbestmergecandidated2withclusterd1,anunrelatedclusterd4becomesthebestmergecandidateford3.Thisisbecausethecomplete-linkmergecriterionisnon-localandcanbeaffectedbypointsatagreatdistancefromtheareawheretwomergecandidatesmeet.Inpractice,theefciencypenaltyoftheQ(N2logN)algorithmissmallcomparedwiththeQ(N2)single-linkalgorithmsincecomputingthesimilar-itybetweentwodocuments(e.g.,asadotproduct)isanorderofmagnitudeslowerthancomparingtwoscalarsinsorting.AllfourHACalgorithmsinthischapterareQ(N2)withrespecttosimilaritycomputations.Sothediffer-enceincomplexityisrarelyaconcerninpracticewhenchoosingoneofthealgorithms.?Exercise17.1Showthatcomplete-linkclusteringcreatesthetwo-clusterclusteringdepictedinFig-ure 17.7 .17.3Group-averageagglomerativeclusteringGroup-averageagglomerativeclusteringorGAAC(seeFigure 17.3 ,(d))evaluatesGROUP-AVERAGEAGGLOMERATIVECLUSTERINGclusterqualitybasedonallsimilaritiesbetweendocuments,thusavoidingthepitfallsofthesingle-linkandcomplete-linkcriteria,whichequatecluster Online edition (c)\n2009 Cambridge UP 17.3Group-averageagglomerativeclustering389similaritywiththesimilarityofasinglepairofdocuments.GAACisalsocalledgroup-averageclusteringandaverage-linkclustering.GAACcomputestheaveragesimilaritySIM-GAofallpairsofdocuments,includingpairsfromthesamecluster.Butself-similaritiesarenotincludedintheaverage:SIM-GA(wi,wj)=1 (Ni+Nj)(Ni+Nj1)ådm2wi[wjådn2wi[wj,dn=dm~dm~dn(17.1)where~disthelength-normalizedvectorofdocumentd,denotesthedotproduct,andNiandNjarethenumberofdocumentsinwiandwj,respec-tively.ThemotivationforGAACisthatourgoalinselectingtwoclusterswiandwjasthenextmergeinHACisthattheresultingmergeclusterwk=wi[wjshouldbecoherent.Tojudgethecoherenceofwk,weneedtolookatalldocument-documentsimilaritieswithinwk,includingthosethatoccurwithinwiandthosethatoccurwithinwj.WecancomputethemeasureSIM-GAefcientlybecausethesumofindi-vidualvectorsimilaritiesisequaltothesimilaritiesoftheirsums:ådm2wiådn2wj(~dm~dn)=(ådm2wi~dm)(ådn2wj~dn)(17.2)With( 17.2 ),wehave:SIM-GA(wi,wj)=1 (Ni+Nj)(Ni+Nj1)[(ådm2wi[wj~dm)2(Ni+Nj)](17.3)Theterm(Ni+Nj)ontherightisthesumofNi+Njself-similaritiesofvalue1.0.Withthistrickwecancomputeclustersimilarityinconstanttime(as-sumingwehaveavailablethetwovectorsumsådm2wi~dmandådm2wj~dm)insteadofinQ(NiNj).Thisisimportantbecauseweneedtobeabletocom-putethefunctionSIMonlines18and20inEFFICIENTHAC(Figure 17.8 )inconstanttimeforefcientimplementationsofGAAC.Notethatfortwosingletonclusters,Equation( 17.3 )isequivalenttothedotproduct.Equation( 17.2 )reliesonthedistributivityofthedotproductwithrespecttovectoraddition.SincethisiscrucialfortheefcientcomputationofaGAACclustering,themethodcannotbeeasilyappliedtorepresentationsofdocumentsthatarenotreal-valuedvectors.Also,Equation( 17.2 )onlyholdsforthedotproduct.Whilemanyalgorithmsintroducedinthisbookhavenear-equivalentdescriptionsintermsofdotproduct,cosinesimilarityandEuclideandistance(cf.Section 14.1 ,page 291 ),Equation( 17.2 )canonlybeexpressedusingthedotproduct.Thisisafundamentaldifferencebetweensingle-link/complete-linkclusteringandGAAC.Thersttwoonlyrequirea Online edition (c)\n2009 Cambridge UP 39017Hierarchicalclusteringsquarematrixofsimilaritiesasinputanddonotcarehowthesesimilaritieswerecomputed.Tosummarize,GAACrequires(i)documentsrepresentedasvectors,(ii)lengthnormalizationofvectors,sothatself-similaritiesare1.0,and(iii)thedotproductasthemeasureofsimilaritybetweenvectorsandsumsofvec-tors.ThemergealgorithmsforGAACandcomplete-linkclusteringarethesameexceptthatweuseEquation( 17.3 )assimilarityfunctioninFigure 17.8 .There-fore,theoveralltimecomplexityofGAACisthesameasforcomplete-linkclustering:Q(N2logN).Likecomplete-linkclustering,GAACisnotbest-mergepersistent(Exercise 17.6 ).ThismeansthatthereisnoQ(N2)algorithmforGAACthatwouldbeanalogoustotheQ(N2)algorithmforsingle-linkinFigure 17.9 .Wecanalsodenegroup-averagesimilarityasincludingself-similarities:SIM-GA0(wi,wj)=1 (Ni+Nj)2(ådm2wi[wj~dm)2=1 Ni+Njådm2wi[wj[~dm~m(wi[wj)](17.4)wherethecentroid~m(w)isdenedasinEquation( 14.1 )(page 292 ).Thisdenitionisequivalenttotheintuitivedenitionofclusterqualityasaveragesimilarityofdocuments~dmtothecluster'scentroid~m.Self-similaritiesarealwaysequalto1.0,themaximumpossiblevalueforlength-normalizedvectors.Theproportionofself-similaritiesinEquation( 17.4 )isi/i2=1/iforaclusterofsizei.Thisgivesanunfairadvantagetosmallclusterssincetheywillhaveproportionallymoreself-similarities.Fortwodocumentsd1,d2withasimilaritys,wehaveSIM-GA0(d1,d2)=(1+s)/2.Incontrast,SIM-GA(d1,d2)=s(1+s)/2.ThissimilaritySIM-GA(d1,d2)oftwodocumentsisthesameasinsingle-link,complete-linkandcentroidclustering.WepreferthedenitioninEquation( 17.3 ),whichexcludesself-similaritiesfromtheaverage,becausewedonotwanttopenalizelargeclus-tersfortheirsmallerproportionofself-similaritiesandbecausewewantaconsistentsimilarityvaluesfordocumentpairsinallfourHACalgorithms.?Exercise17.2Applygroup-averageclusteringtothepointsinFigures 17.6 and 17.7 .Mapthemontothesurfaceoftheunitsphereinathree-dimensionalspacetogetlength-normalizedvectors.Isthegroup-averageclusteringdifferentfromthesingle-linkandcomplete-linkclusterings? Online edition (c)\n2009 Cambridge UP 17.4Centroidclustering391 01234567 012345d1d2d3d4d5d6 b m1 b m3 b m2 IFigure17.11Threeiterationsofcentroidclustering.Eachiterationmergesthetwoclusterswhosecentroidsareclosest. 17.4CentroidclusteringIncentroidclustering,thesimilarityoftwoclustersisdenedasthesimilar-ityoftheircentroids:SIM-CENT(wi,wj)=~m(wi)~m(wj)(17.5)=(1 Niådm2wi~dm)(1 Njådn2wj~dn)=1 NiNjådm2wiådn2wj~dm~dn(17.6)Equation( 17.5 )iscentroidsimilarity.Equation( 17.6 )showsthatcentroidsimilarityisequivalenttoaveragesimilarityofallpairsofdocumentsfromdifferentclusters.Thus,thedifferencebetweenGAACandcentroidclusteringisthatGAACconsidersallpairsofdocumentsincomputingaveragepair-wisesimilarity(Figure 17.3 ,(d))whereascentroidclusteringexcludespairsfromthesamecluster(Figure 17.3 ,(c)).Figure 17.11 showstherstthreestepsofacentroidclustering.Thersttwoiterationsformtheclustersfd5,d6gwithcentroidm1andfd1,d2gwithcentroidm2becausethepairshd5,d6iandhd1,d2ihavethehighestcentroidsimilarities.Inthethirditeration,thehighestcentroidsimilarityisbetweenm1andd4producingtheclusterfd4,d5,d6gwithcentroidm3.LikeGAAC,centroidclusteringisnotbest-mergepersistentandthereforeQ(N2logN)(Exercise 17.6 ).IncontrasttotheotherthreeHACalgorithms,centroidclusteringisnotmonotonic.So-calledinversionscanoccur:SimilaritycanincreaseduringINVERSION Online edition (c)\n2009 Cambridge UP 39217Hierarchicalclustering 012345 012345´´ b d1d2d34 3 2 1 0 d1d2d3 IFigure17.12Centroidclusteringisnotmonotonic.Thedocumentsd1at(1+e,1),d2at(5,1),andd3at(3,1+2p 3)arealmostequidistant,withd1andd2closertoeachotherthantod3.Thenon-monotonicinversioninthehierarchicalclusteringofthethreepointsappearsasanintersectingmergelineinthedendrogram.Theintersectioniscircled. clusteringasintheexampleinFigure 17.12 ,wherewedenesimilarityasnegativedistance.Intherstmerge,thesimilarityofd1andd2is(4e).Inthesecondmerge,thesimilarityofthecentroidofd1andd2(thecircle)andd3iscos(p/6)4=p 3/243.46�(4e).Thisisanexampleofaninversion:similarityincreasesinthissequenceoftwoclusteringsteps.InamonotonicHACalgorithm,similarityismonotonicallydecreasingfromiterationtoiteration.IncreasingsimilarityinaseriesofHACclusteringstepscontradictsthefundamentalassumptionthatsmallclustersaremorecoherentthanlargeclusters.Aninversioninadendrogramshowsupasahorizontalmergelinethatislowerthanthepreviousmergeline.AllmergelinesinFigures 17.1 and 17.5 arehigherthantheirpredecessorsbecausesingle-linkandcomplete-linkclusteringaremonotonicclusteringalgorithms.Despiteitsnon-monotonicity,centroidclusteringisoftenusedbecauseitssimilaritymeasure–thesimilarityoftwocentroids–isconceptuallysimplerthantheaverageofallpairwisesimilaritiesinGAAC.Figure 17.11 isalloneneedstounderstandcentroidclustering.ThereisnoequallysimplegraphthatwouldexplainhowGAACworks.?Exercise17.3ForaxedsetofNdocumentsthereareuptoN2distinctsimilaritiesbetweenclustersinsingle-linkandcomplete-linkclustering.HowmanydistinctclustersimilaritiesarethereinGAACandcentroidclustering? Online edition (c)\n2009 Cambridge UP 17.5OptimalityofHAC393$17.5OptimalityofHACTostatetheoptimalityconditionsofhierarchicalclusteringprecisely,werstdenethecombinationsimilarityCOMB-SIMofaclusteringW=fw1,...,wKgasthesmallestcombinationsimilarityofanyofitsKclusters:COMB-SIM(fw1,...,wKg)=minkCOMB-SIM(wk)Recallthatthecombinationsimilarityofaclusterwthatwascreatedasthemergeofw1andw2isthesimilarityofw1andw2(page 378 ).WethendeneW=fw1,...,wKgtobeoptimalifallclusteringsW0withkOPTIMALCLUSTERINGclusters,kK,havelowercombinationsimilarities:jW0jjWj)COMB-SIM(W0)COMB-SIM(W)Figure 17.12 showsthatcentroidclusteringisnotoptimal.Thecluster-ingffd1,d2g,fd3gg(forK=2)hascombinationsimilarity(4e)andffd1,d2,d3gg(forK=1)hascombinationsimilarity-3.46.Sothecluster-ingffd1,d2g,fd3ggproducedintherstmergeisnotoptimalsincethereisaclusteringwithfewerclusters(ffd1,d2,d3gg)thathashighercombinationsimilarity.Centroidclusteringisnotoptimalbecauseinversionscanoccur.Theabovedenitionofoptimalitywouldbeoflimiteduseifitwasonlyapplicabletoaclusteringtogetherwithitsmergehistory.However,wecanshow(Exercise 17.4 )thatcombinationsimilarityforthethreenon-inversionCOMBINATIONSIMILARITYalgorithmscanbereadofffromtheclusterwithoutknowingitshistory.Thesedirectdenitionsofcombinationsimilarityareasfollows.single-linkThecombinationsimilarityofaclusterwisthesmallestsimilar-ityofanybipartitionofthecluster,wherethesimilarityofabipartitionisthelargestsimilaritybetweenanytwodocumentsfromthetwoparts:COMB-SIM(w)=minfw0:w0wgmaxdi2w0maxdj2ww0SIM(di,dj)whereeachhw0,ww0iisabipartitionofw.complete-linkThecombinationsimilarityofaclusterwisthesmallestsim-ilarityofanytwopointsinw:mindi2wmindj2wSIM(di,dj).GAACThecombinationsimilarityofaclusterwistheaverageofallpair-wisesimilaritiesinw(whereself-similaritiesarenotincludedintheaver-age):Equation( 17.3 ).Ifweusethesedenitionsofcombinationsimilarity,thenoptimalityisapropertyofasetofclustersandnotofaprocessthatproducesasetofclus-ters. Online edition (c)\n2009 Cambridge UP 39417HierarchicalclusteringWecannowprovetheoptimalityofsingle-linkclusteringbyinductionoverthenumberofclustersK.Wewillgiveaproofforthecasewherenotwopairsofdocumentshavethesamesimilarity,butitcaneasilybeextendedtothecasewithties.TheinductivebasisoftheproofisthataclusteringwithK=Nclustershascombinationsimilarity1.0,whichisthelargestvaluepossible.Theinduc-tionhypothesisisthatasingle-linkclusteringWKwithKclustersisoptimal:COMB-SIM(WK)COMB-SIM(W0K)forallW0K.AssumeforcontradictionthattheclusteringWK1weobtainbymergingthetwomostsimilarclustersinWKisnotoptimalandthatinsteadadifferentsequenceofmergesW0K,W0K1leadstotheoptimalclusteringwithK1clusters.Wecanwritetheas-sumptionthatW0K1isoptimalandthatWK1isnotasCOMB-SIM(W0K1)�COMB-SIM(WK1).Case1:Thetwodocumentslinkedbys=COMB-SIM(W0K1)areinthesameclusterinWK.Theycanonlybeinthesameclusterifamergewithsim-ilaritysmallerthanshasoccurredinthemergesequenceproducingWK.Thisimpliess�COMB-SIM(WK).Thus,COMB-SIM(W0K1)=s�COMB-SIM(WK)�COMB-SIM(W0K)�COMB-SIM(W0K1).Contradiction.Case2:Thetwodocumentslinkedbys=COMB-SIM(W0K1)arenotinthesameclusterinWK.Buts=COMB-SIM(W0K1)�COMB-SIM(WK1),sothesingle-linkmergingruleshouldhavemergedthesetwoclusterswhenprocessingWK.Contradiction.Thus,WK1isoptimal.Incontrasttosingle-linkclustering,complete-linkclusteringandGAACarenotoptimalasthisexampleshows:´´´´ 133d1d2d3d4Bothalgorithmsmergethetwopointswithdistance1(d2andd3)rstandthuscannotndthetwo-clusterclusteringffd1,d2g,fd3,d4gg.Butffd1,d2g,fd3,d4ggisoptimalontheoptimalitycriteriaofcomplete-linkclusteringandGAAC.However,themergecriteriaofcomplete-linkclusteringandGAACap-proximatethedesideratumofapproximatesphericitybetterthanthemergecriterionofsingle-linkclustering.Inmanyapplications,wewantspheri-calclusters.Thus,eventhoughsingle-linkclusteringmayseempreferableatrstbecauseofitsoptimality,itisoptimalwithrespecttothewrongcriterioninmanydocumentclusteringapplications.Table 17.1 summarizesthepropertiesofthefourHACalgorithmsintro-ducedinthischapter.WerecommendGAACfordocumentclusteringbe-causeitisgenerallythemethodthatproducestheclusteringwiththebest Online edition (c)\n2009 Cambridge UP 17.6Divisiveclustering395method combinationsimilaritytimecompl.optimal?comment single-link maxinter-similarityofany2docsQ(N2)yeschainingeffectcomplete-link mininter-similarityofany2docsQ(N2logN)nosensitivetooutliersgroup-average averageofallsimsQ(N2logN)nobestchoiceformostapplicationscentroid averageinter-similarityQ(N2logN)noinversionscanoccurITable17.1ComparisonofHACalgorithms. propertiesforapplications.Itdoesnotsufferfromchaining,fromsensitivitytooutliersandfrominversions.Therearetwoexceptionstothisrecommendation.First,fornon-vectorrepresentations,GAACisnotapplicableandclusteringshouldtypicallybeperformedwiththecomplete-linkmethod.Second,insomeapplicationsthepurposeofclusteringisnottocreateacompletehierarchyorexhaustivepartitionoftheentiredocumentset.Forinstance,rststorydetectionornoveltydetectionisthetaskofdetectingtherstFIRSTSTORYDETECTIONoccurrenceofaneventinastreamofnewsstories.Oneapproachtothistaskistondatightclusterwithinthedocumentsthatweresentacrossthewireinashortperiodoftimeandaredissimilarfromallpreviousdocuments.Forexample,thedocumentssentoverthewireintheminutesaftertheWorldTradeCenterattackonSeptember11,2001formsuchacluster.Variationsofsingle-linkclusteringcandowellonthistasksinceitisthestructureofsmallpartsofthevectorspace–andnotglobalstructure–thatisimportantinthiscase.Similarly,wewilldescribeanapproachtoduplicatedetectiononthewebinSection 19.6 (page 440 )wheresingle-linkclusteringisusedintheguiseoftheunion-ndalgorithm.Again,thedecisionwhetheragroupofdocumentsareduplicatesofeachotherisnotinuencedbydocumentsthatarelocatedfarawayandsingle-linkclusteringisagoodchoiceforduplicatedetection.?Exercise17.4Showtheequivalenceofthetwodenitionsofcombinationsimilarity:theprocessdenitiononpage 378 andthestaticdenitiononpage 393 .17.6DivisiveclusteringSofarwehaveonlylookedatagglomerativeclustering,butaclusterhierar-chycanalsobegeneratedtop-down.Thisvariantofhierarchicalclusteringiscalledtop-downclusteringordivisiveclustering.WestartatthetopwithallTOP-DOWNCLUSTERINGdocumentsinonecluster.Theclusterissplitusingaatclusteringalgo- Online edition (c)\n2009 Cambridge UP 39617Hierarchicalclusteringrithm.Thisprocedureisappliedrecursivelyuntileachdocumentisinitsownsingletoncluster.Top-downclusteringisconceptuallymorecomplexthanbottom-upclus-teringsinceweneedasecond,atclusteringalgorithmasa“subroutine”.Ithastheadvantageofbeingmoreefcientifwedonotgenerateacompletehierarchyallthewaydowntoindividualdocumentleaves.Foraxednum-beroftoplevels,usinganefcientatalgorithmlikeK-means,top-downalgorithmsarelinearinthenumberofdocumentsandclusters.SotheyrunmuchfasterthanHACalgorithms,whichareatleastquadratic.Thereisevidencethatdivisivealgorithmsproducemoreaccuratehierar-chiesthanbottom-upalgorithmsinsomecircumstances.SeethereferencesonbisectingK-meansinSection 17.9 .Bottom-upmethodsmakecluster-ingdecisionsbasedonlocalpatternswithoutinitiallytakingintoaccounttheglobaldistribution.Theseearlydecisionscannotbeundone.Top-downclusteringbenetsfromcompleteinformationabouttheglobaldistributionwhenmakingtop-levelpartitioningdecisions.17.7ClusterlabelingInmanyapplicationsofatclusteringandhierarchicalclustering,particu-larlyinanalysistasksandinuserinterfaces(seeapplicationsinTable 16.1 ,page 351 ),humanusersinteractwithclusters.Insuchsettings,wemustlabelclusters,sothatuserscanseewhataclusterisabout.Differentialclusterlabelingselectsclusterlabelsbycomparingthedistribu-DIFFERENTIALCLUSTERLABELINGtionoftermsinoneclusterwiththatofotherclusters.ThefeatureselectionmethodsweintroducedinSection 13.5 (page 271 )canallbeusedfordifferen-tialclusterlabeling. 5 Inparticular,mutualinformation(MI)(Section 13.5.1 ,page 272 )or,equivalently,informationgainandthec2-test(Section 13.5.2 ,page 275 )willidentifyclusterlabelsthatcharacterizeoneclusterincontrasttootherclusters.Acombinationofadifferentialtestwithapenaltyforraretermsoftengivesthebestlabelingresultsbecauseraretermsarenotneces-sarilyrepresentativeoftheclusterasawhole.WeapplythreelabelingmethodstoaK-meansclusteringinTable 17.2 .Inthisexample,thereisalmostnodifferencebetweenMIandc2.Wethereforeomitthelatter.Cluster-internallabelingcomputesalabelthatsolelydependsontheclusterCLUSTER-INTERNALLABELINGitself,notonotherclusters.Labelingaclusterwiththetitleofthedocumentclosesttothecentroidisonecluster-internalmethod.Titlesareeasiertoreadthanalistofterms.Afulltitlecanalsocontainimportantcontextthatdidn'tmakeitintothetop10termsselectedbyMI.Ontheweb,anchortextcan 5.Selectingthemostfrequenttermsisanon-differentialfeatureselectiontechniquewedis-cussedinSection 13.5 .Itcanalsobeusedforlabelingclusters. Online edition (c)\n2009 Cambridge UP 17.7Clusterlabeling397 labelingmethod #docs centroid mutualinformation title 4 622 oilplantmexicopro-ductioncrudepower000renerygasbpd plantoilproductionbarrelscrudebpdmexicodollycapacitypetroleum MEXICO:Hurri-caneDollyheadsforMexicocoast 9 1017 policesecurityrussianpeoplemilitarypeacekilledtoldgroznycourt policekilledmilitarysecuritypeacetoldtroopsforcesrebelspeople RUSSIA:Russia'sLebedmeetsrebelchiefinChechnya 10 1259 00000tonnestradersfutureswheatpricescentsseptembertonne deliverytradersfuturestonnetonnesdeskwheatprices00000 USA:ExportBusiness-Grain/oilseedscom-plexITable17.2Automaticallycomputedclusterlabels.Thisisforthreeoftenclusters(4,9,and10)inaK-meansclusteringoftherst10,000documentsinReuters-RCV1.Thelastthreecolumnsshowclustersummariescomputedbythreelabelingmethods:mosthighlyweightedtermsincentroid(centroid),mutualinformation,andthetitleofthedocumentclosesttothecentroidofthecluster(title).Termsselectedbyonlyoneofthersttwomethodsareinbold. playarolesimilartoatitlesincetheanchortextpointingtoapagecanserveasaconcisesummaryofitscontents.InTable 17.2 ,thetitleforcluster9suggeststhatmanyofitsdocumentsareabouttheChechnyaconict,afacttheMItermsdonotreveal.However,asingledocumentisunlikelytoberepresentativeofalldocumentsinacluster.Anexampleiscluster4,whoseselectedtitleismisleading.Themaintopicoftheclusterisoil.ArticlesabouthurricaneDollyonlyendedupinthisclusterbecauseofitseffectonoilprices.Wecanalsousealistoftermswithhighweightsinthecentroidoftheclus-terasalabel.Suchhighlyweightedterms(or,evenbetter,phrases,especiallynounphrases)areoftenmorerepresentativeoftheclusterthanafewtitlescanbe,eveniftheyarenotlteredfordistinctivenessasinthedifferentialmethods.However,alistofphrasestakesmoretimetodigestforusersthanawellcraftedtitle.Cluster-internalmethodsareefcient,buttheyfailtodistinguishtermsthatarefrequentinthecollectionasawholefromthosethatarefrequentonlyinthecluster.TermslikeyearorTuesdaymaybeamongthemostfrequentinacluster,buttheyarenothelpfulinunderstandingthecontentsofaclusterwithaspecictopiclikeoil.InTable 17.2 ,thecentroidmethodselectsafewmoreuninformativeterms(000,court,cents,september)thanMI(forces,desk),butmostofthetermsse- Online edition (c)\n2009 Cambridge UP 39817Hierarchicalclusteringlectedbyeithermethodaregooddescriptors.Wegetagoodsenseofthedocumentsinaclusterfromscanningtheselectedterms.Forhierarchicalclustering,additionalcomplicationsariseinclusterlabel-ing.Notonlydoweneedtodistinguishaninternalnodeinthetreefromitssiblings,butalsofromitsparentanditschildren.Documentsinchildnodesarebydenitionalsomembersoftheirparentnode,sowecannotuseanaivedifferentialmethodtondlabelsthatdistinguishtheparentfromitschildren.However,morecomplexcriteria,basedonacombinationofoverallcollectionfrequencyandprevalenceinagivencluster,candeterminewhetheratermisamoreinformativelabelforachildnodeoraparentnode(seeSection 17.9 ).17.8ImplementationnotesMostproblemsthatrequirethecomputationofalargenumberofdotprod-uctsbenetfromaninvertedindex.ThisisalsothecaseforHACclustering.Computationalsavingsduetotheinvertedindexarelargeiftherearemanyzerosimilarities–eitherbecausemanydocumentsdonotshareanytermsorbecauseanaggressivestoplistisused.Inlowdimensions,moreaggressiveoptimizationsarepossiblethatmakethecomputationofmostpairwisesimilaritiesunnecessary(Exercise 17.10 ).However,nosuchalgorithmsareknowninhigherdimensions.Weencoun-teredthesameprobleminkNNclassication(seeSection 14.7 ,page 314 ).WhenusingGAAConalargedocumentsetinhighdimensions,wehavetotakecaretoavoiddensecentroids.Fordensecentroids,clusteringcantaketimeQ(MN2logN)whereMisthesizeofthevocabulary,whereascomplete-linkclusteringisQ(MaveN2logN)whereMaveistheaveragesizeofthevocabularyofadocument.Soforlargevocabulariescomplete-linkclusteringcanbemoreefcientthananunoptimizedimplementationofGAAC.WediscussedthisprobleminthecontextofK-meansclusteringinChap-ter 16 (page 365 )andsuggestedtwosolutions:truncatingcentroids(keepingonlyhighlyweightedterms)andrepresentingclustersbymeansofsparsemedoidsinsteadofdensecentroids.TheseoptimizationscanalsobeappliedtoGAACandcentroidclustering.Evenwiththeseoptimizations,HACalgorithmsareallQ(N2)orQ(N2logN)andthereforeinfeasibleforlargesetsof1,000,000ormoredocuments.Forsuchlargesets,HACcanonlybeusedincombinationwithaatclusteringalgorithmlikeK-means.RecallthatK-meansrequiresasetofseedsasinitial-ization(Figure 16.5 ,page 361 ).Iftheseseedsarebadlychosen,thenthere-sultingclusteringwillbeofpoorquality.WecanemployanHACalgorithmtocomputeseedsofhighquality.IftheHACalgorithmisappliedtoadocu-mentsubsetofsizep N,thentheoverallruntimeofK-meanscumHACseed Online edition (c)\n2009 Cambridge UP 17.9Referencesandfurtherreading399generationisQ(N).Thisisbecausetheapplicationofaquadraticalgorithmtoasampleofsizep NhasanoverallcomplexityofQ(N).AnappropriateadjustmentcanbemadeforanQ(N2logN)algorithmtoguaranteelinear-ity.ThisalgorithmisreferredtoastheBuckshotalgorithm.ItcombinestheBUCKSHOTALGORITHMdeterminismandhigherreliabilityofHACwiththeefciencyofK-means.17.9ReferencesandfurtherreadingAnexcellentgeneralreviewofclusteringis( Jainetal.1999 ).EarlyreferencesforspecicHACalgorithmsare( King1967 )(single-link),( SneathandSokal1973 )(complete-link,GAAC)and( LanceandWilliams1967 )(discussingalargevarietyofhierarchicalclusteringalgorithms).Thesingle-linkalgorithminFigure 17.9 issimilartoKruskal'salgorithmforconstructingaminimumKRUSKAL'SALGORITHMspanningtree.Agraph-theoreticalproofofthecorrectnessofKruskal'sal-gorithm(whichisanalogoustotheproofinSection 17.5 )isprovidedby Cor-menetal. ( 1990 ,Theorem23.1).SeeExercise 17.5 fortheconnectionbetweenminimumspanningtreesandsingle-linkclusterings.Itisoftenclaimedthathierarchicalclusteringalgorithmsproducebetterclusteringsthanatalgorithms( JainandDubes ( 1988 ,p.140), Cuttingetal. ( 1992 ), LarsenandAone ( 1999 ))althoughmorerecentlytherehavebeenex-perimentalresultssuggestingtheopposite( ZhaoandKarypis2002 ).Evenwithoutaconsensusonaveragebehavior,thereisnodoubtthatresultsofEMandK-meansarehighlyvariablesincetheywilloftenconvergetoalocaloptimumofpoorquality.TheHACalgorithmswehavepresentedherearedeterministicandthusmorepredictable.Thecomplexityofcomplete-link,group-averageandcentroidclusteringissometimesgivenasQ(N2)( DayandEdelsbrunner1984 , Voorhees1985b , Murtagh1983 )becauseadocumentsimilaritycomputationisanorderofmagnitudemoreexpensivethanasimplecomparison,themainoperationexecutedinthemergingstepsaftertheNNsimilaritymatrixhasbeencomputed.Thecentroidalgorithmdescribedhereisdueto Voorhees ( 1985b ).Voorheesrecommendscomplete-linkandcentroidclusteringoversingle-linkforare-trievalapplication.TheBuckshotalgorithmwasoriginallypublishedby Cut-tingetal. ( 1993 ). Allanetal. ( 1998 )applysingle-linkclusteringtorststorydetection.AnimportantHACtechniquenotdiscussedhereisWard'smethod( WardWARD'SMETHODJr.1963 , El-HamdouchiandWillett1986 ),alsocalledminimumvarianceclus-tering.Ineachstep,itselectsthemergewiththesmallestRSS(Chapter 16 ,page 360 ).ThemergecriterioninWard'smethod(afunctionofallindividualdistancesfromthecentroid)iscloselyrelatedtothemergecriterioninGAAC(afunctionofallindividualsimilaritiestothecentroid). Online edition (c)\n2009 Cambridge UP 40017HierarchicalclusteringDespiteitsimportanceformakingtheresultsofclusteringuseful,compar-ativelylittleworkhasbeendoneonlabelingclusters. PopesculandUngar ( 2000 )obtaingoodresultswithacombinationofc2andcollectionfrequencyofaterm. Gloveretal. ( 2002b )useinformationgainforlabelingclustersofwebpages. SteinandzuEissen 'sapproachisontology-based( 2004 ).Themorecomplexproblemoflabelingnodesinahierarchy(whichrequiresdis-tinguishingmoregenerallabelsforparentsfrommorespeciclabelsforchil-dren)istackledby Gloveretal. ( 2002a )and TreeratpitukandCallan ( 2006 ).Someclusteringalgorithmsattempttondasetoflabelsrstandthenbuild(oftenoverlapping)clustersaroundthelabels,therebyavoidingtheproblemoflabelingaltogether( ZamirandEtzioni1999 , Käki2005 , Osi´nskiandWeiss2005 ).Weknowofnocomprehensivestudythatcomparesthequalityofsuch“label-based”clusteringtotheclusteringalgorithmsdiscussedinthischapterandinChapter 16 .Inprinciple,workonmulti-documentsumma-rization( McKeownandRadev1995 )isalsoapplicabletoclusterlabeling,butmulti-documentsummariesareusuallylongerthantheshorttextfragmentsneededwhenlabelingclusters(cf.Section 8.7 ,page 170 ).PresentingclustersinawaythatuserscanunderstandisaUIproblem.Werecommendread-ing( Baeza-YatesandRibeiro-Neto1999 ,ch.10)foranintroductiontouserinterfacesinIR.AnexampleofanefcientdivisivealgorithmisbisectingK-means( Stein-bachetal.2000 ).Spectralclusteringalgorithms( Kannanetal.2000 , Dhillon SPECTRALCLUSTERING 2001 , Zhaetal.2001 , Ngetal.2001a ),includingprincipaldirectiondivisivepartitioning(PDDP)(whosebisectingdecisionsarebasedonSVD,seeChap-ter 18 )( Boley1998 , SavaresiandBoley2004 ),arecomputationallymoreex-pensivethanbisectingK-means,buthavetheadvantageofbeingdetermin-istic.UnlikeK-meansandEM,mosthierarchicalclusteringalgorithmsdonothaveaprobabilisticinterpretation.Model-basedhierarchicalclustering( VaithyanathanandDom2000 , Kamvaretal.2002 , Castroetal.2004 )isanexception.TheevaluationmethodologydescribedinSection 16.3 (page 356 )isalsoapplicabletohierarchicalclustering.Specializedevaluationmeasuresforhi-erarchiesarediscussedby FowlkesandMallows ( 1983 ), LarsenandAone ( 1999 )and Sahooetal. ( 2006 ).TheRenvironment( RDevelopmentCoreTeam2005 )offersgoodsupportforhierarchicalclustering.TheRfunctionhclustimplementssingle-link,complete-link,group-average,andcentroidclustering;andWard'smethod.Anotheroptionprovidedismedianclusteringwhichrepresentseachclusterbyitsmedoid(cf.k-medoidsinChapter 16 ,page 365 ).Supportforcluster-ingvectorsinhigh-dimensionalspacesisprovidedbythesoftwarepackageCLUTO(http://glaros.dtc.umn.edu/gkhome/views/cluto). Online edition (c)\n2009 Cambridge UP 17.10Exercises40117.10Exercises?Exercise17.5Asingle-linkclusteringcanalsobecomputedfromtheminimumspanningtreeofaMINIMUMSPANNINGTREEgraph.Theminimumspanningtreeconnectstheverticesofagraphatthesmallestpossiblecost,wherecostisdenedasthesumoveralledgesofthegraph.Inourcasethecostofanedgeisthedistancebetweentwodocuments.ShowthatifDk1�Dk�...�D1arethecostsoftheedgesofaminimumspanningtree,thentheseedgescorrespondtothek1mergesinconstructingasingle-linkclustering.Exercise17.6Showthatsingle-linkclusteringisbest-mergepersistentandthatGAACandcentroidclusteringarenotbest-mergepersistent.Exercise17.7a.Considerrunning2-meansclusteringonacollectionwithdocumentsfromtwodifferentlanguages.Whatresultwouldyouexpect?b.WouldyouexpectthesameresultwhenrunninganHACalgorithm?Exercise17.8DownloadReuters-21578.Keeponlydocumentsthatareintheclassescrude,inter-est,andgrain.Discarddocumentsthataremembersofmorethanoneofthesethreeclasses.Computea(i)single-link,(ii)complete-link,(iii)GAAC,(iv)centroidcluster-ingofthedocuments.(v)CuteachdendrogramatthesecondbranchfromthetoptoobtainK=3clusters.ComputetheRandindexforeachofthe4clusterings.Whichclusteringmethodperformsbest?Exercise17.9SupposearunofHACndstheclusteringwithK=7tohavethehighestvalueonsomeprechosengoodnessmeasureofclustering.Havewefoundthehighest-valueclusteringamongallclusteringswithK=7?Exercise17.10Considerthetaskofproducingasingle-linkclusteringofNpointsonaline:´´´´´´´´´´ ShowthatweonlyneedtocomputeatotalofaboutNsimilarities.Whatistheoverallcomplexityofsingle-linkclusteringforasetofpointsonaline?Exercise17.11Provethatsingle-link,complete-link,andgroup-averageclusteringaremonotonicinthesensedenedonpage 378 .Exercise17.12ForNpoints,thereareNKdifferentatclusteringsintoKclusters(Section 16.2 ,page 356 ).Whatisthenumberofdifferenthierarchicalclusterings(ordendrograms)ofNdocuments?AretheremoreatclusteringsormorehierarchicalclusteringsforgivenKandN?