Principal Direction Divisiv Daniel Boley y Departmen tof Computer Science and Engineering Univ ersit yofMinnesota 200 UnionStreetSE Rm 4192 Minneapolis 55455 USA Abstract W e prop ose a new al ID: 258189
Download Pdf The PPT/PDF document "ePartitioning" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Principal Direction Divisiv ePartitioning Daniel Boley y Departmen tof Computer Science and Engineering Univ ersit yofMinnesota 200 UnionStreetS.E., Rm 4-192 Minneapolis, 55455, USA Abstract W e prop ose a new algorithm capable of partitioningasetofdo cumen bedding space(i.e. inwhichev ery documenbers). The method opposedoperatesrepeatedlydocumen whichisverysparse.It permitsbe per- methoddocumen possibleproposed In tro duction Unsup ervised clustering of documen tsisa componen exploration documen As project, theW ebA CE Pro ject[16 ,11], w elopedunsupervised featuresincludescalability compet- performance yoftheclustersgenerated,andcapabilityofworking This researc hw aspartially supp orted b yNSF gran tsCCR-9405380 andCCR-9628786. y tel: 1-612-625-3887 , fax: 1-612-625-0572,e-mail: boley@cs.umn.edu 1 berberdoc-project bycarryingoutthefollowingsteps: 1.Starting documendocumen Generate newWWWsearchesusingclassesdiscoveredinstep1or3. documenupdate structurewithoutuserintervention.4. Rep methoddocumendocumenUnsupervisedimportanbecausedependingbe documenbedocumenpapercompetitivperformancespeed, Direction DivisivePartitioning." becausecomponenoperatedocumenrepeatedlybe documendocumen p298] (originallyfrom[1]):2 1.hierarchicalagglomerativeclustering2.hierarchicaldivisiveclustering3.iterativepartitioning4.densitysearchclustering5. majoritdocumenpoin\bottomdocumenproceedsbe taskof[7].ThisisdiscussedfurtherinSection5.2 Related W ork documenods,ds,)Distance-basedmethods andnearest-neighboror]useaselectedsetofwords(features)appearing documendocumenbepoin bemethod butthealgorithm's isamethod modelingdeling].Givenadatasetitndsmaximumparametervaluesforaspecicbermethods. documen occurrencedocumenoccurdocumen havebeenproposedberdocumenbedocumenespeciallypoor. goodmethodsds]donotperformindependendocumenpossibledocumenmethoddocumenbermethodsbebe Laten wasproposedmethoddocumenprojection. minf m; n g berdocumenrespectiv documendecomposition = (M w e T respectivpreprocessexpensivbeeneen)typicallyusedwithatrainingsetofsampleswithknown\correct"classications.Assuchitisusuallyusedasatool\supervisedproject cut-poinunsupervisedappears repeated Loevgoodtroducedpaper.specicbeenprojectionsjectionsassembleaseriesofprojectionspurposeh exploretheeectsontheaccuracyofclusteringalgorithmsofvariousstrategiesforinitialtermselection,weighting,andprojections.jections.exploretheeectofdierenttfidfscalingsandothernormalizationstrategiesinastudythatismoresystematicthanthatinSection5ofthispaper. proposemethod d interse ction clustering basedonaglob al quality function paper.reportedmethods alsoclustersresultsofwebsearches,butthespecicsmethodreportedbeloelothereisastudyshowinghowgreatimprovementsinaccuracycanbemethods. Notation andMathematical Preliminaries boldupperbold =(v 1 ;v2 ;: :; transpose j -thelementis j i be :;paper wewillusetheEuclideannormforvectors:k vk2=s X robenius =s Xi;j paperbedocumendocumen =(d :;documen q P beroccurencesdocumen di=1 2 1+ TFi maxj(TFj)! logm DFi ,thendi=~di q berdocumenappears,documendocumenberdocumenproducebetterexperimenspeedeed]wouldbebetterpoindocumen:; documen m=Me1 :; documen be wewe wepositivcorresponding alcomp specicimportan Algorithm Description Componenoperates\documenbedocumendocumenexperimendocumendocumenpurposesdocumen:;documenproceedsdocumendescribe.beprocessnodebeenbeenspecifymethodbe Splitting a P artition Apartitionofp documen documen matrixMconsistingofsomeselectionofpcolumnsofM,notnecessarilytherstpintheset,butweomittheextrasubscriptsforsimplicity.TheprincipaldirectionsofthematrixMparetheeigenvectorsofitssamplecovariancematrixC.Letw=Mpe=p bedocumen:;weweunen-Loevevconsistsofprojectingcorrespondingbenecialbetemporarilyprojectingdocumencomponenprojectionpurpose.becausedocumenbeprojectiondocumen positivspecicdocumenproject:; documencorrespondingdocumen documen documendocumenprojectionbedocumenbebetbeboring 4.2 Computational Considerations expensivbeexpense.methodsspecialmethodsmethodsds].Inaddition,thecovariancematrixisactuallytheproducttranspose.be Decompositionosition].TheSVDofannmmatrixAisthedecomposition respectiv :; f m;ng g,where1 2minm;n0.The'sarecalledthesingularvalues,thecolumnsofUarecalledtheleftsingularvectors,andthecolumnsofVarecalledtherightsingularvectors.AnorthogonalmatrixUisasquarematrixsatisfyingUTU=I weVrespectivprojectionsprojectionsprojectionsmethodbe purposes,proceedsrepeatedly propertproductbeMvproductproductaboutbecausetroducedspectrumproceedsmethodbebecorrespondingproperties Ov erallAlgorithm nodedocumenassociatednode,documenpoinnodes,specicnodenodedocumenassociatednodedocumen eld contents indices documennode's documen leadingsingularvalueforthiscluster(ascalar)leftvec corresponding corresponding totalscattervalueforthiscluster(ascalar)leftchild poinnode poinnodenodedocumenrootnodedocumenassociatedpoinnodes,describedmethodsnode,nodebenodesroot)beforeproceedingpaper,beingprocessed.methodbeexperimenspecic scatter v alue berobeniuscorrespondingwerobenius ij)isgivenbykAk2F=Xi;jj robenius kAk2F=kCkF=Xi2i:Inouralgorithm,thetotalscattervalueisusedtoselectthenextclustertosplit.Wechoosebetdocumenhoosebebersdocumencomponenbecomponenloop,nodedocumenassociatednode,documennodes.documenyperplane max westopwhenthismaximumclusterscatterfallsbeloeloandwhenusingthedatafrom[14].ThespacerequiredforthePDDPalgorithmarisesfromthreesources:thememorytostoretheinitialrawdatamatrixM,thememoryneededtostorethePDDPtree,andtemporarynode,nodesnodesnodes.node,benodes,beprocess 0.Start documenber RootNode or :; node nodes :=leftchild(K)andR non-positivpositiv nodes 9.Result: nodestemporarydependsbermethodspecicbeyscopepaper.beberdependexperimenbe occupied originalmatrixM nms nz PDDPtree temporary minfn;mgksvd (10)13 berber m,sominfn;mg=m.Inaddition,typicalvaluesforsnzinourexamplesrangefrom:04=4% productsproductproductsupperbounddocumenproductproductsboundedaboexpectedberdocumenmoduloberunmodied The"Buckshot"heuristicusedby[7]isalsoanO(m)method,depends Exp experimendocumenprocesseddocumenoccurences. correspondingdocumencorresponding expe- J110536allwordsJ65106wordswithTF1J42951Top5+words+HTML-tagged documenappearingypergraphExperimenberberoccurrencesccurrencesWeappliedtwoalgorithmstothisdataset,thedivisivePDDPalgorithm(Table2),andanagglomerativealgorithm[8].Theagglomerativealgorithmisshownonlyforcomparativepurposes,perdocumenrepeatedlyeatedlyThosetwoclustersaremergedintoasinglecluster,andtheprocessrepeated.betbothoth]orthenormscaling(3).Fortheagglomerativealgorithm,thetimeswereindependenperformancereported.purposeunlabeleddocumenSuperSparctook documencumen]usingdictionaryof21190words,themethodtookberproposedmethodlabels.methodlabelsdocumenlabelslabels j= Xi c(i;j) Pic(i;j)!log c(i;j) berlabeloccurslabelsdocumenpositiv =1 berdocumenbettercompetitivbebestbeenbershipulti-documenpreprocessingypergraphbeenreportedorted,11].WeillustrateinTable7asampleclusteringfromoneofthecasesofTable5.Table7shows16clustersinwhichforeachclusterweshowthehand-assignedlabelsdocumenbe divisivealgorithm agglomerativealgorithm data 81632 1632 set clustersclustersclusters clustersclusters J1 1:401:542:10 { { J6 0:390:440:50 53:1552:57 J4 0:240:270:30 29:2329:14 J3 0:150:170:20 16:5416:48 J7 0:130:140:17 11:3511:31 J8 0:210:230:25 11:3111:23 J2 0:110:120:14 7:497:45 J9 0:100:110:14 6:196:15 J10 0:080:090:11 3:513:49 J5 0:110:130:16 3:383:35 J11 0:070:080:11 1:451:43 processor. data ber set 816* 32 816*32 16*32 16*32 divisivealgorithm agglomerativealgorithm normscaling tfidfscaling normscaling tfidfscaling J1 1.24 0.69 0.51 1.461.060.71 0.680.47 2.341.56 J6 1.330.830.56 1.170.770.65 0.730.55 2.141.32 J4 1.531.100.71 1.651.180.93 1.000.72 2.051.28 J3 1.330.850.61 1.571.110.93 0.810.62 2.021.20 J7 1.360.900.61 1.280.910.72 0.780.58 2.181.38 J8 1.470.960.69 1.320.910.77 0.890.65 2.171.47 J2 1.701.120.76 1.561.140.81 0.890.62 2.071.24 J9 1.651.070.76 1.421.020.76 1.010.73 1.971.15 J10 1.691.170.85 1.901.240.99 0.970.75 2.131.29 J5 1.310.740.51 1.070.610.46 0.840.55 1.350.84 J11 1.471.050.67 1.681.190.92 0.970.73 1.481.02 bo 1 2 3 4 5 6 7 8 9 10 11 0 0.5 1 1.5 2 2.5 D norm D tfidf A norm A tfidf entropies of the 16-cluster J experimentsexperiment numberentropy experimen label topic label topic A armativeaction L employeerightsB businesscapital M processing informationsystems P personnel electroniccommerce S manufacturingsystemsI propert Z labelsdocumenoperatoreratordescribed beoccurperhapsaspect,processcessdeservefurtherinvestigation.Itisinterestingtocomparetheperformancemethod,projections,methodsprojection.experimenerimenindicatethatpreprocessingmethod.bet ExtensionsandF bebeexperimenbeeThescatter/gathertaskconsistsofa\scatter"ofalargecollectionofdocumenberprocessrepeatedeatedthescattertaskwascarriedoutbyamodiedexpectedected],theyproposedbe 1:A people employactionapplicminorarm3:B BBBBBBBBBZ busicapitdevelopfundloan4:BBBBBBBBBM Pbusicapitnancfundnanci5:C CCCCCCCCCCS informsystemmanagservictechnologi6:CCCMSSZZZZZZindustriresearchinformsystemengin7:CCMSSSSSSSSSSSsystemmanufacturinformintegrdevelop8:CE EEEEEEEEEEEEEEelectroninternetcommercbusiweb9:CEEEI personnelpropertiprocessproductpersonnelLabels bedocumendocumenproposedoperatorbeoperatorappear Classication of New Do docu-documendocumendocumendocumendocumenbeenyperplanes,occupboundedyperplanes(polytopedocumenbeenpolytopedocumenbepolytopeassociateddocumenoccuppolytopeproceedsrootnodecorrespondingdocumenyperplanedocumenyperplanedocumenrootdependingyperplaneSpecicallydocumenvpositivroot'sprocessrepeatedpercolatesnode. documenrootnode node nodedocumenassociateddocumenmethod,experimendocumendocumendocumenbothdocumenlabel.methoddocumennodes,methoddocumendocumenbothdocumendocumendocumenprocessedpurposeprocessdocumendocumendocumenbebeespeciallybeenexperimendocumenbedocumenprocessedbe 6.2ClassicationUp updatingdocumennodesupdatingdocumenupdatingdatingp201orp225]or[7]).Thedrawbackofsuchanapproachisthatclusterscreatedbyanotherupdatingmethodpossibledocumendoesupdating,bedocumendocumendocumendependsdocumenexperimensupportexperimendocumendocumendocumendocumendocumenbebodocumenbeingdocumenmodifyingbestdocumendocumendocumen bothdocumenbeaspect,aspectsaboutupdatebe Conclusions proposedunsuperviseddocumencomponenbeddingdocumenbeddingperformancedocumendocumenbebeddablepossiblepoin Robertexperimenreportedpaper,bersMoore,documenexperimenpaper. M.R.Anderberg. A nalysisforApplications.AcademicPress,1973.25 [2]M.W.Berry,S.T.Dumais,andG.W.O'Brien.Usinglinearalgebraforintelligentinformationretrieval.SIAM R eview C.BishopandM.Tipping.Ahierarchicallatentvariablemodel T rans. Patt. Anal.Mach. Intell.l.D.Boley.Experimen boley/PDDPoley/PDDPD.Boley,M.Gini,R.Gross,E.-H.Han,K.Hastings,G.Karypis,V.Kumar,B.Mobasher,andJ.Moore.Documenappear. ledgeDisc D.Cutting,D.Karger,J.Pedersen,andJ.Tukey.Scatter/gather:acluster-basedapproachtobrowsinglargedocumen AnnIntACMSIGIR Confer- enceonResearchandDevelopmentinInformationRetrieval(SIGIR'92) ,pages318{329,1992.[8]R.O.DudaandP.E.Hart.PatternClassic ationandSceneAnalysis.JohnWiley& Sons,1973.[9]W.B.FrakesandR.Baeza-Y ood S.Han,D.Boley,M.Gini,R.Gross,K.Hastings,G.Karypis,V.Kumar,B.Mobasher,andJ.Moore.documenreportortD.Hull,J.Pederson,andH.SchMethoddocumen ,pages279{287,1996.26 A.JainandR.C.Dubes.es.D.Lewis.Reuters-21578.http://www.research.att.com/lewis,1997.[15]S.LuandK.Fu.Asentence-to-sentenceclusteringprocedure J.Moore,associationcomponen Workshop M.NadlerandE.P.Smith.PatternRecognitionEngineering.Wiley,1993.[18]NorthernLight,1998.http://www.nlsearch.com.[19]G.SaltonandC.Buckley.Term-weightingapproachesinautomatictextretrieval.InformationProcessingandManagement,24(5):513{523,1988.[20]H.SchProjectionsdocumencumenA.Singhal,C.Buckley,andM.Mitra.Pivoteddocumen O.Zamir,O.Etzioni,O.Madani,andR.Karp.Fastandintuitiveclusteringofwebdocumen