/
ePartitioning ePartitioning

ePartitioning - PDF document

briana-ranney
briana-ranney . @briana-ranney
Follow
381 views
Uploaded On 2016-03-16

ePartitioning - PPT Presentation

Principal Direction Divisiv Daniel Boley y Departmen tof Computer Science and Engineering Univ ersit yofMinnesota 200 UnionStreetSE Rm 4192 Minneapolis 55455 USA Abstract W e prop ose a new al ID: 258189

Principal Direction Divisiv  Daniel Boley y Departmen tof Computer Science and Engineering Univ ersit yofMinnesota 200 UnionStreetS.E. Rm 4-192 Minneapolis 55455 USA Abstract W e prop ose a new al

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "ePartitioning" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Principal Direction Divisiv ePartitioning  Daniel Boley y Departmen tof Computer Science and Engineering Univ ersit yofMinnesota 200 UnionStreetS.E., Rm 4-192 Minneapolis, 55455, USA Abstract W e prop ose a new algorithm capable of partitioningasetofdo cumen bedding space(i.e. inwhichev ery documenbers). The method opposedoperatesrepeatedlydocumen whichisverysparse.It permitsbe per- methoddocumen possibleproposed In tro duction Unsup ervised clustering of documen tsisa componen exploration documen As project, theW ebA CE Pro ject[16 ,11], w elopedunsupervised featuresincludescalability compet- performance yoftheclustersgenerated,andcapabilityofworking  This researc hw aspartially supp orted b yNSF gran tsCCR-9405380 andCCR-9628786. y tel: 1-612-625-3887 , fax: 1-612-625-0572,e-mail: boley@cs.umn.edu 1 berberdoc-project bycarryingoutthefollowingsteps: 1.Starting documendocumen Generate newWWWsearchesusingclassesdiscoveredinstep1or3. documenupdate structurewithoutuserintervention.4. Rep methoddocumendocumenUnsupervisedimportanbecausedependingbe documenbedocumenpapercompetitivperformancespeed, Direction DivisivePartitioning." becausecomponenoperatedocumenrepeatedlybe documendocumen p298] (originallyfrom[1]):2 1.hierarchicalagglomerativeclustering2.hierarchicaldivisiveclustering3.iterativepartitioning4.densitysearchclustering5. majoritdocumenpoin\bottomdocumenproceedsbe taskof[7].ThisisdiscussedfurtherinSection5.2 Related W ork documenods,ds,)Distance-basedmethods andnearest-neighboror]useaselectedsetofwords(features)appearing documendocumenbepoin bemethod butthealgorithm's isamethod modelingdeling].Givenadatasetit ndsmaximumparametervaluesforaspeci cbermethods. documen occurrencedocumenoccurdocumen havebeenproposedberdocumenbedocumenespeciallypoor. goodmethodsds]donotperformindependendocumenpossibledocumenmethoddocumenbermethodsbebe Laten wasproposedmethoddocumenprojection. minf m; n g berdocumenrespectiv documendecomposition = (M w e T respectivpreprocessexpensivbeeneen)typicallyusedwithatrainingsetofsampleswithknown\correct"classi cations.Assuchitisusuallyusedasatool\supervisedproject cut-poinunsupervisedappears repeated Loevgoodtroducedpaper.speci cbeenprojectionsjectionsassembleaseriesofprojectionspurposeh explorethee ectsontheaccuracyofclusteringalgorithmsofvariousstrategiesforinitialtermselection,weighting,andprojections.jections.explorethee ectofdi erenttfidfscalingsandothernormalizationstrategiesinastudythatismoresystematicthanthatinSection5ofthispaper. proposemethod d interse ction clustering basedonaglob al quality function paper.reportedmethods alsoclustersresultsofwebsearches,butthespeci csmethodreportedbeloelothereisastudyshowinghowgreatimprovementsinaccuracycanbemethods. Notation andMathematical Preliminaries boldupperbold =(v 1 ;v2 ;: :; transpose j -thelementis j i be :;paper wewillusetheEuclideannormforvectors:k vk2=s X robenius =s Xi;j paperbedocumendocumen =(d :;documen q P beroccurencesdocumen di=1 2 1+ TFi maxj(TFj)!   logm DFi ,thendi=~di q berdocumenappears,documendocumenberdocumenproducebetterexperimenspeedeed]wouldbebetterpoindocumen:; documen m=Me1 :; documen be wewe wepositivcorresponding alcomp speci cimportan Algorithm Description Componenoperates\documenbedocumendocumenexperimendocumendocumenpurposesdocumen:;documenproceedsdocumendescribe.beprocessnodebeenbeenspecifymethodbe Splitting a P artition Apartitionofp documen documen matrixMconsistingofsomeselectionofpcolumnsofM,notnecessarilythe rstpintheset,butweomittheextrasubscriptsforsimplicity.TheprincipaldirectionsofthematrixMparetheeigenvectorsofitssamplecovariancematrixC.Letw=Mpe=p bedocumen:;weweunen-Loevevconsistsofprojectingcorrespondingbene cialbetemporarilyprojectingdocumencomponenprojectionpurpose.becausedocumenbeprojectiondocumen positivspeci cdocumenproject:; documencorrespondingdocumen documen documendocumenprojectionbedocumenbebetbeboring 4.2 Computational Considerations expensivbeexpense.methodsspecialmethodsmethodsds].Inaddition,thecovariancematrixisactuallytheproducttranspose.be Decompositionosition].TheSVDofannmmatrixAisthedecomposition respectiv :; f m;ng g,where1 2minm;n0.The'sarecalledthesingularvalues,thecolumnsofUarecalledtheleftsingularvectors,andthecolumnsofVarecalledtherightsingularvectors.AnorthogonalmatrixUisasquarematrixsatisfyingUTU=I weVrespectivprojectionsprojectionsprojectionsmethodbe purposes,proceedsrepeatedly propertproductbeMvproductproductaboutbecausetroducedspectrumproceedsmethodbebecorrespondingproperties Ov erallAlgorithm nodedocumenassociatednode,documenpoinnodes,speci cnodenodedocumenassociatednodedocumen eld contents indices documennode's documen leadingsingularvalueforthiscluster(ascalar)leftvec corresponding corresponding totalscattervalueforthiscluster(ascalar)leftchild poinnode poinnodenodedocumenrootnodedocumenassociatedpoinnodes,describedmethodsnode,nodebenodesroot)beforeproceedingpaper,beingprocessed.methodbeexperimenspeci c scatter v alue berobeniuscorrespondingwerobenius ij)isgivenbykAk2F=Xi;jj robenius kAk2F=kCkF=Xi2i:Inouralgorithm,thetotalscattervalueisusedtoselectthenextclustertosplit.Wechoosebetdocumenhoosebebersdocumencomponenbecomponenloop,nodedocumenassociatednode,documennodes.documenyperplane max westopwhenthismaximumclusterscatterfallsbeloeloandwhenusingthedatafrom[14].ThespacerequiredforthePDDPalgorithmarisesfromthreesources:thememorytostoretheinitialrawdatamatrixM,thememoryneededtostorethePDDPtree,andtemporarynode,nodesnodesnodes.node,benodes,beprocess 0.Start documenber RootNode or :; node nodes :=leftchild(K)andR non-positivpositiv nodes 9.Result: nodestemporarydependsbermethodspeci cbeyscopepaper.beberdependexperimenbe occupied originalmatrixM nms nz PDDPtree temporary minfn;mgksvd (10)13 berber m,sominfn;mg=m.Inaddition,typicalvaluesforsnzinourexamplesrangefrom:04=4% productsproductproductsupperbounddocumenproductproductsboundedaboexpectedberdocumenmoduloberunmodi ed The"Buckshot"heuristicusedby[7]isalsoanO(m)method,depends Exp experimendocumenprocesseddocumenoccurences. correspondingdocumencorresponding expe- J110536allwordsJ65106wordswithTF�1J42951Top5+words+HTML-tagged documenappearingypergraphExperimenberberoccurrencesccurrencesWeappliedtwoalgorithmstothisdataset,thedivisivePDDPalgorithm(Table2),andanagglomerativealgorithm[8].Theagglomerativealgorithmisshownonlyforcomparativepurposes,perdocumenrepeatedlyeatedlyThosetwoclustersaremergedintoasinglecluster,andtheprocessrepeated.betbothoth]orthenormscaling(3).Fortheagglomerativealgorithm,thetimeswereindependenperformancereported.purposeunlabeleddocumenSuperSparctook documencumen]usingdictionaryof21190words,themethodtookberproposedmethodlabels.methodlabelsdocumenlabelslabels j=Xi c(i;j) Pic(i;j)!log c(i;j) berlabeloccurslabelsdocumenpositiv =1 berdocumenbettercompetitivbebestbeenbershipulti-documenpreprocessingypergraphbeenreportedorted,11].WeillustrateinTable7asampleclusteringfromoneofthecasesofTable5.Table7shows16clustersinwhichforeachclusterweshowthehand-assignedlabelsdocumenbe divisivealgorithm agglomerativealgorithm data 81632 1632 set clustersclustersclusters clustersclusters J1 1:401:542:10 { { J6 0:390:440:50 53:1552:57 J4 0:240:270:30 29:2329:14 J3 0:150:170:20 16:5416:48 J7 0:130:140:17 11:3511:31 J8 0:210:230:25 11:3111:23 J2 0:110:120:14 7:497:45 J9 0:100:110:14 6:196:15 J10 0:080:090:11 3:513:49 J5 0:110:130:16 3:383:35 J11 0:070:080:11 1:451:43 processor. data ber set 816* 32 816*32 16*32 16*32 divisivealgorithm agglomerativealgorithm normscaling tfidfscaling normscaling tfidfscaling J1 1.24 0.69 0.51 1.461.060.71 0.680.47 2.341.56 J6 1.330.830.56 1.170.770.65 0.730.55 2.141.32 J4 1.531.100.71 1.651.180.93 1.000.72 2.051.28 J3 1.330.850.61 1.571.110.93 0.810.62 2.021.20 J7 1.360.900.61 1.280.910.72 0.780.58 2.181.38 J8 1.470.960.69 1.320.910.77 0.890.65 2.171.47 J2 1.701.120.76 1.561.140.81 0.890.62 2.071.24 J9 1.651.070.76 1.421.020.76 1.010.73 1.971.15 J10 1.691.170.85 1.901.240.99 0.970.75 2.131.29 J5 1.310.740.51 1.070.610.46 0.840.55 1.350.84 J11 1.471.050.67 1.681.190.92 0.970.73 1.481.02 bo 1 2 3 4 5 6 7 8 9 10 11 0 0.5 1 1.5 2 2.5 D norm D tfidf A norm A tfidf entropies of the 16-cluster J experimentsexperiment numberentropy experimen label topic label topic A armativeaction L employeerightsB businesscapital M processing informationsystems P personnel electroniccommerce S manufacturingsystemsI propert Z labelsdocumenoperatoreratordescribed beoccurperhapsaspect,processcessdeservefurtherinvestigation.Itisinterestingtocomparetheperformancemethod,projections,methodsprojection.experimenerimenindicatethatpreprocessingmethod.bet ExtensionsandF bebeexperimenbeeThescatter/gathertaskconsistsofa\scatter"ofalargecollectionofdocumenberprocessrepeatedeatedthescattertaskwascarriedoutbyamodi edexpectedected],theyproposedbe 1:A people employactionapplicminorarm3:B BBBBBBBBBZ busicapitdevelopfundloan4:BBBBBBBBBM Pbusicapit nancfund nanci5:C CCCCCCCCCCS informsystemmanagservictechnologi6:CCCMSSZZZZZZindustriresearchinformsystemengin7:CCMSSSSSSSSSSSsystemmanufacturinformintegrdevelop8:CE EEEEEEEEEEEEEEelectroninternetcommercbusiweb9:CEEEI personnelpropertiprocessproductpersonnelLabels bedocumendocumenproposedoperatorbeoperatorappear Classi cation of New Do docu-documendocumendocumendocumendocumenbeenyperplanes,occupboundedyperplanes(polytopedocumenbeenpolytopedocumenbepolytopeassociateddocumenoccuppolytopeproceedsrootnodecorrespondingdocumenyperplanedocumenyperplanedocumenrootdependingyperplaneSpeci callydocumenvpositivroot'sprocessrepeatedpercolatesnode. documenrootnode node nodedocumenassociateddocumenmethod,experimendocumendocumendocumenbothdocumenlabel.methoddocumennodes,methoddocumendocumenbothdocumendocumendocumenprocessedpurposeprocessdocumendocumendocumenbebeespeciallybeenexperimendocumenbedocumenprocessedbe 6.2Classi cationUp updatingdocumennodesupdatingdocumenupdatingdatingp201orp225]or[7]).Thedrawbackofsuchanapproachisthatclusterscreatedbyanotherupdatingmethodpossibledocumendoesupdating,bedocumendocumendocumendependsdocumenexperimensupportexperimendocumendocumendocumendocumendocumenbebodocumenbeingdocumenmodifyingbestdocumendocumendocumen bothdocumenbeaspect,aspectsaboutupdatebe Conclusions proposedunsuperviseddocumencomponenbeddingdocumenbeddingperformancedocumendocumenbebeddablepossiblepoin Robertexperimenreportedpaper,bersMoore,documenexperimenpaper. M.R.Anderberg. A nalysisforApplications.AcademicPress,1973.25 [2]M.W.Berry,S.T.Dumais,andG.W.O'Brien.Usinglinearalgebraforintelligentinformationretrieval.SIAM R eview C.BishopandM.Tipping.Ahierarchicallatentvariablemodel T rans. Patt. Anal.Mach. Intell.l.D.Boley.Experimen boley/PDDPoley/PDDPD.Boley,M.Gini,R.Gross,E.-H.Han,K.Hastings,G.Karypis,V.Kumar,B.Mobasher,andJ.Moore.Documenappear. ledgeDisc D.Cutting,D.Karger,J.Pedersen,andJ.Tukey.Scatter/gather:acluster-basedapproachtobrowsinglargedocumen AnnIntACMSIGIR Confer- enceonResearchandDevelopmentinInformationRetrieval(SIGIR'92) ,pages318{329,1992.[8]R.O.DudaandP.E.Hart.PatternClassi c ationandSceneAnalysis.JohnWiley& Sons,1973.[9]W.B.FrakesandR.Baeza-Y ood S.Han,D.Boley,M.Gini,R.Gross,K.Hastings,G.Karypis,V.Kumar,B.Mobasher,andJ.Moore.documenreportortD.Hull,J.Pederson,andH.SchMethoddocumen ,pages279{287,1996.26 A.JainandR.C.Dubes.es.D.Lewis.Reuters-21578.http://www.research.att.com/lewis,1997.[15]S.LuandK.Fu.Asentence-to-sentenceclusteringprocedure J.Moore,associationcomponen Workshop M.NadlerandE.P.Smith.PatternRecognitionEngineering.Wiley,1993.[18]NorthernLight,1998.http://www.nlsearch.com.[19]G.SaltonandC.Buckley.Term-weightingapproachesinautomatictextretrieval.InformationProcessingandManagement,24(5):513{523,1988.[20]H.SchProjectionsdocumencumenA.Singhal,C.Buckley,andM.Mitra.Pivoteddocumen O.Zamir,O.Etzioni,O.Madani,andR.Karp.Fastandintuitiveclusteringofwebdocumen

Related Contents


Next Show more