/
Figure1:TheWEKAExploreruserinterface. Figure1:TheWEKAExploreruserinterface.

Figure1:TheWEKAExploreruserinterface. - PDF document

reportcetic
reportcetic . @reportcetic
Follow
343 views
Uploaded On 2020-11-19

Figure1:TheWEKAExploreruserinterface. - PPT Presentation

Figure2TheWEKAKnowledgeFlowuserinterface Figure1Datacanbeloadedfromvarioussourcesincluding12lesURLsanddatabasesSupported12leformatsincludeWEKAsownARFFformatCSVLibSVMsformatandC45sfo ID: 817270

sigkdd explorations validation weka explorations sigkdd weka validation les sanfrancisco 2005 speci weka2 frank page datamining arti ingalgorithms called

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Figure1:TheWEKAExploreruserinterface." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Figure1:TheWEKAExploreruserinterface.Fi
Figure1:TheWEKAExploreruserinterface.Figure2:TheWEKAKnowledgeFlowuserinterface.Figure1.Datacanbeloadedfromvarioussources,including les,URLsanddatabases.Supported leformatsincludeWEKA'sownARFFformat,CSV,LibSVM'sformat,andC4.5'sformat.Itisalsopossibletogeneratedatausinganarti cialdatasourceandeditdatamanuallyusingadataseteditor.ThesecondpanelintheExplorergivesaccesstoWEKA'sclassi cationandregressionalgorithms.Thecorrespondingpaneliscalled\Classify"becauseregressiontechniquesareviewedaspredictorsof\continuousclasses".Bydefault,thepanelrunsacross-validationforaselectedlearningal-gorithmonthedatasetthathasbeenbeenpreparedinthePreprocesspaneltoestimatepredictiveperformance.Italsoshowsatextualrepresentationofthemodelbuiltfromthefulldataset.However,othermodesofevaluation,e.g.basedonaseparatetestset,arealsosupported.Ifapplicable,thepanelalsoprovidesaccesstographicalrepresentationsofmodels,e.g.decisiontrees.Moreover,itcanvisualizepredictionerrorsinscatterplots,andalsoallowsevaluationviaROCcurvesandother\thresholdcurves".Modelscanalsobesavedandloadedinthispanel.Alongwithsupervisedalgorithms,WEKAalsosupportsap-plicationofunsupervisedalgorithms,namelyclusteringal-gorithmsandmethodsforassociationrulemining.TheseareaccessibleintheExplorerviathethirdandfourthpanelrespectively.The\Cluster"panelenablesuserstorunaclusteringalgorithmonthedataloadedinthePreprocesspanel.Itprovidessimplestatisticsforevaluationofcluster-ingperformance:likelihood-basedperformanceforstatisti-calclusteringalgorithmsandcomparisonto\true"clustermembershipifthisisspeci edinoneoftheattributesinthedata.Ifapplicable,visualizationoftheclusteringstruc-tureisalsopossible,andmodelscanbestoredpersistentlyifnecessary.WEKA'ssupportforclusteringtasksisnotasextensiveasitssupportforclassi cationandregression,butithasmoretechniquesforclusteringthanforassociationrulemining,whichhasuptothispointbeensomewhatneglected.Nev-ertheless,itdoescontainanimplementationofthemostwe

ll-knownalgorithminthisarea,aswellasafew
ll-knownalgorithminthisarea,aswellasafewotherones.Thesemethodscanbeaccessedviathe\Associate"panelintheExplorer.Perhapsoneofthemostimportanttaskinpracticaldataminingisthetaskofidentifyingwhichattributesinthedataarethemostpredictiveones.Tothisend,WEKA'sExplorerhasadedicatedpanelforattributeselection,\Se-lectattributes",whichgivesaccesstoawidevarietyofalgo-rithmsandevaluationcriteriaforidentifyingthemostim-portantattributesinadataset.Duetothefactthatitispossibletocombinedi erentsearchmethodswithdi erentevaluationcriteria,itispossibletocon gureawiderangeofpossiblecandidatetechniques.Robustnessoftheselectedattributesetcanbevalidatedviaacross-validation-basedapproach.Notethattheattributeselectionpanelisprimarilydesignedforexploratorydataanalysis.WEKA's\FilteredClassi er"(accessibleviatheClassifypanel)shouldbeusedtoapplyattributeselectiontechniquesinconjunctionwithanun-derlyingclassi cationorregressionalgorithmtoavoidin-troducingoptimisticbiasintheperformanceestimatesob-tained.Thiscaveatalsoappliestosomeofthepreprocess-ingtools|morespeci cally,thesupervisedones|thatareavailablefromthePreprocesspanel.Inmanypracticalapplications,datavisualizationprovidesimportantinsights.Thesemayevenmakeitpossibletoavoidfurtheranalysisusingmachinelearninganddatamin-ingalgorithms.Butevenifthisisnotthecase,theymayinformtheprocessofselectinganappropriatealgorithmfortheproblemathand.ThelastpanelintheExplorer,called\Visualize",providesacolor-codedscatterplotma-trix,alongwiththeoptionofdrillingdownbyselectingin-dividualplotsinthismatrixandselectingportionsofthedatatovisualize.Itisalsopossibletoobtaininformationregardingindividualdatapoints,andtorandomlyperturbdatabyachosenamounttouncoverobscureddata.TheExplorerisdesignedforbatch-baseddataprocessing:trainingdataisloadedintomemoryinitsentiretyandthenprocessed.Thismaynotbesuitableforproblemsinvolvinglargedatasets.However,WEKAdoeshaveimplementationsofsomealgorithmsthatallowincrementalmodelbuild

ing,whichcanbeappliedinincrementalmodefr
ing,whichcanbeappliedinincrementalmodefromacommand-lineinterface.TheincrementalnatureofthesealgorithmsisignoredintheExplorer,butcanbeexploitedusingamorerecentadditiontoWEKA'ssetofgraphicaluserinterfaces,namelytheso-called\KnowledgeFlow",showninFigure2.MosttasksthatcanbetackledwiththeExplorercanalsobehandledbytheKnowledgeFlow.However,inadditiontobatch-basedtraining,itsdata owmodelenablesincre-mentalupdateswithprocessingnodesthatcanloadandpreprocessindividualinstancesbeforefeedingthemintoap-propriateincrementallearningalgorithms.Italsoprovidesnodesforvisualizationandevaluation.Onceaset-upofin-terconnectedprocessingnodeshasbeencon gured,itcanbesavedforlaterre-use.ThethirdmaingraphicaluserinterfaceinWEKAisthe\Experimenter"(seeFigure3).Thisinterfaceisdesignedtofacilitateexperimentalcomparisonofthepredictiveper-formanceofalgorithmsbasedonthemanydi erenteval-uationcriteriathatareavailableinWEKA.Experimentscaninvolvemultiplealgorithmsthatarerunacrossmultipledatasets;forexample,usingrepeatedcross-validation.Ex-perimentscanalsobedistributedacrossdi erentcomputenodesinanetworktoreducethecomputationalloadforin-dividualnodes.Onceanexperimenthasbeensetup,itcanbesavedineitherXMLorbinaryform,sothatitcanbeFigure3:TheWEKAExperimenteruserinterface.re-visitedifnecessary.Con guredandsavedexperimentscanalsoberunfromthecommand-line.ComparedtoWEKA'sotheruserinterfaces,theExperi-menterisperhapsusedlessfrequentlybydataminingprac-titioners.However,oncepreliminaryexperimentationhasbeenperformedintheExplorer,itisoftenmucheasiertoidentifyasuitablealgorithmforaparticulardataset,orcol-lectionofdatasets,usingthisalternativeinterface.WewouldliketoconcludethisbriefexpositionofWEKA'smaingraphicaluserinterfacesbypointingoutthat,regard-lessofwhichuserinterfaceisdesired,itisimportanttoprovidetheJavavirtualmachinethatisusedtorunWEKAwithasucientamountofheapspace.Theneedtopre-specifytheamountofmemoryrequired,whichshouldbesetlowerthantheamountofphysical

memoryofthema-chinethatisused,toavoidswa
memoryofthema-chinethatisused,toavoidswapping,isperhapsthebiggeststumblingblocktothesuccessfulapplicationofWEKAinpractice.Ontheotherhand,consideringrunningtime,thereisnolongerasigni cantdisadvantagecomparedtoprogramswritteninC,acommonly-heardargumentagainstJavafordata-intensiveprocessingtasks,duetothesophisticationofjust-in-timecompilersinmodernJavavirtualmachines.3.HISTORYOFTHEWEKAPROJECTTheWEKAprojectwasfundedbytheNewZealandgov-ernmentfrom1993upuntilrecently.Theoriginalfundingapplicationwaslodgedinlate1992andstatedtheproject'sgoalsas:\Theprogrammeaimstobuildastate-of-the-artfacilityfordevelopingtechniquesofmachinelearningandinvestigatingtheirapplicationinkeyareasoftheNewZealandeconomy.Speci callywewillcreateaworkbenchformachinelearning,determinethefactorsthatcontributetowardsitssuccessfulapplicationintheagriculturalindustries,anddevelopnewmethodsofmachinelearningandwaysofassessingtheiref-fectiveness."The rstfewyearsoftheprojectfocusedonthedevelopmentoftheinterfaceandinfrastructureoftheworkbench.MostoftheimplementationwasdoneinC,withsomeevaluationroutineswritteninProlog,andtheuserinterfaceproducedFigure4:Backthen:theWEKA2.1workbenchuserinter-face.usingTCL/TK.DuringthistimetheWEKA1acronymwascoinedandtheAttributeRelationFileFormat(ARFF)usedbythesystemwascreated.The rstreleaseofWEKAwasinternalandoccurredin1994.Thesoftwarewasverymuchatbetastage.The rstpublicrelease(atversion2.1)wasmadeinOctober1996.Figure4showsthemainuserinterfaceforWEKA2.1.InJuly1997,WEKA2.2wasreleased.Itincludedeightlearn-ingalgorithms(implementationsofwhichwereprovidedbytheiroriginalauthors)thatwereintegratedintoWEKAus-ingwrappersbasedonshellscriptsanddatapre-processingtoolswritteninC.WEKA2.2alsosportedafacility,basedonUnixMake les,forcon guringandrunninglarge-scaleexperimentsbasedonthesealgorithms.Bynowitwasbecomingincreasinglydiculttomaintainthesoftware.Factorssuchaschangestosupportingli-braries,managementofdependenciesandcomplexi

tyofcon- gurationmadethejobdicu
tyofcon- gurationmadethejobdicultforthedevelopersandtheinstallationexperiencefrustratingforusers.AtaboutthistimeitwasdecidedtorewritethesystementirelyinJava,includingimplementationsofthelearningalgorithms.ThiswasasomewhatradicaldecisiongiventhatJavawaslessthantwoyearsoldatthetime.Furthermore,theruntimeperformanceofJavamadeitaquestionablechoiceforim-plementingcomputationallyintensivemachinelearningal-gorithms.Nevertheless,itwasdecidedthatadvantagessuchas\WriteOnce,RunAnywhere"andsimplepackaginganddistributionoutweighedtheseshortcomingsandwouldfacil-itatewideracceptanceofthesoftware.May1998sawthe nalreleaseoftheTCL/TK-basedsys-tem(WEKA2.3)and,atthemiddleof1999,the100%JavaWEKA3.0wasreleased.Thisnon-graphicalversionofWEKAaccompaniedthe rsteditionofthedataminingbookbyWittenandFrank[34].InNovember2003,asta-bleversionofWEKA(3.4)wasreleasedinanticipationofthepublicationofthesecondeditionofthebook[35].Inthetimebetween3.0and3.4,thethreemaingraphicaluserinterfacesweredeveloped.In2005,theWEKAdevelopmentteamreceivedtheSIGKDDDataMiningandDiscoveryServiceAward[22].Theaward1TheWekaisalsoanindigenousbirdofNewZealand.Likethewell-knownKiwi,itis ightless.Page 13Page 14Figure8:TheBayesiannetworkeditor.Figure9:TheExplorerwithan\Experiment"tabaddedfromaplugin.usingWEKA'sdatageneratortools.Arti cialdatasuit-ableforclassi cationcanbegeneratedfromdecisionlists,radial-basisfunctionnetworksandBayesiannetworksaswellastheclassicLED24domain.Arti cialregressiondatacanbegeneratedaccordingtomathematicalexpressions.Therearealsoseveralgeneratorsforproducingarti cialdataforclusteringpurposes.TheKnowledgeFlowinterfacehasalsobeenimproved:itnowincludesanewstatusareathatcanprovidefeedbackontheoperationofmultiplecomponentsinadataminingpro-cesssimultaneously.OtherimprovementstotheKnowledgeFlowincludesupportforassociationrulemining,improvedsupportforvisualizingmultipleROCcurvesandapluginmechanism.4.5ExtensibilityAnumberofpluginmechanismsh

avebeenaddedtoWEKAsinceversion3.4.Thesea
avebeenaddedtoWEKAsinceversion3.4.TheseallowWEKAtobeextendedinvariouswayswithouthavingtomodifytheclassesthatmakeuptheWEKAdistribution.NewtabsintheExplorerareeasilyaddedbywritingaclassthatextendsjavax.swing.JPanelandimplementsthein-terfaceweka.gui.explorer.Explorer.ExplorerPanel.Fig-Figure10:APMMLradialbasisfunctionnetworkloadedintotheExplorer.ure9showstheExplorerwithanewtab,providedbyaplu-gin,forrunningsimpleexperiments.Similarmechanismsal-lownewvisualizationsforclassi ererrors,predictions,treesandgraphstobeaddedtothepop-upmenuavailableinthehistorylistoftheExplorer's\Classify"panel.TheKnowl-edgeFlowhasapluginmechanismthatallowsnewcompo-nentstobeincorporatedbysimplyaddingtheirjar le(andanynecessarysupportingjar les)tothe.knowledgeFlow/pluginsdirectoryintheuser'shomedirectory.Thesejar lesareloadedautomaticallywhentheKnowledgeFlowisstartedandthepluginsaremadeavailablefroma\Plugins"tab.4.6StandardsandInteroperabilityWEKA3.6includessupportforimportingPMMLmod-els(PredictiveModelingMarkupLanguage).PMMLisavendor-agnostic,XML-basedstandardforexpressingstatis-ticalanddataminingmodelsthathasgainedwide-spreadsupportfrombothproprietaryandopen-sourcedataminingvendors.WEKA3.6supportsimportofPMMLregression,generalregressionandneuralnetworkmodeltypes.Importoffurthermodeltypes,alongwithsupportforexportingPMML,willbeaddedinfuturereleasesofWEKA.Figure10showsaPMMLradialbasisfunctionnetwork,createdbytheClementinesystem,loadedintotheExplorer.AnothernewfeatureinWEKA3.6istheabilitytoreadandwritedataintheformatusedbythewellknownLib-SVMandSVM-Lightsupportvectormachineimplementa-tions[5].ThiscomplementsthenewLibSVMandLibLIN-EARwrapperclassi ers.5.PROJECTSBASEDONWEKATherearemanyprojectsthatextendorwrapWEKAinsomefashion.Atthetimeofthiswriting,thereare46suchprojectslistedontheRelatedProjectswebpageoftheWEKAsite3.Someoftheseinclude:Systemsfornaturallanguageprocessing.ThereareanumberoftoolsthatuseWEKAfornaturallan-guageprocessing:GATEisa

NLPworkbench[11];3http://www.cs.waikato
NLPworkbench[11];3http://www.cs.waikato.ac.nz/ml/weka/index_related.htmlBalieperformslanguageidenti cation,tokenization,sentenceboundarydetectionandnamed-entityrecog-nition[21];Senseval-2isasystemforwordsensedis-ambiguation;Keaisasystemforautomatickeyphraseextraction[36].Knowledgediscoveryinbiology.SeveraltoolsusingorbasedonWEKAhavebeendevelopedtoaiddataanalysisinbiologicalapplications:BioWEKAisanextensiontoWEKAfortasksinbiology,bioinformat-ics,andbiochemistry[14];theEpitopesToolkit(EpiT)isaplatformbasedonWEKAfordevelopingepitopepredictiontools;maxdViewandMayday[7]providevisualizationandanalysisofmicroarraydata.Distributedandparalleldatamining.Thereareanum-berofprojectsthathaveextendedWEKAfordis-tributeddatamining;Weka-Parallelprovidesadis-tributedcross-validationfacility[4];GridWekapro-videsdistributedscoringandtestingaswellascross-validation[18];FAEHIMandWeka4WS[31]makeWEKAavailableasawebservice.Open-sourcedataminingsystems.Severalwellknownopen-sourcedataminingsystemsprovidepluginstoallowaccesstoWEKA'salgorithms.TheKonstanzInformationMiner(KNIME)andRapidMiner[20]aretwosuchsystems.TheR[23]statisticalcomputingen-vironmentalsoprovidesaninterfacetoWEKAthroughtheRWeka[16]package.Scienti cwork owenvironment.TheKeplerWekaprojectintegratesallthefunctionalityofWEKAintotheKepler[1]open-sourcescienti cwork owplatform.6.INTEGRATIONWITHTHEPENTAHOBISUITEPentahocorporationisaproviderofcommercialopensourcebusinessintelligencesoftware.ThePentahoBIsuiteconsistsofreporting,interactiveanalysis,dashboards,ETL/datain-tegrationanddatamining.Eachoftheseisaseparateopensourceproject,whicharetiedtogetherbyanenterprise-classopensourceBIplatform.Inlate2006,WEKAwasadoptedasthedataminingcomponentofthesuiteandsincethenhasbeenintegratedintotheplatform.ThemainpointofintegrationbetweenWEKAandthePen-tahoplatformiswithPentahoDataIntegration(PDI),alsoknownastheKettleproject4.PDIisastreaming,engine-drivenETLtool.Itsrichsetofextr

actandtransformopera-tions,combinedwiths
actandtransformopera-tions,combinedwithsupportforalargevarietyofdatabases,areanaturalcomplementtoWEKA'sdata lters.PDIcaneasilyexportdatasetsinWEKA'snativeARFFformattobeusedimmediatelyformodelcreation.SeveralWEKA-speci ctransformationstepshavebeencre-atedsothatPDIcanaccessWEKAalgorithmsandbeusedasbothascoringplatformandatooltoautomatemodelcreation.The rstofthese,showninFigure11,iscalled\WekaScoring."ItenablestheusertoimportaserializedWEKAmodel(classi cation,regressionorclustering)orasupportedPMMLmodelanduseittoscoredataaspartofanETLtransformation.Inanoperationalscenariothepredictiveperformanceofamodelmaydecreaseovertime.4http://kettle.pentaho.org/Figure11:ScoringopensalesopportunitiesaspartofanETLtransformation.Figure12:RefreshingapredictivemodelusingtheKnowl-edgeFlowPDIcomponent.Thiscanbecausedbychangesintheunderlyingdistribu-tionofthedataandissometimesreferredtoas\conceptdrift."ThesecondWEKA-speci cstepforPDI,showninFigure12,allowstheusertoexecuteanentireKnowledgeFlowprocessaspartofantransformation.Thisenablesautomatedperiodicrecreationorrefreshingofamodel.SincePDItransformationscanbeexecutedandusedasasourceofdatabythePentahoBIserver,theresultsofdataminingcanbeincorporatedintoanoverallBIprocessandusedinreports,dashboardsandanalysisviews.7.CONCLUSIONSTheWEKAprojecthascomealongwayinthe16yearsthathaveelapsedsinceitinceptionin1992.Thesuccessithasenjoyedistestamenttothepassionofitscommunityandmanycontributors.ReleasingWEKAasopensourcesoftwareandimplementingitinJavahasplayednosmallpartinitssuccess.Thesetwofactorsensurethatitremainsmaintainableandmodi ableirrespectiveofthecommitmentorhealthofanyparticularinstitutionorcompany.8.ACKNOWLEDGMENTSManythankstopastandpresentmembersoftheWaikatomachinelearninggroupandtheexternalcontributersforalltheworktheyhaveputintoWEKA.Page 17[29]N.Slonim,N.Friedman,andN.Tishby.Unsuperviseddocumentclassi cationusingsequentialinformationmaximization.InProceedingsofthe25thInterna

tionalACMSIGIRConferenceonResearchandDev
tionalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages129{136,2002.[30]J.Su,H.Zhang,C.X.Ling,andS.Matwin.Discrimina-tiveparameterlearningforbayesiannetworks.InICML2008,2008.[31]D.Talia,P.Trun o,andO.Verta.Weka4ws:awsrfen-abledwekatoolkitfordistributeddataminingongrids.InProc.ofthe9thEuropeanConferenceonPrinci-plesandPracticeofKnowledgeDiscoveryinDatabases(PKDD2005,pages309{320.Springer-Verlag,2005.[32]K.M.TingandI.H.Witten.Stackingbaggedanddaggedmodels.InD.H.Fisher,editor,FourteenthinternationalConferenceonMachineLearning,pages367{375,SanFrancisco,CA,1997.MorganKaufmannPublishers.[33]J.S.Vitter.Randomsamplingwithareservoir.ACMTransactionsonMathematicalSoftware,11(1):37{57,1985.[34]I.H.WittenandE.Frank.DataMining:PracticalMa-chineLearningToolsandTechniqueswithJavaImple-mentations.MorganKaufmann,SanFrancisco,2000.[35]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniques.MorganKauf-mann,SanFrancisco,2edition,2005.[36]I.H.Witten,G.W.Paynter,E.Frank,C.Gutwin,andC.G.Nevill-Manning.Kea:Practicalautomatickeyphraseextraction.InY.-L.ThengandS.Foo,ed-itors,DesignandUsabilityofDigitalLibraries:CaseStudiesintheAsiaPaci c,pages129{152.InformationSciencePublishing,London,2005.[37]X.Xu.Statisticallearninginmultipleinstanceprob-lems.Master'sthesis,DepartmentofComputerSci-ence,UniversityofWaikato,2003.[38]Y.Yang,X.Guan,andJ.You.CLOPE:afastande ectiveclusteringalgorithmfortransactionaldata.InProceedingsoftheeighthACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages682{687.ACMNewYork,NY,USA,2002.[39]F.ZhengandG.I.Webb.Ecientlazyeliminationforaveraged-onedependenceestimators.InProceedingsoftheTwenty-thirdInternationalConferenceonMachineLearning(ICML2006),pages1113{1120.ACMPress,2006.Volume 11, Issue 1SIGKDD ExplorationsSIGKDD ExplorationsSIGKDD ExplorationsSIGKDD ExplorationsSIGKDD ExplorationsSIGKDD ExplorationsSIGKDD ExplorationsSIGKDD Exploratio