Figure2TheWEKAKnowledgeFlowuserinterface Figure1Datacanbeloadedfromvarioussourcesincluding12lesURLsanddatabasesSupported12leformatsincludeWEKAsownARFFformatCSVLibSVMsformatandC45sfo ID: 817270
Download Pdf The PPT/PDF document "Figure1:TheWEKAExploreruserinterface." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Figure1:TheWEKAExploreruserinterface.Fi
Figure1:TheWEKAExploreruserinterface.Figure2:TheWEKAKnowledgeFlowuserinterface.Figure1.Datacanbeloadedfromvarioussources,includingles,URLsanddatabases.SupportedleformatsincludeWEKA'sownARFFformat,CSV,LibSVM'sformat,andC4.5'sformat.Itisalsopossibletogeneratedatausinganarticialdatasourceandeditdatamanuallyusingadataseteditor.ThesecondpanelintheExplorergivesaccesstoWEKA'sclassicationandregressionalgorithms.Thecorrespondingpaneliscalled\Classify"becauseregressiontechniquesareviewedaspredictorsof\continuousclasses".Bydefault,thepanelrunsacross-validationforaselectedlearningal-gorithmonthedatasetthathasbeenbeenpreparedinthePreprocesspaneltoestimatepredictiveperformance.Italsoshowsatextualrepresentationofthemodelbuiltfromthefulldataset.However,othermodesofevaluation,e.g.basedonaseparatetestset,arealsosupported.Ifapplicable,thepanelalsoprovidesaccesstographicalrepresentationsofmodels,e.g.decisiontrees.Moreover,itcanvisualizepredictionerrorsinscatterplots,andalsoallowsevaluationviaROCcurvesandother\thresholdcurves".Modelscanalsobesavedandloadedinthispanel.Alongwithsupervisedalgorithms,WEKAalsosupportsap-plicationofunsupervisedalgorithms,namelyclusteringal-gorithmsandmethodsforassociationrulemining.TheseareaccessibleintheExplorerviathethirdandfourthpanelrespectively.The\Cluster"panelenablesuserstorunaclusteringalgorithmonthedataloadedinthePreprocesspanel.Itprovidessimplestatisticsforevaluationofcluster-ingperformance:likelihood-basedperformanceforstatisti-calclusteringalgorithmsandcomparisonto\true"clustermembershipifthisisspeciedinoneoftheattributesinthedata.Ifapplicable,visualizationoftheclusteringstruc-tureisalsopossible,andmodelscanbestoredpersistentlyifnecessary.WEKA'ssupportforclusteringtasksisnotasextensiveasitssupportforclassicationandregression,butithasmoretechniquesforclusteringthanforassociationrulemining,whichhasuptothispointbeensomewhatneglected.Nev-ertheless,itdoescontainanimplementationofthemostwe
ll-knownalgorithminthisarea,aswellasafew
ll-knownalgorithminthisarea,aswellasafewotherones.Thesemethodscanbeaccessedviathe\Associate"panelintheExplorer.Perhapsoneofthemostimportanttaskinpracticaldataminingisthetaskofidentifyingwhichattributesinthedataarethemostpredictiveones.Tothisend,WEKA'sExplorerhasadedicatedpanelforattributeselection,\Se-lectattributes",whichgivesaccesstoawidevarietyofalgo-rithmsandevaluationcriteriaforidentifyingthemostim-portantattributesinadataset.Duetothefactthatitispossibletocombinedierentsearchmethodswithdierentevaluationcriteria,itispossibletocongureawiderangeofpossiblecandidatetechniques.Robustnessoftheselectedattributesetcanbevalidatedviaacross-validation-basedapproach.Notethattheattributeselectionpanelisprimarilydesignedforexploratorydataanalysis.WEKA's\FilteredClassier"(accessibleviatheClassifypanel)shouldbeusedtoapplyattributeselectiontechniquesinconjunctionwithanun-derlyingclassicationorregressionalgorithmtoavoidin-troducingoptimisticbiasintheperformanceestimatesob-tained.Thiscaveatalsoappliestosomeofthepreprocess-ingtools|morespecically,thesupervisedones|thatareavailablefromthePreprocesspanel.Inmanypracticalapplications,datavisualizationprovidesimportantinsights.Thesemayevenmakeitpossibletoavoidfurtheranalysisusingmachinelearninganddatamin-ingalgorithms.Butevenifthisisnotthecase,theymayinformtheprocessofselectinganappropriatealgorithmfortheproblemathand.ThelastpanelintheExplorer,called\Visualize",providesacolor-codedscatterplotma-trix,alongwiththeoptionofdrillingdownbyselectingin-dividualplotsinthismatrixandselectingportionsofthedatatovisualize.Itisalsopossibletoobtaininformationregardingindividualdatapoints,andtorandomlyperturbdatabyachosenamounttouncoverobscureddata.TheExplorerisdesignedforbatch-baseddataprocessing:trainingdataisloadedintomemoryinitsentiretyandthenprocessed.Thismaynotbesuitableforproblemsinvolvinglargedatasets.However,WEKAdoeshaveimplementationsofsomealgorithmsthatallowincrementalmodelbuild
ing,whichcanbeappliedinincrementalmodefr
ing,whichcanbeappliedinincrementalmodefromacommand-lineinterface.TheincrementalnatureofthesealgorithmsisignoredintheExplorer,butcanbeexploitedusingamorerecentadditiontoWEKA'ssetofgraphicaluserinterfaces,namelytheso-called\KnowledgeFlow",showninFigure2.MosttasksthatcanbetackledwiththeExplorercanalsobehandledbytheKnowledgeFlow.However,inadditiontobatch-basedtraining,itsdata owmodelenablesincre-mentalupdateswithprocessingnodesthatcanloadandpreprocessindividualinstancesbeforefeedingthemintoap-propriateincrementallearningalgorithms.Italsoprovidesnodesforvisualizationandevaluation.Onceaset-upofin-terconnectedprocessingnodeshasbeencongured,itcanbesavedforlaterre-use.ThethirdmaingraphicaluserinterfaceinWEKAisthe\Experimenter"(seeFigure3).Thisinterfaceisdesignedtofacilitateexperimentalcomparisonofthepredictiveper-formanceofalgorithmsbasedonthemanydierenteval-uationcriteriathatareavailableinWEKA.Experimentscaninvolvemultiplealgorithmsthatarerunacrossmultipledatasets;forexample,usingrepeatedcross-validation.Ex-perimentscanalsobedistributedacrossdierentcomputenodesinanetworktoreducethecomputationalloadforin-dividualnodes.Onceanexperimenthasbeensetup,itcanbesavedineitherXMLorbinaryform,sothatitcanbeFigure3:TheWEKAExperimenteruserinterface.re-visitedifnecessary.Conguredandsavedexperimentscanalsoberunfromthecommand-line.ComparedtoWEKA'sotheruserinterfaces,theExperi-menterisperhapsusedlessfrequentlybydataminingprac-titioners.However,oncepreliminaryexperimentationhasbeenperformedintheExplorer,itisoftenmucheasiertoidentifyasuitablealgorithmforaparticulardataset,orcol-lectionofdatasets,usingthisalternativeinterface.WewouldliketoconcludethisbriefexpositionofWEKA'smaingraphicaluserinterfacesbypointingoutthat,regard-lessofwhichuserinterfaceisdesired,itisimportanttoprovidetheJavavirtualmachinethatisusedtorunWEKAwithasucientamountofheapspace.Theneedtopre-specifytheamountofmemoryrequired,whichshouldbesetlowerthantheamountofphysical
memoryofthema-chinethatisused,toavoidswa
memoryofthema-chinethatisused,toavoidswapping,isperhapsthebiggeststumblingblocktothesuccessfulapplicationofWEKAinpractice.Ontheotherhand,consideringrunningtime,thereisnolongerasignicantdisadvantagecomparedtoprogramswritteninC,acommonly-heardargumentagainstJavafordata-intensiveprocessingtasks,duetothesophisticationofjust-in-timecompilersinmodernJavavirtualmachines.3.HISTORYOFTHEWEKAPROJECTTheWEKAprojectwasfundedbytheNewZealandgov-ernmentfrom1993upuntilrecently.Theoriginalfundingapplicationwaslodgedinlate1992andstatedtheproject'sgoalsas:\Theprogrammeaimstobuildastate-of-the-artfacilityfordevelopingtechniquesofmachinelearningandinvestigatingtheirapplicationinkeyareasoftheNewZealandeconomy.Specicallywewillcreateaworkbenchformachinelearning,determinethefactorsthatcontributetowardsitssuccessfulapplicationintheagriculturalindustries,anddevelopnewmethodsofmachinelearningandwaysofassessingtheiref-fectiveness."Therstfewyearsoftheprojectfocusedonthedevelopmentoftheinterfaceandinfrastructureoftheworkbench.MostoftheimplementationwasdoneinC,withsomeevaluationroutineswritteninProlog,andtheuserinterfaceproducedFigure4:Backthen:theWEKA2.1workbenchuserinter-face.usingTCL/TK.DuringthistimetheWEKA1acronymwascoinedandtheAttributeRelationFileFormat(ARFF)usedbythesystemwascreated.TherstreleaseofWEKAwasinternalandoccurredin1994.Thesoftwarewasverymuchatbetastage.Therstpublicrelease(atversion2.1)wasmadeinOctober1996.Figure4showsthemainuserinterfaceforWEKA2.1.InJuly1997,WEKA2.2wasreleased.Itincludedeightlearn-ingalgorithms(implementationsofwhichwereprovidedbytheiroriginalauthors)thatwereintegratedintoWEKAus-ingwrappersbasedonshellscriptsanddatapre-processingtoolswritteninC.WEKA2.2alsosportedafacility,basedonUnixMakeles,forconguringandrunninglarge-scaleexperimentsbasedonthesealgorithms.Bynowitwasbecomingincreasinglydiculttomaintainthesoftware.Factorssuchaschangestosupportingli-braries,managementofdependenciesandcomplexi
tyofcon-gurationmadethejobdicu
tyofcon-gurationmadethejobdicultforthedevelopersandtheinstallationexperiencefrustratingforusers.AtaboutthistimeitwasdecidedtorewritethesystementirelyinJava,includingimplementationsofthelearningalgorithms.ThiswasasomewhatradicaldecisiongiventhatJavawaslessthantwoyearsoldatthetime.Furthermore,theruntimeperformanceofJavamadeitaquestionablechoiceforim-plementingcomputationallyintensivemachinelearningal-gorithms.Nevertheless,itwasdecidedthatadvantagessuchas\WriteOnce,RunAnywhere"andsimplepackaginganddistributionoutweighedtheseshortcomingsandwouldfacil-itatewideracceptanceofthesoftware.May1998sawthenalreleaseoftheTCL/TK-basedsys-tem(WEKA2.3)and,atthemiddleof1999,the100%JavaWEKA3.0wasreleased.Thisnon-graphicalversionofWEKAaccompaniedthersteditionofthedataminingbookbyWittenandFrank[34].InNovember2003,asta-bleversionofWEKA(3.4)wasreleasedinanticipationofthepublicationofthesecondeditionofthebook[35].Inthetimebetween3.0and3.4,thethreemaingraphicaluserinterfacesweredeveloped.In2005,theWEKAdevelopmentteamreceivedtheSIGKDDDataMiningandDiscoveryServiceAward[22].Theaward1TheWekaisalsoanindigenousbirdofNewZealand.Likethewell-knownKiwi,itis ightless.Page 13Page 14Figure8:TheBayesiannetworkeditor.Figure9:TheExplorerwithan\Experiment"tabaddedfromaplugin.usingWEKA'sdatageneratortools.Articialdatasuit-ableforclassicationcanbegeneratedfromdecisionlists,radial-basisfunctionnetworksandBayesiannetworksaswellastheclassicLED24domain.Articialregressiondatacanbegeneratedaccordingtomathematicalexpressions.Therearealsoseveralgeneratorsforproducingarticialdataforclusteringpurposes.TheKnowledgeFlowinterfacehasalsobeenimproved:itnowincludesanewstatusareathatcanprovidefeedbackontheoperationofmultiplecomponentsinadataminingpro-cesssimultaneously.OtherimprovementstotheKnowledgeFlowincludesupportforassociationrulemining,improvedsupportforvisualizingmultipleROCcurvesandapluginmechanism.4.5ExtensibilityAnumberofpluginmechanismsh
avebeenaddedtoWEKAsinceversion3.4.Thesea
avebeenaddedtoWEKAsinceversion3.4.TheseallowWEKAtobeextendedinvariouswayswithouthavingtomodifytheclassesthatmakeuptheWEKAdistribution.NewtabsintheExplorerareeasilyaddedbywritingaclassthatextendsjavax.swing.JPanelandimplementsthein-terfaceweka.gui.explorer.Explorer.ExplorerPanel.Fig-Figure10:APMMLradialbasisfunctionnetworkloadedintotheExplorer.ure9showstheExplorerwithanewtab,providedbyaplu-gin,forrunningsimpleexperiments.Similarmechanismsal-lownewvisualizationsforclassiererrors,predictions,treesandgraphstobeaddedtothepop-upmenuavailableinthehistorylistoftheExplorer's\Classify"panel.TheKnowl-edgeFlowhasapluginmechanismthatallowsnewcompo-nentstobeincorporatedbysimplyaddingtheirjarle(andanynecessarysupportingjarles)tothe.knowledgeFlow/pluginsdirectoryintheuser'shomedirectory.ThesejarlesareloadedautomaticallywhentheKnowledgeFlowisstartedandthepluginsaremadeavailablefroma\Plugins"tab.4.6StandardsandInteroperabilityWEKA3.6includessupportforimportingPMMLmod-els(PredictiveModelingMarkupLanguage).PMMLisavendor-agnostic,XML-basedstandardforexpressingstatis-ticalanddataminingmodelsthathasgainedwide-spreadsupportfrombothproprietaryandopen-sourcedataminingvendors.WEKA3.6supportsimportofPMMLregression,generalregressionandneuralnetworkmodeltypes.Importoffurthermodeltypes,alongwithsupportforexportingPMML,willbeaddedinfuturereleasesofWEKA.Figure10showsaPMMLradialbasisfunctionnetwork,createdbytheClementinesystem,loadedintotheExplorer.AnothernewfeatureinWEKA3.6istheabilitytoreadandwritedataintheformatusedbythewellknownLib-SVMandSVM-Lightsupportvectormachineimplementa-tions[5].ThiscomplementsthenewLibSVMandLibLIN-EARwrapperclassiers.5.PROJECTSBASEDONWEKATherearemanyprojectsthatextendorwrapWEKAinsomefashion.Atthetimeofthiswriting,thereare46suchprojectslistedontheRelatedProjectswebpageoftheWEKAsite3.Someoftheseinclude:Systemsfornaturallanguageprocessing.ThereareanumberoftoolsthatuseWEKAfornaturallan-guageprocessing:GATEisa
NLPworkbench[11];3http://www.cs.waikato
NLPworkbench[11];3http://www.cs.waikato.ac.nz/ml/weka/index_related.htmlBalieperformslanguageidentication,tokenization,sentenceboundarydetectionandnamed-entityrecog-nition[21];Senseval-2isasystemforwordsensedis-ambiguation;Keaisasystemforautomatickeyphraseextraction[36].Knowledgediscoveryinbiology.SeveraltoolsusingorbasedonWEKAhavebeendevelopedtoaiddataanalysisinbiologicalapplications:BioWEKAisanextensiontoWEKAfortasksinbiology,bioinformat-ics,andbiochemistry[14];theEpitopesToolkit(EpiT)isaplatformbasedonWEKAfordevelopingepitopepredictiontools;maxdViewandMayday[7]providevisualizationandanalysisofmicroarraydata.Distributedandparalleldatamining.Thereareanum-berofprojectsthathaveextendedWEKAfordis-tributeddatamining;Weka-Parallelprovidesadis-tributedcross-validationfacility[4];GridWekapro-videsdistributedscoringandtestingaswellascross-validation[18];FAEHIMandWeka4WS[31]makeWEKAavailableasawebservice.Open-sourcedataminingsystems.Severalwellknownopen-sourcedataminingsystemsprovidepluginstoallowaccesstoWEKA'salgorithms.TheKonstanzInformationMiner(KNIME)andRapidMiner[20]aretwosuchsystems.TheR[23]statisticalcomputingen-vironmentalsoprovidesaninterfacetoWEKAthroughtheRWeka[16]package.Scienticwork owenvironment.TheKeplerWekaprojectintegratesallthefunctionalityofWEKAintotheKepler[1]open-sourcescienticwork owplatform.6.INTEGRATIONWITHTHEPENTAHOBISUITEPentahocorporationisaproviderofcommercialopensourcebusinessintelligencesoftware.ThePentahoBIsuiteconsistsofreporting,interactiveanalysis,dashboards,ETL/datain-tegrationanddatamining.Eachoftheseisaseparateopensourceproject,whicharetiedtogetherbyanenterprise-classopensourceBIplatform.Inlate2006,WEKAwasadoptedasthedataminingcomponentofthesuiteandsincethenhasbeenintegratedintotheplatform.ThemainpointofintegrationbetweenWEKAandthePen-tahoplatformiswithPentahoDataIntegration(PDI),alsoknownastheKettleproject4.PDIisastreaming,engine-drivenETLtool.Itsrichsetofextr
actandtransformopera-tions,combinedwiths
actandtransformopera-tions,combinedwithsupportforalargevarietyofdatabases,areanaturalcomplementtoWEKA'sdatalters.PDIcaneasilyexportdatasetsinWEKA'snativeARFFformattobeusedimmediatelyformodelcreation.SeveralWEKA-specictransformationstepshavebeencre-atedsothatPDIcanaccessWEKAalgorithmsandbeusedasbothascoringplatformandatooltoautomatemodelcreation.Therstofthese,showninFigure11,iscalled\WekaScoring."ItenablestheusertoimportaserializedWEKAmodel(classication,regressionorclustering)orasupportedPMMLmodelanduseittoscoredataaspartofanETLtransformation.Inanoperationalscenariothepredictiveperformanceofamodelmaydecreaseovertime.4http://kettle.pentaho.org/Figure11:ScoringopensalesopportunitiesaspartofanETLtransformation.Figure12:RefreshingapredictivemodelusingtheKnowl-edgeFlowPDIcomponent.Thiscanbecausedbychangesintheunderlyingdistribu-tionofthedataandissometimesreferredtoas\conceptdrift."ThesecondWEKA-specicstepforPDI,showninFigure12,allowstheusertoexecuteanentireKnowledgeFlowprocessaspartofantransformation.Thisenablesautomatedperiodicrecreationorrefreshingofamodel.SincePDItransformationscanbeexecutedandusedasasourceofdatabythePentahoBIserver,theresultsofdataminingcanbeincorporatedintoanoverallBIprocessandusedinreports,dashboardsandanalysisviews.7.CONCLUSIONSTheWEKAprojecthascomealongwayinthe16yearsthathaveelapsedsinceitinceptionin1992.Thesuccessithasenjoyedistestamenttothepassionofitscommunityandmanycontributors.ReleasingWEKAasopensourcesoftwareandimplementingitinJavahasplayednosmallpartinitssuccess.Thesetwofactorsensurethatitremainsmaintainableandmodiableirrespectiveofthecommitmentorhealthofanyparticularinstitutionorcompany.8.ACKNOWLEDGMENTSManythankstopastandpresentmembersoftheWaikatomachinelearninggroupandtheexternalcontributersforalltheworktheyhaveputintoWEKA.Page 17[29]N.Slonim,N.Friedman,andN.Tishby.Unsuperviseddocumentclassicationusingsequentialinformationmaximization.InProceedingsofthe25thInterna
tionalACMSIGIRConferenceonResearchandDev
tionalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages129{136,2002.[30]J.Su,H.Zhang,C.X.Ling,andS.Matwin.Discrimina-tiveparameterlearningforbayesiannetworks.InICML2008,2008.[31]D.Talia,P.Truno,andO.Verta.Weka4ws:awsrfen-abledwekatoolkitfordistributeddataminingongrids.InProc.ofthe9thEuropeanConferenceonPrinci-plesandPracticeofKnowledgeDiscoveryinDatabases(PKDD2005,pages309{320.Springer-Verlag,2005.[32]K.M.TingandI.H.Witten.Stackingbaggedanddaggedmodels.InD.H.Fisher,editor,FourteenthinternationalConferenceonMachineLearning,pages367{375,SanFrancisco,CA,1997.MorganKaufmannPublishers.[33]J.S.Vitter.Randomsamplingwithareservoir.ACMTransactionsonMathematicalSoftware,11(1):37{57,1985.[34]I.H.WittenandE.Frank.DataMining:PracticalMa-chineLearningToolsandTechniqueswithJavaImple-mentations.MorganKaufmann,SanFrancisco,2000.[35]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniques.MorganKauf-mann,SanFrancisco,2edition,2005.[36]I.H.Witten,G.W.Paynter,E.Frank,C.Gutwin,andC.G.Nevill-Manning.Kea:Practicalautomatickeyphraseextraction.InY.-L.ThengandS.Foo,ed-itors,DesignandUsabilityofDigitalLibraries:CaseStudiesintheAsiaPacic,pages129{152.InformationSciencePublishing,London,2005.[37]X.Xu.Statisticallearninginmultipleinstanceprob-lems.Master'sthesis,DepartmentofComputerSci-ence,UniversityofWaikato,2003.[38]Y.Yang,X.Guan,andJ.You.CLOPE:afastandeectiveclusteringalgorithmfortransactionaldata.InProceedingsoftheeighthACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages682{687.ACMNewYork,NY,USA,2002.[39]F.ZhengandG.I.Webb.Ecientlazyeliminationforaveraged-onedependenceestimators.InProceedingsoftheTwenty-thirdInternationalConferenceonMachineLearning(ICML2006),pages1113{1120.ACMPress,2006.Volume 11, Issue 1SIGKDD ExplorationsSIGKDD ExplorationsSIGKDD ExplorationsSIGKDD ExplorationsSIGKDD ExplorationsSIGKDD ExplorationsSIGKDD ExplorationsSIGKDD Exploratio