edu bittorfarunleonnchrisreczhangcswiscedu University of Michigan Ann Arbor University of Wisconsin Madison ABSTRACT A new generation of data processing systems including web search Googles Knowledge Graph IBMs Watson and sev eral di64256erent recomm ID: 39706
Download Pdf The PPT/PDF document "Brainwash A Data System for Feature Engi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Teamsthathaditbasicallywrong|butforafewgoodideas|madethedierencewhencombinedwithteamswhichhaditbasicallyright...Thetoptwoteamsbeatthechallengebycombiningteamsandtheiralgorithmsintomorecomplexalgorithmsincorporatingeverybody'swork.Themorepeoplejoined,themoretheresultingteam'sscorewouldincrease.Unfortunately,writingfeaturescanbeextremelypainful.Onthesurface,buildingafeaturemayseemtobejustan-othersoftwareengineeringtask,albeitonethatisrunoververylargedatasets.However,inourexperienceengineer-ingfeaturesrequiresdramaticallymoreiterationandad-justment.Acompletedfeatureisoftenasmallpieceofcodethatisrelativelyeasytoreproduceanddoesnotre ectthelargeamountofcodethatwaswritten,tested,andthrownaway.Wenotethreewaysinwhichfeatureengineeringisuniquelyexcruciating:1.Grunt-WorkStatistics.First,thereisalotofsta-tistical\gruntwork."Afeatureprocessesalargedataset|e.g.,thesetofallwebpagetitles|whosecharac-teristicsareoftenwhollynoveltothedeveloper.Evensimplefeaturesarehardtowritewithoutsomecom-moditymetadata,suchasfrequencydistributionsofuniquevalues,listsofoutliervalues,andsomesimplevisualizations.Todaythefeaturedeveloperperformsthisgruntworkforeverynewfeature,byhand.2.UnknowableSpecs.Mostinterestingdatasetsarelargeandnoisy,makingtheactualfeaturecode\spec"nearlyunknowablewithoutrepeatedtestingagainstthedataitself.Forexample,itiseasytoinformallydescribeafeaturesuchas,\Asocialmediauser'snameissomewhatindicativeoftheuser'sage."Thatis,auserwiththenameBrittanyisrelativelyunlikelytobeelderlyin2013.Butimplementingthisfeatureentailsimplementingarstversion,thenlearningthatusernamesoftenhavenumbersandadjectivesappendedtoahumanrstname,thenwritingcodetostripothesesuxes,thentestingthecodeagain.3.UnexpectedFailure.Evenaperfectlyimplementedfeaturecanbeanunexpectedfailure,eitherbecauseitdoesnotcaptureanyusefulinformationorbecauseitsinformationisalreadycapturedbyapreviouslyim-plementedfeature.Forexample,itmaybethatthename-basedmethodaboveishelpfulwhenconsideredalone,butdoesnotaddanypredictivepowerbeyondtext-basedmethods(say,countingthenumberof\lol"sinasocialmediauser'sstatusupdates).Ifthetext-basedmethodhadbeenimplementedrst,thenalltheworktoimplementthenamefeatureiswastedeort.Thesethreeburdensoffeatureengineering|grunt-workstatistics,unknowablespecs,andunexpectedfailure|turnthedeveloper'slifeintoanendlesscycleofsmalliterativecodechangesandtests.Moreover,mostfeaturecoderunsoverhugedatasetsthatrequirescalable\bigdata"systems.Thefewsystemsthatsupporttheuser-denedcodeneces-saryforfeatureengineering(e.g.,MapReduce[2])havegen-erallyemphasizedthroughputoverlatency.Clustersoftwarethusforcesdeveloperstoendurehigh-latencywaits(perhapshoursorevendays)inthe\innerloop"ofthefeaturedevel-opmentcycle.Asaresult,featureengineeringisatime-consuminganddrainingexperience.Weenvisionasystem,brainwash,1todramaticallyim-provetheproductivityoffeatureengineers.Itprovidesper-vasiveprogrammerhintstoaddresstheconstantiterationas-sociatedwithfeaturedevelopment.Incontrasttothehintsprovidedbytoday'sIDEs(whicharederivedfromcodesnip-petsorheaderles),hintsinbrainwasharederivedfromthedataunderinspectionaswellascodewrittenbyotherbrainwashdevelopers.Itisamultiusersystemthathasthreephases:1.InExplore,brainwashspeedsthefeatureengineerthroughgrunt-workstatisticsbyautomaticallypro-vidingcommodityinformationlikefrequencydistri-butionsandautomatically-chosenillustrativesamplesfromthedataset.2.InExtract,brainwashattemptstorecommendafea-turethatisbothlikelytoyieldbenetsandroughlycompatiblewiththedeveloper'scodesofar.brain-washaccomplishesthisbyrepurposingsemi-andun-supervisedmethodsforfeatureinductionasrecom-mendationmethods.3.InEvaluate,brainwashenablestheengineertoeval-uatehowwellorpoorlytheirnewfeaturesperforminthecontextoftheentiretrainedsystem.Thechal-lengeforbrainwashistorunandevaluatefeaturesasquicklyaspossible.Itdoessobyspeculativelyex-ecutingcodeitthinkstheuserwillwanttoruninthefuture;whilefeaturescanbearbitrarypiecesofcode,itispossibletomakeeducatedguessesaboutwhatthecodeactuallydoesevenwithoutdirectlyunderstand-ingthesource.Forexample,twofunctionsthatwerewrittenbythesameuserjustafewminutesapartarelikelytobesmallmodicationsofoneanother.ByincreasingthevelocityofthedeveloperineachphaseoftheExplore-Extract-Evaluateloop,brainwashaimstoimprovedevelopereectivenessduringfeatureengineering.Inthenextsectionwereviewtwocasestudiesfromourownwork,illustratingproblemswithfeatureengineeringto-dayandshowinghowitcouldbeimproved.InSection3,weproposebrainwashanddescribeitsdesign.2.CASESTUDIESWenowdescribetwofeatureengineeringcasestudiesfromtrainedsystemsbeingbuiltbytheauthorsofthispaper.GeoDeepDiveintegratesscienticdataforgeoscientists.Automanaimstoreproducenationaleconomicstatisticsusingsocialmediaactivity.2.1GeoDeepDiveToday,anindividualgeologisthasamicroviewofgeo-science:shehasaccesstomeasurementsfromatmostahandfuloftheapproximately30,000geographicalunitsinNorthAmerica,usuallydatacollectedinherownandpart-nerlabs.Otherlabs'dataisburiedinthetext,gures,and 1brainwashisareferencetotheartistMr.Brainwash,orMBW,whomass-producesartinthemoldoffamousstreetartistBanksy.Similarily,weproposetomass-producetheartistryrequiredbytoday'strainedsystems. Wetreatfeaturedevelopmentasawork owofdeveloper-writtenfunctionsudf0;udf1;:::;udfN.A\run"iconsistsofapplyingudfitoeachtupleintheinputdataset(e.g.,eachwebpage,eachacademicpaper,oreachTweet).Likethemap()fromMapReduce,audffunctioninvocationtakesasingletuplefromtheinputandyieldszeroormoretuplesthatareplacedintotherun'soutput.4Explicitschemasallowbrainwashtoguesswhattheudfisdoing,evenifthefunctionconsistsentirelyofopaquecompiledcode.Weenvisionbrainwashasacentralizedsystemthatsup-portsmanydevelopersandprojectssimultaneously.Evenifdevelopersdonotexplicitlycollaborate,thesystemstoresandsometimesrepurposesdevelopers'work.brainwashcanthusexploitknowledgeabouttheudfwork ow,thedataset,anddeveloperactivitytoaddresseachofthethreeproblemsencounteredinfeatureengineering.1.Explore:Grunt-WorkStatistics.brainwashcanas-sistwithstatisticscollectionbyleveragingtheformalschemaandprevioususers'udfs.Exploitingschemainformationfortherawinputleshouldbestraightforward:thesystemcanautomaticallyreadtheinput,countingtuplesandeldsasitgoes.TheresultshouldbearoughoverviewsimilartowhatisoeredindataintegrationtoolssuchasGoogleRene.However,thisschemainformationisnotsucientwhenthedeveloperwantstoseestatisticson\derivedobjects"thatcomefromrunningudfsonasubeldoftherawinput,suchasthenamedentitiesembeddedinawebpage.Wecanautomatestatisticsonderivedobjectsbyexamin-ingtheoutputschemasanddatasetsfrompreviousudfsthatshareaninputschemawiththecurrentoneunderdevelop-ment.Forexample,consideradeveloperwhohaswrittenoneudfthattransformsarawinputleofwebpagetuplesintoaseriesofparagraphinstancesandthenwritesaseriesofdierentudfsthattransformparagraphintoalargerangeofoutputswithdierentschemaformats.Thebodyofudfsnotonlyprovidesevidencethatparagraphisaninterestingandbroadlyusefulobject,italsoprovidesbrainwashwiththecodenecessarytoproduceparagraph.Further,wedonotneedtolimittheseautomatedstatisticstotheuserwhoactuallywroteparagraph-producingcode:itcanbeusedtohelpanyprogrammerwhoneedstoprocesswebpages.2.Extract:UnknowableSpecs.brainwashshouldpre-emptivelysuggestcodethatwilladdressunknowablespeci-cationproblemsbeforethedeveloperevenencountersthem.Forexample,anydeveloperwhoneedstoprocessrstnamesinsocialmediaislikelytowanttoremoveallthestrangeusernamesuxesdescribedinSection1.Wecandothisbypairingschemamatchingwithcodesuggestion.ImagineauserAwhohaswrittenudfAiwhichextractsusernamesfromTweets,aswellasudfAi+1thatnor-malizestheusernames.NowuserBwritesafunctionudfBjthathasaninputschemasimilartoudfAiandgeneratesout-putthatissimilartothatproducedbyudfBi.brainwashshouldthensuggestthatBrunudfAi+1andcanpresentsam-pleinput/outputpairstoexplainwhy.TheremaybealargenumberofcandidatefunctionsthatBcouldapply|rankingthesewillbeacoretechnicalchal-lenge.Likeotherrankingsystems,udfsuggestionshould 4brainwashalsoallowsaggregation,similartoreduce(),butweomitourdescriptionofthismechanismtosimplifyourexposition.improveasdevelopersusethesystem,providingadditionalevidenceaboutwhichudfsareappropriateinwhichcases.3.Evaluate:UnexpectedFailure.Improvingtheover-allperformanceofthetrainedsystemis,intheend,thefeaturedeveloper'srealgoal.Butmeasuringafeature'srealcontributiontotheend-to-endsystemcanentailtrainingandtestingmanydierentpermutationsoffeatures,eachstepofwhichmayneedtoprocessahugedataset.Asare-sult,developerssidestepsystem-wideevaluationforfartoolonginthedevelopmentprocess.brainwashencouragesfre-quentandfullevaluationbysavingdataonpreviousfeaturepermutations,andspeculativelytrainingstatisticalsystemsusingfeaturesunderdevelopment.Thesystemautomaticallyformulatescodeaccordingtotheabovethreefeatures,thensendstheresultstotheuser'sIDE.Inmanyoftheabovecases,thesystembothformulatescodeandspeculativelyexecutesitontheuser'sbehalf.Thisspeculativeformulate-and-executecycleiswhatal-lowsbrainwashtonotjustgiveprogrammerhints,butreduceexecutionlatencyaswell.Insomecases,suchastheexplorecomponentabove,brainwashreducesuser-apparentlatencysimplybyremovingtheworkfromtheuser'stodolist|wedotheworkwhenthere'sslacktimeintheback-endprocessingsystem.Inothercases,aswithextract,theuser'scodingtargetisunclear;thesystemtriestoguessit,executestheresultingcode,andhopesthattheeortwasnotwasted.Eveniftheuser'sgoalisnotpre-dicted,thesystemcanusepreviousiterationsofthecurrentudftodeterminewhichinputtuplesarelikelytoyieldusefuloutputsandprioritizetheseduringprocessing.4.CURRENTSTATUSANDNEXTSTEPSWearebuildinganinitialprototypeofbrainwashbasedonthecodeofGeoDeepDive,Automan,andanearlierweb-scaleprototypecalledDeepDive(http://hazy.cs.wisc.edu/deepdive).Ourplanistorenebrainwashandtheseactiveresearchprojectsinparallel.brainwashraisesseveralresearchquestions.Onethatwehaveexaminedistheecacyofdierentfeature-engineeringmethods.Earlierthisyear,weperformedtherststudythatcomparestwopopularmethodsusedintheexplore-extract-evaluateloop,distantsupervisionandcrowdsourcing,toex-tractrelationshipsatwebscale[7].Thisstudysparkedourinterestinusingdistantsupervisiontorecommendfeaturesintheexploreandextractphases.Weplantousebrain-washasavehicletocontinuesuchstudies.5.REFERENCES[1]E.V.Buskirk.HowtheNet ixPrizeWasWon.Wired,2009.[2]J.DeanandS.Ghemawat.MapReduce:SimpliedDataProcessingonLargeClusters.InOSDI,pages137{150,2004.[3]D.Ferrucci.AnOverviewoftheDeepQAProject.AIMagazine,2012.[4]A.Y.Halevy,P.Norvig,andF.Pereira.TheUnreasonableEectivenessofData.IEEEIntelligentSystems,2009.[5]S.Levy.HowGoogle'sAlgorithmRulestheWeb.Wired,2010.[6]https://developers.google.com/protocol-buffers/.[7]C.Zhang,F.Niu,C.Re,andJ.Shavlik.BigDataversustheCrowd:LookingforRelationshipsinAlltheRightPlaces.InACL,2012.