/
Brainwash A Data System for Feature Engineering Michael Anderson Dolan Antenucci Victor Brainwash A Data System for Feature Engineering Michael Anderson Dolan Antenucci Victor

Brainwash A Data System for Feature Engineering Michael Anderson Dolan Antenucci Victor - PDF document

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
509 views
Uploaded On 2015-02-26

Brainwash A Data System for Feature Engineering Michael Anderson Dolan Antenucci Victor - PPT Presentation

edu bittorfarunleonnchrisreczhangcswiscedu University of Michigan Ann Arbor University of Wisconsin Madison ABSTRACT A new generation of data processing systems including web search Googles Knowledge Graph IBMs Watson and sev eral di64256erent recomm ID: 39706

edu bittorfarunleonnchrisreczhangcswiscedu University Michigan

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Brainwash A Data System for Feature Engi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Teamsthathaditbasicallywrong|butforafewgoodideas|madethedi erencewhencombinedwithteamswhichhaditbasicallyright...Thetoptwoteamsbeatthechallengebycombiningteamsandtheiralgorithmsintomorecomplexalgorithmsincorporatingeverybody'swork.Themorepeoplejoined,themoretheresultingteam'sscorewouldincrease.Unfortunately,writingfeaturescanbeextremelypainful.Onthesurface,buildingafeaturemayseemtobejustan-othersoftwareengineeringtask,albeitonethatisrunoververylargedatasets.However,inourexperienceengineer-ingfeaturesrequiresdramaticallymoreiterationandad-justment.Acompletedfeatureisoftenasmallpieceofcodethatisrelativelyeasytoreproduceanddoesnotre ectthelargeamountofcodethatwaswritten,tested,andthrownaway.Wenotethreewaysinwhichfeatureengineeringisuniquelyexcruciating:1.Grunt-WorkStatistics.First,thereisalotofsta-tistical\gruntwork."Afeatureprocessesalargedataset|e.g.,thesetofallwebpagetitles|whosecharac-teristicsareoftenwhollynoveltothedeveloper.Evensimplefeaturesarehardtowritewithoutsomecom-moditymetadata,suchasfrequencydistributionsofuniquevalues,listsofoutliervalues,andsomesimplevisualizations.Todaythefeaturedeveloperperformsthisgruntworkforeverynewfeature,byhand.2.UnknowableSpecs.Mostinterestingdatasetsarelargeandnoisy,makingtheactualfeaturecode\spec"nearlyunknowablewithoutrepeatedtestingagainstthedataitself.Forexample,itiseasytoinformallydescribeafeaturesuchas,\Asocialmediauser'snameissomewhatindicativeoftheuser'sage."Thatis,auserwiththenameBrittanyisrelativelyunlikelytobeelderlyin2013.Butimplementingthisfeatureentailsimplementinga rstversion,thenlearningthatusernamesoftenhavenumbersandadjectivesappendedtoahuman rstname,thenwritingcodetostripo thesesuxes,thentestingthecodeagain.3.UnexpectedFailure.Evenaperfectlyimplementedfeaturecanbeanunexpectedfailure,eitherbecauseitdoesnotcaptureanyusefulinformationorbecauseitsinformationisalreadycapturedbyapreviouslyim-plementedfeature.Forexample,itmaybethatthename-basedmethodaboveishelpfulwhenconsideredalone,butdoesnotaddanypredictivepowerbeyondtext-basedmethods(say,countingthenumberof\lol"sinasocialmediauser'sstatusupdates).Ifthetext-basedmethodhadbeenimplemented rst,thenalltheworktoimplementthenamefeatureiswastede ort.Thesethreeburdensoffeatureengineering|grunt-workstatistics,unknowablespecs,andunexpectedfailure|turnthedeveloper'slifeintoanendlesscycleofsmalliterativecodechangesandtests.Moreover,mostfeaturecoderunsoverhugedatasetsthatrequirescalable\bigdata"systems.Thefewsystemsthatsupporttheuser-de nedcodeneces-saryforfeatureengineering(e.g.,MapReduce[2])havegen-erallyemphasizedthroughputoverlatency.Clustersoftwarethusforcesdeveloperstoendurehigh-latencywaits(perhapshoursorevendays)inthe\innerloop"ofthefeaturedevel-opmentcycle.Asaresult,featureengineeringisatime-consuminganddrainingexperience.Weenvisionasystem,brainwash,1todramaticallyim-provetheproductivityoffeatureengineers.Itprovidesper-vasiveprogrammerhintstoaddresstheconstantiterationas-sociatedwithfeaturedevelopment.Incontrasttothehintsprovidedbytoday'sIDEs(whicharederivedfromcodesnip-petsorheader les),hintsinbrainwasharederivedfromthedataunderinspectionaswellascodewrittenbyotherbrainwashdevelopers.Itisamultiusersystemthathasthreephases:1.InExplore,brainwashspeedsthefeatureengineerthroughgrunt-workstatisticsbyautomaticallypro-vidingcommodityinformationlikefrequencydistri-butionsandautomatically-chosenillustrativesamplesfromthedataset.2.InExtract,brainwashattemptstorecommendafea-turethatisbothlikelytoyieldbene tsandroughlycompatiblewiththedeveloper'scodesofar.brain-washaccomplishesthisbyrepurposingsemi-andun-supervisedmethodsforfeatureinductionasrecom-mendationmethods.3.InEvaluate,brainwashenablestheengineertoeval-uatehowwellorpoorlytheirnewfeaturesperforminthecontextoftheentiretrainedsystem.Thechal-lengeforbrainwashistorunandevaluatefeaturesasquicklyaspossible.Itdoessobyspeculativelyex-ecutingcodeitthinkstheuserwillwanttoruninthefuture;whilefeaturescanbearbitrarypiecesofcode,itispossibletomakeeducatedguessesaboutwhatthecodeactuallydoesevenwithoutdirectlyunderstand-ingthesource.Forexample,twofunctionsthatwerewrittenbythesameuserjustafewminutesapartarelikelytobesmallmodi cationsofoneanother.ByincreasingthevelocityofthedeveloperineachphaseoftheExplore-Extract-Evaluateloop,brainwashaimstoimprovedevelopere ectivenessduringfeatureengineering.Inthenextsectionwereviewtwocasestudiesfromourownwork,illustratingproblemswithfeatureengineeringto-dayandshowinghowitcouldbeimproved.InSection3,weproposebrainwashanddescribeitsdesign.2.CASESTUDIESWenowdescribetwofeatureengineeringcasestudiesfromtrainedsystemsbeingbuiltbytheauthorsofthispaper.GeoDeepDiveintegratesscienti cdataforgeoscientists.Automanaimstoreproducenationaleconomicstatisticsusingsocialmediaactivity.2.1GeoDeepDiveToday,anindividualgeologisthasamicroviewofgeo-science:shehasaccesstomeasurementsfromatmostahandfuloftheapproximately30,000geographicalunitsinNorthAmerica,usuallydatacollectedinherownandpart-nerlabs.Otherlabs'dataisburiedinthetext, gures,and 1brainwashisareferencetotheartistMr.Brainwash,orMBW,whomass-producesartinthemoldoffamousstreetartistBanksy.Similarily,weproposetomass-producetheartistryrequiredbytoday'strainedsystems. Wetreatfeaturedevelopmentasawork owofdeveloper-writtenfunctionsudf0;udf1;:::;udfN.A\run"iconsistsofapplyingudfitoeachtupleintheinputdataset(e.g.,eachwebpage,eachacademicpaper,oreachTweet).Likethemap()fromMapReduce,audffunctioninvocationtakesasingletuplefromtheinputandyieldszeroormoretuplesthatareplacedintotherun'soutput.4Explicitschemasallowbrainwashtoguesswhattheudfisdoing,evenifthefunctionconsistsentirelyofopaquecompiledcode.Weenvisionbrainwashasacentralizedsystemthatsup-portsmanydevelopersandprojectssimultaneously.Evenifdevelopersdonotexplicitlycollaborate,thesystemstoresandsometimesrepurposesdevelopers'work.brainwashcanthusexploitknowledgeabouttheudfwork ow,thedataset,anddeveloperactivitytoaddresseachofthethreeproblemsencounteredinfeatureengineering.1.Explore:Grunt-WorkStatistics.brainwashcanas-sistwithstatisticscollectionbyleveragingtheformalschemaandprevioususers'udfs.Exploitingschemainformationfortherawinput leshouldbestraightforward:thesystemcanautomaticallyreadtheinput,countingtuplesand eldsasitgoes.Theresultshouldbearoughoverviewsimilartowhatiso eredindataintegrationtoolssuchasGoogleRe ne.However,thisschemainformationisnotsucientwhenthedeveloperwantstoseestatisticson\derivedobjects"thatcomefromrunningudfsonasub eldoftherawinput,suchasthenamedentitiesembeddedinawebpage.Wecanautomatestatisticsonderivedobjectsbyexamin-ingtheoutputschemasanddatasetsfrompreviousudfsthatshareaninputschemawiththecurrentoneunderdevelop-ment.Forexample,consideradeveloperwhohaswrittenoneudfthattransformsarawinput leofwebpagetuplesintoaseriesofparagraphinstancesandthenwritesaseriesofdi erentudfsthattransformparagraphintoalargerangeofoutputswithdi erentschemaformats.Thebodyofudfsnotonlyprovidesevidencethatparagraphisaninterestingandbroadlyusefulobject,italsoprovidesbrainwashwiththecodenecessarytoproduceparagraph.Further,wedonotneedtolimittheseautomatedstatisticstotheuserwhoactuallywroteparagraph-producingcode:itcanbeusedtohelpanyprogrammerwhoneedstoprocesswebpages.2.Extract:UnknowableSpecs.brainwashshouldpre-emptivelysuggestcodethatwilladdressunknowablespeci -cationproblemsbeforethedeveloperevenencountersthem.Forexample,anydeveloperwhoneedstoprocess rstnamesinsocialmediaislikelytowanttoremoveallthestrangeusernamesuxesdescribedinSection1.Wecandothisbypairingschemamatchingwithcodesuggestion.ImagineauserAwhohaswrittenudfAiwhichextractsusernamesfromTweets,aswellasudfAi+1thatnor-malizestheusernames.NowuserBwritesafunctionudfBjthathasaninputschemasimilartoudfAiandgeneratesout-putthatissimilartothatproducedbyudfBi.brainwashshouldthensuggestthatBrunudfAi+1andcanpresentsam-pleinput/outputpairstoexplainwhy.TheremaybealargenumberofcandidatefunctionsthatBcouldapply|rankingthesewillbeacoretechnicalchal-lenge.Likeotherrankingsystems,udfsuggestionshould 4brainwashalsoallowsaggregation,similartoreduce(),butweomitourdescriptionofthismechanismtosimplifyourexposition.improveasdevelopersusethesystem,providingadditionalevidenceaboutwhichudfsareappropriateinwhichcases.3.Evaluate:UnexpectedFailure.Improvingtheover-allperformanceofthetrainedsystemis,intheend,thefeaturedeveloper'srealgoal.Butmeasuringafeature'srealcontributiontotheend-to-endsystemcanentailtrainingandtestingmanydi erentpermutationsoffeatures,eachstepofwhichmayneedtoprocessahugedataset.Asare-sult,developerssidestepsystem-wideevaluationforfartoolonginthedevelopmentprocess.brainwashencouragesfre-quentandfullevaluationbysavingdataonpreviousfeaturepermutations,andspeculativelytrainingstatisticalsystemsusingfeaturesunderdevelopment.Thesystemautomaticallyformulatescodeaccordingtotheabovethreefeatures,thensendstheresultstotheuser'sIDE.Inmanyoftheabovecases,thesystembothformulatescodeandspeculativelyexecutesitontheuser'sbehalf.Thisspeculativeformulate-and-executecycleiswhatal-lowsbrainwashtonotjustgiveprogrammerhints,butreduceexecutionlatencyaswell.Insomecases,suchastheexplorecomponentabove,brainwashreducesuser-apparentlatencysimplybyremovingtheworkfromtheuser'stodolist|wedotheworkwhenthere'sslacktimeintheback-endprocessingsystem.Inothercases,aswithextract,theuser'scodingtargetisunclear;thesystemtriestoguessit,executestheresultingcode,andhopesthatthee ortwasnotwasted.Eveniftheuser'sgoalisnotpre-dicted,thesystemcanusepreviousiterationsofthecurrentudftodeterminewhichinputtuplesarelikelytoyieldusefuloutputsandprioritizetheseduringprocessing.4.CURRENTSTATUSANDNEXTSTEPSWearebuildinganinitialprototypeofbrainwashbasedonthecodeofGeoDeepDive,Automan,andanearlierweb-scaleprototypecalledDeepDive(http://hazy.cs.wisc.edu/deepdive).Ourplanistore nebrainwashandtheseactiveresearchprojectsinparallel.brainwashraisesseveralresearchquestions.Onethatwehaveexaminedistheecacyofdi erentfeature-engineeringmethods.Earlierthisyear,weperformedthe rststudythatcomparestwopopularmethodsusedintheexplore-extract-evaluateloop,distantsupervisionandcrowdsourcing,toex-tractrelationshipsatwebscale[7].Thisstudysparkedourinterestinusingdistantsupervisiontorecommendfeaturesintheexploreandextractphases.Weplantousebrain-washasavehicletocontinuesuchstudies.5.REFERENCES[1]E.V.Buskirk.HowtheNet ixPrizeWasWon.Wired,2009.[2]J.DeanandS.Ghemawat.MapReduce:Simpli edDataProcessingonLargeClusters.InOSDI,pages137{150,2004.[3]D.Ferrucci.AnOverviewoftheDeepQAProject.AIMagazine,2012.[4]A.Y.Halevy,P.Norvig,andF.Pereira.TheUnreasonableE ectivenessofData.IEEEIntelligentSystems,2009.[5]S.Levy.HowGoogle'sAlgorithmRulestheWeb.Wired,2010.[6]https://developers.google.com/protocol-buffers/.[7]C.Zhang,F.Niu,C.Re,andJ.Shavlik.BigDataversustheCrowd:LookingforRelationshipsinAlltheRightPlaces.InACL,2012.