/
Brainwash A Data System for Feature Engineering Michae Brainwash A Data System for Feature Engineering Michae

Brainwash A Data System for Feature Engineering Michae - PDF document

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
450 views
Uploaded On 2015-04-19

Brainwash A Data System for Feature Engineering Michae - PPT Presentation

edu bittorfarunleonnchrisreczhangcswiscedu University of Michigan Ann Arbor University of Wisconsin Madison ABSTRACT A new generation of data processing systems including web search Googles Knowledge Graph IBMs Watson and sev eral dierent recommendat ID: 52158

edu bittorfarunleonnchrisreczhangcswiscedu University Michigan

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Brainwash A Data System for Feature Engi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Teamsthathaditbasicallywrong|butforafewgoodideas|madethedi erencewhencombinedwithteamswhichhaditbasicallyright...Thetoptwoteamsbeatthechallengebycombiningteamsandtheiralgorithmsintomorecomplexalgorithmsincorporatingeverybody'swork.Themorepeoplejoined,themoretheresultingteam'sscorewouldincrease.Unfortunately,writingfeaturescanbeextremelypainful.Onthesurface,buildingafeaturemayseemtobejustan-othersoftwareengineeringtask,albeitonethatisrunoververylargedatasets.However,inourexperienceengineer-ingfeaturesrequiresdramaticallymoreiterationandad-justment.Acompletedfeatureisoftenasmallpieceofcodethatisrelativelyeasytoreproduceanddoesnotre ectthelargeamountofcodethatwaswritten,tested,andthrownaway.Wenotethreewaysinwhichfeatureengineeringisuniquelyexcruciating:1.Grunt-WorkStatistics.First,thereisalotofsta-tistical\gruntwork."Afeatureprocessesalargedataset|e.g.,thesetofallwebpagetitles|whosecharac-teristicsareoftenwhollynoveltothedeveloper.Evensimplefeaturesarehardtowritewithoutsomecom-moditymetadata,suchasfrequencydistributionsofuniquevalues,listsofoutliervalues,andsomesimplevisualizations.Todaythefeaturedeveloperperformsthisgruntworkforeverynewfeature,byhand.2.UnknowableSpecs.Mostinterestingdatasetsarelargeandnoisy,makingtheactualfeaturecode\spec"nearlyunknowablewithoutrepeatedtestingagainstthedataitself.Forexample,itiseasytoinformallydescribeafeaturesuchas,\Asocialmediauser'snameissomewhatindicativeoftheuser'sage."Thatis,auserwiththenameBrittanyisrelativelyunlikelytobeelderlyin2013.Butimplementingthisfeatureentailsimplementinga rstversion,thenlearningthatusernamesoftenhavenumbersandadjectivesappendedtoahuman rstname,thenwritingcodetostripo thesesuxes,thentestingthecodeagain.3.UnexpectedFailure.Evenaperfectlyimplementedfeaturecanbeanunexpectedfailure,eitherbecauseitdoesnotcaptureanyusefulinformationorbecauseitsinformationisalreadycapturedbyapreviouslyim-plementedfeature.Forexample,itmaybethatthename-basedmethodaboveishelpfulwhenconsideredalone,butdoesnotaddanypredictivepowerbeyondtext-basedmethods(say,countingthenumberof\lol"sinasocialmediauser'sstatusupdates).Ifthetext-basedmethodhadbeenimplemented rst,thenalltheworktoimplementthenamefeatureiswastede ort.Thesethreeburdensoffeatureengineering|grunt-workstatistics,unknowablespecs,andunexpectedfailure|turnthedeveloper'slifeintoanendlesscycleofsmalliterativecodechangesandtests.Moreover,mostfeaturecoderunsoverhugedatasetsthatrequirescalable\bigdata"systems.Thefewsystemsthatsupporttheuser-de nedcodeneces-saryforfeatureengineering(e.g.,MapReduce[2])havegen-erallyemphasizedthroughputoverlatency.Clustersoftwarethusforcesdeveloperstoendurehigh-latencywaits(perhapshoursorevendays)inthe\innerloop"ofthefeaturedevel-opmentcycle.Asaresult,featureengineeringisatime-consuminganddrainingexperience.Weenvisionasystem,brainwash,1todramaticallyim-provetheproductivityoffeatureengineers.Itprovidesper-vasiveprogrammerhintstoaddresstheconstantiterationas-sociatedwithfeaturedevelopment.Incontrasttothehintsprovidedbytoday'sIDEs(whicharederivedfromcodesnip-petsorheader les),hintsinbrainwasharederivedfromthedataunderinspectionaswellascodewrittenbyotherbrainwashdevelopers.Itisamultiusersystemthathasthreephases:1.InExplore,brainwashspeedsthefeatureengineerthroughgrunt-workstatisticsbyautomaticallypro-vidingcommodityinformationlikefrequencydistri-butionsandautomatically-chosenillustrativesamplesfromthedataset.2.InExtract,brainwashattemptstorecommendafea-turethatisbothlikelytoyieldbene tsandroughlycompatiblewiththedeveloper'scodesofar.brain-washaccomplishesthisbyrepurposingsemi-andun-supervisedmethodsforfeatureinductionasrecom-mendationmethods.3.InEvaluate,brainwashenablestheengineertoeval-uatehowwellorpoorlytheirnewfeaturesperforminthecontextoftheentiretrainedsystem.Thechal-lengeforbrainwashistorunandevaluatefeaturesasquicklyaspossible.Itdoessobyspeculativelyex-ecutingcodeitthinkstheuserwillwanttoruninthefuture;whilefeaturescanbearbitrarypiecesofcode,itispossibletomakeeducatedguessesaboutwhatthecodeactuallydoesevenwithoutdirectlyunderstand-ingthesource.Forexample,twofunctionsthatwerewrittenbythesameuserjustafewminutesapartarelikelytobesmallmodi cationsofoneanother.ByincreasingthevelocityofthedeveloperineachphaseoftheExplore-Extract-Evaluateloop,brainwashaimstoimprovedevelopere ectivenessduringfeatureengineering.Inthenextsectionwereviewtwocasestudiesfromourownwork,illustratingproblemswithfeatureengineeringto-dayandshowinghowitcouldbeimproved.InSection3,weproposebrainwashanddescribeitsdesign.2.CASESTUDIESWenowdescribetwofeatureengineeringcasestudiesfromtrainedsystemsbeingbuiltbytheauthorsofthispaper.GeoDeepDiveintegratesscienti cdataforgeoscientists.Automanaimstoreproducenationaleconomicstatisticsusingsocialmediaactivity.2.1GeoDeepDiveToday,anindividualgeologisthasamicroviewofgeo-science:shehasaccesstomeasurementsfromatmostahandfuloftheapproximately30,000geographicalunitsinNorthAmerica,usuallydatacollectedinherownandpart-nerlabs.Otherlabs'dataisburiedinthetext, gures,and 1brainwashisareferencetotheartistMr.Brainwash,orMBW,whomass-producesartinthemoldoffamousstreetartistBanksy.Similarily,weproposetomass-producetheartistryrequiredbytoday'strainedsystems. Wetreatfeaturedevelopmentasawork owofdeveloper-writtenfunctionsudf0;udf1;:::;udfN.A\run"iconsistsofapplyingudfitoeachtupleintheinputdataset(e.g.,eachwebpage,eachacademicpaper,oreachTweet).Likethemap()fromMapReduce,audffunctioninvocationtakesasingletuplefromtheinputandyieldszeroormoretuplesthatareplacedintotherun'soutput.4Explicitschemasallowbrainwashtoguesswhattheudfisdoing,evenifthefunctionconsistsentirelyofopaquecompiledcode.Weenvisionbrainwashasacentralizedsystemthatsup-portsmanydevelopersandprojectssimultaneously.Evenifdevelopersdonotexplicitlycollaborate,thesystemstoresandsometimesrepurposesdevelopers'work.brainwashcanthusexploitknowledgeabouttheudfwork ow,thedataset,anddeveloperactivitytoaddresseachofthethreeproblemsencounteredinfeatureengineering.1.Explore:Grunt-WorkStatistics.brainwashcanas-sistwithstatisticscollectionbyleveragingtheformalschemaandprevioususers'udfs.Exploitingschemainformationfortherawinput leshouldbestraightforward:thesystemcanautomaticallyreadtheinput,countingtuplesand eldsasitgoes.Theresultshouldbearoughoverviewsimilartowhatiso eredindataintegrationtoolssuchasGoogleRe ne.However,thisschemainformationisnotsucientwhenthedeveloperwantstoseestatisticson\derivedobjects"thatcomefromrunningudfsonasub eldoftherawinput,suchasthenamedentitiesembeddedinawebpage.Wecanautomatestatisticsonderivedobjectsbyexamin-ingtheoutputschemasanddatasetsfrompreviousudfsthatshareaninputschemawiththecurrentoneunderdevelop-ment.Forexample,consideradeveloperwhohaswrittenoneudfthattransformsarawinput leofwebpagetuplesintoaseriesofparagraphinstancesandthenwritesaseriesofdi erentudfsthattransformparagraphintoalargerangeofoutputswithdi erentschemaformats.Thebodyofudfsnotonlyprovidesevidencethatparagraphisaninterestingandbroadlyusefulobject,italsoprovidesbrainwashwiththecodenecessarytoproduceparagraph.Further,wedonotneedtolimittheseautomatedstatisticstotheuserwhoactuallywroteparagraph-producingcode:itcanbeusedtohelpanyprogrammerwhoneedstoprocesswebpages.2.Extract:UnknowableSpecs.brainwashshouldpre-emptivelysuggestcodethatwilladdressunknowablespeci -cationproblemsbeforethedeveloperevenencountersthem.Forexample,anydeveloperwhoneedstoprocess rstnamesinsocialmediaislikelytowanttoremoveallthestrangeusernamesuxesdescribedinSection1.Wecandothisbypairingschemamatchingwithcodesuggestion.ImagineauserAwhohaswrittenudfAiwhichextractsusernamesfromTweets,aswellasudfAi+1thatnor-malizestheusernames.NowuserBwritesafunctionudfBjthathasaninputschemasimilartoudfAiandgeneratesout-putthatissimilartothatproducedbyudfBi.brainwashshouldthensuggestthatBrunudfAi+1andcanpresentsam-pleinput/outputpairstoexplainwhy.TheremaybealargenumberofcandidatefunctionsthatBcouldapply|rankingthesewillbeacoretechnicalchal-lenge.Likeotherrankingsystems,udfsuggestionshould 4brainwashalsoallowsaggregation,similartoreduce(),butweomitourdescriptionofthismechanismtosimplifyourexposition.improveasdevelopersusethesystem,providingadditionalevidenceaboutwhichudfsareappropriateinwhichcases.3.Evaluate:UnexpectedFailure.Improvingtheover-allperformanceofthetrainedsystemis,intheend,thefeaturedeveloper'srealgoal.Butmeasuringafeature'srealcontributiontotheend-to-endsystemcanentailtrainingandtestingmanydi erentpermutationsoffeatures,eachstepofwhichmayneedtoprocessahugedataset.Asare-sult,developerssidestepsystem-wideevaluationforfartoolonginthedevelopmentprocess.brainwashencouragesfre-quentandfullevaluationbysavingdataonpreviousfeaturepermutations,andspeculativelytrainingstatisticalsystemsusingfeaturesunderdevelopment.Thesystemautomaticallyformulatescodeaccordingtotheabovethreefeatures,thensendstheresultstotheuser'sIDE.Inmanyoftheabovecases,thesystembothformulatescodeandspeculativelyexecutesitontheuser'sbehalf.Thisspeculativeformulate-and-executecycleiswhatal-lowsbrainwashtonotjustgiveprogrammerhints,butreduceexecutionlatencyaswell.Insomecases,suchastheexplorecomponentabove,brainwashreducesuser-apparentlatencysimplybyremovingtheworkfromtheuser'stodolist|wedotheworkwhenthere'sslacktimeintheback-endprocessingsystem.Inothercases,aswithextract,theuser'scodingtargetisunclear;thesystemtriestoguessit,executestheresultingcode,andhopesthatthee ortwasnotwasted.Eveniftheuser'sgoalisnotpre-dicted,thesystemcanusepreviousiterationsofthecurrentudftodeterminewhichinputtuplesarelikelytoyieldusefuloutputsandprioritizetheseduringprocessing.4.CURRENTSTATUSANDNEXTSTEPSWearebuildinganinitialprototypeofbrainwashbasedonthecodeofGeoDeepDive,Automan,andanearlierweb-scaleprototypecalledDeepDive(http://hazy.cs.wisc.edu/deepdive).Ourplanistore nebrainwashandtheseactiveresearchprojectsinparallel.brainwashraisesseveralresearchquestions.Onethatwehaveexaminedistheecacyofdi erentfeature-engineeringmethods.Earlierthisyear,weperformedthe rststudythatcomparestwopopularmethodsusedintheexplore-extract-evaluateloop,distantsupervisionandcrowdsourcing,toex-tractrelationshipsatwebscale[7].Thisstudysparkedourinterestinusingdistantsupervisiontorecommendfeaturesintheexploreandextractphases.Weplantousebrain-washasavehicletocontinuesuchstudies.5.REFERENCES[1]E.V.Buskirk.HowtheNet ixPrizeWasWon.Wired,2009.[2]J.DeanandS.Ghemawat.MapReduce:Simpli edDataProcessingonLargeClusters.InOSDI,pages137{150,2004.[3]D.Ferrucci.AnOverviewoftheDeepQAProject.AIMagazine,2012.[4]A.Y.Halevy,P.Norvig,andF.Pereira.TheUnreasonableE ectivenessofData.IEEEIntelligentSystems,2009.[5]S.Levy.HowGoogle'sAlgorithmRulestheWeb.Wired,2010.[6]https://developers.google.com/protocol-buffers/.[7]C.Zhang,F.Niu,C.Re,andJ.Shavlik.BigDataversustheCrowd:LookingforRelationshipsinAlltheRightPlaces.InACL,2012.