/
Potter'sWheel:AnInteractiveDataCleaningSystemVijayshankarRamanandJosep Potter'sWheel:AnInteractiveDataCleaningSystemVijayshankarRamanandJosep

Potter'sWheel:AnInteractiveDataCleaningSystemVijayshankarRamanandJosep - PDF document

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
355 views
Uploaded On 2015-09-28

Potter'sWheel:AnInteractiveDataCleaningSystemVijayshankarRamanandJosep - PPT Presentation

PermissiontocopywithoutfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarenotmadeordistributedfordirectcommercialadvantagetheVLDBcopyrightnoticeandthetitleofthepublicationanditsdateappeara ID: 143737

Permissiontocopywithoutfeeallorpartofthismaterialisgrantedpro-videdthatthecopiesarenotmadeordistributedfordirectcommercialad-vantage theVLDBcopyrightnoticeandthetitleofthepublicationanditsdateappear

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Potter'sWheel:AnInteractiveDataCleaningS..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Potter'sWheel:AnInteractiveDataCleaningSystemVijayshankarRamanandJosephM.HellersteinUniversityofCaliforniaatBerkeleyrshankar,jmh@cs.berkeley.eduCleaningdataoferrorsinstructureandcontentisim-portantfordatawarehousingandintegration.Currentsolutionsfordatacleaninginvolvemanyiterationsofdata“auditing”tofinderrors,andlong-runningtrans-formationstofixthem.Usersneedtoendurelongwaits,andoftenwritecomplextransformationscripts.WepresentPotter'sWheel,aninteractivedataclean-ingsystemthattightlyintegratestransformationanddiscrepancydetection.Usersgraduallybuildtrans-formationstocleanthedatabyaddingorundoingtransformsonaspreadsheet-likeinterface;theeffectofatransformisshownatonceonrecordsvisibleonscreen.Thesetransformsarespeciedeitherthroughsimplegraphicaloperations,orbyshowingthede-siredeffectsonexampledatavalues.Intheback-ground,Potter'sWheelautomaticallyinfersstructuresfordatavaluesintermsofuser-defineddomains,andaccordinglychecksforconstraintviolations.Thususerscangraduallybuildatransformationasdiscrep-anciesarefound,andcleanthedatawithoutwritingcomplexprogramsorenduringlongdelays.1IntroductionOrganizationsaccumulatemuchdatathattheywanttoaccessandanalyzeasaconsolidatedwhole.Howeverthedataof-tenhasinconsistenciesinschema,formats,andadherencetoconstraints,duetomanyfactorsincludingdataentryerrorsandmergingfrommultiplesources[6,13].Thedatamustbepurgedofsuchdiscrepanciesandtransformedintoauniformformatbeforeitcanbeused.Suchdatacleaningisakeychallengeindatawarehousing[6].Datatransformationisalsoneededforextractingdatafromlegacydataformats,andforBusiness-to-BusinessEnterpriseDataIntegration[26].1.1CurrentApproachestoDataCleaningDatacleaninghasthreecomponents:auditingdatatofinddiscrepancies,choosingtransformationstofixthese,andap-plyingthetransformationsonthedataset.Therearecurrentlymanycommercialsolutionsfordatacleaning(e.g.,see[9]foranoverview).Theycomeintwoforms:auditingtoolsand Permissiontocopywithoutfeeallorpartofthismaterialisgrantedpro-videdthatthecopiesarenotmadeordistributedfordirectcommercialad-vantage,theVLDBcopyrightnoticeandthetitleofthepublicationanditsdateappear,andnoticeisgiventhatcopyingisbypermissionoftheVeryLargeDataBaseEndowment.Tocopyotherwise,ortorepublish,requiresafeeand/orspecialpermissionfromtheEndowment.Proceedingsofthe27thVLDBConference,Roma,Italy,2001transformationtools.TheuserfirstauditsthedatatodetectdiscrepanciesusinganauditingtoollikeUnitechSystems'ACR/DataorEvokeSoftware'sMigrationArchitect.Thensheeitherwritesacustomscriptorusesantion/Transformation/Loading)toollikeDataJunctionorAs-centialSoftware'stotransformthedata,fixingerrorsandconvertingittotheformatneededforanalysis.Thedataoftenhasmanyhard-to-findspecialcases,sothisprocessofauditingandtransformationmustberepeatedun-tilthe“dataquality”isgoodenough.Thisapproachhastwoproblems.Lackofinteractivity:Transformationistypicallydoneasabatchprocess,operatingonthewholedatasetwithoutanyfeedback.Thisleadstolong,frustratingdelaysduringwhichusershavenoideaifatransformationiseffective.Suchdelaysarecompoundedbyadecouplingoftransfor-mationanddiscrepancydetection–theseareoftendoneasseparatesteps,withseparatesoftware.Thisforcesuserstowaitforatransformationtofinishbeforetheycancheckifithasfixedallanomalies.Moreimportantly,somenesteddiscrepanciesariseonlyafterothershavebeenxed.E.g.,atypoinayeareldsuchas“19997”canbefound(byrunningasuitablealgorithmontheyearvalues)onlyaf-teralldateshavebeenconvertedtoauniformdatetype–untilthen,theyearvaluescannotbeisolatedfromthedatestrings.Thusthedecouplingmakesithardtofindmulti-plediscrepanciesinonepass,leadingtomanyunnecessaryiterations.Needformuchusereffort:Bothtransformationanddis-crepancydetectionneedsignicantusereffort,makingeachstepofthecleaningprocesspainfulanderror-prone.CommercialETLtoolstypicallysupportonlysomere-strictedtransformsbetweenasmallsetofformatsviaaGUI,andprovideadhocprogramminginterfacesforgeneraltransforms(theseareessentiallylibrariesofcon-versionsbetweenstandardformats:e.g.DataJunctionDJXL).Evensystem-supportedtransformsoftenneedtobespeciedinsophisticatedwaysthatinvolveregularex-pressionsorgrammars(Section4.3).Thediscrepancydetectiontechniquemustmatchthedatadomain–itmaybeastandardmethodlikespell-checking,oraspecializedonelikespottingnon-standardnamesforautomobileparts.Unfortunatelydatavaluesareoftencompositestructuresofvaluesfromdifferentdomains,like“RebeccabyDaphneduMaurier,et.alHardcover(April8,1948)(fromAmazon.comsearchresultsfor Weusetransformasanountodenoteasingleoperation,andtransfor-mationtodenoteasequenceofoperations. DaphneDuMaurier).Henceusersmusteitherwritecus-tomprogramsforeachsuchstructure,ordesigntransformstoparsedatavaluesintoatomiccomponentsforanomalydetection,andthenbackintouniedwholesforoutput.1.2Potter'sWheelApproachDatacleaningisintrinsicallyacomplex,knottytask,withmanyinterrelatedproblems.Anysolutionmustsupporttransformationanddiscrepancydetectioninanintegratedfashion.Ononehand,thetransformsprovidedmustbegen-eralandpowerfulenoughtodomosttaskswithoutexplicitprogramming,andthesystemmustextensiblysupportthevarietyofdiscrepancydetectionalgorithmsapplicableindif-ferentdomains.Ontheotherhand,sincethecleaningpro-cessinvolvesuserinteraction,thesystemmustsupporttrans-formationanddiscrepancydetectionthroughsimplespeci-cationinterfacesandwithminimaldelays.Potter'sWheelisaninteractivedatacleaningsystemthatintegratestransformationanddiscrepancydetectioninasin-gleinterface.ThesoftwareispubliclyavailablefromBerke-ley[22],andsomeofthemainideasarealsousedinCohera'sContentWorkBench[8].UsersgraduallybuildtransformationsinPotter'sWheelbycomposinganddebuggingtransforms,onestepatatime,onaspreadsheet-likeinterface(seeFigure1;thedetailswillbeexplainedinlatersections).Transformsarespecifiedgraphically,theireffectisshownimmediatelyonrecordsvis-ibleonscreen,andtheycanbeundoneeasilyiftheireffectsareundesirable.Discrepancydetectionisdoneautomaticallyinthebackground,onthelatesttransformedviewofthedataandanomaliesareflaggedastheyarefound.Thispipelin-ingoftransformationanddiscrepancydetectionmakesdatacleaningatight,closedloopwhereuserscangraduallyde-velopandrefinetransformationsasdiscrepanciesarefound.1.2.1InteractiveTransformationFromtheliteratureontransformationlanguages(e.g.,[1,7,16])wehaveadaptedasmallsetoftransformsthatsup-portcommontransformationswithoutexplicitprogramming.Mostofthesearesimpleandeasytospecifygraphically.Howeversometransformsusedtoparseandsplitvaluesintoatomiccomponentsarequitecomplex.Theirspecicationrequiresuserstoenterregularexpressionsorgrammars,andinsomecaseswritecustomprograms(Section4.3).InsteadPotter'sWheelletsusersspecifythedesiredresultsonexam-plevalues,andautomaticallyinfersasuitabletransform,us-ingthestructureextractiontechniquesdescribedbelow.Wedescribesuchgraphicalspecification,andtheincrementalapplicationofthesetransforms,inSection4.Potter'sWheelcompilesasequenceoftransformsintoaprogramaftertheuserissatisfied,insteadofapplyingthempiecemealovermanyiterations.Usersspecifyorundothesetransformsinorderstheyndnatural,oftenonlywhendis-crepanciesarefound,andthisexploratorybehaviorcouldre-sultinredundantorsub-optimaltransforms.Inaddition,themaincostinthetransformationisthatofmemoryallocationandcopying.Inthefullversionofthepaper[22],wediscusshowthefinalsequenceoftransformscanbeconvertedtoamoreoptimalform,includingwaysofpipeliningtransformstominimizememoryallocationsandcopies. Figure1:AsnapshotofthePotter'sWheeluserinterfaceonightdelaydatafromFEDSTATS(www.fedstats.gov).1.2.2ExtensibleDiscrepancyDetectionPotter'sWheelallowsuserstodefinecustomdomains,andcorrespondingalgorithmstoenforcedomainconstraints.Howeversincethedatavaluesareoftencompositestruc-tures,thesystemneedstoautomaticallyparseastringvalueintoastructurecomposedofuser-defineddomains,andthenapplysuitablediscrepancydetectionalgorithms.Thisissimilartotheproblemofinferringregularexpres-sionstructuresfromexamples,thathasbeenaddressedinthemachinelearningliterature(e.g.,[20,5]).Wearehow-evernotinterestedinabstractstructureslikeregularexpres-sions,butratherinstructuresintermsofuser-deneddo-mains.Forexample,parsingightrecordslike“Taylor,Jane,JFKtoORDonApril23,2000Coach”as“[A-Za-z,]]Z]3to[A-Z]on[A-Za-z]*[0-9]*,[0-9]*[A-Za-z]*”doesnothelpmuchwithdetectinganomaliesthatsatisfytheba-sicpattern.Whereasparsingitas“[A-Za-z,]*irport꜀irport꜀ -24;atel -65;ass”wouldallowustodetectlogicalerrorslikefalseairportcodesordates.Webelievethatapplicationdeveloperswillspecifyuse-fulapplication-specicdomainsandcorrespondingdomainconstraints(likedate,airportcode,constructionpartname),providedPotter'sWheelcanautomaticallyinferpatternsintermsofthesedomainsandapplysuitablealgorithms.Thedifcultyisthatthesedomainsaretypicallynotspeciedasexplicitpatternsbutratherasencapsulatedset-membershipfunctionsthatPotter'sWheelcannotunderstand.Aseconduniquefeatureofpatternlearninginthedatacleaningcon-textisthatthevalueswillhavediscrepanciesintheirstruc-tureitself;hencePotter'sWheelcanonlydetectapproximatestructures.Thereisatradeoffherebetweenchoosingstruc-turesthatmatchmostofthevaluesinacolumnandchoosingstructuresthatdonotovertthedatavalues.Section3de-scribeshowtheMinimumDescriptionLengthprinciple[24]canbeusedtoextractapproximatestructuresforvaluesinawaythatbalancesthistradeoff.2Potter'sWheelArchitectureThemaincomponentsofthePotter'sWheelarchitecture(Figure2)areaDataSourceTransformationEnginethatappliestransformsalong2paths,anOnlineReorderersupportinteractivescrollingandsortingattheuserinter-face[23,21],andanAutomaticDiscrepancyDetector 2.1DataSourcePotter'sWheelacceptsinputdataasasingle,pre-mergedstream,thatcancomefromanODBCsourceoranyASCIIfiledescriptor(orpipe).TheODBCsourcecanbeusedtoquerydatafromDBMSs,orevenfromdistributedsourcesviamiddleware.Inpractice,schematicdifferencesbetweensourceswillrestrictthetightnessoftheintegrationviaaquery(evenFigure1showspoormappingintheSourceDestinationcolumns).Potter'sWheelwillagareasofpoorintegrationaserrors,andtheusercantransformthedata,movingvaluesacrosscolumnstounifythedataformat.WhenreadingfromASCIIfiles,eachrecordisviewedasasinglewidecolumn.Theusercanidentifycol-umndelimitersgraphicallyandsplittherecordintocon-stituentcolumns.Suchparsingismorecomplexandtime-consumingonpoorlystructureddata(suchasfromwebpages).Potter'sWheelhelpsthisprocessthroughatransformthatcanbespeciedbyexample(Section4).Col-umntypesanddelimiterscanalsobespecifiedinametadatafile.Onceadatasethasbeenparsed,thetransformationcanbestoredasamacroforeasyapplicationonsimilardatasets.2.2InterfaceusedforDisplayingDataDatareadfromtheinputisdisplayedonaScalableSpread-sheetinterface[21]thatallowsuserstointeractivelyre-sortonanycolumn,andscrollinarepresentativesampleofthedata,evenoverlargedatasets.WhentheuserstartsPotter'sWheelonadataset,thespreadsheetinterfaceappearsimme-diately,withoutwaitinguntiltheinputhasbeencompletelyread.Thisisimportantwhentransforminglargedatasetsornever-endingdatastreams.TheinterfacesupportsthisbehaviorusinganOnlineRe-orderer[23]thatcontinuallyfetchestuplesfromthesourceanddividesthemintobucketsbasedona(dynamicallycom-puted)histogramonthesortcolumn,spoolingthemtodiskifneeded.Whentheuserscrollstoanewregion,thereordererpicksasampleoftuplesfromthebucketcorrespondingtothescrollbarpositionanddisplaysthemonscreen.Thususerscanexplorelargeamountsofdataalonganydimension.Ex-plorationhelpsusersspotsimplediscrepanciesbyobservingthestructureofdataasvaluesinthesort-columnchange.2.3TransformationEngineTransformsspeciedbytheuserneedtobeappliedintwoscenarios.First,theyneedtobeappliedwhenrecordsarerenderedonthescreen.Withthespreadsheetuserinterfacethisisdonewhentheuserscrollsorjumpstoanewscrollbarposition.Sincethenumberofrowsthatcanbedisplayedonscreenatatimeissmall,usersperceivetransformationsasbeinginstantaneous(thisclearlydependsonthenatureofthetransforms;wereturntothisissueinSection4.2).Second,transformsneedtobeappliedtorecordsusedfordiscrepancydetectionbecause,asarguedearlier,wewanttocheckfordiscrepanciesontransformedversionsofdata.2.4AutomaticDiscrepancyDetectorWhiletheuserisspecifyingtransformsandexploringthedata,thediscrepancydetectorrunsinthebackground,apply-ingappropriatealgorithmstonderrorsinthedata.Hence reorderer Optimized Program datasource specify/undotransforms scroll checkfor errors get page compilescrollbar posn. Figure2:Potter'sWheelArchitecturetuplesfetchedfromthesourcearetransformedandsenttothediscrepancydetector,inadditiontobeingsenttotheOnlineReorderer.Thediscrepancydetectorrstparsesvaluesineacheldintosub-componentsaccordingtothestructureinferredforthecolumn.Thestructureofacol-umnisasequenceofuser-deneddomains,andisinferredassoonasitisformed(i.e.,eitherwhentheinputstreamisstartedorwhenanewcolumnisformedbyatrans-form),aswedescribeinSection3.2.Thensuitablealgo-rthimsareappliedforeachsub-component,dependingonitsdomain.Forexample,ifthestructureofacolumnis&#xnumb;r-1;倀ord&#xw-15;&#xtime;andavalueis19January,thediscrepancydetectorndsJanuary,andassub-componentsbelongingtothe&#xnumb;r-1;倀wor&#x-800;ddomains,andappliesthedetectionalgorithmsspeciedforthosedomains.2.5CompilingaSequenceofTransformsAftertheuserissatisfiedwiththesequenceoftransforms,Potter'sWheelcancompileitintoatransformation,andex-portitaseitheraCorPerlprogram,oraPotter'sWheelmacro–thelattercanbeinvokedonotherdatasetstoreap-plythetransformationwithoutrespecifyingeachtransform.Infuturewewanttosupportcompilationintodeclarativelan-guageslikeSQLorXSLT,sothatadatabasesystemcouldperformfurtheroptimizations.3ExtensibleDiscrepancyDetectionPotter'sWheelallowsuserstodefinearbitrarydomainsalongwithcorrespondingdiscrepancydetectionalgorithms.WedescribetheAPIforspecifyingdomainsinSection3.1.Basedonthesedomains,thesystemautomaticallyinfersap-propriatestructuresforvaluesineachcolumn(Section3.2).Someofthedomainsinthisstructurearethenparameterizedforthespecificcolumnvalues,asdiscussedinSection3.3.Oncethisdetailedstructureisinferred,thesystemparsesval-uesandsendsindividualcomponentstosuitablediscrepancydetectionalgorithms(Section2.4).3.1DomainsinPotter'sWheelDomainsinPotter'sWheelaredefinedthroughtheinterfaceshowninFigure3.Theonlyfunctionrequiredtobeim-plementedisaninclusionfunctionmatchtoidentifyvaluesinthedomain.TheoptionalcardinalityfunctionishelpfulinstructureextractionaswedescribeinSection3.2. publicabstractclassDomain/**RequiredInclusionFunction—Checksifvaluesatisfiesdomainconstraints.(Sections3.1)*/publicabstractbooleanmatch(char*value);/**Optionalfunction–findsthenumberofvaluesinthisdomainwithgivenlength.Thiscouldvarybasedonparameterization.(Sections3.2and3.3)*/publicintcardinality(intlength);/**Optionalfunction–updatesanystateforthisdomainusingthegivenvalue.(Sections3.1and3.3)*/publicvoidupdateStats(char*value);/**Optionalfunction–checksifagivenvalueisadiscrepancy,withacertainprobability.Typicallyneedstoknowthetotalnumberoftuplesinthedataset(e.g.see[14]).(Section3.1)*/publicfloatmatchWithConfidence(char*value,intdataSize);/**Optionalfunction–checksifonepatternisredundantafteranother.(Section3.2)*/publicbooleanisRedundantAfter(Domaind);Figure3:APIforuser-defineddomains.Thesefunctionsareexplainedinthesectionsindicated.ismainlyusedtoparameterizethedomains(Sec-tion3.3).Itcanalsobeusedbyadiscrepancydetectionalgo-rithmtoaccumulatestateaboutthedata.Thisaccumulatedstatecanalsobeusedtocatchmulti-rowanomalieswhereasetofvaluesareindividuallycorrect,buttogetherviolatesomeconstraint.Forexample,aduplicateeliminationalgo-rithmcouldusetobuildanapproximatehashta-bleorBloomlterofthevaluesseensofar.ThematchWith-Confidencemethodishelpfulforprobabilisticandincremen-taldiscrepancydetectionalgorithms,suchassamplingbasedalgorithms(e.g.[14]).TheisRedundantAftermethodisusedwhileenumeratingstructures,asdescribedinSection3.2.Potter'sWheelprovidesthefollowingdefaultdomains:arbitraryASCIIstrings(henceforthcalled),characterstrings(calledWords;likewiseAllCapsWordsCapWordsrefertowordswithallcapitalsandcapitalizedwordsrespec-tively),Integers,sequencesofPunctuation,C-styletifiers,floatingpointvalues(henceforthcalledDoublesEnglishwordscheckedaccordingtoispellIspellWordsNames(checkedbyreferringtotheonline1990censusresults),Money,andagenericregular-expressiondo-mainthatchecksvaluesusingthePCRElibrary.3.2StructureExtractionAgivenvaluewilltypicallybeparseableintermsofthedefaultanduser-deneddomainsinmultipleways.Forexample,March17,2000canbeparsedas,assA-Za-z][0-9];[0-9],orassachrMM17];[20],tonameafewpossiblestructures.Structureextractioninvolveschoosingthebeststructureforvaluesinacolumn.Formally,givenasetofcolumnvalues;:::;vandasetofdo-mains,wewanttoextractasuitablestructure,whereAllthatweknowaboutthesedomainsisfromthefunc-tionsdenedinFigure3–evenamongthese,onlytheset-membershipfunction(match)maybeavailable.Ingeneraltheinferredstructuremustbeapproximate,sincethedatacouldhaveerrorsinthestructureitself.Wefirstdescribehowtoevaluatetheappropriatenessofastructureforasetofvaluesandthenlookatwaysofenumeratingallstructuressoastochoosethebestone.3.2.1EvaluatingtheSuitabilityofaStructureTherearethreecharacteristicsthatwewantinastructureforthecolumnvalues.Thestructureshouldmatchasmanyofthecolumnvaluesaspossible.Precision:Thestructureshouldmatchasfewothervaluesaspossible.Conciseness:Thestructureshouldhaveminimumlength.ThefirsttwocriteriaarestandardIRmetricsforevaluat-ingtheeffectivenessofapattern[27].Weneedtoconsiderrecallbecausethevaluesmightbeerroneouseveninstruc-ture;allunmatchedvaluesareconsideredasdiscrepancies.Consideringprecisionhelpsusavoidoverlybroadstructuresthatdonotuniquelymatchthiscolumn.Thelastcriterionofconcisenessisusedtoavoidover-fittingthestructuretotheexamplevalues.Forinstance,wewanttoparseMarch17,2000[A-Za-z][0-9][0-9]ratherthanasMarch17,2000Thislastexamplehighlightstheimportanceofallowinguser-defineddomainsinthealphabetfromwhichwecreatethestructure.ForinstanceifwedidnothaveWordtegerasdomainsinthealphabet,March17,2000wouldbethebetterstructurethan[A-Za-z][0-9][0-9]sinceithasthesamerecall(100%),betterprecision(sinceitavoidsmatchinganyotherdate),andsmallerpatternlengththan[A-Za-z][0-9][0-9].Intuitively,thelatterisamoreconcisepattern,butthisisonlybecausewethinkoftheeasthedomainWord,ratherthanastheKleeneclosureofasetof56charactersThesethreecriteriaaretypicallyconflicting,withbroadpatternslikehavinghighrecallandconcisenessbutlowprecision,andspecificpatternshavinghighprecisionbutlowconciseness.Aneffectivewaytomakethetradeoffbetweenover-fittingandunder-fittingisthroughtheMinimumDe-scriptionLength(MDL)principle[24],thatminimizesthetotallengthrequiredtoencodethedatausingastructure.DescriptionLength:AmetricforstructurequalityWenowderivethedescriptionlength(DL)forencodingasetofvalueswithastructure,asameasureoftheappropriate-nessofthestructure;betterstructuresresultinsmallerDLs.AccordingtotheMDLprinciple,theDLforusingastructure todescribeasetofcolumnvaluesisdefinedas:thelengthofthetheory(thestructuredenition)plusthelengthrequiredtoencodethevaluesgiventhestructure.WeneedaDLthatcanencapsulatethegoalsofRecall,Precision,andConciseness(aspenalties).Concisenessisdirectlycapturedbythelengthoftheoryforthestructure.Forvaluesthatmatchthestructure,thelengthrequiredforencodingthedatavaluescapturesthePrecision.Wetackleerroneousdatavaluesbypositingthatvaluesnotmatchingthestructureareencodedexplicitlybywritingthemout,i.e.usingthestructure.Thelatterencodingistypicallymorespace-intensivesinceitassumesnostructure,forcingvaluestobewrittenoutexplicitly.TherebywecaptureRecall.Example:ConsiderastructureofWordIntegerIntegeravalueofMay172025.Thenumberofbitsneededtoencodethestructureis3log(numberofdomains.Thenweencodethevaluebyrstspecifyingthelengthofeachsub-componentandthen,foreachsub-component,specifyingtheactualvaluefromallvaluesofthesamelength.Inthiscase,thesub-componentlengthsare3,2,and4.Thedomainsarestringsoveralphabetsofsize52,10,and10([a-zA-Z]and[0-9]).Thusthedescriptionlengthis:3log(numberofdomains3log(maximumlengthofvaluesineachsub-componentlog52+log10+log10Intheaboveexample,weareabletocalculatethelengthsofthevalueencodingsforintegersandwordsbecauseweknowthepropertiesofthedomains.Wenowlookatencod-ingsforstructuresofarbitrarydomains.Considerastructuredomains,.Letdenotethecardinalityofanyset.ThedescriptionlengthofastringoflengthlenusinglengthoftheoryforlengthtoencodegivenGivendomains,wecanrepresenteachdomainwithlogbits.Letbetheprobabilitythatmatchesthestructuredoesnotmatch,weencodeitexplicitly.Thus,log+(1)(loglenspacetoexpresswiththethreepartsoftherighthandsiderepresentingpenaltiesforConciseness,Recall,andPrecisionrespec-tively.Letbethevaluesinthecolumn,andAvgValLenlenbetheaveragelengthofthevaluesinthecolumn.Itiseasytoseethattheaveragespaceneededtoencodethevaluesislog+(1)(logAvgValLenavg.spacetoexpress:::vusingJustasintheexample,weexpressthevaluesusingbyrstencodingthelengthsoftheircomponentsineachdo-mainandthenencodingtheactualvaluesofthecomponents.Foranystringthatfallsinadomain,letlenbeitslength,andletbethespacerequiredtouniquelyamongallthelen-lengthstringsinSupposethatvaluematchesthestructurethroughtheconcatenationofsub-componentsi;p,withi;jparsingofisitselfnoteasy;wediscussefcientparsinginSection4.3.2).LetMaxLenbethemaximumlengthofthevaluesinthecolumn.Thentheaveragespacerequiredtoencodethevaluesinthecolumnis,logf=nlogMaxLeni;j+(1)(logAvgValLenAftersometransformationthisbecomeslogAvgValLenloglogMaxLen logvaluesoflengthleni;jthatsatisfy valuesoflengthleni:jThebestwaytocomputethecardinalityintheaboveex-pressionisusingtheintcardinality(intlength)functionforthedomain,ifithasbeendened.Forotherdomainsweap-proximatethefractiondirectlybyrepeatedlychoosingran-domstringsofappropriatelengthandcheckingiftheysatisfy.Sincethesefractionsareindependentoftheactualval-ues,theycanbepre-computedandcached.Ifthelengthistoohighwemayneedtocheckmanyval-uesbeforewecanestimatethisfraction.Henceintheab-senceofauser-denedcardinalityfunction,wecomputethefractionofmatchesforafewsmalllengths,andextrapolateittolargerlengthsassumingthatthenumberofmatchesisastrictexponentialfunctionofthestringlength.3.2.2ChoosingthebeststructureWehaveseenhowtocomputeadescriptionlengththatmea-suresthesuitabilityofastructureforagivensetofvalues.Wenowwanttoenumerateallstructuresthatcanmatchthevaluesinacolumnandchoosethemostsuitableone.Thisenumerationneedstobedonecarefullysincesincethestruc-turesarearbitrarystringsfromthealphabetofdomains,anditwillbetooexpensivetoenumerateallsuchstrings.WeapplythealgorithmofFigure4onasetofsampleval-uesfromthecolumn,andtaketheunionofallstructuresenu-meratedthus.Weuse100valuesasadefault;thishasprovenadequateinallthecaseswehaveencountered.Duringthisenumeration,weprunetheextentofrecursionbynothan-dlingstructureswithmeaninglesscombinationsofdomainssuchaswordwordinteger.Theseareunnecessarilycomplicatedversionsofsimplerstructuresword,andwillresultinstructureswithidenticalprecisionandrecallbutlesserconciseness.WeidentifysuchunnecessarysequencesusingtheisRe-dundantAfter(Domaind)methodofDomainthatdetermineswhetherthisdomainisredundantimmediatelyafterthegivendomain.Eventhoughthisisanexponentialalgorithm,pruningre-ducesthenumberofstructuresweenumerateforacolumnconsiderably.AsshowninFigure5,thenumberofenumer-atedstructuresistypicallylessthan10.3.3StructureswithParameterizedDomainsSofarthestructuresthatwehaveextractedaresimplystringsofdomains.Butthecolumnvaluesareoftenmuchmorere-stricted,consistingonlyofcertainparameterizationsofthedomains.Forexample,allthesub-componentsfromado-mainmighthaveaconstantvalue,ormightbeofconstantlength,asshownintheexamplesofFigure5.Potter'sWheelcurrentlydetectstwoparameterizationsautomatically:domainswithconstantvaluesanddomains /**Enumerateallstructuresofdomains:::dthatcanbeusedtomatchavalue.*/voidenumerate;:::dbeastringofcharactersforalldomainsmatchingprexdoenumerate(;:::d–avoidstructuresbeginningwithdomainsthatsatisfyisRedundantAftertoallstructuresenumeratedaboveFigure4:Enumeratingvariousstructuresforasetofvalues ExampleColumnValue #Structures FinalStructureChosen (Exampleerroneousvalues) Enumerated (Punc=Punctuation) -60 Integer UNITED,DELTA,AMERICANetc. IspellWord SFO,LAXetc.(JFKtoOAK) AllCapsWord 1998/01/12 IntPunc(/)IntPunc(/)Int M,Tu,Thuetc. CapitalizedWord 06:22 Int(len2)Punc(:)Int(len2) 12.8.15.147(ferret03.webtop.com) DoublePunc('.')Double ”GET Punc(”)IspellWordPunc( /postmodern/lecs/xia/sld013.htm  AllCapsWord(HTTP) /1.0 Punc(/)Double(1.0) Figure5:Structuresextractedfordifferentkindsofcolumns,usingthedefaultdomainslistedinSection3.1.Structureparameterizationsaregiveninparenthesis.withvaluesofconstantlength.Suchparameterizedstruc-turesareespeciallyusefulforautomaticallyparsingtheval-uesinacolumn,wheninferringSplittransformsbyexample(Section4.3).Inaddition,userscandenedomainsthatinfercustomparameterizations,usingthemethod.Thesedo-mainscouldusespecializedalgorithmstofurtherrenethestructureofthesub-componentsthatfallwithintheirdomain.Forexample,thedefaultIntegerdomaininPotter'sWheelcomputesthemeanandstandarddeviationofitsvaluesandusestheseasparameters,toagvaluesthataremorethan2standarddeviationsawayaspotentialanomalies.Likewiseadomaincanacceptallstringsbydefault,butparameterizeitselfbyinferringaregularexpressionthatmatchesthesub-componentvalues.Thedescriptionlengthforvaluesusingastructureoftenreduceswhenthestructureisparameterized.Forthedefaultparameterizationsofconstantvaluesandconstantlengthsitiseasytoadjusttheformulasgivenintheprevioussection.Forcustomparameterizationsliketheregularexpressionin-ferencediscussedabove,theusermustdenethecardinalityfunctionbasedontheparameterization.3.4ExampleStructuresExtractedConsiderthesnapshotshowninFigure1containingightdelaystatistics.Figure5showsthestructuresextractedforsomeofitscolumnvalues,andalsoforsomecolumnsfromawebaccesslog.Weseethatthedominantstructureischoseneveninthefaceofinconsistencies;therebythesystemcanagthesestructuralinconsistenciesaserrorstotheuser,andparseandapplysuitabledetectionalgorithmsforothervaluesthatmatchthestructure.Usingthesethesystemagsseveraldiscrepanciesthatwehadearlieraddedtothedata.Forexample,thesystemagsdatessuchasinthedatecolumnofFigure1asanomaliesbecausetheIntegerdomainfortheyearcolumnparameterizeswithameanof2043.5andastandarddevia-tionof909.2.ItndsthepoormappingintheSourceandDestinationcolumnsofFigure1asstructuralanomalies.Figure5alsoshowsthatacolumnofIPaddresseswithvalueslikehasitsstructureinferredasble.Double,ratherthanInteger.Integer.Integer.Integer.ThisarisesbecauseDoubleisamoreconcisestructurethanInteger.Integer.Thiscouldbeavoidedeitherbyden-ingaShortdomainforvalueslessthan255(toformShort.Short.Short.Short),orevenbyallowingaparameter-izationoftheformInteger(lenAninterestingexampleofover-ttingisthechoiceofIspellWordforightcarriers.Althoughmostightcarriernamesoccurintheispelldictionary,somelikeTWAdonot.IspellWordischosenbecauseitischeapertoencodeTWAexplicitlywithastructurethantoencodeallcarri-erswiththenextbeststructure,AllCapsWord.ThesystemagsTWAasananomaly–theusercouldchoosetoignorethis,orspecifyaminimumRecallthresholdtoavoidover-tting.Inanycase,thisexamplehighlightstheimportanceofinvolvingtheuserinthedatacleaningprocess.Figure10givesmoreexamplesofinferredstructures.4InteractiveTransformationHavingseenhowPotter'sWheelinfersstructuresandiden-tiesdiscrepancies,weturnourattentiontoitssupportforinteractivetransformation.Wewantuserstoconstructtrans-formationsgradually,adjustingthembasedoncontinualfeedback.Thisbreaksdownintothefollowingsub-goals:Easeofspecification:Transformsmustbespeciablethroughgraphicaloperationsratherthancustomprogram-ming.Moreover,intheseoperations,wewanttoavoiduseofregular-expressionsorgrammarsandinsteadallowuserstospecifytransformsbyexampleasfaraspossible.Easeofinteractiveapplication:Oncetheuserhasspecifiedatransform,theymustbegivenimmediatefeedbackontheresultsofitsapplicationsothattheycancorrectit.UndosandDataLineage:Usersmustbeabletoeasilyundotransformsafterseeingtheireffect.Inaddition,thelineageoferrorsmustbeclear–i.e.,errorsintrinsictothedatamustbedifferentiablefromthoseresultingfromothertransforms.4.1TransformssupportedinPotter'sWheelThetransformsusedinPotter'sWheelareadaptedfromex-istingliteratureontransformationlanguages(e.g.[16,7]).Wedescribethembrieflyherebeforeproceedingtodiscusstheirinteractiveapplicationandgraphicalspecification.Ta-ble1givesformaldenitionsforthesetransforms.Addi-tionalillustrativeexamplesandproofsofexpressivepoweraregiveninthefullversionofthepaper[22]. Transform Definition Format R;i;f;:::;a;:::;a R;x;:::;a R;i;:::;a Copy ;:::;a;:::;a Merge ;:::;a;i;j;glue;:::;aglue;:::;a ;i;splitter;:::;asplitterrightsplitter Divide ;:::;a;i;pred;:::;anull;:::;apred ;:::;anullpred Fold R;i;:::i;:::;a;:::;a (a1)2R^1lkg Select pred;:::;apred Notation:isarelationwithi;jarecolumnindicesandrepresentsthevalueofacolumninarow.andgluearevalues.isafunctionmappingvaluestovalues..splitterisapositioninastringoraregularexpression,splitteristheleftpartofaftersplittingbysplitter.predisafunctionreturningaboolean.Table1:Definitionsofthevarioustransforms.Unfoldisdenedinthefullpaper[22]. '(.*), (.*)' to '\2 \1' Stewart,Bob Dole,Jerry Davis Marsh Anna Joan Anna Davis Dole Joan Marsh Jerry Bob Bob Jerry Stewart Dole Anna Joan Davis Marsh Split at ' ' Anna Joan Davis Marsh Bob Stewart Jerry Dole Figure6:UsingFormatMergetocleannamefor-matdifferencesValueTranslationFormattransformappliesafunctiontoeveryvalueinacolumn.Weprovidebuilt-infunctionsforcommonoper-ationslikeregular-expressionbasedsubstitutionsandarith-meticoperations,butalsoallowuserdefinedfunctions.Col-umnandtablenamescanbedemotedintocolumnvaluesus-ingspecialcharactersinregularexpressions;theseareusefulinconjunctionwiththeFoldtransformdescribedbelow.One-to-oneMappingsofRowsOne-to-onetransformsarecolumnoperationsthattransformindividualrows.AsillustratedinFigures6and7,theycanbeusedtounifydatacollectedfromdifferentsources.Mergetransformconcatenatesvaluesintwocolumns,optionallyinterposingaconstant(thedelimiter)inthemid-dle,toformasinglenewcolumn.splitsacolumnintotwoormoreparts,andisusedtypicallytoparseavalueintoitsconstituentparts.Thesplitpositionsareoftendifficulttospecifyifthedataisnotwellstructured.Weallowsplit-tingbyspecifyingcharacterpositions,regularexpressions,orbyinteractivelyperformingsplitsonexamplevalues(Sec-tion4.3).Copy,andallowuserstodroporcopyacol-umn,oraddanewcolumn.Occasionally,logicallydifferentvalues(maybefrommultiplesources)arebunchedintothesamecolumn,andwewanttotransformonlysomeofthem.Divideconditionallydividesacolumn,sendingvaluesintooneoftwonewcolumnsbasedonapredicate.Many-to-ManyMappingsofRowsMany-to-Manytransformshelptotacklehigher-orderschematicheterogeneities[18]whereinformationisstoredpartlyindatavalues,andpartlyintheschema,asshowninFigure8.Fold”attens”tablesbyconvertingonerowintomultiplerows,foldingasetofcolumnstogetherintoonecol-umnandreplicatingtherest.ConverselyUnfold”unattens”tables;ittakescolumns,collectsrowsthathavethesamevaluesforalltheothercolumns,andunfoldsthetwochosencolumns.Valuesinonecolumnareusedascolumnnamestoalignthevaluesintheothercolumn.Figures8and9showanexamplewithstudentgradeswherethesubjectnamesaredemotedintotherowviaFormat,gradesareFoldedtogether,andthentoseparatethesubjectfromthegrade.FoldUnFoldareadaptedfromtherestructuringoperatorsofSchemaSQL[16],andarediscussedinmoredetailinthefullpaper[22].PowerofTransforms:Asweproveinthefullpaper[22],thesetransformscanbeusedtoperformallone-to-manyrowmappingsofrows.FoldUnfoldcanalsobeusedtoflattentables,convertingthemtoaformwherecolumnandtablenamesareallliteralsanddonothavedatavalues.Forafor-maldefinitionof(un)flatteningandananalysisofthepowerFoldUnfold,see[16].4.2InteractiveApplicationofTransformsWewanttoapplythetransformsontuplesincrementally,astheystreamin,sothattheeffectsoftransformscanbeimme-diatelyshownontuplesvisibleonthescreenoftheUI.Italsoletsthesystempipelinediscrepancydetectionontheresultsofthetransforms,therebygivingtheinteractivityadvantagesdescribedintheintroduction.Amongthetransformsdiscussedabove,alltheone-to-onetransformsaswellastheFoldtransformarefunctionsonasinglerow.Hencetheyareeasytoapplyincrementally.HoweverUnfoldoperatesonasetofrowswithmatch-ingvalues.Sincethiscouldpotentiallyinvolvescanningtheentiredata,wedonotallowUnfoldtobespecifiedgraphi-cally.Fordisplayingrecordsonthescreenwecanavoidthisproblembynotshowingacompleterowbutinsteadshow-ingmoreandmorecolumnsasdistinctvaluesarefound,andfillingdatavaluesinthesecolumnsasthecorrespondingin-putrowsareread.Suchprogressivecolumnadditioninthespreadsheetinterfacecouldconfusetheuser;henceweplantoimplementanabstractioninterfacewhereallnewlycre-atedcolumnsareshownasonerolledupcolumn.When Divide (like ', ') Davis Song Such,Bob Dole,Jerry Such,Bob Ann Dole,Jerry Joan Davis Song Ann Joan Figure7:Divide-ingtosepa-ratevariousnameformats Fold 2 Formats(demotes) Ann Bob Math:43 Math:96 Bio:78 Bio:54 Name Math 43 96 Ann Bob 78 54 Bio Ann Bob Bob Ann Name Math:96 Bio:54 Math:43 Bio:78 Math Bio Math Bio Ann Ann Bob Bob Name 43 78 96 54 Name Figure8:Fold-ingtoxhigher-ordervariations Math Bio Math Bio Name Anna Anna Bob Bob 43 78 96 54 Joan Sci 79 79 Sci 78 54 Bio 43 96 Math Name Anna Bob Figure9:Unfold-ingintothreecolumnstheuserclickstounrollthecolumnitexpandsintoasetofcolumnscorrespondingtothedistinctvaluesfoundsofar.4.3GraphicalSpecificationofTransformsTransformslikeCopyFold,andMergearesim-pletospecifygraphically.Userscanhighlightthedesiredcolumnsandpicktheappropriatetransform.Howeverisoftenhardtospecifyprecisely.areneededtoparsevaluesinacolumnintoconsituentparts,asillustratedinFigure10.Thisisanimportantproblemforcommercialintegrationproducts.ToolslikeMicrosoftSQLServerAccesshavewizardsforparsingASCIIdataleswithconstantdelimiters.Therearemanyresearchandcommercial“wrapper-generation”tools(e.g.Araneus[12],CoheraNetQuery(CNQ)[8],Nodose[2])thattacklethisproblemfor“screen-scraping”unstructureddatafoundonwebpages.Howeverthesetoolsoftenrequiresophisticatedspecificationofthesplit,rangingfromregularexpressionsplitdelimiterstocontextfreegrammars.Buteventhesecannotalwaysbeusedtosplitunambiguously.ForinstanceintherstentryofFigure10,commasoccurbothindelim-itersandinthedatavalues.Asasresult,usersoftenhavetowritecustomscriptstoparsethedata.4.3.1SplitbyExampleInPotter'sWheelwewantuserstobeabletoparseandsplitvalueswithoutspecifyingcomplexregularexpressionsorwritingprograms.Insteadwewanttoallowuserstospec-ifymostsplitsbyperformingthemonexamplevalues.Theuserselectsafewexamplevaluesandinagraphical,direct-manipulation[25]wayshowsthepositionsatwhichthesevaluesaretobesplit,intosub-components:::xn;mrespectively.Asisdoneduringdiscrepancydetection,thesysteminfersastructureforeachofthenewcolumnsusingMDL,andusesthesestructurestosplittherestofthevalues.Thesestructuresaregeneral,rangingfromsimpleoneslikeconstantdelimitersorconstantlengthde-limiters,tostructuresinvolvingparameterizeduser-defineddomainslike“Word(len3)Word(len3)Integer(len2)timeInteger(len4)”(fortheoutputoftheUNIXThereforetheyarebetterthansimpleregularexpressionsatidentifyingsplitpositions.Figure10containssomesamplestructuresthatPotter'sWheelextractsfromexamplesplitsondifferentdatasets.Weseethatevenfortheambiguous-delimitercasedescribedear-lier,itcanextractgoodstructuresthatcanbeusedtosplitunambiguously./**Splitastring(ofcharacters)usingstructuresvoidLeftRight(;:::S==0,checkifisemptyforallprexessatisfying:::wvoidDecSpecicity(;:::S;:::vbetheexamplevaluesusedtoinferthestructuresforthesplit.i;kbetheuser-speciedsplitforeachAsinSection3.2,computeforallstructuresspaceneededtoexpressn;jusingChoosethestructurewiththeleastvalueofforallsubstringssatisfyingDecSpecicity(:::wDecSpecicity(:::w:::SFigure11:Twomethodsofsplittingavalue.Somevaluesmaystillnotbeunambiguouslyparseableusingtheinferredstructures.Othervaluesmaybeanoma-lousandnotmatchtheinferredstructureatall.Weflagallsuchvaluesaserrorstotheuser,whocanapplyothertrans-formstosplitthem,orcleanthedatafurther.4.3.2SplittingbasedoninferredstructuresSincethestructuresinferredinvolvedomainswithTuring-completematchfunctions,splittingavaluebasedontheseisnoteasy.TherstalgorithmofFigure11,,isasim-plerecursivealgorithmforparsingavaluethatconsiderstheinferredstructuresfromlefttoright,andtriestomatchthemagainstallprefixesoftheunparsedvalue.Thisalgorithmispotentiallyveryexpensivefor“imprecise”structuresthatmatchmanyprexes.Quickparsingisparticularlyneededwhenthesplitistobeappliedonalargedatasetaftertheuserhaschosentheneededsequenceoftransforms.ThereforeweuseanalternativealgorithmcalledDec-Specicity(thesecondalgorithmofFigure11)thatmatchestheinferredstructuresindecreasingorderofspecificity.Itfirsttriestofindamatchforthemostspecificstructure,andthenrecursivelytriestomatchtheremainingpartofthedatavalueagainsttheotherstructures.Themotivationisthatintheinitialstagesthemostspecificstructures(typicallycon-stantdelimiters)willmatchonlyafewsubstrings,andsothevaluewillbequicklybrokendownintosmallerpiecesthatcanbeparsedlater.Thespecificityofastructureiscom-putedasthesumofthedescriptionlengthsofthe(appropri- ExampleValuesSplitByUserInferredStructureComments isuserspeciedsplitposition) Taylor,Jane,$52,072 Blair,John,$73,238 TonySmith,$1,00,533 ','MoneyParsingisdoabledespitenogoodde-limiter.Aregularexpressioncaninferastructureof$[0-9,]*forlastcomponent. JFK –j OAK len3identier�len3identierParsingispossibledespitemultipledelimiters. 321Blake#7,Berkeley,CA94720 719MLKRoad,Fremont,CA95743 number','word','(2letterword)(5letterinteger)Parsingiseasybecauseofconsistentdelimiter. Figure10:Parsestructuresinferredfromvarioussplit-by-examplesatesubstrings)oftheexamplevaluesusingthestructure.Thelessspecicstructuresneedtobeusedonlyafterthevaluehasbeendecomposedintomuchsmallersubstrings,andthesplittingisnottooexpensiveonthese.TostudytheeffectofparsingaccordingtospecicityweranDecSpecicity,andIncSpecicityonafewstructures.IncSpecicityistheexactoppositeofDecSpeci-cityandconsidersstructuresstartingwiththeleastspecicone;itillustrateshowcrucialthechoiceofstartingstruc-tureis.Figure12comparesthethroughputatwhichonecansplitvaluesusingthesemethods.WeseethatDecSpecicityperformsmuchbetterthantheothers,withtheimprovementbeingdramaticatsplitsinvolvingmanystructures.4.4UndoingTransformsandTrackingDataLineageTheabilitytoundoincorrecttransformsisanimportantre-quirementforinteractivetransformation.However,ifthespeciedtransformsaredirectlyappliedontheinputdata,manytransforms(suchasregular-expression-basedsubsti-tutionsandsomearithmeticexpressions)cannotbeundoneunambiguously–thereexistno“compensating”transforms.Undoingtheserequires“physicalundo”,i.e.,thesystemhastomaintainmultipleversionsofthe(potentiallylarge)dataset.InsteadPotter'sWheelneverchangestheactualdatarecords.Itmerelycollectstransformsastheuseraddsthem,andappliesthemonlyontherecordsdisplayedonthescreen,inessenceshowingaviewusingthetransformsspecifiedsofar.Undosaredone“logically,”byremovingtheconcernedtransformfromthesequenceand“redoing”therestbeforerepaintingthescreen.Thisapproachalsosolvestheambiguousdatalineageproblemofwhetheradiscrepancyisduetoanerrorinthedataorbecauseofapoortransform.Iftheuserwishestoknowthelineageofaparticulardiscrepancy,thesystemonlyneedstoapplythetransformsoneafteranother,checkingfordiscrepanciesaftereachtransform.5RelatedWorkThecommercialdatacleaningprocessisbasedonETLtoolsandauditingtools,asdescribedintheintroduction.[6,9]givegooddescriptionsoftheprocessandsomepopulartools.Thereismuchliteratureontransformationlanguages,es-peciallyforperforminghigher-orderoperationsonrelationaldata[1,7,16,18].OurhorizontaltransformsareverysimilartotherestructuringoperatorsofSchemaSQL[16].Howeverourfocusisontheeaseofspecicationandincrementalap-plication,andnotmerelyonexpressivepower.Theresearchliteratureonndingdiscrepanciesindatahasfocusedontwomainthings:general-purposealgorithmsforndingoutliersindata(e.g.[3]),andalgorithmsfornd-ingapproximateduplicatesindata[13,17,10].Therehasalsobeensomeworkonfindinghiddendependenciesindataandcorrespondinglytheirviolations[14].Suchgeneralpur-posealgorithmsareusefulasdefaultalgorithmsforPotter'sWheel'sdiscrepancydetector.Howeverwebelievethatinmanycasesthediscrepancieswillbedomain-specic,andthatdatacleaningtoolsmusthandlethesedomainsextensi-bly.Acompanionproblemtodatacleaningistheintegrationofschemasfromvariousdatasources.WeintendtoextendPotter'sWheelwithasystemthathandlesinteractivespeci-cationofschemamappings(suchasClio[19]).Extractingstructurefrompoorlystructureddataisin-creasinglyimportantfor“wrapping”datafromwebpages,andmanytoolsexistinboththeresearchandcommercialworld(e.g.[2,12,8]).AsdiscussedinSection4.3,thesetoolstypicallyrequireuserstospecifyregularexpressionsorgrammars;eventheseareoftennotsufcienttounambigu-ouslyparsethedata,sousershavetowritecustomscripts.Therehavealsobeensomelearning-basedapproachesforautomatictextwrappingandsegmentation[15,4].Webe-lieve,however,thatasemi-automatic,interactiveapproachusingacombinationofgraphicaloperationsandstatisticalmethodsismorepowerful.Therehasbeensomeworkinthemachinelearninglitera-ture[20,5]andthedatabaseliterature[11]oninferringreg-ularexpressionsfromasetofvalues.Howeverasarguedbe-fore,fordetectingdiscrepanciesitisimportanttoinferstruc-turesintermsofgenericuser-deneddomains,inawaythatisrobusttostructuraldataerrors.6ConclusionsandFutureWorkDatacleaningandtransformationareimportanttasksinmanycontextssuchasdatawarehousinganddataintegra-tion.Thecurrentapproachestodatacleaningaretime-consumingandfrustratingduetolong-runningnoninterac-tiveoperations,poorcouplingbetweenanalysisandtrans- ExampleValues StructuretoSplitby SplitThroughput(usecs/value) (Int=Integer,Dbl=Double) DecSpecLeftRightIncSpec 8:45 IntInt 5.969.189.18 1997/10/23 IntIntInt 11.5217.9557.89 12.8.15.14--[01/May/2000...”GET...404306 Dbl'.'Dbl'--['� 144.8539.827670 12.8.15.14--[01/May/2000...”GET...404306 Dbl'.'Dbl'--['�Int''Int 219.38943.81525590 12.8.15.14--[01/May/2000...”GET...404306 Dbl'.'Dbl--[Int/Word/Int”GET�Int''Int 233.551960.951036090 Figure12:Comparisonofsplitthroughputsusingthreemethods.formation,andcomplextransformationinterfacesthatoftenrequireuserprogramming.WehavedescribedPotter'sWheel,aninteractivesystemfordatatransformationandcleaning.Byintegratingdiscrep-ancydetectionandtransformation,Potter'sWheelallowsuserstograduallybuildatransformationtocleanthedatabyaddingtransformsasdiscrepanciesaredetected.Userscanspecifytransformsthroughgraphicaloperationsorthroughexamples,andseetheeffectinstantaneously,therebyallow-ingeasyexperimentationwithdifferenttransforms.Wehaveseenthatparsingstringsusingstructuresofuser-deneddomainsresultsinageneralandextensiblediscrep-ancydetectionmechanismforPotter'sWheel.Suchdomainsalsoprovideapowerfulbasisforspecifyingtransfor-mationsthroughexamplevalues.InfuturewewouldliketoinvestigatespecicationofothercomplextransformssuchastheFormattransform,throughexamples.OurfocuswithPotter'sWheelhassofarbeenonat,tabulardata.However,nesteddataformatslikeXMLarebe-comingincreasinglycommon.Whilemuchtherearemanyresearcheffortsontransformationandquerylanguagesforsuchdata,itwouldbeinterestingtoinvestigategraphicalandexample-basedapproachesforspecifyingthese.WhilewehavesofarlookedatPotter'sWheelasadatacleaningtool,wewouldliketoinvestigateitseffectivenessasaclientinterfacetoainteractivequeryprocessingsys-tem.Thetransformationsappliedattheclientinterfacecanbeviewedasrenementstotheongoingquery,andcanbefedbackintothequeryprocessor,therebycombiningqueryspecication,execution,andresultbrowsing.AcknowledgmentsThescalablespreadsheetinterfacethatweusedwasde-velopedalongwithAndyChou.ReneeMillerandSubbuSubramaniangaveuspointerstorelatedworkonhandlingschematicheterogeneities.ThisworkwassupportedbyagrantfromIBMCorporation,aCaliforniaMICROgrant,NSFgrantsIIS-9802051andRICDA-9401156,aMicrosoftFellowship,andaSloanFoundationFellowship.References[1]S.Abitebouletal.Toolsfordatatranslationandintegration.DataEngg.Bulletin,22(1),1999.[2]B.Adelberg.NoDoSE—Atoolforsemi-automaticallyex-tractingstructuredandsemistructureddatafromtextdocu-ments.InSIGMOD,1998.[3]A.Arningetal.Alinearmethodfordeviationdetectioninlargedatabases.InProc.KDD,1996.[4]V.Borkar,K.Deshmukh,andS.Sarawagi.Automaticseg-mentationoftextintostructuredrecords.InSIGMOD,2001.[5]A.Brazma.Efcientalgorithmforlearningsimpleregularexpressionsfromnoisyexamples.InIntl.Wksp.onAlgorith-micLearningTheory,1994.[6]S.ChaudhuriandU.Dayal.AnoverviewofdatawarehousingandOLAPtechnology.InSIGMODRecord,1997.[7]W.Chen,M.Kifer,andD.S.Warren.HiLog:Afounda-tionforhigher-orderlogicprogramming.InJournalofLogicProgramming,volume15,pages187–230,1993.[8]CoheraCorp.http://www.cohera.com.[9]Dataextraction,transformation,andloadingtools(ETL).www.dwinfocenter.org/clean.html.[10]H.Galhardas,D.Florescu,D.Shasha,andE.Simon.AJAX:Anextensibledatacleaningtool.InSIGMOD,2000.[11]M.N.Garofalakisetal.AsystemforextractingdocumenttypedescriptorsfromXMLdocuments.InSIGMOD,2000.[12]S.GrumbachandG.Mecca.Insearchofthelostschema.In,1999.[13]M.HernandezandS.Stolfo.Real-worlddataisdirty:Datacleansingandthemerge/purgeproblem.DataMiningandKnowledgeDiscovery,2(1),1997.[14]J.Kivinenetal.Approximatedependencyinferencefromrelations.TheoreticalComputerScience,149(1),1995.[15]N.Kushmerick.Wrapperinduction:efciencyandexpres-siveness.ArticialIntelligence,118,2000.[16]V.S.Lakshmananetal.SchemaSQL:Alanguageforintere-operabilityinrelationalmulti-databasesystems.In[17]M.Leeetal.Cleansingdataforminingandwarehousing.In,1999.[18]R.J.Miller.Usingschematicallyheterogeneousstructures.SIGMOD,1998.[19]R.J.Miller,L.Haas,andM.A.Hernandez.Schemamappingasquerydiscovery.,2000.[20]L.Pitt.Inductiveinference,DFAs,andcomputationalcom-plexity.AnalogicalandInductiveInference,1989.[21]V.Ramanetal.Scalablespreadsheetsforinteractivedataanalysis.InDMKDWorkshop,1999.[22]V.RamanandJ.M.Hellerstein.Potter'sWheelA-B-C.Atcontrol.cs.berkeley.edu/abc,alsoUCBCSD-00-1110,2000.[23]V.Raman,B.Raman,andJ.Hellerstein.Onlinedynamicreorderingforinteractivedataprocessing.In,1999.[24]J.Rissanen.Modelingbyshortestdatadescription.Automat-,14:465–471,1978.[25]B.Shneiderman.Thefutureofinteractivesystemsandtheemergenceofdirectmanipulation.BehaviorandInformationTechnology,1(3):237–256,1982.[26]M.StonebrakerandJ.M.Hellerstein.ContentintegrationforE-Commerce.InSIGMOD,2001.[27]C.vanRijsbergen.InformationRetrieval.Butterworths,