/
Incognito:EfcientFull Incognito:EfcientFull

Incognito:EfcientFull - PDF document

tatyana-admore
tatyana-admore . @tatyana-admore
Follow
378 views
Uploaded On 2016-06-14

Incognito:EfcientFull - PPT Presentation

55410 Carol 10144 Female 90210 Dan 22184 Male 02174 Ellen 41972 Female 02237 Male 53703 Brochitis 12176 Male 53703 BrokenArm 41386 Female 53706 SprainedAnkle 22876 Female 49 abcd ID: 361942

55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237 Male 53703 Brochitis 1/21/76 Male 53703 BrokenArm 4/13/86 Female 53706 SprainedAnkle 2/28/76 Female 49 (a)(b)(c)(d) (

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Incognito:EfcientFull" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Incognito:EfcientFull­DomainK­AnonymityKristenLeFevreDavidJ.DeWittRaghuRamakrishnanUniversityofWisconsin­Madison1210WestDaytonSt.Madison,WI53706ABSTRACTAnumberoforganizationspublishmicrodataforpurposessuchaspublichealthanddemographicresearch.Althoughattributesthatclearlyidentifyindividuals,suchasNameandSocialSecurityNumber,aregenerallyremoved,thesedatabasescansometimesbejoinedwithotherpublicdatabasesonattributessuchasZipcode,Sex,andBirthdatetore-identifyindividualswhoweresupposedtoremainanony-mous.\Joining"attacksaremadeeasierbytheavailabilityofother,complementary,databasesovertheInternet.K-anonymizationisatechniquethatpreventsjoiningat-tacksbygeneralizingand/orsuppressingportionsofthereleasedmicrodatasothatnoindividualcanbeuniquelydistinguishedfromagroupofsize.Inthispaper,wepro-videapracticalframeworkforimplementingonemodelofk-anonymization,calledfull-domaingeneralization.Weintro-duceasetofalgorithmsforproducingminimalfull-domaingeneralizations,andshowthatthesealgorithmsperformuptoanorderofmagnitudefasterthanpreviousalgorithmson 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237 Male 53703 Brochitis 1/21/76 Male 53703 BrokenArm 4/13/86 Female 53706 SprainedAnkle 2/28/76 Female 49 (a)(b)(c)(d) (e)(f)Figure2:DomainandvaluegeneralizationhierarchiesforZipcode(a,b),BirthDate(c,d),andSex(e,f)InSQL,thefrequencysetisobtainedfromwithrespecttoasetofattributesbyissuingaCOUNT(*)query,withastheattributelistintheGROUPBYclause.Forexam-ple,inordertocheckwhetherthePatientstableinFigure1is2-anonymouswithrespecttoSex,Zipcode,weissueaquerySELECTCOUNT(*)FROMPatientsGROUPBYSex,Zipcode.Sincetheresultincludesgroupswithcountfewerthan2,Patientsisnot2-anonymouswithrespecttoSex,ZipcodeK-AnonymizationAviewofarelationissaidtobeak-anonymizationofiftheviewmodi¯es,distorts,orsuppressesthedataofaccordingtosomemechanismsuchsatis¯esthek-anonymitypropertywithrespecttothesetofquasi-identi¯erattributes.areassumedtobemultisetsoftuples.Throughoutthispaper,weconsidertheproblemofgener-atingasinglek-anonymizationofthemicrodatainasingle.Otherproblems,includinginference,arisewhenmultipledi®erentanonymizationsofthesamemicrodataaremadeavailable[18],butweassumethatonlyasingleanonymizationisreleased.1.2PaperOrganizationandContributionsInSection2wegiveanoverviewofthegeneralizationandsuppressionframeworkfork-anonymization,inparticularamodelcalledfull-domaingeneralization,andwedescribepreviousalgorithmsimplementingminimalfull-domaingen-Our¯rstmaincontributionaddressesimplementationoffull-domaink-anonymization,andispresentedinSections3and4.InSection3weintroduceanimplementationframe-workforfull-domaingeneralizationusingamulti-dimensionaldatamodel,togetherwithasuiteofalgorithms,whichwecallIncognito.Incognitotakesadvantageoftwokeyvari-ationsofdynamicprogramming[4]thathavebeenusedpreviouslyinthequeryprocessingliteratureforotherpur-poses:bottom-upaggregationalongdimensionalhierarchies[6]andaprioriaggregatecomputation[2].InSection4wepresenttheresultsofthelargest-scaleperformanceexper-imentsthatweareawareofforminimalk-anonymization,andweshowthattheIncognitoalgorithmsoutperformpre-viousalgorithmsbyuptoanorderofmagnitude.There-sultsdemonstratethefeasibilityofperformingminimalk-anonymizationonlargedatabases.Thoughouralgorithmsandframeworkfocusprimarilyonthefull-domaingeneralizationmodel,therehavebeenanumberofotherk-anonymizationmodelsproposed,butthedi®erencesamongthesetechniqueshavenotbeenclearlyar-ticulated.Oursecondcontribution,presentedinSection5,isaunifyingtaxonomyofseveralalternativeapproachestok-anonymization.Previousproposalscanbeunderstoodasinstancesofthistaxonomy,di®eringprimarilyinthegranu-larityatwhichanonymizationisapplied.Further,thetax-onomyexposessomeinterestingnewalternativesthato®erthepromiseofmore°exibleanonymization.Extendingthealgorithmicframeworkpresentedinthispapertosomeofthesenovelalternativespresentsabroadclassofproblemsforfuturework.WediscussrelatedworkinSection6andpresentourcon-clusionsinSection7.2.GENERALIZATIONANDSUPPRESSIONSamaratiandSweeney[14,15,17,18]formulatedmecha-nismsfork-anonymizationusingtheideasofgeneralizationandsuppression.Inarelationaldatabase,thereisadomain(e.g.,integer,date)associatedwitheachattributeofarela-tion.Giventhisdomain,itispossibletoconstructamore\general"domaininavarietyofways.Forexample,theZipcodedomaincanbegeneralizedbydroppingtheleastsigni¯cantdigit,andcontinuousattributedomainscanbegeneralizedintoranges.WedenotethisdomaingeneralizationrelationshipbyandweusethenotationtodenotethatdomainiseitheridenticaltooradomaingeneralizationofFortwodomains,therelationshipdicatesthatthevaluesindomainarethegeneralizationsofthevaluesindomain.Moreprecisely,amany-to-onevaluegeneralizationfunctionisassociatedwitheachdomaingeneralizationdomaingeneralizationhierarchyisde¯nedtobeasetofdomainsthatistotallyorderedbytherelationshipWecanthinkofthehierarchyasachainofnodes,andifthereisanedgefrom,wecalldirectgen-eralization.Notethatthegeneralizationrelationshipistransitive,andthus,if,then.Inthiscase,wecalldomainimpliedgen-eralization.Pathsinadomainhierarchychaincorre-spondtoimpliedgeneralizations,andedgescorrespondtodirectgeneralizations.Figure2(a,c,e)showspossibledo-maingeneralizationhierarchiesfortheZipcode,BirthdateandSexattributes.Weusethenotationasshorthandforthecomposi-tionofoneormorevaluegeneralizationfunctions,produc-ingthedirectimpliedvaluegeneralizations.Thevalue- 52 53 NodesEdges ID dim1 index1 dim2 index2 1 Sex 0 Zipcode 0 2 Sex 1 Zipcode 0 3 Sex 0 Zipcode 1 4 Sex 1 Zipcode 1 5 Sex 0 Zipcode 2 6 Sex 1 Zipcode 2 Start End 1 2 1 3 2 4 3 4 3 5 4 6 5 6 Figure6:Representationofsamplegeneralizationlattice(Figure3(a))asnodesandedgesrelations (b) Figure7:(a)The3-attributegraphgeneratedfrom2-attributeresultsinFigure5and(b)The3-attributelatticethatwouldhavebeensearchedwithoutaprioripruninglatticedepictedinFigure3(a).Noticethateachnodeisassignedauniqueidenti¯er(IDThegraphgenerationcomponentconsistsofthreemainphases.First,wehaveajoinphaseandaprunephaseforgeneratingthesetofcandidatenodeswithrespecttowhichcouldpotentiallybek-anonymousgivenpreviousiterations;thesephasesaresimilartothosedescribedin[2].The¯nalphaseisedgegeneration,throughwhichthedi-rectmulti-attributegeneralizationrelationshipsamongcan-didatenodesareconstructed.ThejoinphasecreatesasupersetofbasedonThejoinqueryisasfollows,andassumessomearbitraryorderingassignedtothedimensions.Asin[2],thisorderingisintendedpurelytoavoidgeneratingduplicates.INSERTINTO,index,index,parent,parentSELECTp.dim,p.index,...,p.dim,p.index,q.index,p.ID,q.IDFROMWHEREp.dim=q.dim=q.index=q.dim=q.indexTheresultofthejoinphasemayincludesomenodeswithsubsetsnotin,andduringtheprunephase,weuseahashtreestructuresimilartothatdescribedin[2]toremovethesenodesfromhasbeendetermined,itisnecessarytoconstructthesetofedgesconnectingthenodes().NoticethatduringthejoinphasewetrackedtheuniqueIDsofthetwonodesinthatwerecombinedtoproduceeachnodeinparentparentisconstructedusingbasedonsomesim-pleobservations.Considertwonodes.Weobservethatifthereexistsageneralizationrelationshipbe-tweenthe¯rstparentofandthe¯rstparentof,andthesecondparentofiseitherequaltoorageneraliza-tionofthesecondparentof,thenisageneralizationof.Insomecases,theresultinggeneralizationrelationshipsmaybeimplied,buttheymayonlybeseparatedbyasinglenode.Weremovetheseimpliedgeneralizationrelationshipsexplicitlyfromthesetofedges.TheedgegenerationprocesscanbeexpressedinSQLasfollows:INSERTINTO(start,end)WITHCandidateEdges(start,end)AS(SELECTp.ID,q.IDFROMWHERE(e.start=p.parente.end=q.parentf.start=p.parentf.end=q.parent(e.start=p.parente.end=q.parentp.parent=q.parent(e.start=p.parente.end=q.parentp.parent=q.parentSELECTD.start,D.endFROMCandidateEdgesDEXCEPTSELECTD.start,DFROMCandidateEdgesD,CandidateEdgesDWHERED.end=DExample3.2ConsideragainthePatientstableinFig-ure1withquasi-identi¯erBirthdate,Sex,Zipcode.Sup-posetheresultsofthesecond-iterationgraphsearcharethoseshowninthe¯nalstepsofFigure5(a,b,c).Fig-ure7(a)showsthe3-attributegraphresultingfromthejoin,prune,andedgegenerationprocedueresappliedtothe2-attributegraphs.Inmanycases,theresultinggraphissmallerthanthelatticethatwouldhavebeenproducedwithoutaprioripruning.Forexample,seeFigure7(b).3.2SoundnessandCompletenessAsmentionedpreviously,Incognitogeneratesthesetofallpossiblek-anonymousfull-domaingeneralizations.Forexample,considerthegeneralizationlatticeinFigure7(a).Ifrelationisk-anonymouswithrespectto;S;ZthenthisgeneralizationwillbeamongthoseproducedastheresultofIncognito.Ifisnotk-anonymouswithre-specttothisgeneralization,thenitwillnotbeintheresultset.Inthissection,wesketchoutaproofofsoundnessandcompleteness.Thefullproofisomittedduetospacecon-straints.BasicIncognitoissoundandcompleteforpro-ducingk-anonymousfull-domaingeneralizations.ProofSketchConsideratableanditsquasi-identi¯erattributeset.Letdenotethesetofmulti-attributedomaingeneralizationsof.Incognitodeterminesthek-anonymitystatusofeachgeneralizationinoneofthreeways:Thek-anonymityofwithrespecttosomesubsetofischeckedexplicitlyandfoundnottobek-anonymous.Inthiscase,weknowbythesubsetpropertythatnotk-anonymouswithrespecttoThek-anonymityofwithrespecttosomequasi-identi¯ersubsetischecked,andfoundnottobek-anonymous,andsomesubsetofisa(multi-attribute)generalizationof.Inthiscase,weknowbasedonthe 55 AdultsLandsEnd Attribute DistinctValues Generalizations 1 Age 74 5-,10-,20-yearranges(4) 2 Gender 2 Suppression(1) 3 Race 5 Suppression(1) 4 MaritalStatus 7 Taxonomytree(2) 5 Education 16 Taxonomytree(3) 6 Nativecountry 41 Taxonomytree(2) 7 WorkClass 7 Taxonomytree(2) 8 Occupation 14 Taxonomytree(2) 9 Salaryclass 2 Suppression(1) Attribute DistinctValues Generalizations 1 Zipcode 31953 Roundeachdigit(5) 2 Orderdate 320 TaxonomyTree(3) 3 Gender 2 Suppression(1) 4 Style 1509 Suppression(1) 5 Price 346 Roundeachdigit(4) 6 Quantity 1 Suppression(1) 7 Cost 1412 Roundeachdigit(4) 8 Shipment 2 Suppression(1) Figure9:DescriptionsoftheAdultsandLandsEndDatabasesusedforperformanceexperimentsagainstpreviousminimalfull-domaingeneralizationalgo-rithms,includingSamarati'sBinarysearch[14],Bottom-upsearch(withoutrollup),andtheBottom-upsearch(withrollup)describedinSection2.2.Bothbottom-upvariationswererunexhaustivelytoproduceallk-anonymousgeneral-izations.Throughoutourexperiments,wefoundthattheIncognitoalgorithmsuniformlyoutperformedtheothers.Thedatabasesusedinourperformanceexperimentsrepre-sentthelargest-scaleevaluationthatweareawareofformin-imalfull-domaink-anonymization.Previously,full-domaink-anonymitytechniquesweredemonstratedusingonlyatoydatabaseof265records[15].Noexperimentalevaluationwasprovidedforbinarysearch[14].Thegeneticalgorithmin[11]wasevaluatedusingasomewhatlargerdatabase,butthisalgorithmdoesnotguaranteeminimality.4.1ExperimentalDataandSetupWeevaluatedthealgorithmsusingtwodatabases.The¯rstwasbasedontheAdultsdatabasefromtheUCIrvineMachineLearningRepository[5],whichiscomprisedofdatafromtheUSCensus.Weusedacon¯gurationsimilartothatin[11],usingnineoftheattributes,allofwhichwereconsid-eredaspartofthequasi-identi¯er,andeliminatingrecordswithunknownvalues.Theresultingtablecontained45,222records(5.5MB).Theseconddatabasewasmuchlarger,andcontainedpoint-of-saleinformationfromLandsEndCorporation.Thedatabaseschemaincludedeightquasi-identi¯erattributes,andthedatabasecontained4,591,581records(268MB).TheexperimentaldatabasesaredescribedinFigure9,whichliststhenumberofuniquevaluesforeachattribute,andgivesabriefdescriptionoftheassociatedgeneraliza-tions.Insomecases,thesewerebasedonacategoricaltax-onomytree,andinothercasestheywerebasedonroundingnumericvaluesorsimplesuppression.Theheightofeachdo-maingeneralizationhierarchyislistedinparentheses.Weimplementedthegeneralizationdimensionsasarelationalstar-schema,materializingthevaluegeneralizationsinthedimensiontables.WeimplementedthethreeIncognitovariations,Sama-rati'sbinarysearch,andthetwovariationsofbottom-upbreadth-¯rstsearchusingJavaandIBMDB2.Allfrequencysetswereimplementedasun-loggedtemporarytables.Allexperimentswererunonadual-processorAMDAthlon1.5GHzmachinewith2GBphysicalmemory.Thesoftwarein-cludedMicrosoftWindowsServer2003andDB2Enterprise Weimplementedthek-anonymitycheckasagroup-byqueryoverthestarschema.Samaratisuggestsanalterna-tiveapproachwherebyamatrixofdistancevectorsiscon-structedbetweenuniquetuples[14].However,wefoundconstructingthismatrixprohibitivelyexpensiveforlargeServerEditionVersion8.1.2.Thebu®erpoolsizewassetto256MB.Becauseofthecomputationalintensityofthealgorithms,eachexperimentwasrun2-3times,°ushingthebu®erpoolandsystemmemorybetweenruns.Wereporttheaveragecoldperformancenumbers,andthenumberswerequiteconsistentacrossruns.4.2ExperimentalResultsThecomplexityofeachofthealgorithms,includingIncog-nito,isultimatelyexponentialinthesizeofthequasi-identi¯er.However,wefoundthatinpracticetherollupandapriorioptimizationsgoalongwayinspeedingupperformance.Figure10showstheexecutiontimeofIncognitoandpre-viousalgorithmsontheexperimentaldatabasesforvariedquasi-identi¯ersize(=210).Webeganwiththe¯rstthreequasi-identi¯erattributesfromeachschema(Figure9),andaddedadditionalattributesintheordertheyappearintheselists.WefoundthatIncognitosubstantiallyoutperformedBi-narySearchonbothdatabases,despitethefactthatIncog-nitogeneratesallpossiblek-anonymousgeneralizations,whileBinarySearch¯ndsonlyone.Incognitoalsoout-performedBottom-upsearch(withandwithoutrollup).Theperfor-manceofBinarySearchvariedbasedonthepatternofthesearchperformed.Bottom-upsearchwasmoreconsistent,butoverallbothwereslowerthanIncognito.4.2.1EffectsofRollupandtheAPrioriOptimizationAsmentionedpreviously,weobservedthatthebottom-upbreadth-¯rstsearchcanbeimprovedbyusingrollupaggre-gation,anideaincorporatedintotheIncognitoalgorithm.Togaugethee®ectivenessofthisoptimization,wecomparedtheversionofthebottom-upalgorithmwithrolluptotheversionwithoutrollup.Figure10showsthatbottom-upperformssubstantiallybetterontheAdultsdatabasewhenittakesadvantageofrollup.Wealsofoundthattheapriorioptimization,theotherkeycomponentofIncognito,wentalongwayinhelpingtoprunethespaceofnodessearched,inturnimprovingperformance.Inparticular,thenumberofnodessearchedbyIncognitowasmuchsmallerthanthenumbersearchedbybottom-up,andthesizeofthefrequencysetscomputedforeachofthesenodesisgenerallysmaller.FortheAdultsdatabase,k=2,andvariedquasi-identi¯ersize(QID),thenumberofnodessearchedisshownbelow.QIDsizeBottom-UpIncognito 314144473552061036680246720886648636617789128184307 57 Adultsdatabase(quasi-identi¯ersize8)LandsEnddatabase(staggeredquasi-identi¯ersize) Figure11:Performanceofalgorithmsfor¯xedquasi-identi¯ersizeandvariedvaluesofkAdultsdatabase(k=2)LandsEnddatabase(k=2) Figure12:Combinedcostofbuildingzero-generalizationcubeandanonymizationThesemodelsprovidevaryingamountsof°exibilityinchoosingwhatdataisreleased,andatwhatlevelofgener-ality.Someofthemodelsencompassaspaceofanonymiza-tionsthatothersdonot.Forthisreason,whatistheoptimalanonymizationduetooneschememaybe\better"thantheoptimalanonymizationduetoanotherscheme.Inthefol-lowingsections,weprovideanoverviewofthesedi®erentapproaches.5.1GlobalRecodingModelsFull-domaingeneralization(Section2.1)isoneexampleofglobal-recoding.However,more°exiblemodelshavebeenproposed,andthereareundoubtedlyanumberofotherpos-sibilities.Mostofthesemodelshavefocusedonrecodingthedomainsofthequasi-identi¯erattributesindividually,whichwetermsingle-dimensionrecoding.Asingle-dimensionre-codingde¯nessomefunctionforeachat-ofthequasi-identi¯er.Ageneralizationobtainedbyapplyingeachtothevaluesofineachtu-pleof.Modelsofthisvarietyhavebeenusedinanumberofanonymizationschemes[3,7,11,19].Alternatively,itispossibletoconstructananonymizationmodelthatrecodesthemulti-attributedomainofthequasi-identi¯er,ratherthanrecodingthedomainofeachattributeindependently.Wecallthisclassofmodelsrecoding.Amulti-dimensionrecodingisde¯nedbyai!,whichisusedtorecodethedomainofvaluevectorsassociatedwiththesetofquasi-identi¯erattributes.Generalizationisobtainedbytothevectorofquasi-identi¯ervaluesineachtu-pleof.Recentresultssuggestthatmulti-dimensionmod-elsmightproducebetteranonymizationsthantheirsingle-dimensioncounterparts[12].5.1.1Hierarchy­BasedSingle­DimensionRecodingAsingle-dimensionrecodingde¯nessomefunctioneachattributeofthequasi-identi¯er.Thesingle-dimensionrecodingmodelsdi®erinhowthisfunctionisde¯ned.We¯rstconsiderseveralhierarchy-basedmodels.Full-domainGeneralizationmodelde¯nesmapeveryvalueintoageneralizedvalueatthesamelevelofthevaluegeneralizationhierarchy.Thereisalsoaspecialcaseoffull-domaingeneralization,whichwecallAt-tributeSuppression[13].Ifdenotesasuppressedvalue,musteithermapeveryvalueintoitsunmodi¯edvalue,oritmustmapeveryvalueinMore°exiblehierarchy-basedmodelsrelaxtherequire-mentsoffull-domaingeneralization.Onesuchmodel,whichwewillcallSingle-DimensionFull-SubtreeRecodingwasusedforcategoricaldatain[11].RecallfromSection2the(single-dimension)many-to-onevaluegeneralizationfunc-tion(),thatmappedvaluesfromdomainintothemoregeneraldomain.Underthefull-subtreerecodingmodel,eachisde¯nedsuchthatforeachvalue)=).Ifmapsanysomevalue),thenitmustmaptoallvaluesinthesubtreeofthevaluegeneralizationhierarchyrootedatForexample,considerthevaluegeneralizationhierarchyforZipcodeinFigure2(b).Ifasingle-dimensionfull-subtreerecodingmaps53715to5371*,thenitmustalsomap53710to5371*.Ifitmapsanyvalueto537**,thenitmustmaptheentiresubtreerootedat537**to537**.Wemightconsiderfurtherrelaxingtherequirementstoobtainanevenmore°exiblemodelthatwecallstrictedSingle-DimensionRecodingUnderthismodel,theonlyrestrictiononrecodingfunctionisthatforeachvalueinthedomainof)= Therearesituationswheretheapplicationofthismodelmayleadtocertaintypesofinference,forexamplemap-pingthevalue\Male"to\Person"whileleaving\Female"ungeneralized.Nonetheless,weincludeitasapossibility. 58 Figure13:Multi-dimensionalvaluegeneralizationlatticefortheSexandZipcodeattributes.5.1.2Partition­BasedSingle­DimensionRecodingSingle-dimensionrecodingmodelshavealsobeenproposedthatuseanordered-setpartitioningapproach[3,11].Un-dertheSingle-DimensionOrdered-SetPartitioningmodel,weassumethatthedomainofeachattributecanberepresentedasatotally-orderedset.mapsthesetofvaluesinintoasetofdisjointintervalsthatcover.Forexample,considerZipcode53703,53706,53710,53715asanordered-set,andapossiblesetofin-tervals:53703,53705,53710.Underthismodel,zipcode(53705)=[53703-53710],andzipcode(53710)=[53710].5.1.3Hierarchy­BasedMulti­DimensionRecodingAmulti-dimensionrecodingisde¯nedbyasinglefunctioni!,whichisusedtorecodethedo-mainofn-vectorscorrespondingtothesetofquasi-identi¯erattributes.Thehierarchy-basedmodelsforsingle-dimensionrecodingarereadilyextendedtomultipledimensions.Inordertode¯nethesemodels,we¯rstextendtheideaofavaluegeneralizationfunctiontomultipledimensions.Letmulti-attributevaluegeneralizationfunctioni!hbeamany-to-onefunc-tion,andlet;:::;D;:::;Ddenotethedomaingeneralizationrelationship.(Again,weusetheno-todenotethecompositionofoneormoremulti-attributevaluegeneralizationfunctions.)Likethesingle-dimensioncase,themulti-attributevaluegeneralizationre-lationshipsassociatedwithamulti-attributedomaininducemulti-attributevaluegeneralizationlattice.Thedirectededgesinthislatticerepresentdirectvaluegeneralizations(basedon),whileindirectpathsrepresentimpliedgener-alizations(basedon).Thesub-graphrootedatnodeinthelatticeisthesetofallnodesandedgesencounteredbyrecursivelytraversingalledgesbackwardsfromForexample,Figure13depictsthemulti-attributevaluegeneralizationlatticeforSexandZipcode.(Forclarity,notallvaluegeneralizationrelationshipsareshown.Thedot-tedarrowsindicatethedirectandimpliedmulti-attributevaluegeneralizationsofMale,53715.)Inthisexample,thesubgraphrootedatPerson,5371*containsnodesPerson,Person,53710Male,5371*Female,5371*Male,53715Female,53715Male,53710,andFemale,Severalnewrecodingmodelscanbede¯nedintermsofthemulti-dimensionalvaluegeneralizationlattice.Onesuchmodel,Multi-DimensionFull-SubgraphRecoding,isanextensionofsingle-dimensionfull-subtreerecoding.Un-derthismodel,multi-dimensionrecodingfunctionisde-¯nedsuchthatforeachtuple;:::;qinthemulti-attributedomainofthequasi-identi¯erattributeset,;:::;q)=;:::;q;:::;q).Ifany;:::;qtosome;:::;gi2;:::;q),thenitmustmapto;:::;gallvaluesinthesubgraphofthe(multi-dimensional)valuegeneralizationlatticerooted;:::;g.Forexample,supposeamulti-dimensionfull-subgraphvaluegeneralizationmappedpairMale,53715Person,5371*.Inthiscase,itmustalsomappairsFemale,53715Male,53710,andFemale,53710valuePerson,5371*Likethesingle-dimensionmodel,wecanrelaxthere-quirementstoobtainamore°exiblemodel,whichwecallUnrestrictedMulti-DimensionRecoding.Sucharecodingisde¯nedbyasinglefunctionsuchthat,foreachtuple;:::;qinthemulti-attributedomainofthequasi-identi¯erattributeset,;:::;q)=;:::;q;:::;q5.1.4Partition­BasedMulti­DimensionRecodingTheordered-setpartitioningmodelcanalsobeextendedtomultipledimensions.Again,considerthedomainofeachtobeanorderedsetofvalues,andletamultidimensionalintervalbede¯nedbyapairofpoints;:::;p),(;:::;v)inthemulti-dimensionalspacesuchi;p.TheMulti-DimensionOrdered-SetPartitioningmodelde¯nesasetofdisjointmulti-dimensionalintervalsthatcoverthedomain.Recodingmapseachtuple(;:::;qamulti-dimensionalintervalinthecoversuchthati;p5.2LocalRecodingModelsConsiderarelationwithquasi-identi¯erattributeset.Thelocalrecodingmodelsproducek-anonymizationbyde¯ningabijectivefromeachtuple;:::;nintheprojectionoftosomenewtuple;:::;aisde¯nedbyreplacingeachtupleinthepro-jectionof).Twomainlocalrecodingmodelshavebeenproposedintheliterature.The¯rst(CellSup-)producesbysuppressingindividualcellsof[1,13,20].Thesecond(CellGeneralization)mapsindi-vidualcellstotheirvaluegeneralizationsusingahierarchy-basedgeneralizationmodel[17].Itshouldbenotedthatlocalrecodingmodelsarelikelytobemorepowerfulthanglobalrecoding.6.RELATEDWORKProtectinganonymitywhenpublishingmicrodatahaslongbeenrecognizedasaproblem[20],andtherehasbeenmuchrecentworkoncomputingk-anonymizationsforthispur-pose.The-Argussystemwasalsoimplementedtoanonymizemicrodata[10],butconsideredattributecombinationsofonlyalimitedsize,sotheresultswerenotalwaysguaran-teedtobek-anonymous.ThegeneralizationandsuppressionframeworkemployedbyIncognitowasoriginallyde¯nedbySamaratiandSweeney[15],andSamaratiproposedabi-narysearchalgorithmfordiscoveringasingleminimalfull-domaingeneralization[14].Agreedyheuristicalgorithmfor Recallthataremultisets. full-domaingeneralization(\Data°y")wasdescribedin[17],andalthoughtheresultinggeneralizationisguaranteedtobek-anonymous,therearenominimalityguarantees.Costmetricsintendedtoquantifylossofinformationduetogeneralization,bothforgeneraldatauseandinthecon-textofdataminingapplications,weredescribedin[11].Givensuchacostmetric,geneticalgorithms[11]andsimu-latedannealing[21]havebeenconsideredfor¯ndinglocallyminimalanonymizations,usingthesingle-dimensionfull-subtreerecodingmodelforcategoricalattributesandthesingle-dimensionordered-setpartitioningmodelfornumericdata.Recently,top-down[7]andbottom-up[19]greedyheuristicalgorithmswereproposedforproducinganony-mousdatathatremainsusefulforbuildingdecision-treeclas-In[3],BayardoandAgrawalproposeatop-downset-enumerationapproachfor¯ndingananonymizationthatisoptimalaccordingtoagivencostmetric,giventhesingle-dimensionordered-setpartitioningmodel.Subsequentworkshowsthatoptimalanonymizationsunderthismodelmaynotbeasgoodasanonymizationsproducedwithamulti-dimensionvariation[12].Finally,theminimalcell-andattribute-suppressionvarietiesofk-anonymizationwereproventobeNP-hard(withhardnessproofsconstructedbasedonthenumberofcellsandnumberofattributes,respectively),klogk)[13]and)[1]approximationalgorithmswereproposed.7.CONCLUSIONSANDFUTUREWORKInthispaper,weshowedthatthemulti-dimensionaldatamodelisasimpleandclearwaytodescribefull-domaingen-eralization,andweintroducedaclassofalgorithmsthataresoundandcompleteforproducingk-anonymousfull-domaingeneralizationsusingthetwokeyideasofbottom-upaggre-gationalonggeneralizationdimensionsandaprioricom-putation.Althoughouralgorithms(likethepreviousalgo-rithms)areultimatelyexponentialinthesizeofthequasi-identi¯er,weareabletoimproveperformancesubstantially,andinsomecasesperformuptoanorderofmagnitudefaster.Throughourexperiments,weshowedthatitisfeasi-bletoperformminimalfull-domaingeneralizationonlargeInthefuture,webelievethattheperformanceofIncog-nitocanbeenhancedevenmorebystrategicallymaterializ-ingportionsofthedatacube,includingcountaggregatesatvariouspointsinthedimensionhierarchies,muchlikewhatwasdonein[9].Itisalsoimportanttoperformamoreexten-siveevaluationofthescalabilityofIncognitoandpreviousalgorithmsinthecasewheretheoriginaldatabaseortheintermediatefrequencytablesdonot¯tinmainmemory.Thesecondmaincontributionofthispaperisataxon-omycategorizinganumberofthepossibleanonymizationmodels.BuildingontheideasofIncognito,wearelookingatpossiblealgorithmsforthemore°exiblek-anonymizationmodelsdescribedinSection5.8.ACKNOWLEDGEMENTSThisworkwaspartiallysupportedbyaMicrosoftRe-searchgraduatefellowshipandNationalScienceFoundationGrantsIIS-0326328andIIS-0086002.OurthanksalsotoRobertoBayardo,ChrisKaiserlian,AsherLangton,andthreeanonymousreviewersfortheirthoughtfulcommentsonvariousdraftsofthispaper.9.REFERENCESREFERENCESG.Aggarwal,T.Feder,K.Kenthapadi,R.Motwani,R.Panigrahy,D.Thomas,andA.Zhu.Anonymizingtables.InProc.ofthe10thInt'lConferenceonDatabaseTheory,January2005.2005.R.AgrawalandR.Srikant.Fastalgorithmsforminingassociationrules.InProc.ofthe20thInt'lConferenceonVeryLargeDatabases,August1994.1994.R.BayardoandR.Agrawal.Dataprivacythroughoptimalk-anonymity.InProc.ofthe21stInt'lConferenceonDataEngineering,April2005.2005.R.Bellman.DynamicProgramming.PrincetonUniversityPress,Princeton,NJ,1957.1957.C.BlakeandC.Merz.UCIrepositoryofmachinelearningdatabases,1998.1998.S.ChaudhuriandU.Dayal.AnoverviewofdatawarehousingandOLAPtechnology.SIGMODRecord,26,26,B.Fung,K.Wang,andP.Yu.Top-downspecializationforinformationandprivacypreservation.InProc.ofthe21stInt'lConferenceonDataEngineering,April2005.2005.J.Gray,S.Chaudhuri,A.Bosworth,A.Layman,D.Reichart,M.Venkatrao,F.Pellow,andH.Pirahesh.Datacube:Arelationalaggregationoperatorgeneralizinggroup-by,cross-tab,andsub-totals.DataMiningandKnowledgeDiscovery,1(1),November1996.1996.V.Harinarayan,A.Rajaraman,andJ.Ullman.Implementingdatacubese±ciently.InProc.oftheACMSIGMODInt'lConferenceonManagementofData,JuneJuneA.HundepoolandL.Willenborg.-and-ARGUS:Softwareforstatisticaldisclosurecontrol.InProc.oftheThirdInt'lSeminaronStatisticalCon¯dentiality,1996.1996.V.Iyengar.Transformingdatatosatisfyprivacyconstraints.InProc.ofthe8thACMSIGKDDInt'lConferenceonKnowledgeDiscoveryandDataMiningAugust2002.2002.K.LeFevre,D.DeWitt,andR.Ramakrishnan.Multidimensionalk-anonymity.TechnicalReport1521,UniversityofWisconsin,2005.2005.A.MeyersonandR.Williams.Onthecomplexityofoptimalk-anonymity.InProc.ofthe23rdACMSIGACT-SIGMOD-SIGARTSymposiumonPrinciplesofDatabaseSystems,June2004.2004.P.Samarati.Protectingrespondants'identitiesinmicrodatarelease.IEEETransactionsonKnowledgeandDataEngineering,13(6),November/December2001.2001.P.SamaratiandL.Sweeney.Protectingprivacywhendisclosinginformation:k-anonymityanditsenforcementthroughgeneralizationandsuppression.TechnicalReportSRI-CSL-98-04,SRIComputerScienceLaboratory,1998.1998.R.SrikantandR.Agrawal.Mininggeneralizedassociationrules.InProc.ofthe21stInt'lConferenceonVeryLargeDatabases,August1995.1995.L.Sweeney.Achievingk-anonymityprivacyprotectionusinggeneralizationandsuppression.InternationalJournalonUncertainty,Fuzziness,andKnowledge-basedSystems10(5):571{588,2002.2002.L.Sweeney.K-anonymity:Amodelforprotectingprivacy.InternationalJournalonUncertainty,Fuzziness,andKnowledge-basedSystems,10(5):557{570,2002.2002.K.Wang,P.Yu,andS.Chakraborty.Bottom-upgeneralization:Adataminingsolutiontoprivacyprotection.InProc.ofthe4thIEEEInternatioalConferenceonDataMining,November2004.2004.L.WillenborgandT.deWaal.ElementsofStatisticalDisclosureControl.SpringerVerlagLectureNotesinStatistics,2000.2000.W.Winkler.Usingsimulatedannealingfork-anonymity.ResearchReport2002-07,USCensusBureauStatisticalResearchDivision,November2002.