/
Differentially Private MEstimators Lei Jing Department of Statistics Carnegie Mellon University Differentially Private MEstimators Lei Jing Department of Statistics Carnegie Mellon University

Differentially Private MEstimators Lei Jing Department of Statistics Carnegie Mellon University - PDF document

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
636 views
Uploaded On 2014-12-15

Differentially Private MEstimators Lei Jing Department of Statistics Carnegie Mellon University - PPT Presentation

cmuedu Abstract This paper studies privacy preserving Mestimators using perturbed histograms The proposed approach allows the release of a wide class of Mestimators with both differential privacy and statistical utility without knowing a priori the p ID: 24186

cmuedu Abstract This paper studies

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Differentially Private MEstimators Lei J..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DifferentiallyPrivateM-Estimators Lei,JingDepartmentofStatisticsCarnegieMellonUniversityPittsburgh,PA15213jinglei@andrew.cmu.eduAbstractThispaperstudiesprivacypreservingM-estimatorsusingperturbedhistograms.TheproposedapproachallowsthereleaseofawideclassofM-estimatorswithbothdifferentialprivacyandstatisticalutilitywithoutknowingapriorithepartic-ularinferenceprocedure.Theperformanceoftheproposedmethodisdemonstrat-edthroughacarefulstudyoftheconvergencerates.Apracticalalgorithmisgivenandappliedonarealworlddatasetcontainingbothcontinuousandcategoricalvariables.1IntroductionPrivacy-preservingdataanalysishasreceivedincreasingattentioninrecentyears.Amongvariousnotionsofprivacy,differentialprivacy[1,2]providesmathematicallyrigorousprivacyguaranteeandprotectsagainstessentiallyallkindsofidentityattacksregardlessoftheauxiliaryinformationthatmaybeavailabletotheattackers.Differentialprivacyrequiresthatthepresenceorabsenceofanyindividualdatarecordcannevergreatlychangetheoutcomeandhencetheusercanhardlylearnmuchaboutanyindividualdatarecordfromtheoutput.However,designingdifferentiallyprivatestatisticalinferenceprocedureshasbeenachallengingproblem.Differentialprivacyprotectsindividualdatabyintroducinguncertaintyintheoutcome,whichgenerallyrequirestheoutputofanyinferenceproceduretoberandomevenforaxedinputdataset.Thismakesdifferentiallyprivatestatisticalanalysisdifferentfrommosttraditionalstatis-ticalinferenceprocedures,whicharedeterministiconcethedatasetisgiven.Mostexistingworks[3,4,5]focusontheinteractivedatareleasewhereaparticularstatisticalinferenceproblemischo-senaprioriandtherandomizedoutputforthatparticularinferenceisreleasedtotheusers.Inrealityadatareleasethatallowsmultipleinferenceproceduresareoftendesiredbecauserealworldstatis-ticalanalysesusuallyconsistofaseriesofinferencessuchasexploratoryanalysis,modeltting,andmodelselection,wheretheexactinferenceprobleminalaterstageisdeterminedbyresultsofpreviousstepsandcannotbedeterminedinadvance.InthisworkwestudyM-estimatorsunderadifferentiallyprivateframework.TheproposedmethodusesperturbedhistogramstoprovideasystematicwayofreleasingaclassofM-estimatorsinanon-interactivefashion.Suchanon-interactivemethodusesrandomizationindependentofanypar-ticularinferenceprocedure,thereforeitallowstheuserstoapplydifferentinferenceproceduresonthesamesyntheticdatasetwithoutadditionalprivacycompromise.Theaccuracyoftheseprivatepreservingestimateshasalsobeenstudiedandweprovethat,undermildconditionsonthecontrastfunctionsoftheM-estimators,theproposeddifferentiallyprivateM-estimatorsareconsistent.Asaspecialcase,thisapproachgives1=p n-consistentestimatesforquantiles,providingasimpleandefcientalternativesolutiontosimilarproblemsconsideredin[4,5].Ourmainconditionrequiresconvexityandboundedpartialderivativesofthecontrastfunction.Theconvexityisusedtoen-suretheexistenceandstabilityoftheM-estimatorwhereastheboundedderivativecontrolsthebiascausedbytheperturbedhistogram.InclassicaltheoryofM-estimators,acontrastfunctionwith1 boundedderivativeimpliesrobustnessofthecorrespondingM-estimator.Thisisanotherevidenceofthenaturalconnectionbetweenrobustnessanddifferentialprivacy[4].Wealsodescribeanalgorithmthatisconceptuallysimpleandcomputationallyfeasible.Itisex-ibleenoughtoaccommodatecontinuous,ordinal,andcategoricalvariablesatthesametime,asdemonstratedbyitsapplicationonaBayAreahousingdata.1.1RelatedWorkTheperturbedhistogramisrstdescribedunderthecontextofdifferentialprivacyin[1].Theprob-lemofnon-interactivereleasehasalsobeenstudiedby[6],whichtargetsatreleasingthedifferen-tiallyprivatedistributionfunctionorthedensityfunctioninanon-parametricsetting.Theoretically,M-estimatorscouldbeindirectlyobtainedfromthereleaseddensityfunction.However,themoredirectperspectivetakeninthispaperleadstoanimprovedrateofconvergenceaswellasanefcientalgorithm.Severalaspectsofparameterestimationproblemshavebeenstudiedwithdifferentialprivacyundertheinteractiveframework.Inparticular,[4]showsthatmanyrobustestimatorscanbemadedif-ferentiallyprivateandthatgeneralprivateestimatorscanbeobtainedfromcompositionofrobustlocationandscaleestimators.[5]showsthatstatisticalestimatorswithgenericasymptoticnormalitycanbemadedifferentiallyprivatewiththesameasymptoticvariance.Bothworksinvolveestimat-ingtheinter-quartilerangeinadifferentiallyprivatemanner,wherethealgorithmmayoutput“NoResponse”[4],orthedataisassumedtohaveknownupperandlowerbounds[5].Inaslightlydifferentcontext,[3]considerspenalizedlogisticregressionasaspecialcaseofempiricalriskmini-mization,wherethepenalizedlogisticregressioncoefcientsareestimatedwithdifferentialprivacybyminimizingaperturbedobjectivefunction.Theirmethodusesadifferentformofperturbationandisstillinteractive.Itconnectswiththepresentpaperinthesensethattheperturbationisnallyexpressedintheobjectivefunction.Bothpapersassumeconvexity,whichensuresthattheshiftintheminimizerissmallwhenthedeviationintheobjectivefunctionissmall.Wealsonotethatthemethodin[3]dependsonastrictlyconvexpenaltytermwhichistypicallyusedinhigh-dimensionalproblems,whileourmethodworksforproblemswherenopenalizationisused.2Preliminaries2.1DenitionofPrivacyAdatabaseismodeledasasetofdatapointsD=fx1;:::;xng2Xn,whereXisthedatauniverse.Inmostcaseseachdataentryxirepresentsthemicrodataofanindividual.WeusetheHammingdistancetomeasuretheproximitybetweentwodatabasesofthesamesize.SupposejDj=jD0j,theHammingdistanceisH(D;D0)=jDnD0j=jD0nDj.Theobjectiveofdataprivacyistoreleaseusefulinformationfromthedatasetwhileprotectinginformationaboutanyindividualdataentry.Denition1(DifferentialPrivacy[1]).ArandomizedfunctionT(D)gives -differentialprivacyifforallpairsofdatabases(D;D0)withH(D;D0)=1andallmeasurablesubsetsEoftheimageofT: logP(T2EjD) P(T2EjD0)  :(1)Intherestofthispaperweassumethat,n,thesizeofdatabase,ispublic.2.2ThePerturbedHistogramInmoststatisticalproblems,adatabaseDconsistsofnindependentcopiesofarandomvariableXwithdensityf(x).Forsimplicity,weassumeX=[0;1]d.AswewillseeinSection3.2,ourmethodcanbeextendedtonon-compactXforsomeimportantexamples.Suppose[0;1]dispartitionedintocubiccellswithequalbandwidthhnsuchthatkn=h�1nisaninteger.DenoteeachcellasBr=Ndj=1[(rj�1)hn;rjhn),1forallr=(r1;:::;rd)2f1;:::;kngd.Thehistogram 1TomakesurethatBr'sdoformapartitionof[0;1]d,theintervalshouldbe[(kn�1)hn;1]whenrj=kn.2 densityestimatoristhen^fhist(x)=h�dnXrnr n1(x2Br);(2)wherenr:=Pni=11(Xi2Br)isthenumberofdatapointsinBr.Clearlythedensityestimatordescribedabovedependsonthedataonlythroughthehistogramcounts(nr;r2f1;:::;kngd).Ifwecanndadifferentiallyprivateversionof(nr;r2f1;:::;kngd),thenthecorrespondingdensityestimator^fwillalsobedifferentiallyprivatebyasimplechange-of-measureargument.Weconsiderthefollowingperturbedhistogramasdescribedin[1]:^nr=nr+zr;8r2f1;:::;kngd;(3)wherezr'sareindependentwithdensity exp(� jzj=2)=4.WehaveLemma2([1]).(^nr;r2f1;:::;kngd)satises -differentialprivacy.Wecall(^nr;r2f1;:::;kngd)thePerturbedHistogram.Substitutingnrby^nrin(2),weobtainadifferentiallyprivateversionof^fhist:^fPH(x)=h�dnXr^nr n1(x2Br):(4)Ingeneral^fPHgivenby(4)isnotavaliddensityfunction,sinceitcantakenegativevaluesandmaynotintegrateto1.Toavoidtheseundesirableproperties,[6]uses~nr=(^nr_0)insteadof^nrand~n=Pr~nrinsteadofnsothattheresultingdensityestimatorisnon-negativeandintegratesto1.2.3M-estimatorsGivenarandomvariableXwithdensityf(x),theparameterofinterestisdenedas:=argminM(),whereM()=Rm(x;)f(x)dx,Rp,andm(x;)isthecontrastfunc-tion.AssumingXiiidf,thecorrespondingM-estimatorisusuallyobtainedbyminimizingtheempiricalaverageofcontrastfunction:^=argmin2Mn();whereMn()=n�1Xi=1m(Xi;):(5)M-estimatorscovermanyimportantstatisticalinferenceproceduressuchassamplequantiles,max-imumlikelihoodestimators(MLE),andleastsquareestimators.MostM-estimatorsare1=p n-consistentandasymptoticallynormal.FormoredetailsaboutthetheoryandapplicationofM-estimators,see[7].3DifferentiallyprivateM-estimatorsCombiningequations(4)and(5)givesadifferentiallyprivateobjectivefunction:Mn;PH()=Z[0;1]d^fPH(x)m(x;)dx:(6)WewishtousetheminimizerofMn;PHasadifferentiallyprivateestimateof.Considerthefollowingsetofconditionsonthecontrastfunctionm(x;).(A1)g(x;):=@ @m(x;)existsandjg(x;)jC1on[0;1]d.(A2)g(x;)isLipschitzinxand:jjg(x1;)�g(x2;)jj2C2jjx1�x2jj2,forall;andjjg(x;1)�g(x;2)jj2C2jj1�2jj2,forallx.(A3)m(x;)isconvexinforallxandM()istwicecontinuouslydifferentiablewithM00():=Rf(x)@ @g(x;)dxpositivedenite.Condition(A1)requiresaboundedderivativeofthecontrastfunction,whichiscloselyrelatedtotherobustnessofthecorrespondingM-estimator[8].Itindicatesthatanysmallchangesinthe3 underlyingdistributioncannotchangetheoutcomebytoomuch,whichisalsorequiredimplicitlybythedenitionofdifferentialprivacy.Condition(A2)hastwoparts.TheLipschitzconditiononxisusedtoboundthebiascausedbyhistogramapproximation,whiletheLipschitzconditiononisusedtoestablishauniformupperboundofthesamplingerrorinM0n()=n�1Pig(xi;)aswellasauniformupperboundontheerrorcausedbytheadditiveLaplaciannoises.Condition(A3)requiressomecurvatureintheobjectivefunctioninaneighborhoodofthetrueparameter,whichensuresthattheminimizerisstableundersmallperturbations.Thefollowingtheoremisourrstmainresult:Theorem3.Underconditions(A1)-(A3),ifhn(p logn=n)2=(d+2),thenthereexistsalocalminimizer,^PH,ofMn;PH,suchthatj^PH�j=OP�n�1=2_(p logn=n)2=(d+2):(7)AproofofTheorem3isgiveninthesupplementarymaterial.Atahighlevel,byassumption(A3)itsufcestoshow(Lemma9)thatsup20jM0n;priv()�M0()j=OP(1=p n_(p logn=n)2=(2+d)),forsomecompactneighborhood0of.TheapproximationerrorofM0n;PH()canbedecomposedintothreeparts:Z(^fPH(x)�f(x))g(x;)dx=n�1Xrzrh�dZBrg(x;)dx+n�1Xrnrh�dnZBrg(x;)dx�nXi:Xi2Brg(Xi;)+n�1Xig(Xi;)�Eg(X;):(8)Thethreetermsontherighthandsideof(8)correspondtotheeffectofLaplacenoisesaddedforprivacy,thebiascausedbyusinghistogram,andthesamplingerror,respectively.Asinthegeneraltheoryofhistogramestimators,theapproximationerrordependsonthechoiceofbandwidthhn.Generallyspeaking,ifthebandwidthissmall,thenthehistogrambiastermwillbesmall.However,asmallerbandwidthleadstoalargernumberofcellsandhencemoreLaplaciannoises.Asaresult,thereisatrade-offbetweenthehistogrambiasandLaplaciannoisesinthechoiceofbandwidth.ThebandwidthgiveninTheorem3balancesthesetwoparts.WealsocommentonpracticalchoicesofhninSection4.WeproveTheorem3byinvestigatingtheconvergencerateofeachtermintherighthandsideof(8).First(Lemma10)byempiricalprocesstheory[9,10]wehave,underconditionsA(1)andA(2),thesamplingerrortermin(8)isoforderOP(1=p n),uniformlyon0.Second,usingLipschitzpropertyofg,thehistogrambiastermin(8)isoforderO(hn).Thereforeitsufcestoshowthatsup20 Prn�1zrh�dRBrm(x;)dx =OP�(p logn=n)h�d=2n,whichcanbeestablishedus-ingaconcentrationinequalityduetoTalagrand[11](seealso[12,Equation1.3]),togetherwitha-netargument(Lemma11)enabledbytheLipschitzpropertyofgin.3.1AlgorithmbasedonperturbedhistogramInpractice,exactintegrationof^fPH(x)m(x;)overeachcellBrmaybecomputationallyexpensiveandapproximationsmustbeadoptedtomaketheimplementationfeasible.Notethat^fPH(x)ispiecewiseconstant.Theintegrationcanbesimpliedbyusingapiecewiseconstantapproximationofm(x;).Formally,weintroducethefollowingalgorithm:Algorithm1(M-estimatorusingperturbedhistogram)Input:D=fX1;;Xng,m(;), ,hn.1.Constructperturbedhistogramwithbandwidthhnandprivacyparameter asin(3).2.LetMn;PH()=n�1Pr^nrm(ar;),wherear2[0;1]disthecenterofBr,withar(j)=(rj�0:5)hnforall1jd.4 3.Output^PH=argminMn;PH().Comparingto^n;PHobtainedbyminimizingtheexactintegral,theonlytermin(8)impactedbyusingg(ar;)insteadofh�dnRBrg(x;)dxisthehistogrambiasterm.However,notethat g(ar;)�h�dnZBrg(x;)dx =O(hn):Asaresult,theconvergencerateof^n;PHremainsthesame:Theorem4(StatisticalUtilityofAlgorithm1).UnderAssumptions(A1-A3),ifMn;PH()isgiv-enbyAlgorithm1withhn(p logn=n)2=(2+d)thenthereexistsalocalminimizer,^PH,ofMn;PH(),suchthatj^PH�j=OP(1=p n_(p logn=n)2=(2+d)):(9)Example(Logisticregression)Wegiveaconcreteexamplethatsatises(A1)-(A3).LetD=f(Xi;Yi)2[0;1]f0;1g:1ing,wheretheconditionaldistributionofYigivenXiisBernoulliwithparameterexp( Xi)=[1+exp( Xi)].Themaximumlikelihoodestima-torfor is MLE=argminPi[� YiXi+log(1+exp( Xi))].Herethecontrastfunctionm(x;y; )=� xy+log(1+exp( x))anditiseasytocheckthat(A1)-(A3)hold.InthisexampleXiscontinuousandYisbinary,soitisonlynecessarytodiscretizeXwhenconstructingthehistogram.Tobespecic,suppose[0;1]ispartitionedintoequal-sizedcells(Br,1rkn)asintheordinaryunivariatehistogram.Thejointhistogramfor(X;Y)isconstructedbycountingthenumberofdatapointsineachoftheproductcellsBr;j:=Brfjgforj=0;1.SeeSubsection4.1formoredetailsonconstructinghistogramswhentherearecategoricalvariables.NotethatTheorems3and4donotguaranteetheuniquenessorevenexistenceofaglobalminimizerfortheperturbedobjectivefunctionMn;PH().Thisisbecausesometimeswithsmallprobabilitysomeperturbedhistogramcount^nrcanbenegativehencethecorrespondingobjectivefunctionMn;PHmaynotbeconvex.Inoursimulationandrealdataexperience,thisisusuallynotarealproblemsinceasimilarargumentasinTheorem3showsthat,withhighprobability,thesecondderivativeM00n;PHisuniformlyclosetoM00inanycompactsubsetof.Tocompletelyavoidthisissue,onecanusethresholdingafterperturbationasdescribedinthefollowingalgorithm.Algorithm10(Perturbedhistogramwithnonnegativecounts)Input:D=fX1;;Xng,m(;), ,hn.1Constructperturbedhistogramwithbandwidthhnandprivacyparameter asin(3).2Let~Mn;PH()=n�1Pr~nrm(ar;),where~nr=max(^nr;0).3Output~PH=argmin~Mn;PH().AlthoughthethresholdingguaranteesthatthezeropointsofM0n;PH()isindeedaglobalminimizerbyconvexityofMn;PH(),itincreasestheapproximationerrorintroducedbytheLaplaciannoisesbecausenowthesenoisesnolongercancelwitheachothernicelyinthersttermoftherighthandsideofequation(8).WehavethefollowingutilityresultforAlgorithm10:Theorem5.UnderAssumptions(A1-A3)andhn(logn=n)1=(1+d),theestimatorgivenbyAlgo-rithm10satisesj~PH�j=OP((logn=n)1=(1+d)):Proof.TheprooffollowsessentiallyfromthatofTheorem3,withadifferentchoiceofband-widthhn.TheconcentrationinequalityresultnolongerholdsforPr~zrg(ar;)where~zr=max(zr;�nr),because~zr'sarenotindependent.Instead,weconsideradirectunionbound:suprj~zrjsuprjzrj=OP(logh�dn)=OP(logn).ThereforetheLaplaciannoiseterminrighthandsideof(8)isboundeduniformlyforallbyOP(n�1h�dnlogn).ThehistogrambiasisstillO(hn)aswementionedinthediscussionofAlgorithm1.Thereforetheconvergencerateisopti-mizedbychoosinghn(logn=n)1=(1+d). 5 3.2Non-differentiablecontrastfunctionsNowweconsiderthepossibilityofrelaxingcondition(A2).Allowingdiscontinuitying(x;)ismotivatedbyaclassofM-estimatorswhosecontrastfunctionsm(x;)arenon-differentiableonasetofzeromeasure.Animportantexampleisthequantile.ForarandomvariableX2R1withcumulativedistributionfunctionF()andanygiven2(0;1),the-thquantileofXisq():=F�1(),whichcorrespondstoanM-estimatorwithm(x;)=(1�)(x�)�+(x�)+(see[13]).Quantilesprovideimportantinformationaboutthedistribution,includingbothlocation(median)andscale(inter-quartilerange).Therobustnessofsamplequantilesalsomakesthemgoodcandidatesfordifferentiallyprivatedatarelease.Differentiallyprivatequantileestimatorsareindeedmajorbuildingblocksforsomeexistingprivacypreservingstatisticalestimators[4,5].Ourresultinthissubsectionshowsthatperturbedhistogramscangivesimple,consistent,anddifferentiallyprivatequantileestimators.ThefollowingsetofconditionswillsufceforthispurposeandtheargumentislargelythesameasTheorem4:(B1)m(x;)isconvexandLipschitzinbothxand.(B2)M()istwicedifferentiableatwithM00()�0.(B3)iscompactandconvex.Corollary6(StatisticalutilityofAlgorithm1).Underconditions(B1-B3)andhn(p logn=n)2=(2+d),anyminimizer^PHofMn;PHgivenbyAlgorithm1satises(9).Proof.TheargumentislargelythesameastheproofofTheorem3.HereweconsidertheoriginalobjectivefunctionsMn;PHandMinsteadoftheirderivatives.Byasimilardecompositionasineq.(8),usingthecompactnessof,wehavesupjMn;PH�Mj=OP(1=p n_(p logn=n)�2=(2+d)).Thentheconvergenceof^PHfollowsfromtheconvexityofM. Remark7.Condition(B3)isthemostrestrictiveone.Itrequirestobebounded.ThisisbecausetheproofusesthefactthatMn()andM()areuniformlycloseforlargen,whichisusuallytrueforaboundedsetof.Remark8.Forquantilesthecontrastfunctionispiecewiselinear,soformostcellsinthehistogramtherewouldbenoapproximationerrorifthedatapointsareapproximatedbythecellcenter.TheM-estimatorsforquantilesactuallyenjoyfasterconvergencerates.Extensiontodistributionssupportedon(�1;1).RecallthatweassumeX2[0;1]d.Forquan-tiles,wehaved=1andthequantileestimatorsdescribedabovecanbeextendedtoanycontinuousrandomvariablewhosedensityfunctionissupportedon(�1;1).LetfZi;i=1;:::;ngbeanin-dependentsamplefromdensityfZwithfZ(z)�0;8z2R1.Let2(0;1)andsupposewewanttoestimateqZ(),the-thquantileofZ.Toapplyourmethod,deneX=exp(Z)=(1+exp(Z)).Clearlythequantilesarepreservedunderthismonotonetransformation.ApplyingtheperturbedhistogramquantileestimatoronfXi;i=1;:::;ngweobtain^qX;PH(),thedifferentiallypri-vate-thqunatileofX,whichis1=p n-consistentbyCorollary6.Asaresult,theestimate^qZ;PH():=log[^qX;PH()=(1�^qX;PH())]isa1=p n-consistentestimatorforqZ().4PracticalAspects4.1ComplexityandFlexibilityFromnowonwewilldropthelogarithmtermstosimplifypresentation.Supposehnn�2=(2+d).Thentheperturbedhistogram(^nr:r2f1;:::;h�1ngd)canbeconstructedinO(n2d=(2+d))timebyspecifyingthecorrespondingcellforeachdatapoint.Oncethehistogramisconstruct-ed,followingAlgorithm1,wecanviewitasasetofh�dn=O(n2d=(2+d))weighteddatapointsar;r2f1;:::;h�1ngd associatedwithweightsf^nrg,whereeachdatapointaristhecenterofcellBrasdenedinStep2ofAlgorithm1.ForM-estimatorsthatallowacloseformsolutionintermsoftheminimumsufcientstatistics,suchasleastsquareregression,Mn;PH()(andhence^PH)canbecalculatedinO(n2d=(2+d))time.ForgeneralM-estimatorsthatrequireaniterativeop-timization,suchaslogisticregression,theHessianandgradientscanbecalculatedinO(n2d=(2+d))6 timeineachiteration.SuchaweightedsamplerepresentationcanbeeasilyimplementedusingstandarddatastructuresincommonstatisticalprogrammingpackagessuchasRandMatlab.Anotherattractivepropertyoftheproposedapproachisitsexibilitytoaccommodatedifferentdatatypes.AsseeninthelogisticregressionexampleinSubsection3.1,itisstraightforwardtoconstructmultivariatehistogramswhensomevariablesarecategoricalandsomearecontinuous.Insuchcasesitsufcestodiscretizethecontinuousvariables.Tobespecic,let(X1;:::;Xd1)2[0;1]d1bead1-dimensionalcontinuousvariableand(Y1;:::;Yd2)2Qd2j=1f1;:::;kjgbeasetofd2discretevariableswhereYjtakesvalueinf1;:::;kjg.Foranybandwidthh,letfBr;r2f1;:::;h�1gd1gbethecorrespondingsetofhistogramcellsin[0;1]d1.Thenthejointhistogramfor(X;Y)isconstructedwithcellsBr;y;r2f1;:::;h�1gd1;y2d2Oj=1f1;:::;kjg :Becauseonlythecontinuousvariableshavehistogramapproximationerror,thetheoreticalresultsdevelopedinSection3areapplicablewithsamplesizenanddimensionalityd1.4.2ImprovementbyenhancedthresholdingInapplicationssuchasregression,themultivariatedistributionoftenconcentratesonasubset(usu-allyalowerdimensionalmanifold)of[0;1]d.Thereforemanynon-zerocellsarearticiallycreatedbyadditivenoises.Toalleviatethisproblem,wethresholdthehistogramwithanenhancedcut-offvalue:~nr=^nr1(^nrAlogn= ),whereA�0isatuningparameter.ThisisbasedontheintuitionthatthemaximalnoisewillbeO(logn= ).Asshowninthefollowingdataexample,suchasimplethresholdingstepremarkablyimprovestheaccuracy.4.3ApplicationtohousingpricedataAsanillustration,weapplyourmethodtoahousingpricedataconsistingof348,189housessoldinSanFranciscoBayAreabetween2003and2006.Foreachhouse,thedatacontainstheprice,size,yearoftransaction,andcountyinwhichthehouseislocated.Theinferenceproblemofinterestistostudytherelationshipbetweenhousingpriceandothervariables[14].Inourcase,wewanttobuildasimplelinearregressionmodeltopredictthehousingpriceusingtheothervariableswhileprotectingeachindividualtransactionrecordwithdifferentialprivacy.Thedatasethastwocontinuousvariables(priceandsize),oneordinalvariable(yearofsale)with4levels,andonecategoricalvariable(county)with9levels.Thepreprocessingltersoutdatapointswithpriceoutsideoftherange$105$9105orwithsizelargerthan3000sqft.Wealsocombinesmallcountiesthataregeologicallycloseandhavesimilarhousingprices.Afterthepreprocessing,thereare250,070datapointsandthecountyvariablehas6levelsafterthecombination.Foreach(year,county)combination,aperturbedhistogramisconstructedoverthetwocontinuousvariableswithprivacyparameter andKlevelsineachcontinuousdimension.Thenthereare46K2cells,eachhavingaperturbedhistogramcount.Usingtheweightedsamplerepresen-tationdescribedinSubsection4.1,theperturbeddatacanbeviewedasadatasetwith24K2datapointsweightedbytheperturbedhistogramcounts.Adifferentiallyprivateregressioncoefcientisobtainedbyapplyingaweightedleastsquareregressiononthisdataset.Toassesstheperformance,theprivacypreservingregressioncoefcientsarecomparedwiththosegivenbythenon-privateor-dinaryleastsquare(OLS)estimates.Inparticular,welookatthecoordinate-wiserelativedeviancefromOLScoefcients:"=j^priv=^OLS�1j.Toaccountfortherandomnessofadditivenoises,werepeat100timesandreporttherootmeansquareerror:"=(P1001"2i=100)1=2,where"iistherelativeerrorobtainedintheithrepetition.TheresultsaresummarizedinTable1.Wetest2valuesof ,theprivacyparameter.Recallthatasmallervalueof indicatesastrongerprivacyguarantee.Foreachvalueof weapplyboththeoriginalAlgorithm1andtheenhancedthresholdingdescribedinSubsection4.2,withtuningparameterA=1=2.For =1thecoefcientsgivenbytheperturbedhistogramareclosetothosegivenbyOLSwithmostrelativedeviancesbelow5%.When =0:1,whichisaconservativechoicebecauseexp(0:1)1:1,theperturbedhistogramstillgivesreasonablycloseestimateswithaveragedeviancebelow10%forallparameters7 Table1:LinearregressioncoefcientsusingtheBayAreahousingdata.Thesecondcolumnistheregressioncoefcientsgivenbyordinaryleastsquaremethodwithoutanyperturbation.Wecompareestimategivenby(1)perturbedhistogram(PH,Algorithm1)and(2)perturbedhistogramwithenhancedthresholding(THLD)asdescribedinSubsection4.2.Thereportednumberistherootmeansquarerelativeerror(inpercentage)over100perturbationsasdescribedabove.ThehistogramwithuseK=10segmentsineachcontinuousdimension. =0:1 =1VariableOLSPHTHLDPHTHLD Intercept13514110.67.77.24.4Size2094.73.53.62.3Year563754.62.81.00.4County2-537658.07.81.50.7County31465934.22.50.80.3County4-2754629.837.12.82.1County5458289.87.91.41.3County6-1407387.13.31.00.4 exceptthecountydummyvariable“County4”.ThisvariablehasthesmallestOLScoefcientamongallcountydummyvariables,soweightuctuationinthehistogramcausesarelativelylargerimpactontherelativedeviance.Eventhough,theperturbedhistogramstillgivesatleastqualitativelycorrectestimate.WealsoobservethatthethresholdedhistogramgivesmoreaccurateestimateforallcoefcientsexceptforCounty4when =0:1.ThechoiceofKshoulddependonthesamplesizeanddimensionality.OurtheorysuggestsK=O(n2=(2+d))wheredisthedimensionalityofthehistogramandhenceequalsthenumberofcontinuousvariables.Inthisdatasetn=250;070andd=2,whichsuggestsK500.Thisisnotagoodchoicesinceitproduces245002=6106cells.Letthenumberofcellsbec(K).Inpractice,itmakessensetochooseKsuchthattheaveragedatacountsinacell,n=c(K),ismuchlargerthanthemaximumadditivenoisemaxrjzrj,whichisOP(logc(K)).Forthisdataset,whenK=10wehaven=c(K)100andlog(c(K))7:78.5FurtherDiscussionsWedemonstratehowhistogramscanbeusedasabasictoolforstatisticalparameterestimationunderstrongprivacyconstraints.Theperturbedhistogramaddstoeachhistogramcountadouble-exponentialnoisewithconstantparameterdependingonlyontheprivacybudget .Thehistogramapproximationbiasandtheadditivenoiseonthecellcountsresultinabias-variancetrade-offasusuallyseenforhistogram-basedmethods.Suchanalgorithmshouldworkwellforlow-dimensionalproblems.Solutionstohigherdimensionalproblemsareyettobedeveloped.Onepossibilityistoperturbtheminimumsufcientstatisticsbecausethedimensionalityofminimumsufcientstatisticsisusuallymuchsmallerthanthenumberofhistogramcells.Forexample,inlinearregressionanalysis,itsufcestoobtaintherstandsecondmomentsofallvariablesinaprivacy-preservingway.However,perturbingminimumsufcientstatisticswouldonlyworkforasingleestimatorandisonlypossibleforinteractiverelease.Weareseeinganothertypeofprivacy-utilitytrade-off,wheretheutilityisnotonlyabouttherateofconvergence,butalsoabouttherangeofpossibleanalysesallowedbythedatareleasingmechanism.Theperturbedhistogramisalsorelatedto“errorinvariable”inferenceproblems.Supposetheoriginaldataisjustthehistogram,thentheperturbedversioncanbethoughtasthetruehistogramcountscontaminatedbysomemeasurementerrors.Inthispaperweprovideconsistencyresultsforaclassofinferenceproblemsinpresenceofsuchmeasurementerrors.However,pluggingintheperturbedvaluesdoesnotnecessarilygivethebestinferenceprocedureandbetteralternativesmaybepossible,see[15]forahypothesistestingexampleincontingencytables.Animportantandchallengingquestionishowtondtheoptimalinferenceprocedureinpresenceofsuchmeasurementerrors.Apositiveanswertothisquestionwillhelpestablishalowerboundofapproximationerrorandbetterunderstandthepowerandlimitofperturbedhistograms.8 AcknowledgementsJingLeiwaspartiallysupportedbyNSFGrantBCS-0941518.References[1]C.Dwork,F.McSherry,K.Nissim,andA.Smith.Calibratingnoisetosensitivityinprivatedataanalysis.InProceedingsofthe3rdTheoryofCryptographyConference,pages265–284,2006.[2]C.Dwork.Differentialprivacy.InProceedingsofthe33rdInternationalColloquiumonAutomata,LanguagesandProgramming(ICALP)(2),pages1–12,2006.[3]K.ChaudhuriandC.Monteleoni.Privacy-preservinglogisticregression.InAdvancesinNeu-ralInformationProcessingSystems,2008.[4]C.DworkandJ.Lei.Differentialprivacyandrobuststatistics.InProceedingsofthe41stAnnualACMSymposiumonTheoryofComputing,2009.[5]A.Smith.Privacy-preservingstatisticalestimationwithoptimalconvergencerates.InPro-ceedingsofthe41stAnnualACMSymposiumonTheoryofComputing,2011.[6]L.WassermanandS.Zhou.Astatisticalframeworkfordifferentialprivacy.JournaloftheAmericanStatisticalAssociation,105:375–389,2010.[7]P.J.HuberandE.M.Ronchetti.RobustStatistics.JohnWiley&Sons,Inc.,2ndedition,2009.[8]F.Hampel,E.Ronchetti,P.Rousseeuw,andW.Stahel.RobustStatistics:TheApproachBasedonInuenceFunctions.JohnWiley,NewYork,1986.[9]A.W.vanderVaart.AsymptoticStatistics.CambridgeUniversityPress,1998.[10]M.Talagrand.SharperboundsforGaussianandempiricalprocesses.TheAnnalsofProbabil-ity,22:28–76,1994.[11]M.Talagrand.Anewisoperimetricinequalityandtheconcentrationofmeasurephenomenon.LectureNotesinMathematics,1469/1991:94–124,1991.[12]S.BobkovandM.Ledoux.Poincar´e'sinequalitiesandTalagrand'sconcentrationphenomenonfortheexponentialdistribution.ProbabilityTheoryandRelatedFields,107:383–400,1997.[13]R.KoenkerandK.F.Hallock.Quantileregression.JournalofEconomicPerspectives,15:143–156,2001.[14]R.K.PaceandR.Barry.Sparsespatialautoregressions.Statistics&ProbabilityLetters,33:291–297,1997.[15]D.VuandA.Slavkovic.Differentialprivacyforclinicaltrialdata:Preliminaryevaluations.InProceedingsofthe2009IEEEInternationalConferenceonDataMiningWorkshops,2009.9