umbcedu Qi ang and Krishnamoorthy Si akumar School of Electrical Engineering and Computer Science ashington State Uni ersity Pullman ashington 991642752 USA qw ang si eecswsuedu Abstract Privacy is becoming an incr easingly important issue in many da ID: 26420
Download Pdf The PPT/PDF document "On the Pri acy Pr eser ving Pr operties ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
OnthePrivacyPreservingPropertiesofRandomDataPerturbationTechniquesHillolKarguptaandSouptikDattaComputerScienceandElectricalEngineeringDepartmentUniversityofMarylandBaltimoreCountyBaltimore,Maryland21250,USA hillol,souptik1@cs.umbc.eduQiWangandKrishnamoorthySivakumarSchoolofElectricalEngineeringandComputerScienceWashingtonStateUniversityPullman,Washington99164-2752,USA qwang,siva@eecs.wsu.eduAbstractPrivacyisbecominganincreasinglyimportantissueinmanydataminingapplications.Thishastriggeredthede-velopmentofmanyprivacy-preservingdataminingtech-niques.Alargefractionofthemuserandomizeddatadis-tortiontechniquestomaskthedataforpreservingthepri-vacyofsensitivedata.Thismethodologyattemptstohidethesensitivedatabyrandomlymodifyingthedatavaluesof-tenusingadditivenoise.Thispaperquestionstheutilityoftherandomvaluedistortiontechniqueinprivacypreserva-tion.Thepapernotesthatrandomobjects(particularlyran-dommatrices)havepredictablestructuresinthespectraldomainanditdevelopsarandommatrix-basedspectrall-teringtechniquetoretrieveoriginaldatafromthedatasetdistortedbyaddingrandomvalues.Thepaperpresentsthetheoreticalfoundationofthislteringmethodandextensiveexperimentalresultstodemonstratethatinmanycasesran-domdatadistortionpreserveverylittledataprivacy.1.IntroductionManydataminingapplicationsdealwithprivacy-sensitivedata.Financialtransactions,health-carerecords,andnetworkcommunicationtrafcaresomeexam-ples.Datamininginsuchprivacy-sensitivedomainsisfac-inggrowingconcerns.Therefore,weneedtodevelopdataminingtechniquesthataresensitivetotheprivacyis-sue.Thishasfosteredthedevelopmentofaclassofdataminingalgorithms[2,9]thattrytoextractthedatapat-ternswithoutdirectlyaccessingtheoriginaldataandguaranteesthattheminingprocessdoesnotgetsuf-cientinformationtoreconstructtheoriginaldata.Thispaperconsidersaclassoftechniquesforprivacy-preservingdataminingbyrandomlyperturbingthedatawhilepreservingtheunderlyingprobabilisticproperties.Itexplorestherandomvalueperturbation-basedapproach[2],awell-knowntechniqueformaskingthedatausingran-domnoise.Thisapproachtriestopreservedataprivacybyaddingrandomnoise,whilemakingsurethattherandomnoisestillpreservesthesignalfromthedatasothatthepatternscanstillbeaccuratelyestimated.Thispaperques-tionstheprivacy-preservingcapabilityoftherandomvalueperturbation-basedapproach.Itshowsthatinmanycases,theoriginaldata(sometimescalledsignalinthispaper)canbeaccuratelyestimatedfromtheperturbeddatausingaspectrallterthatexploitssometheoreticalpropertiesofrandommatrices.Itpresentsthetheoreticalfoundationandprovidesexperimentalresultstosupportthisclaim.Section2offersanoverviewoftherelatedliteratureonprivacypreservingdatamining.Section3presentsthemo-tivationbehindtheframeworkpresentedinthispaper.Sec-tion4describestherandomdataperturbationmethodpro-posedin[2].Section5presentsadiscussionontheeigen-valuesofrandommatrices.Section6presentstheintuitionbehindthethoerytoseparateoutrandomcomponentfromamixtureofnon-randomandrandomcomponent.Section7describestheproposedrandommatrix-basedlteringtech-nique.Section8appliestheproposedtechniqueandreportsitsperformanceforvariousdatasets.Finally,Section9con-cludesthispaper.2.RelatedWorkThereexistsagrowingbodyofliteratureonprivacy-sensitivedatamining.Thesealgorithmscanbedividedintoseveraldifferentgroups.Oneapproachadoptsadis-tributedframework.Thisapproachsupportscomputationofdataminingmodelsandextractionofpatternsatagivennodebyexchangingonlytheminimalnecessaryinforma-tionamongtheparticipatingnodeswithouttransmittingtherawdata.Privacypreservingassociationruleminingfromhomogeneous[9]andheterogeneous[19]distributeddatasetsarefewexamples.Thesecondapproachisbasedon data-swappingwhichworksbyswappingdatavalueswithinsamefeature[3].Thereisalsoanapproachwhichworksbyaddingran-domnoisetothedatainsuchawaythattheindividualdatavaluesaredistortedpreservingtheunderlyingdistributionpropertiesatamacroscopiclevel.Thealgorithmsbelong-ingtothisgroupworksbyrstperturbingthedatausingrandomizedtechniques.Theperturbeddataisthenusedtoextractthepatternsandmodels.Therandomizedvaluedis-tortiontechniqueforlearningdecisiontrees[2]andassoci-ationrulelearning[6]areexamplesofthisapproach.Ad-ditionalworkonrandomizedmaskingofdatacanbefoundelsewhere[18].Thispaperexploresthethirdapproach[2].Itpointsoutthatinmanycasesthenoisecanbeseparatedfromtheper-turbeddatabystudyingthespectralpropertiesofthedataandasaresultitsprivacycanbeseriouslycompromised.AgrawalandAggarwal[1]havealsoconsideredtheap-proachin[2]andhaveprovidedaexpectation-maximization(EM)algorithmforreconstructingthedistributionoftheoriginaldatafromperturbedobservations.Theyalsopro-videinformationtheoreticmeasures(mutualinformation)toquantifytheamountofprivacyprovidedbyarandomiza-tionapproach.AgrawalandAggarwal[1]remarkthatthemethodsuggestedin[2]doesnottakeintoaccountthedis-tributionoftheoriginaldata(whichcouldbeusedtoguessthedatavaluetoahigherlevelofaccuracy).However,[1]providesnoexplicitproceduretoreconstructtheoriginaldatavalues.Evmievskietal.[5,4]andRizvi[15]havealsoconsideredtheapproachin[2]inthecontextofasso-ciationruleminingandsuggesttechniquesforlimitingpri-vacybreaches.Ourprimarycontributionistoprovideanexplicitlteringprocedure,basedonrandommatrixtheory,thatcanbeusedtoestimatetheoriginaldatavalues.3.MotivationAsnotedintheprevioussection,agrowingbodyofpri-vacypreservingdataminingtechniquesareadoptingran-domizationasaprimarytooltohideinformation.Whilerandomizationisanimportanttool,itmustbeusedverycarefullyinaprivacy-preservingapplication.Randomnessmaynotnecessarilyimplyuncertainty.Randomeventscanoftenbeanalyzedandtheirprop-ertiescanbeexplainedusingprobabilisticframeworks.Statistics,randomizedcomputation,andmanyotherre-latedeldsarefulloftheorems,laws,andalgorithmsthatrelyonprobabilisticcharacterizationofrandompro-cessesthatoftenworkquiteaccurately.Thesignalpro-cessingliterature[12]offersmanylterstoremovewhitenoisefromdataandtheyoftenworkreasonablywell.Ran-domlygeneratedstructureslikegraphsdemonstrateinter-estingproperties[7].Inshort,randomnessdoesseemtohavestructureandthisstructuremaybeusedtocom-promiseprivacyissuesunlesswepaycarefulattention.Therestofthispaperillustratesthischallengeinthecon-textofawell-knownprivacypreservingtechniquethatworksusingrandomadditivenoise.4.RandomValuePerturbationTechnique:ABriefReviewForthesakeofcompleteness,wenowbrieyreviewtherandomdataperturbationmethodsuggestedin[2]forhid-ingthedata(i.e.guaranteeingprotectionagainsttherecon-structionofthedata)whilestillbeingabletoestimatetheunderlyingdistribution.4.1.PerturbingtheDataTherandomvalueperturbationmethodattemptstopre-serveprivacyofthedatabymodifyingvaluesofthesensi-tiveattributesusingarandomizedprocess[2].TheauthorsexploretwopossibleapproachesValue-ClassMember-shipandValueDistortionandemphasizetheValueDis-tortionapproach.Inthisapproach,theownerofadatasetre-turnsavalue ,where istheoriginaldata,andisarandomvaluedrawnfromacertaindistribution.Mostcom-monlyuseddistributionsaretheuniformdistributionoveraninterval \n\r\nandGaussiandistributionwithmeanandstandarddeviation.Theoriginaldataval-ues ! "areviewedasrealizationsofindepen-dentandidenticallydistributed(i.i.d.)randomvariables#,$&%(') !,eachwiththesamedistributionasthatofarandomvariable#.Inordertoperturbthedata,inde-pendentsamples! ",aredrawnfromadistribu-tion*.Theownerofthedataprovidestheperturbedvalues +! , "+"andthecumulativedistri-butionfunction-/.103254of*.Thereconstructionproblemistoestimatethedistribution-7680394oftheoriginaldata,fromtheperturbeddata.4.2.EstimationofDistributionFunctionfromthePerturbedDatasetTheauthors[2]suggestthefollowingmethodtoestimatethedistribution-60: 4of#,givenindependentsamples; ,$=%,5;000;(' !and-/.0:?4.UsingBayes'rule,theposteriordistributionfunction-A@60: 4of#,giventhat#BC*D;,canbewrittenas-@603 E4GF/HIEJBK.L0;NM?4K6O0PMQ4!RSMFJIEJBK.L0;NM?4K6O0PMQ4!RSM whichupondifferentiationwithrespectto yieldstheden-sityfunctionK@60: 4K.L0; 4K6O03 E4FJIJK.L0;NMQ4K6O0:M?4RQMwhereK6O0!4,K.L0!4denotetheprobabilitydensityfunctionof#and*respectively.Ifwehaveindependentsamples C;,$G%'),thecorrespondingposteriordistributioncanbeobtainedbyaveraging:K@603 E4%"K.0; 4K603 E4FJIJK.0;NMQ4K60:M?4RQM(1)Forsufcientlylargenumberofsamples,weexpecttheabovedensityfunctiontobeclosetotherealdensityfunc-tionK60: 4.Inpractice,sincethetruedensityK603 E4isun-known,weneedtomodifytheright-handsideofequation1.TheauthorssuggestaniterativeprocedurewhereateachstepG%(' theposteriordensityKI60: 4estimatedatstep7%isusedintheright-handsideofequation1.Theuniformdensityisusedtoinitializetheiterations.Theiter-ationsarecarriedoutuntilthedifferencebetweensucces-siveestimatesbecomessmall.Inordertospeedupcompu-tations,theauthorsalsodiscussapproximationstotheaboveprocedureusingpartitioningofthedomainofdatavalues.5.RandomnessandPatternsTherandomperturbationtechniqueapparentlydistortsthesensitiveattributevaluesandstillallowsestimationoftheunderlyingdistributioninformation.However,doesthisapparentdistortionfundamentallyprohibitusfromextract-ingthehiddeninformation?Thissectionpresentsadis-cussiononthepropertiesofrandommatricesandpresentssomeresultsthatwillbeusedlaterinthispaper.Randommatrices[13]exhibitmanyinterestingproper-tiesthatareoftenexploitedinhighenergyphysics[13],sig-nalprocessing[16],andevendatamining[10].Therandomnoiseaddedtothedatacanbeviewedasarandommatrixandthereforeitspropertiescanbeunderstoodbystudyingthepropertiesofrandommatrices.Inthispaperweshallde-velopaspectrallterdesignedbasedonrandommatrixthe-oryforextractingthehiddendatafromthedataperturbedbyrandomnoise.Forourapproach,wearemainlyconcernedaboutdistri-butionofeigenvaluesofthesamplecovariancematrixob-tainedfromarandommatrix.Let*bearandom ma-trixwhoseentriesare*,$% \n,G%!,arei.i.d.randomvariableswithzeromeanandvariance.Thecovariancematrixofisgivenby\r*A@*.Clearly,isanmatrix.Let"betheeigen-valuesof.Let-"E0394%"#0394 betheempiricalcumulativedistributionfunction(c.d.f.)oftheeigenvalues,0%$4,where#0394%99istheunitstepfunction.Inordertoconsidertheasymp-toticpropertiesofthec.d.f.-,wewillconsiderthedi-mensionsN04and704ofmatrixtobefunctionsofavariable.Wewillconsiderasymptoticssuchthatinthelimitas! ,wehaveN04"# ,704$% ,and\r'&)(+*"&)(+*-,,where,.%.Undertheseassumptions,itcanbeshownthat[8]theempiricalc.d.f.-"E0:94convergesinprobabilitytoacontinuousdistributionfunction-0/10:94forevery9,whoseprobabilitydensityfunc-tion(p.d.f.)isgivenbyK/0394/01&)2I354768*9&3:4;I25*\n=?A@2CBEDF"C9GH7BEIKJotherwise(2)where7BEDFand7BEIKJareasfollows:BEDF0%%ALNM,A4BEI\nJ0%%ALM,A4(3)Furtherrenementsofthisresultandotherdiscussionscanbefoundin[16].6.SeparatingtheDatafromtheNoiseConsideranOdatamatrix#andanoisematrix*withsamedimensions.Therandomvalueperturbationtech-niquegeneratesamodieddatamatrix#QP#,*.Ourobjectiveistoextract#from#0P.Althoughthenoisematrix*mayintroduceseeminglysignicantdifferencebetween#and#0P,itmaynotbesuccessfulinhidingthedata.Considerthecovariancematrixof#P:#$RP#SP0#C*4TR\r0#C*4#$R#C*UR/#BB#$R*D*VR*(4)Nownotethatwhenthesignalrandomvector(rowsof#)andnoiserandomvector(rowsof*)areuncorrelated,wehaveW#R*W*R#.Theuncorrelatedassump-tionisvalidinpracticesincethenoise*thatisaddedtothedata#isgeneratedbyastatisticallyindependentpro-cess.Recallthattherandomvalueperturbationtechniquediscussedintheprevioussectionintroducesuncorrelated noisetohidethesignalorthedata.Ifthenumberofob-servationsissufcientlylarge,wehavethat#R* and*R# .Equation4cannowbesimpliedasfollows:#RP#0P#R#C*R*(5)Sincethecorrelationmatrices#R#,#RP#SP,and*R*aresymmetricandpositivesemi-denite,let#$R,HH,URH#RP#P,PP,RPand*R*,,R(6)where,H,'P),areorthogonalmatriceswhosecolumnvectorsareeigenvectorsof#R#,#RP#SP,*R*,respec-tively,andH,P,arediagonalmatriceswiththecorre-spondingeigenvaluesontheirdiagonals.Thefollowingresultfrommatrixperturbationtheory[20]givesarelationshipbetweenH,,and.Theorem1[20]Suppose& *\n&*"&*,\r !aretheeigenvaluesof#R#,#RP#SP,and*R*,respectively.Then,for$%!,&P*&H*"&*&H*&*Thistheoremprovidesusaboundonthechangeintheeigenvaluesofthedatacorrelationmatrix#R#intermsoftheminimumandmaximumeigenvaluesofthenoisecorre-lationmatrix*R*.Nowletustakeastepfurtherandex-plorethepropertiesoftheeigenvaluesoftheperturbeddatamatrix#Pforlargevaluesof.Lemma1Letdatamatrix#andnoisematrix*beofsizeand#P#,*.Let,H,P,beorthogonalmatricesandH,P,bediagonalmatricesasdenedin6.IfL thenPHRwhere,RP,H.Proof:UsingEquations5and6wecanwrite,,PP,URP,HH,URH,A,URP,URP,HH,RH,P,URP,A,UR,PHR,URP,A,UR,P(7)Lettheminimumandmaximumeigenvaluesof*beBEDF&*andBEI\nJ&*respectively.Itfollowsfromequation2thatL alltheeigenvaluesinbecomeidenticalsince\r! "/#"JBEIKJ&*\r! "/$"JBEDF&*(say).Thisimpliesthat,asL ,\n%,where%istheidentitymatrix.Therefore,ifthenum-berofobservationsislargeenough(notethat,inprac-tice,numberoffeaturesisxed),*R*,A,R,A,R.&%.ThereforeEquation7becomesPHR,URP,P,RP,PPHRQ(8)'Ifthenormoftheperturbationmatrix*issmall,theeigenvectors,'Pof#RP#SPwouldbeclosetotheeigen-vectors,RH,Hof#R#.Indeed,matrixperturbationthe-oryprovidespreciseboundsontheanglebetweeneigen-vectors(andinvariantsubspaces)ofamatrix#andthatofitsperturbation#QP#*,intermsofthenormsoftheperturbationmatrix*.Forexample,let039HH4beaneigenvector-eigenvaluepairformatrix#R#and(*)*R*)BEIKJ0*R*4bethetwo-normoftheper-turbation,whereBEI\nJ0P*R*4isthelargestsingularvalueof*R*.Thenthereexistsaneigenvalue-eigenvectorpair039PP4of#RP#Psatisfying[20,17]+-,/.010039H!9NPS4'(2(where2isthedistancebetweenHandtheclosesteigen-valueof#R#,provided(2.Thisshowsthattheeigen-valuesof#R#and#R3#3areingeneralclose,forsmallperturbations.Moreover,4H965#SP9H4'(2(where95istheconjugate-transposeof9.Consequently,theproduct,RP,H,whichisthematrixofinnerprod-uctsbetweentheeigenvectorsof#R#and#RP#SPwouldbeclosetoanidentitymatrix;i.e.,,RP,H7%.Thusequation8becomesP7H(9)Supposethesignalcovariancematrixhasonlyafewdominanteigenvalues,say&H*:8&H*,with&H*(forsomesmallvalue(and$:9% !.Thisconditionistrueformanyreal-worldsignals.Suppose8&H*&*,thelargesteigenvalueofthenoisecovari-ancematrix.Itisthenclearthatwecanseparatethesig-nalandnoiseeigenvaluesH,fromtheeigenvaluesPoftheobserveddatabyasimplethresholdingat&*.Notethatequation9isonlyanapproximation.However,inpractice,onecandesignalterbasedonthisapproxima-tiontolterouttheperturbationfromthedata.Experimen-talresultspresentedinthefollowingsectionsindicatethatthisprovidesagoodrecoveryofthedata.7.RandomMatrix-BasedDataFilteringThissectiondescribestheproposedlterforextractingtheoriginaldatafromthenoisyperturbeddata.Supposeac-tualdata#isperturbedbyarandomlygeneratednoisema-trix*inordertoproduce#P#*.Let P000;=5@?, 050100150200250300-1-0.500.511.5Spectral Filtering : Plot of Estimated Data vs Actual Data with SNR =1.5326 .-- Number of Instances-------- Value of Feature----Actual dataFigure1.Estimationoforiginalsinusoidaldatawithknownrandomnoisevariance.$%(' \n,be(perturbed)datapoints,eachbeingavectoroffeatures.Whenthenoisedistribution-.0:?4of*iscompletelyknown(asrequiredbytherandomvalueperturbationtech-nique[2]),thenoisevarianceisrstcalculatedfromthegivendistribution.Equation2isthenusedtocalcu-late\r2and\r"whichprovidethetheoreticalboundsoftheeigenvaluescorrespondingtonoisematrix*.Fromtheperturbeddata,wecomputetheeigenvaluesofitsco-variancematrix,sayH:".Thenweidentifythenoisyeigenstates:Ssuchthat\r"and\r!2.Theremainingeigen-statesaretheeigenstatescorrespondingtoactualdata.Let,=diag()bethediagonalmatrixwithallnoise-relatedeigenvalues,andbethematrixwhosecolumnsarethecorrespondingeigenvectors.Similarly,letHbetheeigenvaluematrixfortheactualdatapartandHbethecorrespondingeigenvectormatrixwhichisan9matrix(9).Basedonthesematrices,wede-composethecovariancematrixintotwoparts,andwith,whereR,istheco-variancematrixcorrespondingtorandomnoisepart,andHHRH,isthecovariancematrixcorrespondingtoactualdatapart.Anestimate #oftheactualdata#isob-tainedbyprojectingthedata#PontothesubspacespannedbythecolumnsofH.Inotherwords, ##PHRH.8.ExperimentalResultsInthissection,wepresentresultsofourexperimentswiththeproposedspectrallteringtechnique.Thissectionalsoincludesdiscussionontheeffectofnoisevarianceontheperformanceofthespectrallteringmethod.0510152025303510 1810 1610 1410 1210 1010 810 610 410 2100102Eigenvalue Distribution