/
On the Pri acy Pr eser ving Pr operties of Random Data erturbation echniques Hillol Kar On the Pri acy Pr eser ving Pr operties of Random Data erturbation echniques Hillol Kar

On the Pri acy Pr eser ving Pr operties of Random Data erturbation echniques Hillol Kar - PDF document

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
556 views
Uploaded On 2014-12-19

On the Pri acy Pr eser ving Pr operties of Random Data erturbation echniques Hillol Kar - PPT Presentation

umbcedu Qi ang and Krishnamoorthy Si akumar School of Electrical Engineering and Computer Science ashington State Uni ersity Pullman ashington 991642752 USA qw ang si eecswsuedu Abstract Privacy is becoming an incr easingly important issue in many da ID: 26420

umbcedu ang and Krishnamoorthy

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "On the Pri acy Pr eser ving Pr operties ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

OnthePrivacyPreservingPropertiesofRandomDataPerturbationTechniquesHillolKarguptaandSouptikDattaComputerScienceandElectricalEngineeringDepartmentUniversityofMarylandBaltimoreCountyBaltimore,Maryland21250,USAhillol,souptik1@cs.umbc.eduQiWangandKrishnamoorthySivakumarSchoolofElectricalEngineeringandComputerScienceWashingtonStateUniversityPullman,Washington99164-2752,USAqwang,siva@eecs.wsu.eduAbstractPrivacyisbecominganincreasinglyimportantissueinmanydataminingapplications.Thishastriggeredthede-velopmentofmanyprivacy-preservingdataminingtech-niques.Alargefractionofthemuserandomizeddatadis-tortiontechniquestomaskthedataforpreservingthepri-vacyofsensitivedata.Thismethodologyattemptstohidethesensitivedatabyrandomlymodifyingthedatavaluesof-tenusingadditivenoise.Thispaperquestionstheutilityoftherandomvaluedistortiontechniqueinprivacypreserva-tion.Thepapernotesthatrandomobjects(particularlyran-dommatrices)have“predictable”structuresinthespectraldomainanditdevelopsarandommatrix-basedspectrall-teringtechniquetoretrieveoriginaldatafromthedatasetdistortedbyaddingrandomvalues.Thepaperpresentsthetheoreticalfoundationofthislteringmethodandextensiveexperimentalresultstodemonstratethatinmanycasesran-domdatadistortionpreserveverylittledataprivacy.1.IntroductionManydataminingapplicationsdealwithprivacy-sensitivedata.Financialtransactions,health-carerecords,andnetworkcommunicationtrafcaresomeexam-ples.Datamininginsuchprivacy-sensitivedomainsisfac-inggrowingconcerns.Therefore,weneedtodevelopdataminingtechniquesthataresensitivetotheprivacyis-sue.Thishasfosteredthedevelopmentofaclassofdataminingalgorithms[2,9]thattrytoextractthedatapat-ternswithoutdirectlyaccessingtheoriginaldataandguaranteesthattheminingprocessdoesnotgetsuf-cientinformationtoreconstructtheoriginaldata.Thispaperconsidersaclassoftechniquesforprivacy-preservingdataminingbyrandomlyperturbingthedatawhilepreservingtheunderlyingprobabilisticproperties.Itexplorestherandomvalueperturbation-basedapproach[2],awell-knowntechniqueformaskingthedatausingran-domnoise.Thisapproachtriestopreservedataprivacybyaddingrandomnoise,whilemakingsurethattherandomnoisestillpreservesthe“signal”fromthedatasothatthepatternscanstillbeaccuratelyestimated.Thispaperques-tionstheprivacy-preservingcapabilityoftherandomvalueperturbation-basedapproach.Itshowsthatinmanycases,theoriginaldata(sometimescalled“signal”inthispaper)canbeaccuratelyestimatedfromtheperturbeddatausingaspectrallterthatexploitssometheoreticalpropertiesofrandommatrices.Itpresentsthetheoreticalfoundationandprovidesexperimentalresultstosupportthisclaim.Section2offersanoverviewoftherelatedliteratureonprivacypreservingdatamining.Section3presentsthemo-tivationbehindtheframeworkpresentedinthispaper.Sec-tion4describestherandomdataperturbationmethodpro-posedin[2].Section5presentsadiscussionontheeigen-valuesofrandommatrices.Section6presentstheintuitionbehindthethoerytoseparateoutrandomcomponentfromamixtureofnon-randomandrandomcomponent.Section7describestheproposedrandommatrix-basedlteringtech-nique.Section8appliestheproposedtechniqueandreportsitsperformanceforvariousdatasets.Finally,Section9con-cludesthispaper.2.RelatedWorkThereexistsagrowingbodyofliteratureonprivacy-sensitivedatamining.Thesealgorithmscanbedividedintoseveraldifferentgroups.Oneapproachadoptsadis-tributedframework.Thisapproachsupportscomputationofdataminingmodelsandextractionof“patterns”atagivennodebyexchangingonlytheminimalnecessaryinforma-tionamongtheparticipatingnodeswithouttransmittingtherawdata.Privacypreservingassociationruleminingfromhomogeneous[9]andheterogeneous[19]distributeddatasetsarefewexamples.Thesecondapproachisbasedon data-swappingwhichworksbyswappingdatavalueswithinsamefeature[3].Thereisalsoanapproachwhichworksbyaddingran-domnoisetothedatainsuchawaythattheindividualdatavaluesaredistortedpreservingtheunderlyingdistributionpropertiesatamacroscopiclevel.Thealgorithmsbelong-ingtothisgroupworksbyrstperturbingthedatausingrandomizedtechniques.Theperturbeddataisthenusedtoextractthepatternsandmodels.Therandomizedvaluedis-tortiontechniqueforlearningdecisiontrees[2]andassoci-ationrulelearning[6]areexamplesofthisapproach.Ad-ditionalworkonrandomizedmaskingofdatacanbefoundelsewhere[18].Thispaperexploresthethirdapproach[2].Itpointsoutthatinmanycasesthenoisecanbeseparatedfromtheper-turbeddatabystudyingthespectralpropertiesofthedataandasaresultitsprivacycanbeseriouslycompromised.AgrawalandAggarwal[1]havealsoconsideredtheap-proachin[2]andhaveprovidedaexpectation-maximization(EM)algorithmforreconstructingthedistributionoftheoriginaldatafromperturbedobservations.Theyalsopro-videinformationtheoreticmeasures(mutualinformation)toquantifytheamountofprivacyprovidedbyarandomiza-tionapproach.AgrawalandAggarwal[1]remarkthatthemethodsuggestedin[2]doesnottakeintoaccountthedis-tributionoftheoriginaldata(whichcouldbeusedtoguessthedatavaluetoahigherlevelofaccuracy).However,[1]providesnoexplicitproceduretoreconstructtheoriginaldatavalues.Evmievskietal.[5,4]andRizvi[15]havealsoconsideredtheapproachin[2]inthecontextofasso-ciationruleminingandsuggesttechniquesforlimitingpri-vacybreaches.Ourprimarycontributionistoprovideanexplicitlteringprocedure,basedonrandommatrixtheory,thatcanbeusedtoestimatetheoriginaldatavalues.3.MotivationAsnotedintheprevioussection,agrowingbodyofpri-vacypreservingdataminingtechniquesareadoptingran-domizationasaprimarytoolto“hide”information.Whilerandomizationisanimportanttool,itmustbeusedverycarefullyinaprivacy-preservingapplication.Randomnessmaynotnecessarilyimplyuncertainty.Randomeventscanoftenbeanalyzedandtheirprop-ertiescanbeexplainedusingprobabilisticframeworks.Statistics,randomizedcomputation,andmanyotherre-latedeldsarefulloftheorems,laws,andalgorithmsthatrelyonprobabilisticcharacterizationofrandompro-cessesthatoftenworkquiteaccurately.Thesignalpro-cessingliterature[12]offersmanylterstoremovewhitenoisefromdataandtheyoftenworkreasonablywell.Ran-domlygeneratedstructureslikegraphsdemonstrateinter-estingproperties[7].Inshort,randomnessdoesseemtohave“structure”andthisstructuremaybeusedtocom-promiseprivacyissuesunlesswepaycarefulattention.Therestofthispaperillustratesthischallengeinthecon-textofawell-knownprivacypreservingtechniquethatworksusingrandomadditivenoise.4.RandomValuePerturbationTechnique:ABriefReviewForthesakeofcompleteness,wenowbrieyreviewtherandomdataperturbationmethodsuggestedin[2]forhid-ingthedata(i.e.guaranteeingprotectionagainsttherecon-structionofthedata)whilestillbeingabletoestimatetheunderlyingdistribution.4.1.PerturbingtheDataTherandomvalueperturbationmethodattemptstopre-serveprivacyofthedatabymodifyingvaluesofthesensi-tiveattributesusingarandomizedprocess[2].Theauthorsexploretwopossibleapproaches—Value-ClassMember-shipandValueDistortion—andemphasizetheValueDis-tortionapproach.Inthisapproach,theownerofadatasetre-turnsavalue,whereistheoriginaldata,andisarandomvaluedrawnfromacertaindistribution.Mostcom-monlyuseddistributionsaretheuniformdistributionoveraninterval  \n\r \nandGaussiandistributionwithmeanandstandarddeviation.Theoriginaldataval-ues   !"areviewedasrealizationsofindepen-dentandidenticallydistributed(i.i.d.)randomvariables#,$&% (')  !,eachwiththesamedistributionasthatofarandomvariable#.Inordertoperturbthedata,inde-pendentsamples !  ",aredrawnfromadistribu-tion*.Theownerofthedataprovidestheperturbedvalues+ !,  "+"andthecumulativedistri-butionfunction-/.103254of*.Thereconstructionproblemistoestimatethedistribution-7680394oftheoriginaldata,fromtheperturbeddata.4.2.EstimationofDistributionFunctionfromthePerturbedDatasetTheauthors[2]suggestthefollowingmethodtoestimatethedistribution-60:4of#,givenindependentsamples;,$=%&#x,5;� ('  !and-/. 0:?4.UsingBayes'rule,theposteriordistributionfunction-A@60:4of#,giventhat#BC*D;,canbewrittenas-@603E4GF/HIEJBK.L0;NM?4K6O0PMQ4!RSMFJIEJBK.L0;NM?4K6O0PMQ4!RSM whichupondifferentiationwithrespecttoyieldstheden-sityfunctionK@60:4K.L0;4K6O03E4FJIJK.L0;NMQ4K6O0:M?4RQM whereK6O0!4,K.L0!4denotetheprobabilitydensityfunctionof#and*respectively.IfwehaveindependentsamplesC;,$G% ')  ,thecorrespondingposteriordistributioncanbeobtainedbyaveraging:K@603E4%"K.0;4K603E4FJIJK.0;NMQ4K60:M?4RQM(1)Forsufcientlylargenumberofsamples,weexpecttheabovedensityfunctiontobeclosetotherealdensityfunc-tionK60:4.Inpractice,sincethetruedensityK603E4isun-known,weneedtomodifytheright-handsideofequation1.TheauthorssuggestaniterativeprocedurewhereateachstepG%� ('  theposteriordensityKI60:4estimatedatstep7%isusedintheright-handsideofequation1.Theuniformdensityisusedtoinitializetheiterations.Theiter-ationsarecarriedoutuntilthedifferencebetweensucces-siveestimatesbecomessmall.Inordertospeedupcompu-tations,theauthorsalsodiscussapproximationstotheaboveprocedureusingpartitioningofthedomainofdatavalues.5.RandomnessandPatternsTherandomperturbationtechnique“apparently”distortsthesensitiveattributevaluesandstillallowsestimationoftheunderlyingdistributioninformation.However,doesthisapparentdistortionfundamentallyprohibitusfromextract-ingthehiddeninformation?Thissectionpresentsadis-cussiononthepropertiesofrandommatricesandpresentssomeresultsthatwillbeusedlaterinthispaper.Randommatrices[13]exhibitmanyinterestingproper-tiesthatareoftenexploitedinhighenergyphysics[13],sig-nalprocessing[16],andevendatamining[10].Therandomnoiseaddedtothedatacanbeviewedasarandommatrixandthereforeitspropertiescanbeunderstoodbystudyingthepropertiesofrandommatrices.Inthispaperweshallde-velopaspectrallterdesignedbasedonrandommatrixthe-oryforextractingthehiddendatafromthedataperturbedbyrandomnoise.Forourapproach,wearemainlyconcernedaboutdistri-butionofeigenvaluesofthesamplecovariancematrixob-tainedfromarandommatrix.Let*bearandom ma-trixwhoseentriesare*,$%  \n,G%  !,arei.i.d.randomvariableswithzeromeanandvariance.Thecovariancematrixof isgivenby \r*A@*.Clearly, isanmatrix.Let"betheeigen-valuesof .Let-"E0394%"#0394 betheempiricalcumulativedistributionfunction(c.d.f.)oftheeigenvalues,0%$4,where#0394%99istheunitstepfunction.Inordertoconsidertheasymp-toticpropertiesofthec.d.f.-,wewillconsiderthedi-mensionsN04and704ofmatrix tobefunctionsofavariable.Wewillconsiderasymptoticssuchthatinthelimitas! ,wehaveN04"# ,704$% ,and\r'&)(+*"&)(+*-,,where,.%.Undertheseassumptions,itcanbeshownthat[8]theempiricalc.d.f.-"E0:94convergesinprobabilitytoacontinuousdistributionfunction-0/10:94forevery9,whoseprobabilitydensityfunc-tion(p.d.f.)isgivenbyK/0394/01&)2I354768*9&3:4;I25*�\n=?A@2CBEDF"C9GH7BEIKJotherwise (2)where7BEDFand7BEIKJareasfollows:BEDF0%%ALNM,A4BEI\nJ0%%ALM,A4(3)Furtherrenementsofthisresultandotherdiscussionscanbefoundin[16].6.SeparatingtheDatafromtheNoiseConsideranOdatamatrix#andanoisematrix*withsamedimensions.Therandomvalueperturbationtech-niquegeneratesamodieddatamatrix#QP#,*.Ourobjectiveistoextract#from#0P.Althoughthenoisematrix*mayintroduceseeminglysignicantdifferencebetween#and#0P,itmaynotbesuccessfulinhidingthedata.Considerthecovariancematrixof#P:#$RP#SP0#C*4TR\r0#C*4#$R#C*UR/#BB#$R*D*VR*(4)Nownotethatwhenthesignalrandomvector(rowsof#)andnoiserandomvector(rowsof*)areuncorrelated,wehaveW#R*W*R#.Theuncorrelatedassump-tionisvalidinpracticesincethenoise*thatisaddedtothedata#isgeneratedbyastatisticallyindependentpro-cess.Recallthattherandomvalueperturbationtechniquediscussedintheprevioussectionintroducesuncorrelated noisetohidethesignalorthedata.Ifthenumberofob-servationsissufcientlylarge,wehavethat#R*and*R#.Equation4cannowbesimpliedasfollows:#RP#0P#R#C*R*(5)Sincethecorrelationmatrices#R#,#RP#SP,and*R*aresymmetricandpositivesemi-denite,let#$R,HH,URH #RP#P,PP,RP and*R*,,R (6)where,H ,'P) ,areorthogonalmatriceswhosecolumnvectorsareeigenvectorsof#R#,#RP#SP,*R*,respec-tively,andH,P,arediagonalmatriceswiththecorre-spondingeigenvaluesontheirdiagonals.Thefollowingresultfrommatrixperturbationtheory[20]givesarelationshipbetweenH,,and.Theorem1[20]Suppose& *\n& *" & *,\r  !aretheeigenvaluesof#R#,#RP#SP,and*R*,respectively.Then,for$%  !,&P*&H*" &* &H*&*Thistheoremprovidesusaboundonthechangeintheeigenvaluesofthedatacorrelationmatrix#R#intermsoftheminimumandmaximumeigenvaluesofthenoisecorre-lationmatrix*R*.Nowletustakeastepfurtherandex-plorethepropertiesoftheeigenvaluesoftheperturbeddatamatrix#Pforlargevaluesof.Lemma1Letdatamatrix#andnoisematrix*beofsizeand#P#,*.Let,H ,P ,beorthogonalmatricesandH,P,bediagonalmatricesasdenedin6.IfL thenPHRwhere,RP,H.Proof:UsingEquations5and6wecanwrite,,PP,URP,HH,URH,A,URP,URP,HH,RH,P,URP,A,UR,PHR,URP,A,UR,P(7)Lettheminimumandmaximumeigenvaluesof*beBEDF&*andBEI\nJ&*respectively.Itfollowsfromequation2thatL alltheeigenvaluesinbecomeidenticalsince \r! "/#"JBEIKJ&* \r! "/$"JBEDF&*(say).Thisimpliesthat,asL ,\n%,where%istheidentitymatrix.Therefore,ifthenum-berofobservationsislargeenough(notethat,inprac-tice,numberoffeaturesisxed),*R*,A,R,A,R.&%.ThereforeEquation7becomesPHR,URP,P,RP,PPHRQ(8)'Ifthenormoftheperturbationmatrix*issmall,theeigenvectors,'Pof#RP#SPwouldbeclosetotheeigen-vectors,RH,Hof#R#.Indeed,matrixperturbationthe-oryprovidespreciseboundsontheanglebetweeneigen-vectors(andinvariantsubspaces)ofamatrix#andthatofitsperturbation#QP#*,intermsofthenormsoftheperturbationmatrix*.Forexample,let039H H4beaneigenvector-eigenvaluepairformatrix#R#and(*)*R*)BEIKJ0*R*4bethetwo-normoftheper-turbation,whereBEI\nJ0P*R*4isthelargestsingularvalueof*R*.Thenthereexistsaneigenvalue-eigenvectorpair039P P4of#RP#Psatisfying[20,17]+-,/.010039H !9NPS4'(2( where2isthedistancebetweenHandtheclosesteigen-valueof#R#,provided(2.Thisshowsthattheeigen-valuesof#R#and#R3#3areingeneralclose,forsmallperturbations.Moreover,4H965#SP9H4'(2( where95istheconjugate-transposeof9.Consequently,theproduct,RP,H,whichisthematrixofinnerprod-uctsbetweentheeigenvectorsof#R#and#RP#SPwouldbeclosetoanidentitymatrix;i.e.,,RP,H7%.Thusequation8becomesP7H(9)Supposethesignalcovariancematrixhasonlyafewdominanteigenvalues,say&H*:8&H*,with&H*(forsomesmallvalue(and$:9%  !.Thisconditionistrueformanyreal-worldsignals.Suppose8&H*&*,thelargesteigenvalueofthenoisecovari-ancematrix.Itisthenclearthatwecanseparatethesig-nalandnoiseeigenvaluesH,fromtheeigenvaluesPoftheobserveddatabyasimplethresholdingat&*.Notethatequation9isonlyanapproximation.However,inpractice,onecandesignalterbasedonthisapproxima-tiontolterouttheperturbationfromthedata.Experimen-talresultspresentedinthefollowingsectionsindicatethatthisprovidesagoodrecoveryofthedata.7.RandomMatrix-BasedDataFilteringThissectiondescribestheproposedlterforextractingtheoriginaldatafromthenoisyperturbeddata.Supposeac-tualdata#isperturbedbyarandomlygeneratednoisema-trix*inordertoproduce#P#*.LetP&#x;000;=5@?, 050100150200250300-1-0.500.511.5Spectral Filtering : Plot of Estimated Data vs Actual Data with SNR =1.5326 .�-- Number of Instances----�---- Value of Feature----Actual dataFigure1.Estimationoforiginalsinusoidaldatawithknownrandomnoisevariance.$%� ('  \n,be(perturbed)datapoints,eachbeingavectoroffeatures.Whenthenoisedistribution-.0:?4of*iscompletelyknown(asrequiredbytherandomvalueperturbationtech-nique[2]),thenoisevarianceisrstcalculatedfromthegivendistribution.Equation2isthenusedtocalcu-late\r2and\r"whichprovidethetheoreticalboundsoftheeigenvaluescorrespondingtonoisematrix*.Fromtheperturbeddata,wecomputetheeigenvaluesofitsco-variancematrix ,sayH:".Thenweidentifythenoisyeigenstates:Ssuchthat\r"and\r!2.Theremainingeigen-statesaretheeigenstatescorrespondingtoactualdata.Let,=diag(   )bethediagonalmatrixwithallnoise-relatedeigenvalues,andbethematrixwhosecolumnsarethecorrespondingeigenvectors.Similarly,letHbetheeigenvaluematrixfortheactualdatapartandHbethecorrespondingeigenvectormatrixwhichisan9matrix(9).Basedonthesematrices,wede-composethecovariancematrix intotwoparts, and with    ,where R,istheco-variancematrixcorrespondingtorandomnoisepart,and HHRH,isthecovariancematrixcorrespondingtoactualdatapart.Anestimate #oftheactualdata#isob-tainedbyprojectingthedata#PontothesubspacespannedbythecolumnsofH.Inotherwords, ##PHRH.8.ExperimentalResultsInthissection,wepresentresultsofourexperimentswiththeproposedspectrallteringtechnique.Thissectionalsoincludesdiscussionontheeffectofnoisevarianceontheperformanceofthespectrallteringmethod.0510152025303510181016101410121010108106104102100102Eigenvalue Distribution�Eigenvalues(log)� Number of FeaturesEstimated Data EigenvalueFigure2.Distributionofeigenvaluesofac-tualdata,andestimatedeigenvaluesofran-domnoiseandactualdata.010203040506070809010010.500.511.52Plot of a Fraction of Dataset, Estimated vs Actual Signal (Mean SNR = 1.3)� Value of FeatureEstimated Data010203040506070809010010.500.511.52Estimation Error for the Dataset Shown� Number of Instances� Estimation ErrorFigure3.Spectrallteringusedtoestimaterealworldaudiodata.Waveformofaaudiosignaliscloselyestimatedfromitsperturbedversion.8.1.EstimationwithKnownPerturbingDistribu-tionWetestedourprivacybreachingtechniqueusingseveraldatasetsofdifferentsizes.Weconsideredbotharticiallygeneratedandrealdatasets.Towardsthatend,wegener-atedadatasetwith35featuresand300instances.Eachfea-turehasaspecictrendlikesinusoidal,square,andtriangu-larshape,howeverthereisnodependencybetweenanytwofeatures.TheactualdatasetisperturbedbyaddingGaussiannoise(withzeromeanandknownvariance),andourpro-posedtechniqueisappliedtorecovertheactualdatafromtheperturbeddata.Figure1showstheresultofourspec-trallteringforonesuchfeaturewheretheactualdatahasasinusoidaltrend.Thelteringtechniqueappearstopro- videanaccurateestimateoftheindividualvaluesoftheactualdata.Figure2showsthedistributionofeigenvaluesoftheactualandperturbeddata.Italsoidentiestheesti-matednoiseeigenvaluesandthetheoreticalboundsBEIKJand7BEDF.Aswesee,thelteringmethodaccuratelydis-tinguishesbetweennoisyeigenvaluesandeigenvaluescor-respondingtoactualdata.Notethattheestimatedeigenval-uesofactualdataisveryclosetoeigenvaluesofactualdataandalmostoverlapwiththemaboveBEIKJ.TheeigenvaluesofactualdatabelowBEDFarepracticallynegligible.Thus,theestimatedeigenvaluesoftheactualdatacapturemostoftheinformationanddiscardtheadditivenoise.051015202530354045500.100.10.20.30.40.50.60.70.80.9Plot of a Fraction of dataset,Estimated vs Actual Signal� No of instances� Values Estimated dataFigure4.Plotoftheindividualvaluesofafractionofthedatasetwith`Triangular'distri-bution.Spectrallteringgivescloseestima-tionofindividualvalues.Therandommatrix-basedlteringtechniquecanalsobeextendedtodatasetswithasinglefeature,i.ewhenthedatasetisasinglecolumnvector.Thedatavectorisper-turbedwithanoisevectorwiththesamedimension.Theperturbeddatavectoristhensplitintoaxednumberofvectorswithequallengthandallofthesevectorsareap-pendedtoformamatrix.Thespectrallteringtechniqueisthenappliedtothismatrixtoestimatetheoriginaldata.Af-terthedatamatrixisestimated,itscolumnsareconcate-natedtoformasinglevector.Weusedarealworldsinglefeaturedatasettoverifytheperformanceofthespectralltering.Thedatasetusedisthescaledamplitudeofthewaveformofanaudiotunerecordedusingaxedsamplingfrequency.Thetunerecordedisfairlynoisefreewith%Ssamplepoints.WeperturbedthisdatawithadditiveGaussiannoise.WedenethetermSignal-to-NoiseRatio(SNR)toquan-tifytherelativeamountofnoiseaddedtoactualdatatoper-turbit:10.500.511.52020040060080010001200�Attribute value�No of RecordsTriangle DistributionEstimated DistributionFigure5.Reconstructionofthe`Triangu-lar'distribution.Perturbeddatadistributiondoesnotlooklikeatriangulardistribution,butreconstructeddistributionusingspectrallteringresemblestheoriginaldistributionclosely.SNRVarianceofActualDataNoiseVariance(10)Inthisexperiment,thenoisevariancewaschosentoyieldasignal-to-noiseratioof1.3.Wesplitthisvectorofperturbeddataintocolumns,eachcontaining'pointsandappliedthespectrallteringtechniquetorecovertheactualdata.TheresultisshowninFigure3.Forthesakeofclarity,onlyafractionofdatasetisshown,andestima-tionerrorisplottedforthatfraction.AsshowninFigure3,theperturbeddataisverydifferentfromtheactualdata,whereastheestimateddataisacloseapproximationoftheactualdata.Theestimationperformanceissimilartothatforamulti-featureddata(seeFigure1).8.2.ComparisonWithResultsin[2]Theproposedspectrallteringtechniquecanestimatevaluesofindividualdata-pointsfromtheperturbeddataset.Thispoint-wiseestimationcanthenbeusedtoreconstructthedistributionofactualdataaswell.Themethodssug-gestedby[2,1]canonlyreconstructthedistributionoftheoriginaldatafromthedataperturbedbyrandomvaluedis-tortion;butitdoesnotconsiderestimationoftheindividualvaluesofthedata-points.Thespectrallteringtechnique,ontheotherhand,isexplicitlydesignedtoreconstructtheindividualdata-pointsandhence,alsothedistributionoftheactualdataset.Wetriedtoreplicatetheexperimentreportedin[2]usingourmethodtorecoverthetriangulardistribution.Weusedavectordataof%SvalueshavingatriangulardistributionasshowninFigure2in[2].Theindividualvaluesofactual dataarewithin0and1andareindependentofeachother.WeaddedGaussianrandomnoisewithmeanandstandarddeviation,'tothisdataandsplitthedatavectorintocolumns,eachhaving'values.Wethenappliedourspectralltertorecovertheactualdatafromtheperturbeddata.Figure4showsaportionoftheactualdata,theirval-uesafterdistortion,andtheirestimatedvalues.Notethattheestimatedvaluesareveryclosetotheactualvalues,com-paredtotheperturbedvalues.Usingtheestimateofindi-vidualdata-points,wereconstructthedistributionoftheac-tualdata.Figure5showsestimationofthedistributionfromtheestimatedvalueofindividualdata-points.Thedistribu-tionoftheperturbeddataisverydifferentthantheactualtriangulardistribution,buttheestimateddistributionlooksverysimilartotheoriginaldistribution.Thisshowsthatourmethodrecoverstheoriginaldistributionalongwithindi-vidualdata-points,similartotheresultreportedin[2].Theestimationaccuracyisgreaterthanforalldatapoints.Sincespectrallteringcanlterouttheindividualvaluesofactualdataanditsdistributionfromaperturbedrepresen-tation,itbreachestheprivacypreservingprotectionoftherandomizeddataperturbationtechnique[2].8.3.EffectofPerturbationVarianceandtheIn-herentRandomComponentoftheActualDataQualityofthedatarecoverydependsupontherela-tivenoisecontentintheperturbeddata.WeusetheSNR(seeequation(10))toquantifytherelativeamountofnoiseaddedtoactualdatatoperturbit.Asthenoiseaddedtotheactualvalueincreases,theSNRdecreases.Ourexper-imentsshowthattheproposedlteringmethodpredictstheactualdatareasonablywelluptoaSNRvalueof1.0(i.e.%noise).TheresultsshowninFigure1correspondstoanSNRvaluenearly2,i.e.noisecontentisabout.Fig-ure4showsadata-blockwheretheSNRis%.AstheSNRgoesbelow1,theestimationbecomestooerroneous.Fig-ure6showsthedifferenceinestimationaccuracyastheSNRincreasesfrom1.Thedatasetusedherehasasinu-soidaltrendinitsvalues.Thetopgraphcorrespondsto'noise(SNR=4.3),whereasthebottomgraphcorrespondsto%noise(SNR=1.0).Anotherimportantfactorthataffectsthequalityofre-coveryoftheactualdataistheinherentnoiseintheactualdataset(apartfromtheperturbationnoiseaddedintention-ally).Iftheactualdatasethasarandomcomponentinit,andrandomnoiseisaddedtoperturbit,spectrallteringmethoddoesnotltertheactualdataaccurately.Ourex-perimentswithsomeinherentlynoisyreallifedatasetshowthattheeigenvaluesofsignalandnoisenolongerremainsclearlyseparablesincethetheireigenvaluesmaynotbedis-tributedovertwonon-overlappingregimesanylonger.050100150200250300210123Variation of Estimation Accuracy with SNR�Value of FeatureEstimated data0501001502002503001050510Plot of Sinusoidal Feature,Estimated vs Actual Signal with SNRs =1.0033 ,4.2423 ,�No of instances�Value of FeatureEstimated dataFigure6.Ahighernoisecontent(lowSNR)leadstolessaccurateestimation.SNRintheuppergureis1,whilethatforthelowerg-ureis4.3.Wehaveperformedexperimentswitharticialdatasetwithspecictrendinitsvalueaswellasrealworlddatasetcontainingarandomcomponent.Figure1infactshowsthatourmethodgivesacloseestimationofactualdatawhenthedatasethassomespecictrend(sinusoid).Wealsoap-pliedourmethodto“Ionospheredata”availablefrom[14],whichisinherentlynoisy.WeperturbedtheoriginaldatawithrandomnoisesuchthatmeanSNRissameasthearti-cialdataset,i.e.%%.Figure7showsthatrecoveryqualityispoorcomparedtodatasetshavingdenitetrend.However,thisopensupadifferentquestion:Istheran-domcomponentoftheoriginaldatasetreallyimportantasfarasdataminingisconcerned?Onemayarguethatmostdataminingtechniquesexploitonlythenon-randomstruc-turedpatternsofthedata.Therefore,losingtheinherentran-domcomponentoftheoriginaldatamaynotbeimportantinaprivacypreservingdataminingapplication.9.ConclusionandFutureWorkPreservingprivacyindataminingactivitiesisaveryim-portantissueinmanyapplications.Randomization-basedtechniquesarelikelytoplayanimportantroleinthisdo-main.However,thispaperillustratessomeofthechallengesthatthesetechniquesfaceinpreservingthedataprivacy.Itshowedthatundercertainconditionsitisrelativelyeasytobreachtheprivacyprotectionofferedbytherandompertur-bationbasedtechniques.Itprovidedextensiveexperimentalresultswithdifferenttypesofdataandshowedthatthisisreallyaconcernthatwemustaddress.Inadditiontoraisingthisconcernthepaperoffersarandom-matrixbaseddatalteringtechniquethatmayndwiderapplicationindevel-opinganewperspectivetowarddevelopingbetterprivacy-preservingdataminingalgorithms. 010203040506070809010021.510.500.511.52Plot of One Feature,Estimated vs Actual Signal with SNRs =1.11 .�Value of DataEstimated dataFigure7.Spectrallteringperformspoorlyonadatasetwitharandomcomponentinitsactualvalue.However,itisnotclearifloos-ingtherandomcomponentofthedataisaconcernfordataminingapplications.Sincetheproblemmainlyoriginatesfromtheusageofadditive,independent“white”noiseforprivacypreserva-tion,weshouldexplore“colored”noiseforthisapplication.Wehavealreadystartedexploringmultiplicativenoisema-tricesinthiscontext.If#bethedatamatrixand*beanappropriatelysizedrandomnoisematrixthenwearein-terestedinthepropertiesoftheperturbeddata#EP#*forprivacy-preservingdataminingapplications.If*isasquarematrixthenwemaybeabletoextractsignalusingtechniqueslikeindependentcomponentanalysis.However,projectionmatricesthatsatisfycertainconditionsmaybemoreappealingforsuchapplications.Moredetailsaboutthispossibilitycanbefoundelsewhere[11].AcknowledgmentsTheauthorsacknowledgesupportsfromtheUnitedStatesNationalScienceFoundationCAREERawardIIS-0093353,NASA(NRA)NAS2-37143,andTEDCO,MarylandTechnologyDevelopmentCenter.References[1]D.AgrawalandC.C.Aggawal.Onthedesignandquan-ti®cationofprivacypreservingdataminingalgorothms.InProceedingsofthe20thACMSIMODSymposiumonPrin-ciplesofDatabaseSystems,pages247–255,SantaBarbara,May2001.[2]R.AgrawalandR.Srikant.Privacy-preservingdatamining.InProceedingoftheACMSIGMODConferenceonMan-agementofData,pages439–450,Dallas,Texas,May2000.ACMPress.[3]V.Estivill-CastroandL.Brankovic.Dataswaping:Balanc-ingprivacyagainstprecisioninminingforlogicrules.InProceedingsofthe®rstConferenceonDataWarehousingandKnowledgeDiscovery(DaWaK-99),pages389–398,Florence,Italy,1999.SpringerVerlag.[4]A.Ev®mevski,J.Gehrke,andR.Srikant.Limitingprivacybreachesinprivacypreservingdatamining.InProceedingsoftheACMSIMOD/PODSConference,SanDiego,CA,June2003.[5]A.Ev®mevski,R.Srikant,R.Agrawal,andJ.Gehrke.Pri-vacypreservingminingofassociationrules.InProceedingsoftheACMSIKDDConference,Edmonton,Canada,2002.[6]S.Ev®mievski.Randomizationtechniquesforprivacypre-servingassociationrulemining.InSIGKDDExplorations,volume4(2),Dec2002.[7]S.Janson,T.L.,andA.Rucinski.RandomGraphs.WileyPublishers,1edition,2000.[8]D.Jonsson.Somelimittheoremsfortheeigenvaluesofasamplecovariancematrix.JournalofMultivariateAnaly-sis,12:1–38,1982.[9]M.KantarciogluandC.Clifton.Privacy-preservingdis-tributedminingofassociationrulesonhorizontallyparti-tioneddata.InSIGMODWorkshoponDMKD,Madison,WI,June2002.[10]H.Kargupta,K.Sivakumar,andS.Ghosh.Dependencyde-tectioninmobimineandrandommatrices.InProceedingsofthe6thEuropeanConferenceonPrinciplesandPrac-ticeofKnowledgeDiscoveryinDatabases,pages250–262.Springer,2002.[11]K.Liu,H.Kargupta,andJ.Ryan.Randomprojectionandprivacypreservingcorrelationcomputationfromdistributeddata.Technicalreport,UniversityofMarylandBaltimoreCounty,ComputerScienceandElectricalEngineeringDe-partment,TechnicalReportTR-CS-03-24,2003.[12]D.G.Manolakis,V.K.Ingle,andS.M.Kogon.StatisticalandAdaptiveSignalProcessing.McGrawHill,2000.[13]M.L.Mehta.RandomMatrices.AcademicPress,London,2edition,1991.[14]U.M.L.Repository.http://www.ics.uci.edu/mlearn/mlsummary.html.[15]S.J.RizviandJ.R.Haritsa.Maintainingdataprivacyinas-sociationrulemining.InProceedingsofthe28thVLDBCon-ference,HongKong,China,2002.[16]J.W.SilversteinandP.L.Combettes.Signaldetectionviaspectraltheoryoflargedimensionalrandommatrices.IEEETransactionsonSignalProcessing,40(8):2100–2105,1992.[17]G.W.Stewart.Errorandperturbationboundsforsubspacesassociatedwithcertaineigenvalueproblems.SIAMReview,15(4):727–764,October1973.[18]J.F.Traub,Y.Yemini,andH.Woz'niakowski.Thestatisti-calsecurityofastatisticaldatabase.ACMTransactionsonDatabaseSystems(TODS),9(4):672–679,1984.[19]J.VaidyaandC.Clifton.Privacypreservingassociationrulemininginverticallypartitioneddata.InTheEighthACMSIGKDDInternationalconferenceonKnowledgeDiscoveryandDataMining,Edmonton,Alberta,CA,July2002.[20]H.Weyl.Inequalitiesbetweenthetwokindsofeigenvaluesofalineartransformation.InProceedingsoftheNationalAcademyofSciences,volume35,pages408–411,1949.