On the Pri acy Pr eser ving Pr operties of Random Data erturbation echniques Hillol Kar gupta and Souptik Datta Computer Science and Electrical Engineering Department Uni ersity of Maryland Baltimore
214K - views

On the Pri acy Pr eser ving Pr operties of Random Data erturbation echniques Hillol Kar gupta and Souptik Datta Computer Science and Electrical Engineering Department Uni ersity of Maryland Baltimore

umbcedu Qi ang and Krishnamoorthy Si akumar School of Electrical Engineering and Computer Science ashington State Uni ersity Pullman ashington 991642752 USA qw ang si eecswsuedu Abstract Privacy is becoming an incr easingly important issue in many da

Download Pdf

On the Pri acy Pr eser ving Pr operties of Random Data erturbation echniques Hillol Kar gupta and Souptik Datta Computer Science and Electrical Engineering Department Uni ersity of Maryland Baltimore

Download Pdf - The PPT/PDF document "On the Pri acy Pr eser ving Pr operties ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation on theme: "On the Pri acy Pr eser ving Pr operties of Random Data erturbation echniques Hillol Kar gupta and Souptik Datta Computer Science and Electrical Engineering Department Uni ersity of Maryland Baltimore"— Presentation transcript:

Page 1
On the Pri acy Pr eser ving Pr operties of Random Data erturbation echniques Hillol Kar gupta and Souptik Datta Computer Science and Electrical Engineering Department Uni ersity of Maryland Baltimore County Baltimore, Maryland 21250, USA hillol, souptik1 @cs.umbc.edu Qi ang and Krishnamoorthy Si akumar School of Electrical Engineering and Computer Science ashington State Uni ersity Pullman, ashington 99164-2752, USA qw ang, si @eecs.wsu.edu Abstract Privacy is becoming an incr easingly important issue in many data mining applications. This has trig er ed the de- velopment of

many privacy-pr eserving data mining tec h- niques. lar fr action of them use andomized data dis- tortion tec hniques to mask the data for pr eserving the pri- vacy of sensitive data. This methodolo gy attempts to hide the sensitive data by andomly modifying the data values of- ten using additive noise This paper questions the utility of the andom value distortion tec hnique in privacy pr eserva- tion. The paper notes that andom objects (particularly an- dom matrices) have “pr edictable structur es in the spectr al domain and it de velops andom matrix-based spectr al fil- tering tec

hnique to etrie ve original data fr om the dataset distorted by adding andom values. The paper pr esents the theor etical foundation of this filtering method and xtensive xperimental esults to demonstr ate that in many cases an- dom data distortion pr eserve very little data privacy 1. Intr oduction Man data mining applications deal with pri ac y- sensiti data. Financial transactions, health-care records, and netw ork communication traf fic are some xam- ples. Data mining in such pri ac y-sensiti domains is ac- ing gro wing concerns. Therefore, we need to de elop data mining

techniques that are sensiti to the pri ac is- sue. This has fostered the de elopment of class of data mining algorithms [2 9] that try to xtract the data pat- terns without directly accessing the original data and guarantees that the mining process does not get suf fi- cient information to reconstruct the original data. This paper considers class of techniques for pri ac y- preserving data mining by randomly perturbing the data while preserving the underlying probabilistic properties. It xplores the random alue perturbation-based approach [2 ], well-kno wn technique for masking the data

using ran- dom noise. This approach tries to preserv data pri ac by adding random noise, while making sure that the random noise still preserv es the “signal from the data so that the patterns can still be accurately estimated. This paper ques- tions the pri ac y-preserving capability of the random alue perturbation-based approach. It sho ws that in man cases, the original data (sometimes called “signal in this paper) can be accurately estimated from the perturbed data using spectral filter that xploits some theoretical properties of random matrices. It presents the theoretical

foundation and pro vides xperimental results to support this claim. Section of fers an ervie of the related literature on pri ac preserving data mining. Section presents the mo- ti ation behind the frame ork presented in this paper Sec- tion describes the random data perturbation method pro- posed in [2 ]. Section presents discussion on the eigen- alues of random matrices. Section presents the intuition behind the thoery to separate out random component from mixture of non-random and random component. Section describes the proposed random matrix-based filtering tech- nique Section

applies the proposed technique and reports its performance for arious data sets. Finally Section con- cludes this paper 2. Related ork There xists gro wing body of literature on pri ac y- sensiti data mining. These algorithms can be di vided into se eral dif ferent groups. One approach adopts dis- trib uted frame ork. This approach supports computation of data mining models and xtraction of “patterns at gi en node by xchanging only the minimal necessary informa- tion among the participating nodes without transmitting the ra data. Pri ac preserving association rule mining from homogeneous [9

and heterogeneous [19 distrib uted data sets are fe xamples. The second approach is based on
Page 2
data-sw apping which orks by sw apping data alues within same feature [3 ]. There is also an approach which orks by adding ran- dom noise to the data in such ay that the indi vidual data alues are distorted preserving the underlying distrib ution properties at macroscopic le el. The algorithms belong- ing to this group orks by first perturbing the data using randomized techniques. The perturbed data is then used to xtract the patterns and models. The randomized alue dis- tortion

technique for learning decision trees [2 and associ- ation rule learning [6 are xamples of this approach. Ad- ditional ork on randomized masking of data can be found else where [18 ]. This paper xplores the third approach [2 ]. It points out that in man cases the noise can be separated from the per turbed data by studying the spectral properties of the data and as result its pri ac can be seriously compromised. Agra al and Aggarw al [1 ha also considered the ap- proach in [2 and ha pro vided xpectation-maximization (EM) algorithm for reconstructing the distrib ution of the original data from

perturbed observ ations. The also pro- vide information theoretic measures (mutual information) to quantify the amount of pri ac pro vided by randomiza- tion approach. Agra al and Aggarw al [1 remark that the method suggested in [2 does not tak into account the dis- trib ution of the original data (which could be used to guess the data alue to higher le el of accurac y). Ho we er [1] pro vides no xplicit procedure to reconstruct the original data alues. Evfimie vski et al. [5 and Rizvi [15 ha also considered the approach in [2 in the conte xt of asso- ciation rule mining and suggest

techniques for limiting pri- ac breaches. Our primary contrib ution is to pro vide an xplicit filtering procedure, based on random matrix theory that can be used to estimate the original data alues. 3. Moti ation As noted in the pre vious section, gro wing body of pri- ac preserving data mining techniques are adopting ran- domization as primary tool to “hide information. While randomization is an important tool, it must be used ery carefully in pri ac y-preserving application. Randomness may not necessarily imply uncertainty Random ents can often be analyzed and their prop- erties can be

xplained using probabilistic frame orks. Statistics, randomized computation, and man other re- lated fields are full of theorems, la ws, and algorithms that rely on probabilistic characterization of random pro- cesses that often ork quite accurately The signal pro- cessing literature [12 of fers man filters to remo white noise from data and the often ork reasonably well. Ran- domly generated structures lik graphs demonstrate inter esting properties [7 ]. In short, randomness does seem to ha “structure and this structure may be used to com- promise pri ac issues unless we pay

careful attention. The rest of this paper illustrates this challenge in the con- te xt of well-kno wn pri ac preserving technique that orks using random additi noise. 4. Random alue erturbation echnique: Brief Re view or the sak of completeness, we no briefly re vie the random data perturbation method suggested in [2 for hid- ing the data (i.e. guaranteeing protection against the recon- struction of the data) while still being able to estimate the underlying distrib ution. 4.1. erturbing the Data The random alue perturbation method attempts to pre- serv pri ac of the data by modifying

alues of the sensi- ti attrib utes using randomized process [2 ]. The authors xplore tw possible approaches alue-Class Member ship and alue Distortion and emphasize the alue Dis- tortion approach. In this approach, the wner of dataset re- turns alue  where  is the original data, and is random alue dra wn from certain distrib ution. Most com- monly used distrib utions are the uniform distrib ution er an interv al     and Gaussian distrib ution with mean  and standard de viation The original data al- ues    !" are vie wed as realizations of indepen- dent and

identically distrib uted (i.i.d.) random ariables &% (')  ! each with the same distrib ution as that of random ariable In order to perturb the data, inde- pendent samples !   are dra wn from distrib u- tion The wner of the data pro vides the perturbed alues + ! ,   + and the cumulati distri- ution function -/.103254 of The reconstruction problem is to estimate the distrib ution -7680394 of the original data, from the perturbed data. 4.2. Estimation of Distrib ution Function fr om the erturbed Dataset The authors [2 suggest the follo wing method to estimate the

distrib ution 0:4 of gi en independent samples <,5 =% ('>  ! and -/. 0:?4 Using Bayes rule, the posterior distrib ution function -A@ 0:4 of gi en that #BC* D; can be written as 03E4 GF/H IEJBK .L0 NM?4 6O0PMQ4!RSM IEJBK .L0 NM?4 6O0PMQ4!RSM
Page 3
which upon dif ferentiation with respect to yields the den- sity function 0:4 .L0 4 6O03E4 IJ .L0 NMQ4 6O0:M?4 RQM where 6O0! .L0! denote the probability density function of and respecti ely If we ha independent samples C G% ')   the corresponding posterior distrib ution can be obtained by eraging: 03E4

 4 03E4 IJ NMQ4 0:M?4 RQM (1) or suf ficiently lar ge number of samples we xpect the abo density function to be close to the real density func- tion 0:4 In practice, since the true density 03E4 is un- kno wn, we need to modify the right-hand side of equation 1. The authors suggest an iterati procedure where at each step G% ('>  the posterior density  0:4 estimated at step 7 is used in the right-hand side of equation 1. The uniform density is used to initialize the iterations. The iter ations are carried out until the dif ference between succes- si estimates becomes

small. In order to speed up compu- tations, the authors also discuss approximations to the abo procedure using partitioning of the domain of data alues. 5. Randomness and atter ns The random perturbation technique “apparently distorts the sensiti attrib ute alues and still allo ws estimation of the underlying distrib ution information. Ho we er does this apparent distortion fundamentally prohibit us from xtract- ing the hidden information? This section presents dis- cussion on the properties of random matrices and presents some results that will be used later in this paper Random matrices [13

xhibit man interesting proper ties that are often xploited in high ener gy physics [13 ], sig- nal processing [16 ], and en data mining [10 ]. The random noise added to the data can be vie wed as random matrix and therefore its properties can be understood by studying the properties of random matrices. In this paper we shall de- elop spectral filter designed based on random matrix the- ory for xtracting the hidden data from the data perturbed by random noise. or our approach, we are mainly concerned about distri- ution of eigen alues of the sample co ariance matrix ob- tained from random

matrix. Let be random   ma- trix whose entries are   G%  ! are i.i.d. random ariables with zero mean and ariance The co ariance matrix of is gi en by *A@ Clearly is an  matrix. Let   be the eigen- alues of Let "E0394  039  4 be the empirical cumulati distrib ution function (c.d.f.) of the eigen alues 4 where 0394    is the unit step function. In order to consider the asymp- totic properties of the c.d.f. -<"0394 we will consider the di- mensions N0  and 70  of matrix to be functions of ariable will consider asymptotics such that in the limit

as ! we ha N0  "# 70  $% and '&)(+* &)(+* -, where ,. Under these assumptions, it can be sho wn that [8 the empirical c.d.f. "E0:94 con er ges in probability to continuous distrib ution function 0/ 10:94 for ery whose probability density func- tion (p.d.f.) is gi en by 0394 /01 &)2 35476 *9& 3:4;< I 25* =?>A@ CBED F" C9 GH7BEIKJ otherwise (2) where 7BED and 7BEIKJ are as follo ws: BED ALNM A4 BEI J AL A4 (3) Further refinements of this result and other discussions can be found in [16 ]. 6. Separating the Data fr om the Noise Consider an O data matrix and noise

matrix with same dimensions. The random alue perturbation tech- nique generates modified data matrix QP #, Our objecti is to xtract from 0P Although the noise matrix may introduce seemingly significant dif ference between and 0P it may not be successful in hiding the data. Consider the co ariance matrix of $R SP C* TR 0 C* $R C* UR /#BB# $R *D VR (4) No note that when the signal random ector (ro ws of and noise random ector (ro ws of are uncorrelated, we ha The uncorrelated assump- tion is alid in practice since the noise that is added to the data is generated by

statistically independent pro- cess. Recall that the random alue perturbation technique discussed in the pre vious section introduces uncorrelated
Page 4
noise to hide the signal or the data. If the number of ob- serv ations is suf ficiently lar ge, we ha that  and  Equation can no be simplified as follo ws: 0P C* (5) Since the correlation matrices SP and are symmetric and positi semi-definite, let $R <# ,UR and  (6) where ,'P ) are orthogonal matrices whose column ectors are eigen ectors of SP respec- ti ely and are diagonal matrices with the corre- sponding

eigen alues on their diagonals. The follo wing result from matrix perturbation theory [20 gi es relationship between and Theor em [20 Suppose           !  ar the eig en values of SP and espectively Then, for  !   This theorem pro vides us bound on the change in the eigen alues of the data correlation matrix in terms of the minimum and maximum eigen alues of the noise corre- lation matrix No let us tak step further and x- plore the properties of the eigen alues of the perturbed data matrix for lar ge alues of Lemma Let data matrix and noise matrix be of size and

#,* Let  be ortho gonal matrices and be dia gonal matrices as defined in 6. If  then  wher  Pr oof: Using Equations and we can write, ,UR ,UR  A,UR ,UR ,UR  A,UR  ,UR  A,UR (7) Let the minimum and maximum eigen alues of be BED  and BEI J  respecti ely It follo ws from equation that all the eigen alues in become identical since   ! #" BEIKJ    ! $" BED  (say). This implies that, as % where is the  identity matrix. Therefore, if the num- ber of observ ations is lar ge enough (note that, in prac- tice, number of features is fix ed),  A,  A, &%

Therefore Equation becomes  ,UR  Q (8) If the norm of the perturbation matrix is small, the eigen ectors ,'P of SP ould be close to the eigen- ectors of Indeed, matrix perturbation the- ory pro vides precise bounds on the angle between eigen- ectors (and in ariant subspaces) of matrix and that of its perturbation QP in terms of the norms of the perturbation matrix or xample, let 039 be an eigen ector -eigen alue pair for matrix and *) BEIKJ be the tw o-norm of the per turbation, where BEI J 0P* is the lar gest singular alue of Then there xists an eigen alue-eigen ector pair 039 of

satisfying [20 17 +-,/. 10 039 !9 NP S4 where is the distance between and the closest eigen- alue of pro vided This sho ws that the eigen- alues of and are in general close, for small perturbations. Moreo er 65 SP where is the conjugate-transpose of Consequently the product which is the matrix of inner prod- ucts between the eigen ectors of and SP ould be close to an identity matrix; i.e., Thus equation becomes (9) Suppose the signal co ariance matrix has only fe dominant eigen alues, say  : 8 with for some small alue and :9  ! This condition is true for man real-w orld signals.

Suppose 8 <;  the lar gest eigen alue of the noise co ari- ance matrix. It is then clear that we can separate the sig- nal and noise eigen alues from the eigen alues of the observ ed data by simple thresholding at  Note that equation is only an approximation. Ho we er in practice, one can design filter based on this approxima- tion to filter out the perturbation from the data. Experimen- tal results presented in the follo wing sections indicate that this pro vides good reco ery of the data. 7. Random Matrix-Based Data Filtering This section describes the proposed filter

for xtracting the original data from the noisy perturbed data. Suppose ac- tual data is perturbed by randomly generated noise ma- trix in order to produce Let  >= 5 @? 
Page 5
50 100 150 200 250 300 −1 −0.5 0.5 1.5 Spectral Filtering : Plot of Estimated Data vs Actual Data with SNR =1.5326 . −− Number of Instances−−−−> −−−− Value of Feature−−−−> Actual data Estimeted data Perturbed data Figure 1. Estimation of original sin usoidal data with kno wn random noise ariance ('>   be

(perturbed) data points, each being ector of features. When the noise distrib ution 0:?4 of is completely kno wn (as required by the random alue perturbation tech- nique [2 ]), the noise ariance is first calculated from the gi en distrib ution. Equation is then used to calcu- late and which pro vide the theoretical bounds of the eigen alues corresponding to noise matrix From the perturbed data, we compute the eigen alues of its co- ariance matrix say H : Then we identify the noisy eigenstates  :S such that and ! The remaining eigen- states are the eigenstates

corresponding to actual data. Let, diag   be the diagonal matrix with all noise-related eigen alues, and be the matrix whose columns are the corresponding eigen ectors. Similarly let be the eigen alue matrix for the actual data part and be the corresponding eigen ector matrix which is an matrix ). Based on these matrices, we de- compose the co ariance matrix into tw parts,  and  with    where  is the co- ariance matrix corresponding to random noise part, and is the co ariance matrix corresponding to actual data part. An estimate of the actual data is ob- tained by

projecting the data on to the subspace spanned by the columns of In other ords, 8. Experimental Results In this section, we present results of our xperiments with the proposed spectral filtering technique. This section also includes discussion on the ef fect of noise ariance on the performance of the spectral filtering method. 10 15 20 25 30 35 10 −18 10 −16 10 −14 10 −12 10 −10 10 −8 10 −6 10 −4 10 −2 10 10 Eigenvalue Distribution −−Eigenvalues(log)−−> −−− Number of

Features−−> Estimated Data Eigenvalue Actual Data Eigenvalue Estimated Noise Eigenvalue Lambda (Max) Lambda (Min) Figure 2. Distrib ution of eig en alues of ac- tual data, and estimated eig en alues of ran- dom noise and actual data. 10 20 30 40 50 60 70 80 90 100 −1 −0.5 0.5 1.5 Plot of a Fraction of Dataset, Estimated vs Actual Signal (Mean SNR = 1.3) −− Value of Feature−−> Estimated Data Actual Data Perturbed Data 10 20 30 40 50 60 70 80 90 100 −1 −0.5 0.5 1.5 Estimation Error for the Dataset Shown −−− Number of

Instances−−−> −− Estimation Error−−> Figure 3. Spectral filtering used to estimate real orld audio data. vef orm of audio signal is losel estimated fr om its per turbed ver sion. 8.1. Estimation with Kno wn erturbing Distrib u- tion tested our pri ac breaching technique using se eral datasets of dif ferent sizes. considered both artificially generated and real data sets. ards that end, we gener ated dataset with 35 features and 300 instances. Each fea- ture has specific trend lik sinusoidal, square, and triangu- lar shape, ho we er

there is no dependenc between an tw features. The actual dataset is perturbed by adding Gaussian noise (with zero mean and kno wn ariance), and our pro- posed technique is applied to reco er the actual data from the perturbed data. Figure sho ws the result of our spec- tral filtering for one such feature where the actual data has sinusoidal trend. The filtering technique appears to pro-
Page 6
vide an accurate estimate of the indi vidual alues of the actual data. Figure sho ws the distrib ution of eigen alues of the actual and perturbed data. It also identifies the

esti- mated noise eigen alues and the theoretical bounds BEIKJ and 7BED As we see, the filtering method accurately dis- tinguishes between noisy eigen alues and eigen alues cor responding to actual data. Note that the estimated eigen al- ues of actual data is ery close to eigen alues of actual data and almost erlap with them abo BEIKJ The eigen alues of actual data belo BED are practically ne gligible. Thus, the estimated eigen alues of the actual data capture most of the information and discard the additi noise. 10 15 20 25 30 35 40 45 50 −0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Plot of a Fraction of dataset,Estimated vs Actual Signal −−− No of instances−−−> −− Values −−−> Estimated data Actual data Perturbed data Figure 4. Plot of the individual alues of fraction of the dataset with ‘T riangular distri- ution. Spectral filtering gives lose estima- tion of individual alues. The random matrix-based filtering technique can also be xtended to datasets with single feature, i.e when the dataset is single column ector The data ector is per turbed with noise ector with the same dimension. The

perturbed data ector is then split into fix ed number of ectors with equal length and all of these ectors are ap- pended to form matrix. The spectral filtering technique is then applied to this matrix to estimate the original data. Af- ter the data matrix is estimated, its columns are concate- nated to form single ector used real orld single feature data set to erify the performance of the spectral filtering. The dataset used is the scaled amplitude of the eform of an audio tune recorded using fix ed sampling frequenc The tune recorded is airly noise free with S

sample points. perturbed this data with additi Gaussian noise. define the term Signal-to-Noise Ratio (SNR) to quan- tify the relati amount of noise added to actual data to per turb it: −1 −0.5 0.5 1.5 200 400 600 800 1000 1200 −−−Attribute value−−−−> −−No of Records−−> Triangle Distribution Estimated Distribution Perturbed Distribution Actual Distribution Figure 5. Reconstruction of the ‘T riangu- lar distrib ution. er turbed data distrib ution does not look like triangular distrib ution, ut reconstructed

distrib ution using spectral filtering resemb les the original distrib ution losel SNR ariance of Actual Data Noise ariance (10) In this xperiment, the noise ariance as chosen to yield signal-to-noise ratio of 1.3. split this ector of perturbed data into columns, each containing  points and applied the spectral filtering technique to reco er the actual data. The result is sho wn in Figure 3. or the sak of clarity only fraction of dataset is sho wn, and estima- tion error is plotted for that fraction. As sho wn in Figure 3, the perturbed data is ery dif ferent from the actual

data, whereas the estimated data is close approximation of the actual data. The estimation performance is similar to that for multi-featured data (see Figure 1). 8.2. Comparison ith Results in [2 The proposed spectral filtering technique can estimate alues of indi vidual data-points from the perturbed dataset. This point-wise estimation can then be used to reconstruct the distrib ution of actual data as well. The methods sug- gested by [2 can only reconstruct the distrib ution of the original data from the data perturbed by random alue dis- tortion; ut it does not consider estimation of

the indi vidual alues of the data-points. The spectral filtering technique, on the other hand, is xplicitly designed to reconstruct the indi vidual data-points and hence, also the distrib ution of the actual dataset. tried to replicate the xperiment reported in [2 using our method to reco er the triangular distrib ution. used ector data of %S alues ha ving triangular distrib ution as sho wn in Figure in [2 ]. The indi vidual alues of actual
Page 7
data are within and and are independent of each other added Gaussian random noise with mean and standard de viation , to

this data and split the data ector into columns, each ha ving  alues. then applied our spectral filter to reco er the actual data from the perturbed data. Figure sho ws portion of the actual data, their al- ues after distortion, and their estimated alues. Note that the estimated alues are ery close to the actual alues, com- pared to the perturbed alues. Using the estimate of indi- vidual data-points, we reconstruct the distrib ution of the ac- tual data. Figure sho ws estimation of the distrib ution from the estimated alue of indi vidual data-points. The distrib u- tion of the

perturbed data is ery dif ferent than the actual triangular distrib ution, ut the estimated distrib ution looks ery similar to the original distrib ution. This sho ws that our method reco ers the original distrib ution along with indi- vidual data-points, similar to the result reported in [2 ]. The estimation accurac is greater than  for all datapoints. Since spectral filtering can filter out the indi vidual alues of actual data and its distrib ution from perturbed represen- tation, it breaches the pri ac preserving protection of the randomized data perturbation technique [2].

8.3. Effect of erturbation ariance and the In- her ent Random Component of the Actual Data Quality of the data reco ery depends upon the rela- ti noise content in the perturbed data. use the SNR (see equation (10) to quantify the relati amount of noise added to actual data to perturb it. As the noise added to the actual alue increases, the SNR decreases. Our xper iments sho that the proposed filtering method predicts the actual data reasonably well up to SNR alue of 1.0 (i.e. %  noise). The results sho wn in Figure corresponds to an SNR alue nearly 2, i.e. noise content is about 

Fig- ure sho ws data-block where the SNR is As the SNR goes belo 1, the estimation becomes too erroneous. Fig- ure sho ws the dif ference in estimation accurac as the SNR increases from 1. The dataset used here has sinu- soidal trend in its alues. The top graph corresponds to  noise (SNR 4.3), whereas the bottom graph corresponds to %  noise (SNR 1.0). Another important actor that af fects the quality of re- co ery of the actual data is the inherent noise in the actual dataset (apart from the perturbation noise added intention- ally). If the actual dataset has random component in it,

and random noise is added to perturb it, spectral filtering method does not filter the actual data accurately Our x- periments with some inherently noisy real life dataset sho that the eigen alues of signal and noise no longer remains clearly separable since the their eigen alues may not be dis- trib uted er tw non-o erlapping re gimes an longer 50 100 150 200 250 300 −2 −1 Variation of Estimation Accuracy with SNR −−Value of Feature−−> Estimated data Actual data 50 100 150 200 250 300 −10 −5 10 Plot of Sinusoidal Feature,Estimated vs

Actual Signal with SNRs =1.0033 ,4.2423 , −−−No of instances−−> −−−Value of Feature−−> Estimated data Actual data Figure 6. higher noise content (lo SNR )leads to less accurate estimation. SNR in the upper figure is 1, while that or the lo wer fig- ure is 4.3. ha performed xperiments with artificial dataset with specific trend in its alue as well as real orld dataset containing random component. Figure in act sho ws that our method gi es close estimation of actual data when the dataset has some specific

trend (sinusoid). also ap- plied our method to “Ionosphere data ailable from [14 ], which is inherently noisy perturbed the original data with random noise such that mean SNR is same as the arti- ficial dataset, i.e. Figure sho ws that reco ery quality is poor compared to datasets ha ving definite trend. Ho we er this opens up dif ferent question: Is the ran- dom component of the original data set really important as ar as data mining is concerned? One may ar gue that most data mining techniques xploit only the non-random struc- tured patterns of the data. Therefore, losing the

inherent ran- dom component of the original data may not be important in pri ac preserving data mining application. 9. Conclusion and Futur ork Preserving pri ac in data mining acti vities is ery im- portant issue in man applications. Randomization-based techniques are lik ely to play an important role in this do- main. Ho we er this paper illustrates some of the challenges that these techniques ace in preserving the data pri ac It sho wed that under certain conditions it is relati ely easy to breach the pri ac protection of fered by the random pertur bation based techniques. It pro vided

xtensi xperimental results with dif ferent types of data and sho wed that this is really concern that we must address. In addition to raising this concern the paper of fers random-matrix based data filtering technique that may find wider application in de el- oping ne perspecti to ard de eloping better pri ac y- preserving data mining algorithms.
Page 8
10 20 30 40 50 60 70 80 90 100 −2 −1.5 −1 −0.5 0.5 1.5 Plot of One Feature,Estimated vs Actual Signal with SNRs =1.11 . −−Value of Data−−> Estimated data Actual data Figure

7. Spectral filtering perf orms poorl on dataset with random component in its actual alue Ho we ver it is not lear if loos- ing the random component of the data is concern or data mining applications. Since the problem mainly originates from the usage of additi e, independent “white noise for pri ac preserv a- tion, we should xplore “colored noise for this application. ha already started xploring multiplicati noise ma- trices in this conte xt. If be the data matrix and be an appropriately sized random noise matrix then we are in- terested in the properties of the perturbed data EP for

pri ac y-preserving data mining applications. If is square matrix then we may be able to xtract signal using techniques lik independent component analysis. Ho we er projection matrices that satisfy certain conditions may be more appealing for such applications. More details about this possibility can be found else where [11 ]. Ackno wledgments The authors ackno wledge supports from the United States National Science oundation CAREER ard IIS-0093353, ASA (NRA) AS2-37143, and TEDCO, Maryland echnology De elopment Center Refer ences [1] D. Agra al and C. C. Agga al. On the design and quan-

ti˛cation of pri ac preserving data mining algorothms. In Pr oceedings of the 20th ACM SIMOD Symposium on Prin- ciples of Database Systems pages 247–255, Santa Barbara, May 2001. [2] R. Agra al and R. Srikant. Pri ac y-preserving data mining. In Pr oceeding of the CM SIGMOD Confer ence on Man- ement of Data pages 439–450, Dallas, xas, May 2000. CM Press. [3] Esti vill-Castro and L. Brank vic. Data sw aping: Balanc- ing pri ac against precision in mining for logic rules. In Pr oceedings of the ˛r st Confer ence on Data ar ehousing and Knowledg Disco very (DaW aK-99) pages 389 398,

Florence, Italy 1999. Springer erlag. [4] A. Ev˛me vski, J. Gehrk e, and R. Srikant. Limiting pri ac breaches in pri ac preserving data mining. In Pr oceedings of the ACM SIMOD/PODS Confer ence San Die go, CA, June 2003. [5] A. Ev˛me vski, R. Srikant, R. Agra al, and J. Gehrk e. Pri- ac preserving mining of association rules. In Pr oceedings of the ACM SIKDD Confer ence Edmonton, Canada, 2002. [6] S. Ev˛mie vski. Randomization techniques for pri ac pre- serving association rule mining. In SIGKDD Explor ations olume 4(2), Dec 2002. [7] S. Janson, L. and A. Rucinski. Random Gr

aphs ile Publishers, edition, 2000. [8] D. Jonsson. Some limit theorems for the eigen alues of sample co ariance matrix. ournal of Multivariate Analy- sis 12:1–38, 1982. [9] M. Kantarcioglu and C. Clifton. Pri ac y-preserving dis- trib uted mining of association rules on horizontally parti- tioned data. In SIGMOD orkshop on DMKD Madison, WI, June 2002. [10] H. Kar gupta, K. Si akumar and S. Ghosh. Dependenc de- tection in mobimine and random matrices. In Pr oceedings of the 6th Eur opean Confer ence on Principles and Pr ac- tice of Knowledg Disco very in Databases pages 250–262. Springer 2002.

[11] K. Liu, H. Kar gupta, and J. Ryan. Random projection and pri ac preserving correlation computation from distrib uted data. echnical report, Uni ersity of Maryland Baltimore County Computer Science and Electrical Engineering De- partment, echnical Report TR-CS-03-24, 2003. [12] D. G. Manolakis, K. Ingle, and S. M. ogon. Statistical and Adaptive Signal Pr ocessing McGra Hill, 2000. [13] M. L. Mehta. Random Matrices Academic Press, London, edition, 1991. [14] U. M. L. Repository http://www .ics.uci.edu/ mlearn/mlsummary .html. [15] S. J. Rizvi and J. R. Haritsa. Maintaining data pri ac in

as- sociation rule mining. In Pr oceedings of the 28th VLDB Con- fer ence Hong ong, China, 2002. [16] J. Silv erstein and L. Combettes. Signal detection via spectral theory of lar ge dimensional random matrices. IEEE ansactions on Signal Pr ocessing 40(8):2100–2105, 1992. [17] G. Ste art. Error and perturbation bounds for subspaces associated with certain eigen alue problems. SIAM Re vie 15(4):727–764, October 1973. [18] J. raub, emini, and H. oz’niak wski. The statisti- cal security of statistical database. CM ansactions on Database Systems (T ODS) 9(4):672–679, 1984. [19] J. aidya and C.

Clifton. Pri ac preserving association rule mining in ertically partitioned data. In The Eighth CM SIGKDD International confer ence on Knowledg Disco very and Data Mining Edmonton, Alberta, CA, July 2002. [20] H. yl. Inequalities between the tw kinds of eigen alues of linear transformation. In Pr oceedings of the National Academy of Sciences olume 35, pages 408–411, 1949.