# Data Mining And Privacy Protection

### Presentations text content in Data Mining And Privacy Protection

Data Mining And Privacy Protection

Prepared by: Eng.

Hiba

Ramadan

Supervised by:

Dr.

Rakan

Razouk

Slide2

OutlineIntroductionkey directions in the field of privacy-preserving data mining

Privacy-Preserving Data Publishing

The randomization method

The k-anonymity model

l-diversityChanging the results of Data Mining Applications to preserve privacyassociation rule hidingPrivacy-Preserving Distributed Data MiningVertical Partitioning of DataHorizontal partitioning

Data Mining And Privacy Protection

2

Slide3Privacy Preserving Data Mining

What is data mining?

Non-trivial extraction of implicit, previously unknown, and potentially useful information from large data sets or databases [W.

Frawley

and G. Piatetsky-Shapiro and C. Matheus, 1992]

What is privacy preserving data mining?

Study of achieving some data mining goals without scarifying the privacy of the individuals

Data Mining And Privacy Protection

3

Slide4Scenario (Information Sharing)

A data owner wants to release a person-specific data table to another party (or the public) for the purpose of

classification analysis

without scarifying the

privacy of the individuals in the released data.Data owner

Data recipients

Person-specific

data

4

Data Mining And Privacy Protection

Slide5OutlineIntroductionkey directions in the field of privacy-preserving data mining

Privacy-Preserving Data Publishing

The randomization method

The k-anonymity model

l-diversityChanging the results of Data Mining Applications to preserve privacyassociation rule hidingPrivacy-Preserving Distributed Data MiningVertical Partitioning of DataHorizontal partitioning

Data Mining And Privacy Protection

5

Slide6key directions in the field of privacy-preserving data mining

Privacy-Preserving Data

Publishing

: These techniques tend to study different transformation methods associated with privacy.

Changing the results of Data Mining Applications to preserve privacy: In many cases, the results of data mining applications such as association rule or classification rule mining can compromise the privacy of the data.Privacy-Preserving Distributed Data Mining: In many cases, the data may be distributed across multiple sites, and the owners of the data across these different sites may wish to compute a common function.

6

Data Mining And Privacy Protection

Slide7OutlineIntroductionkey directions in the field of privacy-preserving data mining

Privacy-Preserving Data Publishing

The randomization method

The k-anonymity model

l-diversityChanging the results of Data Mining Applications to preserve privacyassociation rule hidingPrivacy-Preserving Distributed Data MiningVertical Partitioning of DataHorizontal partitioning

Data Mining And Privacy Protection

7

Slide8Randomization Approach Overview

50 | 40K | ...

30 | 25 | ...

...

...

Randomizer

Randomizer

Reconstruct

distribution

of Age

Reconstruct

distribution

of Salary

Data Mining

Algorithms

Model

65 | 20 | ...

25 | 60K | ...

...

8

Data Mining And Privacy Protection

Slide9RandomizationThe method of randomization can be described as follows.

x={x

1

…

xN}, For record xi X we add a noise component y1…

yN

, which is drawn from the probability distribution f

Y

(y),

the new set of distorted records are x

1

+y

1

….

x

N

+y

N

In general, it is assumed that the variance of the added noise is large enough, so that the original record values cannot be easily guessed from the distorted

data,Thus

, the original records cannot

be recovered

, but

the distribution

of the original records can be recovered.

9

Data Mining And Privacy Protection

Slide10Randomization

10

Data Mining And Privacy Protection

Slide11Reconstruction ProblemOriginal values x1, x

2, ..., xnfrom probability distribution X (unknown)

To hide these values, we use y

1

, y2, ..., ynfrom probability distribution Y (known)Givenx1+y1, x2+y2, ..., xn+ynthe probability distribution of YEstimate the probability distribution of X.

11

Data Mining And Privacy Protection

Slide12Intuition (Reconstruct single point)

12

Data Mining And Privacy Protection

Slide13Intuition (Reconstruct single point)

13

Data Mining And Privacy Protection

Slide14Reconstructing the DistributionCombine estimates of where point came from for all the points:Gives estimate of original distribution.

14

Data Mining And Privacy Protection

Slide15Reconstruction

fX0 := Uniform distribution

j := 0 // Iteration number

repeat

fXj+1(a) := (Bayes' rule)j := j+1until (stopping criterion met)

15

Data Mining And Privacy Protection

Slide16Seems to work well!

16

Data Mining And Privacy Protection

Slide17Pros & ConsOne key advantage of the randomization method is that it is relatively

simple, and does not require knowledge of the distribution of other records in the data.

17

Data Mining And Privacy Protection

Slide18Pros & Conswe

only have a distribution containing the

behavior of X. Individual records are not available.

the distributions are available only along individual dimensions. While the approach can certainly be extended to multi-dimension distributions, density estimation

becomes inherently more challenging

with increasing dimensionalities. For

even modest

dimensionalities such as 7 to 10, the process of density estimation

becomes increasingly

inaccurate, and falls prey to the curse of dimensionality

18

Data Mining And Privacy Protection

Slide19OutlineIntroductionkey directions in the field of privacy-preserving data mining

Privacy-Preserving Data Publishing

The randomization method

The k-anonymity model

Data Mining And Privacy Protection

19

Slide20k-anonymity

the role of attributes in dataexplicit identifiers are removed

quasi identifiers

can be used to re-identify individuals

sensitive attributes (may not exist!) carry sensitive information

Name

Birthdate

Sex

Zipcode

Disease

Andre

21/1/79

male

53715

Flu

Beth

10/1/81

female

55410

Hepatitis

Carol

1/10/44

female

90210

Brochitis

Dan

21/2/84

male

02174

Sprained Ankle

Ellen

19/4/72

female

02237

AIDS

identifier

quasi identifiers

sensitive

Name

Birthdate

Sex

Zipcode

Disease

Andre

21/1/79

male

53715

Flu

Beth

10/1/81

female

55410

Hepatitis

Carol

1/10/44

female

90210

Brochitis

Dan

21/2/84

male

02174

Sprained Ankle

Ellen

19/4/72

female

02237

AIDS

Data Mining And Privacy Protection

20

Slide21k-anonymitypreserve privacy via k-anonymity, proposed by Sweeney and

Samaratik-anonymity: intuitively, hide each individual among

k-1

others

each QI set of values should appear at least k times in the released datasensitive attributes are not considered (going to revisit this...)how to achieve this?generalization and suppressionvalue perturbation is not considered (we should remain truthful to original values )privacy vs utility tradeoffdo not anonymize more than necessary

Data Mining And Privacy Protection

21

Slide22k-anonymity

Transform each QI value into a less specific form

A generalized table

Age

Sex

Zipcode

Disease

>21

M

1100*

pneumonia

>21

M

1100*

dyspepsia

>21

M

1100*

dyspepsia

>21

M

1100*

pneumonia

>61

F

1100*

flu

>61

F

1100*

gastritis

>61

F

1100*

flu

>61

F

1100*

bronchitis

Name

Age

Sex

Zipcode

Bob

23

M

11000

22

Slide23k-anonymity example

tools for anonymizationgeneralizationpublish more general values, i.e., given a domain hierarchy, roll-upsuppressionremove tuples, i.e., do not publish outliers

often the number of suppressed tuples is bounded

Birthdate

Sex

Zipcode

21/1/79

male

53715

10/1/79

female

55410

1/10/44

female

90210

21/2/83

male

02274

19/4/82

male

02237

Birthdate

Sex

Zipcode

group 1

*/

1/79

person

5****

*/1/79

person

5****

suppressed

1/10/44

female

90210

group 2

*/*/8*

male

022**

*/*/8*

male

022**

original

data

2-anonymous data

Data Mining And Privacy Protection

23

Slide24generalization latticeassume domain hierarchies exist for all QI attributes

zipcode

sex

construct the

generalization lattice

for the entire QI set

objective

find the minimum generalization that satisfies k-anonymity

generalization

less

more

i.e., maximize

utility

by finding minimum distance vector with k-anonymity

Data Mining And Privacy Protection

Slide25incognitoData Mining And Privacy Protection

The Incognito algorithm generates the set of all possible k-anonymous full-domain generalizations of T, with an optional

tuple

suppression threshold.

the algorithm begins by checking single-attribute subsets of the quasi-identifier, and then iterates, checking k-anonymity with respect to increasingly large subsets25

Slide26incognito

(I)

generalization

property

if at some node k-anonymity holds, then it also holds for any ancestor node(II) subset property

if for a set of QI attributes k-anonymity doesn’t hold then it doesn’t hold for any of its supersets

e.g.,

<S1, Z0>

is k-anonymous and, thus, so is

<S1, Z1>

and

<S1, Z2>

e.g.,

<S0, Z0>

is not k-anonymous and, thus

<S0, Z0, B0>

and

<S0, Z0, B1>

cannot be k-anonymous

incognito

considers

sets of QI attributes of increasing cardinality

and

prunes nodes in the lattice using the two properties above

note: the entire lattice, which includes three dimensions <S,Z,B>, is too complex to show

Data Mining And Privacy Protection

Slide27incognitoData Mining And Privacy Protection

<z0> <b0> <s0> <z0,b0> <z0,b1> <z0,s0> <z0,s1>……

<z0,b0,s0> <z0,b0,s1>…….

27

Slide28incognitoData Mining And Privacy Protection

28

Slide29seen in the domain spaceconsider the multi-dimensional domain space

QI attributes are the dimensionstuples are points in this spaceattribute hierarchies partition dimensions

zipcode

hierarchy

sex hierarchy

Data Mining And Privacy Protection

29

Slide30seen in the domain space

incognito example

2

QI attributes,

7 tuples, hierarchies shown with bold lines

zipcode

sex

sex

zipcode

not 2-anonymous

2-anonymous

Data Mining And Privacy Protection

Slide31OutlineIntroductionkey directions in the field of privacy-preserving data mining

Privacy-Preserving Data Publishing

The randomization method

The k-anonymity model

Data Mining And Privacy Protection

31

Slide32k-anonymity problemsk-anonymity example

homogeneity attack: in the last group everyone has cancer

background knowledge

:

in the first group, Japanese have low chance of heart diseasewe need to consider the sensitive valuesid

Zipcode

Age

National.

Disease

1

13053

28

Russian

Heart Disease

2

13068

29

American

Heart Disease

3

13068

21

Japanese

Viral Infection

4

13053

23

American

Viral Infection

5

14853

50

Indian

Cancer

6

14853

55

Russian

Heart Disease

7

14850

47

American

Viral Infection

8

14850

49

American

Viral Infection

9

13053

31

American

Cancer

10

13053

37

Indian

Cancer

11

13068

36

Japanese

Cancer

12

13068

35

American

Cancer

id

Zipcode

Age

National.

Disease

1

130**

<30

∗

Heart Disease

2

130**

<30

∗

Heart Disease

3

130**

<30

∗

Viral Infection

4

130**

<30

∗

Viral Infection

5

1485*

≥40

∗

Cancer

6

1485*

≥40

∗

Heart Disease

7

1485*

≥40

∗

Viral Infection

8

1485*

≥40

∗

Viral Infection

9

130**

3∗

∗

Cancer

10

130**

3∗

∗

Cancer

11

130**

3∗

∗

Cancer

12

130**

3∗

∗

Cancer

data

4-anonymous data

Data Mining And Privacy Protection

32

Slide33l-diversity

make sure each group contains well represented sensitive valuesprotect from homogeneity attacks

protect from background knowledge

l-diversity

(simplified definition)a group is l-diverse if the most frequent sensitive value appears at most 1/l times in group

33

Age

Sex

Zipcode

Disease

[21, 60]

M

[10001, 60000]

pneumonia

[21, 60]

M

[10001, 60000]

dyspepsia

[21, 60]

M

[10001, 60000]

flu

[21, 60]

M

[10001, 60000]

pneumonia

[61, 70]

F

[10001, 60000]

flu

[61, 70]

F

[10001, 60000]

gastritis

[61, 70]

F

[10001, 60000]

flu

[61, 70]

F

[10001, 60000]

bronchitis

Name

Age

Sex

Zipcode

Bob

23

M

11000

A 2-diverse generalized table

Slide34anatomy

fast l-diversity algorithm anatomy is not generalization

seperates

sensitive values from tuples shuffles sensitive values among groupsFor a given data table, Anatomy releases a quasi-identifier table (QIT) and a sensitive table (ST)

Group-ID

Disease

1

dyspepsia

1

pneumonia

1

flu

1

gastritis

2

bronchitis

2

flu

2

gastritis

2

dyspepsia

Age

Sex

Zipcode

Group-ID

23

M

11000

1

27

M

13000

1

35

M

59000

1

59

M

12000

1

61

F

54000

2

65

F

25000

2

65

F

25000

2

70

F

30000

2

Quasi-identifier Table (QIT)

Sensitive Table (ST)

Age

Sex

Zipcode

Disease

23

M

11000

pneumonia

27

M

13000

flu

35

M

59000

dyspepsia

59

M

12000

gastritis

61

F

54000

dyspepsia

65

F

25000

gastritis

65

F

25000

flu

70

F

30000

bronchitis

data

34

Slide35anatomy

algorithm

assign sensitive values to buckets

create groups by drawing from

l

largest buckets

35

Slide36Privacy Preservation

From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1

/

l

Group-ID

Disease

Count

1

dyspepsia

2

1

pneumonia

2

2

bronchitis

1

2

flu

2

2

gastritis

1

Age

Sex

Zipcode

Group-ID

23

M

11000

1

27

M

13000

1

35

M

59000

1

59

M

12000

1

61

F

54000

2

65

F

25000

2

65

F

25000

2

70

F

30000

2

quasi-identifier table (QIT)

sensitive table (ST)

Name

Age

Sex

Zipcode

Bob

23

M

11000

Slide37OutlineIntroductionkey directions in the field of privacy-preserving data mining

Privacy-Preserving Data Publishing

The randomization method

The k-anonymity model

Data Mining And Privacy Protection

37

Slide38Association Rule Hiding

Recent years have seen tremendous advances in the ability to perform association rule mining effectively

Such rules often encode important target

marketing information about a businessUser

Changed

Database

Data Mining

Association Rules

Hide Sensitive Rules

Data Mining And Privacy Protection

38

Slide39Association Rule Hiding

There are

various algorithms

for hiding a group of association rules, which is characterized as

sensitive. One rule is characterized as sensitive if its disclosure risk is above a certain privacy threshold.Sometimes, sensitive rules should not be disclosed to the public since, among other things, they may be used for inferring sensitive data, or they may provide business competitors with an advantage.

Association

Rule Hiding TechniquesDistortion-based

:

Modify entries from 1s to 0s

Blocking-based

Technique

the entry is not modified, but is left incomplete. Thus, unknown entry values are used to prevent discovery of association rules

Data Mining And Privacy Protection

39

Slide40Distortion-based Techniques

A

B

C

D

1

1

1

0

1

0

1

1

0

0

0

1

1

1

1

0

1

0

1

1

Rule A

→C has:

Support(

A→C

)=80%

Confidence(

A→C

)=100%

Sample Database

A

B

C

D

1

1

1

0

1

0

0

1

0

0

0

1

1

1

1

0

1

0

0

1

Distorted Database

Rule A

→C has now:

Support(

A→C

)=40%

Confidence(

A→C

)=50%

Distortion

Algorithm

Data Mining And Privacy Protection

40

Slide41Association Rule Hiding Strategies

Data Mining And Privacy Protection

TID

Items

T1ABCT2ABCT3ABCT4

ABT5

AT6

AC

TID

Items

T1

111

T2

111

T3

111

T4

110

T5

100

T6

101

41

Slide42Association Rule Hiding Strategies

Data Mining And Privacy Protection

If we want to lower the value of the ratio:

X

Y___

X

Y

___

X

Y

___

42

Slide43Association Rule Hiding Strategies

Data Mining And Privacy Protection

Support =

N is the number of transactions in D.

Since N is constant, we can only change the numerator

43

X

Y

Slide44Association Rule Hiding Strategies

Data Mining And Privacy Protection

TID

Items

T1ABCT2ABCT3ABCT4

ABT5

AT6

AC

TID

Items

T1

111

T2

111

T3

111

T4

110

T5

100

T6

101

AC

B

Support=50%, conf=75%

TID

Items

T1

11

0

T2

111

T3

111

T4

110

T5

100

T6

101

Support=33%, conf=66%

min_supp

=35%,

min_conf

=70%

44

Slide45Association Rule Hiding Strategies

Data Mining And Privacy Protection

Confidence=

Decrease the support, making sure we hide items from the right hand side of the rule

Increase the support of the left hand.

45

Slide46Association Rule Hiding Strategies

Data Mining And Privacy Protection

TID

Items

T1ABCT2ABCT3ABCT4

ABT5

AT6

AC

TID

Items

T1

111

T2

111

T3

111

T4

110

T5

100

T6

101

AC

B

Support=50%, conf=75%

TID

Items

T1

1

0

1

T2

111

T3

111

T4

110

T5

100

T6

101

Support=33%, conf=66%

min_supp

=33%,

min_conf

=70%

46

Slide47Association Rule Hiding Strategies

Data Mining And Privacy Protection

TID

Items

T1ABCT2ABCT3ABCT4

ABT5

AT6

AC

TID

Items

T1

111

T2

111

T3

111

T4

110

T5

100

T6

101

AC

B

Support=50%, conf=75%

TID

Items

T1

111

T2

111

T3

111

T4

110

T5

10

1

T6

101

Support=50%, conf=60%

min_supp

=33%,

min_conf

=70%

47

Slide48Quality of DataSometimes it is dangerous to delete some items from the database (etc. medical databases) because the false data may create undesirable effects.

So, we have to hide the rules in the database by adding uncertainty without distorting the database.

Data Mining And Privacy Protection

48

Slide49Blocking-based Techniques

A

B

C

D

1

1

1

0

1

0

1

1

0

0

0

1

1

1

1

0

1

0

1

1

A

B

C

D

1

1

1

0

1

0

?

1

?

0

0

1

1

1

1

0

1

0

1

1

Blocking

Algorithm

Initial Database

New Database

Data Mining And Privacy Protection

49

Slide50OutlineIntroductionkey directions in the field of privacy-preserving data mining

Privacy-Preserving Data Publishing

The randomization method

The k-anonymity model

Data Mining And Privacy Protection

50

Slide51Motivation

SettingData is distributed at different sites

These sites may be third parties (e.g., hospitals, government bodies) or individuals

Aim

Compute the data mining algorithm on the data so that nothing but the output is learnedThat is, carry out a secure computationData Mining And Privacy Protection

51

Slide52Medical Records

RPJ

Yes

Diabetic

CAC

No Tumor

No

PTR

No Tumor

Diabetic

Cell Phone Data

RPJ

5210

Li/Ion

CAC

none

none

PTR

3650

NiCd

Global Database View

TID

Brain Tumor?

Diabetes?

Model

Battery

Vertical Partitioning of Data

Data Mining And Privacy Protection

52

Slide53Horizontal partitioning

-Two banks-V

ery

similar information.

credit cardIs the account active?Is the account delinquent? Is the account new?account balance-No public Sharing

Data Mining And Privacy Protection

53

Slide54Privacy-Preserving Distributed Data Mining

Data Mining And Privacy Protection

Secure multiparty computation

Cryptography

54

Slide55Data Mining And Privacy Protection55

Questions

55

Slide56Data Mining And Privacy Protection

Thank You

56

56

## Data Mining And Privacy Protection

Download Presentation - The PPT/PDF document "Data Mining And Privacy Protection" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.