/
De- anonymizing De- anonymizing

De- anonymizing - PowerPoint Presentation

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
387 views
Uploaded On 2017-03-17

De- anonymizing - PPT Presentation

Data CompSci 59003 Instructor Ashwin Machanavajjhala 1 Lecture 2 59003 Fall 12 Source httpxkcdorg834 Announcements Project ideas will be posted on the site by Friday You are welcome to send me or talk to me about your own ideas ID: 525235

fall 590 individual lecture 590 fall lecture individual network cathy alice attacks bob anonymized sensitive social nodes grace diane

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "De- anonymizing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

De-anonymizing Data

CompSci 590.03Instructor: Ashwin Machanavajjhala

1

Lecture 2 : 590.03 Fall 12

Source (http://xkcd.org/834/)Slide2

AnnouncementsProject ideas will be posted on the site by Friday.

You are welcome to send me (or talk to me about) your own ideas.Lecture 2 : 590.03 Fall 12

2Slide3

OutlineRecap & Intro to

AnonymizationAlgorithmically De-anonymizing Netflix DataAlgorithmically De-anonymizing Social Networks

Passive AttacksActive Attacks

Lecture 2 : 590.03 Fall 12

3Slide4

OutlineRecap & Intro to

AnonymizationAlgorithmically De-anonymizing Netflix Data

Algorithmically De-anonymizing

Social Networks

Passive AttacksActive Attacks

Lecture 2 : 590.03 Fall 12

4Slide5

Personal Big-Data

Google

DB

Person

1

r

1

Person 2

r

2

Person 3

r

3

Person N

r

N

Census

DB

Hospital

DB

Doctors

Medical Researchers

Economists

Information Retrieval Researchers

Recommen

-

dation

Algorithms

5

Lecture 2 : 590.03 Fall 12Slide6

The Massachusetts Governor

Privacy Breach [Sweeney IJUFKS 2002]

Name

SSN

Visit Date

Diagnosis

Procedure

Medication

Total Charge

Name

Address

Date

Registered

Party

affiliation

Date last

voted

Zip

Birth

date

Sex

Medical Data

Voter

List

Governor of MA

uniquely identified

using

ZipCode

,

Birth Date, and Sex.

Name linked to Diagnosis

6

Lecture 2 : 590.03 Fall 12Slide7

The Massachusetts Governor

Privacy Breach [Sweeney IJUFKS 2002]

Name

SSN

Visit Date

Diagnosis

Procedure

Medication

Total Charge

Name

Address

Date

Registered

Party

affiliation

Date last

voted

Zip

Birth

date

Sex

Medical Data

Voter

List

Governor of MA

uniquely identified

using

ZipCode

,

Birth Date, and Sex.

Quasi Identifier

87 % of US population

7

Lecture 2 : 590.03 Fall 12Slide8

Statistical Privacy (Trusted Collector) Problem

8

Individual 1

r

1

Individual 2

r

2

Individual 3

r

3

Individual

N

r

N

Server

D

B

Utility:

Privacy:

No breach about any individual

Lecture 2 : 590.03 Fall 12Slide9

Statistical Privacy (Untrusted Collector) Problem

9

Individual 1

r

1

Individual 2

r

2

Individual 3

r

3

Individual

N

r

N

Server

D

B

f

( )

Lecture 2 : 590.03 Fall 12Slide10

Randomized ResponseFlip a coin

heads with probability p, and tails with probability 1-p (p > ½)Answer question according to the following table: Lecture 2 : 590.03 Fall 12

10

True

Answer =

Yes

True Answer = No

Heads

Yes

No

Tails

No

YesSlide11

Statistical Privacy (Trusted Collector) Problem

11

Individual 1

r

1

Individual 2

r

2

Individual 3

r

3

Individual

N

r

N

Server

D

B

Lecture 2 : 590.03 Fall 12Slide12

Query Answering

12

Individual 1

r

1

Individual 2

r

2

Individual 3

r

3

Individual

N

r

N

Hospital

D

B

Lecture 2 : 590.03 Fall 12

Correlate Genome to disease

How many allergy patients?

‘Slide13

Query AnsweringNeed to know the list of questions up front

Each answer will leak some information about individuals. After answering a few questions, server will run out of privacy budget and not be able to answer any more questions. Will see this in detail later in the course.

Lecture 2 : 590.03 Fall 12

13Slide14

Anonymous/ Sanitized Data Publishing

14

Individual 1

r

1

Individual 2

r

2

Individual 3

r

3

Individual

N

r

N

Hospital

D

B

Lecture 2 : 590.03 Fall 12

I wont tell you what questions

I am interested in!

writingcenterunderground.wordpress.comSlide15

Anonymous/ Sanitized Data Publishing

15

Individual 1

r

1

Individual 2

r

2

Individual 3

r

3

Individual

N

r

N

Hospital

D

B

Lecture 2 : 590.03 Fall 12

D’

B

Answer any # of questions

directly on D

B

’ without

any modifications. Slide16

Today’s classIdentifying individual records and their sensitive values from data publishing (with insufficient sanitization).

Lecture 2 : 590.03 Fall 1216Slide17

Outline

Recap & Intro to AnonymizationAlgorithmically De-anonymizing Netflix Data

Algorithmically De-

anonymizing Social Networks

Passive AttacksActive Attacks

Lecture 2 : 590.03 Fall 12

17Slide18

Terms

Coin tosses of an algorithmUnion Bound Heavy Tailed Distribution

Lecture 2 : 590.03 Fall 12

18Slide19

Terms (contd.)Heavy Tailed Distribution

Lecture 2 : 590.03 Fall 1219

Not heavy tailed.

Normal DistributionSlide20

Terms (contd.)Heavy Tailed Distribution

Lecture 2 : 590.03 Fall 1220

Heavy tailed.

Laplace DistributionSlide21

Terms (contd.)Heavy Tailed Distribution

Lecture 2 : 590.03 Fall 1221

Heavy tailed.

Zipf

DistributionSlide22

Terms (contd.)Cosine Similarity

Collaborative filteringProblem of recommending new items to a user based on their ratings on previously seen items.

Lecture 2 : 590.03 Fall 12

22

θSlide23

Netflix Dataset

Lecture 2 : 590.03 Fall 1223

3

4

2

1

5

1

1

1

5

5

1

5

2

2

1

4

2

1

4

3

3

5

4

3

1

3

2

4

Movies

Users

Rating +

TimeStamp

Record (r)

Column/AttributeSlide24

DefinitionsSupport

Set (or number) of non-null attributes in a record or columnSimilaritySparsity

Lecture 2 : 590.03 Fall 12

24Slide25

Adversary ModelAux(r) – some subset of attributes from r

Lecture 2 : 590.03 Fall 1225Slide26

Privacy BreachDefinition 1: An algorithm A outputs an r’ such that

Definition 2: (When only a sample of the dataset is input) Lecture 2 : 590.03 Fall 12

26Slide27

Algorithm

ScoreBoardFor each record r’, compute Score(r’, aux) to be the minimum similarity of an attribute in aux to the same attribute in r’. Pick r’ with the maximum scoreOR

Return all records with Score > α

Lecture 2 : 590.03 Fall 12

27Slide28

Analysis

Theorem 1: Suppose we use Scoreboard with α = 1 – ε. If Aux contains m randomly chosen attributes s.t

.

Then Scoreboard returns a record r’ such that

Pr

[

Sim

(m, r’) > 1 –

ε

δ

]

> 1 –

ε

Lecture 2 : 590.03 Fall 12

28Slide29

Proof of Theorem 1Call r’ a false match if

Sim(Aux, r’) < 1 – ε – δ.

For any false match, Pr[ Sim(

Auxi

, ri

’) > 1 –

ε

] < 1 –

δ

Sim

(Aux, r’) = min

Sim

(

Aux

i

,

r

i

’)

Therefore,

Pr[

Sim

(Aux, r’) > 1 –

ε

] < (1 –

δ

)

m

Pr[some false match has similarity > 1- ε ] < N(1-δ)m N(1-δ)m < ε when m > log(N/ε) / log(1/1-δ)Lecture 2 : 590.03 Fall 1229Slide30

Other resultsIf dataset D is (1-

ε-δ, ε)-sparse, then D can be (1, 1-

ε)-deanonymized

.Analogous results when a list of candidate records are returned

Lecture 2 : 590.03 Fall 12

30Slide31

Netflix DatasetSlightly different algorithm

Lecture 2 : 590.03 Fall 1231Slide32

Summary of Netflix Paper

Adversary can use a subset of ratings made by a user to uniquely identify the user’s record from the “anonymized” dataset with high probabilitySimple Scoreboard algorithm provably guarantees identification of records.A variant of Scoreboard can de-anonymize Netflix dataset.

Algorithms are robust to noise in the adversary’s background knowledge

Lecture 2 : 590.03 Fall 12

32Slide33

Outline

Recap & Intro to AnonymizationAlgorithmically De-anonymizing

Netflix DataAlgorithmically De-

anonymizing Social NetworksPassive Attacks

Active AttacksLecture 2 : 590.03 Fall 12

33Slide34

Social Network Data

Social networks: graphs where each node represents a social entity, and each edge represents certain relationship between two entitiesExample: email communication graphs, social interactions like in Facebook, Yahoo! Messenger, etc.

34

Lecture 2 : 590.03 Fall 12Slide35

Anonymizing Social Networks

Naïve anonymizationremoves the label of each node and publish only the structure of the networkInformation LeaksNodes may still be re-identified based on network structure

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

35

Lecture 2 : 590.03 Fall 12Slide36

Passive Attacks on an Anonymized Network

Consider the above email communication graphEach node represents an individualEach edge between two individuals indicates that they have exchanged emails

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

36

Lecture 2 : 590.03 Fall 12Slide37

Passive Attacks on an Anonymized Network

Alice has sent emails to three individuals only

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

37

Lecture 2 : 590.03 Fall 12Slide38

Passive Attacks on an Anonymized Network

Alice has sent emails to three individuals only Only one node in the anonymized network has a degree threeHence, Alice can re-identify herself

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

38

Lecture 2 : 590.03 Fall 12Slide39

Passive Attacks on an Anonymized Network

Cathy has sent emails to five individuals

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

39

Lecture 2 : 590.03 Fall 12Slide40

Passive Attacks on an Anonymized Network

Cathy has sent emails to five individualsOnly one node has a degree fiveHence, Cathy can re-identify herself

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

40

Lecture 2 : 590.03 Fall 12Slide41

Passive Attacks on an Anonymized Network

Now consider that Alice and Cathy share their knowledge about the anonymized networkWhat can they learn about the other individuals?

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

41

Lecture 2 : 590.03 Fall 12Slide42

Passive Attacks on an Anonymized Network

First, Alice and Cathy know that only Bob have sent emails to both of them

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

42

Lecture 2 : 590.03 Fall 12Slide43

Passive Attacks on an Anonymized Network

First, Alice and Cathy know that only Bob have sent emails to both of themBob can be identified

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

43

Lecture 2 : 590.03 Fall 12Slide44

Passive Attacks on an Anonymized Network

Alice has sent emails to Bob, Cathy, and Ed only

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

44

Lecture 2 : 590.03 Fall 12Slide45

Passive Attacks on an Anonymized Network

Alice has sent emails to Bob, Cathy, and Ed onlyEd can be identified

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

45

Lecture 2 : 590.03 Fall 12Slide46

Passive Attacks on an Anonymized Network

Alice and Cathy can learn that Bob and Ed are connected

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

46

Lecture 2 : 590.03 Fall 12Slide47

Passive Attacks on an Anonymized Network

The above attack is based on knowledge about degrees of nodes. [Liu and Terzi, SIGMOD 2008]More sophisticated attacks can be launched given additional knowledge about the network structure, e.g., a

subgraph of the network.[Zhou and Pei, ICDE 2008, Hay et al., VLDB 2008, ]

Protecting privacy becomes even more challenging when the nodes in the anonymized network are labeled.

[Pang et al., SIGCOMM CCR 2006]

47

Lecture 2 : 590.03 Fall 12Slide48

Inferring Sensitive Values on a Network

Each individual has a single sensitive attribute.

Some individuals share the sensitive attribute, while others keep it privateGOAL

: Infer the private sensitive attributes usingLinks in the social network

Groups that the individuals belong to

Approach: Learn a predictive model

(think classifier)

using public profiles as training data.

[

Zheleva

and

Getoor

, WWW 2009]

48

Lecture 2 : 590.03 Fall 12Slide49

Inferring Sensitive Values on a Network

Baseline: Most commonly appearing sensitive value amongst all public profiles.

49

Lecture 2 : 590.03 Fall 12Slide50

Inferring Sensitive Values on a Network

LINK: Each node x has a list of binary features Lx, one for every node in the social network.Feature value

Lx[y] = 1 if and only if (

x,y) is an edge. Train a model on all pairs (

Lx, sensitive value(x)), for

x’s

with public sensitive values.

Use learnt model to predict private sensitive values

50

Lecture 2 : 590.03 Fall 12Slide51

Inferring Sensitive Values on a Network

GROUP: Each node x has a list of binary features Gx, one for every group in the social network.Feature value

Gx[y] = 1 if and only if

x belongs to group y. Train a model on all pairs (G

x, sensitive value(x)), where x’s sensitive value is public.

Use model to predict private sensitive values

51

Lecture 2 : 590.03 Fall 12Slide52

Inferring Sensitive Values on a Network

Flickr

(Location)

Facebook

(Gender)

Facebook

(Political

View)

Dogster

(Dog

Breed)

Baseline

27.7%

50%

56.5%

28.6%

LINK

56.5%

68.6%

58.1%

60.2%

GROUP

83.6%

77.2%

46.6%

82.0%

[

Zheleva

and

Getoor

, WWW 2009]

52

Lecture 2 : 590.03 Fall 12Slide53

Active Attacks on Social Networks

[Backstrom et al., WWW 2007]Attacker may create a few nodes in the graphCreates a few ‘fake’ Facebook user accounts.

Attacker may add edges from the new nodes. Create friends using ‘fake’ accounts. Goal: Discover an edge between two legitimate users.

53

Lecture 2 : 590.03 Fall 12Slide54

High Level View of Attack

Step 1: Create a graph structure with the ‘fake’ nodes such that it can be identified in the anonymous data.

54

Lecture 2 : 590.03 Fall 12Slide55

High Level View of Attack

Step 2: Add edges from the ‘fake’ nodes to real nodes.

55

Lecture 2 : 590.03 Fall 12Slide56

High Level View of Attack

Step 3: From the

anonymized

data, identify fake graph due to its special graph structure.

56

Lecture 2 : 590.03 Fall 12Slide57

High Level View of Attack

Step 4: Deduce edges by following links

57

Lecture 2 : 590.03 Fall 12Slide58

Details of the Attack

Choose k real users W = {w1, …, wk}Create k fake users X = {x1

, …, xk}Creates edges (x

i, wi

)Create edges (xi, xi+1)

Create all other edges in X with probability 0.5.

Large graph

58

Lecture 2 : 590.03 Fall 12Slide59

Why does it work?Given a graph G, and a set of nodes S, G[S] = graph induced by nodes in S.

There is an isomorphism between two sets of nodes S, S’ ifThere is a function mapping each node in S to a node in S’(u,v) is an edge in G[S] if and only if (f(u), f(v)) is an edge in S’

Isomorphism from S to S is called an automorphism

Think: permuting the nodes

Lecture 2 : 590.03 Fall 1259Slide60

Why does it work?

There is no S such that G[S] is isomorphic to G[X] (call it H).H can be efficiently found from G.H has no non-trivial automorphisms.

Large graph

(size N)

60

Lecture 2 : 590.03 Fall 12Slide61

Recovery

Subgraph isomorphism is NP-hardi.e., Finding X could be hard. But since X has a path, with random edges, there is a simple brute force with pruning search algorithm. Run Time:

O(N 2O(log log N) )

Large graph

(size N)

2

61

Lecture 2 : 590.03 Fall 12Slide62

Works in Real Life!

LiveJournal – 4.4 million nodes, 77 million edgesSuccess all but guaranteed by adding 10 nodes.Recovery typically takes a second.

Probability of Successful Attack

[

Backstrom

et al., WWW 2007]

62

Lecture 2 : 590.03 Fall 12Slide63

Summary of Social Networks

Nodes in a graph can be re-identified using background knowledge of the structure of the graphLink and group structure provide valuable information for accurately inferring private sensitive values. Active attacks that add nodes and edges are shown to be very successful. Guarding against these attacks is an open area for research !

63

Lecture 2 : 590.03 Fall 12Slide64

Next ClassK-Anonymity + Algorithms: How to limit de-

anonymization?Lecture 2 : 590.03 Fall 12

64Slide65

References

L. Sweeney, “K-Anonymity: a model for protecting privacy”, IJUFKS 2002A. Narayanan & V. Shmatikov, “Robust De-anonymization of Large Sparse Datasets”, SSP 2008

L. Backstrom, C.

Dwork & J. Kleinberg, “Wherefore art thou r3579x?:

anonymized social networks, hidden patterns, and structural steganography”, WWW 2007

E.

Zheleva

& L.

Getoor

, “

To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles

”, WWW 2009

Lecture 2 : 590.03 Fall 12

65