Data CompSci 59003 Instructor Ashwin Machanavajjhala 1 Lecture 2 59003 Fall 12 Source httpxkcdorg834 Announcements Project ideas will be posted on the site by Friday You are welcome to send me or talk to me about your own ideas ID: 525235
Download Presentation The PPT/PDF document "De- anonymizing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
De-anonymizing Data
CompSci 590.03Instructor: Ashwin Machanavajjhala
1
Lecture 2 : 590.03 Fall 12
Source (http://xkcd.org/834/)Slide2
AnnouncementsProject ideas will be posted on the site by Friday.
You are welcome to send me (or talk to me about) your own ideas.Lecture 2 : 590.03 Fall 12
2Slide3
OutlineRecap & Intro to
AnonymizationAlgorithmically De-anonymizing Netflix DataAlgorithmically De-anonymizing Social Networks
Passive AttacksActive Attacks
Lecture 2 : 590.03 Fall 12
3Slide4
OutlineRecap & Intro to
AnonymizationAlgorithmically De-anonymizing Netflix Data
Algorithmically De-anonymizing
Social Networks
Passive AttacksActive Attacks
Lecture 2 : 590.03 Fall 12
4Slide5
Personal Big-Data
Google
DB
Person
1
r
1
Person 2
r
2
Person 3
r
3
Person N
r
N
Census
DB
Hospital
DB
Doctors
Medical Researchers
Economists
Information Retrieval Researchers
Recommen
-
dation
Algorithms
5
Lecture 2 : 590.03 Fall 12Slide6
The Massachusetts Governor
Privacy Breach [Sweeney IJUFKS 2002]
Name
SSN
Visit Date
Diagnosis
Procedure
Medication
Total Charge
Name
Address
Date
Registered
Party
affiliation
Date last
voted
Zip
Birth
date
Sex
Medical Data
Voter
List
Governor of MA
uniquely identified
using
ZipCode
,
Birth Date, and Sex.
Name linked to Diagnosis
6
Lecture 2 : 590.03 Fall 12Slide7
The Massachusetts Governor
Privacy Breach [Sweeney IJUFKS 2002]
Name
SSN
Visit Date
Diagnosis
Procedure
Medication
Total Charge
Name
Address
Date
Registered
Party
affiliation
Date last
voted
Zip
Birth
date
Sex
Medical Data
Voter
List
Governor of MA
uniquely identified
using
ZipCode
,
Birth Date, and Sex.
Quasi Identifier
87 % of US population
7
Lecture 2 : 590.03 Fall 12Slide8
Statistical Privacy (Trusted Collector) Problem
8
Individual 1
r
1
Individual 2
r
2
Individual 3
r
3
Individual
N
r
N
Server
D
B
Utility:
Privacy:
No breach about any individual
Lecture 2 : 590.03 Fall 12Slide9
Statistical Privacy (Untrusted Collector) Problem
9
Individual 1
r
1
Individual 2
r
2
Individual 3
r
3
Individual
N
r
N
Server
D
B
f
( )
Lecture 2 : 590.03 Fall 12Slide10
Randomized ResponseFlip a coin
heads with probability p, and tails with probability 1-p (p > ½)Answer question according to the following table: Lecture 2 : 590.03 Fall 12
10
True
Answer =
Yes
True Answer = No
Heads
Yes
No
Tails
No
YesSlide11
Statistical Privacy (Trusted Collector) Problem
11
Individual 1
r
1
Individual 2
r
2
Individual 3
r
3
Individual
N
r
N
Server
D
B
Lecture 2 : 590.03 Fall 12Slide12
Query Answering
12
Individual 1
r
1
Individual 2
r
2
Individual 3
r
3
Individual
N
r
N
Hospital
D
B
Lecture 2 : 590.03 Fall 12
Correlate Genome to disease
How many allergy patients?
‘Slide13
Query AnsweringNeed to know the list of questions up front
Each answer will leak some information about individuals. After answering a few questions, server will run out of privacy budget and not be able to answer any more questions. Will see this in detail later in the course.
Lecture 2 : 590.03 Fall 12
13Slide14
Anonymous/ Sanitized Data Publishing
14
Individual 1
r
1
Individual 2
r
2
Individual 3
r
3
Individual
N
r
N
Hospital
D
B
Lecture 2 : 590.03 Fall 12
I wont tell you what questions
I am interested in!
writingcenterunderground.wordpress.comSlide15
Anonymous/ Sanitized Data Publishing
15
Individual 1
r
1
Individual 2
r
2
Individual 3
r
3
Individual
N
r
N
Hospital
D
B
Lecture 2 : 590.03 Fall 12
D’
B
Answer any # of questions
directly on D
B
’ without
any modifications. Slide16
Today’s classIdentifying individual records and their sensitive values from data publishing (with insufficient sanitization).
Lecture 2 : 590.03 Fall 1216Slide17
Outline
Recap & Intro to AnonymizationAlgorithmically De-anonymizing Netflix Data
Algorithmically De-
anonymizing Social Networks
Passive AttacksActive Attacks
Lecture 2 : 590.03 Fall 12
17Slide18
Terms
Coin tosses of an algorithmUnion Bound Heavy Tailed Distribution
Lecture 2 : 590.03 Fall 12
18Slide19
Terms (contd.)Heavy Tailed Distribution
Lecture 2 : 590.03 Fall 1219
Not heavy tailed.
Normal DistributionSlide20
Terms (contd.)Heavy Tailed Distribution
Lecture 2 : 590.03 Fall 1220
Heavy tailed.
Laplace DistributionSlide21
Terms (contd.)Heavy Tailed Distribution
Lecture 2 : 590.03 Fall 1221
Heavy tailed.
Zipf
DistributionSlide22
Terms (contd.)Cosine Similarity
Collaborative filteringProblem of recommending new items to a user based on their ratings on previously seen items.
Lecture 2 : 590.03 Fall 12
22
θSlide23
Netflix Dataset
Lecture 2 : 590.03 Fall 1223
3
4
2
1
5
1
1
1
5
5
1
5
2
2
1
4
2
1
4
3
3
5
4
3
1
3
2
4
Movies
Users
Rating +
TimeStamp
Record (r)
Column/AttributeSlide24
DefinitionsSupport
Set (or number) of non-null attributes in a record or columnSimilaritySparsity
Lecture 2 : 590.03 Fall 12
24Slide25
Adversary ModelAux(r) – some subset of attributes from r
Lecture 2 : 590.03 Fall 1225Slide26
Privacy BreachDefinition 1: An algorithm A outputs an r’ such that
Definition 2: (When only a sample of the dataset is input) Lecture 2 : 590.03 Fall 12
26Slide27
Algorithm
ScoreBoardFor each record r’, compute Score(r’, aux) to be the minimum similarity of an attribute in aux to the same attribute in r’. Pick r’ with the maximum scoreOR
Return all records with Score > α
Lecture 2 : 590.03 Fall 12
27Slide28
Analysis
Theorem 1: Suppose we use Scoreboard with α = 1 – ε. If Aux contains m randomly chosen attributes s.t
.
Then Scoreboard returns a record r’ such that
Pr
[
Sim
(m, r’) > 1 –
ε
–
δ
]
> 1 –
ε
Lecture 2 : 590.03 Fall 12
28Slide29
Proof of Theorem 1Call r’ a false match if
Sim(Aux, r’) < 1 – ε – δ.
For any false match, Pr[ Sim(
Auxi
, ri
’) > 1 –
ε
] < 1 –
δ
Sim
(Aux, r’) = min
Sim
(
Aux
i
,
r
i
’)
Therefore,
Pr[
Sim
(Aux, r’) > 1 –
ε
] < (1 –
δ
)
m
Pr[some false match has similarity > 1- ε ] < N(1-δ)m N(1-δ)m < ε when m > log(N/ε) / log(1/1-δ)Lecture 2 : 590.03 Fall 1229Slide30
Other resultsIf dataset D is (1-
ε-δ, ε)-sparse, then D can be (1, 1-
ε)-deanonymized
.Analogous results when a list of candidate records are returned
Lecture 2 : 590.03 Fall 12
30Slide31
Netflix DatasetSlightly different algorithm
Lecture 2 : 590.03 Fall 1231Slide32
Summary of Netflix Paper
Adversary can use a subset of ratings made by a user to uniquely identify the user’s record from the “anonymized” dataset with high probabilitySimple Scoreboard algorithm provably guarantees identification of records.A variant of Scoreboard can de-anonymize Netflix dataset.
Algorithms are robust to noise in the adversary’s background knowledge
Lecture 2 : 590.03 Fall 12
32Slide33
Outline
Recap & Intro to AnonymizationAlgorithmically De-anonymizing
Netflix DataAlgorithmically De-
anonymizing Social NetworksPassive Attacks
Active AttacksLecture 2 : 590.03 Fall 12
33Slide34
Social Network Data
Social networks: graphs where each node represents a social entity, and each edge represents certain relationship between two entitiesExample: email communication graphs, social interactions like in Facebook, Yahoo! Messenger, etc.
34
Lecture 2 : 590.03 Fall 12Slide35
Anonymizing Social Networks
Naïve anonymizationremoves the label of each node and publish only the structure of the networkInformation LeaksNodes may still be re-identified based on network structure
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
35
Lecture 2 : 590.03 Fall 12Slide36
Passive Attacks on an Anonymized Network
Consider the above email communication graphEach node represents an individualEach edge between two individuals indicates that they have exchanged emails
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
36
Lecture 2 : 590.03 Fall 12Slide37
Passive Attacks on an Anonymized Network
Alice has sent emails to three individuals only
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
37
Lecture 2 : 590.03 Fall 12Slide38
Passive Attacks on an Anonymized Network
Alice has sent emails to three individuals only Only one node in the anonymized network has a degree threeHence, Alice can re-identify herself
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
38
Lecture 2 : 590.03 Fall 12Slide39
Passive Attacks on an Anonymized Network
Cathy has sent emails to five individuals
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
39
Lecture 2 : 590.03 Fall 12Slide40
Passive Attacks on an Anonymized Network
Cathy has sent emails to five individualsOnly one node has a degree fiveHence, Cathy can re-identify herself
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
40
Lecture 2 : 590.03 Fall 12Slide41
Passive Attacks on an Anonymized Network
Now consider that Alice and Cathy share their knowledge about the anonymized networkWhat can they learn about the other individuals?
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
41
Lecture 2 : 590.03 Fall 12Slide42
Passive Attacks on an Anonymized Network
First, Alice and Cathy know that only Bob have sent emails to both of them
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
42
Lecture 2 : 590.03 Fall 12Slide43
Passive Attacks on an Anonymized Network
First, Alice and Cathy know that only Bob have sent emails to both of themBob can be identified
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
43
Lecture 2 : 590.03 Fall 12Slide44
Passive Attacks on an Anonymized Network
Alice has sent emails to Bob, Cathy, and Ed only
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
44
Lecture 2 : 590.03 Fall 12Slide45
Passive Attacks on an Anonymized Network
Alice has sent emails to Bob, Cathy, and Ed onlyEd can be identified
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
45
Lecture 2 : 590.03 Fall 12Slide46
Passive Attacks on an Anonymized Network
Alice and Cathy can learn that Bob and Ed are connected
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
46
Lecture 2 : 590.03 Fall 12Slide47
Passive Attacks on an Anonymized Network
The above attack is based on knowledge about degrees of nodes. [Liu and Terzi, SIGMOD 2008]More sophisticated attacks can be launched given additional knowledge about the network structure, e.g., a
subgraph of the network.[Zhou and Pei, ICDE 2008, Hay et al., VLDB 2008, ]
Protecting privacy becomes even more challenging when the nodes in the anonymized network are labeled.
[Pang et al., SIGCOMM CCR 2006]
47
Lecture 2 : 590.03 Fall 12Slide48
Inferring Sensitive Values on a Network
Each individual has a single sensitive attribute.
Some individuals share the sensitive attribute, while others keep it privateGOAL
: Infer the private sensitive attributes usingLinks in the social network
Groups that the individuals belong to
Approach: Learn a predictive model
(think classifier)
using public profiles as training data.
[
Zheleva
and
Getoor
, WWW 2009]
48
Lecture 2 : 590.03 Fall 12Slide49
Inferring Sensitive Values on a Network
Baseline: Most commonly appearing sensitive value amongst all public profiles.
49
Lecture 2 : 590.03 Fall 12Slide50
Inferring Sensitive Values on a Network
LINK: Each node x has a list of binary features Lx, one for every node in the social network.Feature value
Lx[y] = 1 if and only if (
x,y) is an edge. Train a model on all pairs (
Lx, sensitive value(x)), for
x’s
with public sensitive values.
Use learnt model to predict private sensitive values
50
Lecture 2 : 590.03 Fall 12Slide51
Inferring Sensitive Values on a Network
GROUP: Each node x has a list of binary features Gx, one for every group in the social network.Feature value
Gx[y] = 1 if and only if
x belongs to group y. Train a model on all pairs (G
x, sensitive value(x)), where x’s sensitive value is public.
Use model to predict private sensitive values
51
Lecture 2 : 590.03 Fall 12Slide52
Inferring Sensitive Values on a Network
Flickr
(Location)
Facebook
(Gender)
Facebook
(Political
View)
Dogster
(Dog
Breed)
Baseline
27.7%
50%
56.5%
28.6%
LINK
56.5%
68.6%
58.1%
60.2%
GROUP
83.6%
77.2%
46.6%
82.0%
[
Zheleva
and
Getoor
, WWW 2009]
52
Lecture 2 : 590.03 Fall 12Slide53
Active Attacks on Social Networks
[Backstrom et al., WWW 2007]Attacker may create a few nodes in the graphCreates a few ‘fake’ Facebook user accounts.
Attacker may add edges from the new nodes. Create friends using ‘fake’ accounts. Goal: Discover an edge between two legitimate users.
53
Lecture 2 : 590.03 Fall 12Slide54
High Level View of Attack
Step 1: Create a graph structure with the ‘fake’ nodes such that it can be identified in the anonymous data.
54
Lecture 2 : 590.03 Fall 12Slide55
High Level View of Attack
Step 2: Add edges from the ‘fake’ nodes to real nodes.
55
Lecture 2 : 590.03 Fall 12Slide56
High Level View of Attack
Step 3: From the
anonymized
data, identify fake graph due to its special graph structure.
56
Lecture 2 : 590.03 Fall 12Slide57
High Level View of Attack
Step 4: Deduce edges by following links
57
Lecture 2 : 590.03 Fall 12Slide58
Details of the Attack
Choose k real users W = {w1, …, wk}Create k fake users X = {x1
, …, xk}Creates edges (x
i, wi
)Create edges (xi, xi+1)
Create all other edges in X with probability 0.5.
Large graph
58
Lecture 2 : 590.03 Fall 12Slide59
Why does it work?Given a graph G, and a set of nodes S, G[S] = graph induced by nodes in S.
There is an isomorphism between two sets of nodes S, S’ ifThere is a function mapping each node in S to a node in S’(u,v) is an edge in G[S] if and only if (f(u), f(v)) is an edge in S’
Isomorphism from S to S is called an automorphism
Think: permuting the nodes
Lecture 2 : 590.03 Fall 1259Slide60
Why does it work?
There is no S such that G[S] is isomorphic to G[X] (call it H).H can be efficiently found from G.H has no non-trivial automorphisms.
Large graph
(size N)
60
Lecture 2 : 590.03 Fall 12Slide61
Recovery
Subgraph isomorphism is NP-hardi.e., Finding X could be hard. But since X has a path, with random edges, there is a simple brute force with pruning search algorithm. Run Time:
O(N 2O(log log N) )
Large graph
(size N)
2
61
Lecture 2 : 590.03 Fall 12Slide62
Works in Real Life!
LiveJournal – 4.4 million nodes, 77 million edgesSuccess all but guaranteed by adding 10 nodes.Recovery typically takes a second.
Probability of Successful Attack
[
Backstrom
et al., WWW 2007]
62
Lecture 2 : 590.03 Fall 12Slide63
Summary of Social Networks
Nodes in a graph can be re-identified using background knowledge of the structure of the graphLink and group structure provide valuable information for accurately inferring private sensitive values. Active attacks that add nodes and edges are shown to be very successful. Guarding against these attacks is an open area for research !
63
Lecture 2 : 590.03 Fall 12Slide64
Next ClassK-Anonymity + Algorithms: How to limit de-
anonymization?Lecture 2 : 590.03 Fall 12
64Slide65
References
L. Sweeney, “K-Anonymity: a model for protecting privacy”, IJUFKS 2002A. Narayanan & V. Shmatikov, “Robust De-anonymization of Large Sparse Datasets”, SSP 2008
L. Backstrom, C.
Dwork & J. Kleinberg, “Wherefore art thou r3579x?:
anonymized social networks, hidden patterns, and structural steganography”, WWW 2007
E.
Zheleva
& L.
Getoor
, “
To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles
”, WWW 2009
Lecture 2 : 590.03 Fall 12
65