Data Privacy Dr Balaji Palanisamy Associate Professor School of Computing and Information University of Pittsburgh bpalanpittedu Slides Courtesy Prof James Joshi University of Pittsburgh ID: 932069
Download Presentation The PPT/PDF document "Data Privacy 1 SADET Module D5:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Data Privacy
1
Slide2SADET
Module D5: Data Privacy
Dr. Balaji Palanisamy
Associate Professor
School of Computing and InformationUniversity of Pittsburghbpalan@pitt.eduSlides Courtesy:Prof. James Joshi (University of Pittsburgh).Many slides in this lecture are adapted from SIGMOD 2009 tutorial, “Anonymized Data: Generation, Models, Usage”, Cormode & Srivastava and Indrajit Roy et. al, NSDI 2010 paper
IS-2150/TEL2810: Info. Security and Privacy
2
Slide3Introduction to PrivacyData PrivacyAnonymization techniquesDifferential PrivacyIS-2150/TEL2810: Info. Security and Privacy
3
Slide4What is privacy?Hard to define“Privacy is the claim of individuals, groups, or institutions to determine for themselves when, how, and to what extent information about them is communicated to others”Alan Westin, Privacy and Freedom, 1967
4
Slide5OECD Guidelines on the Protection of Privacy (1980)Collection limitationData qualityPurpose specificationUse limitationSecurity safeguardsOpennessIndividual participationAccountability
5http://www.oecd.org/document/18/0,3343,en_2649_34255_1815186_1_1_1_1,00.html#part2
Slide6Privacy LawsEU: ComprehensiveEuropean Directive on Data ProtectionUS: Sector specificHIPAA (Health Insurance Portability and Accountability Act of 1996)Protect individually identifiable health informationCOPPA (Children‘s Online Privacy Protection Act of 1998
)Address collection of personal information from children under 13, how to seek verifiable parental consent from their parents, etc.GLB (Gramm-Leach-Bliley-Act of 1999) Requires financial institutions to provide consumers with a privacy policy notice, including what info. collected, where info. shared (affiliates and nonaffiliated third parties), how info. used, how info. protected, opt-out options, etc.
Fair Credit Reporting Act
6
Slide7Anonymized Data: Generation, Models, Usage – Cormode & Srivastava
7
Why anonymize and how?
For Data Sharing
Give real(istic) data to others to study without compromising privacy of individuals in the data For Data Retention and UsageVarious requirements prevent companies from retaining customer information indefinitely Anonymization methods:k-anonymityl-diversitydifferential privacy
Slide88
Tabular Data Example
Course record data recording scores and demographics
Releasing Student ID
Score association violates individual’s privacyStudent ID is an identifier, Score is a sensitive attribute (SA)
Student ID
DOB
Sex
ZIP
Score
75835
9/28/96
M
15213
70
14792
9/29/96
F
15213
70
87593
1/21/95
F
15212
80
87950
9/28/96
M
15212
80
38833
5/25/92
M
15206
90
68054
1/13/92
F
15206
70
99316
7/28/92
M
15207
80
51589
1/13/92
F
15207
80
14941
1/13/98
F
15232
90
22563
7/28/99
M
15232
90
90652
1/22/99
M
15231
90
12386
2/23/98
F
15231
90
Slide99
Tabular Data Example: De-Identification
Course record
: remove Student ID to create de-identified table
Does the de-identified table preserve an individual’s privacy?Depends on what other information an attacker knows
DOB
Sex
ZIP
Score
9/28/96
M
15213
70
9/29/96
F
15213
70
1/21/95
F
15212
80
9/28/96
M
15212
80
5/25/92
M
15206
90
1/13/92
F
15206
70
7/28/92
M
15207
80
1/13/92
F
15207
80
1/13/98
F
15232
90
7/28/99
M
15232
90
1/22/99
M
15231
90
2/23/98
F
15231
90
Slide1010
Tabular Data Example: Linking Attack
De-identified private data + publicly available data
Cannot uniquely identify either individual’s score
DOB is a quasi-identifier (QI)
DOB
Sex
ZIP
Score
9/28/96
M
15213
70
9/29/96
F
15213
70
1/21/95
F
15212
80
9/28/96
M
15212
80
5/25/92
M
15206
90
1/13/92
F
15206
70
7/28/92
M
15207
801/13/92F15207801/13/98F15232907/28/99M15232901/22/99M15231902/23/98F1523190
Student IDDOBSex758359/28/96M515891/13/92M
Slide1111
Tabular Data Example: Linking Attack
De-identified private data + publicly available data
Uniquely identified one individual’s score, but not the other’s
DOB, Sex are quasi-identifiers (QI)
DOB
Sex
ZIP
Score
9/28/96
M
15213
70
9/29/96
F
15213
70
1/21/95
F
15212
80
9/28/96
M
15212
80
5/25/92
M
15206
90
1/13/92
F
15206
70
7/28/92
M
15207
801/13/92F15207801/13/98F15232907/28/99M15232901/22/99M15231902/23/98F1523190
Student IDDOBSex758359/28/96M225637/28/99M
Slide1212
Tabular Data Example: Linking Attack
De-identified private data + publicly available data
Uniquely identified both individuals’ scores
[DOB, Sex, ZIP] is unique for lots of US residents [Sweeney 02]
DOB
Sex
ZIP
Score
9/28/96
M
15213
70
9/29/96
F
15213
70
1/21/95
F
15212
80
9/28/96
M
15212
80
5/25/92
M
15206
90
1/13/92
F
15206
70
7/28/92
M
15207
801/13/92F15207801/13/98F15232907/28/99M15232901/22/99M15231902/23/98F1523190
Student IDDOBSexZIP758359/28/96M1521322563
7/28/99
M
15232
Slide1313
k-Anonymization
[Samarati, Sweeney 98]
k-anonymity: Table T satisfies k-anonymity with quasi-identifier QI
iff each tuple in (the multiset) T[QI] appears at least k timesProtects against “linking attack”k-anonymization: Table T’ is a k-anonymization of T if T’ is a generalization/suppression of T, and T’ satisfies k-anonymity
→
DOB
Sex
ZIP
Score
9/28/96
M
15213
70
9/29/96
F
15213
70
1/21/95
F
15212
80
9/28/96
M
15212
80
5/25/92
M
15206
90
1/13/92
F
15206
70
7/28/92
M
15207801/13/92F15207801/13/98F15232907/28/99M15232901/22/99M15231902/23/98F
1523190DOBSexZIPScore96-95*
1521*
70
96-95
*
1521*
70
96-95
*
1521*
80
96-95
*
1521*
80
92
*
1520*
90
92
*
1520*
70
92
*
1520*
8092*1520*8098-99*1523*9098-99*1523*9098-99*1523*9098-99*1523*90
4-anonymization
Slide1414
k-Anonymization and Uncertainty
Intuition
: A k-anonymized table
T’ represents the set of all “possible world” tables Ti s.t. T’ is a k-anonymization of TiThe table T from which T’ was originally derived is one of the possible worlds
→
DOB
Sex
ZIP
Score
9/28/96
M
15213
70
9/29/96
F
15213
70
1/21/95
F
15212
80
9/28/96
M
15212
80
5/25/92
M
15206
90
1/13/92
F
15206
70
7/28/92
M
15207801/13/92F15207801/13/98F15232907/28/99M15232901/22/99M15231902/23/98F
1523190DOBSexZIPScore96-95*
1521*
70
96-95
*
1521*
70
96-95
*
1521*
80
96-95
*
1521*
80
92
*
1520*
90
92
*
1520*
70
92
*
1520*
8092*1520*8098-99*1523*9098-99*1523*9098-99*1523*9098-99*1523*90
One possibility
Slide1515
k-Anonymization and Uncertainty
Intuition
: A k-anonymized table
T’ represents the set of all “possible world” tables Ti s.t. T’ is a k-anonymization of Ti(Many) other tables are also possible
DOB
Sex
ZIP
Score
9/28/96
M
15217
70
9/29/96
F
15213
70
1/21/95
F
15212
80
9/28/96
F
15212
80
5/25/92
M
15206
90
1/13/92
F
15206
70
7/28/92
M
15207
801/13/92F15207801/13/98M15232907/28/99M15232901/22/99M15232902/23/98F15231
90DOBSexZIPScore96-95*1521*70
96-95
*
1521*
70
96-95
*
1521*
80
96-95
*
1521*
80
92
*
1520*
90
92
*
1520*
70
92
*
1520*
80
92
*1520*8098-99*1523*9098-99*1523*9098-99*1523*9098-99*1523*90Another possibility
→
Slide1616
Homogeneity Attack
[Machanavajjhala+ 06]
Issue
: k-anonymity requires each tuple in (the multiset) T[QI] to appear ≥ k times, but does not say anything about the SA valuesIf (almost) all SA values in a QI group are equal, loss of privacy!The problem is with the choice of grouping, not the data
→
Not Ok!
All scores are 90
In the QI group!
DOB
Sex
ZIP
Score
9/28/96
M
15213
70
9/29/96
F
15213
70
1/21/95
F
15212
80
9/28/96
M
15212
80
5/25/92
M
15206
90
1/13/92
F
15206
707/28/92M15207801/13/92F15207801/13/98F15232907/28/99M15232901/22/99M15231902/23/98F
1523190DOBSexZIPScore96-95*
1521*
70
96-95
*
1521*
70
96-95
*
1521*
80
96-95
*
1521*
80
92
*
1520*
90
92
*
1520*
70
92
*
1520*
8092*1520*8098-99*1523*9098-99*1523*9098-99*1523*9098-99*1523*90
Slide1717
Homogeneity Attack
[Machanavajjhala+ 06]
Issue
: k-anonymity requires each tuple in (the multiset) T[QI] to appear ≥ k times, but does not say anything about the SA valuesIf (almost) all SA values in a QI group are equal, loss of privacy!The problem is with the choice of grouping, not the dataFor some groupings, no loss of privacy
→
Ok!
DOB
Sex
ZIP
Score
9/28/96
M
15213
70
9/29/96
F
15213
70
1/21/95
F
15212
80
9/28/96
M
15212
80
5/25/92
M
15206
90
1/13/92
F
15206
70
7/28/92
M15207801/13/92F15207801/13/98F15232907/28/99M15232901/22/99M1523190
2/23/98F1523190DOBSexZIPScore95-99
*
152**
70
95-99
*
152**
70
95-99
*
152**
80
95-99
*
152**
80
92
*
1520*
90
92
*
1520*
70
92
*
1520*8092*1520*8095-99*152**9095-99*152**9095-99*152**9095-99*152**
90
At least 3 unique
values in each QI group
3-diversity
with
3
distinct
values
for each QI group
Slide1818
Homogeneity and Uncertainty
Intuition
: A k-anonymized table
T’ represents the set of all “possible world” tables Ti s.t. T’ is a k-anonymization of TiLack of diversity of SA values implies that in a large fraction of possible worlds, some fact is true, which can violate privacy
Student ID
DOB
Sex
ZIP
90652
1/22/99
M
15231
DOB
Sex
ZIP
Score
96-95
*
1521*
70
96-95
*
1521*
70
96-95
*
1521*
80
96-95
*
1521*
80
92
*
1520*
9092*1520*7092*1520*8092*1520*8098-99*1523*9098-99*1523*
9098-99*1523*9098-99*1523*90
Slide1919
l
-Diversity
[Machanavajjhala+ 06]
Intuition: Most frequent value does not appear too often compared to the less frequent values in a QI groupl-Diversity Principle: a table is l-diverse if each of its QI groups contains at least l “well-represented” values for the SA“well-represented” extensions:Distinct l-diversity: simplest definition that at least l distinct values represented in each QI group
gEntropy l
-diversity
: for each QI group
g
,
entropy(g) ≥ log(
l
)
Recursive (
c,
l
)-diversity
: for each QI group
g
with
m
SA values, and
r
i
is the
i
’th
highest frequency,
r
1
< c (
r
l
+ r
l+1
+ … + rm)DOBSexZIPScore96-95*1521*7096-95*1521*7096-95*1521*8096-95*1521*8092*1520*9092*1520*7092*1520*8092*1520*8098-99*1523*9098-99*1523*9098-99*1523*9098-99
*1523*90
Slide20Background: Differential privacy20A mechanism is
differentially private if every output is produced with similar probability whether any given input is included or notCynthia Dwork. Differential Privacy
. ICALP 2006
Slide21Differential privacy (intuition)21A mechanism is
differentially private if every output is produced with similar probability whether any given input is included or not
Output distribution
F(x)
A
B
C
Cynthia Dwork.
Differential Privacy
. ICALP 2006
Slide22Differential privacy (intuition)22A mechanism is
differentially private if every output is produced with similar probability whether any given input is included or not
Similar output distributions
Bounded risk for
D
if she includes her data!
F(x)
F(x)
A
B
C
A
B
C
D
Cynthia Dwork.
Differential Privacy
. ICALP 2006
Slide23Achieving differential privacy23A simple differentially private mechanism
How much noise should one add?
Tell me f(x)
f(x)+noise
…xnx1
Slide24Achieving differential privacy24Function sensitivity
(intuition): Maximum effect of any single input on the outputAim: Need to conceal this effect to preserve privacyExample: Computing the average height of the people in this room has low sensitivityAny single person’s height does not affect the final average by too much
Calculating the
maximum height
has high sensitivity
Slide25Achieving differential privacy25Function sensitivity
(intuition): Maximum effect of any single input on the outputAim: Need to conceal this effect to preserve privacyExample: SUM over input elements drawn from [0, M]
X
1
X2X3X4SUM
Sensitivity = M
Max. effect of any input element is
M
Slide26Achieving differential privacy26A simple differentially private mechanism
f(x)+Lap(∆(f))
…
x
nx1Tell me f(x)
Intuition: Noise needed to mask the effect of a single input
Lap = Laplace distribution
∆(f) = sensitivity
Slide2727
Sensitivity of a Function f
How Much Can f(DB + Me) Exceed f(DB - Me)?
Recall:
K (f, DB) = f(DB) + noiseQuestion Asks: What difference must noise obscure?
f = max
DB, Me
|f(DB+Me) – f(DB-Me)|
eg,
Count = 1
Slide2828
Calibrate Noise to Sensitivity
f = max
DB, Me
|f(DB+Me) – f(DB-Me)|
0
R
2R
3R
4R
5R
-R
-2R
-3R
-4R
Pr[
K
(f, DB - Me) = t]
Pr[
K
(f, DB + Me) = t]
=
exp(-(|t-
f
-
|-|t-
f
+
|)/R)
≤ exp(-
f/R)
Theorem: To achieve -differential privacy, use scaled symmetric noise Lap(|x|/R) with R = f/.