Moni Naor Weizmann Institute of Science The Brussels Privacy Symposium November 8 th 2016 What is Differential Privacy Differential Privacy is a concept Motivation Rigorous mathematical definition ID: 721915
Download Presentation The PPT/PDF document "Differential Privacy: What, Why and Whe..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Differential Privacy: What, Why and When
Moni NaorWeizmann Institute of Science
The Brussels Privacy
Symposium
November 8
th
2016Slide2
What is Differential PrivacyDifferential Privacy is a concept MotivationRigorous mathematical definitionPropertiesA measurable quantitySet of algorithmic techniques for achieving itFirst defined in:
Dwork, McSherry, Nissim, and Smith
, Calibrating Noise to Sensitivity in Private Data Analysis, in Third Theory of Cryptography Conference, TCC 2006.Earlier roots:
Warner
,
Randomized
R
esponse
1965Slide3
Why Differential Privacy?DP: Strong, quantifiable, composable mathematical privacy guaranteeProvably resilient to known
and unknown attack modes!Theoretically, DP enables many computations with personal data while preserving personal privacyPracticality in first stages of validation
No snake oil
Not a panacea Slide4
BasicsLots of Data and Privacy Definition of Differently Privacy Properties of DPCounting QueriesOnline case via Multiplicative Weights algorithmPrivacy and SecurityConnection to Tracing Traitors
Deduplication in Cloud StorageCollaborative SecurityImplementation
Outline
Disclaimer:
Non comprehensive
Very biasedSlide5
Lots of DataRecent years: a lot of data is available to companies and government agenciesCensus dataHuge databases collected by companiesData delugePublic Surveillance Information
CCTVRFIDs Social Networks
Data contains
personal
and
confidential
informationSlide6
Social benefits from analyzing large collections of data
John Snow’s map
Cholera cases in London
epidemic
of 1854
Water Pump
on Broad
Street
Cholera casesSlide7
More Utility
Word CompletionSlide8
What about Privacy?Better Privacy Better
DataAlmost any usage of the data that is not carefully crafted will leak something about it
Social benefits from analyzing large collections of dataSlide9
Glorious
Failures of Traditional Approaches to Data PrivacyRe-identification [Sweeney ’00, …]
Auditors [Kenthapadi, Mishra, Nissim ’05]Genome-Wide association studies (GWAS
) [Homer et al. ’08]
Netflix
Prize
[Narayanan,
Shmatikov
‘08]Social networks [Backstrom, Dwork, Kleinberg ‘11]Attack on
statistical aggregates [
Dwork
, Smith, Steinke,
Vadhan
‘15]Slide10Slide11
The Netflix PrizeNetflix Recommends Movies to its SubscribersSeek an improved recommendation systemOffered $1,000,000 for “10% improvement”Published training data
Prize won in September 2009“BellKor's Pragmatic Chaos team”
Very influential competition in machine learningSlide12
From the Netflix Prize Rules Page…“The training data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles.”“The ratings are on a scale from 1 to 5 (integral) stars. To protect customer privacy, all
personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. The
date of each rating and the title and year of release for each movie are provided.”Slide13
Netflix Data Release [Narayanan-Shmatikov 2008]
User 1
User 2
User N
Item 1
Item 2
Item M
Ratings for subset of
movies and users
Usernames replaced
with random IDs
Some additional
perturbation
Credit: Arvind Narayanan via Adam SmithSlide14
A Source of Auxiliary InformationInternet Movie Database (IMDb)Individuals may register for an account and rate moviesNeed not be anonymousProbably want to create some web presence Visible material includes ratings,
dates, commentsSlide15
Use Public Reviews from IMDb.com
Alice
Bob
Charlie
Danielle
Erica
Frank
Anonymized
NetFlix
data
Public, incomplete
IMDB
data
Identified
NetFlix
Data
=
Alice
Bob
Charlie
Danielle
Erica
Frank
Credit: Arvind Narayanan via Adam SmithSlide16
De-anonymizing the Netflix DatasetResults“With 8 movie ratings and dates that may have a 3-day error, 96% of Netflix subscribers whose records have been released can be uniquely identified in the dataset.”
“For 89%, 2 ratings and dates are enough to reduce the set of plausible records to 8 out of almost 500,000, which can then be inspected by a human for further deanonymization.”
Consequences?Learn about movies that IMDB users didn’t want to tell the world about...Sexual orientation, religious beliefsSubject of lawsuits
Credit:
Arvind
Narayanan via Adam Smith
of which 2 may be completely wrong
Video Privacy Protection Act
1988
Settled, March 2010Slide17
Do People Care About Privacy?Technology changesAttitudes differAttitudes change in time Slide18
Do people care about privacy?
Idea: Arvind NarayananSlide19
Do people care about privacy?
What about proctologists?Slide20
Voting Confidentiality
http://www.cs.uiowa.edu/~jones/voting/pictures/
George Caleb
Bingham
"The County Election"Slide21
Voting Confidentiality
Judge swears the voter
Tallying the voteSlide22
Privacy of Public Data AnalysisThe holy grail:Get utility
of statistical analysis while protecting privacy of every individual
participantIdeally:
Privacy-preserving sanitization should allow reasonably
accurate answers
to
meaningful information
Is it possible to phrase the goal in a meaningful and achievable manner?Slide23
Setting
Curator/
Sanitizer
Database
Released
data
Global vs. localSlide24
Setting: Interactive case
Data
Multiple queries, chosen adaptively
?
query 1
query 2
Curator/
Sanitizer
Give guidelines/tools to the curatorSlide25
Difficult Even ifCurator is AngelData are in Vault
?
C
“Pure” Privacy Problem
Nevertheless:
tight connection to problems in cryptography
You can run but you can’t hide
Credit: Cynthia
DworkSlide26
Databases that TeachDatabase teaches that smoking causes cancer. Smoker S’s insurance premiums rise. This is true even if S not in database!
Learning that smoking causes cancer is the whole point.Smoker S enrolls in a smoking cessation program…Differential privacy
: limit the harm to the teachings, not participation
Outcome
of any analysis is essentially equally likely,
independent of whether any individual
joins,
or refrains from
joining, the dataset.Slide27
The Statistics Masquerade Differencing AttackSuppose Kobbi is unique in the community in
speaking 5 languages
How many people in the community have sickle cell trait?How many in the community
speak
4
languages
and
have sickle cell
trait?
“
B
latant non-privacy
”
Dinur
and
Nissim
2003, et
sequelae
Overly accurate
answers to too many questions destroys privacy
.
Slide28
Differential Privacy
b
n
b
n-1
b
3
b
2
b
1
b=
Distributions
at
“distance”
<
ε
M
M(b)
b’=
b
n
b
n-1
b
3
b
2
’
b
1
M
Neighboring:
One entry
modified
M(b’)
Dwork
,
McSherry
Nissim
& Smith 2006
Slide credit:
Kobbi
NissimSlide29
Differential Privacy
Responses:
Z
Z
Z
Pr [response]
ratio bounded
Sanitizer
M
gives
-
differential
privacy
if:
for
all
adjacent
D
1
and
D
2
All subsets
A
range(M):
Pr[
M
(
D
1
)
A
]
≤
e
Pr[
M
(
D
2
)
A
]
Differing in
one user
Probability taken over randomness of
M
Participation in the data set poses no
additional
risk
Adversary
does not distinguish whether I supplied
real or fake
inputSlide30
Differential Privacy is a SuccessAlgorithms in many setting and for many tasksImportant Properties:Group privacy: k
privacy for a group of size kComposabilityApplying the sanitization several time yields a graceful degradation
proportional to number of applications even prop. to squareroot
of number of applications
.
Robustness to
side information
No need to specify
exactly what the adversary knows
Programmable!
Hard to quantifySlide31
Counting Queries
Counting-queries
Q is a set
of predicates
q:
U
{0,1}
Query
: how many
x
participants satisfy
q?
Relaxed accuracy:
answer query within
α
additive error
w.h.p
Not so bad:
some
error anyway inherent in statistical analysis
U
Database
x
of
size
n
Query
q
n
individuals, each contributing
a single point in
U
Sometimes talk about fractionSlide32
Laplacian Mechanism for Counting QueriesHandle t online queries by adding
independently
Privacy loss
Can do better with
(
,
)-
DP
-
Given query
:
Compute true answer
Output
Question: Can we handle number of queries >>
n
?
Need bound on
t
which is
o(n
)
(
=0
)
and
o(n
2
)
0
1
2
3
4
5
-1
-2
-3
-4Slide33
Key Insight to increase # of queries: Use Coordinated Noise
Starting from Blum-Ligget-Roth 2008:If noise is added in with careful coordination, rather than independently
can answer hugely many queriesWave of results showing:
Differential Privacy for
every
set
Q
of counting queriesError is Õ(n
1/2
log|Q
|)
Even in the
interactive
case –
Private
Multiplicative Weights
Algorithm
#queries >> DB size
?
MSlide34
Maintaining State
Query
q
State = Distribution
DSlide35
The true value
The PMW Algorithm
Initialize D
to be
uniform on
U
Repeat
up to
k
times
Set
←
T + Lap
(
)
Repeat
while no update occurs:
Receive
query
q
Q
Let
=
x(q
)
+ Lap
(
)
Test
: If
|q(D)-
|
≤
output
q(D)
.
Else
(
update
):
Output
Update
D[
i
]
/
D[
i
]
e±T/4q[i]
and re-weight.
the
plus or minus are according
to
the sign of the
error
Algorithm fails if more than k updates
Maintain a distribution
D
on universe
U
This is the state.
Is completely public!
Multiplicative
Weights
Powerful tool in algorithms
design
Learn a Probability Distribution iteratively
In each round:
either
current distribution is good
or get a lot of information on distributionUpdate distributionHardt & Rothblum 2010 Slide36
Privacy and SecurityConnection to Tracing TraitorsDeduplication in Cloud StorageCollaborative SecuritySlide37
IssuesHow big can epsilon be?It may add up over a lifetime…Possible answer: design your system so that epsilon is bounded at all. This already removes some pernicious attacksWhat level are we protecting?
User level or event levelGene or individual?
0.01? 0.1? 10?Slide38
Applications/Implementations of Differential PrivacyCensus Bureau OnTheMap: gives researchers access to agency data.
Google’s RAPPOR: Randomized Aggregatable
Privacy-Preserving Ordinal Response
Open
source
Tackling Urban Mobility with aggregate traffic data
Apple: big
news coverage,
commitment to privacy
Applications to
multiple hypothesis testing
How to hack
Kaggle
competitions
Enabled collection of data
C
hrome avoided before
Global vs. localSlide39
Public policyCalifornia Public Utilities CommissionSmart metersInterpreting and implementing FERPA by Differential Privacy
Family Educational
Rights and Privacy Act
Understand
how DP fits with
existing
regulatory framework.
Problem: regulatory
framework
is
not mathematically precise,
Idea
of de-identification is hard wired in it.
Nissim
and WoodSlide40
ChallengesSmall DatasetsMassive Composition – global epsilon: event level vs user levelWork in conjunction with Secure Function EvaluationWinning the hearts and minds of policy makers…Slide41
Winning the hearts and minds of policy makers…Widen scope of implementation and use. Should identify what are the next good use cases for DP. Construct DP tools matching
best the practices and education of users; Explain shortcomings of other methods and benefits of DPNeed to figure how DP works as one of the layers in a suit of privacy protections
.Less straightforward and intuitive than anonymity/de-identification and its variants