/
Differential Privacy:  What, Why and When Differential Privacy:  What, Why and When

Differential Privacy: What, Why and When - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
352 views
Uploaded On 2018-11-08

Differential Privacy: What, Why and When - PPT Presentation

Moni Naor Weizmann Institute of Science The Brussels Privacy Symposium November 8 th 2016 What is Differential Privacy Differential Privacy is a concept Motivation Rigorous mathematical definition ID: 721915

data privacy netflix differential privacy data differential netflix query queries set information public user smith ratings nissim narayanan credit dwork analysis error

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Differential Privacy: What, Why and Whe..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Differential Privacy: What, Why and When

Moni NaorWeizmann Institute of Science

The Brussels Privacy

Symposium

November 8

th

2016Slide2

What is Differential PrivacyDifferential Privacy is a concept MotivationRigorous mathematical definitionPropertiesA measurable quantitySet of algorithmic techniques for achieving itFirst defined in:

Dwork, McSherry, Nissim, and Smith

, Calibrating Noise to Sensitivity in Private Data Analysis, in Third Theory of Cryptography Conference, TCC 2006.Earlier roots:

Warner

,

Randomized

R

esponse

1965Slide3

Why Differential Privacy?DP: Strong, quantifiable, composable mathematical privacy guaranteeProvably resilient to known

and unknown attack modes!Theoretically, DP enables many computations with personal data while preserving personal privacyPracticality in first stages of validation

No snake oil

Not a panacea Slide4

BasicsLots of Data and Privacy Definition of Differently Privacy Properties of DPCounting QueriesOnline case via Multiplicative Weights algorithmPrivacy and SecurityConnection to Tracing Traitors

Deduplication in Cloud StorageCollaborative SecurityImplementation

Outline

Disclaimer:

Non comprehensive

Very biasedSlide5

Lots of DataRecent years: a lot of data is available to companies and government agenciesCensus dataHuge databases collected by companiesData delugePublic Surveillance Information

CCTVRFIDs Social Networks

Data contains

personal

and

confidential

informationSlide6

Social benefits from analyzing large collections of data

 John Snow’s map

Cholera cases in London

epidemic

of 1854

 

Water Pump

on Broad

Street

Cholera casesSlide7

More Utility

Word CompletionSlide8

What about Privacy?Better Privacy Better

DataAlmost any usage of the data that is not carefully crafted will leak something about it

Social benefits from analyzing large collections of dataSlide9

Glorious

Failures of Traditional Approaches to Data PrivacyRe-identification [Sweeney ’00, …]

Auditors [Kenthapadi, Mishra, Nissim ’05]Genome-Wide association studies (GWAS

) [Homer et al. ’08]

Netflix

Prize

[Narayanan,

Shmatikov

‘08]Social networks [Backstrom, Dwork, Kleinberg ‘11]Attack on

statistical aggregates [

Dwork

, Smith, Steinke,

Vadhan

‘15]Slide10
Slide11

The Netflix PrizeNetflix Recommends Movies to its SubscribersSeek an improved recommendation systemOffered $1,000,000 for “10% improvement”Published training data

Prize won in September 2009“BellKor's Pragmatic Chaos team”

Very influential competition in machine learningSlide12

From the Netflix Prize Rules Page…“The training data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles.”“The ratings are on a scale from 1 to 5 (integral) stars. To protect customer privacy, all

personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. The

date of each rating and the title and year of release for each movie are provided.”Slide13

Netflix Data Release [Narayanan-Shmatikov 2008]

User 1

User 2

User N

Item 1

Item 2

Item M

Ratings for subset of

movies and users

Usernames replaced

with random IDs

Some additional

perturbation

Credit: Arvind Narayanan via Adam SmithSlide14

A Source of Auxiliary InformationInternet Movie Database (IMDb)Individuals may register for an account and rate moviesNeed not be anonymousProbably want to create some web presence Visible material includes ratings,

dates, commentsSlide15

Use Public Reviews from IMDb.com

Alice

Bob

Charlie

Danielle

Erica

Frank

Anonymized

NetFlix

data

Public, incomplete

IMDB

data

Identified

NetFlix

Data

=

Alice

Bob

Charlie

Danielle

Erica

Frank

Credit: Arvind Narayanan via Adam SmithSlide16

De-anonymizing the Netflix DatasetResults“With 8 movie ratings and dates that may have a 3-day error, 96% of Netflix subscribers whose records have been released can be uniquely identified in the dataset.”

“For 89%, 2 ratings and dates are enough to reduce the set of plausible records to 8 out of almost 500,000, which can then be inspected by a human for further deanonymization.”

Consequences?Learn about movies that IMDB users didn’t want to tell the world about...Sexual orientation, religious beliefsSubject of lawsuits

Credit:

Arvind

Narayanan via Adam Smith

of which 2 may be completely wrong

Video Privacy Protection Act

1988

Settled, March 2010Slide17

Do People Care About Privacy?Technology changesAttitudes differAttitudes change in time Slide18

Do people care about privacy?

Idea: Arvind NarayananSlide19

Do people care about privacy?

What about proctologists?Slide20

Voting Confidentiality

http://www.cs.uiowa.edu/~jones/voting/pictures/

George Caleb

Bingham

"The County Election"Slide21

Voting Confidentiality

Judge swears the voter

Tallying the voteSlide22

Privacy of Public Data AnalysisThe holy grail:Get utility

of statistical analysis while protecting privacy of every individual

participantIdeally:

Privacy-preserving sanitization should allow reasonably

accurate answers

to

meaningful information

Is it possible to phrase the goal in a meaningful and achievable manner?Slide23

Setting

Curator/

Sanitizer

Database

Released

data

Global vs. localSlide24

Setting: Interactive case

Data

Multiple queries, chosen adaptively

?

query 1

query 2

Curator/

Sanitizer

Give guidelines/tools to the curatorSlide25

Difficult Even ifCurator is AngelData are in Vault

?

C

“Pure” Privacy Problem

Nevertheless:

tight connection to problems in cryptography

You can run but you can’t hide

Credit: Cynthia

DworkSlide26

Databases that TeachDatabase teaches that smoking causes cancer. Smoker S’s insurance premiums rise. This is true even if S not in database!

Learning that smoking causes cancer is the whole point.Smoker S enrolls in a smoking cessation program…Differential privacy

: limit the harm to the teachings, not participation

Outcome

of any analysis is essentially equally likely,

independent of whether any individual

joins,

or refrains from

joining, the dataset.Slide27

The Statistics Masquerade Differencing AttackSuppose Kobbi is unique in the community in

speaking 5 languages

How many people in the community have sickle cell trait?How many in the community

speak

4

languages

and

have sickle cell

trait?

B

latant non-privacy

Dinur

and

Nissim

2003, et

sequelae

Overly accurate

answers to too many questions destroys privacy

.

 Slide28

Differential Privacy

b

n

b

n-1

b

3

b

2

b

1

b=

Distributions

at

“distance”

<

ε

M

M(b)

b’=

b

n

b

n-1

b

3

b

2

b

1

M

Neighboring:

One entry

modified

M(b’)

Dwork

,

McSherry

Nissim

& Smith 2006

Slide credit:

Kobbi

NissimSlide29

Differential Privacy

Responses:

Z

Z

Z

Pr [response]

ratio bounded

Sanitizer

M

gives

-

differential

privacy

if:

for

all

adjacent

D

1

and

D

2

All subsets

A

range(M):

Pr[

M

(

D

1

)

A

]

e

Pr[

M

(

D

2

)

A

]

 

Differing in

one user

Probability taken over randomness of

M

Participation in the data set poses no

additional

risk

Adversary

does not distinguish whether I supplied

real or fake

inputSlide30

Differential Privacy is a SuccessAlgorithms in many setting and for many tasksImportant Properties:Group privacy: k

privacy for a group of size kComposabilityApplying the sanitization several time yields a graceful degradation

proportional to number of applications even prop. to squareroot

of number of applications

.

Robustness to

side information

No need to specify

exactly what the adversary knows

Programmable!

Hard to quantifySlide31

Counting Queries

Counting-queries

Q is a set

of predicates

q:

U

{0,1}

Query

: how many

x

participants satisfy

q?

Relaxed accuracy:

answer query within

α

additive error

w.h.p

Not so bad:

some

error anyway inherent in statistical analysis

U

Database

x

of

size

n

Query

q

n

individuals, each contributing

a single point in

U

Sometimes talk about fractionSlide32

Laplacian Mechanism for Counting QueriesHandle t online queries by adding

independently

Privacy loss

Can do better with

(

,

)-

DP

-

 

Given query

:

Compute true answer

Output

 

Question: Can we handle number of queries >>

n

?

Need bound on

t

which is

o(n

)

(

=0

)

and

o(n

2

)

0

1

2

3

4

5

-1

-2

-3

-4Slide33

Key Insight to increase # of queries: Use Coordinated Noise

Starting from Blum-Ligget-Roth 2008:If noise is added in with careful coordination, rather than independently

can answer hugely many queriesWave of results showing:

Differential Privacy for

every

set

Q

of counting queriesError is Õ(n

1/2

log|Q

|)

Even in the

interactive

case –

Private

Multiplicative Weights

Algorithm

#queries >> DB size

?

MSlide34

Maintaining State

Query

q

State = Distribution

DSlide35

The true value

The PMW Algorithm

Initialize D

to be

uniform on

U

Repeat

up to

k

times

Set

T + Lap

(

)

Repeat

while no update occurs:

Receive

query

q

Q

Let

=

x(q

)

+ Lap

(

)

Test

: If

|q(D)-

|

output

q(D)

.

Else

(

update

):

Output

Update

D[

i

]

/

D[

i

]

e±T/4q[i]

and re-weight.

 

the

plus or minus are according

to

the sign of the

error

Algorithm fails if more than k updates

Maintain a distribution

D

on universe

U

This is the state.

Is completely public!

Multiplicative

Weights

Powerful tool in algorithms

design

Learn a Probability Distribution iteratively

In each round:

either

current distribution is good

or get a lot of information on distributionUpdate distributionHardt & Rothblum 2010 Slide36

Privacy and SecurityConnection to Tracing TraitorsDeduplication in Cloud StorageCollaborative SecuritySlide37

IssuesHow big can epsilon be?It may add up over a lifetime…Possible answer: design your system so that epsilon is bounded at all. This already removes some pernicious attacksWhat level are we protecting?

User level or event levelGene or individual?

0.01? 0.1? 10?Slide38

Applications/Implementations of Differential PrivacyCensus Bureau OnTheMap: gives researchers access to agency data.

Google’s RAPPOR: Randomized Aggregatable

Privacy-Preserving Ordinal Response

Open

source

Tackling Urban Mobility with aggregate traffic data

Apple: big

news coverage,

commitment to privacy

Applications to

multiple hypothesis testing

How to hack

Kaggle

competitions

Enabled collection of data

C

hrome avoided before

Global vs. localSlide39

Public policyCalifornia Public Utilities CommissionSmart metersInterpreting and implementing FERPA by Differential Privacy

Family Educational

Rights and Privacy Act

Understand

how DP fits with

existing

regulatory framework.

Problem: regulatory

framework

is

not mathematically precise,

Idea

of de-identification is hard wired in it.

Nissim

and WoodSlide40

ChallengesSmall DatasetsMassive Composition – global epsilon: event level vs user levelWork in conjunction with Secure Function EvaluationWinning the hearts and minds of policy makers…Slide41

Winning the hearts and minds of policy makers…Widen scope of implementation and use. Should identify what are the next good use cases for DP. Construct DP tools matching

best the practices and education of users; Explain shortcomings of other methods and benefits of DPNeed to figure how DP works as one of the layers in a suit of privacy protections

.Less straightforward and intuitive than anonymity/de-identification and its variants