/
The Promise of Differential Privacy The Promise of Differential Privacy

The Promise of Differential Privacy - PowerPoint Presentation

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
345 views
Uploaded On 2019-11-22

The Promise of Differential Privacy - PPT Presentation

The Promise of Differential Privacy Cynthia Dwork Microsoft Research TexPoint fonts used in EMF Read the TexPoint manual before you delete this box A A A A A A A A On the Primacy of Definitions ID: 766779

privacy definition queries database definition privacy database queries sensitivity algorithm propose noise differential truth databases add mechanism error counting

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "The Promise of Differential Privacy" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

The Promise of Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A

On the Primacy of Definitions Learning from History

Pre-Modern Cryptography Propose Break

Modern Cryptography Propose Definition Break Definition Propose STRONGER Definition Break Definition algorithms satisfying definition Algs Propose STRONGER

Modern Cryptography Propose Definition Break Definition Propose STRONGER Definition Break Definition algorithms satisfying definition Algs Propose STRONGER

No Algorithm? Propose Definition ? Why?

Provably No Algorithm? Bad Definition Propose Definition ? Propose WEAKER/DIFF Definition Alg / ?

Getting Started Model, motivation, definition

The Model Database is a collection of rowsOne per person in the databaseAdversary/User and curator computationally unboundedAll users are part of one giant adversary“Curator against the world” ? C

Difficult Even ifCurator is AngelData are in Vault ? C “Pure” Privacy Problem

Typical Suggestions “Large Set” Queries How many MSFT employees have Sickle Cell Trait (CST)?How many MSFT employees who are not female Distinguished Scientists with very curly hair have the CST?Add Random Noise to True Answer Average of responses to repeated queries converges to true answer Can’t simply detect repetition ( undecidable ) Detect When Answering is Unsafe Refusal can be disclosive

A Litany

William Weld’s Medical Record [S02] ZIP birth date sex name address date reg. party affiliation last voted ethnicity visit date diagnosis procedure medication total charge voter registration data HMO data

Name: Thelma Arnold Age: 62 Widow Residence: Lilburn, GA AOL Search History Release (2006) Heads Rolled

Subsequent challenge abandoned

GWAS Membership [Homer et al. ‘08] SNP: Single Nucleotide (A,C,G,T) polymorphism C T T T … … … … Reference Population Major Allele (C): 94% Minor Allele (T): 6% Genome-Wide Association Study Allele frequencies for many thousands of SNPS NIH-funded studies pulled data from public view

Definitional Failures Failure to Cope with Auxiliary Information Existing and future databases, newspaper reports, Flikr , literature, etc. Definitions are Syntactic Dalenius’s Ad Omnia Guarantee (1977): Anything that can be learned about a respondent from the statistical database can be learned without access to the database

Dalenius’s Ad Omnia Guarantee (1977):Anything that can be learned about a respondent from the statistical database can be learned without access to the databaseUnachievable in useful databases [D.,Naor ‘06] I’m from Mars. My (incorrect) prior is that everyone has 2 left feet . Database teaches: almost everyone has one left and one right foot. Provably No Algorithm!

Databases that Teach Database teaches that smoking causes cancer. Smoker S’s insurance premiums rise. This is true even if S not in database!Learning that smoking causes cancer is the whole point. Smoker S enrolls in a smoking cessation program. Differential privacy: limit harms to the teachings, not participation The outcome of any analysis is essentially equally likely, independent of whether any individual joins, or refrains from joining, the dataset.

Differential Privacy [D., McSherry, Nissim, Smith 06] Bad Responses: Z Z Z Pr [response] ratio bounded M gives ( ε , 0 ) - differential privacy if for all adjacent x and x’, and all C Range( M ): Pr[ M (x) C] ≤ e  Pr[ M (x’) C ]   Neutralizes all linkage attacks. Composes unconditionally and automatically: Σ i ε i

(, d) - Differential Privacy Bad Responses: Z Z Z Pr [response] M gives ( ε , d ) - differential privacy if for all adjacent x and x’, and all C Range( M ): Pr [ M (x) C] ≤ e  Pr [ M (x’) C ] + d   Neutralizes all linkage attacks. Composes unconditionally and automatically: ( Σ i  i , Σ i d i ) ratio bounded This talk: negligible  

Range Equivalently,   “Privacy Loss”

Privacy by Process Randomized Response [Warner’65]

Did You Have Sex Last Night? Flip a coin.Heads: Flip again and respond “Yes” if heads, “No” if otherwise Tails: Answer honestly Analysis: Pr [say “ Y ” | truth = Y] / Pr [say “Y” | truth = N] = 3Pr [say “N” | truth = N] / Pr [say “N” | truth = Y] = 3 Privacy is by Process “ P lausible deniability”

Did You Have Sex Last Night? Randomized response is differentially private. L og of ratio of probabilities of seeing any answer, as truth varies Pr [say “ Y ” | truth = Y] / Pr [say “ Y” | truth = N] = 3Pr [say “N” | truth = N] / Pr [say “ N ” | truth = Y] = 3  

Different Bang, Same Buck Finding a “Stable” Algorithm Can be Hard

DP: A Definition, Not an Algorithm Many randomized algorithms for the same task provide -DP. The discovery that one method works poorly for your problem is only that. Others may work better.  

Randomized Response in Our Setting Q = What Fraction Had Sex?C randomizes responses in each record; releases fraction of 1’sCall this Algorithm 1. ? C

Sensitivity of a Function Adjacent databases differ in at most one row.Counting queries have sensitivity 1.Sensitivity captures how much one person’s data can affect output f = max adjacent x,x ’ |f(x) – f(x’)|

30 Laplace Distribution Lap(b) p(z) = exp (-|z|/b)/2b variance = 2b 2 ¾ = √ 2 b Increasing b flattens curve

Calibrate Noise to Sensitivity  f = max adj x,x ’ |f(x) – f(x’)| Theorem [DMNS06] : On query f, to achieve  -differential privacy, it suffices to add scaled symmetric noise [Lap(f/)]. Noise depends on f and , not on the database Smaller sensitivity (  f) means less distortion Better privacy (smaller e ) means more distortion 0 b 2b 3b 4b 5b -b - 2b - 3b -4b

Example: Counting Queries How many people in the database had sex? Sensitivity = 1Sufficient to add noise Fractional version: add noise Call this Algorithm 2  

Two -DP Algorithms   Algorithm 1: Randomized Response: Error Algorithm 2: Laplace Mechanism: Error Algorithm 2 is better than Algorithm 1  

Vector-Valued Queries f = max adj x, x’ || f(x) – f(x’) || 1 0 b 2b 3b 4b 5b -b - 2b - 3b -4b Theorem [DMNS06] : On query f, to achieve  -differential privacy, it suffices to add scaled symmetric noise [Lap(  f/  )] d . Noise depends on f and , not on the database Smaller sensitivity (  f) means less distortion Better privacy (smaller e ) means more distortion

35 Example: Histograms  f = max adj x, x’ ||f(x) – f(x’)|| 1 Theorem: To achieve  -differential privacy, it s uffices to add scaled symmetric noise [Lap(  f/ )]d .

36 Why Does it Work ?  f = max x , Me ||f(x+Me) – f(x-Me)|| 1 0 b 2b 3b 4b 5b -b - 2b - 3b - 4b Pr[ M (f, x – Me) = t] Pr[ M (f , x + Me) = t ] = exp (-(||t- f - ||-||t- f + ||)/ b ) ≤ exp (  f/b) Theorem: To achieve  -differential privacy, it Suffices to add scaled symmetric noise [Lap(  f/  )] d

Composition “Simple”: k-fold composition of (, ± ) -differentially private mechanisms is (k , k ± ) -differentially private.Advanced: rather than : the -fold composition of - dp mechanisms is - dp What is Bob’s lifetime exposure risk? Eg , 10,000 -dp or - dp databases, for lifetime cost of -dp What should be the value of ? 1/801 OMG, that is small! Can we do better ?  

Hugely Many Queries Single Database

Counting Queries Arbitrary Low-Sensitivity Queries Offline Error [Blum-Ligett-Roth’08] Runtime Exponential in |U| -dp Error [D.-Rothblum-Vadhan‘10] Runtime Exp (|U|) Online Error [Hardt-Rothblum’10] Runtime Polynomial in |U| Counting Queries Arbitrary Low-Sensitivity Queries Offline Online Omitting polylog (various things, some of them big, like ) terms   Erro r [ Hardt-Rothblum ] Runtime Exp (|U|)  

Discrete-Valued Functions Strings, experts, small databases, … Each has a utility for , denoted Exponential Mechanism [McSherry-Talwar’07] Output with probability  

Exponential Mechanism Applied Many (fractional) counting queries [Blum, Ligett , Roth’08]: Given -row database , set of properties, produce a synthetic database giving good approx to “What fraction of rows of satisfy property ?” . is set of all “small” databases (size given by sampling error bounds) |   - 1/3 - 1/310 -7/286 -62/4589 -1/100000

Non-Trivial Accuracy with -DP   Stateless Mechanism   Stateful Mechanism   Barrier at [D., Naor,Vadhan ]  

To handle hugely many databases must introduce coordination Non-Trivial Accuracy with -DP   Independent Mechanisms   Statefull Mechanism  

Two Additional Techniques + An application that combines them

    Functions “Expected” to Behave Well Propose-Test-Release [D.-Lei’09] Privacy-preserving test for “goodness” of data set Eg , low local sensitivity [Nissim-Raskhodnikova-Smith07] Big gap …… Robust statistics theory: L ack of density at median is the only thing that can go wrong …   …     PTR: Dp test for low sensitivity median (equivalently, for high density) if good, then release median with low noise else output (or use a sophisticated dp median algorithm )    

High/Unknown Sensitivity Functions Subsample-and-Aggregate [ Nissim , Raskhodnikova , Smith’07]

  Application: Feature Selection                 If “far” from collection with no large majority value, then output most common value. Else quit.

  Application : Model Selection                 If “far” from collection with no large majority value, then output most common value. Else quit.

A Few of Many Future Directions Efficiency for handling hugely many queries Time complexity (counting queries): Connection to Tracing TraitorsSample complexity / database sizeDifferentially private analysis of social networks?Is DP the right definition? At what granularity? What do we want to compute? Is there an alternative to dp ? Axiomatic approach? Focus on a specific application ( Datamining !)Collaborative effort with domain expertsWhat can be proved about S&A for feature/model selection?

0 R 2R 3R 4R 5R -R - 2R - 3R - 4R Thank You!