/
Using Data Privacy for Better Adaptive Predictions Using Data Privacy for Better Adaptive Predictions

Using Data Privacy for Better Adaptive Predictions - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
386 views
Uploaded On 2017-01-17

Using Data Privacy for Better Adaptive Predictions - PPT Presentation

Vitaly Feldman IBM Research Almaden Foundations of Learning Theory 2014 Cynthia Dwork Moritz Hardt Omer Reingold Aaron Roth MSR SVC IBM Almaden ID: 510734

algorithm data algorithms samples data algorithm samples algorithms queries privacy query counting answer time answers set probability private valid

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Using Data Privacy for Better Adaptive P..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Using Data Privacy for Better Adaptive Predictions

Vitaly Feldman IBM Research – Almaden

Foundations of Learning Theory, 2014

Cynthia Dwork Moritz Hardt Omer Reingold Aaron Roth MSR SVC IBM Almaden MSR SVC UPenn, CSSlide2

Statistical inference

Genome Wide Association StudiesGiven: DNA sequences with medical recordsDiscover: Find SNPs associated with diseasesPredict chances of developing some conditionPredict drug effectiveness

Hypothesis testing

Given:

samples

drawn

i.i.d

. from unknown distribution

over

Output solution

of value

with a guarantee that

w.p. (over )

 Slide3

Existing approaches

Theoretical MLUniform convergence bounds for the solution class For every

of complexity ,

w.p.

Output stability-based bounds

But often too loose in practice (do not exploit additional structure) and complicated to derive

Practical ML

Cross-validation

Statistics

Model-specific fit and significance tests

Bootstrapping etc.

 Slide4

Real world is interactive

Outcomes of analyses inform future manipulations on the same dataExploratory data analysisModel selectionFeature selectionHyper-parameter tuningPublic data - findings inform others

Samples are no longer

i.i.d

.!Slide5

Is the issue real?

Freedman’s paradox (1983):Data:

Throw away uncorrelated variables:

Perform least squares

regression over:

 

“Such practices can distort the significance levels of conventional statistical tests. The existence of this effect is well known, but its magnitude may come as a surprise, even to a hardened statistician

.”Slide6

competitions

Public

Private

Private dataPublic score

Data

Private score

http://www.rouli.net/2013/02/five-lessons-from-kaggles-event.html

“If you based your model solely on the data which gave you constant feedback, you run the danger of a model that

overfits

to the specific noise in that data

.” –

Kaggle

FAQ.Slide7

Adaptive statistical queries

is tolerance of the

query

With

probability

for all

 

 

 

 

 

 

 

Learning algorithm(s)

SQ oracle

[K93,

F

GRVX13]

 

Can measure

error/performance and

test hypotheses

Can be used in place of samples in most

algorithms!Slide8

SQ algorithms

PAC learning algorithms (except parities)Convex optimization (Ellipsoid, iterative methods)Expectation maximization (EM)SVM (with kernel)PCAICAID3k-meansmethod of moments

MCMCNaïve BayesNeural Networks (backprop)PerceptronNearest neighborsBoosting

[K 93, BDMN 05, CKLYBNO 06, FPV 14]Slide9

For a query

respond with

How many samples are needed to answer

queries?

If

are fixed then

With 1 round of

adaptivity

(constant

and

)?

 

Naïve answeringChernoff UnionSlide10

Our result

There exists an algorithm that can answer

adaptively chosen SQ such that with probability

the answers are -valid using

The algorithm runs in time

 

Also:

 

Cannot be achieved efficiently:

lower bound for poly-time algorithms

under crypto assumptions

[HU14]

 Slide11

Fresh samples

Data set analyzed

differentially privatelySlide12

Privacy-preserving data analysis

How to get utility from data while preserving privacy of individuals

DATASlide13

Differential Privacy

Each sample point is created frompersonal data of an individual (GTTCACG…TC, “YES”)Differential Privacy

[DMNS06]

(Randomized) algorithm A is

-differentially private if for any two data sets

such that

:

I

f

then

 Slide14

Properties of DP

Privacy has a priceMinimum data set size

usually scales as

Composable adaptively:If

is

-DP and

is

-

DP then

is

-DP

Or better

[DRV 10]: For every and , composition of -DP algorithms is

 Slide15

is a loss function

an

-DP algorithm such that

 

DP implies generalization

For all

over

:

 

DP composition implies that DP preserving algorithms can reuse data adaptivelySlide16

 

Proof

For

and

let

be

with

-

th element replaced by

. By

-DP

Taking expectation over

 Slide17

Counting queries

Counting query on a data set

For function

, value

DP algorithms for approximate answering of counting queries are actively studied for ~10 years

 

 

 

 

 

 

 

Data analyst(s)

Query release algorithm

 

 Slide18

From private counting to SQs

Let be an (adaptive) query asking strategy

 

Let be an algo that answers counting queries of

s.t.

is -DP

For any data set

w.p

.

answers are

-accurate

Then for any

over , applied to , w.p. outputs -valid response to SQs of provided that

 

Can be extended to

-DP

with

 Slide19

For

let

denote the -th query asked by

Depends on , randomness of (and nothing else)Let

+

Union bound and accuracy of

For let

denote the event

 

Proof I

 

 Slide20

Concetration of

Markov’s

ineq

.

For

 

Proof IISlide21

Proof: moment bound

w

here

.

Suffice to prove for all

Consider

where

.

-DP approximately preserves conditional expectations

 

 Slide22

Corollaries

There exists an algorithm that can answer

adaptively chosen SQ such that with probability

the answers are -valid using

random samples.

The algorithm runs in time

 

There exists an

-DP algorithm that can answer

(adaptive) counting queries such that with probability

the answers are

-accurate

provided that . The algorithm runs in time

Also

-DP for

 

[HR10]Slide23

MWU + Sparse Vector

Initialize

;

For each query

:

if

answer with

else

Answer with

 

Laplace noise

 

At most

MWU updates

Sparse Vector Technique

[DNRRV09]

: privacy loss only when approximate comparison with a threshold fails

 Slide24

Threshold validation queries

Threshold SQ:

-valid response

 

 

YES

,

 

There exists an algorithm that can answer

adaptively chosen

t

hresholds SQ such that with probability the answers are -valid as long as at most

comparisons failed using

random samples.

The algorithm runs in time

time

 Slide25

Applications

 

 

 

 

 

 

SQ oracle

 

Learning algorithm(s)

DATA

 

Validation set

 

Working set

 Slide26

Conclusions

Adaptive data manipulations can cause overfitting/false discoveryTheoretical model of the problem based on SQsUsing exact empirical means is risky DP provably preserves “freshness” of samples: adding noise can provably prevent

overfittingIn applications not all data must be used with DPSlide27

Future work

Better (direct) bounds on

?Is

necessary?Better dependence on

?Efficient algorithms for special cases?Practical implementations?

 

THANKS!