/
Constantinos Daskalakis Constantinos Daskalakis

Constantinos Daskalakis - PowerPoint Presentation

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
393 views
Uploaded On 2017-09-10

Constantinos Daskalakis - PPT Presentation

CSAIL and EECS MIT Greek Stochastics   Statistics vs Big Data YOU WANT BIG DATA ILL GIVE YOU BIG DATA Human genome 40 exabytes storage by 2025 SKA Telescope 1 exabyte daily ID: 586991

samples testing sample distributions testing samples distributions sample high distinguish distribution goodness unknown fit access properties test uniformity ising

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Constantinos Daskalakis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Constantinos DaskalakisCSAIL and EECS, MITGreek Stochastics

 

Statistics vs Big DataSlide2

YOU WANT BIG DATA?I’LL GIVE YOU BIG DATA!Slide3

Human genome: 40 exabytes storage by 2025SKA Telescope: 1 exabyte dailyFacebook: 20 petabytes images dailyBIG DatasmallSlide4

High-dimensional ExpensiveDNA microarrayComputervisionFinancialrecordsExperimental drugsSlide5

What properties do your BIG distributions have?Slide6

e.g.1: play the lottery?

Is it uniform?Slide7

Is the lottery unfair?from Hitlotto.com: “Lottery experts agree, past number histories can be the key to predicting future winners.”Slide8

True Story!Polish lottery Multilotek Choose “uniformly” at random distinct 20 numbers out of 1 to 80. Initial machine biasede.g., probability of 50-59 too small Slide9

Thanks to Krzysztof Onak (pointer) and Eric Price (graph)Slide10

New Jersey Pick 3,4 LotteryNew Jersey Pick k ( =3,4) Lottery.Pick k random digits in order. 10k possible values. Data:Pick 3 - 8522 results from 5/22/75 to 10/15/002-test (on Excel) answers "42% confidence”Pick 4 - 6544 results from 9/1/77 to 10/15/00.

fewer results than possible valuesnot a good idea to run

test

convergence to

distribution

won’t kick in for small sample size

(textbook) rule of thumb:

expected number of at least 5 observations of each element in the domain under the null hypothesis to run

 Slide11

e.g. 2: Independence Testing

Shopping patterns:

Independent of zip code?Slide12

e.g.2: Linkage Disequilibriumlocus 1locus 2

locus

 

Genome

Single nucleotide polymorphisms, are they independent?

should we expect the genomes from the 1000 human genomes project to be sufficient? up to how many loci?

Suppose

loci,

possible states each, then:

s

tate of one’s genome

humans:

some distribution

over

Question:

Is

a product

dist’n

OR

far

from all product

dist’ns

?

 Slide13

e.g. 3: Outbreak of diseasesSimilar patterns in different years? More prevalent near large airports?

Flu 2005

Flu 2006Slide14

Distributions on BIG domainsGiven samples of a distribution, need to know, e.g.,entropynumber of distinct elements“shape” (monotone, unimodal, etc)closeness to uniform, Gaussian, Zipfian…No assumptions on shape of distributioni.e., parametric, smoothness, monotonicity, normality,…Considered in Statistics, information theory, machine learning, databases, algorithms, physics, biology,…Slide15

Classical SettingSmall domain (comparatively) Modern Setting

Large domain

 

Old questions, new challenges

Domain

 

tosses

 

Asymptotic analysis

Computation

not crucial

New challenges:

samples, computation,

communication, storage

One human genome

Domain:Slide16

A Key QuestionHow many samples do you need in terms of domain size?Do you need to estimate the probabilities of each domain item?-- OR --Can sample complexity be sublinear in size of the domain? Rules out standard statistical techniquesSlide17

AimAlgorithms with sublinear sample complexityINFORMATION THEORYMACHINE LEARNINGSTATISTICSALGORITHMSSlide18

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High DimensionsSlide19

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High DimensionsSlide20

Model : family of distributions over may be non-parametric, e.g. unimodal, product, log-concaveProblem Given: samples from unknown with probability 0.9, distinguish vs

Objective

Minimize samples

Computational efficiency

 

Problem formulation

Sublinear

in

?

 

 

discrete

 Slide21

(Composite) hypothesis testingNeyman-Pearson testKolmogorov-Smirnov testPearson’s chi-squared testGeneralized likelihood ratio test…Quantities of Interest

Focus

Consistency

Error exponents

:

as

Asymptotic regime:

Results kick in when

 

Well-studied Problem

- Sublinear

in

?

 

- Strong control for false

positives

?Slide22

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High DimensionsSlide23

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High DimensionsSlide24

Testing Fairness of a Coin : unknown probability of Question: Is OR Goal: Toss coin several times, deduce correct answer w/ probability Number of samples required?Can estimate , by tossing times, then taking

By concentration bounds, if

w/ probability

Are

many samples necessary?

Suppose there is tester using

samples

Then it can distinguish one sample from

where each

from one sample from

where each

w/ probability

Claim:

Any tester has error probability at least

)

 

vs

 Slide25

Testing Uniformity: unknown distribution over Sample access to Question: is or

?[Paninski’03]:

samples and

time

 

“Intuition:”

(

Lower Bound)

Suppose

is uniform distribution over

and

is

either uniform on

or uniform

on

a random

size

subset of

unless

samples are observed, there are no collisions, hence

cannot distinguish

(Upper Bound)

Collision

statistics suffice to distinguish

 

: unknown

vs

 Slide26

Proving Lower Bounds[Le Cam’73]: Consider two disjoint sets of distributions , . Suppose algorithm is given samples from some unknown

and claims to distinguish

vs

.

Then:

 

all distributions generating samples as follows

choose a random distribution

from

(according to some distribution over

then generate

samples from

 Slide27

Proving Lower Bounds[Le Cam’73]: Consider two disjoint sets of distributions , . Suppose algorithm is given samples from some unknown

and claims to distinguish

vs

.

Then:

 

all distributions generating samples as follows

choose a random distribution

from

(according to some distribution over

then generate

samples from

 Slide28

Proving Lower Bounds[Le Cam’73]: Consider two disjoint sets of distributions , . Suppose algorithm is given samples from some unknown

and claims to distinguish

vs

.

Then:

 

all distributions generating samples as follows

choose a random distribution

from

(according to some distribution over

then generate

samples from

 

To prove

lower bound for uniformity

testing take:

 Slide29

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High DimensionsSlide30

Identity Testing (“goodness of fit”): distributions over : given; sample access to Question: is or

?

 

[Batu-Fisher-Fortnow-Kumar-Rubinfeld-White’01]…

[Paninski’08, Valiant-Valiant’14]:

samples and time

[w/ Acharya-Kamath’15]:

a

tolerant

goodness of fit test with same sample size can distinguish:

vs

Cauchy-Schwarz:

 

: unknown

vs

 Slide31

A new - Goodness of Fit Test Goal: given and sample access to distinguish: Case 1:

vs

Case 2:

Approach:

Draw

many samples

from

: # of appearances of symbol

independent random variables

Statistic:

Case 1:

;

Case 2:

chug

chug

chug…bound variance of

samples suffice to distinguish

 

Side-Note:

Pearson’s

test uses statistic

Subtracting

in the numerator gives an unbiased estimator and importantly may hugely decrease variance

 Slide32

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High DimensionsSlide33

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High DimensionsSlide34

Testing Properties of Distributionsso far ={single distribution} restrictive, as rarely know hypothesis distribution exactlynatural extension: test structural propertiesmonotonicity: “PDF is monotone,” e.g. cancer vs radiationunimodality: “PDF is single-peaked,” e.g. single source of diseaselog-concavity: “ is concave”monotone-hazard rate: “

is concave”product distribution, e.g. testing linkage disequilibrium

Example question:

Sample access to

Is

unimodal OR is

-far from or unimodal distributions?

 Slide35

Testing Properties of Distributions[w/ Acharya and Kamath 2015]:Testing identity, monotonicity, log-concavity, monotone hazard-rate, unimodality for distributions over (ordered set)

is doable w/

samples and time.

Testing monotonicity/independence of a distribution over

is doable

w/

samples and time.

previous best for monotonicity testing:

[Bhattacharya-Fisher-Rubinfeld-Valiant’11]

previous best for independence: d=2, worst bounds

[

Batu

et al.’01]

All bounds above are

optimal

Matching lower bounds for 1

and 2 via Le Cam.

Unified approach, computationally efficient tests

N.B. Contemporaneous work of

[Canonne et al’2015]

provide a different unified approach for testing structure but their results are suboptimal.

 Slide36

Choose a Hypothesis  Test the Hypothesis (how well does fit ?)

 

A Natural Approach

Goal:

Given

and sample

access to

,

distinguish

vs

 Slide37

A Natural Approach (cont’d)Goal: Given and sample access to

, distinguish

vs

A Learning-Followed-By-Testing Algorithm:

Learn hypothesis

s.t.

(needs cheap “proper learner”)

(automatic since

Reduce to “tolerant goodness of fit”

g

iven

, sample access to

,

and explicit description of

, distinguish

vs

Tolerant tester requires almost linear #samples in the support of

namely

) samples

[Valiant-Valiant’10]

Could try investing more samples for more accurate learning, but proper learning complexity vs tolerant testing complexity tradeoff does not work out to give optimal testing complexity

 Slide38

A Modified ApproachGoal: Given

and sample access to , distinguish

vs

A Learning-Followed-By-Testing Algorithm:

Learn hypothesis

s.t.

(needs cheap “proper learner”)

(automatic since

Reduce to “tolerant goodness of fit”

g

iven

, sample access to

, and explicit description of

, distinguish

vs

Now tolerant testing has the right complexity of

Pertinent Question:

are there sublinear proper learners in

?

We show that the

-learning complexity is dominated by the testing

complexity for all properties of distributions we consider

 Slide39

Tutorial: part 2Slide40

Summary so farHypothesis Testing in the small sample regime.

 

Test

i.i.d

. samples

Pass/Fail?

unknown

distribution over some

discrete

set

set of distributions over

Given:

sample access to

Goal:

w/

prob

tell

vs

Properties of interest: Is

uniform? unimodal? log-concave? MHR? product measure?

All above properties

can be tested w/

samples and time

Unified approach based on

modified

Pearson’s goodness of fit test: statistic

t

ight

control for false

positives: want to be able to both

assert and reject the

null hypothesis

accommodate sublinear sample size

 Slide41

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High DimensionsSlide42

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High DimensionsSlide43

Other Distances (beyond  So far focused on (a.k.a. total variation) distanceGiven sample access to , w/ prob distinguish: vs

Stronger distances?

[

Acharya-D-Kamath]:

results are actually shown for

Should also extend to

½

Weaker distances?

is easy to test for

[

Goldreich

Ron

],

bu

t makes less sense, e.g.

p

= (

2/m,2/m,2/m,…,

2/n,0,0,0,0,0,0,0….0)

q

= (0,0,0,0,0,0,0….

0,2/m,2/m,2/m,…,2/m)

l

1

distance =

2

l

2

distance

=

 Slide44

ToleranceSo far, focused on non-tolerant version:Given set of distributions , and sample access to Distinguish: vs

Tolerant version:Distinguish:

vs

[Valiant-Valiant’10]:

samples are needed

Tolerant version in

:

Distinguish:

vs

[w/ Acharya, Kamath’15]:

samples suffice

 

Different ratios change the constants in

notation

 Slide45

Goodness of FitOur goodness of fit test was given an explicit distribution and sample access to a distribution , and was asked to test vs

Sometimes both distributions are unknown, e.g.

 

Transactions of 20-30

yr

olds

Transactions of 30-40 yr olds

Same or different?Slide46

Goodness of Fit w/ two unknownsGiven sample access to two unknown distributions Distinguish vs

 

p

Test

Pass/Fail?

q

i.i.d

. samples

i.i.d

. samplesSlide47

Goodness of Fit w/ two unknowns[Batu Fortnow Rubinfeld Smith White], [P. Valiant], …[Chan Diakonikolas Valiant Valiant]: Tight upper and lower bound of:

Why different?

Collision

statistics are all that matter

Collisions on “heavy” elements can hide collision statistics of rest of the domain

Construct pairs of distributions where heavy elements are identical, but “

light” elements are either identical or very different

 Slide48

Continuous DistributionsWhat can we say about continuous distributions?without extra assumptions such as smoothness of PDF/parametric modeling cannot stick to hard distances ()Instead of restricting , , let us switch distancesCan extend results if we switch to Kolmogorov distance recall:

in contrast:

Now want to distinguish:

Claim:

Tolerant

testing in Kolmogorov distance of any distribution property (continuous or discrete) of

-dimensional distributions can be performed from

samples.

Importantly

: Kolmogorov distance allows graceful scaling with dimensionality of

data

 Slide49

Dvoretzky–Kiefer–Wolfowitz inequalitySuppose i.i.d. samples from (single-dimensional) , and let be the resulting empirical CDF, namely

Then:

,

.

i.e.

samples suffice to learn any single-dimensional

dist’n

to within

in Kolmogorov distance.

VC Inequality

same is true for

-dimensional

distributions

when #samples is at least

After learning in Kolmogorov, can tolerant test any property.

Runtime under investigation.

trouble:

computing/approximating the Kolmogorov distance of two high-dimensional distributions is generally a hard computational question.

 Slide50

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High DimensionsSlide51

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High DimensionsSlide52

Testing in High-DimensionsAlready talked about testing high-dimensional distributions in Kolmogorov distance.Sample complexity is Next Focus: discrete distributions, stronger distances Slide53

High-Dimensional Discrete Distn’sConsider source generating -bit strings0011010101 (sample 1)0101001110 (sample 2)0011110100 (sample 3)…Are bits/pixels independent?Our algorithms require

samplesIs source generating graphs over

nodes

Erdos-Renyi

?

Our algorithms require

samples

Exponential dependence on

unsettling, but necessary

Lower

bound exploits high possible

correlation among bits

Nature

is not

adversarial

Often

high dimensional systems have structure, e.g. Markov

random

fields

fields

(MRFs

), graphical models (Bayes nets),

etc

Testing high-dimensional distributions with combinatorial

structure?

 

400 bit imagesSlide54

High-Dimensional Discrete Distn’sConsider source generating -bit strings0011010101 (sample 1)0101001110 (sample 2)0011110100 (sample 3)…Are bits/pixels independent?Our algorithms require

samplesIs source generating graphs over nodes

Erdos-Renyi

?

Our algorithms require

samples

Exponential dependence on

unsettling, but necessary

Lower

bound exploits high possible

correlation among bits

Nature

is not

adversarial

Often

high dimensional systems have structure, e.g. Markov

random

fields

fields

(MRFs

), graphical models (Bayes nets),

etc

Testing high-dimensional distributions with combinatorial

structure?

 

400 bit images

[w/

Dikkala, Kamath’16]:

If unknown

is known to be an

Ising

model, then

samples suffice to test

independence, goodness of fit. (

extends to MRFs)

 Slide55

Ising ModelStatistical physics, computer vision, neuroscience, social scienceIsing model: Probability distribution defined in terms of a graph edge potentials , node potentials State space

High

’s

strongly (anti-)correlated spins

 Slide56

Ising Model: Strong vs weak ties

 

 

 

 

 

“low temperature regime”

“high temperature regime”

 Slide57

Testing Ising ModelsGiven: sample access to Ising model over w/ nodes,

edges

unknown, graph

G

unknown

supported on

Goal:

distinguish

vs

[w/ Dikkala, Kamath’16]:

samples

suffice

Warmup:

+

 

 Slide58

Testing Ising ModelsGiven: sample access to Ising model over w/ nodes,

edges

unknown, graph

G

unknown

supported on

Goal:

distinguish

vs

[w/ Dikkala, Kamath’16]:

samples

suffice

Warmup:

+

 

 Slide59

Testing Ising ModelsGiven: sample access to Ising model over w/ nodes,

edges

unknown, graph

G

unknown

supported on

Goal:

distinguish

=

vs

[w/ Dikkala, Kamath’16]:

samples

suffice

Warmup:

=

Hence, can distinguish which is the case from

samples.

 

 

Can localize departure from Uniformity (independence) on some edge

Focus:

 

Cheaper non-localizing test?Slide60

Testing Ising ModelsGiven: sample access to Ising model over w/ nodes,

edges

unknown, graph

G

unknown

supported on

Goal:

distinguish

=

vs

Claim:

B

y expending some samples, can identify distinguishing statistic of the form

, where

for all

.

Issue

:

can’t bound

intelligently as

var’s

aren’t pairwise

ind.

If

, then

O.w

. best can say is t

rivial

)

and is, in fact, tight:

consider two disjoint cliques with super-strong ties

suppose all

for all

dances around its mean by

 

 

 

 

Focus:

 

Cheaper non-localizing test?

Low temperature. How about high temperature?Slide61

Ising Model: Strong vs weak ties

 

 

 

 

 

“low temperature regime”

“high temperature regime”

 

 

mixing of the

Glauber

dynamics

 

Exponential mixing of the

Glauber

dynamicsSlide62

Testing Ising ModelsGiven: sample access to Ising model over w/ nodes,

edges

unknown, graph

G

unknown

supported on

Goal:

distinguish

=

vs

Claim:

B

y expending some samples, can identify distinguishing statistic of the form

, where

for all

.

Issue

:

can’t bound

intelligently as

var’s

aren’t pairwise

ind.

If

, then

Low temperature:

High temperature:

O(

) for dense graphs

proof via exchangeable pairs

[Stein,…,Chatterjee 2006]

 

 

Focus:

 Slide63

Exchangeable PairsGoal: Given and random vector want to bound moments of , or prove concentration of about its meanApproach:Define pair of random vectors

such that:

has the same distribution as

(exchangeability)

m

arginal distributions

are

(faithfulness)

Find

anti-symmetric function (i.e.

) such that

Claims:

Let

.

If

a.s

.

t

hen

 Slide64

Silly Example, where

Suppose for all

:

and

a.s

.

Goal:

prove concentration of

about its mean

Defining exchangeable pair

as follows:

sample

pick

u.a.r

.; sample

and set

Choose anti-symmetric function:

Bounding

?

 Slide65

Interesting Example: Ising

How to define exchangeable pair?

Natural approach: sample

Do one step of the

Glauber

dynamics from

to find

i.e. pick random node

and resample

from marginal of

at

conditioning on the state of all other nodes being

Harder question: find anti-symmetric

s.t.

Approach that works for any

where

are two (potentially coupled) executions of the

Glauber

dynamics starting from states

and

respectively

Challenging question:

bound

 

Need to show function contracts as

Glauber

dynamics unravels

Requires a good couplingSlide66

Showing ContractionLemma 1:Lemma 2:Generous coupling: choose same node, but update independentlyDifferent than “greedy coupling” typically used where the same node is chosen and the update is coordinated to maximize probability of same updateSlide67

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High Dimensions

Future DirectionsSlide68

The MenuMotivationUniformity Testing, Goodness of Fit

Problem Formulation

Testing Properties of Distributions

Discussion/Road Ahead

Testing in High Dimensions

Future DirectionsSlide69

Markov Chain Testingn = 52 cards, 7 riffle shuffles needed

* [Diakonis,Bayer’92]

 

6

 

Empirical Fact

:

vs. different Markov chains! [

Diakonis’03

]

Example

:

 

Question:

how close is real shuffle to the GSR distribution?

Given:

sample access to

Goal:

distinguish is GSR vs

 

*riffle shuffle = Gilbert-Shannon-Reeds (GSR) model for distribution on card permutations

.

[Ongoing work with

Dikkala

,

Gravin

]Slide70

Markov Chain TestingDistance?Let

be the transition matrices of chains

Object of interest

:

Pertinent question

:

asymptotic

as

?

Easier to quantify

, for all

So

proposed notion of distance:

 

Question

: test

 

Results

: testing symmetric

with

samples

 

[Ongoing work with

Dikkala

,

Gravin

]

How to quantify distance between Markov chains?

s

pectral radiusSlide71

Testing Combinatorial StructureIs the phylogenic tree assumption true? Sapiens-Neanderthal early interbreeding [Slatkin et al’13]Is a graphical model a tree?[ongoing work with Bresler, Acharya]Efficiently testing combinatorial structureSlide72

Testing from a Single SampleGiven one social network, one brain, etc., how can we test the validity of a certain generative model?Get many samples from one sample?Slide73

INFORMATION THEORYMACHINE LEARNINGSTATISTICSALGORITHMS

Thank You!