CSAIL and EECS MIT Greek Stochastics Statistics vs Big Data YOU WANT BIG DATA ILL GIVE YOU BIG DATA Human genome 40 exabytes storage by 2025 SKA Telescope 1 exabyte daily ID: 586991
Download Presentation The PPT/PDF document "Constantinos Daskalakis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Constantinos DaskalakisCSAIL and EECS, MITGreek Stochastics
Statistics vs Big DataSlide2
YOU WANT BIG DATA?I’LL GIVE YOU BIG DATA!Slide3
Human genome: 40 exabytes storage by 2025SKA Telescope: 1 exabyte dailyFacebook: 20 petabytes images dailyBIG DatasmallSlide4
High-dimensional ExpensiveDNA microarrayComputervisionFinancialrecordsExperimental drugsSlide5
What properties do your BIG distributions have?Slide6
e.g.1: play the lottery?
Is it uniform?Slide7
Is the lottery unfair?from Hitlotto.com: “Lottery experts agree, past number histories can be the key to predicting future winners.”Slide8
True Story!Polish lottery Multilotek Choose “uniformly” at random distinct 20 numbers out of 1 to 80. Initial machine biasede.g., probability of 50-59 too small Slide9
Thanks to Krzysztof Onak (pointer) and Eric Price (graph)Slide10
New Jersey Pick 3,4 LotteryNew Jersey Pick k ( =3,4) Lottery.Pick k random digits in order. 10k possible values. Data:Pick 3 - 8522 results from 5/22/75 to 10/15/002-test (on Excel) answers "42% confidence”Pick 4 - 6544 results from 9/1/77 to 10/15/00.
fewer results than possible valuesnot a good idea to run
test
convergence to
distribution
won’t kick in for small sample size
(textbook) rule of thumb:
expected number of at least 5 observations of each element in the domain under the null hypothesis to run
Slide11
e.g. 2: Independence Testing
Shopping patterns:
Independent of zip code?Slide12
e.g.2: Linkage Disequilibriumlocus 1locus 2
locus
Genome
Single nucleotide polymorphisms, are they independent?
should we expect the genomes from the 1000 human genomes project to be sufficient? up to how many loci?
Suppose
loci,
possible states each, then:
s
tate of one’s genome
humans:
some distribution
over
Question:
Is
a product
dist’n
OR
far
from all product
dist’ns
?
Slide13
e.g. 3: Outbreak of diseasesSimilar patterns in different years? More prevalent near large airports?
Flu 2005
Flu 2006Slide14
Distributions on BIG domainsGiven samples of a distribution, need to know, e.g.,entropynumber of distinct elements“shape” (monotone, unimodal, etc)closeness to uniform, Gaussian, Zipfian…No assumptions on shape of distributioni.e., parametric, smoothness, monotonicity, normality,…Considered in Statistics, information theory, machine learning, databases, algorithms, physics, biology,…Slide15
Classical SettingSmall domain (comparatively) Modern Setting
Large domain
Old questions, new challenges
Domain
tosses
Asymptotic analysis
Computation
not crucial
New challenges:
samples, computation,
communication, storage
One human genome
Domain:Slide16
A Key QuestionHow many samples do you need in terms of domain size?Do you need to estimate the probabilities of each domain item?-- OR --Can sample complexity be sublinear in size of the domain? Rules out standard statistical techniquesSlide17
AimAlgorithms with sublinear sample complexityINFORMATION THEORYMACHINE LEARNINGSTATISTICSALGORITHMSSlide18
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High DimensionsSlide19
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High DimensionsSlide20
Model : family of distributions over may be non-parametric, e.g. unimodal, product, log-concaveProblem Given: samples from unknown with probability 0.9, distinguish vs
Objective
Minimize samples
Computational efficiency
Problem formulation
Sublinear
in
?
discrete
Slide21
(Composite) hypothesis testingNeyman-Pearson testKolmogorov-Smirnov testPearson’s chi-squared testGeneralized likelihood ratio test…Quantities of Interest
Focus
Consistency
Error exponents
:
as
Asymptotic regime:
Results kick in when
Well-studied Problem
- Sublinear
in
?
- Strong control for false
positives
?Slide22
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High DimensionsSlide23
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High DimensionsSlide24
Testing Fairness of a Coin : unknown probability of Question: Is OR Goal: Toss coin several times, deduce correct answer w/ probability Number of samples required?Can estimate , by tossing times, then taking
By concentration bounds, if
w/ probability
Are
many samples necessary?
Suppose there is tester using
samples
Then it can distinguish one sample from
where each
from one sample from
where each
w/ probability
Claim:
Any tester has error probability at least
)
vs
Slide25
Testing Uniformity: unknown distribution over Sample access to Question: is or
?[Paninski’03]:
samples and
time
“Intuition:”
(
Lower Bound)
Suppose
is uniform distribution over
and
is
either uniform on
or uniform
on
a random
size
subset of
unless
samples are observed, there are no collisions, hence
cannot distinguish
(Upper Bound)
Collision
statistics suffice to distinguish
: unknown
vs
Slide26
Proving Lower Bounds[Le Cam’73]: Consider two disjoint sets of distributions , . Suppose algorithm is given samples from some unknown
and claims to distinguish
vs
.
Then:
all distributions generating samples as follows
choose a random distribution
from
(according to some distribution over
then generate
samples from
Slide27
Proving Lower Bounds[Le Cam’73]: Consider two disjoint sets of distributions , . Suppose algorithm is given samples from some unknown
and claims to distinguish
vs
.
Then:
all distributions generating samples as follows
choose a random distribution
from
(according to some distribution over
then generate
samples from
Slide28
Proving Lower Bounds[Le Cam’73]: Consider two disjoint sets of distributions , . Suppose algorithm is given samples from some unknown
and claims to distinguish
vs
.
Then:
all distributions generating samples as follows
choose a random distribution
from
(according to some distribution over
then generate
samples from
To prove
lower bound for uniformity
testing take:
Slide29
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High DimensionsSlide30
Identity Testing (“goodness of fit”): distributions over : given; sample access to Question: is or
?
[Batu-Fisher-Fortnow-Kumar-Rubinfeld-White’01]…
[Paninski’08, Valiant-Valiant’14]:
samples and time
[w/ Acharya-Kamath’15]:
a
tolerant
goodness of fit test with same sample size can distinguish:
vs
Cauchy-Schwarz:
: unknown
vs
Slide31
A new - Goodness of Fit Test Goal: given and sample access to distinguish: Case 1:
vs
Case 2:
Approach:
Draw
many samples
from
: # of appearances of symbol
independent random variables
Statistic:
Case 1:
;
Case 2:
chug
chug
chug…bound variance of
samples suffice to distinguish
Side-Note:
Pearson’s
test uses statistic
Subtracting
in the numerator gives an unbiased estimator and importantly may hugely decrease variance
Slide32
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High DimensionsSlide33
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High DimensionsSlide34
Testing Properties of Distributionsso far ={single distribution} restrictive, as rarely know hypothesis distribution exactlynatural extension: test structural propertiesmonotonicity: “PDF is monotone,” e.g. cancer vs radiationunimodality: “PDF is single-peaked,” e.g. single source of diseaselog-concavity: “ is concave”monotone-hazard rate: “
is concave”product distribution, e.g. testing linkage disequilibrium
Example question:
Sample access to
Is
unimodal OR is
-far from or unimodal distributions?
Slide35
Testing Properties of Distributions[w/ Acharya and Kamath 2015]:Testing identity, monotonicity, log-concavity, monotone hazard-rate, unimodality for distributions over (ordered set)
is doable w/
samples and time.
Testing monotonicity/independence of a distribution over
is doable
w/
samples and time.
previous best for monotonicity testing:
[Bhattacharya-Fisher-Rubinfeld-Valiant’11]
previous best for independence: d=2, worst bounds
[
Batu
et al.’01]
All bounds above are
optimal
Matching lower bounds for 1
and 2 via Le Cam.
Unified approach, computationally efficient tests
N.B. Contemporaneous work of
[Canonne et al’2015]
provide a different unified approach for testing structure but their results are suboptimal.
Slide36
Choose a Hypothesis Test the Hypothesis (how well does fit ?)
A Natural Approach
Goal:
Given
and sample
access to
,
distinguish
vs
Slide37
A Natural Approach (cont’d)Goal: Given and sample access to
, distinguish
vs
A Learning-Followed-By-Testing Algorithm:
Learn hypothesis
s.t.
(needs cheap “proper learner”)
(automatic since
Reduce to “tolerant goodness of fit”
g
iven
, sample access to
,
and explicit description of
, distinguish
vs
Tolerant tester requires almost linear #samples in the support of
namely
) samples
[Valiant-Valiant’10]
Could try investing more samples for more accurate learning, but proper learning complexity vs tolerant testing complexity tradeoff does not work out to give optimal testing complexity
Slide38
A Modified ApproachGoal: Given
and sample access to , distinguish
vs
A Learning-Followed-By-Testing Algorithm:
Learn hypothesis
s.t.
(needs cheap “proper learner”)
(automatic since
Reduce to “tolerant goodness of fit”
g
iven
, sample access to
, and explicit description of
, distinguish
vs
Now tolerant testing has the right complexity of
Pertinent Question:
are there sublinear proper learners in
?
We show that the
-learning complexity is dominated by the testing
complexity for all properties of distributions we consider
Slide39
Tutorial: part 2Slide40
Summary so farHypothesis Testing in the small sample regime.
Test
i.i.d
. samples
Pass/Fail?
unknown
distribution over some
discrete
set
set of distributions over
Given:
sample access to
Goal:
w/
prob
tell
vs
Properties of interest: Is
uniform? unimodal? log-concave? MHR? product measure?
All above properties
can be tested w/
samples and time
Unified approach based on
modified
Pearson’s goodness of fit test: statistic
t
ight
control for false
positives: want to be able to both
assert and reject the
null hypothesis
accommodate sublinear sample size
Slide41
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High DimensionsSlide42
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High DimensionsSlide43
Other Distances (beyond So far focused on (a.k.a. total variation) distanceGiven sample access to , w/ prob distinguish: vs
Stronger distances?
[
Acharya-D-Kamath]:
results are actually shown for
Should also extend to
½
Weaker distances?
is easy to test for
[
Goldreich
Ron
],
bu
t makes less sense, e.g.
p
= (
2/m,2/m,2/m,…,
2/n,0,0,0,0,0,0,0….0)
q
= (0,0,0,0,0,0,0….
0,2/m,2/m,2/m,…,2/m)
l
1
distance =
2
l
2
distance
=
Slide44
ToleranceSo far, focused on non-tolerant version:Given set of distributions , and sample access to Distinguish: vs
Tolerant version:Distinguish:
vs
[Valiant-Valiant’10]:
samples are needed
Tolerant version in
:
Distinguish:
vs
[w/ Acharya, Kamath’15]:
samples suffice
Different ratios change the constants in
notation
Slide45
Goodness of FitOur goodness of fit test was given an explicit distribution and sample access to a distribution , and was asked to test vs
Sometimes both distributions are unknown, e.g.
Transactions of 20-30
yr
olds
Transactions of 30-40 yr olds
Same or different?Slide46
Goodness of Fit w/ two unknownsGiven sample access to two unknown distributions Distinguish vs
p
Test
Pass/Fail?
q
i.i.d
. samples
i.i.d
. samplesSlide47
Goodness of Fit w/ two unknowns[Batu Fortnow Rubinfeld Smith White], [P. Valiant], …[Chan Diakonikolas Valiant Valiant]: Tight upper and lower bound of:
Why different?
Collision
statistics are all that matter
Collisions on “heavy” elements can hide collision statistics of rest of the domain
Construct pairs of distributions where heavy elements are identical, but “
light” elements are either identical or very different
Slide48
Continuous DistributionsWhat can we say about continuous distributions?without extra assumptions such as smoothness of PDF/parametric modeling cannot stick to hard distances ()Instead of restricting , , let us switch distancesCan extend results if we switch to Kolmogorov distance recall:
in contrast:
Now want to distinguish:
Claim:
Tolerant
testing in Kolmogorov distance of any distribution property (continuous or discrete) of
-dimensional distributions can be performed from
samples.
Importantly
: Kolmogorov distance allows graceful scaling with dimensionality of
data
Slide49
Dvoretzky–Kiefer–Wolfowitz inequalitySuppose i.i.d. samples from (single-dimensional) , and let be the resulting empirical CDF, namely
Then:
,
.
i.e.
samples suffice to learn any single-dimensional
dist’n
to within
in Kolmogorov distance.
VC Inequality
same is true for
-dimensional
distributions
when #samples is at least
After learning in Kolmogorov, can tolerant test any property.
Runtime under investigation.
trouble:
computing/approximating the Kolmogorov distance of two high-dimensional distributions is generally a hard computational question.
Slide50
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High DimensionsSlide51
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High DimensionsSlide52
Testing in High-DimensionsAlready talked about testing high-dimensional distributions in Kolmogorov distance.Sample complexity is Next Focus: discrete distributions, stronger distances Slide53
High-Dimensional Discrete Distn’sConsider source generating -bit strings0011010101 (sample 1)0101001110 (sample 2)0011110100 (sample 3)…Are bits/pixels independent?Our algorithms require
samplesIs source generating graphs over
nodes
Erdos-Renyi
?
Our algorithms require
samples
Exponential dependence on
unsettling, but necessary
Lower
bound exploits high possible
correlation among bits
Nature
is not
adversarial
Often
high dimensional systems have structure, e.g. Markov
random
fields
fields
(MRFs
), graphical models (Bayes nets),
etc
Testing high-dimensional distributions with combinatorial
structure?
400 bit imagesSlide54
High-Dimensional Discrete Distn’sConsider source generating -bit strings0011010101 (sample 1)0101001110 (sample 2)0011110100 (sample 3)…Are bits/pixels independent?Our algorithms require
samplesIs source generating graphs over nodes
Erdos-Renyi
?
Our algorithms require
samples
Exponential dependence on
unsettling, but necessary
Lower
bound exploits high possible
correlation among bits
Nature
is not
adversarial
Often
high dimensional systems have structure, e.g. Markov
random
fields
fields
(MRFs
), graphical models (Bayes nets),
etc
Testing high-dimensional distributions with combinatorial
structure?
400 bit images
[w/
Dikkala, Kamath’16]:
If unknown
is known to be an
Ising
model, then
samples suffice to test
independence, goodness of fit. (
extends to MRFs)
Slide55
Ising ModelStatistical physics, computer vision, neuroscience, social scienceIsing model: Probability distribution defined in terms of a graph edge potentials , node potentials State space
High
’s
strongly (anti-)correlated spins
Slide56
Ising Model: Strong vs weak ties
“low temperature regime”
“high temperature regime”
Slide57
Testing Ising ModelsGiven: sample access to Ising model over w/ nodes,
edges
unknown, graph
G
unknown
supported on
Goal:
distinguish
vs
[w/ Dikkala, Kamath’16]:
samples
suffice
Warmup:
+
Slide58
Testing Ising ModelsGiven: sample access to Ising model over w/ nodes,
edges
unknown, graph
G
unknown
supported on
Goal:
distinguish
vs
[w/ Dikkala, Kamath’16]:
samples
suffice
Warmup:
+
Slide59
Testing Ising ModelsGiven: sample access to Ising model over w/ nodes,
edges
unknown, graph
G
unknown
supported on
Goal:
distinguish
=
vs
[w/ Dikkala, Kamath’16]:
samples
suffice
Warmup:
=
Hence, can distinguish which is the case from
samples.
Can localize departure from Uniformity (independence) on some edge
Focus:
Cheaper non-localizing test?Slide60
Testing Ising ModelsGiven: sample access to Ising model over w/ nodes,
edges
unknown, graph
G
unknown
supported on
Goal:
distinguish
=
vs
Claim:
B
y expending some samples, can identify distinguishing statistic of the form
, where
for all
.
Issue
:
can’t bound
intelligently as
var’s
aren’t pairwise
ind.
If
, then
O.w
. best can say is t
rivial
)
and is, in fact, tight:
consider two disjoint cliques with super-strong ties
suppose all
for all
dances around its mean by
Focus:
Cheaper non-localizing test?
Low temperature. How about high temperature?Slide61
Ising Model: Strong vs weak ties
“low temperature regime”
“high temperature regime”
mixing of the
Glauber
dynamics
Exponential mixing of the
Glauber
dynamicsSlide62
Testing Ising ModelsGiven: sample access to Ising model over w/ nodes,
edges
unknown, graph
G
unknown
supported on
Goal:
distinguish
=
vs
Claim:
B
y expending some samples, can identify distinguishing statistic of the form
, where
for all
.
Issue
:
can’t bound
intelligently as
var’s
aren’t pairwise
ind.
If
, then
Low temperature:
High temperature:
O(
) for dense graphs
proof via exchangeable pairs
[Stein,…,Chatterjee 2006]
Focus:
Slide63
Exchangeable PairsGoal: Given and random vector want to bound moments of , or prove concentration of about its meanApproach:Define pair of random vectors
such that:
has the same distribution as
(exchangeability)
m
arginal distributions
are
(faithfulness)
Find
anti-symmetric function (i.e.
) such that
Claims:
Let
.
If
a.s
.
t
hen
Slide64
Silly Example, where
Suppose for all
:
and
a.s
.
Goal:
prove concentration of
about its mean
Defining exchangeable pair
as follows:
sample
pick
u.a.r
.; sample
and set
Choose anti-symmetric function:
Bounding
?
Slide65
Interesting Example: Ising
How to define exchangeable pair?
Natural approach: sample
Do one step of the
Glauber
dynamics from
to find
i.e. pick random node
and resample
from marginal of
at
conditioning on the state of all other nodes being
Harder question: find anti-symmetric
s.t.
Approach that works for any
where
are two (potentially coupled) executions of the
Glauber
dynamics starting from states
and
respectively
Challenging question:
bound
Need to show function contracts as
Glauber
dynamics unravels
Requires a good couplingSlide66
Showing ContractionLemma 1:Lemma 2:Generous coupling: choose same node, but update independentlyDifferent than “greedy coupling” typically used where the same node is chosen and the update is coordinated to maximize probability of same updateSlide67
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Future DirectionsSlide68
The MenuMotivationUniformity Testing, Goodness of Fit
Problem Formulation
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Future DirectionsSlide69
Markov Chain Testingn = 52 cards, 7 riffle shuffles needed
* [Diakonis,Bayer’92]
6
Empirical Fact
:
vs. different Markov chains! [
Diakonis’03
]
Example
:
Question:
how close is real shuffle to the GSR distribution?
Given:
sample access to
Goal:
distinguish is GSR vs
*riffle shuffle = Gilbert-Shannon-Reeds (GSR) model for distribution on card permutations
.
[Ongoing work with
Dikkala
,
Gravin
]Slide70
Markov Chain TestingDistance?Let
be the transition matrices of chains
Object of interest
:
Pertinent question
:
asymptotic
as
?
Easier to quantify
, for all
So
proposed notion of distance:
Question
: test
Results
: testing symmetric
with
samples
[Ongoing work with
Dikkala
,
Gravin
]
How to quantify distance between Markov chains?
s
pectral radiusSlide71
Testing Combinatorial StructureIs the phylogenic tree assumption true? Sapiens-Neanderthal early interbreeding [Slatkin et al’13]Is a graphical model a tree?[ongoing work with Bresler, Acharya]Efficiently testing combinatorial structureSlide72
Testing from a Single SampleGiven one social network, one brain, etc., how can we test the validity of a certain generative model?Get many samples from one sample?Slide73
INFORMATION THEORYMACHINE LEARNINGSTATISTICSALGORITHMS
Thank You!