/
Estimating the Unseen: Estimating the Unseen:

Estimating the Unseen: - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
404 views
Uploaded On 2017-01-26

Estimating the Unseen: - PPT Presentation

Sublinear Statistics Paul Valiant Fishers Butterflies Turings Enigma Codewords How many new species if I observe for another period Probability mass of unseen codewords ID: 514163

log distributions samples linear distributions log linear samples distance bounds find yields distribution entropy support theorem multinomial fingerprint fingerprints minimize sided size

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Estimating the Unseen:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Estimating the Unseen:Sublinear Statistics

Paul ValiantSlide2

Fisher’s Butterflies

Turing’s Enigma

Codewords

How many new species if I observe for another period?

Probability mass of unseen

codewords

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

F

1-F2+F3-F4+F5-…

F1/(number of samples)

(“Fingerprint”)Slide3

Characteristic FunctionsFor element p

i

:

Pr

[Not seen in first period, but seen in second period]

Pr[Not seen]*p

i

 

 

 

 

F

1

-F

2

+F

3

-F

4

+F

5

-

F

1

/(number of samples)

 Slide4

Other Properties?Entropy: p

i

log

p

i

Support size: step function

Approximate as

 

log p

i

1/p

i

0

Accurate to O(1) for x=

Ω

(1)

 linear samples

Exponentially hard to approximate below 1/k

Easier case? L

2

norm

 p

i2

 Slide5

L2 approximation

 

Works very well if we have a bound on the j’s encountered

L

2

distance related to L

1

:

 

Yields 1-sided testers for L

1

, also, L

1

-distance to uniform, also, L

1

-distance to arbitrary known distribution

[

Batu

,

Fortnow

,

Rubinfeld

, Smith, White,

‘00]Slide6

Are good testers computationally trivial?Slide7

Maximum Likelihood Distributions

[

Orlitsky

et al.,

Science,

etc

]Slide8

Relaxing the ProblemGiven {

F

j

}, f

ind a distribution p such that the expected fingerprint of k samples from p approximates Fj

By concentration bounds, the “right” distribution should also satisfy this, be in the feasible region of the linear program

Yields: n/log n-sample estimators for entropy, support size, L1

distance, anything similar

Does the extra computational power help??Slide9

Lower Bounds

Find not-large {c

i

} that minimize

 

DUAL

Find distributions

y

+

,y

-

that maximize

while

is small

 

“Find distributions with very different property values, but almost identical fingerprint expectations”

NEEDS: Theorem: close expected fingerprints

 indistinguishable

[

Raskhodnikova

, Ron,

Shpilka

, Smith’07]Slide10

“Roos’s Theorem”

Generalized Multinomial Distributions

Definition: a distribution expressible as

where

Z

i

{

0

, (1,0,0,0,…), (0,1,0,0,…), (0,0,1,0,…), … }

 

Includes fingerprint distributions

Also: binomial distributions, multinomial distributions, and any sums of such distributions.

“Generalized Multinomial Distributions” appear all over CS, and characterizing them is central to many papers (for example,

Daskalakis

and Papadimitriou,

Discretized m

ultinomial distributions and Nash equilibria in anonymous games

, FOCS 2008.)

Comment:

Thm

: If there are bounds , s.t.

then

is multivariate Poisson to within

 Slide11

Distributions of Rare Elements

Distribution of fingerprints

 

– provided every element is rare, even in k samples

Yields best known lower-bounds for non-trivial 1-sided testing problems:

Ω

(n

2/3

) for L

1

distance,

Ω

(n

2/3

m

1/3

) for “independence”

Note: impossible to confuse >log n with o(1). Can cut off above log n? Suggests these lower bounds are tight to within log n.

Can we do better?Slide12

A Better Central Limit Theorem (?)Roos’s

Theorem: Fingerprints are like

Poissons

(provided…)

Poissons

: 1-parameter family

Gaussians: 2-parameter family

New CLT: Fingerprints are like Gaussians

(provided variance is high enough in every direction)

How to ensure high variance? “Fatten” distributions by adding elements at many different probabilities.

 can’t use for 1-sided bounds Slide13

Results

Additive estimates of Entropy, Support Size, L

1

distance :

 

2-approximation of L

1

distance to U

m

:

 

All testers are linear expressions in the fingerprintSlide14

Duality

Find not-large {c

i

} that minimize

 

DUAL

Find distributions

y

+

,y

-

that maximize

while

is small

 

that minimize

 

 

Yields estimator when d<½

Yields lower bound when d>½

“When

, optimum is log-convex”

 

Theorem: For linear symmetric property

π

, and

ε

>0, c>½, if all

p

+

,p

-

of support ≤n with

are distinguishable

w.p

. >c via k samples, then there exists a

linear

estimator with error

using (1+o(1))k samples, succeeding

w.p

. 1-o(1/poly(k))

 Slide15

Open ProblemsDependence on

ε

(resolved for entropy)

Beyond additive estimates – “case-by-case optimal”?

We suspect linear programming

is better than linear estimators

Leveraging these results for non-symmetric properties

Monotonicity, with respect to different

posets

Practical applications!