algorithms for Big Data Grigory Yaroslavtsev httpgrigoryus Lecture 13 testing and isotonic regression Slides at httpgrigoryusbigdataclasshtml Testing Big Data Q ID: 508836
Download Presentation The PPT/PDF document "CIS 700:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CIS 700: “algorithms for Big Data”
Grigory Yaroslavtsevhttp://grigory.us
Lecture 13:
-testing and isotonic regression
Slides at
http://grigory.us/big-data-class.htmlSlide2
Testing Big DataQ: How to make sense of big data?
Q: How to understand properties looking only at a small sample?Q: How to ignore noise and outliers?Q: How to minimize assumptions about the sample generation process?Q: How to optimize running time?Slide3
Which stocks were growing steadily?
Data from
http://finance.google.comSlide4
Property Testing [Goldreich
, Goldwasser, Ron; Rubinfeld, Sudan]
NO
YES
Randomized Algorithm
Accept with probability
Reject with probability
YES
NO
Property Tester
-
close
Accept with probability
Reject with probability
Don’t care
-
close :
fraction has to be changed to become
YES
Slide5
Which stocks were growing steadily?
Data from
http://finance.google.comSlide6
Tolerant Property Testing [Parnas
, Ron, Rubinfeld]
-close :
fraction has to be changed to become YES
YES
NO
Property Tester
-
close
Accept with probability
Reject with probability
Don’t care
Tolerant Property
T
ester
Accept with probability
Reject with probability
Don’t care
NO
-
close
-close
YESSlide7
Which stocks were growing steadily?
Data from
http://finance.google.comSlide8
-Isotonic Regression
Running time
[Ahuja, Orlin]
Slide9
= class of monotone functions
-close:
Tolerant
“
Property Testing”
Tolerant “
Property Tester”
Accept with probability
Reject with probability
Don’t care
NO
-
close
-close
YESSlide10
New
-Testing Model for Real-Valued Data
Generalizes standard Hamming testingFor still has a probabilistic interpretation:
Compatible with existing
PAC-style learning models
(preprocessing for model selection
)
For Boolean functions,
.
10Slide11
Our Contributions
Relationships between
-testing modelsAlgorithms
-testers for
monotonicity, Lipschitz, convexity
Tolerant
-
tester for
monotonicity in 1D (
sublinear
algorithm for isotonic regression)
Our
-testers
beat lower bounds
for
Hamming
testers
Simple algorithms
backed up by involved analysis
Uniformly
sampled (or
easy to sample
) data
suffices
Nearly tight lower bounds
11Slide12
Implications for Hamming TestingSome techniques/results carry over to Hamming testing
Improvement on Levin’s work investment strategyConnectivity of bounded-degree graphs [Goldreich, Ron ‘02]
Properties of images [Raskhodnikova ‘03]
Multiple-input problems [Goldreich ‘13]First example of monotonicity testing problem where adaptivity helps
Improvements to Hamming testers for Boolean functions12Slide13
Definitions
(D = finite domain/
poset)
, for
Hamming weight (# of non-zero values)
Property
= class of functions (monotone, convex
, linear,
Lipschitz
, …)
Slide14
Relationships:
-Testing
(
,) = query complexity of
-testing property
at distance
(Cauchy-
Shwarz
)
Boolean functions
Slide15
Relationships: Tolerant
-Testing
(
,
)
= query complexity
of tolerant
-testing property
with distance parameters
No general relationship between tolerant
-
testing and tolerant Hamming testing
-
testing
for
is close in complexity to
-testing
For Boolean
functions
=
Slide16
Testing Monotonicity
Line ()
Upper bound
[Ergun,
Kannan
, Kumar,
Rubinfeld
, Viswanathan’00]
Lower bound
[Fischer’04]
Upper bound
Lower boundSlide17
Monotonicity
Domain D=
(vertices of -dim hypercube)A function
is monotone
if increasing a coordinate of does not decrease
Special case
is monotone
is sorted.
One of the most studied properties in property testing
[
Ergün
Kannan
Kumar
Rubinfeld
Viswanathan
,
Goldreich
Goldwasser
Lehman Ron,
Dodis
Goldreich
Lehman
Raskhodnikova
Ron
Samorodnitsky, Batu Rubinfeld
White, Fischer Lehman Newman
Raskhodnikova Rubinfeld Samorodnitsky
, Fischer, Halevy Kushilevitz, Bhattacharyya
Grigorescu Jung
Raskhodnikova Woodruff, ..., Chakrabarty Seshadhri, Blais,
Raskhodnikova
Yaroslavtsev
, Chakrabarty Dixit Jha Seshadhri]
(1,1,1)
Slide18
Monotonicity: Key Lemma
M = class of monotone functionsBoolean slicing operator
if
otherwise.
Theorem:
Slide19
Proof sketch: slice and conquer
Closest monotone function with minimal -norm is unique (can be denoted as
an operator ).
and
commute:
=
2)
3)
1)Slide20
-Testers from
Boolean Testers
Thm: A nonadaptive, 1-sided error -test for monotonicity of
is also an
-test for monotonicity of
.
Proof:
A
violation
:
A
nonadaptive
, 1-sided error test queries a random set
and rejects
iff
contains a violation.
If
is monotone,
will not contain a violation.
If
then
W.p
.
, set
contains a violation
for
Slide21
Our Results: Testing Monotonicity
Hypergrid ()
adaptive tester for Boolean functions
Upper bound
[
Dodis
et al. ’99,
…,
Chakrabarti
,
Seshadhri
’13
]
Lower bound
[
Dodis
et al.’99
…,
Chakrabarti
,
Seshadhri
’13]
Non-adaptive 1-sided error
Upper bound
Lower boundSlide22
Testing Monotonicity of
=
-th unit vector.For
where
an axis-parallel line along dimension
:
= set of all
axis-parallel lines
Dimension reduction for
[Dodis et al.’99]
:
If
=>
-
sample detects a violation
Slide23
Testing Monotonicity on
Dimension reduction for
[Dodis et al.’99]:
If
=>
-
sample
can detect
a violation
“Inverse Markov”: For
with
and
-test
[
Dodis
et al.]
via “Levin’s economical work investment strategy” (used in other papers for testing connectedness of a graph, properties of images, etc.)
Slide24
Testing Monotonicity on
“Discretized Inverse Markov”
For with
and
For each
pick
samples of size
=> complexity
For the good bucket
j
the test rejects with constant probability
=>
-test
Slide25
Distance Approximation and Tolerant Testing
[Saks
Seshadhri
10]
Approximating
-distance to monotonicity
Sublinear
algorithm for isotonic regression
Time complexity of tolerant
-testing for monotonicity is
Better dependence than what follows from distance
appoximation
for
Improves
adaptive distance approximation of
[Fattal
,Ron’10]
for Boolean functions
Slide26
Distance Approximation
Theorem
: with constant probability over the choice of a random sample S of size
Implies an
tolerant tester by setting
Suffices:
Improves previous
algorithm
[
Fattal
, Ron’10]
Slide27
Distance Approximation
For
violation graph
edge
if
MM
(G) = maximum matching
VC
(G) = minimum vertex cover
[Fischer et al.’02]
Slide28
Define:
has
hypergeometric
distribution:
Slide29
Experiments
Data: Apple stock price data (2005-2015) from Google Finance
Left:
-isotonic regressionRight: error vs. sample size
Slide30
-Testers for Other Properties
Via combinatorial characterization of
-distance to the propertyLipschitz property
:
Via (implicit)
proper learning
: approximate in
up to error
, test approximation on a random
-sample
Convexity
:
(tight for
)
Submodularity
[Feldman,
Vondrak
13]
Slide31
Open Problems
All our algorithms for for were obtained directly from
-testers.
Can one design better algorithms by working directly with -distances?Our complexity for
-testing convexity grows exponentially with d
Is there an
-testing
algorithm for convexity with
subexponential
dependence on the dimension?
Our
-tester for monotonicity is
nonadaptive
, but we show that
adaptivity
helps for Boolean range.
Is there a better adaptive tester
?
We designed tolerant tester only for monotonicity (d=1,2).
T
olerant testers for higher dimensions?
Other properties?