June 1 2013 Mark Braverman Princeton University a tutorial Part I Information theory Information theory in its modern format was introduced in the 1940s to study the problem of transmitting data over physical channels ID: 703849
Download Presentation The PPT/PDF document "1 Basics of information theory and infor..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1
Basics of information theory and information complexity
June 1, 2013
Mark BravermanPrinceton University
a tutorialSlide2
Part I: Information theory
Information theory, in its modern format was introduced in the 1940s to study the problem of transmitting data over physical channels.
2
communication channel
Alice
BobSlide3
Quantifying “information”
Information is measured in bits.The basic notion is Shannon’s entropy.
The entropy of a random variable is the (typical) number of bits needed to remove the uncertainty of the variable. For a discrete variable:
3Slide4
Shannon’s entropy
Important examples and properties:If
is a constant, then
If
is uniform on a finite set
of possible values, then
.
If
is supported on at most
values, then
.
If
is a random variable determined by
, then
.
4Slide5
Conditional entropy
For two (potentially correlated) variables
, the
conditional entropy of given is the amount of uncertainty left in
given
:
.
One can show
.
This important fact is knows as the
chain rule.
If
, then
5Slide6
Example
Where
.
Then
;
;
;
6Slide7
Mutual information
7
Slide8
Mutual information
The mutual information is defined as
“By how much does knowing
reduce the entropy of
?”
Always non-negative
.
Conditional mutual information:
Chain rule for mutual information:
Simple intuitive interpretation.
8Slide9
Example – a biased coin
A coin with -Heads or Tails bias is tossed several times.
Let
be the bias, and suppose that a-priori both options are equally likely:
.
How many tosses needed to find
?
Let
be a sequence of tosses.
Start with
.
9Slide10
What do we learn about
?
Similarly,
To determine
with constant accuracy, need
.
10Slide11
Kullback–Leibler
(KL)-DivergenceA distance metric between distributions on the same space.
Plays a key role in information theory.
, with equality when
.
Caution:
!
11Slide12
Properties of KL-divergence
Connection to mutual information:
If
,
then
,
and both sides are
.
Pinsker’s
inequality:
Tight!
.
12Slide13
Back to the coin example
.
“Follow the information learned from the coin tosses”
Can be done using combinatorics, but the information-theoretic language is more natural for expressing what’s going on.
Slide14
Back to communication
The reason Information Theory is so important for communication is because information-theoretic quantities readily operationalize.
Can attach operational meaning to Shannon’s entropy:
“the cost of transmitting ”.Let
be the (expected) cost of transmitting a sample of
.
14Slide15
?
Not quite.
Let trit
.
It is always the case that
.
15
1
0
2
10
3
11Slide16
But
and
are close
Huffman’s coding:
This is a
compression result
: “an uninformative message turned into a short one”.
Therefore:
.
16Slide17
Shannon’s noiseless coding
The cost of communicating many copies of
scales as
. Shannon’s source coding theorem:Let
be the cost of transmitting
independent copies of
. Then the
amortized transmission cost
.
This equation gives
operational meaning.
17Slide18
communication channel
18
per copy to transmit
’s
Slide19
is nicer than
is additive for independent variables.
Let
be independent trits.
.
Works well with concepts such as
channel capacity.
19Slide20
“Proof” of Shannon’s noiseless coding
Therefore
.
20
Additivity of entropy
Compression (Huffman)Slide21
Operationalizing other quantities
Conditional entropy
(cf.
Slepian-Wolf Theorem).
communication channel
per copy
to transmit
’s
Slide22
communication channel
Operationalizing other quantities
Mutual information
:
per copy to
sample
’s
Slide23
Information theory and entropy
Allows us to formalize intuitive notions. Operationalized in the context of one-way transmission and related problems. Has nice properties (additivity, chain rule…)Next, we discuss extensions to more interesting communication scenarios.
23Slide24
Communication complexity
Focus on the two party randomized setting.
24
A
B
X
Y
A & B implement a functionality
.
F(X,Y)
e.g.
Shared randomness RSlide25
Communication complexity
A
B
X
Y
Goal
: implement a functionality
.
A protocol
computing
:
F(X,Y)
m
1
(X,R)
m
2
(Y,m
1
,R)
m
3
(X,m
1
,m
2
,R)
Communication cost = #of bits exchanged.
Shared randomness RSlide26
Communication complexity
Numerous applications/potential applications (some will be discussed later today). Considerably more difficult to obtain lower bounds than transmission (still much easier than other models of computation!).
26Slide27
Communication complexity
(Distributional) communication complexity with input distribution
and error :
Error
w.r.t.
.
(Randomized/worst-case) communication complexity:
.
Error
on all inputs.
Yao’s minimax:
.
27Slide28
Examples
Equality
.
.
28Slide29
Equality
is
.
is a distribution where
w.p
.
and
w.p
.
are random.
A
B
X
Y
Shows that
MD5(X) [128 bits]
X=Y? [1 bit]
Error?Slide30
Examples
I
.
.
In fact, using information complexity:
.
30Slide31
Information complexity
Information complexity
::
communication complexity
a
s
Shannon’s entropy
::
transmission cost
31Slide32
Information complexity
The smallest amount of information
Alice and Bob need to exchange to solve . How is information measured?
Communication cost of a protocol?Number of bits exchanged. Information cost of a protocol?Amount of information revealed.
32Slide33
Basic definition 1: The information cost of a protocol
Prior distribution:
.
A
B
X
Y
Protocol
π
Protocol transcript
what Alice learns about
Y
+ what Bob learns
about
XSlide34
Example
is
.
is a distribution where
w.p
.
and
w.p
.
are random.
A
B
X
Y
1 + 64.5 = 65.5 bits
what Alice learns about
Y
+
what
Bob learns
about
X
MD5(X) [128 bits]
X=Y? [1 bit]Slide35
Prior
matters a lot for information cost!
If
a singleton,
35Slide36
Example
is
.
is a distribution where
are just uniformly random.
A
B
X
Y
0 + 128 = 128 bits
what Alice learns about
Y
+
what
Bob learns
about
X
MD5(X) [128 bits]
X=Y? [1 bit]Slide37
Basic definition 2: Information complexity
Communication complexity:
.
Analogously:
.
37
Needed!Slide38
Prior-free information complexity
Using minimax can get rid of the prior. For communication, we had:
.
For information
.
38Slide39
Ex: The information complexity of Equality
What is
?
Consider the following protocol.
39
A
B
X in {0,1}
n
Y in {0,1}
n
A
non-singular in
A
1
·X
A
1
·Y
A
2
·X
A
2
·Y
Continue for
n
steps, or until a disagreement is discovered.Slide40
Analysis (sketch)
If X≠Y, the protocol will terminate in O(1)
rounds on average, and thus reveal O(1) information. If X=Y… the players only learn the fact that
X=Y (≤1 bit of information). Thus the protocol has O(1) information complexity for any prior
.
40Slide41
Operationalizing IC: Information equals amortized communication
Recall [Shannon]:
.
Turns out:
, for
.
[Error
allowed on each copy]
For
:
.
[
an interesting open problem.]
41Slide42
Information = amortized communication
.
Two directions: “
” and “
”.
Additivity of entropy
Compression (Huffman)Slide43
The “
” direction
.
Start with a protocol
solving
, whose
is close to
.
Show how to
compress
many copies of
into a protocol whose communication cost is close to its information cost.
More on compression later.
43Slide44
The “
” direction
.
Use the fact that
Additivity
of information complexity:
44Slide45
Proof: Additivity of information complexity
Let
and
be two two-party tasks.
E.g. “Solve
with error
w.r.t.
”
Then
“
” is easy.
“
” is the interesting direction.
Slide46
Start from a protocol
for
with prior
, whose information cost is
.
Show how to construct two protocols
for
with prior
and
for
with prior
, with information costs
and
, respectively, such that
.
46Slide47
Publicly sample
Bob privately samples
Run
Publicly sample
Alice privately samples
Run
Slide48
Publicly sample
Bob privately samples
Run
Analysis -
Alice learns about
:
Bob learns about
48Slide49
Publicly sample
Alice privately samples
Run
Analysis -
Alice learns about
:
Bob learns about
49Slide50
Adding
and
.
50Slide51
Summary
Information complexity is additive. Operationalized via “Information = amortized communication”.
.
Seems to be the “right” analogue of entropy for interactive computation.
51Slide52
Entropy vs. Information Complexity
Entropy
IC
Additive?
Yes
Yes
Operationalized
Compression?
Huffman:
???!
Entropy
IC
Additive
?
Yes
Yes
Operationalized
Compression?
???!Slide53
Can interactive communication be compressed?
Is it true that
?
Less ambitiously:
(Almost) equivalently: Given a protocol
with
, can Alice and Bob simulate
using
communication?
Not known in general…
53Slide54
Direct sum theorems
Let
be any functionality. Let
be the cost of implementing . Let
be the functionality of implementing
independent copies of
.
The direct sum problem:
“Does
?”
In most cases it is obvious that
.
54Slide55
Direct sum – randomized communication complexity
Is it true that
?
Is it true that
?
55Slide56
Direct product – randomized communication complexity
Direct sum
?
Direct product
?
56Slide57
Direct sum for randomized CC and interactive compression
Direct sum:
?
In the limit:
?
Interactive compression:
?
Same question!
57Slide58
The big picture
additivity (=direct sum) for information
information = amortized communication
direct sum for communication?
interactive compression?Slide59
Current results for compression
A protocol
that has
bits of communication, conveys bits of information over prior
, and works in
rounds can be simulated:
Using
bits of communication.
Using
bits of communication.
Using
bits of communication.
If
,
then using
bits of communication.
59Slide60
Their direct sum counterparts
For product distributions
,
When the number of rounds is bounded by
, a direct sum theorem holds.
60Slide61
Direct product
The best one can hope for is a statement of the type:
Can prove:
61Slide62
Proof 2: Compressing a one-round protocol
Say Alice speaks:
Recall KL-divergence:
Bottom line:
Alice has
; Bob has
;
Goal
: sample from
using
communication.
62Slide63
The dart board
q
1
q
2
q
3
q
4
q
5
q
6
q
7
….
u
1
u
2
u
3
u
4
u
5
u
6
u
7
1
0
Interpret the public randomness as random points in
,
where
is the universe of all possible messages.
First message under the histogram of
is distributed
.
u
1
u
2
u
3
u
4
u
5Slide64
64
Proof Idea
Sample using
communication with statistical error
ε
.
M
X
M
Y
u
1
u
1
u
2
u
2
u
3
u
3
u
4
u
4
u
4
~|U| samples
Public randomness:
q
1
q
2
q
3
q
4
q
5
q
6
q
7
….
u
1
u
2
u
3
u
4
u
5
u
6
u
7
M
X
M
Y
1
1
0
0Slide65
Proof Idea
Sample using
communication with statistical error
ε
.
u
4
u
2
h
1
(u
4
)
h
2
(u
4
)
65
M
X
M
Y
u
4
1
1
0
0
u
2
M
X
M
YSlide66
66
66
Proof Idea
Sample using
communication with statistical error
ε
.
u
4
u
2
h
4
(u
4
)…
h
log 1/
ε
(u
4
)
u
4
h
3
(u
4
)
M
X
2M
Y
M
X
M
Y
u
4
u
4
h
1
(u
4
), h
2
(u
4
)
1
1
0
0Slide67
67
Analysis
If
, then the protocol will reach round
of doubling.
There will be
candidates.
About
hashes to narrow to one.
The contribution of
to cost:
Done!
Slide68
External information cost
.
A
B
X
Y
Protocol
π
Protocol transcript
π
what Charlie learns about
(X,Y)
CSlide69
Example
69
F is
“X=Y?”.μ is a distribution where
w.p
.
½ X=Y
and
w.p
.
½ (X,Y)
are random.
MD5(X)
X=Y?
A
B
X
Y
what Charlie learns about
(X,Y)Slide70
External information cost
It is always the case that
If
is a product distribution, then
70Slide71
External information complexity
.
Can it be operationalized?
71Slide72
Operational meaning of
?
Conjecture: Zero-error communication scales like external information:
Recall:
72Slide73
Example – transmission with a strong prior
is such that
, and
with a very high probability (say
).
is just the “transmit
” function.
Clearly,
should just have Alice send
to Bob.
.
73Slide74
Example – transmission with a strong prior
.
Other examples, e.g. the two-bit AND function fit into this picture.
74Slide75
Additional directions
75
Information complexity
Interactive coding
Information theory in TCSSlide76
Interactive coding theory
So far focused the discussion on noiseless coding. What if the channel has noise?
[What kind of noise?]In the non-interactive case, each channel has a capacity .
76Slide77
Channel capacity
The amortized number of channel uses needed to send over a noisy channel of capacity
is
Decouples the task from the channel!
77Slide78
Example: Binary Symmetric Channel
Each bit gets independently flipped with probability
One way capacity
78
0
1
0
1
Slide79
Interactive channel capacity
Not clear one can decouple channel from task in such a clean way. Capacity much harder to calculate/reason about.
Example: Binary symmetric channel. One way capacity
Interactive (for simple pointer jumping,[Kol-Raz’13]):
0
1
0
1
Slide80
Information theory in communication complexity and beyond
A natural extension would be to multi-party communication complexity. Some success in the number-in-hand case.
What about the number-on-forehead?Explicit bounds for
players would imply explicit
circuit lower bounds.
80Slide81
Naïve multi-party information cost
+
81
A
B
C
YZ
XZ
XYSlide82
Naïve multi-party information cost
+
Doesn’t seem to work.
Secure multi-party computation [Ben-
Or,Goldwasser
,
Wigderson
], means that anything can be computed at near-zero information cost.
Although, these construction require the players to share private channels/randomness.
82Slide83
Communication and beyond…
The rest of today:Data structures; Streaming;Distributed computing;
Privacy. Exact communication complexity bounds.Extended formulations lower bounds. Parallel repetition?…
83Slide84
84
Thank You!Slide85
Open problem: Computability of IC
Given the truth table of
,
and , compute
Via
can compute a sequence of upper bounds.
But the rate of convergence as a function of
is unknown.
85Slide86
Open problem: Computability of IC
Can compute the
-round
information complexity of
.
But the rate of convergence as a function of
is unknown.
Conjecture
:
This is the relationship for the two-bit AND.
86