Book

T H E S I N G L E B I G G E S T P R O B L E M I N C O M M U N I C AT I O N I S T H E I L L U S I O N T H AT
I T H A S TA K E N P L A C E .
G E O R G E B E R N A R D S H AW
W O R D S E M P T Y A S T H E W I N D A R E B E S T L E F T U N S A I D.
HOMER
E V E RY T H I N G B E C O M E S A L I T T L E D I F F E R E N T A S S O O N A S I T I S S P O K E N
O U T L O U D.
HERMANN HESSE
L A N G U A G E I S A V I R U S F R O M O U T E R S PA C E .
WILLIAM S. BUROUGHS
A N U P R A O , A M I R Y E H U D AYO F F
C O M M U N I C AT I O N
COMPLEXITY
( E A R LY D R A F T )
Contents
I Fundamentals 15
1 Deterministic Protocols 17
Some Examples of Communication Problems and Protocols 17
Defining 2 party protocols 19
Rectangles 20
Balancing Protocols 21
From Rectangles to Protocols 22
Some lower bounds 23
Rectangle Covers 29
Counterexample to deterministic direct sum of relations 31
2 Rank 33
Basic Properties of Rank 33
Lower bounds using Rank 35
Towards the Log-Rank Conjecture 37
Non-negative Rank and Covers 42
3 Randomized Protocols 45
Variants of Randomized Protocols 47
Public Coins vs Private Coins 49
Nearly Monochromatic Rectangles 50
6
4 Numbers On Foreheads 53
Cylinder Intersections 56
Lower bounds from Ramsey Theory 57
5 Discrepancy 63
Some Examples Using Convexity in Combinatorics 64
Lower bounds for Inner-Product 65
Lower bounds for Disjointness in the Number-on-Forehead model 68
6 Information 73
Entropy, Divergence and Mutual Information 75
Some Examples from Combinatorics 80
Lower bound for Indexing 84
Randomized Communication of Disjointness 85
Lower bounds on Non-Negative Rank 88
Lower bound for Number of Rounds 90
7 Compressing Communication 97
Correlated Sampling 99
Compressing a Single Round of Communication 101
Compressing Entire Protocols with Low Internal Information 106
Lower bounds from Compression Theorems 110
II Applications 115
8 Circuits, Branching Programs 117

Boolean Circuits 117
Karchmer-Wigderson Games 118
Karchmer-Wigderson Games in few Rounds 119
Lowerbounds on the Depth of Monotone Circuits 120
7
Lowerbounds on Circuits with few Alternations 122

Monotone Circuit Depth Hierarchy 125
Branching Programs 127
Boolean Formulas 128
Formula Lowerbounds for Parity 129
Boolean Depth Conjecture 129
9 Proof Systems 131

Resolution Refutations 131
Cutting Planes 134
10 Data Structures 141

Maintaining a Set of Numbers 141
Lower Bounds on Static Data Structures 143
Lower bounds on Dynamic Data Structures 148
Graph Connectivity 151
Dictionaries 156
2d range counting Chazelle 157
11 Extension Complexity of Convex Polytopes 159

Extension Complexity of a Polytope 163
Slack Matrices 168
Lower bounds on Extension Complexity 170
12 Distributed Computing 179

Coloring Problem 179
On computing the Diameter 180
Detecting Triangles 182
Verifying a Spanning Tree 183
Bibliography 185
Introduction
This is a textbook about interactive communication, a concept first

1
Yao, 1979
made mathematically rigorous by Yao1 . The study of communication
complexity was born out of the struggle to surpass some of the
biggest philosophical barriers to understanding computation—our
inability to prove lower bounds on computational processes. The
last few decades have seen amazing progress in the development of
fast algorithms for all kinds of natural and basic algorithmic tasks
like multiplying numbers and matrices, computing shortest paths,
and minimum cuts in graphs. Yet we really have no idea whether
the algorithms we have found are the fastest possible. The best we
can say is that any algorithm for these problems must read all of the
Anup Says: Edited intro
input.
In light of this difficulty, the pursuit of lower bounds has taken
two paths. The first it to consider limited computational models,
where the algorithm is restricted to a small set of operations, and
one tries to prove lower bounds on the resources needed to operate
within these restrictions. The hope is that one can gradually relax the
limitations until one is able to prove the right lower bounds on all
computational models. The second is to prove conditional results—
one argues that if one could find an efficient solution to problem A,
then one could also solve problem B efficiently. If B is a well-known
problem that no one has been able to solve efficiently, this is taken
as evidence that A must be hard. In this way, even if one cannot
conclusively establish that there are no efficient algorithms for these
tasks, one can reduce the problem to proving lower bounds on a few
hard tasks.
The study of communication protocols began to take a central role
in the first approach. This is largely due to two important features:
1. the concept is general enough that it captures something impor-
tant about many other models of computation. Efficient streaming
algorithms, data structures, linear programs, proofs, booleann
circuits, and distributed algorithms all give rise to efficient commu-
nication protocols for solving related tasks.
10
2. the concept is simple and natural enough that methods from

combinatorics, analysis, probability and information theory can
be leveraged to understand the complexity of communication
problems.
In this book, we explain some of the central results in the area of

communication complexity, and show how they can be used to prove
surprising results about several other models. This book is a living
document: comments about the content are always appreciated, and
the authors intend to keep the book up to date for the foreseeable
future.
Like this.
Each page of the book has a large margin, where you can find
references to the relevant literature, diagrams, and additional expla-
nations of arguments in the main text. We encourage the reader to
switch back and forth between the margin and the main text, as they
Anup Says: Edited this
find appropriate.
Acknowledgements
Thanks to Morgan Dixon, Abe Friesen, Mika Göös, Jeff Heer, Pavel
Hrubeš, Guy Kindler, Vincent Liew, Venkatesh Medabalimi, Shay
Moran, Rotem Oshman, Sebastian Pokutta, Kayur Patel, Sivaramakr-
ishnan Natarajan Ramamoorthy, Cyrus Rashtchian, Thomas Rothvoß,
and Makrand Sinha for many contributions to this book.
Conventions and Preliminaries
In this chapter, we set up notation and explain some basic facts that Anup Says: edited
are used throughout the book.
Sets, Numbers and Functions
For a positive integer h, we use [h] to denote the set {1, 2, . . . , h}.
2[h] denotes the power set, namely the family of all subsets of [h].
All logarithms are computed base 2 unless otherwise specified. A
boolean function is a function whose values are in the set {0, 1}.
Random variables are denoted by capital letters (e.g. A) and
values they attain are denoted by lower-case letters (e.g. a). Events
in a probability space will be denoted by calligraphic letters (e.g.
E ). Given a = a1 , a2 , . . . , an , we write a≤i to denote a1 , . . . , ai . We
define a<i similarly. We write aS to denote the projection of a to
the coordinates specified in the set S ⊆ [n]. [k] to denotes the set
{1, 2, . . . , k}, and [k]<n denotes the set of all strings of length less than
n over the alphabet [k ], including the empty string. |z| denotes the
length of the string z.
Graphs
A graph on the set [n] (called the vertices) is a collection of sets of

size 2 (called edges). A clique C ⊆ [n] in the graph is a subset where
every edge is present in the graph. An independent set I ⊆ [n] in the
graph is a set that does not contain any edges. A path in the graph
is a sequence of vertices v1 , . . . , vn such that {vi , vi+1 } is an edge for
each i. A cycle is a path whose first and last vertices are the same.
The graph is said to be connected if there is a path between every
two distinct vertices in the graph. The graph is called a tree if it is
connected and has no cycles.
One can prove by induction on n that every tree of size n has
exactly n − 1 edges.
12
Probability
Anup Says: Removed mention of
Throughout this book, we consider only finite probability spaces. We measurability, since this will confuse
use the notation p( a) to denote both the distribution on the variable a, readers that don’t know what that is.
and the number Pr p [ A = a]. The meaning will be clear from context.
We write p( a|b) to denote either the distribution of A conditioned on
the event B = b, or the number Pr[ A = a| B = b]. Given a distribution
p( a, b, c, d), we write p( a, b, c) to denote the marginal distribution on
the variables a, b, c (or the corresponding probability). We often write
p( ab) instead of p( a, b) for conciseness of notation. If E is an event,
we write p(E ) to denote its probability according to p. We denote by
E p(a) [ g( a)] the expected value of g( a) in p. We write A − M − B to
assert that p( amb) = p(m) · p( a|m) · p(b|m).
The statistical distance (also known as total variational distance)
Anup Says: changed wording and
between p( x ) and q( x ) is defined to be: emphasis
| p − q| = (1/2) ∑ | p( x ) − q( x )| = max p( T ) − q( T ), The proofs of these equations is an easy

x T exercise.
where the maximum is taken over all subsets T of the universe. We

e
sometimes write p( x ) ≈ q( x ), to indicate that | p( x ) − q( x )| ≤ e.
Suppose a, b are two variables in a probability space p. For ease of
e
notation, we write p( a|b) ≈ p( a) for average b, to mean that
E [| p( a|b) − p( a)|] ≤ e.
p(b)
Some Useful Inequalities

Anup Says: added this
Markov’s Inequality
Suppose X is a non-negative random variable, and γ is a number.

Then Markov’s inequality bounds the probability that X exceeds γ in
terms of the expected value of X. We have
E [ X ] > p( X > γ) · γ,
1
so p( X > γ) < E [ X ] /γ. e− x
1−x
2−2x
0.8
Approximating linear functions with exponentials

0.6
We will often need to approximate linear functions with exponentials.
For this it is often useful to note that e− x ≥ 1 − x when x ≥ 0, and 0 0.5
1 − x ≥ 2−2x when 0 ≤ x ≤ 1/2. x
13
Cauchy-Schwartz Inequality
The Cauchy-Schwartz inequality says that for two vectors x, y ∈ Rn ,
their inner product is at most the products of their lengths:
s s
n n n
∑ xi yi = hx, yi ≤ k xk · kyk = ∑ xi2 · ∑ y2i .
i =1 i =1 i =1
Convexity

f ( x )+ f (y)
A function f : R → R is said to be convex if 2 ≥ f x+2 y , for

f ( x )+ f (y) x +y
all x, y in the domain. It is said to be concave if 2 ≤ f 2 .
Some convex functions: x2 , e x , x log x. Some concave functions:
√
log x, x.
Jensen’s inequality says if a function f is convex, then E [ f ( X )] ≥ x log x
√
f (E [ X ]), for any real-valued random variable X. Similarly, if f is x
concave, then E [ f ( X )] ≤ f (E [ X ]).
A consequence of Jensen’s inequality is the Arithmetic-Mean
Geometric-Mean inequality:
!1/n 1 2 3 4
∑in=1 ai n
n
≥ ∏ ai , x
i =1
which can be proved using the concavity of the log function:

n
∑ i =1 a i ∑n log ai
log ≥ i =1
n n
!1/n
n
= log ∏ ai .
i =1
Part I
Fundamentals
1
Deterministic Protocols
A protocol specifies a way for k parties, who each have access

to different inputs, to communicate in order to learn about some
Anup Says: changed wording
property of all the inputs. Each of the k parties may have access to
different bits of information. We begin by giving some interesting
examples of communication problems.
Some Examples of Communication Problems and Protocols

Anup Says: changed wording
Many of these examples will be discussed in much more detail in
future chapters of the book.
Equality Alice and Bob are given two n-bit strings x, y ∈ {0, 1}n and
want to know if x = y. There is a trivial solution: Alice can send
her input to Bob, and Bob can let her know if x = y. This is a 1
These terms will be made clear in due
deterministic1 protocol that takes n + 1 bits of communication, and course.
we shall prove that no deterministic protocol is more efficient. On
the other hand, for every number k, there is a randomized1 protocol
that uses only k + 1 bits of communication and errs with probability
at most 2−k : the parties can hash their inputs and check that the
Anup Says: added explicit dependence
hashes are the same. There is a non-deterministic1 protocol that on error
uses O(log n) bits of communication: If Alice guessed an index i
where xi 6= yi , she could send it to Bob and they could confirm
that their inputs are not the same.
Cliques and Independent Sets Alice and Bob are given and graph G
on n vertices, and two subsets A, B ⊆ [n] of the vertices. Alice
knows A and G and Bob knows B and G. In addition, A is always A clique is a set of vertices that are all
connected to each other. An indepen-
a clique in the graph, and B is always an independent set. They dent set is a set of vertices that contains
want to know whether A intersects B or not. There is no one-way no edges.
protocol that solves this problem efficiently using less than n bits
Anup Says: added clique and ind set
of communication. However, there is an interactive protocol that
descriptions
uses O(log2 n) bits of communication to solve the problem. If
18 communication complexity
A contains a vertex v of degree less than n/2, Alice announces

v. Either v ∈ B or Alice and Bob can safely discard all the non-
neighbors of v, since these cannot be a part of A. This reduces the
size of the graph by a factor of 2. Similarly, if B contains a vertex v
of degree at least n/2, Bob announces the name of v. Again, either
v ∈ A, or Alice and Bob can safely delete all the neighbors of v
from the graph, which reduces the size of the graph by a factor
of 2. So in each round of communication, the number of vertices
that Alice and Bob are working with is reduced by a factor of 2. So,
after k rounds, the number of vertices is at most n/2k . If k exceeds
log n, the number of vertices left will be less than 1, and Alice and
Bob will known of A, B intersect or not. Thus the protocol can run
for at most log n rounds, proving that at most O(log2 n) bits will
Anup Says: added more text to make
be exchanged. this easier to follow
k-Disjointness Alice and Bob are given two sets A, B ⊆ [n], each
of size k, and want to know if the sets share a common element.
Alice can send her set over, which takes k log n bits of communi-
cation. There is a randomized protocol that uses only O(k ) bits of
communication. Alice and Bob sample a random sequence of sets
in the universe, Alice announces the name of the first set that con-
tains A. If A and B are disjoint, this eliminates half of B. Repeating
this procedure gives a protocol with O(k) bits of communication.
There is a non-deterministic protocol that uses O(log n) bits of
communication.
k-party Disjointness The input is k sets A1 , . . . , Ak ⊆ [n], and there are
k parties. The i’th party knows all the sets except for the i’th one.
The parties want to know if there is a common element in all sets.
There is a deterministic protocol with O(n/2k ) bits of communi-
cation, and this is known to be essentially the best protocol. We
know that no randomized protocol can have communication less
√
than n/2k , but it is not known whether this bound is tight.
3-Sum The input is three numbers x, y, z ∈ [n]. Alice knows ( x, y),
Bob knows (y, z) and Charlie knows ( x, z). The parties want to
know whether or not x + y + z = n. Alice can tell Bob x, which
would allow Bob to announce the answer. This takes O(log n) bits
of communication. There is a deterministic protocol that com-
municates o (log n) bits, but one can show that any deterministic
protocol must communicate ω (1) bits. There is a randomized
protocol that communicates O(1) bits.
Pointer Chasing The input consists of two functions f , g : [n] → [n],
where Alice knows f and Bob knows g. Let a0 , a1 , . . . , ak ∈ [n] be
defined by setting a0 = 1, and ai = f ( g( ai−1 )). The goal is to com-
pute ak . There is a simple k round protocol with communication
deterministic protocols 19
O(k log n) that solves this problem, but any protocol with fewer
than k rounds requires Ω(n) bits of communication.
Graph Connectivity The input is an undirected graph on the vertices
[n]. There are k parties, and the j’th party knows all of the edges
except those that touch the vertices of [( j − 1)n/k, jn/k]. The parties
want to know whether 1 is connected to n in the graph. The trivial
deterministic protocol takes O(n2 /k) bits of communication. One
can show that there is no randomized protocol with less than n/2k
bits of communication.
Defining 2 party protocols
Here we define exactly what we mean by a 2 party deterministic

protocol. The definition is meant to capture a conversation between
the two players. Suppose Alice’s input comes from a set X and
Bob’s input comes from Y . A protocol π is specified by a rooted
tree, where every internal vertex v has 2 children. Every such vertex
v is associated with either the first or second party, and a function The setup is analogous for k party
f v : X → {0, 1} (or f v : Y → {0, 1}) mapping an input of that party to protocols. Let X1 , X2 , . . . , Xk be k sets. A
k-party communication protocol defines
one of the 2 children of the vertex v. a way for k parties to communicate
Given inputs ( x, y) ∈ X × Y , the outcome of the protocol π ( x, y) information about their inputs, where
the i’th party gets an input from the set
is a leaf in the protocol tree, computed as follows. The parties begin
Xi . Every vertex v is associated with a
by setting the current vertex to be the root of the tree. If the first party i and a function f v : Xi → {0, 1}.
party (resp. second party) is associated with the current vertex v, she
announces the value of f v ( x ) (resp. f v (y)). Both parties set the new In the special case that X = Y =
{0, 1}n , and each of the functions f v is
current vertex to be the child of v indicated by the value of f v . This restricted to being equal to a bit of the
process is repeated until the current vertex is a leaf, and this leaf is input, the resulting protocol is called a
decision tree, a model worthy of study in
the outcome of the protocol. In other words, inputs ( x, y) induce a
its own right.
path from the root of the protocol tree to the leaf π ( x, y). This path
corresponds to the conversation between the parties. The length of the 2
The length of the longest path from
protocol π, denoted kπ k, is the depth of the protocol tree2 . root to leaf.
Given a boolean function g : X × Y → {0, 1} we say that π
One can easily generalize the defini-
computes g if π ( x, y) determines g( x, y) for every input in X × Y . tions to handle functions that are not
The communication complexity of a function is c if there is protocol that boolean. We restrict our attention to
boolean functions for simplicity.
computes the function with c bits of communication, but no protocol
can compute the function with less than c bits of communication. Anup Says: eliminated discussion of
The number of rounds of the protocol is the maximum number of partial functions and promise problems
alternations that occur between messages of the first party and I don’t think it is needed at this point in
the text.
messages of the second party on any root to leaf path in the tree.
Some basic observations: Example: Alice sends 2 bits, then Bob
sends 3 bits, and then Alice sends 1
Fact 1.1. For any protocol π, the number of rounds in π is always at most bit. The length is 6 and the number of
rounds is 3.
kπ k − 1.
Fact 1.2. The number of leaves in the protocol tree of π is at most 2kπ k .
Rectangles
A very useful concept for understanding communication protocols

is the concept of a rectangle in the inputs. A rectangle is a subset of
the form R = A × B ⊆ X × Y .
Lemma 1.3. A set R ⊂ X × Y is a rectangle if and only if whenever

( x, y), ( x 0 , y0 ) ∈ R, we have ( x 0 , y), ( x, y0 ) ∈ R.
Proof. If R = A × B is a rectangle, then ( x, y), ( x 0 , y0 ) ∈ R means that

x, x 0 ∈ A and y, y0 ∈ B. Thus ( x, y0 ), ( x 0 , y) ∈ A × B. On the other
hand, if R is an arbitrary set with the given property, if R is empty, it
is a rectangle. If R is not empty, let ( x, y) ∈ R be an element. Define
A = { x 0 : ( x 0 , y) ∈ R} and B = {y0 : ( x, y0 ) ∈ R}. Then by the
property of R, we have A × B ⊆ R, and for every element ( x 0 , y0 ) ∈ R,
Figure 1.1: A rectangle.
x 0 ∈ A, y0 ∈ B, so R ⊆ A × B. Thus R = A × B.
For k party protocols, a rectangle is a
If a function is defined by a rectangle A × B, then it certainly has a cartesian product of k sets.
very simple communication protocol. If Alice and Bob want to know
if their inputs belong to the rectangle or not, Alice can send a bit
indicating if x ∈ A, and Bob can send a bit indicating if y ∈ B. These
two bits determine whether or not ( x, y) ∈ A × B.
The importance of rectangles stems from that every protocol can be
described using rectangles in the following sense. For every vertex Anup Says: wording
v in a protocol π, let Rv ⊆ X × Y denote the set of inputs ( x, y)
that would lead the protocol to pass through the vertex v during the
execution, and let
Xv = { x ∈ X : ∃y ∈ Y ( x, y) ∈ Rv },
Yv = {y ∈ Y : ∃ x ∈ X ( x, y) ∈ Rv }.
Lemma 1.4. For every vertex v in the protocol tree, Rv is a rectangle with
Rv = Xv × Yv . Moreover, the rectangles given by all the leaves of the
protocol tree form a partition of the inputs to the protocol.
The lemma follows by induction. For the root vertex r, we see that
Rr = X × Y , so indeed the lemma holds. Now consider an arbitrary
vertex v such that Rv = Xv × Yv . Let u, w be the children of v in the
protocol tree. Suppose the first party is associated with v, and u is the
vertex that the players move to when f v ( x ) = 0. Thus
Figure 1.2: A partition of the space into
rectangles.
X u = { x ∈ X v : f v ( x ) = 0},
Anup Says: moved equation inline
X w = { x ∈ X v : f v ( x ) = 1},
Xu and Xw form a partition of Xv , and Ru = Xu × Yu and Rw =

Xw × Yw form a partition of Rv .
If we are interested in computing a boolean function g : X × Y →
{0, 1}, we say that a rectangle R ⊂ X × Y is monochromatic if g is
constant on R. We say that the rectangle is 1-monochromatic if g only
takes the value 1 on the rectangle, and 0-monochromatic if g only takes 2 3
the value 0 on R. We use Lemmas 1.2 and 1.4 to show that every 0 1 1 0 0 1 1 1 0 1
61 1 0 1 1 0 0 0 0 17
function with small communication complexity induces a partition of 6 7
61 1 0 0 0 0 0 0 0 07
6 7
the inputs into a few monochromatic rectangles. 60
6 0 0 0 1 1 0 0 0 077
61 0 0 0 0 0 1 0 0 07
If a protocol π computes a function g : X × Y → {0, 1}, and v is 6 7
60 1 0 0 0 1 0 0 0 17
a leaf in π, then there cannot be two inputs ( x, y), ( x 0 , y0 ) ∈ Rv such 6
60 0 0 0 0 0 1 0 0 17
7
6 7
that g( x, y) 6= g( x 0 , y0 ). So every leaf v of the protocol corresponds 60
6 0 0 0 0 0 1 0 1 177
to a monochromatic rectangle Rv under g. Combining this fact with 60 0 0 0 0 0 1 0 0 17
6 7
61 1 0 0 0 0 1 0 0 17
Lemmas 1.2 and 1.4 gives: 6
41
7
1 0 0 0 0 0 0 0 05
1 1 1 0 0 1 0 0 1 1
Theorem 1.5. If the communication complexity of g : X × Y → {0, 1} is c,
then X × Y can be partitioned into at most 2c monochromatic rectangles. Figure 1.3: A 0-monochromatic rectan-
gle.
Balancing Protocols Anup Says: removed notion of g-

monochromatic, which we never use in
the rest of the book. Edited wording
Lemma 1.2 gives the best bound when the protocol tree is a
Anup Says: wording
full binary tree; then the number of leaves in the protocol tree is
exactly 2c . Does it ever make sense to have a protocol tree that is not Anup Says: wording
balanced? It turns out that one can always balance an unbalanced
tree.
Input: Alice knows x ∈ X , Bob
knows y ∈ Y , both know a
Theorem 1.6. If π is a protocol with ` leaves, then there is a protocol that protocol π that has ` leaves.
computes the outcome π ( x, y) with length at most d2 log3/2 è. Output: The outcome of the
protocol π.
To prove the theorem, we need a simple lemma about trees.
while π has more than 1 leaf do
Find a vertex v as promised by
Lemma 1.7. In every protocol tree that has ` > 1 leaves, there is a vertex v Lemma 1.7;
such that the subtree rooted at v contains r leaves, and `/3 ≤ r < 2`/3. Alice and Bob exchange two
bits indicating if their inputs
Proof. Consider the sequence of vertices v1 , v2 , . . . defined as follows. are consistent with the path
in the protocol tree to v;
The vertex v1 is the root of the tree, which is not a leaf by the assump- if both inputs are consistent with
tion on `. For each i > 0, the vertex vi+1 is the child of vi that has v then
Replace π with the
the most leaves under it, breaking ties arbitrarily. Let ì denote the
subtree rooted at v;
number of leaves in the subtree rooted at vi . Then, ì+1 ≥ ì /2, and else
ì+1 < ì . Since `1 = `, and the sequence is decreasing until it hits 1, Remove v from the
protocol tree, and
there must be some i for which `/3 ≤ ì < 2`/3. replace v’s parent with
v’s sibling;
In each step of the balanced protocol (see Figure 1.4), the parties end
pick a vertex v as promised by Lemma 1.7, and decide whether end
Output the unique leaf in π;
( x, y) ∈ Rv using two bits of communication. That is, Alice sends
a bit indicating if x ∈ Xv and Bob sends a bit indicating if y ∈ Yv . Figure 1.4: Balancing Protocols
If x ∈ Xv and y ∈ Yv , then the parties repeat the procedure at the

subtree rooted at v. Otherwise, the parties delete the vertex v and its
subtree from the protocol tree and continue the simulation. In each
step, the number of leaves of the protocol tree is reduced by a factor
of at least 23 , so there can be at most log3/2 ` such steps.
From Rectangles to Protocols
Given Theorem 1.5, one might wonder whether every partition

of the inputs can be realized by a protocol. While this is not true in
general (see Figure 1.5), we can show that a small partition of the
inputs into monochromatic rectangles can be used to give an efficient
protocol3 :
Theorem 1.8. If X × Y can be covered by 2c monochromatic rectangles,
then there is a protocol that computes g with O(c2 ) bits of communication.
A key concept we will need to understand this protocol is the
Figure 1.5: A partition into rectangles
notion of two rectangles intersecting horizontally and vertically. We say that does not correspond to a protocol.
that two rectangles R = A × B and R0 = A0 × B0 intersect horizontally 3
Yannakakis, 1991; and Aho et al., 1983
if A intersects A0 , and intersect vertically if B intersects B0 . If x ∈
A ∩ A0 and y ∈ B ∩ B0 , then ( x, y) ∈ A × B and ( x, y) ∈ A0 × B0 , An efficient partition of the 1’s of the
proving: input space into rectangles also leads to
an efficient protocol (Exercise ??).
Fact 1.9. If R, R0 are disjoint rectangles, they cannot intersect both horizon-
tally and vertically.
The parties are given inputs ( x, y) and know a collection of
monochromatic rectangles R that contain all inputs. The aim of
the protocol is to find a rectangle R x,y ∈ R such that ( x, y) ∈ R x,y . In
each step, one of the parties will announce the name of a rectangle 1
R = A × B that is consistent with their input (e.g. x ∈ A). If Alice an-
3
nounces such a rectangle, then it must be that x ∈ A, so both parties
can safely discard all rectangles in R that do not vertically intersect R. 2
Any rectangle that contains ( x, y) will not be discarded. Similarly, if
Bob announces R, then both parties can safely discard all rectangles
that do not horizontally intersect R. We shall show that there will
always be a rectangle R that one of the parties can announce so that Figure 1.6: Rectangles 1 and 2 intersect
vertically, while rectangles 1 and 3
many other rectangles are discarded. intersect horizontally.
Let R0 = { R ∈ R : g( R) = 0} be the set of rectangles that have
value 0, and R1 = { R ∈ R : g( R) = 1} be the set of rectangles with
value 1.
Definition 1.10. Say that a rectangle R = ( A × B) ∈ R0 is
• horizontally good if x ∈ A, and R horizontally intersects at most half of
the rectangles in R1 , and
• vertically good if y ∈ B, and R vertically intersects at most half of the

rectangles in R1 .
Observe that the set of horizontally good rectangles is determined Input: Alice knows x ∈ X , Bob
knows y ∈ Y , both know a
by x, and the set of vertically good rectangles is determined by y. set of monochromatic
Suppose g( x, y) = 0. Then there must be a rectangle R x,y ∈ R0 that rectangles R whose union
contains ( x, y). Since the rectangles of R1 are disjoint from R x,y , Fact contains ( x, y).
Output: g( x, y).
1.9 implies that every rectangle in R1 does not intersect R x,y both
horizontally and vertically. Thus either at most half of the rectangles while R1 is not empty do
if ∃ R ∈ R0 that is horizontally
in R1 intersect R x,y horizontally, or at most half of them intersect good then
R x,y vertically. Moreover, any such rectangle is consistent with both Alice sends Bob the name
of R;
players’ input. So we have shown: Both parties discard all
rectangles from R1 that
Claim 1.11. Any rectangle of R0 that contains ( x, y) is either horizontally do not horizontally
good, or vertically good. intersect R;
else if ∃ R ∈ R0 that is
In each step of the protocol, one of the parties announce the name vertically good then
Bob sends Alice the name
of a rectangle that is either horizontally good or vertically good, if of R;
such a rectangle exists. This leads to half of the rectangles in R1 Both parties discard all
rectangles from R1 that
being discarded. If no such rectangle exists, then it must mean that do not vertically
no rectangle of R0 covers ( x, y), and so g( x, y) = 1. Since R1 can intersect R;
survive at most c + 1 such discards, and a rectangle in the family else
The parties output 1;
can be described with c bits of communication, the communication end
complexity of the protocol is at most O(c2 ). end
The parties output 0;
Recent work4 has shown that there is function g under which the
inputs can be partitioned into 2c monochromatic rectangles, yet no Figure 1.7: A Protocol from Monochro-
matic Rectangle Covers
protocol can compute g using o (c2 ) bits of communication, showing
that Theorem 1.8 is tight. 4
Göös et al., 2015; and Kothari, 2015
Some lower bounds
We turn to proving that some problems do not have efficient

protocols. The easiest way to prove a lower bound is to use the
characterization provided by Theorem 1.5. If we can show that the
inputs cannot be partitioned into 2c monochromatic rectangles, or do
not have large monochromatic rectangles, then that proves that there
is no protocol computing the function with c bits of communication.
Size of Monochromatic Rectangles

Equality Consider the equality function EQ : {0, 1}n × {0, 1}n → {0, 1}
defined as:

1 if x = y,
EQ( x, y) = (1.1)
0 otherwise.
Alice can send Bob her input, and Bob can respond with the value
of a function, giving a protocol with complexity n + 1. Is there
a protocol with complexity n? Since any such protocol induces
a partition into 2n monochromatic rectangles, a first attempt at
proving a lower bound might to try and show that there is no
large monochromatic rectangle. If we could prove that, then we
could argue that many monochromatic rectangles are needed
to cover the whole input. However, the equality function does
have large monochromatic rectangles. For example, the rectangle
R = {( x, y) : x1 = 0, y1 = 1}. This is a rectangle that has
density 14 , and it is monochromatic, since EQ( x, y) = 0 for every
( x, y) ∈ R. We will try to show that equality does not have a large
1-monochromatic rectangle, and argue that this is good enough to
prove a lower bound.
Observe that if x 6= x 0 , then the points ( x, x ) and ( x 0 , x 0 ) cannot be
in the same monochromatic rectangle. Otherwise, by Lemma 1.3,
( x, x 0 ) would also have to be included in this rectangle. Since the
rectangle is monochromatic, we would have EQ( x, x 0 ) = EQ( x, x ),
which is a contradiction. We have shown:
Claim 1.12. If R is a 1-monochromatic rectangle, then | R| = 1.
Since there are 2n inputs x with EQ( x, x ) = 1, this means 2n

rectangles are needed to cover such inputs. There is also at least
Anup Says: wording
one more 0-monochromatic rectangle. We conclude:
Theorem 1.13. The deterministic communication complexity of EQ is

n + 1.
Disjointness Next, consider the disjointness function Disj : 2[n] × 2[n] →

{0, 1} defined by:

1 if X ∩ Y = ∅,
Disj( X, Y ) = (1.2)
0 otherwise.
Alice can send her whole set X to Bob, which gives a protocol
with communication n + 1. Can we prove that this is optimal? Once
again, this function does have large monochromatic rectangles,
for example the rectangle R = {( X, Y ) : 1 ∈ X, 1 ∈ Y }, but we
shall show that there are no large monochromatic 1-rectangles.
Indeed, suppose R = A × B is a 1- monochromatic rectangle. Let
X 0 = ∪ X ∈ A X and Y 0 = ∪Y ∈ B Y. Then X 0 and Y 0 must be disjoint,
0 0
so | X 0 | + |Y 0 | ≤ n. On the other hand, | A| ≤ 2| X | , | B| ≤ 2|Y | , so
| R| = | A|| B| ≤ 2n . We have shown:
Claim 1.14. Every 1-monochromatic rectangle of Disj has size at most 2n .

On the other hand, the number of disjoint pairs ( X, Y ) is exactly

3n .That’s because for every element of the universe, there are
3 possibilities: to be in X, be in Y or be in neither. Thus, at least
3n /2n = 2(log 3−1)n monochromatic rectangles are needed to cover
We shall soon prove an optimal lower
the 1’s of Disj, an so : bound for disjointness.
Theorem 1.15. The deterministic communication complexity of Disj is at

least (log 3 − 1)n.
Richness
Sometimes we need to understand asymmetric communication proto-
cols, where we need separate bounds on the communication complex-
5
Miltersen et al., 1998
ity of Alice and Bob. The concept of richnesst5 is useful here:
Definition 1.16. A function g : X × Y → {0, 1} is said to be (u, v) rich

if there is a set V ⊆ Y of size |V | = v such that for all y ∈ V, there is a set
Uy ⊆ X of size |Uy | = u so that for all x ∈ Uy we have with g( x, y) = 1.
A rich function has large 1-monochromatic rectangles:
Lemma 1.17. If g : X × Y → {0, 1} is (u, v) rich with u, v > 0, and if

there is a protocol for computing g where Alice sends at most a bits and Bob
sends at most b bits, then g admits a 2ua × 2av+b 1-monochromatic rectangle.
Proof. The statement is proved inductively. For the base case, if

the protocol does not communicate at all, then g( x, y) = 1 for all
x ∈ X , y ∈ Y , and the statement holds.
If Bob sends the first bit of the protocol, then Bob partitions Y =
Y0 ∪ Y1 . One of these two sets must have at least v/2 of the inputs
y that show that g is (u, v) rich. By induction, this set contains a
u v/2
2a × 2a+b−1 1-monochromatic rectangle, as required. On the other
hand, if Alice sends the first bit, then this bit partitions X into two
sets X0 , X1 . Every input y ∈ Y that has u ones must have u/2 ones
in either X0 or X1 . Thus there must be at least v/2 choices of inputs
y ∈ Y that have u/2 ones for g restricted to X0 × Y or for g restricted
to X1 × Y . By induction, we get that there is a 1-monochromatic
rectangle with dimensions 2u/2 v/2
a−1 × 2a−1+b , as required.
Now let us see some examples where richness can be used to

prove lower bounds.
Lopsided Disjointness Suppose Alice is given a set X ⊆ [n] of size

k < n, and Bob is given a set Y ⊆ [n], and they want to compute
whether the sets are disjoint or not. Now the obvious protocol is
for Alice to send her input to Bob, which takes log (nk) bits6 . How- 6
In Chapter 2, we show that the
communication complexity of this
ever, what can we say about the communication of this problem if
problem is at least log (nk).
Alice is forced to send much less than log (nk) bits?
To prove a lower bound, we need to analyze rectangles of a

certain shape. We restrict our attention to special family of sets for
Alice and Bob. Let n = 2kt, and suppose Y contains exactly one
element of 2i − 1, 2i, for each i, and that X contains exactly one
element from 2t(i − 1) + 1, . . . , 2ti for each i ∈ [k ].
Claim 1.18. If A × B is a 1-monochromatic rectangle, then | B| ≤ 2Y X3
1/k
2kt−k| A| .
S
Proof. We claim that | X ∈ A X | ≥ k| A|1/k . Indeed, if the union Figure 1.8: An input with n = 12, k =
S 3, t = 2.
X ∈ A X has ai elements in {2t (i − 1) + 1, . . . , 2ti }, then
!1/k
[ k k

X = ∑ ai ≥ k ∏ ai ≥ k| A|1/k . By the arithmetic-mean, geometric
i =1 i =1 mean inequality.
X∈ A
S
X ∈ A X cannot contain both 2i, 2i + 1 for any i, since one of these
two elements belongs to a set in B. Thus, the number of possible
1/k
choices for sets in B is at most 2kt−k| A| .
The disjointness matrix here is at least (tk , 2kt )-rich, since every
choice Y allows for tk possible choices for X that are disjoint. By
Lemma 1.17, any protocol where Alice sends a bits and Bob sends
b bits induces a 1-monochromatic rectangle with dimensions
tk /2a × 2kt−a−b , so Claim 1.18 gives:
a/k
2kt− a−b ≤ 2kt−kt/2
n
⇒ a + b ≥ a/k+1 .
2
We conclude:
Theorem 1.19. If X, Y ⊆ [n], | X | = k and Alice sends at most a bits
and Bob sends at most b bits in a protocol computing Disj( X, Y ), then
a + b ≥ 2a/kn+1 .
For example, for k = 2, if Alice sends at most log n bits to Bob,
√
then Bob must send at least Ω( n) bits to Alice in order to solve
lopsided disjointness.
Span Suppose Alice is given a vector x ∈ {0, 1}n , and Bob is given
a n/2 dimensional subspace V ⊆ {0, 1}n . Their goal is figure out
whether or not x ∈ V. As in the case of disjointness, we start by
claiming that the inputs do not have 1-monochromatic rectangles
of a certain shape:
Claim 1.20. If A × B is a 1-monochromatic rectangle, then | B| ≤
2
2n /2−n log | A| .
Proof. The set of x’s in the rectangle spans a subspace of di-
mension at least log | A|. The number of n/2 dimensional sub-
n
spaces that contain this span is thus at most (n/2−2log | A|) ≤
2 /2− n log | A |
2n .
2
The problem we are working with is at least (2n/2 , 2n /4 /n!)-
2 A subspace of dimension n/2 can be
rich, since there are at least 2n /4 /n! subspaces, and each contains specified by picking the n/2 basis
2n/2 vectors. Applying Lemma 1.17 and Claim 1.20, we get that if vectors. For each such vector, there are
at least 2n/2 available choices. However,
there is a protocol where Alice sends a bits and Bob sends b bits,
we have over counted by a factor of n!,
then since every permutation of the basis
vectors gives the same subspace. This
2 /4− a − b 2 /2− n log 2n/2− a
2n /n! ≤ 2n 2
gives that there are at least 2n /4 /n!
subspaces.
⇒ n2 /4 − a(n + 1) − n log n ≤ b.
Theorem 1.21. If Alice sends a bits and Bob sends b bits to solve the span
problem, then b ≥ n2 /4 − a(n + 1) − n log n.
For example, if Alice sends at most n/8 bits, then Bob must
send at least Ω(n2 ) bits in order to solve the span problem. One of
the players must send a linear number of the bits in their input.
Fooling Sets
A set S ⊂ X × Y is called a fooling set if every monochromatic
rectangle can share at most 1 element with S. Fooling sets can be Anup Says: wording
used to prove several basic lower bounds on communication.
Greater-than Our first example using fooling sets is the greater-than

function, GT : [n] × [n] → {0, 1}, defined as:

1 if x > y,
GT( x, y) = (1.3)
0 otherwise.
The trivial protocol computing greater-than has complexity 1 +

dlog ne bits, and we shall show that this is essentially tight. The
methods we used for the last two examples will surely not work
here, because GT has large 0-monochromatic rectangles (like
R = {( x, y) : x < n/2, y > n/2}) and large 1-monochromatic
rectangles (like R = {( x, y) : x > n/2, y < n/2}). Instead we shall
use a fooling set to to prove the bound. Consider the set of n points
S = {( x, x )}. We claim:
Claim 1.22. Two points of S cannot lie in the same monochromatic

rectangle.
Indeed, if R is monochromatic, and x < x 0 , but ( x, x ), ( x 0 , x 0 ) ∈

R, then since R is a rectangle, ( x 0 , x ) ∈ R. This contradicts the
fact that R is monochromatic, since GT( x 0 , x ) 6= GT( x 0 , x 0 ). So
once again, we have shown that the number of monochromatic
rectangles must be at least n, proving:
Theorem 1.23. The deterministic communication complexity of GT is at

least log n.
Disjointness Fooling sets also allow us to prove tighter lower bounds

on the communication complexity of disjointness. Consider the
set S = {( X, X ) : X ⊆ [n]}, namely the set of pairs of sets and
their complements. No monochromatic rectangle can contain two
such pairs, because if such a rectangle contained ( X, X ), (Y, Y )
for X 6= Y, then it would also contain both ( X, Y ), (Y, X ), but
at least one of the last pair of sets must intersect, while the first
two pairs are disjoint. Since |S| = 2n , and at least one more 0-
monochromatic rectangle is required, this proves:
Theorem 1.24. The deterministic communication complexity of disjoint-

ness is n + 1.
Krapchenko’s Method
We end the lower bounds part of this chapter with a method for non-
boolean relations. Let X = { x ∈ {0, 1}n : ∑in=1 xi = 0 mod 2} and
Y = {y ∈ {0, 1}n : ∑in=1 yi = 1 mod 2}. Since X and Y are disjoint,
for every x ∈ X , y ∈ Y , there is an index i such that xi 6= yi . Suppose
Alice is given x and Bob is given y, and they want to find such an
index i. How much communication is required?
Perhaps the most trivial protocol is for Alice to send Bob her entire
string, but we can use binary search to do better. Notice that
∑ xi + ∑ xi 6 = ∑ yi + ∑ yi mod 2.
i ≤n/2 i >n/2 i ≤n/2 i >n/2
Alice and Bob can thus exchange ∑i≤n/2 xi mod 2 and ∑i≤n/2 yi
mod 2. If these values are not the same, they can safely restrict their
attention to the strings x≤n/2 , y≤n/2 and continue. On the other hand,
if the values are the same, they can continue the protocol on the
strings x>n/2 , y>n/2 . In this way, in every step they communicate
2 bits and eliminate half of their input string, giving a protocol of
communication complexity 2 log n.
It is easy to see that log n bits of communication are necessary,
We need at least n monochromatic
because that’s how many bits it takes to write down the answer.
rectangles to cover pairs of the type
Now we shall prove that 2 log n bits are necessary, using a variant of (0, ei ), where ei is the i’th unit vector.
fooling sets. Consider the set of inputs
S = {( x, y) ∈ X × Y : x, y differ in only 1 coordinate}.
S contains n · 2n−1 inputs, since one can pick an input of S by picking

x ∈ X and flipping any of the n coordinates. We will not be able to
argue that every monochromatic rectangle must contain only one
element of S or bound the number of elements in any way. Instead,
we will prove that if such a rectangle does contain many elements of
S, then it is big:
Claim 1.25. Suppose R is a monochromatic rectangle that contains r

elements of S. Then | R| ≥ r2 .
The key observation here is that two distinct elements ( x, y), ( x, y0 ) ∈

S cannot be in the same monochromatic rectangle. For if the rectan-
gle was labeled i, then ( x, y), ( x, y0 ) must disagree in the i’th coordi-
nate, but since they both belong to S we must have y = y0 . Similarly
we cannot have two distinct elements ( x, y), ( x 0 , y) ∈ S that belong to
the same monochromatic rectangle. Thus, if R = A × B has r elements
of S, we must have | A| ≥ r, | B| ≥ r, proving that | R| ≥ r2 .
Now suppose there are t monochromatic rectangles that partition
the set S, and the i’th rectangle covers ri elements of S. Then |S| =
∑it=1 ri , but since the rectangles are disjoint, 22n−2 ≥ ∑it=1 ri2 . Using
these facts and the Cauchy-Schwartz inequality:
!2
t t √
2n−2
2 ≥ ∑ ri2 ≥ ∑ ri / t = n2 22n−2 /t,
i =1 i =1
proving that t ≥ n2 . This shows that the binary search protocol is the
best one can do.
Rectangle Covers
Given that rectangles play such a crucial role in the communica-

tion complexity of protocols, it is worth studying alternative ways to
measure the complexity of functions. Here we investigate what one
can say if we count the number of monochromatic rectangles needed
to cover all of the inputs.
Definition 1.26. We say that a boolean function has a 1-cover of size C if

there are C monochromatic rectangles whose union is all of the inputs that
evaluate to 1. We say that the function has a 0-cover of size C if there are C
monochromatic rectangles whose union is all of the inputs that evaluate to 0. Rectangle covers have an interesting in-
terpretation in terms of non-deterministic
By Theorem 1.5, every function that admits a protocol with com- communication complexity. If a func-
tion has a 1-cover of size C, then given
munication c also admits a 1-cover of size at most 2c and a 0-cover of any input that evaluates to 1, Alice and
size at most 2c . Conversely, Theorem 1.8 shows that small covers can Bob can non-deterministically guess the
name of a rectangle that covers their
be used to give small communication.
input, and then check that their inputs
Can the logarithm of the cover number be significantly different are consistent with the guessed rectan-
from the communication complexity? Consider the disjointness gles. On the other hand, if their inputs
correspond to a 0, no guess will con-
function, defined in (1.2). For i = 1, 2, . . . , n, define the rectangle vince them that their input is a 1. One
Ri = {( X, Y ) : i ∈ X, i ∈ Y }. Then we see that R1 , R2 , . . . , Rn can show that any non-deterministic
protocol for a function corresponds to a
form a 0-cover for disjointness. So there is a 0-cover of size n, yet
1-rectangle cover!
(Theorem 1.24) the communication complexity of disjointness is n + 1.
In fact, by the proof of Theorem 1.8, this√

must mean that any 1-cover
Ω ( n ) Anup Says: √changed bound from
for disjointness must have at least 2 rectangles. We shall see 2n/ log n to 2 n .
in a later chapter that any 1-cover of disjointness must have at least
2Ω(n) rectangles.
Another interesting example is the k-disjointness function. Here
Alice and Bob are given sets X, Y ⊆ [n], each of size k. We shall see
in Chapter 2 that the communication complexity of k-disjointness
is at least log (nk) ≈ k log(n/k). As above, there is a 0-cover of k-
disjointness using n rectangles.
2
Claim 1.27. k-disjointness has a 1-cover of size 22k ln((nk) ).
We prove Claim 1.27 using the probabilistic method. Sample a

random 0-rectangle by picking a set S ⊆ [n] uniformly at random,
and using the rectangle R = {( X, Y ) : X ⊆ S, Y ⊆ [n] \ S}. Namely, the
set of all inputs X, Y where X is contained in S, and
Y is contained in
2
the complement of S. Now sample t = 22k ln (nk) such rectangles
independently. The probability that a particular disjoint pair ( X, Y )
is included in any single rectangle is 2−2k . So the probability that the
pair is excluded from all the rectangles is
−2
−2k t −2−2k t n
(1 − 2 ) ≤e < , Fact: 1 − x ≤ e− x for x ≥ 0.
k
by the choice of t. Since the number of disjoint pairs ( X, Y ) is at most

2
(nk) , this means that the probability that any disjoint pair is excluded
by the t rectangles is less than 1. So there must be t rectangles that
cover all the 1 inputs.
Setting k = log n, we have found 1-cover with t = O(n2 log2 n)
rectangles. This example shows that Theorem 1.8 is tight, at least
when it comes to rectangle covers.
Direct-sums in Communication Complexity
The direct-sum question in computational complexity

theory is about the complexity of solving several copies of a given
problem; if a function requires c bits of communication, how much
communication is required to compute k copies of the function?
Given a function g : {0, 1}n × {0, 1}n → {0, 1}, we define gk :
({0, 1}n )k × ({0, 1}n )k → {0, 1}k by
gk (( x1 , . . . , xk ), (y1 , . . . , yk )) = ( g( x1 , y1 ), g( x2 , y2 ), . . . , g( xk , yk )).
7
Feder et al., 1995
We shall use many of the ideas we have developed so far to prove
that7 :
Theorem 1.28. If g requires c bits of communication, then gk requires at

√
least k( c − log n − 1) bits of communication.
In fact, one can show that even computing the two bits ∧ik=1 g( xi , yi ),
√ See Exercise ??.
and ∨ik=1 g( xi , yi ) requires k( c − log n − 1) bits of communication8 .
8
The main technical lemma we show is:
Lemma 1.29. If gk can be computed with ` bits of communication, then the

inputs to g can be covered by d2n · 2`/k e monochromatic rectangles.
Theorem 1.8 and Lemma 1.29 imply that g has a protocol with
communication (`/k + log n + 1)2 . Thus,
c ≤ (`/k + log n + 1)2

√
⇒ ` ≥ k( c − log n − 1),
as required.
Now we turn to proving Lemma 1.29. We find rectangles that
cover the inputs to g iteratively. Let S ⊆ {0, 1}n × {0, 1}n denote
the set of inputs to g that have not yet been covered by one of the
monochromatic rectangles we have already found. Initially, S is the
set of all inputs. We claim:
Claim 1.30. There is a rectangle that is monochromatic under g and covers

at least 2−`/k |S| of the inputs from S.
Proof. Since gk can be computed with ` bits of communication, by

Theorem 1.5, the set Sk can be covered by 2` monochromatic rectan-
gles. So there must be some monochromatic rectangle R that covers
at least 2−` |S|k of these inputs. For each i, define
Ri = {( x, y) ∈ {0, 1}n × {0, 1}n : ∃( a, b) ∈ R, ai = x, bi = y},
which is a rectangle, since R is a rectangle. Ri is simply the projection

of the rectangle R to the i’th coordinate. Moreover, since this rectan-
gle is monochromatic under gk , it must be monochromatic under g.
Since Anup Says: added more sentences
k
∏ | Ri ∩ S| ≥ | R ∩ Sk | ≥ 2−` |S|k .
i =1
there must be some i for which | Ri ∩ S| ≥ 2−`/k |S|.
We repeatedly pick rectangles using Claim 1.30 until all of the in-
puts to g are covered. After d2n2`/k e steps, the number of uncovered
inputs is at most
`/k −`/k ·2n2`/k
22n · (1 − 2−`/k )2n2 ≤ 22n e−2 = 22n · e−2n < 1. Using 1 − x ≤ e− x for all x.
Counterexample to deterministic direct sum of relations

2
Rank
Matrices give a powerful way to represent functions that depend

on two inputs. We can represent g : X × Y → {0, 1} by an m × n
matrix M, where m = |X | is the number of rows and n = |Y | is the Sometimes it is convenient to use
number of columns, and the (i, j)’th entry is Mij = g(i, j). Given this Mij = (−1) g(i,j) instead.
interpretation of g, one can think of the inputs to the parties as unit
column vectors ei , e j . The parties are trying to compute eiT Me j . This If there the function depends on
the inputs of k parties, the natural
view allows us to bring in the many tools of linear algebra to bear on representation is by a k-tensor.
understanding communication complexity.
Abusing notation, we shall sometimes
refer to the communication complexity
Basic Properties of Rank of M when we really mean to refer to
the communication complexity of the
associated boolean function.
The most basic quantity associated with a matrix is its rank. The
rank of a matrix is the maximum size of a set of linearly independent
rows in the matrix. Its versatility stems from the fact that it has many
interpretations:
Fact 2.1. For an m × n matrix M, rank( M ) = r if and only if:
• r is the smallest number such that M can be expressed as M = AB,

where A is an m × r matrix, and B is an r × n matrix.
• r is the smallest number such that M can be expressed as the sum of r

matrices of rank 1.
• r is the largest number such that M has r linearly independent columns

(or rows).
A useful property of rank that follows immediately from the

definitions:
Fact 2.2. If M0 is a submatrix of M, then rank( M0 ) ≤ rank( M ).

Another nice feature of the rank of matrices is that it behaves

nicely under basic matrix operations. Since the rank of M is the
minimum number of rank 1 matrices that add up to M, we get:
Fact 2.3. |rank( A) − rank( B)| ≤ rank( A + B) ≤ rank( A) + rank( B).
One consequence of Fact 2.3 is that many different representations

of a matrix are more or less equivalent, when it comes to their rank.
For example, if M is a boolean matrix, one can define a matrix M0 of
the same dimensions, with Mi,j0 = (−1) Mi,j , replacing 1’s with −1 and
0’s with 1. Then we see that M0 = J − 2M, where J is the all 1’s matrix,
and so
Fact 2.4. |rank( M0 ) − rank( M )| ≤ rank( J ) = 1.
Since taking linear combinations of the rows or columns cannot

increase the dimension of their span, we get:
Fact 2.5. rank( AB) ≤ min{rank( A), rank( B)}.
The tensor product of an m × n matrix M and an m0 × n0 matrix M0 is

the mm0 × nn0 matrix T = M ⊗ M0 whose entries are indexed by tuples
(i, i0 ), ( j, j0 ), with T(i,i0 ),( j,j0 ) = Mi,j · Mi0 ,j0 . The tensor product multiplies
the rank, a fact that is very useful for proving lower bounds:
Fact 2.6. rank( M ⊗ M0 ) = rank( M ) · rank( M0 ).
The matrices we are working with are boolean, so one can view
the entries of the matrix as real numbers, or rationals, or coming
from the field of integers modulo 2: F2 . This potentially leads to 3
different notions of rank, but we have:
Lemma 2.7. The real rank of a boolean matrix is the same as its rational
rank. The real rank is always at least as large as the rank over F2 .
The proof of the first fact follows from Gaussian elimination. If the
rank over the rationals is r, we can always apply a linear transforma-
tion to the rows using rational coefficients to bring the matrix into
this form:
 
1 0 0 . . . 0 M1,r+1 . . . M1,n
 
0 1 0 . . . 0 M2,r+1 . . . M2,n 
 
0 0 1 . . . 0 M3,r+1 . . . M2,n 
 
 .. .. .. . . .. 
. . . . . 
 
0 0 0 . . . 1 M Mr,n 
 r,r +1 . . . 
 
0 0 0 . . . 0 0 ... 0 
 
.. .. .. .. .. .. .. ..
. . . . . . . .
This transformation does not affect the rank over the reals, and now
it is clear that the rank is exactly r. Now if any set of rows is linearly
rank 35
dependent over the rationals, then we can find an integer linear

dependence between them, and so get a linear dependence over F2 .
This proves that the rank over F2 is at most the rank over the reals.
Throughout the rest of the book, unless we explicitly state other-
wise, we shall always consider the rank over the reals. One conse-
quence of Lemma 2.7 is:
Lemma 2.8. A boolean matrix of rank r has at most 2r distinct rows, and at
most 2r distinct columns.
Proof. Since the rank over F2 is also at most r, every row must be
expressible as the linear combination of some r rows over F2 . There
are only 2r such linear combinations possible, so there can be at most
2r distinct rows.
Lower bounds using Rank
Lemma 2.8 immediately gives some bound on the communication

in terms of the rank of the matrix. If the matrix has rank r, it has at
most 2r distinct rows. Alice only needs to communicate which one of
these rows her row corresponds to. This takes r bits of communica-
tion. Bob can then respond with the value of the function. We have
shown:
Theorem 2.9. If a matrix has rank r then its communication complexity is
at most r + 1.
The main reason that rank is useful in this context is that it can Theorem 2.9 is far from the last word on
the subject. By the end of this chapter,
be used to prove lower bounds on communication, via the following we will prove that the communication is
√
theorem: bounded by a quantity closer to r.
Lemma 2.10. If a boolean matrix can be partitioned into 2c monochromatic

rectangles, then its rank is at most 2c .
Lemma 2.10 follows easily from Fact 2.1. For every rectangle Lemma 2.10 applies even if the matrix
R = A × B, define the matrix where Ri,j = 1 if (i, j) ∈ R, and Ri,j = 0 has +1, −1 entries.
otherwise. Then we see that R a matrix that has rank 1. Moreover, M
can be expressed as the sum of at most 2c such matrices, those that
correspond to 1-rectangles.
Since every function with low communication gives rise to a par-
tition into monochromatic rectangles (Theorem 1.5), we immediately
get:
Theorem 2.11. If a matrix has rank r, then its communication complexity is
at least log r.
Theorem 2.11 allows us to prove lower bounds on many of the
examples we have already considered. So let us revisit some of them.
Equality We start with the equality function, defined in (1.1). The

matrix of the equality function is just the identity matrix. Since
the rows of this matrix are all linearly independent, the rank of
the matrix is 2n , proving that the communication complexity of
equality is at least n bits.
Greater-than Consider the greater than function, defined in (1.3). The

matrix of this function is the upper-triangular matrix which is 1
above the diagonal and 0 on all other points. Once again we see
that rows are linearly independent, and so the matrix has full rank.
This proves that the communication complexity is at least log n.
Disjointness Consider the disjointness function, defined in (1.2). Let

Dn be boolean matrix that represents disjointness. Let us order
the rows of the matrix in lexicographic order, so that the sets that
contain the element n correspond to the last row and last column.
If we partition the rows into two parts based on whether the row
corresponds to a set that contains n or not, and do the same for the
columns, we get that if the rows and columns come from the part
where n is included in both, then the matrix is 0. However, if n is
included in only the rows, or only the columns, we get a copy of
the matrix Dn−1 . So Dn can be expressed as:
" #
Dn − 1 Dn −1
Dn =
Dn − 1 0
In other words Dn = D1 ⊗ Dn−1 , and so rank( Dn ) = 2 · rank( Dn−1 )

by Fact 2.6. We conclude that rank( Dn ) = 2n , proving that the
communication complexity of disjointness is at least n.
k-disjointness Consider the disjointness function restricted to sets

of size at most k. In this case, the matrix is an ∑ik=0 (ni) × ∑ik=0 (ni)
matrix. Let us write Dn,k to represent the matrix for this problem.
For two sets X, Y ⊆ [n], define the monomial x = ∏i∈X yi , and the
string y ∈ {0, 1}n such that yi = 0 if and only if i ∈ Y. Then we see
that Disj( X, Y ) = x (y). Now any non-zero linear combination of
the rows corresponds to a linear combination of the monomials we
have defined, and so gives a non-zero polynomial f . To show that
the matrix has full rank, we need to prove that there is a set Y that
Alexander Razborov, 1987.
gives rise to an input y with f (y) 6= 0.
To show this, let X be a set that corresponds to a monomial of
maximum degree in f . Let us restrict the values of all variables
outside X to be equal to 1. After doing this, f becomes a non-zero
polynomial that depends only on the variables of X. Since such
polynomials are in one to one correspondence with the boolean
functions on these variables, we get that there must be some
rank 37
setting of the variables of X giving an assignment y with f (y) = 1.

Moreover, this gives an assignment to y with at most k entries that
are 0.
Inner-product Our final example the inner product function IP :
{0, 1}n × {0, 1}n → {0, 1}, defined by
IP( x, y) = h x, yi mod 2. (2.1)
The trivial protocol takes n bits, and one case use bounds on the
size of the largest rectangle to show that the communication is at
least Ω(n). Here it will be helpful to use Fact 2.4. If Pn represents See Exercise ??
the matrix whose entries are , sorting the rows and columns
lexicographically, we see that
" # " #
Pn−1 Pn−1 1 1
Pn = = ⊗ Pn−1 ,
Pn−1 − Pn−1 1 −1
and so by Fact 2.6, rank( Pn ) = 2rank( Pn−1 ). This proves that

rank( Pn ) = 2n , and so the communication complexity of IP is at
least n.
Towards the Log-Rank Conjecture
Lovasz and Saks conjectured 1 that Theorem 2.11 is closer to the 1

Lovász and Saks, 1988
truth than Theorem 2.9:
Conjecture 2.12. There is a constant α such that the communication

complexity of a matrix M is at most logα rank( M ).
The separation between partition number and deterministic com-

munication of Theorem ?? implies that α must be at least 2 for the
conjecture to hold, so we cannot expect the communication complex-
ity of a matrix to be exactly equal to its rank. Our main goal in this
section is to prove the following theorem2 : 2
Lovett, 2014
Theorem 2.13. If the rank of a matrix is r, its communication complexity is Lovett actually proves that√the commu-
√
at most O( r log2 r ). nication is bounded by O( r log r ), but
we prove the weaker bound here for
The proof of Theorem 2.13 relies3 on a powerful theorem from ease of presentation.
convex geometry called John’s theorem4 . We use it to show:
3
Rothvoß, 2014
Lemma 2.14. Any m × n boolean matrix of rank√r > 1 must have a
monochromatic rectangle of size at least mn · 2−20 r log r . 4
John, 1948
Let us see how to use Lemma 2.14 to get a protocol. Let R be the
rectangle promised by the lemma. Then,
" rearranging
# the rows and
R A
columns, we can write the matrix as: . Now we claim5 that
B C
" #! " #!
R h i R A
rank + rank R A ≤ rank + 3.
B B C
Indeed, one can write

" # " # " #
R A 0 A R 0
= +
B C B C 0 0
h i h i h i
R A = 0 A + R 0
" # " # " #
R 0 R
= + ,
B B 0
So by Fact 2.3,
6
(t + 3)/2 ≤ 2t/3, when t ≥ 9.
" #!
R h i
rank + rank R A ≤ rank( A) + rank( B) + 2 7
Fact: 1 − x ≤ e− x , for x ≥ 0.
B
" #!
0 A
≤ rank +2 Input: Alice knows i, Bob knows j.
B C Output: Mi,j .
" #!
R A while rank( M ) > 9 do
≤ rank + 3. (2.2) Find a monochromatic
B C
rectangle R as promised by
" # Lemma 2.14;

R Write M =
R A
;
Now suppose has the smaller rank. Then Bob sends the bit B C
B
R
0 if his input is consistent with R and 1 otherwise. If it is consistent, if rank >
B

then if rank( M ) > 9, players have reduced6 the rank of the matrix by rank R A then
if i is consistent with R
a factor of at least 23 . If it is not consistent, the √
players have reduced then
the size of the matrix by a factor of 1 − 2−20 r log r . Both parties
replace

M with R A ;
By Lemma 2.8, we can assume that any matrix of rank r has at
else
most 2r rows and columns. The √
number of 0 transmissions in this Both parties
replace

protocol is at most 2r ln 2 · 2 20 r log r , since after that many transmis- M with B C ;
end
sions, the number of entries in the matrix have been reduced to7 else
√ √ √ √ if j is consistent with R
r log r 2r ·220 r log r −20 r log r 2r ln 2·220 r log r
22r (1 − 2−20 ) < 22r e−2 then
Both parties
replace
= 22r e−2r ln 2 = 1. M with
R
;
B
else
The number of 1 transmissions is at most O(log3/2 r ), since af- Both parties
replace
ter that many transmissions, the rank of the matrix is reduced to A
M with ;
less than√ 6. Thus, the number of leaves in this protocol is at most C
20 r log r √ end
(2r lnlog
2·2
) ≤ 2O( r log2 r ) .
By Theorem 1.6, we can balance the
3/2 r
end
√
protocol tree to obtain a protocol with communication O( r log2 r ) end
The parties exchange at most 9 bits
that computes the same function. to compute Mi,j , using Theorem
It only remains to prove Lemma 2.14. To prove it, we need to 2.9;
understand John’s theorem. A set K ⊆ Rr is called convex if whenever Figure 2.1: Protocol for Low Rank
√
r log2 r )
x, y ∈ K, then all the points on the line from x to y are also in K. The Matrices with 2O( leaves.
rank 39
set is called symmetric if whenever x ∈ K, then − x ∈ K. An ellipsoid

centered at 0 is a set of the form:
( )
r
E= x ∈ Rr : ∑ hx, ui i2 /α2i ≤ 1 ,
i =1
8
John, 1948
where u1 , . . . , ur are a basis for Rr . John’s theorem shows8 :
Theorem 2.15 (John’s Theorem). Let K ⊆ Rr be a symmetric convex body

such that the unit ball is the most voluminous of all ellipsoids contained in
√
K. Then every element of K is of length at most r.
The most voluminous ellipsoid contained in K behaves nicely

when K is changed. Suppose the ellipsoid E above is the largest
ellipsoid in K. Suppose that for some i, we multiply every element
of K by a number e in the direction of ui : namely we consider the
convex body:
  

e · h x, u i if j = i, 
i
K 0 = x 0 : ∃ x ∈ K, x 0 , u j =

  x, u j otherwise. 
Since scaling the space by β in any direction changes the volume of

all objects by exactly β, the largest ellipsoid in K 0 is the scaled version
of the largest ellipsoid in K:
Fact 2.16. The largest ellipsoid in K 0 is

( )
r
2 2
r
E = x ∈ R : ∑ x, u j /β j ≤ 1 ,
0
j =1
where β j = α j if j 6= i, and β i = eαi .
Lemma 2.14 is proved in two steps. In the first step, we use

John’s theorem to show that the matrix must contain a large nearly
monochromatic rectangle. In the second step, we will show how
any such rectangle of low rank must contain a large monochromatic
rectangle itself.
Since the matrix has rank r, we know that M can be expressed as
M = AB, where A is an m × r matrix, and B is an r × n matrix. We
start by showing:
Lemma 2.17. Any boolean matrix M of rank r can be expressed as M = AB,

√
where A is an m × r matrix whose rows are vectors of length at most r, and
B is an r × n matrix whose columns are vectors of length at most 1.
√
The " r" and "1" in the statement of
Proof. Start with M = AB for A, B not necessarily satisfying the Lemma 2.17 can be replaced √ by any
numbers whose product is r.
length constraints. Let v1 , . . . , vm be the rows of A, and let w1 , . . . , wn
be the columns of B. Let K be the convex hull of {±v1 , . . . , ±vm }.
An ellipsoid centered at the origin in the space is specified by a

basis u1 , . . . , ur for the space, and numbers α1 , . . . , αr . The ellipsoid
determined by these parameters is the set
( )
r
E= x ∈ Rr : ∑ hx, ui i2 /α2i ≤ 1 .
i =1
Our first goal is to ensure that the ellipsoid of maximum volume

in K is the unit ball. This is the same as ensuring that α1 = α2 = . . . =
αr = 1. Suppose αi is not 1 for some i. Then we can scale every vector
v j by a factor of αi in the direction9 ui , and scale every vector w j by a 9
Formally, we write v j = ∑ri0 =1 γi0 ui0 ,
and wk = ∑ri0 =1 β i0 ui0 and replace
factor of 1/αi in the direction of ui . This preserves the inner products γi with αi γi , and β i with β i /α

i . This

of all pairs of vectors. By Fact 2.16, repeating this for each coordinate preserves the inner product v j , wk .
i ensures that the ellipsoid of maximum volume in K is the unit ball.
√
Now, by John’s theorem, every vector vi must have length at most r,
√
since every vector in K has length at most r.
It only remains to argue that vectors w1 , . . . , wn are of length at
most 1. This is where we use the fact that the matrix is boolean.
Consider any wi , and the unit vector in the same direction: ei =
wi /kwi k. The length of wi can be expressed as hwi , ei i, but since ei is
in the unit ball, and so is contained in K, ei = ∑ j µ j v j + ∑ j κ j (−v j ) is a
convex combination of the v j ’s. Thus

hwi , ei i = ∑ µ j wi , v j + ∑ κ j wi , −v j ≤ ∑ µ j + ∑ κ j = 1,
j j j j
where the inequality follows from the fact that M is boolean.
For the rest of the proof, we assume that M has as least mn/2 0’s.
We can do this, because if M has more 1’s than 0’s, we can replace M
with J − M, where J is the all 1’s matrix. This can increase the rank by
at most 1, but now the role of 0’s and 1’s has been reversed.
Lemma 2.17 says something about the angles between the vectors
hvi ,w j i
we have found. Define θi,j = arccos kv kkw k . Then observe that
i j
when vi , w j are orthogonal, the angle is π/2. But when the inner

product is 1, the angle is at most arccos √1r ≤ π2 − 72π
√ .
r
So we get:

= π if Mi,j = 0,
2
θi,j 2π
≤ − √ if Mi,j = 1.
π
2 7 r
Consider the following random experiment. Sample t vectors of

length 1 uniformly at random z1 , . . . , zt , and define the rectangle R
by:

R = {(i, j) : ∀k, hvi , zk i > 0, w j , zk < 0}.
rank 41
π Figure 2.2: arccos(α) ≤ π/2 − 2πα/7 for

2
0 ≤ α ≤ 1.
arccos(α)
π/2 − 2πα/7
vi
π
4
wj
Figure 2.3: The region where all zk ’s

must fall to ensure that (i, j) ∈ R, when
Mi,j = 0.
0 0.5 1
α
vi
For a fixed (i, j) and k the probability that hvk , zk i > 0 and
π/2−θi,j 2p
1 p
hwk , zk i < 0 is exactly 4 − 2π . So we get 7 r
 t
= 1 if Mi,j = 0, wj
4
Pr[(i, j) ∈ R] t
R ≤ 1 − 1
√ if Mi,j = 1.
4 7 r
Let R1 denote the number of 1’s in R and R0 denote the number of

√
0’s. Set t = 7 r log r. By what we have just argued,
mn/2 Figure 2.4: The region where all zk ’s

E [ R0 ] ≥ √
47 r log r must fall to ensure that (i, j) ∈ R, when
mn −14√r log r Mi,j = 1.
= ·2 ,
2
7√r log r
mn/2 4
E [ R1 ] ≤ 7√r log r 1 − √
4 7 r
√
mn −14√r log r − 7√4 r 7 r log r
≤ ·2 ·e Fact: 1 − x ≤ e− x for x ≥ 0.
2
mn −14√r log r −4 log e
= ·2 ·r .
2
Now let Q = R0 − r4 R1 . By linearity of expectation, we have
mn −14√r log r
E [ Q] ≥ ·2 · (1 − 1/r )
2 √
≥ mn · 2−16 r log r . since r > 1.
Thus there must be some rectangle R realizing this value of Q. Only

1/r3 fraction of such a rectangle can correspond to 1 entries of the
matrix. We have shown:
Claim 2.18. If at least

√
half of the matrix is 0’s, then there is a submatrix T of
size at least mn2−16 r log r such that the fraction of 1’s in T is at most 1/r3 .
Call a row of T good if it contains at most 2/r3 fraction 1’s. At

least half the rows of T must be good, or else T would have more
than 1/r3 fraction 1’s overall. Let T 0 be the submatrix obtained by
restricting T 0 to the good rows. Since rank( T 0 ) = r, there are r rows
A1 , . . . , Ar0 that span all the rows of T 0 . Since each row Ai can have
only 2/r3 fraction of 1’s, at most 2/r2 ≤ 1/2 fraction of the columns
can contain a 1 in one of these r 0 rows.
2 3
Let T 00 be the matrix obtained by restricting T 0 to the columns that 0 1 1 0 0 1 1 1 0 1
61 1 0 1 1 0 0 0 0 17
do not have a 0 in the rows A1 , . . . , Ar . Since every row of T 00 must 6
61
7
6 1 0 0 0 0 0 0 0 077
be a linear combination of rows that only have 0’s in them, √
we have 60 0 0 0 1 1 0 0 0 07
− 16 r log r 6 7
found a√monochromatic matrix of size at least mn2 /4 ≥ 61
6 0 0 0 0 0 1 0 0 077
60 1 0 0 0 1 0 0 0 17
mn2−18 r log r . This concludes the proof of Lemma 2.14. 6 7
60 0 0 0 0 0 1 0 0 17
6 7
Open Problem 2.19. It would be very nice to find a more direct geometric 60 0 0 0 0 0 1 0 1 17
6 7
60 0 0 0 0 0 1 0 0 17
argument to prove Lemma 2.14. 6
61
7
6 1 0 0 0 0 1 0 0 177
41 1 0 0 0 0 0 0 0 05
Non-negative Rank and Covers 1 1 1 0 0 1 0 0 1 1
T T0 A1 , . . . , Ar 0 T 00
Another way to measure the complexity of a matrix is by measur- Figure 2.5: Going from a nearly
monochromatic rectangle to a
ing its non-negative rank. The non-negative rank of a m × n boolean monochromatic rectangle.
matrix M is the smallest number r such that M = AB, where A, B are
matrices with non-negative entries, such that A is an m × r matrix
and B is an r × n matrix. Equivalently, it is the smallest number of
non-negative rank 1 matrices that sum to M. Clearly, we have
Fact 2.20. rank( M) ≤ rank+ ( M).
In general, rank( M ) and rank+ ( M ) may be far apart. For example,

given a set of numbers X = { x1 , . . . , xn } of size n, consider the
n × n matrix where Mi,j = ( xi − x j )2 = xi2 + x2j + 2xi x j . Since M is
the sum of three rank 1 matrices, rank( M ) = 3. On the other hand,
we can show by induction on n that rank+ ( M ) ≥ log n. Indeed, if
rank+ ( M) = k, then there must be non-negative rank 1 matrices
R1 , . . . , Rk such that M = R1 + . . . + Rk . Let the support of R1 be
the rectangle A × B. Then we must have that either | A| ≤ | X |/2 or
| B| ≤ | X |/2, or else there will be an element x ∈ A ∩ B, but Mx,x = 0.
Suppose A ≤ | X |/2, and let M0 be the submatrix that corresponds
to the numbers of X \ A. Then we get that rank+ ( M0 ) ≥ log(n/2) =
log n − 1, and so k − 1 ≥ log n − 1, proving that k ≥ log n.
rank 43
If M = R1 + . . . + Rr , where R1 , . . . , Rr are rank 1 non-negative ma-

trices, then the support of each matrix Ri must be a monochromatic
rectangle in M with value 1. Thus, we get a 1-cover of the matrix:
Fact 2.21. M always has a 1-cover with rank+ ( M ) rectangles.
Moreover, one can prove that if a matrix has both small rank and a
10
Lovász, 1990
small cover, then there is a small communication protocot10 :
Theorem 2.22. If M has a 1-cover of size r, then there is a protocol comput-

ing M with O(log r · log rank( M )) bits of communication.
Proof. The protocol is similar to the one used to prove Theorem 1.8.
For every rectangle R in the cover, we can write
" #
R A
M= ,
B C
and by (2.2), either

h i
rank R A ≤ (rank( M) − 3)/2, (2.3)
or
" #!
R
rank ≤ (rank( M) − 3)/2. (2.4)
B
So in each step of the protocol, if Alice sees an R that is consistent

with her input satisfying (2.3), she announces its name, or if Bob sees
a rectangle R in the cover consistent with his input and satisfying
(2.4), he announces its name. Both parties then restrict their attention
to the appropriate submatrix, which reduces the rank of M by a
factor of 2.
This can continue for at most O(log rank( M)) steps before the rank
of the matrix becomes 1. On the other hand, if neither party finds
such an R, then there must be be no such R that covers their input, so
they can safely output a 1.
Putting together Fact 2.21 and Theorem 2.22, we get
Corollary 2.23. The communication of M is at most
O(log(rank+ M) · log rank( M )) ≤ O(log2 rank+ ( M )).
Exercise 2.1
Fix a function f : X × Y → {0, 1} with the property that in
every row and column of the communication matrix M f there are
exactly t ones. Cover the zeros of M f using O(t(log| X | + log|Y |))
monochromatic rectangles.
Exercise 2.2
Show that Nisan-Widgerson protocol (i.e., the proof of Lemma
2.13) goes though even if we weaken Lemma 2.14 to only guarantee
a rectangle with rank at most r/8 (instead of rank at most one, or
monochromatic).
Exercise 2.3
Recall that for a simple, undirected graph G the chromatic number
χ( G ) is the minimum number of colors needed to color the vertices
of G so that no two adjacent vertices have the same color. Show that
log χ( G ) is at most the deterministic communication complexity of
G’s adjacency matrix.
Exercise 2.4
For any symmetric matrix M ∈ {0, 1}n×n with ones in all diagonal
entries, show that
n2
2c ≥ ,
| M|
where c is the deterministic communication complexity of M, and
| M| is the number of ones in M.
Exercise 2.5
For any boolean matrix M, define rank2 ( M) to be the rank of M
over F2 , the field with two elements. Exhibit an explicit family of ma-
trices M ∈ {0, 1}n×n with the property that c ≥ rank2 ( M )/10, where c
is the deterministic communication complexity of M. Conclude that
this falsifies the analogue of log-rank conjecture for rank2 .
Exercise 2.6
√
Show that if f has fooling set of size s then rk ( M f ) ≥ s. Hint:
tensor product.
3
Randomized Protocols
Access to randomness is an enabling feature in many compu-

tational processes, and it is useful in communication protocols as
well. We start with some examples of protocols where the use of
randomness gives an advantage that cannot be matched by determin-
istic protocols, before defining randomized protocols formally and
proving some basic facts about them. We do not discuss any lower
bounds on randomized communication in this chapter. The lower
Input: Alice knows x ∈ {0, 1}n ,
bounds are proved in Chapter 5 and Chapter 6. Bob knows y ∈ {0, 1}n .
Output: Whether or not x = y.
Equality Suppose Alice and Bob are each given access to n bit strings
Alice and Bob sample a random
x, y, and want to know if these strings are the same or not (1.1).
function h : {0, 1}n → {0, 1}k ;
We have shown that at least n + 1 bits of communication are Alice sends Bob h( x );
required if the communication is deterministic. Bob announces whether
h ( x ) = h ( y );
However, there is a simple randomized protocol. Alice and Bob
Figure 3.1: Public-coin Protocol for
sample a random function h : {0, 1}n → {0, 1}k and Alice sends equality.
h( x ) to Bob. If x = y, we will have that h( x ) = h(y). On the other
hand, if x 6= y, the probability that h( x ) = h(y) is at most 2−k . Input: Alice knows x ∈ {0, 1}n ,
So the protocol can compute equality with a small probability Bob knows y ∈ {0, 1}n .
Output: Whether or not x = y.
of failure, even if the communication is a constant number of
bits. This protocol may seem dissatisfying, because although Alice and Bob agree on a good
code C : {0, 1}n → {0, 1}m ;
the communication is small, the number of shared random bits Alice picks k coordinates
required is very large (2nk ). However, there is a slightly more i1 , . . . , ik ∈ [m] at random;
Alice sends Bob
complicated protocol that uses very few random bits. ( i 1 , C ( x ) i1 ), . . . , ( i k , C ( x ) i k );
Alice and Bob agree on an error correcting code1 C : {0, 1}n → Bob announces whether this is
equal to
{0, 1}m . It can be shown that a random function is a code with ( i 1 , C ( y ) i1 ), . . . , ( i k , C ( y ) i k );
high probability, but even explicit constructions of good codes
Figure 3.2: Private-coin Protocol for
are known. Given the code, Alice can pick k random coordinates equality.
of the code and send them to Bob. Bob will check whether these 1
This is a function that maps n bits
coordinates are consistent with his input. This takes k log n bits of to m = O(n) bits, such that if x 6= y,
communication, and now the probability of making an error is at then C ( x ) and C (y) differ in Ω(m)
coordinates.
most 2−Ω(k) .
Greater-than Suppose Alice and Bob are given numbers x, y ∈ [n]

and want to know which one is greater (1.3). We have seen that
any deterministic protocol for this problem requires log n + 1 bits
of communication. However, there is a randomized protocol that
The protocol requiring O(log log n)
requires only O(log log n) bits of communication. bits of communication is described in
Here we describe a protocol that requires only O(log log n · Exercise 3.1.
log log log n) communication. The inputs x, y can be encoded by
`-bit binary strings, where ` = log n. To determine whether x ≥ y,
it is enough to find the most significant bits in x, y where x, y are
not the same. We use the randomized protocol for equality, and
binary search, to achieve this. In the first step, Alice and Bob will
use the protocol for equality to exchange k bits that determine
whether the `/2 most significant bits of x and y are the same. If
they are the same, the parties continue with the remaining bits. If
not, the parties discard the second half of their strings. In this way,
after log ` steps, they find the first bit of difference in their inputs.
We need to set k = log log ` for this process to work.
Input: Alice knows x ∈ {0, 1}` ,
Bob knows y ∈ {0, 1}` .
k-Disjointness Suppose Alice and Bob are given 2 sets X, Y ⊆ [n] of Output: Largest i such that xi 6= yi ,
size at most k, and want to know if these sets intersect or not. We if such an i exists.
used the rank method to argue that at least log (nk) ≈ k log(n/k ) Let J = [n];
bits of communication are required. Here we give a randomized while | J | > 1 do
Let J 0 be the first | J |/2
protocol2 that requires only O(k) bits of communication3 , which is elements of J;
more efficient when k n. Both parties use shared
randomness to sample a
Alice and Bob sample a sequence of sets R1 , R2 , . . . ⊆ [n], un- random function
formly at random. They exchange 2 bits to announce whether or 0
h : {0, 1}| J | → {0, 1}2 log log ` ;
not their sets are empty. If neither set is empty, Alice announces Alice sends h evaluated on the
bits in J 0 , h( x J 0 );
the index of the first set Ri that contains her set, and Bob an- Bob announces whether or not
nounces the index of the first set R j that contains his set. Now h ( x J 0 ) = h ( y J 0 );
if h( x J 0 ) = h(y J 0 ) then
Alice can safely replace her set with X ∩ R j , and Bob can replace Alice and Bob replace
his set with Y ∩ Ri . If at any point one of the parties is left with an J = J \ J0;
empty set, they can safely conclude that the inputs were disjoint. else
Alice and Bob replace
We will argue that if the sets are disjoint, this process terminates J = J0;
after O(k) bits of communication. end
Both parties announce x J , y J ;
Assume that X, Y are disjoint. Let us start by analyzing the
end
expected number of bits that will be communicated in the first
Figure 3.3: Public-coin protocol for
step. We claim: greater than.
2
Håstad and Wigderson, 2007
Claim 3.1. E [i ] = 2| X | , E [ j] = 2 |Y | .
3
Later, we show that Ω(k ) bits are
required.
Proof. The probability that the first set of the sequence contains
X is exactly 2−|X | . In the event that it does not contain X, we are
picking the first set that contains X from the rest of the sequence.
randomized protocols 47
Thus: Input: Alice knows X ⊆ [n], Bob

knows Y ⊆ [n].
−| X |
E [i ] = 2 · 1 + (1 − 2−|X | ) · (E [i ] + 1) Output: Whether or not X ∩ Y = ∅
⇒ E [ i ] = 2| X | . while | X | > 1 and |Y | > 1 and at

most 120k + 20 bits have been
The bound on the expected value of j is the same. communicated so far do
Alice and Bob use shared
randomness to sample
Since a number of size i can be communicated with at most
random subsets
2 log i bits, the number of bits communicated to transmit i, is at R1 , R2 , . . . ⊆ [ n ];
most4 Alice sends Bob the smallest i
such that X ⊆ Ri ;
E [2 log i ] ≤ 2 log E [i ] = 2| X | Bob sends Alice the smallest j
such that Y ⊆ R j ;
E [2 log j] ≤ 2 log E [ j] = 2|Y | (3.1) Alice replaces X = X ∩ R j ;
Bob replaces Y = Y ∩ Ri ;
Next we argue that when X ∩ Y = ∅, the above communication end
process must terminate quickly. if X = ∅ or Y = ∅ then
Alice and Bob conclude that
Claim 3.2. If X ∩ Y = ∅, the expected number of bits communicated by the sets were disjoint;
the protocol is at most 6| X | + 6|Y | + 2. else
Alice and Bob conclude that
Proof. Notice that as the protocol continues, the sets X, Y can the sets were intersecting;
end
only get smaller. So we can prove the bound by induction on
Figure 3.4: Public-coin protocol for
the size of the sets X, Y. For the base, case, if X or Y is empty, at
k-disjointness.
most 2 ≤ 6(| X | + |Y |) + 2 bits are communicated. If both X and
Y are non-empty, (3.1) shows that the expected number of bits
4
since log is concave
communicated in the first step is 2 + 2| X | + 2|Y |. By induction, the

expected number of bits communicated in the rest of the protocol

is E 6(| X ∩ R j | + |Y ∩ Ri |) + 2. But observe that since X, Y are

assumed to be disjoint, E | X ∩ R j | = | X |/2, and E [|Y ∩ Ri |] =
|Y |/2. Thus the total number of bits communicated is
2 + 2| X | + 2|Y | + (6/2)| X | + (6/2)|Y | + 2
= 6| X | + 6|Y | + 2 − (| X | + |Y | − 2)
≤ 6| X | + 6|Y | + 2,
as required.
Claim 3.2 means that if X, Y are disjoint, the expected number

of step taken by the process above is 6| X | + 6|Y | + 2. By Markov’s
inequality, the probability that the protocol communicates more
than 10 · (6| X | + 6|Y | + 2) bits is at most 1/10. Thus if we run
this process until 120k + 20 bits have been communicated, the
probability of making an error is at most 1/10.
Variants of Randomized Protocols
A randomized protocol is a deterministic protocol where each

party has access to a random string, in addition to the inputs to
the protocol. The random string is sampled independently from

the inputs, but may have an arbitrary distribution. We say that the
protocol uses public coins if all parties have access to a common
shared random string. We say that the protocol uses private coins if
each party samples an independent random string. Every private
We shall soon see a partial converse:
coin protocol can be simulated by a public coin protocol. There are at every public coin protocol can be
least two ways to quantify the errors made by a protocol: simulated with private coins, with a
small increase in the communication
Worst-case We say that a randomized protocol has error e in the
If a randomized protocol never makes
worst-case if the probability that the protocol makes an error is at
an error, we can fix the randomness to
most e on every input. obtain a deterministic protocol that is
always correct.
Average-case Given a distribution on inputs µ, we say that the proto-
col has error e with respect to µ if the probability that the protocol The worst-case error is e if and only if
makes an error is at most e when the inputs are sampled from µ. the error is e under every distribution
on inputs.
When a protocol has error e < 1/2 in the worst case, we can run it
several times and take the majority output to reduce the error. If we If a randomized protocol makes no
errors, we can fix the randomness to
repeat the protocol k times, and output the most frequent output in obtain a deterministic protocol that
all of the runs, there will be an error in the output only if at least k/2 comptes the function.
of the runs computed the wrong answer. By the Chernoff bound, the
2
probability of error is at most 2−Ω(k(1/2−e) ) .
Worst-case and average-case errors are related by via Yao’s mini-
max principle:
Theorem 3.3. The communication complexity of computing a function

g in the worst-case with error at most e is equal to the maximum, over all
distributions µ, of the communication complexity of computing g with error
at most e with respect to µ.
Theorem 3.3 can be proved by appealing to von Neumann’s mini-

5
von Neumann, 1928
max principle5 :
Theorem 3.4. Let M be an m × n matrix. Then The minimax principle can also be seen
as a consequence of linear program-
ming duality.
min max xMy = max min xMy,
x ≥0 y ≥0 y ≥0 x ≥0
where x is a 1 × m row vector with ∑i xi = 1, and y is a n × 1 column

vector with ∑ j y j = 1.
Let us see how to prove Theorem 3.3. One direction is easy: if

there is a protocol that computes g with error e in the worst case,
then the same protocol must compute g with error e in the average
case, no matter what the input distribution is.
Now suppose we know that for every distribution µ, there is
a c-bit protocol that computes g with error e in the average case.
Consider the boolean matrix M where every row corresponds to a de-
terministic communication protocol, and every column corresponds
to an input to the protocol, such that


1 if protocol i computes g correctly on input j,
Mi,j =
0 otherwise.
A distribution on the inputs corresponds to a choice of y ≥ 0 such

that ∑ j y j = 1. Since a randomized protocol can be thought of as a
distribution on deterministic protocols, a randomized protocol corre-
sponds to a choice of x ≥ 0 such that ∑i xi = 1. The probability that
a fixed randomized protocol x makes an error when the inputs come
from the distribution y is exactly xMy. Thus maxy≥0 minx≥0 xMy ≤ e.
Theorem 3.4 implies that minx≥0 maxy≥0 xMy ≤ e as well, which is
exactly what we want to prove. There is a fixed randomized protocol
that has error at most e under every distribution on inputs.
Public Coins vs Private Coins
While every private coin protocol can be simulated by a public

coin protocol, can every private coin protocol be simulated by a 6
Newman, 1991
public coin protocol? Such a simulation is possible6 , up to a small
additive loss in communication:
Theorem 3.5. If g : {0, 1}n × {0, 1}n → {0, 1} can be computed with c bits
of communication, and error e in the worst case, then it can be computed by
a private coin protocol with c + log(n/e2 ) + O(1) bits of communication, It is known that computing whether or
and error 2e in the worst case. not two n-bit strings are equal requires
Ω(log n) bits of communication if only
private coins are used. This shows that
Proof. We use the probabilistic method to find the required private
Theorem 3.5 is tight.
coin protocol. Let us pick t independent random strings, each of
which can be used as the randomness for the given private-coin
protocol.
For any fixed input, some of these t random strings lead to the
public coin protocol computing the right answer, and some of the
lead to the protocol computing the wrong answer. By the Chernoff
bound, the probability that 1 − 2e fraction of the t strings lead to the
2
wrong answer is at most 2−Ω(e t) . We set t = O(2n/e2 ) to be large
enough so that this probability is less than 2−2n . Then by the union
bound, we get that the probability that 2et of these strings give the
wrong answer for any input is less than 1. Thus there must be some
fixed strings with this property.
The private coin protocol is now simple. Alice samples one of the
t strings and sends its index to Bob, which takes at most log(n/e2 ) +
O(1) bits. Alice and Bob then run the original public coin protocol.
Nearly Monochromatic Rectangles
Monochromatic rectangles proved to be a very useful concept

to understand deterministic protocols. A similar role is played by
nearly monochromatic rectangles when trying to understand random-
ized protocols.
Definition 3.6. Given a distribution µ on inputs, we say that a rectangle R

has bias (1 − e) under a function g if there is a constant b so that
Pr[ g( x, y) = b|( x, y) ∈ R] ≥ 1 − e.
µ
Such a rectangle is called (1 − e)-monochromatic.
We would like to claim that a protocol with small error e induces

a partition of the space into nearly monochromatic rectangles. That
is not quite true, but we can claim that the average rectangle must be
very close to being monochromatic:
Theorem 3.7. If there is a c-bit protocol that computes g with error e under
a distribution µ, then you can partition the inputs into 2c rectangles, such
that the average bias of a random rectangle from the partition is at least 1 − e.
Applying Markov’s inequality to this average gives that there must

be many large nearly monochromatic rectangles:
Theorem 3.8. If there is a c-bit protocol that computes g with error e under
µ, then for every `, there are disjoint (1 − è)-monochromatic rectangles
R1 , R2 , . . . , R2c such that Prµ [( x, y) ∈ ∪i Ri ] ≥ 1 − 1/`. Theorem 3.8 will be instrumental to
prove lower bounds on randomized
protocols
Proof. Since we can always fix the randomness of the protocol in

the best way, we can assume that the protocol is deterministic. By
Theorem 1.5, we know that the protocol induces a partition of the
space into 2c rectangles. Consider all the rectangles that are not
(1 − è)-monochromatic. If the probability that the input lands in one
of these rectangles is bigger than 1/`, the error of the protocol will be
bigger than e. Thus the inputs must land in a (1 − è)-monochromatic
rectangle with probability at least 1 − 1/`.
As a corollary, we get:
Corollary 3.9. If there is a c-bit protocol that computes g with error e under
µ, then for every `, there is a (1 − è)-monochromatic rectangle of density at
least 2−c (1 − 1/`).
Exercise 3.1
In this exercise we will develop a randomized protocol for greater-
than that requires only O(log log n) bits of communication. Let
x, y ∈ {0, 1}` be two strings. Alice and Bob want to find the smallest i
such that xi 6= yi .
Exercise 3.2
In this exercise, we design a randomized protocol for finding the
first difference between two n-bit strings. Alice and Bob are given n
bit strings x 6= y and want to find the smallest i such that xi 6= yi . In
class we saw how to accomplish this using O(log n log log n) bits of
communication. Here we do it with O(log n) bits of communication.
Define a rooted tree as follows. Every vertex will correspond to an
interval of coordinates from [n]. The root corresponds to the interval
I = [n]. Every internal vertex corresponding to the interval I will
have two children, the left child corresponding to the first half of I
and the right child corresponding to the right half of I. This defines
a tree of depth log n, where the leaves correspond to intervals of
size 1 (i.e. coordinates) of the input. At each leaf, attach a path of
length 3 log n. Every vertex of this path represents the same interval
of size 1. The depth of the tree is now 4 log n.
1. Fill in the details of the following protocol. Prove an upper bound

on the expected number of bits communicated and a lower bound
on the success probability.
The players use their inputs and hashing to start at the root of the
tree and try to navigate to the smallest interval that contains the
index i that they seek. In each step, the players will either move to
a parent or a child of the node that they are at. When the players
are at a vertex that corresponds to the interval I, they should first
exchange O(1) hash bits to confirm that the first difference does
lie in I. If this hash shows that the first difference does not lie in
I, they should move to the parent of the current node. Otherwise,
they exchange O(1) hash bits and use this to decide on which
child of the current node to move to. Once the players reach the
nodes of the tree that correspond to intervals of size 1, they use
their hashes to either move to a parent or child.
2. Argue that as long as the number of nodes where the protocol

made the right choice exceeds the number of nodes where the
players made the wrong choice by log n, the protocol you defined
does succeed in computing i.
3. Use the Chernoff bound to argue that the number of hashes that
gives the right answer is high enough to ensure that the protocol
succeeds with high probability on any input.
Exercise 3.3
Show that if the inputs to greater-than are sampled uniformly
and independently, then there is a protocol that communicates only
O(log(1/e)) bits and has error at most e under this distribution.
4
Numbers On Foreheads
1
Chandra et al., 1983
The number-on-forehead model1 of communication is one way
to generalize the case of two party communication to the multiparty
setting. There are k parties communicating, and the i’th party has an
input drawn from the set Xi written on their forehead. Each party can
When there are only k = 2 parties, this
see all of the inputs except the one that is written on their forehead. model is identical to the model of 2
The fact that each party can see most of the inputs means that parties party communication.
do not need to communicate as much. It also means that proving
lower bounds against this model is particularly hard. Indeed, we
do not yet known how to prove optimal lower bounds in the model
of computation, in stark contrast to models where the inputs are
Moreover, optimal lower bounds in
completely private. this model would have very interesting
We start with some examples of clever number-on-forehead proto- consequences to the study of circuit
complexity.
cols.
Equality We have seen that the best protocol for deterministically

computing equality in the 2 party setting involves one of the
parties revealing their entire input. Suppose 3 parties each have
an n bit string written on their foreheads. Then there is a trivial
protocol for computing whether all three strings are the same:
Alice announces whether or not Bob and Charlie’s strings are
the same, and Bob announces whether or not Alice and Charlie’s
strings are the same.
Intersection size Suppose there are k parties, and the i’th party has a
subset Xi ⊆ [n] on their forehead. The parties want to compute A protocol solving this problem would
the size of the intersection ∩i Xi . We shall describe a protocol2 that compute both the disjointness function
and the inner product function.
requires only O(k4 n/2k ) bits of communication.
We start by describing a protocol that requires only k2 log n bits 2
Grolmusz, 1998; and Babai et al., 2003
k
of communication, as long as n < (k/2 ). It is helpful to think of the
input as a k × n boolean matrix. Each of the parties knows all but In Chapter 5, we prove that at least
n/4k bits of communication are re-
one row of this matrix, and they wish to compute the number of
quired.
all 1’s columns. Let Ci,j denote the number of columns containing
j 1’s that are visible to the i’th party. The parties compute and
announce the values of Ci,j , for each i, j. The communication of the
protocol is at most k2 log n bits. Let A j denote the actual number of
columns with j ones in them.
Claim 4.1. If there are two valid solutions Ak , . . . , A0 and A0k , . . . , A00
that are both consistent with the values Ci,j , then either A0k = Ak , or for
each j, | A j − A0j | ≥ (kj).
Proof. Suppose A0k 6= Ak . Then | Ak − A0k | ≥ 1 = (kk). We proceed by

induction. Since a column of weight j is observed as having weight
j − 1 by j parties, and having weight j by k − j parties, we have:
k
( k − j ) A j + ( j + 1 ) A j +1 = ∑ Ci,j = (k − j) A0j + ( j + 1) A0j+1
i =1

j+1
⇒ | Aj − A0j | ≥ | A j+1 − A0j+1 |
k−j

j+1 k k
≥ = ,
k−j j+1 j
as required.
k
√
Claim 4.1 implies that if n < (k/2 ) ≈ 2k / k, then there can
only be one solution for Ak , since | Ak/2 − A0k/2 | cannot exceed
n. To get the final protocol, the parties divide the columns of
k
the matrix into blocks of size at most (k/2 ), and compute Ak for
each such block separately. The total communication is then
k
n·k2 log (k/2 )
k = O(k4 n/2k ).
(k/2 )
Exactly n Suppose 3 parties each have a number from [n] written on

their forehead, and want to know whether these numbers sum to
The randomized communication of
n or not.A trivial protocol is for one of the parties to announce one
exactly n is only a constant, since Alice
of the numbers she sees, which takes log n bits. Here we use ideas can use the randomized protocol for
p
of Behrend3 to show that one can do it with just O( log n) bits of equality to check that her forehead
is equal to what it needs to be for all
communication. Behrend’s ideas lead to a coloring of the integers numbers to sum to n.
that avoids monochromatic three-term arithmetic progressions:
√ 3
Behrend, 1946
Theorem 4.2. One can color the set [n] with 2O( log n) colors, such
that for any a, b ∈ [n], if the numbers a, a + b, a + 2b are all in [n], they
cannot have the same color.
Proof. For parameters d, r, with dr = n, we write each number

−1
x ∈ [n] in base d, using at most r digits: we express x = ∑ri= i
0 xi d ,
where xi ∈ [d − 1]. One can interpret x ∈ [n] as a vector v( x ) ∈ Rr
whose i’th coordinate is the i’th digit of x. We approximate each
of these vectors using a vector where every coordinate is off by at
most d/4: let w( x ) be the vector where the i’th coordinate is the
numbers on foreheads 55
largest number of the form jd/4 such that jd/4 ≤ xi and j is an

integer.
Color each number x ∈ [n] by the vector w( x ) and the integer
kv( x )k2 = ∑ri=0 xi2 . The number of choices for w( x ) is at most 2O(r) ,
and the number of possible values for kv( x )k2 is at most O(rd2 ),
so the total number O(r +log d) . Setting
p √ of possible colors is at most 2
r = log n, d = 2 log n gives the required bound.
It only remains to check that the coloring avoids arithmetic
progressions. Suppose a, b ∈ [n] are such that a, a + b, a + 2b all
get the same color. Then we must have kv( a)k = kv( a + b)k =
kv( a + 2b)k, so the three vectors v( a), v( a + b), v( a + 2b) all lie on
the surface of a sphere. We will get a contradiction by proving that
v( a)+v( a+2b)
v( a + b) = 2 , so these three vectors are collinear, and no
three vectors on the sphere can be collinear.
Let W ( x ) = ∑ri=0 w( x )i di . Note that by assumption, W ( a) =
W ( a + b) = W ( a + 2b). Then we have
a + 2b + a = 2( a + b)
⇒ a + 2b − W ( a + 2b) + a − W ( a) = 2( a + b − W ( a + b)).
Now observe that the base d representation of x − W ( x ) is

exactly v( x ) − w( x ), and in this vector, every coordinate is at most
d/4. This means that the base d representation of a + 2b − W ( a +
2b) + a − W ( a) is exactly v( a + 2b) − w( a + 2b) + v( a) − w( a). Thus,
we conclude
v( a + 2b) − w( a + 2b) + v( a) − w( a) = 2(v( a + b) − w( a + b))

⇒ v( a + 2b) + v( a) = 2v( a + b),
as required.
Now we show how the coloring of Theorem 4.2 can be used

to get a protocol for the exactly n problem with communication
p
O( log n). Suppose the three inputs are x, y, z. Alice computes
the number x 0 = n − y − z, and Bob computes y0 = n − x − z.
Define `( x, y) = x + 2y. Then we see that if x + y + z = n, then
x 0 = x and y0 = y. On the other hand, if x + y + z 6= n, then
x − x 0 = y − y0 6= 0, and `( x, y), `( x 0 , y), `( x, y0 ) form an arithmetic
progression. So Alice simply announces the color of `( x 0 , y) in the
coloring promised by Theorem 4.2. Bob and Charlie just check
that this color is the same as the color of `( x, y) and `( x, y0 ). The
p
communication required is O( log n) since there are at most
√
2O( log n) colors in the coloring.
Cylinder Intersections
The basic building blocks of protocols in the number-on-forehead

model are cylinder intersections. They play the same role that rect-
angles play in the case that the number of parties is 2. Any set
S ⊆ X1 × · · · × Xk can be described using its characteristic func-
tion:

1 if ( x , . . . , x ) ∈ S,
1 k
χ S ( x1 , . . . , x k ) =
0 otherwise.
We can then define cylinder intersections as:
Definition 4.3. S ⊆ X1 × · · · × Xk is called a cylinder if χS does not

depend on one of its inputs. S is called a cylinder intersection if it can be
expressed as an intersection of cylinders.
Figure 4.1: A cylinder intersection.

Watch an animation.
Figure 4.2: Figure 4.1 viewed from

above.
If S is a cylinder intersection, we can always express When k = 2, cylinder intersections are

the same as rectangles. However, when
k
k > 2, they are much more complicated
χ S ( x1 , . . . , x k ) = ∏ χ i ( x1 , . . . , x k ), to understand than rectangles.
i =1
where χi is a boolean function that does not depend on the i’th input.
Figure 4.3: A cylinder intersection.

Watch an animation.
Figure 4.4: Figure 4.3 viewed from

above.
Fact 4.4. The intersection of two cylinder intersections is also a cylinder

intersection.
Just as for rectangles, we say that a cylinder intersection is monochro-

matic with respect to a function g, if g( x ) = g(y) for every two inputs Figure 4.5: Figure 4.3 viewed from the
x, y in the cylinder intersection. In analogy with the 2 party case, we left.
have the following theorem:
Theorem 4.5. If the communication complexity of g : X1 × · · · × Xk →

{0, 1} is c, then X1 × · · · × Xk can be partitioned into at most 2c monochro-
matic cylinder intersections.
Indeed, for every outcome of the protocol m, it is easy to verify

that the set of inputs that are consistent with that outcome form a
cylinder intersection. Moreover, in analogy with Theorem 1.8, one
can show4 a small cover by monochromatic cylinder intersections Figure 4.6: Figure 4.3 viewed from the
leads to a protocol computing the same function. right.
4
See Exercise ??.
Lower bounds from Ramsey Theory
One way to prove lower bounds on protocols in the number-on-

forehead model is by appealing to arguments from Ramsey Theory.
In Chapter 5 we discuss the discrepancy
Let us consider the Exactly n problem: Alice, Bob and Charlie are method, which leads to the strongest
each given a number from [n], written on their foreheads, and want known lower bounds in the number-on-
forehead model.
to know if their numbers sum to n. We have shown that there is a
p
protocol that computes this function using O( log n) bits of commu-
nication. Here we show that Ω(log log log n) bits of communication 5
Chandra et al., 1983
are required5 .
Let cn is the communication of the exactly n problem. Three points
of [n] × [n] form a corner if they are of the form ( x, y), ( x + d, y), ( x, y +
d). A coloring of [n] × [n] with 2c colors is a function g : [n] × [n] → [C ].
We say that the coloring avoids monochromatic corners if there is
no corner with g( x, y) = g( x + d, y) = g( x, y + d). Let Cn be the
minimum number of colors required to avoid monochromatic corners
in any coloring of [n] × [n]. We claim that Cn essentially captures the
value of cn :
Claim 4.6. cn ≤ 2 + log Cn , and cn ≥ log Cn/3 .
Proof. For the first inequality, suppose there is a coloring with C

colors that avoids monochromatic corners. As in the protocol we saw,
Alice can compute x 0 = n − y − z, and Bob can compute y0 = n − x − z.
Alice will then announce the color of ( x 0 , y), and Bob and Charlie
will say whether this color is consistent with their inputs. These three
points form a corner, since x 0 − x = n − x − y − x = y0 − y. So if they
all have the same color, they must all be the same point.
To prove the second inequality, suppose there is a protocol com- Figure 4.7: A monochromatic corner.
puting the exactly n/2 problem with c bits of communication. Then

by Theorem 4.5, every input can be colored by one of 2c colors that is
the name of the corresponding cylinder intersection. This induces a
coloring of [n/3] × [n/3]: color ( x, y) by the name of the cylinder inter-
section containing the point ( x, y, n − x − y). We claim that this color-
ing avoid monochromatic corners. Indeed, if ( x, y), ( x + d, y), ( x, y + d)
is a monochromatic corner, then ( x, y, n − x − y), ( x + d, y, n − x −
y − d), ( x, y + d, n − x − y − d) must all belong to the same cylinder
intersection. But then ( x, y, n − x − y − d) must also be in the same
cylinder intersection, since it agrees with each of the three points
in two coordinates. That contradicts the correctness of the proto- 6
Graham, 1980; and Graham et al., 1980
col, since the sum of the points ( x, y, n − x − y) is n, and the sum of
( x, y, n − x − y − d) is n − d.
Next we prove6 :

log log n
Theorem 4.7. Cn ≥ Ω log log log n .
Proof. The proof will proceed by induction on the number of colors,

but using a stronger structure than monochromatic corners.
A rainbow-corner with r colors and center ( x, y) is specified by a
set of r colors, and numbers d1 , . . . , dr−1 , such that ( x + di , y) and
( x, y + di ) are both colored using the i’th color, and ( x, y) is colored by
the r’th color. Figure 4.8: A rainbow-corner.
2r
We shall prove by induction that as long as C > 3, if n ≥ 2C ,
then any coloring of [n] × [n] with C colors must contain either a
monochromatic corner, or a rainbow corner with r colors. When
2( C +1)
r = C + 1, this means that if n ≥ 2C , [n] × [n] must
contain a
log log n 2(Cn + 1) log Cn ≥ log log n,
monochromatic corner, proving that Cn ≥ Ω log log log n .
which cannot happen if Cn =
For the base case, when r = 2, n = 4, two of the points of the type o (log log n/ log log log n).
( x, n − x ) must have the same color. If ( x, n − x ) and ( x 0 , n − x 0 ) have
the same color, with x > x 0 , then ( x 0 , n − x ), ( x, n − x ), ( x 0 , n − x 0 ) are
either a monochromatic corner, or a rainbow corner with 2 colors.
2r 2r 2(r −1)
For the inductive step, if n = 2C , n contains m = 2C −C
consecutive disjoint intervals: [n] = I1 ∪ I2 ∪ . . . ∪ Im , each of size
2(r −1)
exactly 2C . By induction, each of the sets Ij × Ij must have either
a monochromatic corner, or a rainbow-corner with r − 1 colors. If
one of them has a monochromatic corner, we are done, so suppose
they all have rainbow-corners with r − 1 colors. Since a rainbow
corner is specified by choosing the center, choosing the colors and
2(r −1) 2
choosing the offsets for each color, there are at most (2C ) · 2C ·
C 2(r −1) C
(2 ) rainbow-corners in each interval. This number is at most
2C 2(r −1) +C +C2r −1 2r 2(r −1)
2 < 2C − C = m, so there must be j < j0 that Figure 4.9: A rainbow-corner induced
by two smaller rainbow corners.
have exactly the same rainbow corner with the same coloring. Then
we see (Figure 4.9) that these two rainbow corners must induce a
monochromatic corner centered in the box Ij × Ij0 , or a rainbow corner
with r colors.
Exercise 4.1
Define the generalized inner product function GIP as follows. Here
each of the k players is given a binary string xi ∈ {0, 1}n . They want
to compute GIP( x ) = ∑nj=1 ∏ik=1 xi,j (mod 2).
Each vector xi can be interpretted as
This exercise outlines a number-on-forehead GIP protocol using a subset of [n]. Our set intersection
O(n/2k + k) bits. It will be convenient to think about the input X as a protocol computes GIP with O(k4 n/2k )
bits. This improved protocol, by A.
k × n matrix with rows corresponding to x1 , . . . , xk .
Chattopadhyay, slightly improves a
famous protocol of V. Grolmusz.
• Fix z ∈ {0, 1}n . Assume the first t coordinates of z are ones and
the rest are zeros. For ` ∈ {0, 1, . . . , k − 1} define c` as the number
of columns in X with ` ones, followed by either a one or zero,
followed by k − ` − 1 zeros. Note that GIP( x ) = ck (mod 2). Find
a protocol to compute GIP( x ) using O(k ) bits assuming the players
know ct (mod 2).
One can extend this protocol to com-
• Exhibit an overall protocol for GIP by showing that the players can pute any function of the number of all
agree upon a vector z and communicate to determine ct (mod 2) ones rows using O(n/2k + k log n) bits.
using O(n/2k + k) bits.
Exercise 4.2
Given a function g : X × Y → {0, 1}, recall that we define gr to

be the function that computes r copies of g. This exercise explores, in
the number-on-forehead model, what we can say about the commu-
nication required by gr , knowing the communication complexity of
g. The approach taken in the proof of Theorem 1.28 does not work
because cylinder intersections do not tensorize nicely like rectangles
do. Fortunately, we can appeal to a powerful result from Ramsey the- 7
Hales and Jewett, 1963
ory called the Hales-Jewett Theorem7 to prove that the communication
complexity of gr must increase as r increases.
For an arbitrary set S, the Hales-Jewett Theorem gives insight
into the structure of the cartesian product Sn = S × S × · · · S as
n grows large. For the precise statement we need the notion of a
combinatorial line. The combinatorial line specified by a nonempty
set of indices I ⊆ [n] and a vector v ∈ Sn is the set { x ∈ Sn : xi =
vi , if i ∈
/ I, and for every i, j ∈ I, xi = x j }. For example, when S = [3]
and n = 4 then the set {1132, 2232, 3332} is a combinatorial line with
I = {1, 2} and v3 = 3, v4 = 2.
Given a set S and a number t, the Hales-Jewett theorem says that
as long as n is large enough, then any coloring of Sn with t colors
must contain a monochromatic combinatorial line.
• Set S = g−1 (0), so S ⊆ X × Y , a subset of the domain of g. Use the

Hales-Jewett theorem to argue that there exists an n large enough
such that the following holds. Any protocol correctly computing
∧in=1 g( xi ) induces a coloring of the domain of gr that can be used
to get a single monochromatic cylinder intersection containing all
the points of S.
• Assume that the communication complexity of g is strictly greater

than the number of players (that is, c > k). Define cn to be commu-
nication required to compute ∧in=1 g( xi ). Prove that
lim cn = ∞.
n→∞
Exercise 4.3
A three player NOF puzzle demonstrates that unexpected effi-
ciency is sometimes possible.
Inputs: Alice has a number i ∈ [n] on her forehead, Bob has a
number j ∈ [n] on his forehead, and Charlie has a string x ∈ {0, 1}n
on his forehead.
Output: On input (i, j, x ) the goal is for Charlie to output the bit xk
where k = i + j (mod n).
Question: Find a deterministic protocol such that Bob sends one bit
to Charlie, and Alice sends b n2 c bits to Charlie. Alice and Bob must
each send Charlie their message simultaneously; then Charlie should

be able to output the correct answer.
Exercise 4.4
Show that any degree d polynomial over F2 over the variables
x1 , . . . , xn can be computed by d + 1 players with O(d) bits of Number-
On-Forehead communication, for any partition of the inputs where
each party has n/(d + 1) bits on their forehead. (You may assume
d + 1 divides n exactly).
5
Discrepancy
The discrepancy method is a powerful way to prove lower bounds

on communication complexity. We will use it here to prove optimal
lower bounds on randomized protocols, and tight lower bounds in
the number-on-forehead model.
One reason why our previous approaches were insufficient to
prove lower bounds on randomized protocols is that the existence of
a randomized protocol only guarantees a partition of the space into
nearly monochromatic rectangles, rather than completely monochro-
matic rectangles. In order to work with nearly monochromatic rectan-
gles, we need to work with a quantity that is sensitive to the bias of a
rectangle. Let g be a boolean function, and let χS be the characteristic
function of the set S. Then we define the discrepancy of S with respect
to g to be
h i

E χS ( x ) · (−1) g( x) ,
where the expectation is taken over a random input x. A large, nearly

monochromatic rectangle (or cylinder intersection) must have high
discrepancy:
Fact 5.1. If R is a (1 − e)-monochromatic rectangle (or cylinder intersection)

of density δ, then the discrepancy of R must be at least (1 − 2e)δ.
Proof. Only points inside R contribute to its discrepancy. Since (1 − e)

fraction of these points have the same value under g, the discrepancy
is at least δ(1 − e − e) = δ(1 − 2e).
Let π ( x, y) denote the output of a protocol π with c bits of commu-

nication and error e. Let R1 , . . . , Rt be the rectangles induced by the
protocol. Then we have

h i
1 − 2e = E (−1)π ( x,y)+ g( x,y)
x,y
h i
= E (−1)π (x,y) · (−1) g(x.y)
x,y
" ! #
t
g( x,y)
≤ E
x,y
∑ χRi (x, y) · o( Ri ) · (−1) ,
i =1
where here o ( Ri ) is −1 if the protocol outputs 1 in Ri , and it is 1 if

the protocol outputs O. We can continue to bound:
h
t i
g( x,y)
1 − 2e ≤ ∑ x,y Ri
E χ ( x, y ) · (− 1 )
i =1
h i
c
g( x,y)

≤ 2 · max E χ R ( x, y) · (−1)
R x,y ,
where the maximum is taken over all choices of rectangles. Rearrang-

ing, we get
1 − 2e
2c ≥ .
maxR Ex,y χ R ( x, y) · (−1) g( x,y)

The same calculation also works in the case of cylinder intersections.

We have shown:
Theorem 5.2. If the maximum discrepancy of every rectangle (or cylinder

intersection) is at most γ, then every protocol with
error
e computing the
1−2e
function must have communication at least log γ .
Some Examples Using Convexity in Combinatorics
To bound the discrepancy of communication protocols, we shall use

Jensen’s inequality. Before applying these ideas to bounding the
discrepancy of rectangles and cylinder intersections, we show how to
use them to prove some interesting results in combinatorics.
While there are dense graphs that have no 3-cycles (for example
the complete bipartite graph), there are no dense graphs that avoid
4-cycles:
Lemma 5.3. Every n-vertex graph with e(n2 ) edges has at least (en − 1)4 /4
4-cycles.
Proof. Let 1x,y be 1 when there is an edge between the vertices x and Figure 5.1: A dense graph with no
y, and 0 otherwise. Then if x, x 0 , y, y0 are chosen uniformly at random, 3-cycles.
discrepancy 65
we can count the number of 4-cycles by computing:

h i
E 1x,y · 1x0 ,y · 1x,y0 · 1x0 ,y0
h i2
= E E 1x,y · 1x0 ,y
x,x 0 y
h i 2
≥ E E 1x,y · 1x0 ,y
x,x 0 y
2
2
= E E 1x,y
y x
4
≥ E 1x,y .
x,y
This last quantity is at least (e − 1/n)4 , since we are picking a random

edge as long as x and y are distinct. This gives (en − 1)4 /4 cycles,
since each cycle is counted 4 times.
We can use similar ideas to prove that every dense bipartite graph
must contain a reasonably large bipartite clique. Next we show a
slightly different way to prove this:
Lemma 5.4. If G is a bipartite graph of edge density e, and bipartition

A, B, with | B| = n, then there exists subsets Q ⊆ A, R ⊆ B with
log n √
| Q| ≥ 2 log(e/e) , | R| ≥ n, such that every pair of vertices q ∈ Q, r ∈ R is
connected by an edge.
log n
Proof. Pick a random subset Q ⊆ A of size 2 log(e/e) , and let R be all
the common neighbors of Q. Given any vertex b ∈ B that has degree
d, the probability that b is included in R is exactly
d
( log n ) log n
2 log(e/e) d 2 log(e/e)
n ≥ . Fact: ( nk )k ≤ (nk) ≤ ( en k
k ) .
( log n ) en
2 log(e/e)
So if di is the degree of the i’th vertex, the expected size of the set R
is at least
! log n
n log n e log n
di 2 log(e/e) 1 n di 2 log(e/e) √
∑ en ∑
2 log(e/e)
≥ n · ≥ n · = n. By convexity.
i =1
n i=1 en e
So there must be some choice of Q, R that proves the Lemma.
Lower bounds for Inner-Product
Say Alice and Bob are given x, y ∈ {0, 1}n and want to compute
h x, yi mod 2. We have seen that this requires n + 1 bits of communi-
cation using a deterministic protocol. Here we show that it requires
≈ n/2 bits of communication even using a randomized protocol.
Lemma 5.5. For any rectangle R, the discrepancy of R with respect to the
inner product is at most 2−n/2 .
Proof. Since R is a rectangle, we can write its characteristic function

as the product of two functions A and B. Thus we can write:
h i2 h i2
x,y x,y
E χ R ( x, y) · (−1)h i = E A( x ) · B(y) · (−1)h i
x,y x,y
h i 2
h x,yi
= E A( x ) E B(y) · (−1)
x y
h i2
2 h x,yi
≤ E A( x ) E B(y) · (−1) ,
x y

where the inequality follows from the fact that E [ Z ]2 ≤ E Z2 for
any real valued random variable Z. Now we can drop A( x ) from this
expression to get:
h i2 h i2
x,y x,y
E χ R ( x, y) · (−1)h i ≤ E E B(y) · (−1)h i
x,y x y
h 0
i
= E B(y) B(y0 ) · (−1)hx,yi+hx,y i
x,y,y0
h 0
i
= E B(y) B(y0 ) · (−1)hx,y+y i
x,y,y0
In this way, we have completely eliminated the set A! Moreover, we

can eliminate the set B too and write:
h i2 h i
x,y 0 x,y+y0 i
E χ R ( x, y) · (−1)h i ≤ E B(y) B(y ) · (−1)h
x,y x,y,y0
h
i
h x,y+y0 i
≤ E E (−1) (5.1)
y,y0 x
Now, whenever y + y0 is not 0 modulo 2, the expectation is 0. On the

other hand, the probability that y + y0 is 0 modulo 2 is exactly 2−n . So
we can bound (5.1) by 2−n .
Lemma 5.5 and Theorem 5.2 together imply:
Theorem 5.6. Any 2-party protocol that computes the inner-product with
error at most e over the uniform distribution must have communication at
least n/2 − log(1/(1 − 2e)).
Similar ideas can be used to show that the communication com-

1
Babai et al., 1989
plexity of the generalized inner product must be large in the number-on-
forehead model1 . Here each of the k players is given a binary string Each vector xi can be interpretted as
xi ∈ {0, 1}n . They want to compute GIP( x ) = ∑nj=1 ∏ik=1 xi,j mod 2 a subset of [n]. Then our protocol for
We can show: computing the set intersection size
gives a protocol for computing the
inner product with communication
Lemma 5.7. For any cylinder intersection S, the discrepancy of S with
k −1 O(k4 n/2k ).
respect to the inner product is at most e−n/4 .
discrepancy 67
Proof. Since S is a cylinder intersection, its characteristic function

can be expressed as the product of k boolean functions χS = ∏ik=1 χi ,
where χi does not depend on the i’th input. Thus we can write:
" #2
h i2 k
GIP( x )
E χS ( x ) · (−1)
x
=E
x
∏ χi (x) · (−1)GIP(x)
i =1
" " ##2
k −1
= E
x1 ,...,xk−1
χk ( x ) E
xk
∏ χi (x) · (−1)GIP(x)
i =1
 " #2 
k −1
≤ E χk ( x )2 E ∏ χi ( x ) · (−1)GIP( x)  ,
x1 ,...,xk−1 xk
i =1

where the inequality follows from the fact that E [ Z ]2 ≤ E Z2 for
any real valued random variable Z. Now we can drop χk ( x ) from
this expression to get:
h i2
GIP( x )
E χS ( x ) · (−1)
x
 " #2 
k −1
GIP( x ) 
≤ E E ∏ χi ( x ) · (−1)
x1 ,...,xk−1 xk
i =1
" #
k −1 −1
∑nj=1 ( xk + xk0 ) ∏ik= 1 xi,j
= E
x1 ,...,xk ,xk0
∏ χi (x)χi (x ) · (−1) 0
i =1
In this way, we have completely eliminated the function χk ! Repeat-

ing this trick k − 1 times gives the bound
h i 2k −1
GIP( x )

∑nj=1 x1 ∏ik=2 ( xi + xi0 )
1 1
E S
x
χ ( x ) · (− ) ≤ E E (− )
x1 .
x2 ,x20 ,...,xk ,xk0
Now, whenever ∏ik=−21 ( xi,j + xi0 ,j ) is not 0 modulo 2, at any coordinate

j, the expectation is 0. On the other hand, the probability that this
expression is 0 modulo 2 is exactly (1 − 2−k+1 )n . So we get
h i 2k −1 k −1
E χS ( x ) · (−1)
GIP( x )
≤ (1 − 2−k+1 )n < e−n/2 . Fact: 1 − x < e− x for x > 0.
x
This proves that

h i k −1
GIP( x )
E χS ( x ) · (−1) < e−n/4 .
x
By Lemma 5.7 and Theorem 5.2:
Theorem 5.8. Any randomized protocol for computing the generalized inner
product in the number-on-forehead model with error e requires n/4k−1 −
log(1/(1 − 2e)) bits of communication.
Lower bounds for Disjointness in the Number-on-Forehead model
At first it may seem that the discrepancy method is not very useful
for proving lower bounds against functions like disjointness, which
do have large monochromatic rectangles.
Suppose Alice and Bob are given two sets X, Y ⊆ [n] and want to
compute disjointness. If we use a distribution on inputs that gives
intersecting sets with probability at most e, then there is a trivial
protocol with error at most e. On the other hand, if the probability of
intersection is at least e, then then there must be some fixed coordi-
nate i such that an intersection occurs in coordinate i with probability
at least e/n. Setting R = {( X, Y ) : i ∈ X, i ∈ Y }, we get
h i

E χ R ( X, Y ) · (−1)Disj(X,Y ) ≥ e/n,
so we cannot hope to prove a lower bound better than Ω(log n) this

way. Nevertheless, we show that one can use discrepancy to give
a lower bound on the communication complexity of disjointness 2 , 2
Sherstov, 2012; and Rao and Yehuday-
off, 2015
even when the protocol is allowed to be randomized, by studying a
different expression. In fact, this is the only known method to prove
lower bounds on the communication complexity of disjointness in the
number-on-forehead model.
Consider the following distribution on sets. Let the universe
consist of disjoint sets I1 , . . . , Im . Alice gets m independently sampled
sets X1 , . . . , Xm , where Xi is a random subset of Ii , and Bob gets m
random sets of size 1, Y1 , . . . , Ym , where the i’th set is again drawn
from Ii . Let X = ∪im=1 Xi , and Y = ∪im=1 Yi . We prove:
Lemma 5.9. For any rectangle R,
s
h i 1
∑im=1 Disj( Xi ,Yi )
E χ R ( X, Y ) · (−1) ≤ .
∏m
j=1 | Ii |
Proof. As usual, we express χ R ( X, B) = A( X ) · B( X ) and carry out a

convexity argument. We get:
h m
i2
Disj( Xi ,Yi )
E χ R ( X, Y ) · (−1)∑i=1
h m
i2
= E A( X ) · B(Y ) · (−1)∑i=1 Disj(Xi ,Yi )
h i2
m
≤ E A( X )2 E B(Y ) · (−1)∑i=1 Disj(Xi ,Yi )
h m m 0
i
≤ E B(Y ) B(Y 0 ) · (−1)∑i=1 Disj(Xi ,Yi )+∑i=1 Disj(Xi ,Yi )
X,Y,Y 0
h
m
i
∑ Disj( Xi ,Yi )+∑im=1 Disj( Xi ,Yi0 )
≤ E E (−1) i = 1

Y,Y 0 X
discrepancy 69
For any fixing of Y, Y 0 , the inner expectation is 0 as long as Y 6= Y 0 .

The probability that Y = Y 0 is exactly 1/ ∏m j=1 | Ii |. Thus we get
h m
i2
that E χ R ( X, Y ) · (−1)∑i=1 Disj(Xi ,Yi ) ≤ 1/ ∏mj=1 | I j |, proving the
bound.
Lemma 5.9 may not seem useful at first, because under the given
distribution, the probability that X, Y are disjoint is 2−m . However,
we can actually use it to give a linear lower bound on the communi-
cation of deterministic protocols. Suppose a deterministic protocol
for disjointness has communication c. Then there must be at most 2c
monochromatic 1-rectangles R1 , . . . , Rt that cover all the 1’s. When-
ever X, Y are disjoint, we have that ∑m j=1 Disj( Xi , Yi ) = m. On the
other hand, the probability that X, Y are disjoint is exactly 2−m . Thus,
we get
" #
t
−m ∑m
j=1 Disj( Xi ,Yi )
2 ≤E ∑ χRi (X, Y ) · (−1)
i =1
t h m i
Disj( Xi ,Yi )
≤ ∑ E χ Ri ( X, Y ) · (−1)∑ j=1
i =1
m q
≤ 2c · (1/ ∏ | Ij |).
j =1
Setting | Ii | = 4, and rearranging gives c ≥ m, a linear lower bound

on the communication complexity of disjointness. While we have
already seen several approaches to proving linear lower bounds on
disjointness, this approach has a unique advantage: it works even
in the number-on-forehead model. Consider the distribution where
for each j = 1, 2, . . . , m, X1,j ⊆ Ii is picked uniformly at random,
and X2,j , . . . , Xk,j ⊆ Ii are picked uniformly at random, subject to
the constraint that their intersection contains exactly 1 element. Set
Xi = ∪ m j=1 Xi,j . Suppose the i’th player has set Xi written on his
forehead. Then we prove:
Lemma 5.10. For any cylinder intersection S,
h i m
2k −1 − 1
∑m Disj( X1,j ,...,Xk,j )
E χS ( X ) · (−1) j=1 ≤ ∏ q .
j =1 | Ij |
Proof. We prove the lemma by induction on k. When k = 2, the

statement was already proved in Lemma 5.9.
For ease of notation, we write Tj to denote the input in the j’th
interval, X1,j , . . . , Xk,j . Suppose χS ( X ) = ∏ik=1 χi ( X ), where χi is
the indicator of the i’th cylinder. Then, as usual, we can apply a
convexity argument to bound:
h i2
∑m Disj( X1,j ,...,Xk,j )
E χS ( X ) · (−1) j=1
 " #2 
k −1
 χ k ( X )2 · E ∑m
j=1 Disj( Tj )
≤ E
X1 ,...,Xk−1 Xk
∏ χi (X ) · (−1) 
i =1
 " #2 
k −1
∑m
j=1 Disj( Tj )
≤ E
X1 ,...,Xk−1
E
Xk
∏ χi (X ) · (−1) 
i =1
" #
k −1
∑m 0
j=1 Disj( Tj )+Disj( Tj )
= E
X1 ,...,Xk−1 ,Xk ,Xk0
∏ χi (X )χi (X ) · (−1) 0
, (5.2)
i =1
where here X = X1 , . . . , Xk , X 0 = X1 , . . . , Xk−1 , Xk0 , and Tj0 =

0 . Let v, v0 denote the two common intersection
X1,j , . . . , Xk−1,j , Xk,j
points of the last k − 1 sets.
Now whenever v = v0 , we have Disj( Tj ) = Disj( Tj0 ), and so the
j term of the sum is 0 modulo 2. On the other hand, when v 6= v0 ,
then any intersection in Tj must take place in the set Xk,j \ Xk,j 0 , and
0 0
any intersection in Tj must take place in Xk,j \ X j,k , so we can fix
the intersections of all the sets to Xk ∩ Xk0 and the Xkc ∩ Xk0c and use
induction to bound the discrepancy.
Let Zj be the random variable defined as:


1 if v = v0 ,
Z j = q (2 −1) k − 2 2
 | Xk,j \ X 0 || X 0 \ Xk,j | otherwise.

k,j k,j
Then we get:
" #
m m
(5.2) ≤ E ∏ Zj ≤ ∏ E Zj ,
j =1 j =1
since the Zi ’s are independent of each other. We need a technical

claim next:
Claim 5.11. Suppose a set Q ⊆ Ij is sampled by including a random element
v ∈ Ij and adding every other element to Q independently with probability
h i
γ 6= 0. Then E |Q1 | ≤ 1/(γ| Ij |).
Proof.

1 (1/| Ij |) · γ|Q|−1 (1 − γ)| Ij |−|Q|
E
| Q|
= ∑ | Q|
Q,v
1 1 1
= ∑ γ|Q| (1 − γ)| Ij |−|Q| ≤ (1 − γ + γ ) | I j | = .
γ| Ij | Q6=∅
γ| Ij | γ| Ij |
discrepancy 71
If X2,j ∩ . . . ∩ Xk−1,j is of size t, then the probability that v = v0 is

exactly 1/t. Thus, the probability of this event is exactly the expected
size of 1/| Q|, where Q is the intersection of the first k − 1 sets. After
picking the common intersection point, every other element of Ij is
included in Q independently with probability 2k−11 −1 . So by Claim
2k −1 −1
5.11, Pr[v = v0 ] = | Ij |
. When v 6= v0 , we can bound
(2k −2 − 1 )2
Zj = q
0 | · |X0 \ X |
| Xk,j \ Xk,j k,j k,j
!
(2k −2 − 1 )2 1 1
≤ · 0 | + |X0 \ X | , By the Arithmetic√mean - geometric
2 | Xk,j \ Xk,j k,j k,j mean inequality: ab ≤ a+2 b .
Let Q = Xk \ Xk0 . Once again we see that Q is sampled by picking

the value of V uniformly, and then every other element is included
k −2
in Q independently with probability 2(22k−1−−11) . So by Claim 5.11,

2 (2k −1 −1 )
E | X \1 X 0 | = (2k−2 −1)| I | . Combining these bounds, we get
k,j k,j j
" #
(2k −2 − 1 )2 0
E Zj ≤ Pr[v = v ] + E 0 |
| Xk,j \ Xk,j
2k−1 − 1 2(2k−1 − 1)(2k−2 − 1)2
≤ +
| Ij | (2k−2 − 1)| Ij |
(2k −1 − 1 )2
= ,
| Ij |
as required.
Lemma 5.10 can be used to prove a linear lower bound on the

communication of deterministic protocols. Suppose a deterministic
protocol for disjointness has communication c. Then there must be
at most 2c monochromatic 1-cylinder intersections S1 , . . . , St that
cover all the 1’s. Whenever X1 , . . . , Xk are disjoint, we have that
∑mj=1 Disj( X1,j , X2,j , . . . , Xk,j ) = m. On the other hand, the probability
that X1 , . . . , Xk are disjoint is exactly 2−m . Thus, we get
" #
t m
Disj( X1,i ,...,Xk,j )
2− m ≤ E ∑ χSi (X1 , . . . , Xk ) · (−1)∑j=1
i =1
h
t m i Setting | Ii | = `, for all i, and rear-
Disj( X1,i ,...,Xk,j ) n
≤ ∑ E χSi ( X1 , . . . , Xk ) · (−1)∑ j=1
√
ranging gives c ≥ 2·(2k−`1 −1)
`
=
i =1 n/2
  (`/a) 1/ ` , where a = (2 · (2 1 −
k −
m
2k −1 − 1 2
1)) . The derivative of (`/a)
1/` is
≤ 2c ·  ∏ q .
(`/a)1/` · 1−ln`(` /a)
, which is 0 when
j =1 | Ij | 2
` = e · a. In this way, one can set ` to get

n log e
a slightly better bound: c ≥ 8e·(2k−1 −1)2 .
n
Setting | Ij | = 16 · (2k−1 − 1)2 , we get that c ≥ m = 16(2k−1 −1)2
.
Theorem 5.12. Any deterministic protocol for computing disjointness in the

number-on-forehead model requires 16(2k−n1 −1)2 bits of communication.
6
Information
1
Shannon, 1948
Shannon’s seminal work on information theory1
has had a big im-
pact on communication complexity. Shannon wanted to measure the
amount of information (or entropy) contained in a random variable
X. Shannon’s definition was motivated by the observation that the
amount of information contained in a message is not the same as the
length of the message. Suppose we are working in the distributional
setting, where the inputs are sampled from some distribution µ.
• Consider a protocol where Alice’s first message to Bob is a c-bit The entropy of the message is 0.
string that is always 0c , no matter what her input is. This message
does not convey any information to Bob. We might as well run the
protocol imagining that this first message has already been sent,
and so reduce the communication of the first step to 0.
The entropy of the message is log |S|.
• Consider a protocol where Alice’s first message to Bob is a random
string from a set S ⊆ {0, 1}c , with |S| 2c . In this case, the
parties should use log |S| bits to index the elements of the set,
reducing the communication from c to log |S|.
• Consider a protocol where Alice’s first message to Bob is the

The entropy of the message is ≈ en.
string 0n with probability 1 − e, and is a uniformly random n bit
string with the remaining probability. In this case one cannot
encode every message using fewer than n bits. However, Alice
can send the bit 0 to encode the string 0n , and the the string 1x
to encode the n bit string x. Although the first message is still
quite long in the worst case, the expected length of the message is
1 − e + e(n + 1) = 1 + en.
Shannon’s definition of entropy gives a general way to compute

the length of the smallest encoding of a message. Given a random
variable X with probability distribution p( x ), define the entropy of X
to be
1
H (e)
0.5
0
0 0.5 1
e
Figure 6.1: The entropy of a bit with
p(1) = e.
1
H (X) = ∑ p(x) log(1/p(x)) = pE(x) log
p( x )
.
x
The entropy of X characterizes the expected number of bits that need

to be transmitted to encode X. Intuitively, if there is an encoding The definition ensures that the entropy
is always non-negative.
of X that has expected length k, then X can be encoded by a string
of length 10k most of the time, so one would expect that X takes
on one of 2O(k) values most of the time, and so the expected value
of log(1/p( x )) should be O(k). Conversely, if the entropy of X is k,
then one can encode X using the positive integers, in such a way that
p(1) ≥ p(2) ≥ . . . . The probability that the sample for X is the i’th
integer is at most 1/i (since otherwise ∑ij=1 p(i ) > 1), so the expected
length of transmitting the number that encodes X should be bounded
by ∑i p(i ) log i, which is at most the entropy. Formally, we can prove:
Theorem 6.1. X can be encoded using a message whose expected length is at
most H ( X ) + 1. Conversely, every encoding of X has expected length at least
H ( X ).
Proof. Without loss of generality, suppose that X is an integer from Can you think of an example that
shows that the expected length needs to
[n], with p(i ) ≥ p(i + 1). Let ì = dlog(1/p(i ))e. We shall encode i be at least H ( X ) + 1?
with a leaf at depth ì . Then the expected length of the message will
be ∑i p(i )ì ≤ ∑i p(i )(log(1/p(i )) + 1) = H ( X ) + 1. The encoding
is done greedily. In the first step, we pick the first vertex of the
complete binary tree at depth `1 and let that vertex represent 1. We
delete all of its descendants so that the vertex becomes a leaf. Next
we find the first vertex at depth `2 that has not been deleted, and use
it to represent 2. We continue in this way until every element of [n]
has been encoded. For i < j, the number of vertices at depth ` j that
are deleted in the i’th step is exactly 2` j −ì , so the number of vertices
at depth j that are deleted before the j’th step is
!
j −1 j −1 j −1
∑ 2`j −ì = 2`j ∑ 2−ì ≤ 2` j ∑ p ( i ) < 2` j ,
i =1 i =1 i =1
information 75
so some vertex will be available at the j’th step. This ensures that
every step of this process succeeds.
Conversely, suppose X can be encoded in such a way that i is
encoded using ì bits. Then the expected length of the encoding is:
h i
−`
E [ì ] = E [log(1/p(i ))] − E log(2 i /p(i ))
p (i ) p (i ) p (i )
!
h i
−ì
≥ H ( X ) − log E 2 /p(i ) By convexity of the log function,
p (i ) E [log Y ] ≤ log (E [Y ]).
!
= H ( X ) − log ∑2 −ì
.
i
If you pick a random path starting from the root in the protocol tree,
you hit the leaf encoding i with probability 2−ì . The probability
that you hit one of theleaves encoding
a number from [n] is thus
∑i 2−ì ≤ 1. Thus log ∑i 2−ì is at most 0, and the entropy is at
most the expected length of the encoding.
Entropy, Divergence and Mutual Information
The concepts of divergence and mutual information are

closely related to the concept of entropy. They provide a toolbox that
helps to understand the flow of information in different situations.
The divergence between two distributions p( x ) and q( x ) is defined to
be

p( x ) p( x ) p( x )
= ∑ p( x ) log = E log .
q( x ) x q( x ) p( x ) q( x )
The divergence is a measure of distance the two distributions.

p( x )
Clearly, = 0.
p( x )
p( x )
Fact 6.2. ≥ 0.
q( x )
Proof.

p( x ) p( x )
= E log
q( x ) p( x ) q( x )
q( x ) q( x )
= − ∑ p( x ) log ≥ − log ∑ p( x ) = log 1 = 0. The inequality follows from the convex-
x p( x ) x p( x ) ity of the log function.
e log γe + (1 − e) log 11−

−e
γ 5
4.5
0.8
3.5
3
0.5
γ
2.5
1.5
0.2 1
0.5
0
0.1 0.5 0.9
e
Figure 6.2: The divergence between two

bits.
p( x ) q( x )
However, the divergence is not symmetric: in 6=
q( x ) p( x )
general. Moreover, the divergence can be infinite, for example if p is
supported on a point that has 0 probability under q. If X is an `-bit
string, we see that:

p( x ) p( x )
H ( X ) = E [log(1/p( x ))] = ` − E log −` = ` − ,
p( x ) p( x ) 2 q( x )
where q( x ) is the uniform distribution on `-bit strings. So we see

that the entropy of a string is just a way to measure the divergence
from uniform. In particular, since the divergence is non-negative
(Fact 6.2), the uniform distribution has maximum entropy of all the
distributions on a set.
In our context, the divergence is most often measured between two
distributions that arise from the same probability space. For example,
if E is an event in a probability space containing x, we have Proof.

p( x |E ) p( x |E )
p( x |E ) 1 = E log
Fact 6.3. ≤ log p(E )
. p( x ) p( x |E ) p( x )
p( x )
p(E | x )

= E log
p( x |E ) p(E )
We can use divergence to quantify the dependence between two
1
random variables. If p( a, b) is a joint distribution, we define the ≤ log .
p(E )
mutual information between a and b to be
" #
p( a, b) p(b| a) p(b| a)
I ( A : B) = E log = E log = E . The entropy, mutual information and
p( a,b) p( a) p(b) p( a,b) p(b) p( a) p(b)
divergence are all expectations over the
universe of various log-ratios.
We have that I ( A : B) = H ( A) + H ( B) − H ( AB). The mutual
information 77
information of any random variable with itself is the same as its en-
tropy I ( A : A) = H ( A). On the other hand, if A, B are independent,
I ( A : B) = 0. In general, the mutual information is always a num-
ber between these two quantities: 0 ≤ I ( A : B) ≤ H ( A). The first
inequality follows from Fact 6.2, and the second by observing:

1 p( a, b)
H ( A) − I ( A : B) = E log − log
p( a,b) p( a) p( a) p(b)

p(b)
= E log ≥ 0.
p( a,b) p( a, b)
Chain Rules
Chain rules allow one to relate bounds on the information of a
collection of random variables to the information associated with
each variable. Suppose p( a, b) and q( a, b) are two distributions. Then
we have

p( a, b) p( a) · p(b| a)
= E log
q( a, b) p( a,b) q( a) · q(b| a)

p( a) p(b| a)
= E log + E log
p( a,b) q( a) p( a,b) q(b| a)
" #
p( a) p(b| a)
= + E .
q( a) p( a) q(b| a)
In words, the total divergence is the sum of the divergence from the
first variable, plus the expected divergence from the second variable.
Similar chain rules hold for the entropy and mutual informa-
tion. Suppose A, B are two random bits that are always equal. Then
H ( AB) = 1 6= H ( A) + H ( B), so the entropy does not add in general.
Nevertheless, a chain rule does exist for entropy. Denote

1
H ( B | A) = E log .
p( a,b) p(b| a)
h i
2
H ( AB) = E p(a,b) log p(a) p1(b|a) =
Then we have the chain rule2 :
H ( AB) = H ( A) + H ( B | A). h i
Suppose A, B, C are three random bits that are all equal to each E p(a,b) log p(1a) + log p(b1|a) = H ( A) +
H ( B | A) .
other. Then I ( AB : C ) = 1 < 2 = I ( A : C ) + I ( B : C ). On the other
hand, if A, B, C are three random bits satisfying A + B + C = 0 mod 2,
we have I ( AB : C ) = 1 > 0 = I ( A : C ) + I ( B : C ). Nevertheless, a
chain rule does hold for mutual information, after we use the right 3
I ( AB :hC ) = i
definition. Denote: p( a,c)· p(b| a,c)
E p(a,b,c) log p(a) p(b|a)· p(c) =
h i
p(b, c| a) p(b| a,c)
I ( A : C ) + E p(a,b,c) log p(b|a) =
I ( B : C | A) = E log . h i
p( a,b,c) p(b| a) p(c| a) p(b,c| a)
I ( A : C ) + E p(a,b,c) log p(b|a)· p(c|a =
I ( A : C ) + I ( B : C | A ).
Then we have the chain rule3 : I ( AB : C ) = I ( A : C ) + I ( B : C | A).
Subadditivity
Each of the definitions we have seen so far satisfies the property that
conditioning on variables can either only increase the quantity or
only decrease the quantity, a property that we loosely refer to as
subadditivity. We start with the divergence. Suppose p( a, b), q( a, b) are
two distributions. Then:
" #
p( a|b) p( a) p( a|b)
E = E log + log
p(b) q( a) p( a,b) q( a) p( a)
p( a) p( a)
= + I ( A : B) ≥ .
q( a) q( a)
One consequence of this last inequality is:
Fact 6.4. If q( x1 , . . . , xn ) is a product distribution, then for any p,
p ( x1 , . . . , x n ) n p ( xi )
q ( x1 , . . . , x n
≥ ∑ q ( xi )
.
i =1
When it comes to entropy, we have: Proof of Fact 6.4.
H ( AB) = H ( A) + H ( B) − I ( A : B) ≤ H ( A) + H ( B) . p ( x1 , . . . , x n )
q ( x1 , . . . , x n
This last inequality also implies that n
" #
p ( xi | x <i )
= ∑ p(Ex
H ( A) ≥ H ( AB) − H ( B) = H ( A | B) . i =1 <i q ( xi | x <i )
" #
n p ( xi | x <i )
We have already seen that conditioning on a random variable can = ∑ p(Ex
i =1 <i q ( xi )
both decrease, or increase the mutual information. Nevertheless, n p ( xi )
when A, B are independent, we can prove4 : ≥ ∑ q ( xi )
.
i =1
I ( AB : C ) ≥ I ( A : C ) + I ( B : C ) .
Shearer’s Inequality 4
I ( AB : C ) − I ( A : C ) − I ( B : C ) =
I ( B : C | A) − I ( B : C ) = H ( B | A) −
A useful consequence of subadditivity is Shearer’s inequality: H ( B | AC ) − H ( B) + H ( B | C ) ≥ 0,
since H ( B | AC ) ≤ H ( B | C ), and
Lemma 6.5. Suppose X = X1 , . . . , Xn is a random variable and S ⊆ [n] is a H ( B | A ) = H ( B ).
set sampled independently of X. Then if p(i ∈ S) ≥ e for every i ∈ [n], we
have H ( XS | S) ≥ e · H ( X ).
Proof. Suppose S = { a, b, c}, with a < b < c. Then we can express
H ( XS ) = H ( X a ) + H ( Xb | X a ) + H ( Xc | X a , Xb )
≥ H ( X a | X< a ) + H ( Xb | X< b ) + H ( Xc | X< c ) ,
by subadditivity. In general, we get that
" #
H ( XS | S ) ≥ E
S
∑ H ( Xi | X < i )
i ∈S
n
= ∑ p ( i ∈ S ) H ( Xi | X < i ) ≥ e · H ( X ) .
i =1
information 79
Pinsker’s Inequality
Pinsker’s inequality bounds the statistical distance between two
distributions in terms of the divergence between them.
e log γe + (1 − e) log 11−

−e −
γ
2
ln 2 ( e − γ )
2
0.8 0.6
0.5
0.4
0.5
γ
0.3
0.2
0.1
0.2
0
0.1 0.5 0.9
e
Figure 6.3: Pinsker’s Inequality
p( x ) 2
Lemma 6.6. ≥ ln 2 · | p − q |2 .
q( x )
See the notational remarks in Prob-
ability section of the Conventions
Proof. Let T be the set that maximizes p( T ) − q( T ), and define Chapter.

1 if x ∈ T,
xT =
0 otherwise.
Then, | p − q| = p( T ) − q( T ) = p( x T = 1) − q( x T = 1). We shall prove:
p( x ) p( xT )
≥
q( x ) q( xT ) f
2 2 g
≥ · ( p( x T = 1) − q( x T = 1))2 = · | p − q |2 .
ln 2 ln 2
The first inequality follows from the chain rule for divergence. It
only remains to prove the second inequality. Suppose p( x T = 1) =
e ≥ q( x T = 1) = γ. Then we shall show that
0 0.67 1
e
e 1−e 2 Figure 6.4: f = e log 2/3 e
+ (1 −
e log + (1 − e) log − · ( e − γ )2 (6.1) 1− e 2
γ 1 − γ ln 2 e) log 1/3 , g = ln 2 (e − 2/3)2 .
is always non-negative. (6.1) is 0 when e = γ, and its derivative with

respect to γ is
−e 1−e 4( γ − e )
+ −
γ ln 2 (1 − γ) ln 2 ln 2
γ − eγ − e + eγ 4(γ − e)
= −
γ(1 − γ) ln 2 ln 2

(γ − e) 1
= −4 .
ln 2 γ (1 − γ )
Since γ(11−γ) is always at most 4, the derivative is non-positive when

γ < e, and non-negative when γ > e. This proves that (6.1) is always
non-negative, as required.
Pinsker’s inequality implies that two variables that have low

information with each other cannot affect each other’s distributions
by much:
Corollary 6.7. If A, B are q

random variables then on average over b,
e ln 2·I( A:B)
p( a|b) ≈ p( a), where e = 2 .
Another useful corollary is that conditioning on a low entropy

random variable cannot change the distribution of many other inde-
pendent random variables:
Corollary 6.8. Let A1 , . . . , An be independent random variables, and B be

jointly distributed. Let i ∈ [n] be uniformly random and independent of all
e
otherq
variables. Then on average over b, i, a<i , p( ai |b, a<i ) ≈ p( ai ), where
H( B) ln 2
e≤ 2n .
Proof. By subadditivity, we have:
H ( B) /n ≥ I ( A1 , . . . , An : B) /n
n
≥ (1/n) ∑ I A j : BA< j .
j =1
Thus we get that for a uniformly random coordinate i,
E [I ( Ai : BA<i )] ≤ H ( B) /n.
The bound then follows from Corollary 6.7.
Some Examples from Combinatorics
The entropy function has found many applications in combina-

torics, where it can be used to give simple proofs. Here we give a few
examples that illustrate its power.
information 81
On the Size of Projections
Let S be a set of n3 points in R3 , and let Sxy , Syz , Sxz denote the pro-
jections of S onto the xy, yz, xz planes.
Figure 6.5: A set in R3 projected to the

three planes.
Claim 6.9. One of the three projections must have size at least n2 .
Proof. Let X, Y, Z be the coordinates of a uniformly random point

from S. By Shearer’s inequality,
H ( XY ) + H (YZ ) + H ( XZ ) 2
≥ · H ( XYZ ) = 2 log n,
3 3
so one of the first three terms must be at least 2 log n, proving that
the projection must be of size at least n2 .
On the number of Paths and Cycles in a Dense Graph
Suppose we have a graph on n vertices with m edges. The average

degree of a vertex in the graph is d = 2m/n. In Lemma 5.3 we proved
5
Szegedy, 2014
a lower bound on the number of 4 cycles in the graph. Here we prove

6
Babu and Radhakrishnan, 2010; and
lower bounds on the number of paths and cycles 5 and an upper Alon et al., 2002
bound on the girth of the graph 6 .
Let X, Y, Z be a random path of length 2 sampled as follows. First
we sample a uniformly random edge X, Y, and then a uniformly
random neighbor Z of Y. Observe that p( x |y) = p(z|y) under this
distribution, and so p( xy) = p(yz). If dv denotes the number of
neighbors of the vertex v, we have:
H ( XYZ ) = H ( XY ) + H ( Z | Y ) .
H ( XY ) = m by the choice of our distribution. To bound H ( Z | Y ),

we use convexity:
d
H (Z | Y) = ∑ 2mi · log di
i
n di
2m ∑
= · · log di
i
n
n
≥ · d · log d = log d, since the function x log x is concave
2m
which is the logarithm of the average degree in the graph.

Thus we get H ( XYZ ) ≥ log(2m2 /n). Some of the points in the
support of XYZ do not correspond to paths, since we could have
x = z. After correcting for these counts we are left with at least
2m2 /n − m paths of length 2.
Next we turn to proving a lower bound on the number of 4-cycles.
Sample X, Y, Z as before, and then sample W using the distribution
p(y| xz). Then we have:
H ( XYZW ) = H ( XYZ ) + H (W | XZ )
≥ H ( XYZ ) + H ( XWZ ) − H ( XZ ) using subadditivity
≥ 2 · H ( XYZ ) − 2 log n. since XYZ is identically distirbuted to

XWZ and the entropy of XZ is at most
2 log n
Combining this with our bound for H ( XYZ ), we get that H ( XYZW ) ≥
log(4m4 /n4 ). There are some redundant cycles where two of the ver-
tices are the same. After accounting for these, we are left with at least 7
very similar reasoning can be used to
4m4 /n4 − 4n3 distinct cycles. give a bound in the case that the girth is
Finally we turn to bounding the girth of a graph. The girth is the even
length of the shortest cycle in the graph. Suppose we have a graph 8

meaning that every vertex has the
whose girth is an odd7 number g. same number of neighbors
g −1
If the graph is regular8 , the vertices at distance 2 from any fixed
vertex must form a tree, or else the graph would have a cycle of
g −1
length < g. This proves that (d − 1) 2 ≤ n. We shall prove a very
similar bound without assuming that the graph is regular.
Let X = X0 , X1 , . . . , X g−1 be a random path in the graph, sampled
2
as follows. Let X0 , X1 be a random edge, and for i > 1, let Xi be a
random neighbor of Xi−1 that is not the same as Xi−2 . As before, we
see that X2 has the same distribution as X0 after fixing X1 , and so
each edge of this path is identically distributed. We have
g −1
2
H ( X | X0 ) = ∑ H ( Xi | Xi − 1 ) .
i =1
information 83
We lower bound each term H ( Xi | Xi−1 ) using convexity:
dv
H ( Xi | Xi − 1 ) = ∑ 2m log(dv − 1)
v
2m dv
=
n ∑ n
log(dv − 1)
v
≥ log(d − 1).
Putting the bounds together, we get:
g−1
H ( X | X0 ) ≥ log(d − 1).
2
On the other hand, since the girth of the graph is g, the entire path X
g −1
is determined by X0 , Xt . So we have log n ≥ H ( X | X0 ) ≥ 2 log(d −
1), as required.
An Isoperimetric Inequality in the Hypercube

The hypercube is the graph who vertex set is {0, 1}n and the edges
connect two vertices that disagree in exactly one coordinate. The
hypercube contains 2n vertices and 2n n/2 edges. Here we give a tight
9
Samorodnitsky. Ref?
bound on the number of edges in any subset of the vertices9 :
Theorem 6.10. If S ⊆ {0, 1}n , the number of edges contained in S is at

|S| log |S|
most 2 .
Proof. Let e(S) denote the number of edges in S. Let X be a uni-

formly random element of S. Then for any vertex x ∈ S and y such
that x, y is an edge of the hypercube where x, y disagree in the i’th
Here X−i denotes
coordinate, we have X1 , X2 , . . . , X i − 1 , X i + 1 , . . . , X n .

1 if ( x, y) is an edge that is contained in S,
H ( Xi | X − i = x − i ) =
0 otherwise.
So ∑ x∈S,i∈[n] H ( Xi | X−i = x−i ) = 2e(S), since each edge is counted

twice. By subadditivity,
n n
2e(S)
log |S| = H ( X ) = ∑ H ( Xi | X < i ) ≥ ∑ H ( Xi | X − i ) = |S|
,
i =1 i =1
|S| log |S|

proving that e(S) ≤ 2 .
On the Size of Triangle Intersecting Graphs

Suppose F is a family of subsets of [n] such that any two sets from F
intersect. Then we claim:
Claim 6.11. |F | ≤ 2n−1 .

Proof. For any set T ∈ F , its complement cannot be in F . So only

half of all the sets can be in F .
Let G be a family of graphs on n vertices such that every two 10

A cycle of length 3
graphs intersect in a triangle10 . Such a family can be obtained by
n
including a fixed triangle, which gives 2( 2 ) /8 graphs. This bound 11
Ellis et al., 2010
is known to be tight11 , but here we give a simple argument that
provides a partial converse12 : 12
Chung et al., 1986
n
Theorem 6.12. |G| ≤ 2( 2 ) /4.
Proof. Let G be a uniformly random graph from the family. G can

be described by a binary vector of length (n2 ), where each bit indi-
cates whether a particular edge is present or not. Let S be a random
subset of n/2 of the vertices, and let GS denote the graph obtained
by deleting all edges that go from S to the complement of S. Since
the probability that any particular edge is retained is exactly 1/2,
Shearer’s inequality gives ES [H ( GS | S)] ≥ H ( G ) /2.
Now any two graphs G, G 0 in the family intersect in a triangle,
so we must have that GS , GS0 must share an edge in common, no
matter what S is, because at least one of the edges of the triangle will
not be thrown away in the above process. But this means that the
number of such projections is at most half of all possible projections,
by Claim 6.11. Writing e(S) = (|S2 |) + (n−|2 S|) for the total number of
edges possible in the graph GS , this means that H ( GS ) + 1 ≤ e(S). In
expectation exactly half of the edges contribute to e(S), so we get:

1 n 1
· = E [e(S)] ≥ E [H ( GS | S)] + 1 ≥ · H ( G ) + 1,
2 2 S S 2 Figure 6.6: Two intersecting families of
n n
sets on a universe of size 3.
and so H ( G ) ≤ (n2 ) − 2, which implies that |G| ≤ 2( 2 )−2 = 2( 2 ) /4.
Very similar ideas can be used to show
that any family of graphs that intersects
in an r-clique can be of size at most
Lower bound for Indexing n
2( 2 ) /2r−1 . See Exercise 6.4.
We now have enough information theory tools to prove some lower

bounds in communication complexity. Suppose Alice has a random n
bit string x, and Bob is given a random index i ∈ [n]. The goal of the
players is to compute the i’th bit, xi , but the protocol must start with If Bob could tell Alice i in the first step,
that would give a log n bit protocol.
a message from Alice to Bob, and then Bob must output the answer.
We prove that Ω(n) bits of communication are necessary, even if the Proving a deterministic lower bound
parties are only looking for an average-case protocol. for this problem is easy: after Alice’s
message, Bob must know the entire
Suppose there is a protocol for this problem where Alice sends
n-bit string. So Alice must send n bits.
a message M that is ` bits long. Then by Corollary 6.8, on average
e
over the choice of m and a random coordinate i, p( xi |m) ≈ p( xi ),
information 85
q
with e = ` ln 2 . Since p ( x ) is uniform for each i, the probability
2n i
that Bob makes an error in the i’th coordinate must be at least 1/2 −
) − p( xi )|. So the probability that Bob makes an error is at least
| p( xi |mq
ln 2
1/2 − ` 2n , proving that at least Ω(n) bits must be transmitted if the
Note that if Alice has a random set
protocol has a small probability of error. from a family of sets of size 2Ω(n) , the
lower bound for indexing would still
hold. The lower bound even extends to
the case that Bob knows x1 , . . . , xi−1 .
Randomized Communication of Disjointness
13
Kalyanasundaram and Schnitger,
One of the triumphs of information theory is its ability to prove
1992; Razborov, 1992; Bar-Yossef et al.,
optimal lower bounds on the randomized communication complexity 2004; and Braverman and Moitra, 2013
of functions like disjointness13 , which we do not know how to prove
any other way. This result is especially impactful
because many other lower bounds
in other models (more in Part II) are
Theorem 6.13. Any randomized protocol that computes the disjointness consequences of Theorem 6.13.
function with error 1/2 − e must have communication Ω(e2 n).
Obstacles to Proving Theorem 6.13

By Theorem 3.3, Theorem 6.13 is
equivalent to the existence of such a
The natural way to prove lower bounds on randomized protocols is hard distribution.
to find a hard distribution on the inputs, such that any protocol with
low communication must make an error a significant fraction of the
time. This is the approach we took when we proved lower bounds
on the inner-product function (Theorem 5.6), and the same distri-
bution works to understand the pointer-chasing problem (Theorem
6.18). In those cases, the uniform distribution on inputs is a hard
distribution. But the uniform distribution is not a hard distribution
for disjointness: two uniformly random sets X, Y will intersect with Intuitively this is because if for typical
i, if the enropy H ( Xi Yi | X<i Y<i ) << 1
very high probability, so the protocol can output 0 without communi- then Alice can encode the relevant coor-
cating and still have very low error. In fact, it can be shown that any dinates of her set (those where there is
distribution where X and Y are independent cannot be used to prove a good chance of an intersection) with
much less than n bits and send it to Bob.
a strong lower bound. So we must use a hard distribution where On the other hand, if this entropy is
X, Y are correlated. typically close to 1, then the sets will
intersect with high probability.
A natural distribution to use, given these constraints, is a convex
combination of two uniformly random disjoint sets, and two sets that
intersect in exactly one element. Once we restrict our attention to
such a distribution, we have a second challenge: the events i ∈ X ∩ Y
and j ∈ X ∩ Y are not independent for i 6= j. This makes arguments
involving subadditivity much harder to carry out. The subtleties in
the proof arise from coming up with technical ideas that allow us to
circumvent these obstacles.
Proving Theorem 6.13

Given a randomized protocol with error 1/2 − e, one can make the
error an arbitrarily small constant by repeating the protocol O(1/e2 )
times and outputting the majority outcome. So to prove the lower
1
bound, it suffices to show that any protocol with error < 100 must
have communication Ω(n).
We start by defining a hard distribution on inputs. View the
sets X, Y as n-bit strings, by setting Xi = 1 if and only if i ∈ X.
Pick an index T ∈ [n] uniformly at random, and let XT , YT to be
random and independent bits. For i 6= T, sample ( Xi , Yi ) to be one
of (0, 0), (0, 1), (1, 0) with equal probability, and independent of all
other pairs ( X j , Yj ). X and Y intersect in at most 1 element, and they
Note that the n coordinates
intersect with probability 14 . ( X1 , Y1 ), ( X2 , Y2 ), . . . , ( Xn , Yn ) are
Let M denote the messages of a deterministic protocol of com- not independent, a complication that
makes the proof subtle.
munication ` whose error is at most 1/32. Let H denote the random
variable T, X<T , Y>T . Observe that X − MH − Y, after you fix MH the
After fixing H, X, Y become indepenent:
distribution of XY become independent. Moreover, after H is fixed, for every q, p( ab|q) = p( a|q) · p(b|q).
the n tuples ( X1 , Y1 ), . . . , ( Xn , Yn ) become independent. Since fixing M restricts the inputs
to a rectangle, X, Y remain indepen-
We shall prove that the protocol learns a significant amount of
dent after fixing M: for every q, m,
information about xt (or yt ), when the sets are disjoint. Let D denote p( ab|qm) = p( a|qm) · p(b|qm).
the event that X, Y are disjoint.
Claim 6.14. I ( XT : M | H, YT , D) + I (YT : M | H, XT , D) ≥ Ω(1).
Intuitively, if this sum of informations
Before proving Claim 6.14, we use the subadditivity of information was 0, then XT , YT are both uniform
to show it implies that a linear number of bits were communicated. and independent conditioned on M, H.
We start by proving: This means that the probability that
the sets intersect is far from both 0
Lemma 6.15. Let X = X1 , . . . , Xn and Y = Y1 , . . . , Yn be random variables and 1, which should not happen if M
determines disjointness.
such that the n tuples ( X1 , Y1 ), . . . , ( Xn , Yn ) are mutually independent. Let
M be an arbitrary random variable. Then
n
∑ I (Xi : M | Xi ) ≤ I (Y : M | X ) .
i =1
Proof. Using the chain rule repeatedly:

n n
∑ I (Xi : M | X<i Y≥i ) ≤ ∑ I (Xi : MY<i | X<i Y≥i )
i =1 i =1
n
= ∑ I (Xi : Y<i | X<i Y≥i ) + I (Xi : M | X<i Y )
i =1
n
= ∑ I ( Xi : M | X < i Y ) = I ( X : M | Y ) . since I ( Xi : Y<i | X<i Y≥i ) = 0.
i =1
information 87
The second bound is proved similarly.
We see that X, Y, M|D satisfy the assumptions of Lemma 6.15.

Moreover T is independent of X, Y, M, conditioned on this event, so
Lemma 6.15 gives:
2`
≥ I ( X : M | Y D) + I (Y : M | X D) since M has at most ` bits.
n
≥ I ( XT : M | TX<T Y≥T D) + I (YT : M | TX≤T Y>T D) by Lemma 6.15
= I ( XT : M | HYT D) + I ( XT : M | HYT D) ≥ Ω(1), by Claim 6.14
which proves that ` ≥ Ω(n).
Proof of Claim 6.14. For any h, m, let αhm be the statistical distance of
p( xt |hm) from uniform, and let β hm denote the distance of p(yt |hm)
from uniform. Let
I ( XT : M | HYT D) + I (YT : M | HXT D) = γ4 .
Observe that p( xt |mh) = p( xt |mh, yt = 0) = p( xt |mh, yt = 0, D). So

Intuitively, if γ is small, then xt , yt must
by Pinsker’s inequality (Corollary 6.8), be close to uniform for most fixings of
r 2 q m. This leads to a high probability of
E [αmh ] ≤ E αmh ≤ γ4 = γ2 . error in the protocol.
p(m,h|yt =0) p(m,h|yt =0)
In particular,
γ ≥ p(αhm > γ|yt = 0)

≥ p( xt = 0|yt = 0) · p(αhm > γ|yt = 0 = xt )
p(αhm > γ| xt = 0 = yt )
= .
2
A symmetric argument proves that p( β hm > γ| xt = 0 = yt ) ≤ 2γ.
Now if E denotes the event that the protocol makes an error, we have
1
p(E |hm) ≥ − αhm − β hm .
4
Expressing

1
p(E ) ≥ p(αhm ≤ γ ∧ β hm ≤ γ) − 2γ ,
4
and since
p(αhm ≤ γ ∧ β hm ≤ γ)
≥ p( xt = 0 = yt ) · p(αhm ≤ γ ∧ β hm ≤ γ| xt = 0 = yt )
1 − 4γ
≥ ,
4
we get

1 − 4γ 1 1 3γ
p(E ) ≥ · − 2γ = − + 2γ2 ,
4 4 16 4
proving that γ ≥ Ω(1), as required.
Lower bounds on Non-Negative Rank
The ideas used to prove the lower bound on the communication

complexity of disjointness are powerful. In fact, they can be used to
prove lower bounds on the non-negative rank of a large family of
Recall that a non-negative matrix D
matrices. As we shall see in Chapter 11, the non-negative rank is also has non-negative rank r if and only if
related to algorithmic questions and to properties of polytopes. D can be written as the sum of r (and
no fewer) non-negative rank 1 matrices.
For a parameter 0 ≤ δ ≤ 1, suppose we are given a 2n × 2n non-
This is also equivalent to saying that
negative matrix A whose rows and columns are indexed by sets D = AB, where A has r columns, and B
x, y ⊆ [n], such that has r rows.

= 1 if x, y are disjoint,
A x,y
≤ 1 − δ if x, y are not disjoint.
When δ = 1, the matrix A is the disjointness matrix, which has full

rank and hence full non-negative rank. When δ = 0, the matrix may
If the entries corresponding to intersect-
have non-negative rank 1. Here we prove that the rank of the matrix ing sets are allowed to be larger than
is at least exponential in δ2 n. In fact, we shall prove the following the entries corresponding to disjoint
sets, that matrix may have exponen-
stronger theorem:
tially smaller rank. For example, if
A x,y = | x ∩ y| + 1, the matrix has non-
Theorem 6.16. If A x,y = 1 when x, y are disjoint and A x,y ≤ 1 − δ when negative rank n + 1. This shows that for
4 n)
| x ∩ y| = 1, then rank+ ( A) ≥ 2Ω(δ . δ ≤ 0, the non-negative rank of A can
be quite small.
The proof of the theorem is an appropriate adaptation of the lower
bound we proved for the randomized communication complexity of A slightly more involved argument
2
proves that rank+ ( A) ≥ 2Ω(δ ) .
disjointness.
Proof. Consider the distribution on x, y given by
A x,y
q( xy) = .
∑ a,b A a,b
If A has non-negative rank r, then A can be expressed as A =

∑rm=1 A(m), where A(m) is a non-negative rank 1 matrix. In other
words, q( xy) can be expressed as a convex combination of r product
distributions, by setting
A(m) x,y
q( xy|m) = ,
∑ a,b A(m) a,b
q( xy|m) is a product distribution,
and since A(m) has non-negative rank 1: if
∑ a,b A(m) a,b the rank of A(m) is 1, there must be
q(m) = .
∑ a,b A a,b v x , vy ∈ R such that q( xy|m) = v x · vy .
Let D denote the event that the sets X, Y sampled in this distribu-
tion are disjoint. Since A x,y = 1 for disjoint sets x, y, q( xy|D) is the
uniform distribution on all pairs of disjoint sets. Let Hi = Xi .
information 89
Claim 6.17. For every i ∈ [n],
I ( Xi : M | Hi , Yi , D) + I (Yi : M | Hi , Xi , D) ≥ Ω(δ4 ).
Before proving Claim 6.17, we show how to use it. By Lemma 6.15,
we get that
2 log r
n
≥ ∑ I (Xi : M | Hi , Yi , D) + I (Yi : M | Hi , Xi , D)
i =1
≥ Ω ( δ4 n ),
4 n)
proving that r ≥ 2Ω(δ as required. Next we turn to proving Claim
6.17.
Proof of Claim 6.17. We shall prove that for every setting of hi ,
I ( Xi : M | hi , Yi , D) + I (Yi : M | hi , Xi , D)
2
≥ (I ( Xi : M | hi , Yi = 0, D) + I (Yi : M | hi , Xi = 0, D))
3
≥ Ω ( δ4 ).
Let U denote the event that X ∩ Y ⊆ {i }. Let p( xym) = q( xym|hU ).

Note that p( xy|m) is a product distribution, and p( xym|yi = 0) =
q( xym|yi = 0, hi , D).
For fixed m, let αm denote the statistical distance of p( xi |m) from
uniform, and β m denote the distance of p(yi |m) from uniform. If
I ( Xi : M | hi , Yi = 0, D) = γ4 ,
then by Pinsker’s inequality (Corollary 6.8),

r q
E [αm ] ≤ E [α2m ] ≤ γ4 = γ2 .
p ( m | y i =0) p ( m | y i =0)
In particular, p(αm > γ|yi = 0) ≤ γ. A symmetric argument proves

that p( β m > γ| xi = 0) ≤ γ. Now the assumptions on A imply that
1 1 + δ/4 1 δ
p ( xi = 0 = yi ) ≥ ≥ = + . since 1
1− e ≥ 1 + e.
4−δ 4 4 16
On the other hand,
r
p ( xi = 0 = yi ) = ∑ p(m, xi = 0 = yi ).
m =1
In this sum, the contribution of the terms for which αm > γ is at most
∑ p(m, xi = 0 = yi ) ≤ ∑ p ( m | y i = 0)
m:αm >γ m:αm >γ
= p(αm > γ|yi = 0) ≤ γ.

Similarly, the contribution of the terms for which β m > γ is at most

γ. All the remaining contribution comes from terms where both
αm , β m ≤ γ. These terms contribute at most

1 1
∑ p ( x i = 0 = y i | m ) =
2
+ γ
2
+ γ 14
Yao, 1983; Duris et al., 1987; Halsten-
m:α ,β ≤γ
m m berg and Reischuk, 1993; and Nisan and
1 Wigderson, 1993
≤ + 2γ.
4
Since the information about the number
So we must have that 4γ ≥ δ/16, as required. of rounds is lost once we move to
viewing a protocol as a partition into
rectangles, it seems hard to prove a
separation between a few rounds and
many rounds using the techniques
we have seen before. A protocol with
low communication will have a large
Lower bound for Number of Rounds rectangle, so we cannot bound the
size of rectangles to get a separation
between interactive protocols and
non-interactiveprotocols.
Are interactive protocols more powerful than protocols
that do not have much interaction?14 Here we show that a protocol 15
Actually we will prove that it is hard
with more rounds can have significantly less communication than a to compute any information about zk in
protocol with fewer rounds. at most k − 1 rounds of communication
Randomized Pointer-Chasing
In the k step pointer-chasing problem, the input is a directed graph
z0 z1
on the vertex set [2n], where every vertex has exactly one edge com-
ing out of it. Let 1 = z0 , z1 , z2 , . . . , zk be the path of length k starting
at the first vertex. However, Alice only knows all of the edges that
originate in the vertices A = {1, . . . , n}, and Bob knows all of the
edges that originate at the vertices B = {n + 1, . . . , 2n}. The goal of the z2 z7
parties is to output whether or not zk is even15 .
There is an obvious deterministic protocol that takes k rounds
and k log n bits of communication: in each step one of the players
announces z1 , z2 , . . . , zk . There is a randomized protocol with k − 1
rounds and O((k + n/k ) log n) bits of communication. In the first z6 z3
step, Alice and Bob use shared randomness to pick 10n/k vertices
in the graph and announce the edges that originate at these vertices.
z4
Alice and Bob then continue to use the deterministic protocol, but
do not communicate if one of the edges they need has already been
announced. In expectation, this protocol will have k + 1 − 10 rounds16 .
We shall prove that any randomized or deterministic protocol with
k − 1 rounds must have much more communication.
Let the graph be sampled uniformly, subject to the constraint that
z5
every edge from from either A → B or from B → A. Let m≤k−1
denote the first k − 1 messages of a protocol whose communication Figure 6.7: An example of an input to
complexity is `. The key idea here is quite similar to the lower bound pointer chasing, with n = 8, k = 7.
information 91
for the indexing problem. We will argue by induction that zk remains 16

It can be shown that this randomized
random even after conditioning on m<k . Suppose k is even. Then protocol will have < k rounds with
high probability. Indeed, the probability
intuitively, if Alice sends the message mk−1 , we will have shown by that none of the announced values
induction that zk−1 is close to random conditioned on m<k−1 , but help to save a round of communication
is exponentially small in k, as long
now mk−1 is independent of zk after fixing m<k−1 . So p(zk |m<k ) is as Ω(k ) of the values zi are distinct.
distributed like a random coordinate of y|m<k−1 , which is likely to For a uniformly random input, most
be close to uniform. On the other hand, if Bob sends mk−1 , then zk−1 of the zi ’s will be distinct with high
probability.
is again close to uniform conditioned on m<k−1 by induction, and
now zk−1 is independent of y, mk−1 , so p(zk |m<k ) is distributed like a
random coordinate of y|m<k , which is again close to uniform. Theorem 6.18 actually proves that
the communication is at least
Theorem 6.18. Any randomized k − 1 round protocol for the k-step Ω(n/k2 ) in the randomized set-
pointer chasing problem that is correct with probability 1/2 + e requires ting with k − 1 rounds.
p This is
e2 n because when k < 3 n/ log n,
( k −1)2
− k log n bits of communication. e2 n
− k log n = Ω(n/k2 ), and
4( k −1)2
p
Proof. The proof will proceed by induction. We shall show that when k ≥ 3 n/ log n, the communica-
zk remains close to uniformly random, even conditioned on the tion must be at least k which is again
Ω(n/k2 ).
messages that have been sent in the first k − 1 rounds. Initially, z1 is
uniform (when we do not condition on any of the messages).
Let rk denote the random variable m1 , . . . , mk , z1 , . . . , zk . We shall
prove by induction on k that on average
q over rk−1 , p(zk |rk−1 ) is e-
`+log n
close to uniform, with e ≤ (k − 1) n . Rearranging, this would
e2 n
imply that ` ≥ − k log n.
( k −1)2
The case when k = 1 is trivial. Suppose k ≥ 2 and k is even17 . Let 17
The proof is exactly the same when k
is odd.
yi denote the name of the vertex that Bob’s i’th edge points to. Since
rk−2 , mk−1 contains at most ` + k log n bits of information, Corollary
6.8 implies that if i is a uniformly random coordinate independent of
all other variables, then on average over i, rk−2 , mk−1 ,
e0 e0
p ( y i | r k −2 ) ≈ p ( y i ) ≈ p ( y i | m k −1 , r k −2 ),
q
`+k log n
where e0 = n . There are two cases to consider:
Bob sends the message mk−1 In this case, after fixing rk−2 , zk−1 is inde-
pendent of yi for every i. q
By induction, p(zk−1 |rk−2 ) is e-close to
`+k log n
uniform, with e = (k − 2) n . So on average over rk−1 , i,:
Fact: If i, j, a are independent, and
e e0 γ γ
p(i ) ≈ p( j), we have p( ai ) ≈ p( a j ). See
p ( z k | r k −1 ) = p ( y z k −1 | m k −1 , r k −2 ) ≈ p ( y i | m k −1 , r k −2 ) ≈ p ( y i ) .
the Conventions chapter of the book for
Alice sends the message mk−1 In this case, p(yi |rk−1 ) = p(yi |rk−2 ), since a proof.
after fixing rk−2 , yi is independent of mk−1 , zk−1 . So on average

over rk−1 , i:
e e0
p ( z k | r k −1 ) = p ( y z k −1 | r k −2 ) ≈ p ( y i | r k −2 ) ≈ p ( y i ) .
q
`+k log n
Both of these bounds imply that p(zk |rk−1 ) is (k − 1) n -close to
uniform, as required.
Stronger Bounds for Deterministic Protocols

Similar intuitions can be used to show that the deterministic com-
munication of the pointer-chasing problem is Ω(n) if fewer than k
rounds of communication are used.
Theorem 6.19. Any k − 1 round deterministic protocol that computes the

n
k-step pointer-chasing problem requires 16 − k bits of communication.
Proof. Consider any k − 1 round deterministic protocol with com-

n
munication complexity ` ≤ 16 − k, and let m1 , . . . , mk−1 denote the
messages of the protocol. Let ri denote z0 , z1 , . . . , zi , m1 , . . . , mi . Let p
denote the uniform distribution on inputs to the protocol. We shall
show by induction on i that there is a fixed value of ri such that
• z0 , z1 , . . . , zi are all distinct.

q
• p(zi+1 |ri ) is e-close to uniform, with e = 2 `+k ≤ 1/4.
n
• p(m≤i |z≤i ) ≥ 2−|m≤i |−i .
The first property applied to i = k − 1 shows that the protocol cannot

be correct, since rk−1 contains all the messages in the first k rounds,
and so p(zk |rk−1 ) cannot be close to uniform.
When i = 0, the claims are trivially satisfied. Now suppose i > 0 is 18
The proof is symmetric when i is odd.
even18 , so zi+1 = xzi . By induction, there exists a setting of ri−1 that
satisfies the given conditions. We only need to show that there exists
a setting of values for zi , mi to append to ri−1 to obtain the setting of
ri that we want. There are two cases:
Alice sends the i + 1’st message In this case, fixing ri−1 leaves mi and zi
independent. Pick mi by greedily setting each bit of mi in such a
way that the probability of that bit is maximized conditioned on
ri−1 and all previous bits. This ensures that
p(mi |ri−1 ) ≥ 2−|mi | .
To choose zi , define
B1 = {z0 , z1 , . . . , zi−1 }
( )
p ( x j | m i , r i −1 ) `+k
B2 = j : > 4·
p( x j ) n

p( Zi = j|ri−1 )
B3 = j : < 1/2
p( Zi = j|z≤i−1 )
We shall prove:
Claim 6.20. | B1 ∪ B2 ∪ B3 | < n.

information 93
Proof. Obviously, | B1 | ≤ k − 1 < n/16 − ` ≤ n/16.

We have | B3 | ≤ 2en ≤ n/2, or else we would have
p( Zi ∈ B3 |z≤i−1 ) − p( Zi ∈ B3 |ri−1 ) > 2e − e = e,
contradicting the fact that p(zi |ri−1 ) is e-close to uniform.

We shall prove that | B2 − B1 | ≤ n/4. Observe that:
`+k p ( x j | m i , r i −1 )
| B2 − B1 | · 4 ·
n
≤ ∑ p( x j )
j∈ B2 − B1
p ( x j | m ≤ i , z ≤ i −1 )
= ∑ p ( x j | z ≤ i −1 )
Since x j is independent of z≤i−1 for all
j∈ B2 − B1 j∈/ B1 .
p( x[n]− B1 |m≤i , z≤i−1 )

≤ . By Fact 6.4. Here x[n]− B1 denotes x
p( x[n]− B1 |z≤i−1 ) projected to the coordinates that are not
in B1 .
By the choice of mi , we have
p ( m ≤ i | z ≤ i −1 ) = p ( m i | r i −1 ) · p ( m ≤ i −1 | z ≤ i −1 )
≥ 2−|mi | · 2−|m≤i−1 |−i+1 ≥ 2−`−k ,
So we can apply Fact 6.3 to conclude that
p( x[n]− B1 |m≤i , z≤i−1 )

≤ ` + k,
p( x[n]− B1 |z≤i−1 )
giving that | B2 − B1 | ≤ n/4. Thus | B1 ∪ B2 ∪ B3 | < n/16 + n/2 +

n/4 = n.
Set zi to be an arbitrary element outside of B1 ∪ B2 ∪ B3 . This

completes the description of ri . Since zi ∈ / B1 , z0 , . . . , zi are distinct.
Since after fixing mi , ri , x is independent of y, the distribution of
( xzi |ri ) is the same as the distribution of p( xzi |mi ri−1 ). Thus it is
pq
k
2 · `+n -close to uniform by Pinsker’s inequality and the fact that
/ B3 .
zi ∈
Finally, we have:
p ( m ≤ i | z ≤ i ) = p ( m i | r i −1 ) · p ( m ≤ i −1 | z ≤ i )
p ( z i | m ≤ i −1 , z ≤ i −1 )
≥ 2−|mi | · · p ( m ≤ i −1 | z ≤ i −1 )
p ( z i | z ≤ i −1 )
p ( m ≤ i −1 | z ≤ i )
≥ 2−|m≤i |−i · (1/2) = 2−|m≤i |−(i+1) p ( m ≤ i −1 , z i | z ≤ i −1 )
=
p ( z i | z ≤ i −1 )
by the choice of mi , and the fact that zi ∈
/ B2 . p ( z i | m ≤ i −1 , z ≤ i −1 )
= · p ( m ≤ i −1 | z ≤ i −1 )
p ( z i | z ≤ i −1 )
Bob sends the i + 1’st message In this case, we pick zi first. Define the
sets:
B1 = {z0 , z1 , . . . , zi−1 }
( )
p ( x j | r i −1 ) `+k
B2 = j : > 4·
p( x j ) n

p( Zi = j|ri−1 )
B3 = j : < 1/2
p( Zi = j|z≤i−1 )
Analogous to Claim 6.20, we have
Claim 6.21. | B1 ∪ B2 ∪ B3 | < n.
Proof. | B1 | ≤ n/16 and | B3 | ≤ n/2, as proved in Claim 6.20.

We shall prove that | B2 − B1 | ≤ n/4. Observe that:
`+k p ( x j | r i −1 )
| B2 − B1 | · 4 ·
n
≤ ∑ p( x j )
j∈ B2 − B1
p ( x j | m ≤ i −1 , z ≤ i −1 )
= ∑ p ( x j | z ≤ i −1 )
Since x j is independent of z≤i−1 for all
j∈ B2 − B1 j∈/ B1 .
p( x[n]− B1 |m≤i−1 , z≤i−1 )

≤ By Fact 6.4.
p( x[n]− B1 |z≤i−1 )
≤ ` + k, Using p(m≤i−1 |z≤i−1 ) ≥ 2−`−k , and
Fact 6.3.
giving that | B2 − B1 | ≤ n/4. Thus | B1 ∪ B2 ∪ B3 | < n/16 + n/2 +
n/4 = n.
We let zi be an element that is not in B1 ∪ B2 ∪ B3 , and pick mi by

greedily setting each bit of mi in such a way that the probability of
that bit is maximized conditioned on ri−1 , zi and all previous bits.
Clearly, z0 , . . . , zi are all distinct.
q p( xzi |ri ) has the same distribution
k
as p( xzi |ri−1 ), which is 2 `+
n -close to uniform by Pinsker’s
inequality and the fact that zi ∈ / B2 .
Finally, we have
p ( m ≤ i | z ≤ i ) ≥ p ( m i | r i −1 , z i ) · p ( m ≤ i −1 | z ≤ i )
p ( z i | r ≤ i −1 )
≥ 2−|mi | · · p ( m ≤ i −1 | z ≤ i −1 )
p ( z i | z ≤ i −1 )
≥ 2−|m≤i |−(i+1) ,
as required.
information 95
Exercise 6.1
Show that for anys two joint distributions p( x, y), q( x, y) with same
support, we have
" # " #
p( x |y) p( x |y)
E ≤ E .
p(y) p( x ) p(y) q( x )
Exercise 6.2
Suppose n is odd, and x ∈ {0, 1}n is sampled uniformly at random
from the set of strings that have more 1’s than 0’s. Use Pinsker’s
inequality to show that the expected number of 1’s in x is at most
√
n/2 + O( n).
Exercise 6.3
Let X be a random variable supported on [n] and g : [n] → [n] be a Use the fact that α log α ≥
− log e
≥ −1,
e
function. Prove that for α > 0.
H ( X | g( X )) − 1
Pr[ X 6= g( X )] ≥ .
log n
Use this bound to show that if Alice has a uniformly random

vector y ∈ [n]n , and Bob has uniformly random input i ∈ [n],
and Alice sends Bob a message M with that contains ` bits, the
probability that Bob guesses yi is at most 1log
+`/n
n .
Exercise 6.4
Let G be a family of graphs on n vertices, such that every two
vertices in the graph share a clique on r vertices. Show that the
n Hint: Partition the graph into r parts
number of graphs in the family is at most 2( 2 ) /2r−1 . uniformly at random and throw away
all edges that do not stay within a part.
Analyse the entropy of the resulting
distribution on graphs from the family.
Anup Says: exercise using hellinger and

common information to prove lower
bounds on non-negative rank
7
Compressing Communication
Is there a way to define the information of a protocol

in analogy with Shannon’s definition of the entropy of a single
message? Extending Shannon’s ideas, we would like to measure
the information contained in all the messages of the protocol. In
this chapter, we explore how to do this for 2-party communication
protocols. Somewhat surprisingly, the definitions lead to several
results in communication complexity that do not concern information
at all, though these definitions were motivated by understanding
It remains open to define a similarly
combinatorial problems. useful notion of information for multi-
Suppose we are working in the distributional setting, where party protocols in the number on the
the inputs X, Y to Alice and Bob are sampled from some known forehead model.
distribution µ, and the protocol is randomized.
• Consider a protocol where all the messages of the protocol are 0,

not matter what the inputs are. Then messages of the protocol are
known ahead of time, and Alice and Bob may as well not send
them. These messages do not convey any information. The information is 0.
• Suppose X, Y are independent uniformly random n bit strings.

Alice sends X as her first message and Bob sends Y in response. In
this case, the protocol cannot be simulated with less than 2n bits. The information is 2n.
• Suppose X, Y are independent uniformly random n bit strings. In

the first message, Alice privately samples k uniformly random bits
and sends them to Bob, and follows this message by sending X.
This protocol can be simulated by a randomized communication
protocol with communication n: Alice and Bob can use shared
randomness to sample the k bit string, so that they do not have to
The information is n. Note that the
send it. entropy of the first message is k + n > n.
• Suppose X, Y are independent uniformly random n bit strings.
Alice uses private randomness to sample a uniformly random
subset T ⊆ [n] of size k. Alice sends n bits to Bob, where the i’th
bit is Xi if i ∈
/ T, and 1 − Xi otherwise. One can simulate this
protocol with less than n bits, at least in expectation. Alice and

Bob can use public randomness to sample a set S ⊆ [n] of size
k, as well a uniformly random n-bit string R. Alice computes the
number of coordinates i ∈ S such that Ri 6= Xi . If there are t such
coordinates, she samples a uniformly random subset T 0 ⊆ [n] − S
of size k − t. Setting Mi = Ri for all i ∈ S, Mi = 1 − Xi for all
i ∈ T 0 and Mi = Xi for all remaining i, Alice sends the n − k bits
of Mi that Bob does not already know. For any fixed value of X,
the string M is identically distributed to how it was in the original
The information is ≈ n − k. The entropy
protocol, so the simulation succeeds with communication n − k. of the message is n.
• Suppose X, Y are uniformly random strings that are always equal.
Suppose Alice sends X to Bob in the first message. Then this mes-
sage can be simulated with 0 communication, since Bob already The information is 0. The entropy of the
knows X. message is n.
These examples illustrate the difficulties with defining the in-

formation of communication protocols. Indeed, motivated by the
applications to communication complexity, there are two natural
There are many other definitions of
ways to define the information of a protocol. Let R denote the public information that allow for similar
randomness of the protocol, and let M denote the messages that intuitions. However, the definitions
used here are the most useful, because
result from executing the protocol. The external information1 of the
they allow us to reason about questions
protocol is defined to be in communication complexity that have
nothing to do with information.
I ( XY : M | R) .
1
Chakrabarti et al., 2001
The external information measures the amount of information about
the inputs that an external observer may learn about X, Y from mes- The measure is called external infor-
mation to contrast it with internal
sages and public randomness of the protocol. The second definition information, a concept that was defined
is called the internal information2 of the protocol. It is defined to be much later.
I ( X : M | YR) + I (Y : M | XR) . Use the chain rule for informa-

tion to prove that I ( XY : M | R) =
information learnt by Bob information learnt by Alice
I ( XY : MR).
The internal information measures the amount of information that
2
Barak et al., 2010
Alice and Bob learn about each other’s inputs from the messages and
public randomness of the protocol. Use the chain rule for informa-
The external information is never larger than the internal information to prove that I ( X : M | YR) +
I (Y : M | XR) = I ( X : MR | Y ) +
tion, and the two quantities are equal when X, Y are independent. To
I (Y : MR | X ).
see this, let us apply the chain rule to express the internal informa-
tion as: M<i denotes the first i − 1 bits of M.
I ( X : M | YR) + I (Y : M | XR)
= ∑ I ( X : Mi | YRM<i ) + I (Y : Mi | XRM<i ) ,
i
where here M1 , M2 , . . . are the bits of M. The definition of communi-

cation protocols ensures that the first i − 1 bits m<i determine whether
compressing communication 99
Alice or Bob sends the next bit of the protocol. If Alice sends the next
bit, then
I (Y : Mi | XRm<i ) = 0,
because Mi is determined by the variables XRm<i . Similarly, if Bob
sends the next bit in the protocol, then
I ( X : Mi | YRm<i ) = 0.
Moreover, if Alice sends the next bit, then by the chain rule, we have
I ( X : Mi | YRm<i )
≤ I ( X : Mi | YRm<i ) + I (Y : Mi | Rm<i )
= I ( XY : Mi | Rm<i ) ,
where the inequality is an equality when X, Y are independent of
each other, because in this case Y is independent of Mi after fixing
R, m<i . Similarly, if Bob sends the next bit, we have
I (Y : Mi | XRm<i ) ≤ I ( XY : Mi | Rm<i ) ,
and the inequality is an equality when X, Y are independent. Putting

all of these observations together, we get that the internal information
can be bounded
I ( X : M | YR) + I (Y : M | XR)
= ∑ I ( X : Mi | YRM<i ) + I (Y : Mi | XRM<i )
i
≤ ∑ I ( XY : Mi | RM<i )
i
= I ( XY : M | R) ,
so the internal information never exceeds the external information.
The two quantities are equal when X, Y are independent.
What we are really after is an analogy to Theorem 6.1—we want
to show that information characterizes communication. Such a state- The same argument proves that
I ( X : M | YR) is at most the expected
ment would be immensely useful, because the quantities defining number of bits sent by Alice in the
information are much easier to work with than communication protocol, and I (Y : M | XR) is at most
the expected number of bits sent by Bob
complexity. in the protocol.
Correlated Sampling
Suppose we are given a protocol whose communication complexity is

very large, but its internal information is very close to 0. In this case,
the protocol teaches Alice and Bob almost nothing about each others
inputs, so they should be able to simulate its execution without
communicating, and this is what we show here. We use the technique
of correlated sampling3 to achieve this:
3
9
p 8
7
1
5
r q 6
4
m
Figure 7.1: An example of the sampling
procedure. (m4 , ρ4 ) is selected in this
Lemma 7.1. There is a protocol for Alice and Bob, who are each given case.
distributions p(m), q(m) to use public randomness and no communication
in such a way that Alice samples M A distributed according to p(m), Bob
samples M B distributed according to q(m) and the probability that M A 6=
M B is at most 2| p − q|.
Proof. Alice and Bob will use public randomness to sample a se-
quence (m1 , ρ1 ), (m2 , ρ2 ), . . . , where mi is a uniformly random ele-
ment from the support of m, and ρ1 is uniformly random from [0, 1].
Alice will set m = mi , where i is the minimum number for which
ρi ≤ p(mi ). Similarly, Bob will set m0 = m j where j is the minimum
number such that ρ j < p(m j ).
The expected values of i and j are
Let r (m A , m B ) denote the joint distribution of the outputs of Alice proportional to the size of the universe,
and Bob. Let E denote the event that Alice sets i = 1. Then we claim so the time required to carry out this
procedure is also proportional to the
that r ( M A = m A | E) = p( M = m A ). Indeed, by the definition of the size of the universe.
process, we have r ( M A = m a |¬ E) = r ( M A = m A ). Since
Anup Says: make exercise
r ( M A = m A ) = r ( E)r ( M A = m A | E) + (1 − r ( E))r ( M A = m A |¬ E),
we have r ( M A = m A ) = r ( M A = m A | E).
This implies that r ( M A = m A ) = p( M = m A ), and r ( M B = m B ) =

q( M = m B ). Let B denote the event that either q(mi ) < ρi < p(mi )
or p(m j ) < ρ j < q(m j ). The event B must happen if M 6= M0 . Let F
denote the event that either i = 1 or j = 1. Then exactly as before, we
have r ( B|¬ F ) = r ( B), and so r ( B) = r ( B| F ). But we have
∑m | p(m) − q(m)|
r ( B| F ) =
∑m max{ p(m), q(m)}
∑m | p(m) − q(m)|
≤
∑m p(m) + | p(m) − q(m)|
≤ ∑ | p(m) − q(m)|, since ∑m p(m) = 1
m
as required.
Compressing a Single Round of Communication
External Information
Suppose we would just like to compress the first message in a pro-
tocol down to its external information. If the message M is sent
by Alice, who has the input X, and Bob has the input Y, then the
external information can be expressed as
I ( XY : M) = I ( X : M) + I (Y : M | X )
= I (X : M) . since after fixing X, Y and M are
independent.
In analogy with Theorem 6.1, we prove that there is a way to simu-
late4 the sending of the message M using I ( X : M) + O(log I ( X : M )) 4
Harsha et al., 2007; and Braverman
bits of communication in expectation. The theorem follows from the and Garg, 2014
following stronger fact:
Theorem 7.2. Suppose Alice knows two distributions p, q, and Bob knows
q. There is a protocol for Alice and Bob to sample an element according to p
using !
p(m) p(m)
+ 2 log + O (1)
q(m) q(m)
bits of communication in expectation.
As a corollary, we get
Corollary 7.3. Alice and Bob can use public randomness to simulate sending
M with expected communication I ( X : M) + 2 log I ( X : M ) + O(1).
To prove the corollary, if r (m, x ) denotes
The protocol we use is inspired by the correlated sampling idea. the joint distribution of X, M, let
p(m) = r (m| x ) and q(m) = r (m).
The public random tape will consist of a sequence of samples Then Jensen’s inequality proves that
(m1 , ρ1 ), (m2 , ρ2 ), . . . , where each mi is a uniformly random element the expected communication of the
resulting protocol is at most I ( X : M ) +
from the support of m, and ρi is a uniformly random number from
2 log I ( X : M ) + O(1).
[0, 1].
5
2
p ST 7
9
1 6
r q 8
4
m
Figure 7.2: The sampling procedure
of Theorem 7.2. Here T is 3 and the
Given this public randomness, Alice finds the minimum index sampled point is the 3’rd point of ST .
r such that p( M = mr ) ≥ ρr . The value mr has exactly the right
distribution. Unfortunately, communicating r can be too expensive,
What is the expected communication
so Alice cannot simply send r to Bob. Instead, Alice computes the complexity of sending r?
positive integer
ρr
T= ,
q ( M = mr )
and sends T to Bob. Given T, Alice and Bob both compute the set
( & ')
ρj
ST = j : T = .
q( M = m j )
Alice sends Bob the number K for which r is the K’th smallest ele-
ment of ST .
We have already shown in Section 7 that the sample mr has the
right distribution. To analyze the expected communication of the
protocol, we need two basic claims. The first claim, whose proof
we sketch in the margin, is used to encode the integers sent in the
protocol.
Claim 7.4. One can encode all positive integers in such a way that at most
log z + 2 log log z + O(1) bits are used to encode the integer z.
Proof of Claim 7.4: A naive encoding
To argue that the expected length of T is small, we need the follow- would have Alice send a bit to indicate
ing claim: whether there is another bit left to send
in the encoding, and then send the bit
Claim 7.5. For any two distribution p(m), q(m), the contribution of the of data. This would take 2dlog ze + O(1)
bits. To get a better bound, first send
terms with p(m) < q(m) to the divergence is at least −1: the integer dlog ze using the naive
encoding, and then send dlog ze more
p(m)
∑ p(m) log
q(m)
> −1. bits to encode z.
m:p(m)<q(m)
Now, to bound the expected number of bits required to transmit T, Proof of Claim 7.5: Let E denote the
observe that by Claim 7.4, this is at most subset of m’s for which p(m) < q(m).
Then we have
E [log T + 2 log log T + O(1)] ≤ E [log T ] + 2 log E [log T ] + O(1), ∑ p(m) log
p(m)
m:p(m)<q(m)
q(m)
where the inequality follows from Jensen’s inequality. By Claim 7.5, p(m)
we can bound = ∑ p(m) log
q(m)
m∈ E

p(m) q(m)
E [log T ] ≤ ∑ p(m) log ≥ − p( E) · ∑ p(m| E) log
p(m)
m q(m) m∈ E
q(m)
p(m) ≥ − p( E) · log ∑ p(m| E)
≤ ∑ p(m) log q(m) + 1 m∈ E
p(m)
m:p(m)>q(m) q( E)
= − p( E) · log
p( E)
p(m) p(m)
≤
q(m)
− ∑ p(m) log
q(m)
+1 ≥ p( E) · log p( E).
m:p(m)<q(m)
For 0 ≤ x ≤ 1, x log x is maximized
p(m) when its derivative is 0: log e + log x = 0.
≤ + 2, So the maximum is attained at x = 1/e,
q(m) − log e
proving that p( E) log p( E) ≥ e > 1.
and so the expected number of bits used to transmit T is at most

!
p(m) p(m)
+ 2 log + O (1).
q(m) q(m)
It only remains to bound the number of bits required to transmit

K. Since Jensen’s inequality proves that E [log K ] ≤ log E [K ], we
shall start by bounding E [K ]. Consider the event A, defined to be
p( M = m1 ) ≥ ρ1 . When A happens, K = 1. Conditioned on the
event that A does not happen, T is independent of (m1 , ρ1 ). Define
the random variable

1 if 1 ∈ S ,
T
Z=
0 otherwise.
Pr[ A] + Pr[¬ A] · E [ Z |¬ A]
E [K ] =
1 − Pr[¬ A]
Then we have
Pr[ A] + Pr[¬ A] · E [ Z |¬ A]
=
Pr[ A]
E [K ] = Pr[ A] + Pr[¬ A](E [K ] + E [ Z |¬ A])
Pr[¬ A] · E [ Z |¬ A]
Pr[¬ A] · E [ Z |¬ A] = 1+ .
Pr[ A]
⇒ E [K ] = 1 + .
Pr[ A]
Suppose the space of all m’s is of size u. Then we can compute
Pr[ A] = (1/u) ∑ p(m) = 1/u,

m
and
E [ Z |¬ A] = Pr[i ∈ ST |¬ A]
(1/u) ∑m T p(m) − ( T − 1) p(m)
≤
(1/u) ∑m (1 − p(m))
(1/u) ∑m p(m)
≤
(1/u) ∑m (1 − p(m))
1/u 1
= = .
(1/u)(u − 1) u−1
Thus we get
(1 − 1/u)/(u − 1)
E [K ] ≤ 1 + = 2.
1/u
So the expected number of bits required to transmit K is a constant.
Internal Information
Now suppose we wish to compress a single message sent from Alice
to Bob down to its internal information. This is strictly harder than
the problem for external information—when Y is a constant, the two
problems are the same.
Theorem 7.6. Suppose Alice knows two distributions p, and Bob knows q.
For every e, there is a protocol for Alice to sample an element according to
the distribution p while communicating an expected
v
u
p(m) u p(m)
+t + log(1/e)
q(m) q(m)
bits, such that Bob also computes the same sample, except with probability e.
As a corollary, we get
Corollary 7.7. Alice and Bob can use public randomness to simulate
p
sending M with expected communication I ( X : M | Y ) + I ( X : M | Y ) +
log(1/e).
To prove the corollary, if r ( x, y, m) de-
notes the joint distribution of X, Y, M,
We shall use very similar ideas to obtain a protocol as in the previ- let p(m) = r (m| x ) = r (m| xy) and
ous section. However, our simulating protocol will be interactive, and q(m) = r (m|y). Then Jensen’s inequality
proves that the expected communica-
there will be a small possibility of committing an error. tion of the resulting
p protocol is at most
As in the previous section, Alice and Bob will use public random- I ( X : M | Y ) + I ( X : M | Y ).
ness to sample a sequence of points (m1 , ρ1 ), (m2 , ρ2 ), . . . , where each
5
2
8
p 9
7
Q4 4
Q3
r q Q2
1
6
m
Figure 7.3: Sampling from p when the
sender knows only one distribution.
mi is a uniformly random element of the support, and ρi is a uni-
formly random number in [0, 1]. As before, Alice picks the smallest
index r such that p( M = mr ) > ρr . In analogy with thel idea form
ρ
external compression, we would really like to compute q( M=r m )
r
with small communication. Unfortunately, Alice does not know q, so
she cannot compute this ratio without interacting with Bob. Instead,
Alice and Bob will try to guess the ratio. To do so, they will graduate
increase a threshhold T until it is larger than this ratio. They will
then use hashing to find r.
For each index i, let h(i ) = h(i )1 , h(i )2 , . . . be an infinite sequence
of uniformly random bits, sampled publicly. h(i ) is a hash function
that Alice and Bob will try to use to quickly agree on the value of r.
The protocol will proceed in rounds. In round k, Alice and Bob set
2
T = 2k , and Bob computes the set
( & ')
ρj
QT = j : T ≥ .
q( M = m j )
For a parameters αk , β k , Alice will send Bob all the bits of h(r )≤αk
that she has not already sent him. For each i = 1, 2, . . . , T, and
j = 1, 2, . . . , αk , Bob will compute the value
g(i, j) = min{` ∈ Qi : h(`)≤ j = h(r )≤ j }.
g(i, j) is Bob’s best guess for the index of Qi that is consistent with
the first j bits of h(r ) that he sees. If there is any index s ≤ k such that
2 2
g(2s , αk ) = g(2s , αk − β k ), then Bob stops the protocol and outputs
2
g(2s , αk ) for the smallest such index s. If there is no such index, Bob
sends Alice a bit to indicate that the protocol should continue, and
the parties begin the next round.
Intuitively, if k is large enough so that Q T contains r, then all
indices of Q T that are less than r will eventually become inconsistent
with h(r ). If T is smaller, then the probability that any index will
remain consistent with the hashes for β k steps is small.
First, let us analyze the probability that the protocol makes
an error. The protocol outputs g2s2 ,α − β 6= r only if we have
k k
g2s2 ,α − β = g2s2 ,α . The probability of this event is at most 2− β k .
k k k
Thus the probability of an error is at most
∞ k ∞
∑ ∑ 2− β k = ∑ k · 2− β k
k =1 i =1 k =1
To analyze the expected communication of the lprotocol,mlet kr be

2
the smallest non-negative integer such that 2kr ≥ q( M=r m ) . kr is the
ρ
r
round during which Alice’s point is included in Q T . Let γ` denote
2
the probability that g(2kr , `) 6= r. Let E denote the event that r = 1,
and E0 denote the event that 1 ∈ Q k2r
2
γ` = Pr[ E] + Pr[¬ E∧]
Compressing Entire Protocols with Low Internal Information

5
Barak et al., 2010
In this section, we describe how to compress any protocol with low
internal information5 . Suppose we are given inputs X, Y sampled
according to some known distribution, and a protocol π with pub-
lic randomness R and messages M. Suppose the communication
complexity of the protocol is C and its internal information is
I = I ( X : M | YR) + I (Y : M | XR) .
We shall prove:
Theorem 7.8. One can simuate any such protocol π with communication
√
complexity O( I · C · log C ).
The idea for the proof is quite striaghtforward. Alice and Bob use
correlated sampling to repeatedly guess the bits of the messages in
the protocol, without communicating. Then, they communicate a few
bits to fix the errors in the transmissions.
First observe that without loss of generality, we can assume that
there is no public randomness in the protocol we are simulating. This
is because for each fixing of the public randomness R = r, if the
internal information cost is Ir , and we obtain a simulating protocol
√
with communication Ir · C log C, then the expected number of bits
communicated for average r is
hp i r √
E Ir · C · log C ≤ E [ Ir ] · C · log C = I · C · log C. by convexity
p (r ) p (r )
To carry out the simulation, we use correlated sampling, but since

we will be sampling bits and not elements of a large universe, the
sampling procedure is particularly simple. Let ρ1 , . . . , ρC be random
numbers from the interval [0, 1]. For each prefix m<i of messages,
define the number
γ(m<i ) = p( Mi = 1| xym<i ).
Input: Alice knows x ∈ X , Bob
These numbers define the correct m that our simulation protocol will knows y ∈ Y .
attempt to compute. To define the correct m, for each i, set mi = 1 if Output: m distributed according
to p(m| xy).
ρi < γ(m<i ), and set mi = 0 otherwise. The correct m has exactly the
right distribution—the probability that m is correct is Alice and Bob use public
randomness to sample
C C ρ1 , . . . , ρC ∈ [0, 1] uniformly and
independently;
∏ γ(m<i )mi (1 − γ(m<i ))1−mi = ∏ p(mi |xym<i ) = p(m|xy). Set j = 0;
i =1 i =1
while j ≤ C do
for i = j + 1, . . . , C do
Although Alice and Bob cannot compute γ(m<i ) without commu-
If ρi < γ A (m<i ), set
nicating, they can compute the numbers: miA = 1, otherwise set
miA = 0;
γ A (m<i ) = p( Mi = 1| xm<i ) and γ B (m<i ) = p( Mi = 1|ym<i ). If ρi < γ B (m<i ), set
miB = 1, otherwise set
miB = 0;
Moreover, if it is Alice’s turn to speak, then γ A (m<i ) = γ(m<i ), and
end
if it is Bob’s turn to speak, then γ B (m<i ) = γ(m<i ), so: if m A = m B then
Set i = C + 1;
Claim 7.9. Either γ(m<i ) = γ A (m<i ), or γ(m<i ) = γ B (m<i ). else
Set i to be the smallest
So, Alice and Bob use these numbers to try and guess the correct number such that
miA 6= miB ;
m. Alice computes m A by setting miA = 1 if and only if ρi < γ A (m<i ), If Alice was to send the
and Bob computes m B by setting miB = 1 if and only if ρi < γ B (m<i ). A , set
i’th bit after m< i
B A
Of course, m A and m B are likely to be quite different. However, mi = mi , otherwise set
miA = miB ;
by Claim 7.9, if they are the same, then they must both be equal end
to m. To compute m, Alice and Bob communicate to find the first end
index j where m jA 6= m Bj . Using the results of Exercise 3.1, this takes Figure 7.4: Compressing protocols to
O(log C/e) communication, if the probability of making an error is e. their internal information.
A dictates that Alice was supposed to send the j’th bit, then Bob
If m< j
sets m Bj = m jA , otherwise Alice sets m jA = m Bj . The two parties then
use ρ j+1 , . . . , ρC to recompute m A , m B . They repeat this procedure
until m A = m B = m.
To analyze the correctness of the simulation, we need to argue that
Alice and Bob can find m with small communication. To prove that
the communication complexity of the protocol is small, we appeal
to Pinsker’s inequality. We say that the protocol made a mistake at
i if during its execution, miA was found to be not equal to miB . This
happens exactly when ρi lies in between the numbers γ A (m<i ) and
γ B (m<i , so given that m is sampled by the protocol, the probability
that there is a mistake at i is at most
h i
E |γ A (m<i ) − γ B (m<i )|
p( xym)
= E [| p(mi = 1| xm<i ) − p(mi = 1|ym<i )|] .

p( xym)
Now for each fixing of m<i , if the i’th message is supposed to be sent
by Alice, we have
E [| p(mi = 1| xm<i ) − p(mi = 1|ym<i |]

p( xy|m)
= E [| p(mi = 1| xym<i ) − p(mi = 1|ym<i |]

p( xy|m)
q
≤ I ( X : Mi | Ym<i ), by Corollary 6.7
and if the i’the bit was to be send by Bob, then we have
E [| p(mi = 1| xm<i ) − p(mi = 1|ym<i |]

p( xy|m)
= E [| p(mi = 1| xm<i ) − p(mi = 1| xym<i |]

p( xy|m)
q
≤ I (Y : Mi | Xm<i ).
In either case, we get that expected number of mistakes is at most

C q
∑ I ( X : Mi | YM<i ) + I (Y : Mi | XM<i )
i =1
v
u C
√ u
≤ C·t ∑ I (X : Mi | YM<i ) + I (Y : Mi | XM<i ) by the Cauchy-Schwartz inequality
i =1
√ q
= C· I ( X : M | Y ) + I (Y : M | X ) . by the chain rule
√
Setting e = 1/C2 , the communication of the protocol is O( C · I log C )
in expectation. By Markov’s inequality, the probability that the com-
munication exceeds 10 times this number is at most 1/10, so we
obtain a protocol with small communication overall.
m mA mB
4
Figure 7.5: Finding the correct path. In
this case the correct path is obtained
after 3 mistakes have been fixed.
Lower bounds from Compression Theorems
Round Lower bounds for Tree Pointer-Chasing

Even though correlated sampling deals with the regime where the
information is extremely small, it is still quite useful to prove lower
bounds on communication. We give an important example in this
section. The tree pointer-chasing problem is a variant of the pointer- 6
Nisan and Wigderson, 1993; and
chasing problem6 that is useful to study for applications. In the Klauck et al., 2007
version we study here, Alice and Bob are allowed to have a lot of
common information, which makes proving the lower bound both For example, it will be used to prove
harder and more fruitful. lower bounds for data structures
Let T a,b denote a rooted tree of depth k, where every vertex at even solving the predecessor search problem.
depth has a children, and every vertex at odd depth has b children.
Let E denote a subset of the edges of the tree, such that every vertex
is connected to exactly one of its children in E. The goal of Alice and
Bob is to compute the unique leaf in T a,b that remains connected to
the root of the tree. However, Alice only knows a subset E A ⊆ E of
Including the edges in the left subtrees
the edges, and Bob only knows a subset EB ⊆ E. E A is promised to corresponds to knowing x1 , . . . , xi−1 in
contain all the edges of E at even depth, and moreover, if a vertex v the indexing problem.
at odd depth is to the left of a sibling that is picked by its parent in
E A , then all edges of E in the subtree rooted at v are included in E A .
Similarly, EB contains all edges of E at odd depth, and in addition,
if a vertex v at even depth is to the left of a sibling that is picked by
its parent in EB , then all edges of E in the subtree rooted at v are
included in EB . A natural hard distribution for this problem is when
The inputs to Alice and Bob are corre-
the edges are sampled uniformly and independently. See Figure 7.6 lated, even though the edges sampled
for an example. are independent.
Theorem 7.10. Let M denote the messages in a deterministic k − 1 round

protocol where Alice sends the first message and Alice sends a0 bits in each
round, Bob sends b0 bits in each round. Let X denotes Alice’s input, Y
denotes Bob’s input and L1 , . . . , Lk denote the vertices on the unique path
from the root to a leaf in the tree pointer chasing problem on T a,b . Then we
must have
r r !
b0 ln 2 a0 ln 2
E [| p(lk |l<k ) − p(lk |l<k m)|] ≤ (k − 1) + .
p(m,l<k ) 2a 2b
Theorem 7.10 shows that any protocol that has few rounds of By symmetry, the same result holds
when Bob sends the first message of the
communication must make an error:
protocol but Alice knows the first edge
of the tree.
Corollary 7.11. Any randomized k − 1 round protocol where Alice sends the
first message and Alice sends a0 bits in each round, Bob sends b0 bits in each
round must make an error with probability
r r !
1 b0 ln 2 a0 ln 2
+ ( k − 1) + ,
2 2a 2b
2 EB
2 EA
2 E A \ EB
Figure 7.6: An example of a tree pointer

chasing input with k = 4, a = 2, b = 3.
when solving tree pointer chasing on a tree T a,b of depth k.
Proof of Theorem 7.10. Suppose the edges E are sampled uniformly at

random, and the edges E A , EB are given to Alice and Bob. We shall
prove the theorem by induction on k. When k = 1, the theorem is
trivially true, since there is no communication.
Suppose k > 1. Let Xi denote the edges of the tree at even depth,
in the subtree rooted at the i’th child of the root, and let Yi denote
the edges at odd depth in the same subtree. Let M1 denote the first
message of the protocol, sent by Alice. We shall prove

E | p( x`1 y`1 ) − p( x`1 y`1 |m1 `1 x<`1 y>`1 )| ≤ e, (7.1)
p( x<` y>` m1 `1 )
1 1
q
a ln 2 0
with e = 2b . After the first message has been transmitted, and
we have fixed `1 , x<`1 , one can think of the rest of the protocol as a
k − 2 round protocol where Bob speaks first. If x`1 , y`1 were truly
uniform after this conditioning, induction would give
" #
E E [| p(`k |`<k ) − p(`k |m`<k )|]
p( x<` y>` m1 `1 ) p(`<k m)
1 1
r r !
b0 ln 2 a0 ln 2
< ( k − 2) + .
2a 2b
However, since the distribution on x`1 , y`1 is only e-close to uniform,

the error has an additional term of e. This proves the final statement.
It only remains to prove (7.1). By Lemma 6.15, if X = X1 , . . . , Xb

and Y = Y1 , . . . , Yb , we have
b
a0
(1/b) ∑ I ( Xi : M1 | X<i Y≥i ) ≤ I ( X : M1 | Y ) ≤ .
i =1
b
By Pinsker’s inequality (Corollary 6.7), we get that on average over

the choice of i, x`1 `1 m1 ) = p(y`1 ). Thus
we get that

E | p( x`1 y`1 ) − p( x`1 y`1 |m1 `1 x<`1 y>`1 )|
p( x<` y>` m1 `1 )
1 1

≤ E | p( x`1 ) − p( x`1 |m1 `1 x<`1 y≥`1 )| ≤ e,
p( x<` y≥` m1 `1 )
1 1
as required.
Round Lower Bound for Greater Than

Suppose Alice and Bob each have n-bit numbers, and want to know
which of their numbers is greater. We have already shown that the
deterministic communication complexity of this problem is n + 1,
and argued that the randomized communication complexity is
Θ(log n). Here we study the randomized communication complexity
of bounded round protocols for this problem.
Theorem 7.12. Any k round protocol for computing the greater than
function on x, y ∈ [2n ] must transmit at least Ω(n1/k /k2 ) bits in some
round.
Proof. We prove the theorem by appealing to the lower bound for the
tree pointer chasing problem (Theorem 7.10). Say we have an input
to the tree pointer chasing problem on a tree T a,a of depth k. We set
n = ak .
We show how Alice and Bob can transform the tree into inputs
x ∈ {0, 1, . . . , 2n − 1} and y ∈ {0, 1, . . . , 2n − 1} for the greater-than
problem, without communicating. The numbers x, y are best thought
of as ak−1 -digit numbers written in base a. The transformation will
guarantee that the greater than function will reveal undue informa-
tion about the identity of the last leaf in the tree.
We describe how to carry out the reduction. If the tree is of depth
1, with the edge coming out of the root in Alice’s input, we set x ∈
{0, 1, . . . , a − 1}, and y = d a/2e. If the edge from the root is in Bob’s
x= y0 x1 x2
y= y0 y1 y2
x, y
x0 , y0 x1 , y1 x2 , y2
Figure 7.7: Using the subtrees to

combine several greater-than instances
input, we set y based on the edge, and set x = d a/2e. Viewing x, y as

numbers written in base a, these are single digit numbers.
If the tree is of depth k > 1, with Alice knowing the edge coming
out of the root, we first compute x0 , . . . , x a−1 , y0 , . . . , y a−1 , which are
the inputs to predecessor search determined by the a subtrees of
depth k − 1 that lie just below the root. By induction, these correspond
to numbers with ak−2 digits. Suppose i ∈ {0, 1, . . . , a − 1} corresponds
to the edge coming out of the root. In words, we shall view x, y as
a-digit numbers with ak−1 digits. The digits of x, y can be thought of
as broken up into a consecutive blocks, each with ak−2 digits. For all
j, we set the j’th block of y to be y j . We set the first i − 1 blocks of x
to be the same as y. For j ≥ i, we set the j’th block of x to be x j . Since
Alice knows all the edges in the first i − 1 subtrees, Alice can compute
x, and Bob can compute y.
This construction has the property that if `k is the k’th leaf, and the
last edge on the root to leaf path is visible only to Alice, then x > y
if and only if this last leaf is the r’th child of its parent, with r > a/2.
Similarly, if this last edge is visible to Bob, then y > xi f and ony if
this last edge is greater than a/2. By Theorem 7.10, the protocol can
succeed only if the number of bits communicated in each round is
Ω( a/k2 ). Since a = n1/k , the proves the required bound.
Part II
Applications
8
Circuits, Branching Programs
Although communication complexity studies the amount

of communication needed between two parties that are far apart,
it has had quite an impact in understanding many other concrete
computational models and discrete systems. In this chapter, we
illustrate some of the salient work in this direction by focussing on
It also makes sense to consider circuits
two computational models: boolean circuits and branching programs. where every gate has fan-in 2 and
We will also give some applications in the study of proof complexity. computes an arbitrary function of its
inputs. This only changes the size and
depth of the circuit by a constant factor,
Boolean Circuits since any function on 2 bits can be
computed by a small circuit using only
AND and OR gates.
A boolean circuit is a directed acyclic graph whose vertices

(called gates) are associated with boolean operators or input variables. x1 x2 x3
Every gate with in-degree 0 corresponds to an input variable or ^
its negation, and all other gates compute either the logical AND
(denoted ∧) or the OR (denoted ∨) of the inputs that feed into them. ^ ^
Usually the fan-in of the gates is restricted to being at most 2, and

in the rest of the discussion we adopt the convention that the fan-in _ _
of each gate is at most 2 unless we explicitly state otherwise. The

circuit computes a function f : {0, 1}n → {0, 1} if some gate in the ^ ^ ^ ^
circuit evaluates to f . A formula is a circuit whose underlying graph
is a tree. The size of a circuit is the number of gates, and the depth x1 ¬ x1 x2 ¬ x2 x3 ¬ x3
of the circuit is the length of the longest path in the graph. When
the circuit does not use any negated variables, it is called a monotone Figure 8.1: A circuit computing the
circuit. parity x1 ⊕ x2 ⊕ x3 .
It is well known that every (monotone) function f : {0, 1}n → The size of the circuit captures the total
{0, 1} can be computed by a (monotone) circuit of depth n and size at number of basic operations needed
to evaluate it. The depth captures the
most O(2n /n). number of parallel steps needed to
The importance of understanding boolean circuits stems from the evaluate it: if a circuit has size s and
depth d, then it can be evauated by s
fact that they are a universal model of computation. Any function
processors in d time steps.
that can be computed by an algorithm in T (n) steps can also be com-
puted by circuits of size Õ( T (n)). Thus to prove lower bounds on the
time complexity of algorithms, it is enough to prove that there are no
A super-polynomial lower bound on
small circuits that can carry out the computation. However, we know the circuit size of an NP problem would
of no explicit function (even outside NP) for which we can prove a imply that P 6= NP, resolving the most
famous open problem in computer
super-linear lower bound, highlighting the difficulty in proving lower
science.
bounds on algorithms. In contrast, counting arguments imply that
almost every function requires circuits of exponential size. The number of circuits of size s can by
bounded by 2O(s log s) , while the number
n
of functions f is 22 , so if s 2n /n, one
cannot hope to compute every function
Karchmer-Wigderson Games with a circuit of size s.
Every boolean function defines a communication problem via 1

Karchmer and Wigderson, 1990
its Karchmer-Wigderson game1 . The game defined by f : {0, 1}n →
{0, 1} is the communication problem where Alice gets x ∈ f −1 (0),
and Bob gets y ∈ f −1 (1). The goal of Alice and Bob is to compute an
index i such that xi 6= yi . When f is monotone (namely f (y) ≥ f ( x )
whenever x ≥ y coordinate by coordinate), one can define the
monotone Karchmer-Wigderson game to be the problem where the
inputs are x, y as before, but now the players are required to output i
such that xi < yi .
If there is a circuit computing f of depth d, then the cost of the
associated game is at most d. If f is computed as f = g ∧ h, then
either g( x ) = 0 or h( x ) = 0, while g(y) = h(y) = 1. Alice can
announce whether g( x ) or h( x ) is 0, and the parties can continue the
protocol using g or h. Similarly if f = g ∨ h, then either g(y) = 1 or
h(y) = 1, while g( x ) = h( x ) = 0. Bob can announce whether g( x ) = 1
or h( x ) = 1. The parties then continue with either g or h. After at
most d steps, the parties will have identified an index i for which
xi 6= yi . When the circuit is monotone (has no negations), the above
simulation finds an index i such that xi = 0, yi = 1.
Indeed every AND gate corresponds to a node in the protocol tree
where Alice speaks, and every OR gate corresponds to a node where
Bob speaks. Moreover, a circuit of size s gives a protocol which has at
most s vertices. Lemma 8.1 proves that any protocol
that solves the game on the input sets
Conversely, we have:
A, B can be viewed as solving the game
on sets A0 , B0 such that A ⊆ A0 , B ⊆ B0 ,
Lemma 8.1. If A, B ⊆ {0, 1}n are disjoint non-empty sets and π is a and A0 , B0 partition {0, 1}n .
protocol that solves the (monotone) Karchmer-Wigderson game on A, B,
such that every node of π is reachable by some input in A × B. Then
there is a (monotone) boolean function f : {0, 1}n → {0, 1} such that
f ( A) = 0, f ( B) = 1, and a circuit whose underlying graph can be obtained
by replacing every node of π where Alice speaks with an AND gate, every
node where Bob speaks with an OR gate and every output node with the
corresponding variable or its negation.
circuits, branching programs 119
Proof. We prove the lemma by induction on the number of nodes in

the protocol.
When the protocol has only one node with output i, we must
have that xi 6= yi (or xi = 0, yi = 1 in the monotone case) for every
x ∈ A, y ∈ B. Thus setting f to be the i’th variable or its negation
works.
For protocols with more nodes, suppose without loss of generality
that Alice speaks first. Then her message partitions the set A into
two disjoint sets A = A0 ∪ A1 . Both sets must be non-empty, since
every node in the protocol is reachable by assumption. By induction
the two children of the root correspond to boolean functions f 0 and
f 1 . Consider the circuit that takes the AND of the two gates obtained
inductively, and denote the function it computes by f . Then for all
y ∈ B, f (y) = f 0 (y) ∧ f 1 (y) = 1 ∧ 1 = 1. For all x ∈ A, either x ∈ A0
or x ∈ A1 . In either case f ( x ) = f 0 ( x ) ∧ f 1 ( x ) = 0.
One immediate consequence of Lemma 8.1 is regarding the circuit

depth required to compute the majority and parity functions. In
Section 1 and Exercise ??, we proved that solving the Karchmer-
Wigderson games for these functions requires at least 2 log n − O(1)
bits of communication. This shows that if the fan-in is at most 2, both
of these functions requires circuits of depth 2 log n − O(1).
Karchmer-Wigderson Games in few Rounds
Protocols that solve Karchmer-Wigderson games in a few rounds

have a particularly nice structure. Suppose we have a game with
input sets X , Y ⊆ {0, 1}n .
If the game can be solved without communication, then the output
i must have the feature that xi 6= yi . The only way this can happen is
if there is a b ∈ {0, 1} such that X = { x : xi = b}, Y = {y : yi 6= b}.
A restriction of the input is a string ρ ∈ {0, 1, ∗}. Its size |ρ| = |{i :
ρi 6= ∗}| is the number of coordinates which it restricts, namely the
number of coordinates that are 0 or 1. An input x ∈ {0, 1}n can be
thought of as a restriction of size n. Given two restrictions ρ, α, we Sr X
write ρ k α if the two are consistent: for every i ∈ [n], either αi = ρi ,
or αi = ∗, or ρi = ∗. We denote
Y
Sρ = { x : ρ k x }
If the game can be solved with 1 round of communication, we can

show:
Lemma 8.2. If the Karchmer-Wigderson game can be solved with 1 round

of communication, where Alice speaks first, then there is ρ ∈ {0, 1, ∗} such
that Y ⊆ Sρ , and X is disjoint from Sρ . Figure 8.2: Lemma 8.2
Proof. Let a ∈ Y be arbitrary, and define the restriction ρ by


 a if i can be output by the protocol,
i
ρi =
∗ otherwise.
We claim that Y = {y : ρ k y}. Indeed, if y k ρ, then it cannot be

in X , otherwise the protocol must make an error on inputs (y, a). So
it must be in Y . Conversely, if y ∦ ρ, because say yi 6= ai and ρi 6= ∗,
then when the protocol outputs i it must make an error either when
Bob has a as input or Bob has y as input.
If the game can be solved with a 2-round protocol where Alice

speaks first we can show:
Lemma 8.3. If the Karchmer-Wigderson game can be solved with 2 rounds Y

of communication that begins with Alice sending a k-bit message, then there
is a set of restrictions R with | R| ≤ 2k , such that X
X ⊆ ∪ ρ ∈ R Sρ , Y is disjoint ∪ρ∈ R Sρ .
Proof. For each message m that Alice sends, let Xm denote the set of
inputs that are consistent with that message. Then by Lemma 8.2, we
get that there is a restriction ρm such that Xm ⊆ Sρm but Y is disjoint
from Sρm . Since every x ∈ X is consistent with some message m, the Sr
sets Sρ obtained in this way must cover X . Figure 8.3: Lemma 8.3
Lowerbounds on the Depth of Monotone Circuits
One of topics we do not yet understand in circuit complexity is

whether polynomial sized circuits of small depth are strictly weaker
We do know, via counting arguments,
than polynomial circuits of larger depth. that there is a constant e such that the
set of functions computable by size
Open Problem 8.4. Can every function that is computable using circuits of s log s circuits is strictly larger than the
size polynomial in n be computed by circuits of depth O(log n)? set of functions computable by size
es circuits. Similarly, we know that
However, we do know how to prove interesting results when the circuits of depth d compute a bigger set
of functions than those computable in
underlying circuits are monotone. depth ed.
Matching
One of the most well studied combinatorial problems is the problem
of finding the largest matching in a graph. A matching is a set of dis-
joint edges. Today, we know of several polynomial time algorithms
2
Kleinberg and Tardos, 2006
that can find the matching of largest size in a given graph2 . This
translates to polynomial sized circuits for computing whether or not
a graph has a matching of any given size.
Given a graph G on n vertices, define


1 if G has a matching of size at least n/3 + 1,
Match( G ) =
0 otherwise.
Since there are polynomial time algorithms for finding matchings,

one can obtain polynomial sized circuits that compute Match. How-
ever, we do not know of any logarithmic depth circuits that compute
Match, and here we show that there are no such circuits that are also 3
Raz and Wigderson, 1992
monotone3 .
It is enough to prove a lower bound on the communication com- Recall Lemma 8.1.
plexity of corresponding monotone Karchmer-Wigderson game. In
the game, Alice gets a graph G which has a matching of size n/3 + 1
and Bob gets a graph H that does not have a matching of size n/3 + 1.
Their goal is to compute an edge which is in G, but not in H. We
shall prove:
Theorem 8.5. Any randomized protocol solving this game must communi-
cate Ω(n) bits.
As a corollary, we get:
Corollary 8.6. Every monotone circuit computing Match has depth Ω(n).
Set m = n/3. We shall show that if the parties can solve the
monotone Karchmer-Wigderson game using c bits of communication,
then they can get a randomized protocol for computing disjointness
on a universe of size m. Since any such randomized protocol requires 4
By Theorem 6.13.
linear communication complexity4 , this gives a communication lower
bound of Ω(n). Suppose Alice and Bob get inputs X ⊆ [m] and
Y ⊆ [m]. Alice constructs the graph GX on the vertex set [3m + 2]
such that for each i, GX contains the edge {3i, 3i − 1} if i ∈ X, and
has the edge {3i, 3i − 2} if i ∈
/ X. In addition, GX contains the edge
{3m + 1, 3m + 2}. Alice’s graph consists of m + 1 disjoint edges. Bob
uses Y to build a graph HY on the same vertex set as follows. For
each i ∈ [m], Bob connects 3i − 2 to all the other 3m + 1 vertices of the
graph if i ∈ Y. If i ∈
/ Y, Bob connects 3i to all the other vertices. Since
every edge of HY contains exactly one of {3i, 3i + 1}.
Alice and Bob permute the vertices of the graph randomly and run
the protocol promised by the monotone Karchmer-Wigderson game
on GX and HY . If X and Y are disjoint, the outcome of the protocol
must be the edge corresponding to {3m + 1, 3m + 2}. On the other
hand, if X and Y intersect in k elements, then the outcome of the
protocol is equally likely to be one of the edges that corresponds
to these k elements, so the probability that it is {3m + 1, 3m + 2}’th
edge is at most 1/2. If Bob sees that the i’th edge is output, then
Figure 8.4: The graphs GX and HY
GX HY all edges touching
2X 2Y
Bob knows that i ∈ X ∩ Y. Thus, repeating this experiment a few

times, Bob will know whether the sets are disjoint or not with high
probability.
This proves that the communication for the Karchmer-Wigderson
game must be at least Ω(n), as required.
Lowerbounds on Circuits with few Alternations
Another special kind of boolean circuit is a circuit where the

number of alternations in the circuit, namely the number of switches
between AND and OR gates on input-output paths is small. Let us
say that the circuit has d alternations if there are d such switches.
The number of alternations has a very natural interpretation in
terms of communication protocols: it corresponds to the number of
rounds of communication generated in the corresponding protocol
for the Karchmer-Wigderson game. One can prove that if the number
of alternations is small, then the circuit must have exponential size 5
Håstad, 1987
even to compute simple functions like majority and parity5 :
Theorem 8.7. If a circuit with d alternations computes ∑in=1 xi mod 2,

1
where x ∈ {0, 1}n , then it must have 2Ω(n d−1 ) gates.
In Section 6, we saw that protocols with few rounds are in general

much weaker than protocols with many rounds. To prove Theorem
8.7, we prove an exception to this moral: we show that if we are
allowed to restrict the input, then one can simulate any protocol for a
Karchmer-Wigderson game with fewer rounds. This is the technical
heart of the proof of Theorem 8.7.
Suppose Alice is given an input from a set X ⊆ {0, 1}n , and Bob
is given an input from the set Y ⊆ {0, 1}n . Let α ∈ {0, 1, ∗} be a
Note that the sets Xα , Yρ may become
restriction. Applying the restriction gives us new sets of inputs: empty.
X α = { x ∈ X : x k α }, Y α = { y ∈ Y : y k α }.
In Lemma 8.3, we showed that every 2 round protocol is associated

with a set of restrictions R such that ∪α∈ R Sα covers the inputs. Given
a restriction α for which Xα , Yα are not empty, we write πα to denote
the protocol operating on the inputs Xα , Yα . It is the protocol ob-
tained by deleting all nodes in π that are not reachable via the inputs
Xα × Yα . Let α be a uniformly random restriction of size ` = (1 − e)n.
Claim 8.8. If t ≤ n/2, and e ≤ 1/8, if γ is a restriction with |γ| ≥ t,

Prα [γ k α] ≤ (7/8)t .
Proof. The probability that the first coordinate set in γ is consistent

` + 1 − ` . Given any fixing of < t coordinates of
with α is at most 2n n
α, the probability that a new coordinate i that is set in γ is consistent
with α is at most
1 `−t `−t `−t
· +1− = 1−
2
|{z} n}
| {z | {z n } 2n
Pr[αi =γi |αi 6=∗] Pr[αi 6=∗] Pr[αi =∗]
2n − (1 − e)n + n/2
≤ since t < n/2
2n
3 e 7
= + ≤ . since e ≤ 1/8
4 2 8
The heart of the proof of Theorem 8.7 is the following lemma. Say
that a two round protocol where Alice speaks first has restrictions of
size t, if every restriction sent by Alice has size at most t.
Lemma 8.9. If a two-round protocol π has restrictions of size t, and

e ≤ 1/2, then πα can be simulated by a two-round protocol which has
restrictions of size t, where Bob speaks first, except with probability (8et)t .
Before we give the proof of Lemma 8.9, let us use it to prove Theo-
1
1
rem 8.7. Set t = n d−1 /16. Suppose the circuit has s < (8/7)n d−1 /16 /d =
(8/7)t /d gates, and is of depth d. Now consider the corresponding
Karchmer-Wigderson protocol. There are at most s possible two
round protocols that are executed as the last 2 round protocol in the
Karchmer-Wigderson game. We claim:
1
Claim 8.10. There is a k-restriction β, with k < n − n d−1 /2, such that π β
can be simulated by a two-round protocol with restrictions of size t.
Proof. First apply a random n/2-restriction, and then d − 2 restrictions

−1
that set (1 − e) fraction of the variables, with e = n d−1 . After d − 2
such rounds of applying restrictions, the number of variables left
1
alive is ed−2 n/2 = n d−1 /2. There are at most s two round protocols
that are executed in π, and by Claim 8.8, the probability that any of
them remains of size > t after the first restriction is at most s(7/8)t <
1/d.
The probability that any of the subsequent restrictions generates a
two-round protocol that has restrictions of size > t is at most
(d − 2) · s · (8et)t < (4/7)t . by Lemma 8.9.
Thus the probability that we are left with a two-round protocol with
restrictions of size t is at least 1 − 1/d − (4/7)t > 0. This proves that
some choice of restriction gives the claim.
Any two-round protocol computing solving the Karchmer-

Wigderson game after the restriction must use restrictions of size
> t, since any restriction that the first player sends must completely
determine the input in order to determine the parity of his bits. So
the original protocol cannot have computed the Karchmer-Wigderson
game. This completes the proof of Theorem 8.7.
Input: Alice knows x ∈ Xρ , Bob
Proof of Lemma 8.9. Let Rα ⊆ R be the subset of restrictions that are knows y ∈ Yρ .
consistent with α. On input y ∈ Yα , define ρy to be the smallest Output: i such that xi 6= yi .
restriction that distinguishes y from all of the restrictions in R: Bob sends Alice the name of ρy ;
Alice outputs an index i ∈ [n] such
ρy = arg min | ρ |. that xi 6= ρy 6= ∗;
{ρ:ρky,∀γ∈ Rα ,ρ∦γ}
Figure 8.5: Generic 2-round protocol τ
By definition, it is enough for Bob to send Alice ρy in order to for Karchmer-Wigderson games.
solve the Karchmer-Wigderson game. Once Alice has ρy , she can find
γ ∈ Rα that is consistent with her input x, which allows her to solve
the game, since x k γ ∦ ρy . We only need to show that ρy is of size at
most t with high probability.
y
Claim 8.11. If ρi 6= ∗, there is a β ∈ Rα such that β i 6= ∗.
y Ra
Proof. If not, we could set ρi = ∗ to obtain a smaller restriction that set in
ry and r
is inconsistent with every γ ∈ Rα , contradicting the minimality of set in
ρy .
Figure 8.6: The restrictions from R must
cover ρy .
Given ρy , α, if |ρy |
> t, we shall compute an ` + t-restriction ρ and
y
blame it. ρ will have the property that if ρi 6= ∗, then ρi 6= ∗ or αi 6= ∗. Recall that by Lemmas 8.2 and 8.3,
Sρy must be disjoint from S β , for every
To compute ρ, let ρ = α to begin with, so |ρ| = `. We need to set
y β ∈ Rα .
t more coordinates of ρ to bit values. Whenever ρi 6= ∗, we must
have αi = ∗, or else ρy can be made smaller. We will set t coordinates

of ρ, and they will all be coordinates that are set to bits in ρy , using
the algorithm shown in Figure 8.7. We then blame the resulting
restriction ρ.
We shall prove that the probability that a specific restriction ρ is Input: α, ρy ∈ {0, 1, ∗}n , with
|α| = `, |ρy | > t.
blamed is very small. Output: ρ ∈ {0, 1, ∗} of size t + `.
Claim 8.12. For any restriction ρ, the number of restrictions α that could Set ρ = α;
lead to ρ being blamed is at most (2t)t . while |ρ| < ` + t do
Let β ∈ Rα be the
lexicographically first
Proof. Given ρ, one can immediately identify the lexicographically restriction such that ∃i ∈ [n]
y
first restriction β ∈ Rα : it is the first restriction in R that is consistent with β i 6= ∗ 6= ρi , yet ρi = ∗;
Set ρi = β i .
with ρ. There are then at most t options for the first coordinate that end
was set in ρ. Given the first coordinate that was set in ρ, there at most Output ρ;
2t options for the next coordinate of ρ that was set: it is either one of Figure 8.7: Algorithm for computing ρ
the coordinates of β, or a coordinate in the next restriction in R that is from α, ρy .
consistent with ρ after setting the first coordinate to ∗. In this way, we
see that there are at most (2t)t choices for the set of coordinates that
were set by the algorithm for computing ρ from α.
Claims 8.12 implies that the probability that any ρ is blamed when
Rα contains no restriction of size bigger than t is at most

n `+t (2t)t n−` t t
·2 · n ≤ · 2 · (2t)t Since the number of ` + t -restrictions
`+t ( ` ) · 2` `+t n
is (`+ t) · 2`+t , and the number of
`-restrictions is (n` ) · 2` .
en t t
≤ · 2 · (2t)t since e ≤ 1/2.
n/2
= (8te)t .
Monotone Circuit Depth Hierarchy
We can use the connection to communication to show that

monotone circuits of larger depth are strictly more powerful than We do not know how to prove a similar
circuits of larger depth. Throughout this section, we work with result for general circuits.
circuits of arbitrarily large fan-in.
Let k be an even number, and consider the formula F of depth
k, where every gate at odd depth is an OR gate, every gate at even
depth is an AND gate, and every gate has fan-in exactly n. Every
input gate is labeled by a distinct unnegated variable. The formula
has size O(nk ). We prove:
Theorem 8.13. Any circuit of depth k − 1 that computes F must have size
n
at least 2 16k −1 .
Proof. To prove Theorem 8.13, it is enough to find show that any

protocol computing the associated Karchmer-Wigderson game has
communication at least Ω(n/k2 − k log n), the lower bound follows,
since if the size of the circuit is at most s, the communication of a k − 1
round protocol for the Karchmer-Wigderson game can be at most
(k − 1) log s, so we get that log s ≥ Ω(n/k3 − log n), as required.
We prove that the Karchmer-Wigderson game has large communi-
cation by reducing the problem to the pointer-chasing problem. Here
Alice and Bob are given x, y ∈ [n]n and what to compute zk , where
1 = z0 , z1 , z2 , . . . are defined using the rule

x if i is odd,
z i −1
zi =
yz if i is even.
i −1
Note that every variable in the formula can be described by a

string in v ∈ [n]k . We say that v is consistent with x if

x when i = 1,
1
vi =
vv when i is odd and not 1,
i −1
We say that v is consistent with y if vi = yvi−1 when i is even. Then

note that there is a unique v that is consistent with both x and y, and
that’s when v = z, the intended path in the pointer-chasing problem.
k
Alice sets all the coordinates of x 0 ∈ {0, 1}[n] that are consistent
with her input to be 0, and all other coordinates to be 1. Bob sets all
k
the coordinates of y0 ∈ {0, 1}[n] that are consistent with her input
to be 1, and all other coordinates to be 0. Clearly, there is only one
coordinate where x 0 is 1 and y0 is 0, and that is the coordinate that is
consistent with both x, y.
Now we claim that every gate in the formula that corresponds to a
path that is consistent with Alice’s input evaluates to 0. This clearly
true for the gates at depth k, since that is how we set the variables
in x 0 . For the gates at depth d < k, if the gate is an AND gate, then it
must be true because one of its inputs will be consistent with x and
so evaluate to 0. On the other hand, if the gate is an OR gate, then all
of its inputs correspond to paths that must be consistent with Alice’s
input, so they must all evaluate to 0. Thus F ( x 0 ) = 0, since the gate at
the root is certainly consistent with Alice’s input. Similar arguments
prove that F (y0 ) = 1.
Thus any protocol for the monotone Karchmer-Wigderson game
gives a protocol solving the pointer-chasing problem, and so by
Theorem 6.19, we get that the communication of the game must be at
least n/16 − k, as required.
Branching Programs
Branching programs model computations that use very little

memory. A branching program of length ` and width w is a layered
directed graph whose vertices are a subset of [` + 1] × [w]. All the
vertices in the u’th layer (u, ·) are associated with a variable from
x1 , . . . , x n .
Every vertex (u, v), with u < ` + 1 has exactly two edges coming
out of it, and both go to a vertex (u + 1, ·) in the next layer. Every 1
vertex in the last layer (` + 1, ·) is labeled with an output of the x1 0
program. On input x ∈ {0, 1}n , the program is executed by starting
at the vertex (1, 1) and reading the variables associated with each
layer in turn. These variables define a path through the program. The
program outputs the label of the last vertex on this path. 0
Every function f : {0, 1}n → {0, 1} can be computed by a 0 1 1
branching program of width 2n , and most functions require expo-
x2
nential width. Here we define an explicit function that requires that
` · log w ≥ Ω(n log2 n), for any branching program that computes it.
To prove the lower bound, we shall first show that any branching 0
1
x3 0 1
program can be simulated (at least, in a sense) by an efficient com-
munication protocol in the number-on-forehead model (Chapter 4).
Let g : ({0, 1}r )k → {0, 1} be an arbitrary function that k-parties
wish to compute in the number-on-forehead model. Then define
0 1
g0 : {0, 1}r log t+t → {0, 1} to be g0 (S, x ) = g( xS ), where here the first
part of g0 ’s input is a set S ⊆ [t] of size r, and the second part of g0 ’s
Figure 8.8: A branching program that
input is a string x ∈ {0, 1}t . g( xS ) is the output obtained when x is computes the logical AND, x1 ∧ x2 ∧ x3 .
projected to the coordinates in S.
In the literature, the programs we
We claim 6 : define here are referred to as oblivi-
ous branching programs. In general
Theorem 8.14. If g0 can be computed by a length ` n log2 n, width branching programs, every vertex of the
program can be associated with a dif-
w branching program, then g can be computed by log n players with
ferent variable, vertices in a particular
communication log w · log2 n in the number-on-forehead model. layer need not read the same variable.
Setting g to be the generalized inner-product function, Theorem 6

Babai et al., 1989; and Hrubes and Rao,
8.14 and Theorem 5.8 imply that any program with length ` 2015
Ω (1)
n log2 n that computes g0 must have width that is at least 2n .
The proof of Theorem 8.14 relies on Lemma 5.4. Given a branching
program of length ` and width w, partition the layers of the program
into O(`/n) sets of size n/2 each. Consider the bipartite graph where
every vertex on the left corresponds to an element of the partition,
and every vertex on the right corresponds to a variable. We connect
two vertices if the variable does not occur in the corresponding
partition of the program.
Repeatedly applying Lemma 5.4, we find disjoint sets Q1 , . . . , Q(1/2) log n
log n
on the left and R1 , . . . , R(1/2) log n on the right, such that | Qi | = 2 log 3e ,
√
| Ri | = n, and Qi , Ri form a bipartite clique. The parties simply pick
the set S in such a way that the i’th players input can be mapped to
Ri . This proves the theorem.
Boolean Formulas
Since the graph is a tree, and every gate

A formula is a circuit whose underlying graph is a tree. Although takes at least 2 inputs, the size of the
we do not know how to prove super-linear circuit lower bounds graph is within a factor of two of the
number of leaves in the tree. This is
for arbitrary circuits, we do know how to prove super-linear lower
because the number of edges in a graph
bounds for formulas. Once again, the primary technique here is a with s vertices is exactly s − 1. Every
reduction to a communication argument. leaf is connected to a distinct edge, so
there can be at most s − 1 leaves. On
Consider the function Distinct : [2n]n+1 → {0, 1}, defined as: the other hand, every non-leaf has at
 least 3 edges touching it. So if there are
1 if x , . . . , x are distinct, k leaves, counting the number of edges
1 n
Distinct( x1 , . . . , xn ) = gives (k + 3(s − k ))/2 ≤ s − 1, and so
0 else. 2k ≥ s + 2.
Distinct is a boolean function that depends on (n + 1) log(2n) bits. We

7
Neciporuk, 1966
shall prove7 :
Theorem 8.15. Any formula computing Distinct must use Ω(n2 ) gates.
To prove the theorem, we start by proving a simple communica-

tion lower bound. Suppose Alice is given n numbers y1 , . . . , yn ∈ [2n],
and Bob is given z ∈ [2n]. They want to compute Distinct(y1 , . . . , yn , z).
Lemma 8.16. If there is a 1-round protocol where Alice sends Bob t bits and
Bob outputs Distinct(y1 , . . . , yn , z), then t ≥ log (2n
n ) = 2n − O (log n ).
Proof. To prove the lower bound, it is enough to consider the case

when S = {y1 , . . . , yn } is a set of n distinct elements. In this case,
Alice’s message must determine S, or else Bob will not be able to
compute Distinct(y1 , . . . , yn , z). This is because if S 6= S0 are two sets
of size n that are consistent with Alice’s message, then there must be
an element z ∈ S such that z ∈ / S0 . Then z is distinct from S0 , but not
from S. Thus the number of bits transmitted by Alice must be at least
log (2n
n ), as required.
Armed with Lemma 8.16, we are ready to prove the formula lower
bound:
Proof of Theorem 8.15. Suppose there is a formula F computing

Distinct using s gates. Each input gate in the formula corresponds
to one of the numbers xi in the input to Distinct( x1 , . . . , xn+1 ). For
each i ∈ [n + 1] we define the tree Ti as follows. Every vertex of Ti
corresponds to a gate in F. Start by discarding all the gates in F that
Figure 8.9: The tree Ti that corresponds

to the input gates of xi . Here shaded
F input gates correspond to xi .
Ti
do not depend on xi . In what remains, replace every gate that has

only one input feeding into it with an edge connecting its input to its
output (Figure 8.9).
Suppose Alice knows all of the input numbers except xi , and Bob
knows xi , and Alice and Bob want to compute Distinct( x1 , . . . , xn ).
They can use the tree Ti to carry out the computation efficiently. Bob
already knows the values at the leaves of the tree. Every other gate in
Ti depends on at most two values that Bob does not know. However,
given Alice’s input, each of these values is a boolean function that
depends on only one bit that Bob does know. There are exactly 4
boolean functions that depend on one bit. So Alice can send 2 bits to
describe this function for every edge of the tree Ti . Alice also sends
2 bits to describe how to compute the output of F given the values
of the gates in Ti . If Ti contains k edges, Alice needs to send at most
2(k + 1) bits for Bob to be able to compute Distinct( x1 , . . . , xn ).
On the other hand, the number of leaves in F is at most s, so by
averaging, some input xi corresponds to at most s/n input gates of
F. The corresponding tree Ti then contains at most O(s/n) edges. By
Lemma 8.16, we get that s/n ≥ Ω(n), so s ≥ Ω(n2 ).
Formula Lowerbounds for Parity
Boolean Depth Conjecture

9
Proof Systems
Proof systems give a way to measure the difficulty of proving the-

orems. A proof system is specific language for expressing a proof. It
consists of a set of rules that allow one to derive a theorem statement
from axioms. The study of proof systems has led to many interest-
ing philosophical results, including Godel’s famous incompleteness 1
?
theorem1 .
Resolution Refutations
Perhaps the simplest example of a proof system is refutation.

A refutation system can be used to prove that a boolean formula
expressed in conjuctive normal form cannot possibly be satisfied. For
example, consider the formula
F =( x2 ∨ x1 ) ∧ (¬ x2 ∨ x1 ) ∧ (¬ x1 ∨ x3 ∨ ¬ x4 )
∧ (¬ x1 ∨ x3 ∨ x4 ) ∧ (¬ x1 ∨ ¬ x3 )
We claim that F cannot be satisfied by any boolean assignment. To

reason about this formula, we repeatedly use the resolution rule
( a ∨ b) ∧ (¬ a ∨ c) ⇒ (b ∨ c).
The resolution refutation for F shown in Figure 9.1 uses this rule to
give a proof that F cannot be satisfied. In general, a resolution
refutation is a sequence of clauses where each clause is derived by
combining two previously derived clauses using the resolution rule.
The proof ends when the empty clause (namely a contradiction) is
derived. The proof is said to be tree-like if every derived clause is
used only once.
Finding solutions to boolean formulas is a central problem because 2
Wikipedia, 2016b
of its connection2 to the complexity classes NP and coNP. The best
Figure 9.1: A refutation of F.
(¬ x1 _ ¬ x3 )
(¬ x2 _ x1 )
(¬ x1 _ x3 _ x4 ) (¬ x1 )
( x1 )
(¬ x1 _ x3 ) )(
( x2 _ x1 )
(¬ x1 _ x3 _ ¬ x4 )
solvers known today try to both find a satisfying solution while

simultaneously trying to prove that formulas obtained after partial
assignments cannot be satisfied, using resolution refutations. Thus it
is important to understand what kinds of formulas can be proved to
If NP 6= coNP, then it must be true
be unsatisfiable using small refutations. that for every proof system, there is
an unsatisfiable formula that cannot
be proved to be unsatisfiable in a
The Pigeonhole Principle polynomial number of steps. On the
other hand, if P = NP, then there
The pigeonhole principle states that if n pigeons are placed into must be a proof system in which
n − 1 holes, then some hole must contain at least 2 pigeons. One can every unsatisfiable boolean formula
can be proved to be unsatifiable in a
express the principle as a boolean formula. For i ∈ [n], j ∈ [n − 1], polynomial number of proof steps.
we have the variable xi,j which is set to true when the i’th pigeon is in
the j’th hole. Define:
Pi = ( xi,1 ∨ xi,2 ∨ . . . ∨ xi,n−1 ) pigeon i must be in some hole

−1
n^ ^
H= (¬ xi,j ∨ ¬ xi0 ,j ) each hole contains at most one pigeon
j=1 i <i0 ∈[n]
Then the pigeonhole principle implies that

^
P= Pi ∧ H
i
cannot be satisfied by any assignment to the variables xi,j . 3

Haken, 1985
Here we prove3 that proving that P is unsatisfiable requires an
exponential number of resolution steps: In fact, the proof will show that an
exponential number of steps in any
Theorem 9.1. Any resolution refutation of the pigeonhole principle must proof system where each line derives a
disjunction using any derivation rule,
involve 2Ω(n) derivation steps. and not just using resolution.
proof systems 133
Consider any refutation of P that derives s clauses. The key idea of

the proof is to give the proof even more power. We allow the proof to
assume the following axiom for free:
Axiom 9.2. Each hole contains exactly one pigeon, and the n − 1 pigeons
that are in the holes are distinct.
This can only make it easier to derive a contradiction. Indeed, H is implied by Axiom 9.2.
Axiom 9.2 gives

_
¬ xi,j ⇔ xi0 ,j ,
i 0 6 =i
so under the Axiom, we can replace every negated variable in the

proof with a disjunction of unnegated variables.
Let C be one of the clauses derived in the proof. We say that the
clause C is big if there is a set S ⊂ [n], |S| = n/4 such that for each
i ∈ S, C contains n/4 variables xi,j (after replacing the negated
variables).
Pick n/4 of the pigeons uniformly at random, and randomly
assign them to n/4 different holes. If i0 is assigned to hole j0 , in
this process, then by Axiom 9.2, we set xi,j0 to be false for all i 6= i0 ,
and set xi0 ,j to be false for all j 6= j0 . After this assignment to the
variables, n/4 of the pigeon clauses become true. Moreover, the
variables corresponding to these n/4 pigeons and holes disappear
from the rest of the clauses by appealing to the axiom. The rest of the
formula P becomes equivalent to the corresponding formula for 3n/4
pigeons and 3n/4 − 1 holes, and the resolution proof must still derive
a contradiction. We claim:
Claim 9.3. One of the big clauses must survive the assignment.
Proof. Say that a clause has pigeon complexity w if there is a set

S ⊂ [n] of size w such that
^
Pi ⇒ C, here the implication is allowed to use
i ∈S Axiom 9.2.
yet no smaller set has this property.

The contradiction can only be derived from all 3n/4 pigeon
clauses that remain (even if Axiom 9.2 is used), so the final clause
of the proof has pigeon complexity 3n/4. This means that one of the
clauses used to derive the contradiction has pigeon complexity at
least 3n/8. Continuing in this way, we obtain a sequence of clauses
in the proof, where each clause requires at least half as many pigeon
clauses as the previous one. Since the clauses of P have pigeon com-
plexity at most 1 < n/4, there must be a clause C in this sequence
that has pigeon complexity in between n/4 and n/2.
Let S ⊂ [n] be the set realizing the pigeon complexity of C. For
V
each i0 ∈ S, since i∈S\{i0 } Pi does not imply C, there must be an
V
assignment to all the variables where i∈S\{i0 } Pi is true, yet C is
false. Suppose i00 ∈ / S and xi00 ,j is set to true in this assignment. Then
consider what happens when we set xi00 ,j to be false and xi0 ,j to be
true and leave the rest of the variables as they are. Doing so must
V
make C true, since i∈S Pi is now true in the assignment. Since C is a
disjunction of unnegated variables, this can only happen if C contains
xi0 ,j . Thus for each i0 ∈ S, there must be at least 3n/4 − n/2 = n/4
values of j for which xi0 ,j is in the clause C. So C is big.
Claim 9.4. If C is big, then the probability that C survives the random
n/8
assignment is at most 6364 .
Proof. Since C is big, in each of the first n/8 assignments, there are at
least n/4 − n/8 = n/8 pigeons which if assigned to n/4 − n/8 = n/8
holes would lead to the clause vanishing from the proof. Thus the
probability that the clause survives the first n/8 assignments of
pigeons to holes is at most
(n/4 − n/8)(n/4 − n/8)
1− = 1 − 1/64 = 63/64.
n2
n/8
64
Now suppose the proof has less than 63 clauses. Then by
Claim 9.4, there is an assignment of the pigeons to holes such that
every big clause does not survive. On the other hand, by Claim 9.3,
at least one big clause must survive. So the proof must have at least
n/8
64
63 clauses.
Cutting Planes
A stronger proof system can be obtained by reasoning about lin-

ear inequalities instead of clauses. We can view a clause of the type
( a ∨ ¬b ∨ c) as asserting the linear inequality a + 1 − b + c ≥ 1 of the
boolean variables a, b and c. In the cutting planes proof system, we are
allowed to convert each of these clauses into inequalities, take posi-
tive linear combination of two inequalities, and round the inequalities
when one side of the inequality is sure to be an integer. One proves
that the clauses are unsatisfiable by arriving at the contradiction
1 ≤ 0.
The resolution rule tells us that ( a ∨ b) and (¬ a ∨ c) imply the
clause (b ∨ c). The analogous fact can easily be derived using cutting
planes:
( a ∨ b) ⇒ a + b ≥ 1
(¬ a ∨ c) ⇒ 1 − a + c ≥ 1.
proof systems 135
Adding these inequalities gives
1+b+c ≥ 2
⇒ b + c ≥ 1,
which corresponds to the clause (b ∨ c). So cutting planes are at least
as expressive as resolution refutations.
Lemma 9.5. If a formula can be refuted in s steps using resolution, then it
can be refuted in O(s) steps using cutting planes.
In fact, cutting planes gives a strictly stronger proof system. For
example, one can give a cutting planes proof of the pigeonhole
principle using just O(n2 ) proof steps. Rewriting the clauses of the
pigeonhole principle as linear inequalities, we get:
Pi ≡ xi,1 + xi,2 + . . . + xi,n−1 ≥ 1 pigeon i must be in some hole
Hi,i0 ,j ≡ 1 − xi,j + 1 − xi0 ,j ≥ 1 hole j cannot contain both i, i0 .
⇒ xi,j + xi0 ,j ≤ 1.
We shall use these inequalities to derive the inequality:
Lk,j ≡ x1,j + x2,j + . . . + xk,j ≤ 1
in O(k) steps. This would provide the required contradiction, since

n n −1 n
∑ Ln,j ≡ ∑ ∑ xi,j ≤ n − 1,
j =1 j =1 i =1
while
n n n −1
∑ Pi ≡ ∑ ∑ xi,j ≥ n.
i =1 i =1 j =1
It only remains to show how to derive Lk,j . L1,j and L2,j follow
immediately from H1,2,j . To derive Lk,j from Lk−1,j , for every r < q ≤
k, we derive the inequality
r
Er,q,j ≡ ∑ xi,j + xq,j ≤ 1.
i =1
We have Lk,j ≡ Ek−1,k,j , and E1,q,j ≡ H1,q,j . Moreover we can derive

Er,q,j by:
r
Er−1,q,j + Lr,j + Hr,q,j ≡ ∑ 2xi,j + 2xq,j ≤ 3 Er 1,q,j
i =1 Lr,j
r
⇒ ∑ xi,j + xq,j ≤ 3/2 Hr,q,j
i =1
r Er,q,j
⇒ ∑ xi,j + xq,j ≤ 1
i =1 Figure 9.2: Adding Er−1,q,j , Lr,j , Hr,q,j
≡ Er,q,j . gives the terms of Er,q,j .
Lower bounds on Cutting Planes

Here we give an example of a formula that is unsatisfiable, but
requires an exponential number of steps to prove unsatisfiable using
the cutting planes proof system.
Recall that a clique of size k in a graph is a set of k vertices that are
all connected to each other. A k − 1 coloring of the graph is a coloring
of the vertices with k − 1 colors, such that every edge gets exactly 2
distinct colors. The formula we consider encodes the statement that
a graph cannot simultaneously contain a clique of size k, and be k-
colorable. Indeed, some two vertices of the clique must get the same
In fact, when k is equal to the number
color by the pigeonhole principle, which is a contradiction. Let us of vertices of the graph, this state-
encode this fact using a boolean formula. For each edge {i, , j} ⊂ [n], ment is equivalent to the pigeonhole
principle.
we have the variable x{i,j} which encodes whether or not the edge is
present in the graph. For each i ∈ [k], j ∈ [n] we have the variable yi,j ,
which is set to 1 if and only if the i’th vertex of the clique is j. Finally,
for each i ∈ [n], j ∈ [k − 1] we have the variable zi,j which is set to 1 if
and only if the i’th vertex of the graph is colored with j.
Define the following formulas:
^
Ci = (yi,1 ∨ yi,2 ∨ . . . ∨ yi,n ) ∧ (¬yi,j ∨ ¬yi,j0 ) exactly one vertex of the graph is the
j6= j0 i’th vertex of the clique
^
Ki = (zi,1 ∨ zi,2 ∨ . . . ∨ zi,k−1 ) ∧ (¬zi,j ∨ ¬zi,j0 ) the i’th vertex is colored with exactly
j6= j0 one color
^
C= (¬yi,j ∨ ¬yi0 ,j0 ∨ x{ j,j0 } ) every pair of vertices in the clique are
i 6=i0 ,j6= j0 connected by an edge
^
K= (¬zi,j ∨ ¬zi0 ,j ∨ ¬ x{i,i0 } ) every pair of vertices that have the same
i 6=i0 ,j color are not connected by an edge
Finally define the unsatisfiable formula:

k
^ n
^
F= Ci ∧ C ∧ Ki ∧ K.
i =1 i =1
We shall prove:
Theorem 9.6. When k = Ω(n), any tree-like cutting planes proof of the
unsatisfiability of F must derive 2Ω(n/ log n) inequalities.
To prove the theorem, we shall reduce the problem to a communi-

cation problem called the clique-coloring problem. In this problem,
Alice and Bob are each given a graph on 3k + 2 vertices. Alice’s graph
is promised to have a clique on k vertices, and Bob’s graph is k − 1
colorable. Their goal is to find a vertex in Alice’s graph which is not
present in Bob’s graph. We will show:
Theorem 9.7. Any randomized protocol solving the clique-coloring problem

must communicate Ω(k) bits.
proof systems 137
Given any tree-like cutting plane proof of size s that F is not

satisfiable, we show how to obtain a randomized protocol for the
clique-coloring problem with communication O(log s · log log s).
Combining this fact with Theorem 9.7 proves Theorem 9.6.
Suppose Alice is given a graph that has a clique of size k, and Bob
is given a graph that is k − 1 colorable. Alice will set the variables yi,j
to be consistent with her clique, and Bob will set the variable zi,j and
x{i,j} to be consistent with his graph. Under this setting of variables,
all of the clauses in Ci , Ki , K are true, but one of the clauses in C must
be false. Such a false clause specifies an edge that is in Alice’s graph
but not in Bob’s graph. So our goal will be to find this clause with an
efficient communication protocol.
Claim 9.8. There must be an inequality L in the tree-like proof such that at
least s/3, but no more than 2s/3 of the inequalities in the proof are used to
derive L.
Proof. All s inequalities are involved in deriving the final contradic-

tion, so one of the inequalities used to prove the final contradiction
must use at least s/2 of the inequalities. Continuing in this way, we
must eventually find an inequality that uses at most 2s/3 inequalities,
but no more than s/3 of the inequalities derived in the proof.
Our aim will be to check whether this inequality L is satisfied or

not under the assignment to the variables held by Alice and Bob. L
can always be written as
κ + ∑ αi,j · yi,j ≤ ∑ βi,j · zi,j + ∑ γi,j · x{i,j} ,

i,j i,j i,j
where here all of the variables on the left hand side are known to
Alice, and all the variables on the right hand side are known to
2
Bob. Since the variables are boolean, there are at most 23k possible
3k
values for the left hand side, and at most 23k(k−1)+( 2 ) possible values
that can be taken by the right hand side. Thus Alice and Bob can
use the randomized protocol for solving the greater-than problem
2
on a set of size 2O(k ) to compute whether or not this inequality
is satisfied by their variables. They expend O(log(k) + log log(s))
bits of communication in order to make sure that output of their
computation is correct with error 1/ log(s).
If the inequality L is not satisfied, Alice and Bob can safely discard
the rest of the proof, and continue to find a clause used to derive
L that evaluates to false. Otherwise, all of the inequalities used
to derive L can safely be discarded, and Alice and Bob can start
their search from the beginning of the proof after discarding all
the inequalities used to derive L. In either case, they discard at
least s/3 inequalities. Thus this process can repeat at most O(log s)
times. The probability of making an error is still small, since the

probability of error in each step is much less than 1/ log(s). The total
communication of the protocol is O(log s · log log s), as promised.
It only remains to prove Theorem 9.7:
Proof of Theorem 9.7. We prove the theorem by reduction to the ran-

domized communication complexity of set disjointness. Suppose
there is a protocol for solving the clique-coloring problem using t bits
of communication.
Given sets X ⊆ [k − 2], Y ⊆ [k − 2], Alice uses her set to generate
a k-clique on 3(k − 2) + 2 vertices and Bob uses his set to generate a
k − 1 coloring of the same 3(k − 1) + 2 vertices as follows. For each
Figure 9.3: A clique and a coloring

generated from the sets X, Y, when
k = 6.
2X 2Y
i ∈ [k − 2], if i ∈ Y, Bob colors 3(i − 1) and 3(i − 1) + 1 with the color

i. Otherwise Bob colors 3(i − 1) and 3(i − 1) + 2 with the color i. All
the remaining vertices are colored with k − 1. For each i ∈ [k − 2], if
i ∈ X, Alice includes 3(i − 1) + 2 in her clique. Otherwise, she includes
3(i − 1) in her clique. Alice also adds 3(k − 2) + 1, 3(k − 2) + 2 to her
clique to obtain k vertices. Finally, Bob connects every pair of vertices
that he colored differently, to obtain a k − 1 colorable graph.
If X, Y are disjoint, exactly one edge of Alice’s graph is missing
from Bob’s graph, namely the edge {3(k − 2) + 1, 3(k − 2) + 2}. On
the other hand, if X and Y intersect in ` ≥ 1 elements, there will be
(`+2 2) ≥ 3 edges in Alice’s graph that are not in Bob’s graph. Finding
any of these edges will reveal that X and Y do intersect. Alice and
proof systems 139
Bob use shared randomness to randomly permute their graphs and

run the protocol. In the case that the sets intersect and the protocol
succeeds, the probability that they obtain an edge that does not
correspond to the edge {3(k − 2) + 1, 3(k − 2) + 2} is at least 2/3. By
Theorem 6.13, t ≥ Ω(k).
Exercise 9.1
Show that the formula that asserts that there cannot be a graph
which both has a k-matching and a set of size k − 1 that covers every
edge requires an exponential number of inequalities to prove in the
cutting planes proof system.
Exercise 9.2
Show that the formula that asserts that there cannot be a graph
on [n] which both has a path from 1 to n and a set S ⊂ [n] with
1 ∈ S, n ∈
/ S k-matching and a set of size k − 1 that covers every edge
requires an exponential number of inequalities to prove in the cutting
planes proof system.
10
Data Structures
A data structure is a way to efficiently maintain access to data. 1

Kleinberg and Tardos, 2006
Many of the best known algorithms1 rely on efficient data structures.
Lower bounds on data structures are often proved by appealing to For example, interesting data structures
arguments about communication complexity. are used in Dijkstra’s algorithm for
finding the shortest path in directed
graphs, and in Kruskal’s algorithm for
computing the minimum spanning tree
Maintaining a Set of Numbers of undirected graphs.
Efficient algorithms for sorting numbers are key primitives

in algorithm design, with countless applications. In many of these
applications, we do not actually need to sort the numbers. It is
enough to be able to query some information about the sorted list
that can be computed very quickly.
Sort Statistics
Suppose we want to maintain a set S ⊆ [n] of k numbers, so that you
can quickly add and delete numbers from the set, as well as compute These are operations need to be carried
the minimum of the set. A trivial solution is to store the k numbers in out efficiently in the execution the
fastest algorithms for computing the
a list. Then adding a number is fast, but finding the minimum might shortest path connecting two vertices of
take as long as k steps. A better solution is to maintain the numbers a graph.
in a heap. The numbers are stored in a balanced binary tree, with
the property that every node is at most as large as the value of its
children. One can add a number to the heap by adding it at a leaf,
and bubbling it up the tree. One can delete the minimum by deleting
the number at the root, inserting one of the numbers at a leaf into the
root, and bubbling down the number. This takes only O(log k) time
for each operation. See Figure 10.1 for an example.
Another solution is to maintain the numbers in a balanced binary
tree (Figure 10.2). Each memory location corresponds to a node in
the binary tree. Each leaf corresponds to an element of [k]. Each node
Figure 10.1: Deleting the minimum (3)

3 10
from a heap, and adding a new number
4 7 4 7 4 7
(2).
7 5 9 8 7 5 9 8 7 5 9 8
8 9 6 6 9 10 10 8 9 6 6 9 10 10 8 9 6 6 9 10
4 4 4
10 7 5 7 5 7
7 5 9 8 7 10 9 8 7 6 9 8
8 9 6 6 9 10 8 9 6 6 9 10 8 9 10 6 9 10 2
4 4 2
5 7 5 2 5 4
7 6 9 2 7 6 9 7 7 6 9 7
8 9 10 6 9 10 8 8 9 10 6 9 10 8 8 9 10 6 9 10 8
above it maintains the number of elements of S in the corresponding

subtree, the minimum of S in that subtree, and the maximum of S The tree can beimproved
to give
in that subtree. An element can be added to S or deleted from S in log n
update time O log log n , at the ex-
O(log n) steps, by visiting all the memory cells that correspond to the pense
of increasing
the query time to
ancestors of the element in the tree. One can also compute the i’th log2 n
O log log n , by using a balanced tree
smallest element of S in O(log n) steps, by starting at the root and of arity log n instead of 2.
moving to the appropriate subtree.
Figure 10.2: Maintaining numbers in a

9 2 16
binary search tree.
4 2 8 5 9 16
2 2 3 2 7 8 4 9 12 1 16 16
1 2 2 1 3 3 0 0 0 2 7 8 2 9 10 2 11 12 0 0 0 1 16 16
0 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1
Predecessor Search
Suppose we want to maintain a set of numbers S ⊆ [n] and be able to
quickly determine the predecessor of x, defined as
P( x ) = arg max y,
y∈S,y≤ x
data structures 143
namely the largest element of S that is at most x. If we maintain the

numbers using a binary search tree as in Figure 10.2, we can handle
updates in O(log u) time, and answer queries in time O(log u). In fact
If we wish to compute P(i ) and
the queries can be computed in time O(log log u). a1 , a2 , . . . , ad are the memory cells
One can improve the update time using Van Emde Boas trees2 . associated with the path from the root
√ √ √ √ of the tree to i, let j be the maximum
Let I1 = [1, n], I2 = [ n + 1, 2 n], . . . be n consecutive intervals,
√ coordinate such that the maximum of
each of size n. For each i, we store the set S ∩ Ii using a smaller the subtree rooted at a j is not i. Then
Van Emde Boas tree. We also store the maximum element of S, the the maximum stored at the subtree
rooted at a j is the predecessor of i. It
minimum element of S and the set TS = {i : S ∩ Ii 6= ∅} recursively. can be computed either by querying
Each of these smaller sets is stored using another Van Emde Boas all the memory cells associated with
a1 , a2 , . . . , ad , or more quickly by query-
data structure. See Figure 10.3.
ing just ad/2 and recursing on the first
Now to compute P( x ) from the data structure, one can do it by half or the second half.
first checking whether the corresponding element of TS is empty
2
van Emde Boas, 1975
or not. If it is empty, then we find the predecessor of the relevant
interval in TS , and output its maximum element. If it is not empty,
TS stores the identities of the non-
we compute the predecessor in the relevant interval. In either case, empty intervals.
we only need to make one recursive call to a smaller data structure.
j
After the j’th recursive call, we would be working with at most n1/2 It is not known whether Van Emde Boas
data structures give the best possible
numbers. Thus there can be at most O(log log n) recursive calls performance for the predecessor search
before the query is computed. Similarly, one can add and delete problem.
numbers to the set S using at most O(log log n) operations.
Figure 10.3: An example of a Van

Emde Boas tree
S
store
min max
S \ I1 S \ I2 S \ I3 S \ I4 TS
store store store store store
Lower Bounds on Static Data Structures
A static data structure is an algorithm that can be used to store

data in a collection of memory cells, and efficiently answer queries
about the data. The data structure has three main parameters that we
seek to optimize:
Number of cells s This is the total number of memory cells used to

store the data.
Number of bits in each cell w This is the word-size of the data structure.
Query time t This is the number of cells that need to be accessed to

answer a query on the data.
Ideally, we would like to minimize all three of these parameters. The
primary method known for proving lower bounds on the parameters
of static data structures is via lower bounds on two party commu-
nication. This is because efficient data structures lead to efficient
communication protocols.
Say we are given a data structure for a particular problem. We
define the corresponding data structure communication game as
follows: Alice is given a query to the data structure, and Bob is given
the data that must be stored in the data structure. Using the data
structure, the communication problem can be solved by a 2t-round
protocol, where in each round, Alice sends log s bits to indicate the
name of the memory cell she wishes to read, and Bob responds with
w bits that are the contents of the appropriate cell. Thus we get:
Lemma 10.1. If there is a data structure of size s, word size w, and query
time t for solving a particular problem, then there is a 2t round protocol
where in each round Alice sends log s bits and Bob responds with w bits to
solve the corresponding communication game.
Appealing to lower bounds in communication complexity gives us
lower bounds on the parameters of the data structure.
Set Intersection
Suppose we wish to store an arbitrary subset Y ⊆ [n], so that on
input X ⊆ [n], one can quickly compute whether or not X ∩ Y is
empty3 . There are several solutions one could come up with:
3
• We could store Y as string of n bits, broken up into words of size
w. This would give the parameters s = dn/we, t = dn/we.
• We could store where or not Y intersects every potential set X.

This would give s = 2n , w = 1, t = 1.
• For every subset V ⊆ [n] of size at most p, we could store whether

or not Y intersects V. Since X is always the union of at most dn/pe
p
such sets V, this would gives s = ∑i=0 (np), w = 1, t = dn/pe.
On the other hand, since every data structure leads to a com-
munication protocol for computing set disjointness, for which the
communication must be a least n + 1, we have:
Theorem 10.2. Any data structure that solves the set intersection problem
must have t · (dlog se + w) ≥ n + 1.
data structures 145
Lopsided Set Intersection

In practice queries are much smaller than the data being stored. In
the lopsided set intersection problem, the data structure is required
to store a set Y ⊆ [n]. A query to the problem is a set X ⊆ [n] of size
k < n. The data structure must compute whether or not X, Y have a
common element.
When k = 1, we can store Y as a string. This achieves, s = n/w, t =
1, and no better parameters are possible. The problem becomes more
interesting when k > 1. We can get a solution with s = (nk)/w, t = 1
by storing whether or not Y intersects each set of size k.
Theorem 1.19 proves a lower bound. Applying the theorem gives:
n n
t(log s + w) ≥ = t/k .
2(t log s)/k + 1 s +1
As a consequence, we get the following theorem:
Theorem 10.3. In any√data structure solving the lopsided set intersection

n k/t
problem, either t ≥ 2(log nk
s+w)
, or s ≥ k .
The Span Problem

In the span problem4 , the data structure is to store n/2 vectors
4
y1 , . . . , yn/2 ∈ F2n . A query is a vector x ∈ F2n . The data structure

must quickly compute whether or not x is a linear combination of
y1 , . . . , yn/2 .
Theorem 1.21 proves that any data structure solving this problem
must satisfy:
tw ≥ n2 /4 − t log s · (n + 1) − n log n.
As a consequence, we get:
Theorem 10.4. In any static data structure solving the span problem, if
s < 2n/8t , then tw = Ω(n2 ).
Predecessor Search
In the predecessor search problem, the data structure is required to
encode a subset S ⊆ [u] of size n. The data structure should also be
able to compute the predecessor P( x ) of any element x ∈ [u]. This is
the largest element of S that is at most x. We have seen that there is a
data structure that can handle all of these operations in time log log u.
Here we show that this bound is essentially tight5 . 5
Ajtai, 1988; Beame and Fich, 2002;
Pǎtraşcu and Thorup, 2006; and Sen
Theorem 10.5. Any data structure solving the predecessor search problem and Venkatesh, 2008
log n
with s = poly(n) must either have time larger than Ω( log(w log n) ) or must
work only when log n · log log n ≥ Ω (log log u − log log w).
Proof. We prove the theorem by appealing to a lower bound for the

tree pointer chasing problem (Theorem 7.11). Say we have an input
to the tree pointer chasing problem on a tree T a,b of depth k. We
show how Alice and Bob can transform the tree into inputs x ∈
{0, 1, . . . , u − 1} and S ⊆ {0, 1, . . . , u − 1} for the predecessor search
k
problem, without communicating. We will ensure that u ≤ ( a + b)b ,
and |S| ≤ ak . The transformation will guarantee that the predecessor
of x in S determines the correct output of the tree pointer chasing
problem.
Given the reduction to predecessor search, any data structure
with < k/2 queries for the predecessor search problem gives a
communication protocol for the tree pointer chasing problem where
Alice sends log s bits in each round, and Bob responds with w bits
in each round, and the total number of rounds is less than k. We set
log n
a = (25w log n)2 , b = (25 log s log n)2 , and k = log a . The size of the
set produced by the reduction is at most ak = n.
By Theorem 7.11, the protocol can succeed with probability at
most
r r !
1 log s w
+ 2( k − 1) +
2 b a

1 1 1
≤ + 2 log n + < 1,
2 5 log n 5 log n
So the protocol for tree pointer chasing must make an error. Since the
protocol for predecessor search is correct, it must be the case that the
k
protocol works only when u < ( a + b)b . So we must have
log log u
≤ k log b + log log( a + b)
≤ log n · log((25 log s · log n)2 ) + log log((25 log n(log s + w))2 )
≤ O(log n log log n) + log log w, by assumption log s = O(log n)
as claimed.
Next we describe how to carry out the reduction. If the tree is of
depth 1, with the edge coming out of the root in Alice’s input, we set
S = {0, 1, . . . , a − 1}, and x ∈ {0, 1, . . . , a − 1} to be the name of the
child that is connected to the root. If Bob knows the edge coming out
of the root, we set S = {i } ⊂ {0, 1, . . . , b − 1}, where i corresponds to
the leaf of the tree that is connected to the root, and we set x = b − 1.
In either case, S is a set of size at most ( a + b), defined on a universe
of size at most a, and the predecessor of x in S determines the output
of the tree pointer chasing problem.
If the tree is of depth k > 1, with Alice knowing the edge coming
out of the root, we first compute x0 , . . . , x a−1 , S0 , . . . , Sa−1 , which
data structures 147
Figure 10.4: Using the subtrees to

combine several predecessor search
x, S instances
x 0 , S0 x 1 , S1 x 2 , S2
are the inputs to predecessor search determined by the a subtrees

of depth k − 1 that lie just below the root. If i ∈ {0, 1, . . . , a − 1}
corresponds to the edge coming out of the root, we set
x = i · t + xi ,
and
a
[
S= { i · t + y : y ∈ Si } ,
i =1
k −1
where t = ( a + b)b . The new universe is of size at most a · ( a +
k −1 k
b)b ≤ ( a + b)b , and |S| is at most a · ak−1 = ak . If the edge touching
the root belongs to Bob, we compute x0 , . . . , xb−1 and S0 , . . . , Sb−1 This corresponds to writing x in base
using the b subtrees of depth k − 1. If i ∈ {0, 1, . . . , b} corresponds to t with the digits x0 , x1 , x2 , . . . , xb−1 ,
and setting the y’th element of S to be
the edge touching the root, we set x0 , x1 , . . . , xi−1 , y, 0, . . . , 0, where y ∈ Si .
b −1
x= ∑ x j · t b − j −1 ,
j =0
and
( )
i −1
S= ∑ x j · t b − j − 1 + y · t b − i − 1 : y ∈ Si ,
j =0
k −1
where u = ( a + b)b . Since Bob knows x0 , . . . , xi−1 , S can be
computed by Bob. The size of the universe in this case is at most

k −1 b k
( a + b)b ≤ ( a + b)b .
In both cases, the predecessor of x in S determines the relevant
predecessor of xi in Si , and hence determines the output of the tree
pointer chasing input.
Lower bounds on Dynamic Data Structures
A dynamic data structure is one that allows for both efficient

updates on the data and queries on the data. The union-find data
structure, the Van Emde Boas tree, heaps and binary search trees are
all examples of dynamic data structures. In this section, we develop Unlike for static data structures, not
methods to prove lower bounds on such data structures. A dynamic all of the methods used to prove lower
data structure has four main parameters: bounds on dynamic data structures
involve reductions to communication
complexity. Nevertheless, intuitions
Number of cells s This is the total number of memory cells used to from understanding the role of informa-
store the data. tion will play a role here as well.
Word size w This is the total number of bits of memory available in

each cell.
Update time tu This is the number of cells that need to be accessed to

update the data.
Query time tq This is the number of cells that need to be accessed to

query the data structure for information. Some practical data structures do not
give good bounds on the worst case
update and query times, but do give
We allow data structures to make errors. The data structure al- good bounds on amoritzed update and
gorithm can be randomized, making decisions using random coin query times. For example, a common
tosses. We say that the error of the data structure is e if for every scheme is to use hashing to maintain a
small subset S ⊆ [u] of a large universe.
sequence of updates followed by a single query, the probability that After many operations, the size of
the query is computed correctly is at least 1 − e. the set S may exceed the capacity of
the hash function to effectively avoid
If the size of a data structure is very large, one can always use collisions. In this case the data structure
hashing to reduce the size. We use randomness to pick a random rehashes the entire space using a less
hash function h : [s] → [n2 (tu + tq )2 /e], and simulate access to the efficient hash function that avoid
collisions for larger sets. The rehashing
cell i with the cell h(i ). If there are at most n operations during the operation can be very expensive, but
execution of the data structure, the chance that two distinct cells are it needs to be performed infrequently,
and the time complexity of the data
hashed to the same cell is at most e, so this adds an error of at most structure per update/query remains
e. small. The techniques developed in this
section actually do prove lower bounds
Lemma 10.6. n operations of a dynamic data structure with parameters even on the amortized time complexity
of data structures, though we do not
s, tu , tq , w and error e can always be simulated by another data structure discuss this in the text.
with space n2 (tu + tq )/e0 , update time tu , query time tq , word size w and
error e + e0 .
Prefix Sum and Maintaining a Sorted List of Numbers

Suppose we want to maintain a set S ⊆ [u], of size at most n, and 6
Fredman and Saks, 1989; and Patrascu
want to be able to add and delete elements from the set, as well as and Thorup, 2014
compute the i’th element of the set in sorted order6 . We prove:

data structures 149
Theorem 10.7. Any data structure maintaining a sorted list of numbers

with error at most e for n operations must satisfy:

tu (w + log s)
tq · log ≥ Ω((1 − h(e)) log n).
1 − h(e)
Here h(e) = e log(1/e) + (1 −
We first prove a lower bound for an even easier task called prefix e) log(1/1 − e) is the entropy of a
sum. Suppose we want to maintain a binary string x ∈ {0, 1}n , which bit with probability e.
is initially set to 0, and allowing for updates that replace the value of
In particular, if tu , and w are polyloga-
xi with 1 − xi for any i, and queries that compute p( j) = ∑ij=1 x j . We rithmic in n, and e < 1/2 is a constant,
prove7 : then Lemma 10.6 asserts that s can be
assumed to be polynomial in n. This
Theorem 10.8. Any data structure correctly computing prefix sum of an n gives tq ≥ Ω(log n/ log log n).
bit string with error e satisfies
7
Fredman and Saks, 1989

tu (w + log s)
tq · log ≥ Ω((1 − h(e)) log n).
1 − h(e)
This result applies even if the data
Before proving Theorem 10.8, we show how to use it to prove a structure is only required to compute
lower bound for maintaining a sorted list of numbers. We can use ∑ij=1 xi mod 2.
any data structure for maintaining a sorted list of numbers to solve
the prefix sum problem as follows. We initialize our set of numbers
to be S = { j ∈ [n2 ] : j 6= 0 mod n}. We also maintain a dictionary
Recall that a dictionary allows one to
with S. Whenever we wish to flip the value of xi , we check if the add and delete elements to S and check
number i · n ∈ S. If it is, we delete it from the set. If i · n ∈
/ S, we whether i ∈ S in constant time, and
j space poly(n).
add it to S. Then ∑i=1 xi = jn − 1 − S j(n−1) , where Sr denotes the
r’th number in the set S. Thus Theorem 10.8 can be used to show
Theorem 10.7.
Proof of Theorem 10.8. To prove the lower bound, we use a particular

distribution on sequences of updates. For parameters k, r, we set
n = kr . For j = 0, 1, 2, . . . , r, define
S j = { a · k j + 1 : a ∈ {0, 1, . . . , kr− j − 1}}.
S j consists of kr− j uniformly spaced numbers in [n]. Now consider
Figure 10.5: An example of the sets S j

when k = 2, r = 5.
2 S0 2 S0 2 S0 2 S0 2 S0 2 S0
2 S1 2 S1 2 S1 2 S1 2 S1
2 S2 2 S2 2 S2 2 S2
2 S3 2 S3 2 S3
2 S4 2 S4
2 S5
the sequence of updates that consists of r + 1 rounds. In the j’th round

we pick a uniformly random subset Tj ⊆ S j . For every i ∈ Tj , we

flip the value of xi using the data structure. At the end of these r + 1
rounds of updates, we pick a uniformly random coordinate L ∈ [n]
and compute ∑iL=1 xi using the data structure. We shall compute
the expected number of queries the data structure needs to make to
correctly compute the prefix sum. By fixing the randomness used by
the data structure, we can assume that it is deterministic, and makes
an error on at most e fraction of the sequences of updates and queries
in the distribution we have defined.
Say that a cell of the data structure belongs to round j if it was
last touched when Tj was added. The number of cells belonging to
rounds i > j is at most
r r − j −1
kr − j − 1
tu · ∑ kr −s = tu · ∑ ks = tu · ≤ 2tu kr− j−1 .
s = j +1 s =0
k−1
r− j Cells that belong to rounds i < j

Let A ∈ {0, 1}k be the indicator vector for the set Tj ⊆ S j : namely have no information about Tj , and
Ai is 1 if and only if the i’th element of S j is included in Tj . A is a the number of cells that belong to
uniformly random binary string. Let B = T1 , T2 , . . . , Tj−1 , Tj+1 , . . . , Tr , rounds i > j are so few that do
not have information about most of
and let C denote the locations and contents of all the cells that belong the coordinates in Tj . We shall use
to rounds i > j. Since C can be described using at most 2kr− j−1 · tu · this intuition to argue that the data
structure must query at least one cell
(w + log s) bits, we have: that belongs to round j, for every j.
H ( A | BC ) ≥ H ( A | B) − H (C | B) by subadditivity of entropy
r− j r − j −1
≥k − 2k tu (w + log s) since the number of bits
needed to describe Z is at most
r− j 2tu (w + log s) 2kr− j−1 · tu · (w + log s)
≥k 1− .
k
By the chain rule, we get:
kr − j
r− j 2tu (w + log s)
∑ H ( A i | A <i BC ) ≥ k 1 −
k
. (10.1)
i =1
Let Q be the random variable that is 1 if the data structure queries

a cell that belongs to round j, and 0 otherwise. For each fixing of
B = b, C = c, L = l, let eb,c,l denote the random variable whose value
is the probability that the data structure makes an error given these
values of B, C, L. Finally, let I be such that I ≤ L, yet I + kr− j > L.
Let h(e) = e log(1/e) + (1 − e) log(1/(1 − e)) be the binary entropy
function.
Then for any fixing of B, C, L, either Q = 1, or H ( A I | A< I ) ≤
h(eB,C,L ). Thus we get:
H ( A I | A< I LBC ) ≤ E [ Q + h(eB,C,L )]

LBC
≤ E [ Q] + h(E [eB,C,L ]) ≤ E [ Q] + h(e). by convexity of the entropy function.
data structures 151
4t (w+log s)
Combining this bound with (10.1), and setting k = u1−h(e) we get
that the probability that a cell belonging to round j is queried is at
least
2tu (w + log s) 1 − h(e)
E [ Q] ≥ 1 − − h(e) =
k 2
The expected number of cells queried is then at least:
1 − h(e) log n 1 − h(e) log n

r· ≥ · , since r = logk n = log k
2 4tu (w+log s)
log( 1−h(e) ) 2
as required.
Graph Connectivity
Efficient algorithms that maintain graphs are widely used

in computer science. These provide another source of basic data
structure questions.
Suppose we want to store a graph on n vertices using a small
amount of space such that we can later quickly answer whether
two given vertices are connected or not. A trivial solution is to store
the adjacency matrix of the graph, and then perform a breadth first
search using this matrix. A better solution is to store a vector in [n]n
which stores the name of the connected component that each vertex
0 1 1 2 0 3
belongs to. This can be stored with n words of size log n, and now
connectivity for two vertices can be computed by two queries to the
data structure. 0 4 1 5 2 6
If we want to maintain a graph while allowing edges to be added,

we can do this using the union-find data structure (Figure 10.6). Each 0 7 0 8 0 9
vertex is associated with a constant number of memory locations.

These locations store a pointer to another element that is in the same
1 10 2 11 0 12
connected component, as well as the length of the longest path that
ends at that vertex. The pointers are maintained so that the induced
Figure 10.6: Maintaining a partition of
trees are always balanced, and so their depth is at most O(log n). the universe into 3 sets using the union-
Initially, when the graph has no edges, each connected component find data structure. Each cell contains
the length of the longest sequence of
has a single vertex. pointers ending at that cell, the name
To check if two vertices are in the same component, we only need of an element of the universe, and
(potentially) a pointer to another cell
to check if the roots of the trees are the same, which takes at most
representing a different element of the
O(log n) steps if the trees are balanced. When an edge is added, if same set.
the two vertices of the edge are contained in the same connected The union-find data structure plays a
component, nothing needs to be done. Otherwise the two connected key role in the fastest algorithms for
computing the minimum spanning tree
components are merged by adding a pointer from the root of the
of a graph.
shallow component to the root of the deeper component, which
ensures that the trees of the data structure remain balanced. This
gives a simple data structure of size O(n log n) where each operation
takes O(log n) time.
Connectivity in Sparse Directed Graphs

We have seen that one can solve the static graph connectivity problem
on n vertices with t = 2, s = n, w = log n. Here we consider the same
problem in directed graphs.
A trivial solution is to store whether or u is connected to v for
every pair (u, v). This gives t = 1, s = n2 , w = 1. In general one
cannot do much better than this. Indeed, set A, B be two disjoint
sets of vertices, each of size n/2, and consider the graphs where all
2
edges go from A to B. There are 2n /4 such graphs, and the data
structure must distinguish all of them, since the queries to the data
structure can reconstruct all edges in such graphs. Thus we must
have sw ≥ n2 /4.
The problem becomes more interesting when we consider sparse
graphs. What if we are guaranteed that every vertex only has A B
O(log n) edges coming out of it? Is there a data structure solving
graph connectivity with s = O(n log n) and t, w = O(1)? Here we
show that such a data structure does not exist8 :
Theorem 10.9. In any static data structure solving the directed graph
connectivity problem on graphs with at most O(n log n) edges with word
size w = log n, we must have
 
log n
t ≥ Ω .
s log2 n
log( n )
To prove Theorem 10.9, we consider a special family of sparse
directed graphs: subgraphs of the butterfly graph.
A (d, B) butterfly graph is a layered graph where each vertex
corresponds to a tuple (i, u) ∈ [d + 1] × [ B]d . There is an edge from
8
Pătraşcu, 2011
(i, u) to ( j, v) if and only if j = i + 1, and xt = yt for all coordinates

Theorem 10.9 implies that if t is a
t 6= i. Every vertex in the graph has at most B edges coming out of constant, then s = n1+Ω(1) .
it. We shall set B ≈ log n, and d ≈ log n/ log log n to prove the lower
bound.
For any u ∈ [ B]d , let u−i denote u after deleting the i’th coordinate: 111 112 121 122 211 212 221 222
u−i = (u1 , u2 , . . . , ui−1 , ui+1 , . . . , ud ), and for every i, u−i , let Si,u−i be a 1
set of size B, identified with [ B]. 2
Suppose Alice and Bob are given subsets X, Y ⊆ ∪i,u−i Si,u−i , such 3
that for every i, u−i , | X ∩ Si,u−i | = 1. So k = | X | = dBd−1 , while the
4
size of Y could be as large as dBd . We shall show how Alice and Bob
can use the data structure to solve the lopsided disjointness problem Figure 10.7: The butterfly graph with
d = 3, B = 2.
with inputs X, Y.
data structures 153
Alice uses her set X to construct the graph GX , which will be a

subgraph of the butterfly graph. (i, u) and (i + 1, v) are connected
in GX if and only if there is w ∈ X ∩ Si,u−i such that u−i = v−i , and
vi + ui = w mod B. The same edge is included in GY if and only if
for every w ∈ Y ∩ Si,u−i , vi + ui 6= w mod B.
Since X intersects Si,u−i in exactly one element, the graph GX
consists of Bd vertex disjoint paths from the first layer of the butterfly
graph to the last. Moreover, X intersects Y if and only if one of the
edges of GX is not present in GY . Thus Alice can determine if this is
true by executing Bd−1 queries on the data structure, by picking Bd−1
representative paths and querying whether their end points remain
connected in GY .
This simulation gives a protocol where Alice sends log ( Bds−1 ) bits
in each round, and Bob responds with wBd−1 bits in each round.
Theorem 1.19 implies that

s d −1 n
t log d 1
+ wB ≥ s . (10.2)
B −
2 t log ( B d −1
) / (dBd−1 )
+1
log n log n
Setting B = log n, we get that ≥ d ≥log log n 2 log log n , and so

n log log n log n
B d −1 = n/(dB) ≥ 2 . Now if t = o and
log n log(s log2 n/n)
w = log n, then the left hand side of (10.2) is at most
!
s es n
t log + log n · Bd−1 ≤ t Bd−1 log + since (ba) ≤ ea b
B d −1 B d −1 2
log n
b
!
n log log n es log2 n n
≤t log +
log2 n n log log n log2 n
= o (n log log n/ log n) .
On the other hand,

s s
t log d 1
/(dBd−1 ) ≤ (t/d) log d−1 since (ba) ≤ ab
B − B
t log log n s log2 n
≤ · log = o (log log n).
log n n
n
So the right hand side of (10.2) is at least , which is larger than
log0.5 n
n log log n/ log n for large enough n, a contradiction.
Lower bound for Dynamic Graph Connectivity

As we saw at the beginning of the
In the graph connectivity problem, the data structure is required to chapter, the union-find data structure
maintain a graph on the vertex set [n], supporting the addition of can be used to solve this problem with
parameters s = O(n), w = log n, tu =
edges, as well as queries that compute whether or not two vertices O(log n), tq = O(log n).
u, v are connected or not in the graph.
Here we prove9 :
Theorem 10.10. Any data structure solving the graph connectivity problem
with error e must satisfy:
tu (w + log s)
tq · log ≥ Ω((1 − h(e)) log n).
1 − h(e)
Setting w = log n, s = poly(n), e < 1/2
to be constant, and tu = polylog(n), we
Proof. For parameters k, r, we sample a uniformly random graph that get that tq ≥ Ω(log n/ log log n).
consists of two disconnected k-ary trees of depth r. The number of
vertices in such a graph is n = 2(kr+1 − 1)/(k − 1). We add the edges
of this graph to the data structure in r rounds. In the first round,
we add all the edges at depth r. In the j’th round, we add the edges
at depth r − j + 1 to using the data structure. Finally, we pick two
random leaves from the graph and query whether or not they are
connected.
Say that a cell of the data structure belongs to round j if it was
last touched in round j of the updates. Let B denote all edges not
added to the graph in the j’th round. Let C denote the the locations
and contents of all cells that belong to rounds i > j. After fixing
B, the roots of the two trees have been fixed, and the identities of
r − j +1
the leaves have also been determined. Let A ∈ {0, 1}2k be the
random variable which has a bit for each vertex v in the graph at
depth r − j + 1, such that Av = 0 if v is connected to the first tree in
the graph, and Av = 1 if v is connected to the second tree.
Then we see that
r − j +1 p
2k √
H ( A | B) = log r j 1 ≥ 2kr− j+1 − 1 − log kr− j+1 , using the fact that (2a
a) ≥ 2
2a−1 / a
k − +
since after fixing B, the vertices at depth r − j + 1 are in 2kr− j+1

distinct connected components, and exactly half of these connected
components will be connected to the first tree, and the rest will be
connected to the second tree.
The number of edges added after the j’th round is
r r− j
∑ 2kr−i+1 = 2 · ∑ ki
i = j +1 i =0
k r − j +1 − 1
= 2·
k−1
r− j
≤ 4k . since k − 1 ≥ k/2
By subadditivity, we get
H ( A | BC ) ≥ H ( A | B) − H (C | B)
p
≥ 2kr− j+1 − log kr− j+1 − 1 − 4kr− j tu (w + log s)
≥ 2kr− j+1 − 6kr− j tu (w + log s) since log k(r− j+1)/2 ≤ k(r− j+1)/2 ≤ kr− j
data structures 155
Figure 10.8: The edges of B when

k = 3, r = 5, j = 2.
Let U, V be two random vertices at depth r − j + 1 in the graph. For
any fixed vertex v, we have that
2
1
Pr[v ∈ {U, V }] = 1 − 1 −
2kr− j+1
2
1 1
= −
k r − j +1 2kr− j+1
Applying Shearer’s lemma (Lemma 6.5), we conclude that
H ( AU , AV | UVBC )
2 !
1 1
≥ −
k r − j +1 2kr− j+1

· 2kr− j+1 − 6kr− j tu (w + log s)
1 6tu (w + log s)
≥ 2− −
2kr− j+1 k
7tu (w + log s)
≥ 2− . since kr− j+1 ≥ k
k
Let Q be 1 if the data structure queries a cell that belongs to round
j, and 0 otherwise. Let X, Y be two random leaves that queried
for connectivity. For each fixing of B, these leaves correspond to
two vertices U, V at depth r − j + 1 that we wish to compute the
connectivity of. For each fixing of X, Y, B, C, let eX,Y,B,C denote the
probability that the data structure makes an error in answering
the query. If the data structure makes a query to a cell belong-
ing to round j, then H ( AU , AV ) ≤ 1 + Q, and if it does not then
H ( AU , AV ) ≤ H ( AU ⊕ AV ) + H ( AU ) ≤ 1 + h(eX,Y,B,C ), since the
parity of AU , AV can have at most h(eX,Y,B,C ) bits of entropy. Thus we
get:
H ( AU , AV | XYBC ) ≤ [1 + Q + h(eX,Y,B,C )]
E
X,Y,B,C

≤ E [1 + Q ] + h E [eX,Y,B,C ] by convexity of h(e)
X,Y,B,C
= 1 + E [ Q ] + h ( e ).
Thus,
7tu (w + log s)
E [ Q] ≥ 1 − h(e) − .
k
14t (w+log s)
We set k = (u1−h(e)) , to get E [ Q] ≥ (1 − h(e))/2. Thus the expected
number of queries made must be at least
log n
r · (1 − h(e))/2 ≥ 14tu (w+log s)
· (1 − h(e))/2,
log 1− h ( e )
as required.
Dictionaries
Suppose we want to maintain a subset S ⊆ [n] with a data

structure, allowing for the updates of the form add(i ), delete(i ), which
add and delete the element i from the set S. We would also like to
ask queries of the form member(i ) which should compute whether or
not i ∈ S.
We could maintain a string xS ∈ {0, 1}u that is the indicator vector
of S: xi = 1 if and only if i ∈ S. This allows us to maintain the set
and ask membership queries in time 1, but requires u memory cells.
A more efficient alternative is to use hashing: pick a random function
h : [u] → [n2 /e]. To add i, set xh(i) = 1, and to delete i, set xh(i) = 0.
This data structure uses only n2 /e cells of memory, and if at most n
operations are involved, the probability that this data structure makes
an error is at most e.
As far as we know, it is possible that there is a deterministic data
structure solving the dictionary problem using word size log u, space
O(n), and carrying out updates and queries in O(1) time. It is a
tantalizing open problem to prove that such a data structure does not
exist!
Open Problem 10.11. Prove that there is no deterministic data structure

that can maintain a dictionary S ⊂ [n] using space O(|S|), word size
O(log n), and time O(1).
data structures 157
2d range counting Chazelle
Exercise 10.1
Modify the Van Emde Boas tree data structure so that it can maintain
the median of n numbers, with time O(log log n) for adding, deleting
and querying the median.
11
Extension Complexity of Convex Polytopes
A convex polytope is a subset of Euclidean space that can be A set S is convex if whenever x, y ∈ S,
defined by a finite number of linear inequalities. Any n × d matrix A the line between x, y is also in S. We
see from the definition that P is always
and an n × 1 vector b define the polytope
convex.
P = { x ∈ Rd : Ax ≤ b}.
Polytopes are fundamental geometric constructs that have been stud-

ied by mathematicians for centuries. In this chapter, we explore some
questions about the complexity of representing polytopes. When can a
complex polytope be expressed as the shadow of a simple polytope?
If the polytope has dimension d0 , we can always project it down to
an affine space of dimension d0 . So for the rest of the discussion, we
shall assume that the polytope has the same dimension as the space
that it lives in. Doing so will make it easier to define the geometric
components of a polytope.
A halfspace is a polytope defined by a single inequality: it’s a Figure 11.1: A cube in R3 can be
polytope H = { x ∈ Rd : Ax ≤ b}, where A is a 1 × d matrix. A face of defined by the 6 inequalities 0 ≤ x ≤
the polytope P is a set of the type F = P ∩ H, where H is a halfspace 1, 0 ≤ y ≤ 1, 0 ≤ z ≤ 1.
that only intersects P on its boundary. A face must have dimension For example, if the polytope is defined
less than the polytope, since the polytope has dimension d. When the by the 3 × d matrix A by Ax ≤ b, and
A1 + A2 = A3 , b1 + b2 = b3 , then the
dimension of the face is exactly one less than the dimension of the third inequality is implied by the first
polytope itself, we call the face a facet. If the polytope is defined by two, and the polytope can have at most
2 facets.
n inequalities, it can have at most n facets, though it may have fewer
facets than inequalities.
Fact 11.1. Every face of P can be expressed as the intersection of some subset
of the facets of P.
A vertex of the polytope is a face that is a set of size 1.
Fact 11.2. v ∈ P is a vertex of P if and only if there are no distinct u, w ∈ P Figure 11.2: A polytope in the plane
such that v = u+2 w . with 5 facets and 5 vertices.
Relationships between Polytopes

Often it is easier to describe a polytope as the projection of a poly-
tope in higher dimensions. For example, given a set of points
v1 , . . . , vk ∈ Rd , the convex hull of these points is the set of points
x ∈ Rd satisfying
x = V·µ
µi ≥ 0 for i = 1, 2, . . . , k
k
∑ µi = 1 ,
i =1
where here V is the d × k matrix whose columns are v1 , . . . , vk , and

µ ∈ Rk . These inequalities describe a polytope Q, which when
projected onto x gives the convex hull P of the points v1 , . . . , vk .
Similarly, we define conical hull of these points to be the set of x ∈ Rd
satisfying
x = V·µ
µi ≥ 0 for i = 1, 2, . . . , k
where here V is the d × k matrix whose columns are v1 , . . . , vk , and

µ ∈ Rk .
In general, we claim that if Q ⊆ Rk is a polytope and π : Rk → Rd
This process is called Fourier-Motzkin
is a linear map, then π ( Q) is also a polytope. Suppose Q = {( x, z) :
elimination.
Ax + zc ≤ b} ⊆ Rk , where x ∈ Rk−1 , z ∈ R, and P is the projection of
Q onto the first k − 1 variables. There are three types of inequalities in
Q. When the coefficient of z is 0 the inequality can be expressed as:
E( x ) ≤ 0,
when the coefficient of z is positive, we get an inequality of the type
U ( x ) ≥ z, ⇔ z − U ( x ) ≤ 0.
and when the coefficient is negative, we get an inequality of the type
L( x ) ≤ z, ⇔ L( x ) − z ≤ 0.
where here E, U, L are affine maps. Every inequality E( x ) ≤ 0 in- An affine map is a function of the type
F ( x ) = a0 + ∑ik=1 ai xi .
duces the same inequality for P. Every pair of inequalities L, U
induces the inequality
L ( x ) ≤ U ( x ). ⇔ L( x ) − U ( x ) ≤ 0.
We claim that the inequalities obtained in this way actually define

P. Every point of P satisfies all the inequalities by definition. On the
other hand, if x is a point satisfying all of these inequalities, we must
extension complexity of convex polytopes 161
have max L L( x ) ≤ minU U ( x ), so setting z = max L L( x ) gives a point

( x, z) ∈ Q, proving that x ∈ P.
This reasoning gives a characterization of the inequalities that
One can prove that there are only 2n
apply to a projection: such inequalities by observing that each
derived inequality is determined by the
Fact 11.3. If a polytope P is the projection of a polytope Q = { x : Ax ≤ b} set of original inequalities involved in
which has n inequalities, then P can be described by at most 2n inequalities deriving it.
obtained by taking non-negative linear combinations of the inequalities of Q.
One consequence of Fact 11.3 is that if P ⊆ Q are two convex

polytopes, then the facets of Q are implied by the facets of P via
non-negative linear combinations:
Lemma 11.4. If P = { x : Ax ≤ b}, and a · x ≤ c is a valid inequality for P,

then there are numbers µi ≥ 0 such that a = ∑i µi Ai and c ≥ ∑i µi bi .
Proof. Consider the set R = { a · x : x ∈ P}. R is a projection of P and

so is a convex polytope, and so R = {y : β ≤ y ≤ α}, where here
β, α may be −∞, ∞. Since y ≤ c for every y ∈ R, there must be some
facet y ≤ α of R, with α ≤ c. By Fact 11.3, the inequality y ≤ α can be
derived as a positive linear combination of inequalities of P, giving
the linear combination we seek.
Algorithms from Polytopes

Besides being fundamental geometric objects, polytopes are useful
from the perspective of algorithm design, because many interesting
computational problems can be reduced to the problem of optimiz-
Figure 11.3: A projection of the cube.
ing a linear function over a specific polytope1 . We illustrate this
with some examples of how to reduce combinatorial problems to
1
Wikipedia, 2016a
optimizing linear functions over polytopes.
Shortest Paths Say we want to compute the shortest path between

any two vertices of an n-vertex directed graph. For every pair of
The variable x{u,v} measures the flow
distinct vertices u, v ∈ [n], define the variables x{u,v} , and define from u to v.
the path polytope P using the equations:
x{u,v} ≥ 0 for every u 6= v the flow is always nonnegative
∑ x{u,w} = ∑ xw,u for every u 6= 1, n the flow into an intermediate vertex is

w w equal to the flow coming out
∑ x{1,w} = 1 the flow out of 1 is 1

w
∑ x{w,n} = 1 the flow into n is 1

w
These equations define a polytope P ⊆ Rd , with d = (n2 ). P has at

most (n2 ) facets, since there are (n2 ) inequalities needed to define it.
Now given a graph with the edge set E, consider the problem of
finding the point in the graph polytope that minimizes the linear
function
L( x ) = ∑ n · x{u,v} + ∑ x{u,v} .
/E
{u,v}∈ {u,v}∈ E
Claim 11.5. minx∈ P L( x ) is achieved by picking the shortest path from 1

to n in the graph, and assigning the edges corresponding to this path the
values of 1, and all other edge the value 0.
Proof. Given any x ∈ P, if x puts non-zero weight on and edge that

is not on the shortest path, then either x puts non-zero weight on
all the edges of a cycle, or it puts non-zero weight on some path
from 1 to n that is not a shortest path. If it puts weight on a cycle,
we can simply reduce the weight of the cycle to lower L( x ). In the
second case we can reduce the weight of all of the edges of the
path by a tiny amount, and add the reduced amount to the edges
on the shortest path of the graph. This transformation keeps us
inside the polytope P, and cannot increase the value of L( x ).
So minx∈ P L( x ) is the length of the shortest path in the graph.

Matchings Suppose we wish to find the size of the maximum match-
ing in a graph. For every pair u, v ∈ [n] we have the variable x{u,v} .
Each matching in the complete graph corresponds to the points
where x{u,v} = 1 if u, v are matched and x{u,v} = 0 if they are not.
The convex hull of these matching is called the matching polytope,
M. One can prove that the facets of the polytope correspond to the 2
Edmonds, 1965
inequalities2 :
x{u,v} ≥ 0 for all distinct u, v
∑ x{u,v} ≤ 1 for all u the number of edges touching u is at

most 1
v
| A| − 1
∑ x{u,v} ≤
2
for all A ⊆ [n], with | A| odd. an odd set can contain at most b
| A|
2 c
u6=v∈ A edges.
Given a graph defined by a set of edges E, define the function
L( x ) = ∑ x{u,v} − ∑ x{u,v} .
{u,v}∈ E /E
{u,v}∈
We claim that the size of the largest matching in the graph is the
same as maxx∈ M L( x ). Indeed, the largest matching in the graph
itself has value equal to its size. On the other hand, since every
point of the polytope is a convex combination of matchings, if
there is a point in the polytope that achieves a larger value, then
there is a matching that achieves a larger value. Such a matching
cannot x{u,v} > 0 for any {u, v} ∈ / E, since setting x{u,v} = 0
would give a larger value for L( x ). Unfortunately, the matching

polytope has an exponential number of facets, so we cannot use
linear programming to find the size of the largest matching in
We shall see in Exercise 11.4 that one
polynomial time this way. can come up with O(n2 ) equations
to work with the bipartite matching
The time required to find the point optimizing a linear function polytope.
over the polytope scales with the number of facets that the polytope
has. So we are interested in solving such problems using polytopes
that have a small number of facets. Our measure of the complexity of
a polytope is the number of facets that it has.
Extension Complexity of a Polytope
Given that the complexity of solving optimization problems

on polytopes is related to the number of facets of the polytope, it
is important to find polytopes that have a small number of facets
that encode computational problems. A generic way to do this is via
0
extensions of the polytope. A polytope Q ⊆ Rd is an extension of a
0
polytope P ⊆ Rd if there is a linediverar map π : Rd → Rd such that
π ( Q) = P.
The extension complexity of P is the minimum number of facets
achieved by any extension of P. The definition makes sense because
there are polytopes P whose extension complexity is less than the
number of facets of P. See Figure 11.4 for an example.
Fact 11.6. If a polytope P has n faces, its extension complexity must be at

least log n.
Proof. Suppose P is the projection of a polytope Q. Then every face

of P corresponds to the projection of a face of Q, and every face of Q
can be derived by intersecting some subset of the facets of Q. So if Q Figure 11.4: A polytope with 6 facets
has k facets, P can have at most 2k faces, as required. that projects down to a polytope with 8
facets. Watch an animation.
Next, we explore some examples of natural polytopes that have By applying a suitable linear trans-
small extension complexity. formation to Q if necessary, we
can always ensure that the projec-
tion map π projects Q onto the
Regular Polygons Suppose we are given a polytope P ⊆ R2 in the
first d coordinates: namely that
plane, with n = 2k facets. One can show3 that there are such π ( x1 , . . . , x d 0 ) = ( x1 , . . . , x d ).
√
n-gons whose extension complexity is at least Ω( n). Here we
show that any such shape that has sufficient symmetries has low
3
Fiorini et al., 2012
extension complexity4 . 4
Kaibel and Pashkovich, 2010
Consider any polytope P = π ( Q) ⊆ R2 which is the projection
of a polytope Q = { Ax ≤ b} ⊆ Rd . We can assume that P The result can also be proved when n is
is obtained from Q by restricting the point in Q to the first 2 not a power of 2, though this requires
more work.
coordinates. Suppose x1 ≥ 0 is an inequality defining a facet
of the polytope P. Then we define a new polytope Q0 ⊆ Rd+1

by reflecting Q about the hyperplane x1 = 0. We replace each
inequality
d
c 1 x 1 + ∑ c i x i ≤ bi
i =2
of Q with the inequality
d
c 1 x d + 1 + ∑ c i x i ≤ bi ,
i =2
in Q0 , and add in the inequalities − xd+1 ≤ x1 ≤ xd+1 to Q0 .

Let π 0 be the projection map defined by π 0 ( x1 , x2 , . . . , xd+1 ) =
( x1 , x2 ), and set P0 = π 0 ( Q0 ). The number of facets of Q0 is only 2
more than the number of facets of Q, but the number of facets of
P0 is potentially twice as many as in P! Using this method, we can
construct a regular n-gon using at most k reflections, when n = 2k .
So every such n-gon has extension complexity at most O(log n). By
Fact 11.6, this is the best one can hope for.
Permutahedron For every permutation π : [n] → [n], define the point Figure 11.5: An octagon can be built
(π (1), π (2), . . . , π (n)) ∈ Rn . The permutahedron is the convex with 3 reflections.
hull of these n! points. The dimension of this polytope is n − 1.

To see this, observe that if a1 , . . . , an , b are such that ∑in=1 ai xi = b
for every point x in the polytope, then we must have ai = a j for
each i, j. Otherwise if the permutation π satisfies the constraint,
then the permutation π obtained from π by swapping π (i ), π ( j)
would not satisfy the constraint. So the only linear constraint that
is satisfied by all permutations is
n n
∑ xi = ∑ i.
i =1 i =1
Claim 11.7. The permutahedron is the set of points satisfying the

conditions:
n n
∑ xi = ∑ i
i =1 i =1
|S|
∑ xi ≥ ∑i for all sets S ⊆ [n] with 0 < |S| < 1,
i ∈S i =1
and moreover the facets of the permutahedron correspond to these 2n − 2

inequalities.
Proof. Let Q denote the polytope defined by the constraints.

Clearly every permutation is in Q, so the permutahedron is con-
tained in Q.
Suppose x ∈ Q is a vertex. We shall show that it must be one of

the n! permutations. Suppose x satisfies
|S|
∑ xi = ∑i
i ∈S i =1
and
|T |
∑ xi = ∑ i,
i∈T i =1
for two distinct sets S, T. Then we claim that we must have S ⊆ T

or T ⊆ S. Otherwise we would have
∑ xi = ∑ xi + ∑ xi − ∑ xi
i ∈S∪ T i ∈S i∈T i ∈ T ∩S
|S| |T | | T ∩S|
≤ ∑i+∑i− ∑ i using the ineuqality applied to the set
T∩S
i =1 i =1 i =1
|S| |T | |S∪ T |
= ∑i+ ∑ i< ∑ i,
i =1 i =| T ∩S|+1 i =1
contradicting the constraint for S ∪ T. Thus all the sets that give
constraints that x satisfies with equality correspond to sets S1 ⊆
S2 ⊆ . . . ⊆ Sk that form a chain. Let π be a permutation with
π (Si ) = [|Si |] for i = 1, 2, . . . , k. This permutation also satisfies the
same equations with equality. If x 6= π, for small enough e, we get
that that x + eπ ∈ Q and x − eπ ∈ Q are two distinct points and
( x +eπ )+( x −eπ ) Recall Fact 11.2.
x= 2 lies on the line between them. This contradicts
the fact that x is a vertex of Q.
To see that each of these inequalities gives a facet, observe that
every permutation satisfying π (S) = [|S|] gives a point in the
halfspace corresponding to the inequality for S. Moreover if a, b
are such that h a, π i = b, then it must be the case that ai = a j
whenever i, j ∈ S and whenever i, j ∈ / S. Otherwise we could
violate this constraint by swapping the values of π (i ), π ( j). This
proves that the dimension of all such permutations is at least n − 2,
and so the inequality defines a facet of the permutahedron. 5
Goemans, 2015
It is known5 that the extension complexity of the permu-

tahedron is O(n log n). This bound is tight: the polytope has
n! = 2Ω(n log n) vertices, so its extension complexity must be at least
6
Rado, 1952
Ω(n log n) by Fact 11.6.
Here we prove6 that its extension complexity is at most n2 . Each
permutation π corresponds to a boolean permutation matrix Y By Fact 11.6, the extension complexity
where Yij = 1 if and only if π (i ) = j. We shall use this correspon- of the permutahedron must be at least
n.
dence to find an extension of P with only n2 facets. Define the
polytope Q using the inequalities:
Yij ≥ 0 for i, j ∈ [n]

n
∑ Yij = 1 for all j ∈ [n]
i =1
n
∑ Yij = 1 for all i ∈ [n]
j =1
Viewing the n2 variables as a matrix Y, and setting v to be the
Figure 11.6: The permutahedron with

n = 4. Watch an animation.
2413
2314 3412
1423
1324
4312 3421
3214
1432
4213 2431
4321
4231
4123 2341
1342
3124
4132 3241
1234
3142
2134 1243
2143
column vector (1, 2, . . . , n) T , we claim that Yv is an element of the

permutahedron exactly when Y is an element of Q. Indeed, each
vertex of the permutahedron is the image of the corresponding
permutation matrix under this linear map, so every element of the
permutahedron can be realized as Yv for some Y ∈ Q. Moreover,
for any Y ∈ Q,
n n n n
∑ (Yv)i = ∑ ∑ Yij · j = ∑ j.
i =1 i =1 j =1 j =1
It only remains to check that Yv satisfies the inequalities of the

permutahedron. For any set S ⊆ [n] with 0 < |S| < n,
n
∑ (Yv)i = ∑ ∑ Yij · j (11.1) j  |S| j > |S|
i ∈S i ∈ S j =1
Now if there is some i0 ∈

/ S and j0 ≤ |S| such that Yi0 j0 > 0, then Yij > 0
i2S
since
n − |S| = ∑ Yij = ∑ Yij + ∑ Yij ,

n |S| >
i ∈[n],j>|S| i ∈S,j>|S| / S,j>|S|
i∈
Yi0 j0 > 0
/S
i2
Â Yij
we must have ∑i∈/ S,j>|S| Yij < n − |S|. This means that there must be / S,j>|S|
i2
some i ∈ S, j > |S| such that Yij > 0. Then by decreasing Yij and
Yi0 j0 , and increasing Yij0 and Yi0 j , we get a new point Y ∈ Q with
a smaller value in (11.1). Thus we can assume that the minimum
value of (11.1) is achieved with Yi0 j0 = 0 for all i0 ∈/ S, j0 ≤ |S|. For 101
such Y, we have 111

100
n |S| |S|
∑ (Yv)i ≥ ∑ ∑ Yij · j = ∑ j, 110
i ∈S i =1 j =1 j =1
001
011
proving that Yv lies inside the permutahedron. This proves that
the extension complexity of the permutahedron is at most n2 . 000
Polytopes from Boolean Circuits The connection between polytopes

010
and algorithms also goes the other way: efficient algorithms lead
to efficient polytopes. Given any boolean function f : {0, 1}n → Figure 11.7: A separating polytopes for
the AND function. Watch an animation.
{0, 1}, we say that the polytope P ⊆ Rn is a separating polytope for
f if f ( x ) is determined by whether or not x ∈ P.
011
If f can be computed by a circuit with s gates, there is a poly-
tope separating f that has extension complexity at most 3s. Indeed,
consider the polytope obtained by replacing every gate g with a 001
variable v g . If g = ¬h, we add the constraint 111

010
v g = 1 − vh . 000
If g = h ∧ r, we add the constraints 101

110
v g ≤ vh , 100
v g ≤ vr , Figure 11.8: A separating polytope for

the parity function: x1 + x2 + x3 mod 2.
v g ≥ vh + vr − 1. Watch an animation.
and if g = h ∨ r, we add the constraints
v g ≥ vh ,
v g ≥ vr ,
v g ≤ v h + vr .
Finally, we add the constraint v f = 1, where f denotes the gate

computing the output of the circuit. When this polytope is pro-
jected onto the inputs, we obtain a polytope P that we claim sepa-
rates f . Indeed, we see that if f ( x ) = 1, then x ∈ P, since we can
assign all of the gate variables the values computed in the circuit, This can be seen by assigning the
value 0 to gate g if v g = 0 and 1 if
and these values will satisfy the inequalities we defined. On the v g 6= 0. Since f ( x ) = 0, one of the
other hand, if f ( x ) = 0, then the constraints cannot be satisfied by gates must be assigned an incorrect
any assignment. So we obtain a separating polytope defined by at value. If g = h ∨ r, and v g = 0, then
we see that the inequalities for g must
most 3s inequalities, as promised. be violated, since either vh > 0 or
vr > 0. If v g > 0, then vh = vr = 0,
and again this violates v g ≤ vh + vr .
Slack Matrices Similarly if g = h ∧ r and v g = 0,
then we must have vh , vr > 0 and this
violates v g ≥ vh + vr − 1. If v g > 0, then
we must have either vh = 0 or vr = 0,
A useful tool for understanding the extension complexity of a which again causes a violation of the
polytope is the concept of a slack matrix. inequalities for g.
Definition 11.8. The slack matrix of a polytopes P is the matrix S where

Sij is the distance of the i’th facet of P from the j’th vertex of P. e
f a
Thus if a · x ≤ b is a facet of the polytope, and v is a vertex of the
b
polytope, the corresponding entry of the slack matrix of the polytope
is b − a · v, which is always non-negative.
The extension complexity any polytope P is determined by its h
slack matrix7 : it is equal to the non-negative rank of the slack matrix
of P. d
g
Lemma 11.9. For any polytopes P, the extension complexity of P is equal to
the non-negative rank of its slack matrix.
c
Proof. Suppose the extension complexity of the polytope P ⊆ Rd
Figure 11.9: A numbering of the facets
is r. Then there is a polytope Q with r facets, such that P is a linear and vertices of the 1 × 1 × 1 cube. The
projection of Q. By applying a suitable linear transformation, we can corresponding slack matrix:
assume that
a b c d e f g h
Q = {( x, y) : Ax + Cy ≤ b},  
1 0 0 0 0 1 1 1 1
2  0 1 1 0 0 1 1 0 
and P is the projection of Q onto x. Here A is a r × d matrix, C is a 



3
 1 1 1 1 0 0 0 0 
r × k matrix, and y is a column vector with k entries. 4 
 1 0 0 1 1 0 0 1 

 
By Fact 11.3, every facet of P is given by a positive linear combi- 5 0 0 1 1 0 0 1 1
6 1 1 0 0 1 1 0 0
nation of the inequalities defining Q, so for every facet i, we have
a non-negative row vector vi such that the facet corresponds to the 7
Yannakakis, 1991
inequality ui Ax + ui Cy ≤ vi b, and ui C = 0. Moreover, if v j is the j’th
vertex of P, there must be a point (v j , y j ) ∈ Q, which gives a non-

negative vector b − Av j − Cy j . The (i, j)’th entry of the slack matrix of
P can then be computed as ui · (b − Av j − Cy j ) = ui b − (ui A)v j . Thus
the slack matrix S can be expressed as S = UV, where U is the matrix
whose rows are ui and V is the matrix whose columns are v j .
Conversely, suppose the polytope P = { x : Ax ≤ b}, where
each of these inequalities corresponds to a facet of P. Suppose P has
an f × k slack matrix S that can be expressed as S = UV, where
U is a f × r matrix and V is a r × k matrix, both with non-negative
entries. Then we claim that P is the projection of the polytope Q =
{( x, y) : Ax + Uy = b, y ≥ 0}, which has at most r facets. The
j’th vertex x j of P corresponds to the column vector v j of V, and we
have Uv j = b − Ax j by the definition of the slack matrix, so we get
Ax j + Uv j = b and v j ≥ 0, proving that every vertex of P is the
projection of a point in Q. Moreover, every point x for which there
exists y ≥ 0 with Ax + Uy = b must satisfy Ax ≤ b, so the projection
of Q is contained in P. We conclude that the extension complexity of
P is at most r.
Lemma 11.9 gives a powerful way to prove both upper and lower
bound on the extension complexity of polytopes. The lower bounds
are usually proved using ideas strongly inspired by lower bounds in
communication complexity.
Spanning Tree Polytope

The spanning tree polytope is the convex hull of all trees of size n.
A tree is a connected acyclic graph.
We have a variable x{u,v} for every distinct u, v ∈ [n]. Every tree gives
a point x by setting x{u,v} = 1 if {u, v} is an edge of the tree, and 0
otherwise. The spanning tree polytope is the convex hull of all of
these spanning trees. 8
Edmonds, 1971
The facets of the spanning tree polytope correspond8 to the in-
equalities:
∑ x{u,v} = n − 1
u,v
∑ x{u,v} ≤ |S| − 1 for every S ⊆ [n] the number of edges in any subset is at
most the size of the subset
u,v∈S
Each facet of the polytope corresponds to a subset S ⊆ [n], and each

vertex corresponds to a spanning tree T. The slack of the pair S, T
is exactly |S| − 1 − k, where k is the number of edges of T that are
contained in the set S. If T is rooted at a vertex a ∈ S, the slack is the
number of children in S whose parents are not in S.
Motivated by this observation, we show how to encode the slack
matrix using a small non-negative factorization. For every tree T,
define the vector


1 b is the parent of c when T is rooted at a
v a,b,c =
0 otherwise.
For every set S, define the vector


1 if a, b, c are distinct, a = min{S}, b ∈
/ S and c ∈ S
u a,b,c =
0 otherwise.
Then we see that ∑ a,b,c u a,b,c v a,b,c is exactly the slack of T from
the facet of the set S. So setting U to be the matrix whose rows
correspond to the vectors u for each set S, and V to be the matrix
T
whose columns correspond to the vectors v for each tree T, we can
express the slack matrix as UV. This proves that the non-negative
rank of the slack matrix is at most n3 . By Lemma 11.9, the spanning a
tree polytope has an extended formulation with at most n3 facets.
S
Lower bounds on Extension Complexity
Figure 11.10: The slack of the pair S, T

Our main weapon for proving lower bounds on the extension is the number of edges leaving S. In
this case, the slack is 4.
complexity of polytopes are the entropy based techniques discussed
in Chapter 6.
If we want to prove lower bounds on the extension complexity
using Lemma 11.9, it is enough to find any inequalities valid for P
and any set of points in P for which the corresponding matrix has
large non-negative rank.
Lemma 11.10. Suppose v1 , . . . , vk ∈ P ⊆ { x : Ax ≤ b}. Let T be
the matrix whose i, j’th entry is the slack bi − Ai v j . Then the extension
complexity of P is at least rank+ ( T ) − 1.
Proof. Let S be the slack matrix of P, and let W be the matrix whose
i, j’th entry is the slack of v j from the i’th facet of P. Since every
v j is a convex combination of the vertices of P, the columns of W
are convex combinations of S. So we can write W = SB for a non-
negative matrix B. This proves that rank+ (W ) ≤ rank+ (S).
By Lemma 11.4, every inequality Ai · x ≤ bi can be expressed as
non-negative combination of the facets of P and the inequality 0 ≤ 1
(which corresponds to adding a row with Wij = 1 to the matrix W).
So if W 0 denotes the matrix obtained from W by adding a row where
every entry is 1, then T = CW 0 , where C is some non-negative matrix.
Thus rank+ ( T ) ≤ rank+ (W ) + 1.
This gives rank+ ( T ) − 1 ≤ rank+ (S), and so by Lemma 11.9, the
extension complexity of P is at least rank+ ( T ) − 1.
Separating Polytopes
Consider any separating polytope for the function f : {0, 1}n →
{0, 1}. We can use Lemma 11.10 to get a lower bound on the exten-
Since we showed that the minimal
sion complexity. extension complexity of a separating
Let polytope is at most linear in the size
n of the smallest circuit computing f ,
∆( x, y) = ∑ y i (1 − x i ) + (1 − y i ) x i the bound we prove here gives a lower
i =1 bound on the circuit size of f .
be the Hamming distance between x, y. Consider the | f −1 (0)| ×

| f −1 (1)| matrix M f ,e whose entries are indexed by inputs x ∈ f −1 (0), 9
Hrubeš, 2016
y ∈ f −1 (1), and whose ( x, y)’th entry is ∆( x, y) − e. We claim9 :
Theorem 11.11. Suppose for every e > 0, rank+ ( M f ,e ) > k. Then the
extension complexity of every separating polytope for f is at at least k.
Proof. Let P be a separating polytope for f , and suppose f −1 (0) ⊆ P.

For each y ∈ f −1 (1), define the linear inequality
∆( x, y) ≥ e.
We claim that for y, there is a value e > 0 such that this inequality is
satisfied by all the points in P. Indeed, if this is not the case, then one
can find a sequence of points x1q , x2 , . . . ∈ P such that ∆( x j , y) ≤ 1/j.
√
Since we must have k x j − yk ≤ ∑in=1 (1/j)2 = n/j, this sequence
A compact set is a set that is both
must converge to y. Since P is compact, this implies that y ∈ P, which
closed and bounded. The limit of every
contradicts the fact that P is a separating polytope for f . convergent sequence contained in a
Thus, by taking the smallest e valid for all y’s, we get that ∆( x, y) ≥ compact set must itself lie in the set.
e for every y ∈ f −1 (1), x ∈ P. Lemma 11.10 completes the proof of

the theorem.
As a Corollary, we get that inf{rank+ ( M f ,e )} is a lower bound on

the circuit complexity of computing the function f .
Correlation and Cut Polytopes

The correlation polytope C is the convex hull of cliques. More pre-
cisely, for every A ⊆ [n], define the vertex

1 if i, j ∈ A, A vertex of the correlation polytope:
x{Ai,j} = for all i, j ∈ [n].
0 otherwise. ∈A ∈A ∈A
 
∈A 1
 0 0 
Here there is one variable for every i ∈ [n], and one for every set 



1 ∈A 1 0 1
{i, j} ⊆ [n] of size 2, for a total of (n2 ) + n = (n+
2 ) variables. The

 0 0 0 0


correlation polytope is the convex hull of all such vertices. There are ∈A 1 0 1 0 1
2n such vertices.
The cut polytope K is the convex hull of all cuts in a graph. For-
mally, for every set A ⊆ [n + 1] define the vertex

1 if i ∈ A, j ∈
/ A,
y{Ai,j} = for all i 6= j ∈ [n + 1].
0 otherwise.
The cut polytope is the convex hull of all such vertices. Observe that
c A vertex of the cut polytope:
y A = y A , so there are 2n vertices.
In fact, C and K are isomorphic. Consider the linear map π : C → 
/A
∈ ∈A /A
∈ ∈A ∈A /A
∈

/A
∈
K, where for i ≤ j ∈ [n + 1] we set ∈A  1 
 
 /A
∈ 
 0 1 

 x + x − 2x  
{i } { j} {i,j} if i, j ≤ n, ∈A  1 0 1 
π ( x ){i,j} = ∈A  1 0 1 0 
x if j = n + 1. 0 1 0 1 1
{i } /A
∈
Then for every set A ⊆ [n], we see that π ( x A ) = y A , so π is an

injective linear map with π (C) = K. So C and K are geometrically
The inverse map sends the vertex
the same. corresponding to the set A ⊆ [n + 1] to
1 either A or Ac , depending on whether
Claim 11.12. C has full dimension (n+
2 ). or not n + 1 ∈ A.
Proof. For every set of size 1, {i }, C contains the unit vector x {i} in
the corresponding direction. For every set {i, j} of size 2, C contains
the unit vector x{i,j} − x{i} − x{ j} in that direction. So the dimension of
C is full.
The facets of C are hard to determine, and no explicit expression is

known to capture them. However, we claim that for every non-empty
set B ⊆ [n], the inequality
∑ x {i } ≤ 1 + ∑ xS
i∈ B S∈( B2 )
holds. Indeed, we see that for any vertex x A , the left hand side is
exactly | A ∩ B|, and the right hand side is exactly 1 + (| A∩ B|
2 ), so the
inequality holds. Moreover, the equation is satisfied with equality
exactly when | A ∩ B| = 1. In all other cases, the slack of the inequality
is at least 1. So if we let Q be the polytope defined by the inequalities
given by every set B, and the inequalities 1 ≥ x{i,j} ≥ 0 then Q is a
bounded polytope that contains C .
Theorem 11.13. The extension complexity of C is 2Ω(n) .
Proof. Consider the matrix T whose entries correspond to sets A, B,

such that TA,B is the slack of x A from the inequality given by B. Then
we see that T satisfies the assumptions of Theorem 6.16, and so
the non-negative rank of T is at least 2Ω(n) . By Lemma 11.10, the
extension complexity of C is also 2Ω(n) .
Matching Polytope
We defined the matching polytope as the convex hull of all matchings 10
Rothvoß, 2014
in a graph on n vertices. Here we prove10 :
Theorem 11.14. The extension complexity of the matching polytope is at

least 2Ω(n) .
Like the lower bound for the correlation polytope, the proof will
rely crucially on entropy based inequalities. Recall that the facets of
this polytope are given by:
x{u,v} ≥ 0 for all distinct u, v
∑ x{u,v} ≤ 1 for all u the number of edges touching u is at

most 1
v
|X| − 1
∑ x{u,v} ≤
2
for all X ⊆ [n], with | X | odd. an odd set can contain at most b
|X|
2 c
u6=v∈ X edges.
Given a set X and a edge, we say that the set cuts the edge if the
edge goes from inside the set to outside the set. A matching is called
perfect if every vertex of the graph is contained in some edge of the
matching. When the matching is a perfect matching, the inequality of
each of these facets corresponds to asserting that every odd set must
The proof will also show that the
cut at least one edge of the matching. convex hull of perfect matchings has
It will be convenient to work with 4n + 6 vertices. Let S be the slack exponential extension complexity.
matrix of the matching polytope. We shall prove that rank+ (S) ≥
2Ω ( n ) . Sx,y is one less than the number of
edges of y cut by x.
Consider the distribution on cuts and matchings given by:
Sx,y
q( xy) =
∑i,j Si,j
If S has non-negative rank r, then S can be expressed as S = ∑rt=1 S(t),

where S(t) is a non-negative rank 1 matrix. In other words, q( xy) can
be expressed as a convex combination of r product distributions, by
setting
S(t) x,y
q( xy|t) = ,
∑i,j S(t)i,j
q( xy|t) is a product distribution, since
and S(t) has non-negative rank 1: if the rank
∑i,j S(t)i,j of S(t) is 1, there must be v a , vm ∈ R
q(t) = .
∑i,j Si,j such that q( xy|t) = v x · vy .
Fix an arbitrary perfect matching A and an arbitrary cut W that

cuts all of the edges of A. Let P = (C, B1 , . . . , Bn ) be a uniformly
random partition of the set W , such that |C | = 3, and | Bi | = 2.
Let Ai denote the set of edges of A that touch Bi . We say that the
cut X ⊆ [n] is consistent with P if C ⊆ X ⊆ W , and for every i,
Xi = X ∩ Bi ∈ { Bi , ∅}. We say that a matching Y is consistent with
P if Y is a perfect matching containing the edges of A cut by C, and

for each i, either Ai ⊆ Y, or Y matches Bi to itself and matches the
neighbors of Bi under Ai to themselves. We write Yi to denote the
edges of i contained in the vertices of Ai . See Figure 11.11 for an
example.
Figure 11.11: An example of A, W , P

when n = 6, and X, Y conditioned on
W D.
C B1 B2 B3 B4 B5 B6
A
X
Y
Let D denote the event that X, Y are consistent with P, and for
every i, Xi does not cut the edges of Yi . We will show:
Lemma 11.15. For every i,
I ( Xi : T | Xi PD) ≥ Ω(1).
Before proving Lemma 11.15, we show how to use it. For each
fixing of P, conditioned on the event D , every pair X, Y has exactly
the same probability, and the cut and matching are sampled inde-
pendently in each of the cycles. So the pairs ( X1 , Y1 ), . . . , ( Xn , Yn ) are
mutually independent. Thus by Lemma 6.15, we get
n
2 log r ≥ ∑ I (Xi : T | Xi PD)
i =1
≥ Ω ( n ), by Lemma 11.15
proving that r ≥ 2Ω(n) as required.

It only remains to prove Lemma 11.15. Fix i. Let E denote the
event that X, Y are consistent with P, and for each j 6= i, the edges of
Yj are not cut by X j . So D ⊂ E , but under E the edges of Yi may be
cut by Xi . For a particular value of i, we write Z = Xi , B1 , . . . , Bn .
In fact, we shall prove the stronger fact that for any fixed z, if
I ( Xi : T | Yi CzD) + I (Yi : T | Xi CzD) = η 4 /2,
then η ≥ Ω(1). Define
p( xyct) = q( xyct|zE ).
A direct computation shows that the weights in S look like When xi = ∅, there are 2 potential
values for yi , each with relative weight
Yi 6= Ai Yi = Ai 2. When xi 6= ∅, there is one value of yi
" # with a weight of 4, and one value with
Xi = ∅ 2 2 a weight of 2.
S=
Xi 6 = ∅ 2 4
So
p( Xi 6= ∅, Yi = Ai )
p( Xi = ∅, Yi 6= Ai ) = .
2
We will prove:
p( Xi = ∅, Yi 6= Ai )
2 1 + 4η
≤ p( Xi 6= ∅, Yi = Ai ) · · + 4η,
5 1 − 4η
which implies that η ≥ Ω(1). To prove the bound, we start by
expressing:
r
p( Xi = ∅, Yi 6= Ai ) = ∑ p(Xi = ∅, Yi 6= Ai , t).
t =1
Now p( xy|ct) is a product distribution, since after fixing c, t, the

conditions of E can be checked separately for x, y. Define
αct = | p( xi |ct, Yi 6= Ai ) − p( xi |c, Yi 6= Ai )|, note: p( xi |ct, Yi 6= Ai ) = p( xi |ct)
β ct = | p(yi |ct, Xi = ∅) − p(yi |c, Xi = ∅)|. note: p(yi |ct, Xi = ∅) = p(yi |ct)
Let G denote the set of pairs (c, t) satisfying αct ≤ η and β ct ≤ η. We

will be able to argue that good pairs (c, t) account for almost all of
the event Xi = ∅, Yi 6= Ai :
Claim 11.16. p((c, t) ∈ G| Xi = ∅, Yi 6= Ai ) ≥ 1 − 4η.
So we can bound
p( Xi = ∅, Yi 6= Ai ) ≤ 4η + ∑ p( Xi = ∅, Yi 6= Ai , c, t).
(c,t)∈G
The relative contribution to the probability of these two events can

be controlled when (c, t) ∈ G . By symmetry, c is independent of the
events Xi = ∅, Yi = Ai , so
1
p( Xi = ∅|Yi 6= Ai , c) = = p(Yi = Ai | Xi = ∅, c).
2
When (c, t) ∈ G ,
1
p( Xi = ∅, Yi 6= Ai |ct) 4 +η 1 + 4η
≤ 1
= .
p( Xi 6= ∅, Yi = Ai |ct) 4 −η 1 − 4η
Which implies:
p( Xi = ∅, Yi 6= Ai )
1 + 4η
≤ 4η +
1 − 4η ∑ p( Xi 6= ∅, Yi = Ai , c, t).
(c,t)∈G
Note that Xi 6= ∅ determines Xi and Yi = Ai determines the

matching. Once the cut and matching are determined, c is uniformly
random subset independent of t. On the other hand, we will prove:
Claim 11.17. For any value of t, |{c|(c, t) ∈ G}| ≤ 4.

Claim 11.17 is the technical heart of
So at most 4 values of c carry the weight of ( Xi = ∅) ∧ (Yi 6= Ai ), the proof, and is the final piece we
while all (53) values of c carry an equal amount of weight for ( Xi 6= need to complete the proof. It is proved
by reasoning about the combinatorial
∅) ∧ (Yi = Ai ). So we get: structure of matchings.
p( Xi = ∅, Yi 6= Ai )
1 1 + 4η
≤ 4η +
(53)
·
1 − 4η ∑ p( Xi 6= ∅, Yi = Ai , t)
(c,t)∈G
r
4 1 + 4η
≤ 4η +
(53)
·
1 − 4η ∑ p(Xi 6= ∅, Yi = Ai , t) by Claim 11.17
t =1
2 1 + 4η
= 4η + · · p( Xi 6= ∅, Yi = Ai ),
5 1 − 4η
as promised.
Now we turn to proving each of the claims:
Proof of Claim 11.17. We claim that if c and c0 are two sets with |c ∩
c0 | ≤ 1, then we cannot have (c, t) ∈ G and (c0 , t) ∈ G . Indeed,
suppose (c, t), (c0 , t) ∈ G are as in Figure 11.12. Then since
αct < η < 1/2, p( xi = ∅|ct) has positive probability, and the
cut shown in Figure 11.12 has positive probability conditioned on
t. Similarly, since β c0 t < η < 1/k, the two edges shown in Figure
11.12 have positive probability conditioned on t. However, since
the matching and cut are independent conditioned on t, both have
positive probability conditioned on t. But this cannot happen, since
such a matching and cut have 0 probability in q.
This means that all of the sets c, c0 that are in G must intersect in
at least 2 elements. We claim that family of subsets in ([53]) can have
at most 4 sets, if all pairs of sets are to have pairwise intersections
of size 2. We can obtain 4 = (43) such sets by taking the collection ([nk ]) denotes the collection of subsets of
[n] of size k.
([43]), and we claim that there is no family of 5 sets whose pairwise
Figure 11.12: Two values of c that

intersect in only one element. The cut X
is consistent with c, and the matching Y
X is consistent with c0 .
c c0
intersection is of size 2. Indeed, take any two sets c1 , c2 in such a

family. We must have |c1 ∪ c2 | = 4. Suppose there is another set c3
such that |c1 ∪ c2 ∪ c3 | = 5, then we must have c1 ∩ c2 ⊆ c3 . But now
we cannot have any other set in the family, for if the set has only one
element of c1 ∩ c2 , it will intersect one of the sets c1 , c2 , c3 in just one
element. This proves that if the sets cover all 5 elements then there
can be at most 3 sets.
Since the events Xi = ∅ and E together imply D , and Yi 6= Ai and

c2
E together imply D , we have:
p( xyct| Xi = ∅) = q( xyct|z, Xi = ∅, D), c1

and
p( xyct|Yi 6= Ai ) = q( xyct|z, Yi 6= Ai , D). c3
Observe that p( xi |c, Yi 6= Ai ) equally likely to be the empty set or
bi . p(yi |bi , Xi = ∅) is uniformly distributed on 2 matchings. Suppose
Figure 11.13: If c1 , c2 , c3 are three sets
4 whose union is of size 5, and pairwise
I ( Xi : T | Yi CzD) + I (Yi : T | Xi CzD) = η /2, intersections are of size 2, there is
no other set that can be added to the
then since collection while keeping the pairwise
1 intersection of size 2.
q( xi = ∅|zD) ≥ ≤ q(yi 6= ai |zD),
2
we get
I ( Xi : T | Yi C, Yi 6= Ai , zD) + I (Yi : T | C, Xi = ∅, zD) ≤ η 4 (11.2)
We claim:
Proof of Claim 11.16. By (11.2) and Pinsker’s inequality (Corollary

6.8):
s h i q

E β bi t ≤ E β2b t ≤ η 4 = η 2 .
i
p(ct| Xi =∅) p ( bi t | X i = ∅ )
This means that p( β bi t > η | Xi = ∅) ≤ η, and so
p( β bi t > η | Xi = ∅, Yi 6= Ai )
p ( β bi t > η | X i = ∅ )
=
p(Yi 6= Ai | Xi = ∅)
≤ 2η.
A symmetric argument, using p( Xi = ∅|Yi 6= Ai ) = 12 , proves that

p(αbi t > η | Xi = ∅, Yi 6= Ai ) ≤ 2η. The claim then follows using the
union bound.
Exercise 11.1
Give a factorization of the slack matrix of the permutahedron
S = UV, where U is a non-negative 2n − 2 × r matrix, V is a non-
negative r × n! matrix, and r = O(n2 ).
Exercise 11.2
The cube in n dimensions is the convex hull of the set {0, 1}n .
Identify the facets of the cube. Is it possible that the extension com-
√
plexity of the cube is O( n)?
Exercise 11.3
Show that the non-negative rank of the slack matrix of a regular 2k
sided polygon in the plane is at most O(k ) by giving a factorization
of the matrix into non-negative matrices.
Exercise 11.4
Given two disjoint sets A, B, each of size n, define the bipartite
matching polytope to be the convex hull of all bipartite matchings:
matchings where every edge goes from A to B. Using what we know
about the permutahedron, show that the extension complexity of the
bipartite matching polytope is at most O(n2 ).
Exercise 11.5
Show that there is a k for which the convex hull of cliques of size k
has extension complexity 2Ω(n) /n.
12
Distributed Computing
The study of distributed computing has become increasingly

relevant with the rise of the internet. In a distributed computing
environment, there are n parties that are connected together by a
communication network, yet no single party knows exactly what the
network is.
More formally, the network is defined by an undirected graph on
n vertices, where each of the vertices represents one of the parties.
Each protocol begins with each party knowing their own name. In
each round, each of the parties can send a message to all of their
In general, networks may be asyn-
neighbors. chronous, and one can either charge or
not charge for the length of each mes-
sage. Moreover, one can also study the
Coloring Problem model in which some of the parties do
not follow the protocol. In this chapter
we stick to the model of synchronous
networks, where we count both the
Suppose the parties in a distributed environment want number of rounds of communication
to properly color themselves so that no two neighboring parties and the total communication. We also
assume that all parties execute the
have the same color. Let us start with the example that n parties are protocol correctly.
connected together so that every party has at most 2 neighbors.
Here is a simple protocol1 that finds a coloring in log∗ n rounds 1
Cole and Vishkin, 1986
of communication. Initially, the parties color themselves with their
names. This takes n colors, and the coloring is proper. In each round,
the parties send all of their neighbors their current color. If a ∈ {0, 1}t
denotes the color of one of the parties in a round, and b, c ∈ {0, 1}t
denote the colors assigned to her neighbors, the party sets i to be the
smallest number such that ai 6= bi , and j to be the smallest number
such that a j 6= c j . Her new color is then i, j, ai , c j . In this way, the
number of colors has been reduced from t to 22 log t+2 . After log∗ n
rounds, the number of colors will be a constant.
The coloring protocol can be generalized to handle arbitrary
graphs of degree d. Any graph of degree d can be colored using d + 1 2
Linial, 1992
colors. Here we give a protocol2 that uses log∗ n rounds to find a
coloring using d2 colors.

The protocol relies on the following combinatorial lemma:
Lemma 12.1. There is a family of t subsets of [5d2 log t], such that for any
d + 1 sets S1 , . . . , Sd+1 in the family, S1 is not contained in the union of
S2 , . . . , S d + 1 .
Proof. Pick the t sets at random from [5d2 log t], where each element
is included in each set independently with probability 1/d. Then for
a particular choice of S1 , . . . , Sd+1 , the probability that some element
j ∈ S1 but j is not in any of the other sets is
1
(1 − 1/d)d /d ≥ 2−2 /d = d.
4
2
The probability that there is no such j is at most (1 − 14 d)5d log t ≤
e−d log t .
The number of choices for d + 1 such sets from the family is (d+t 1) ≤
t 1 ≤ 2d log t . Thus, by the union bound, the probability that the
d +
family does not have the property we need is at most e−d log t 2d log t <
1.
In each round of the protocol, all parties send their current color
to all the other parties. If there are t colors in a particular round,
each party looks at the d colors she received and associates each
with a set from the family promised by Lemma 12.1. She picks a
color by picking an element that belongs to her own set but not to
any of the others. Thus, the next round will have at most 5d2 log t
colors. Continuing in this way, the number of colors will be reduced
to O(d2 log d) in log∗ n rounds.
On computing the Diameter
Suppose the parties in a network want to compute the diameter

of the graph: namely the length of the longest path. Here we show
that in any distributed algorithm for computing the diameter of an
n vertex graph, there must n links that transmit at least Ω(n2 ) bits in 3
Frischknecht et al., 2012; and Holzer
total3 , even if it is just trying to distinguish whether the diameter of and Wattenhofer, 2012
the graph is at most 4 or bigger than 4. We do this by a reduction to

the 2 party communication complexity of disjointness.
Let X, Y ⊆ [n] × [n] be two subsets. For every such pair of sets, we
shall define a graph GX,Y , and then show that if there is an efficient
distributed algorithm for computing the diameter of the graphs
GX,Y , then there must be a an efficient communication protocol for
computing whether or not X, Y are disjoint.
distributed computing 181
Let A, B, C, D be disjoint cliques of size n. Let v be a vertex that

is connected to all the vertices of A, B and w be a vertex that is con-
nected to all the vertices of C, D. We connect v and w with an edge as
well. For each i, we connect the i’th vertex of A to the i’th vertex of C,
and the i’th vertex of B to the i’th vertex of D.
Finally we connect the i’th vertex of A to the j’th vertex of B if and
only if (i, j) ∈
/ X. Similarly, we connect the i’th vertex of C to the j’th
vertex of D if and only if (i, j) ∈
/ Y.
A C
v w
B D
i i
i i
if (i, j) 2
/X if (i, j) 2
/Y
j j
Figure 12.1: GX,Y for n = 4.
GX,Y is clearly connected. Moreover, the distance of v from every

vertex is at most 2, and the distance of w from every vertex is at most
2. Similarly the distance of all the vertices of A, B, C, D from the rest
of the graph is always at most 3. When (i, j) ∈ / X ∩ Y, we also have

that the distance of the i’th vertex in A from the j’th vertex in D is at
most 2. This is because, if say (i, j) ∈
/ X, we can take the path from
the i’th vertex of A to the j’th vertex of B to the j’th vertex of D. On
the other hand, if (i, j) ∈ X ∩ Y, the distance of the i’th vertex of A
from the j’th vertex of D is at least 3. To summarize:
Claim 12.2. The diameter of GX,Y is 2 if X, Y are disjoint, and 3 if X, Y are

not disjoint.
Consider the protocol obtained when Alice simulates all the ver-
tices close to A, B, and Bob simulates all the nodes close to C, D. This
protocol must solve the disjointness problem, and so has communi-
cation at least Ω(n2 ). This proves that the O(n) links that cross from
the left to the right in the above network must carry at least Ω(n2 )
bits of communication to compute the diameter of the graph.
Detecting Triangles
4
Drucker et al., 2014
Another basic measure associated with a graph is its girth,
which is the length of the shortest cycle in the graph. Here we show
that any distributed protocol for computing
√ the girth of an n vertex
graph must involve at least Ω(n2 2−O( log n) )) bits of communication.
We prove4 this by showing that any such protocol can be used
to compute disjointness in the number-on-forehead
√ model with 3
parties, and a universe of size Ω(n2 2−O( log n) )). Applying Theorem
5.12 gives the lower bound.
Suppose Alice, Bob and Charlie have 3 sets X, Y, Z ⊆ U written
on their foreheads, where U is a set that we shall soon specify. Let
A, B, C be 3 disjoint sets of size 2n. We shall define a graph GX,Y,Z on
the vertex set A ∪ B ∪ C, that will have a triangle (namely a cycle of
length 3) if and only if X ∩ Y ∩ Z is non-empty. c a
A a if 2Q C
2
To construct GX,Y,Z we need the coloring
√ promised by Theorem
c
4.2. This is a coloring of [n] with 2O( log n) colors such that there are
no monochromatic arithmetic progressions. Since such √ a coloring if b a2Q if c b2Q
exists, there must be a subset Q ⊆ [n] of size n2−O( log n) that does
not contain any non-trivial arithmetic progressions.
b
Now define a graph G on the vertex set A ∪ B ∪ C, where for each B
a ∈ A, b ∈ B, c ∈ C, Figure 12.2: The graph G.
distributed computing 183
( a, b) ∈ G ⇔ b − a ∈ Q,
(b, c) ∈ G ⇔ c − b ∈ Q,
c−a
( a, c) ∈ G ⇔ ∈ Q.
2
√
log n)
Claim 12.3. The graph G has at least n| Q| = Ω(n2 2−O( ) triangles,
and no two triangles in G share an edge.
Proof. For each element q ∈ Q, the vertices a ∈ A, a + q ∈ B, a + 2q ∈ C

certainly form a triangle, as long as a + 2q ≤ 2n. No two of these
triangles share an edge, since any edge determines a, q. We claim
that there are no other triangles. Indeed, if a, b, c was a triangle in the
q +q
graph, then b − a = q1 ∈ Q, c − b = q2 ∈ Q, 1 2 2 = c−b+2 b− a = c−2 a =
q3 ∈ Q, so we have q1 , q3 , q2 ∈ Q form an arithmetic progression. The
only way this can happen is if q1 = q2 = q3 , so that the progression is
trivial. But in this case we recover one of the triangles above.
Let U denote the set of triangles in G. The graph GX,Y,Z will be

defined as follows:
A
( a, b) ∈ G ⇔ a triangle of U containing ( a, b) is in Z,
B
(b, c) ∈ G ⇔ a triangle of U containing (b, c) is in X,
2X C
( a, c) ∈ G ⇔ a triangle of U containing ( a, b) is in Y.
2Y
Given sets X, Y, Z as input, Alice, Bob and Charlie build the network 2Z
GX,Y,Z and execute the protocol for detecting triangles, with Alice, Figure 12.3: The graph GX,Y,Z .
Bob, Charlie simulating the behavior of the nodes in A, B, C of the
network. Each of the players knows enough information to simulate
the behavior of these nodes. By Theorem 5.12, the total
√ communi-
log n)
cation of triangle detection must be at least Ω(n2 2−O( ), as
required.
Verifying a Spanning Tree

Bibliography
A. Aho, J. Ullman, and M. Yannakakis. On notations of information

transfer in VLSI circuits. In Proc. 15th Ann. ACM Symp. on Theory of
Computing, pages 133–139, 1983.
Miklós Ajtai. A lower bound for finding predecessors in yao’s call

probe model. Combinatorica, 8(3):235–247, 1988.
Noga Alon, Shlomo Hoory, and Nathan Linial. The moore bound for
irregular graphs. Graphs and Combinatorics, 18(1):53–57, 2002. URL
http://dx.doi.org/10.1007/s003730200002.
László Babai, Noam Nisan, and Mario Szegedy. Multiparty protocols

and logspace-hard pseudorandom sequences. In Proceedings of the
21st Annual Symposium on Theory of Computing (STOC), pages 1–11,
1989.
László Babai, Anna Gál, Peter G. Kimmel, and Satyanarayana V.

Lokam. Communication complexity of simultaneous messages. SIAM
J. Comput, 33(1):137–166, 2003.
Ajesh Babu and Jaikumar Radhakrishnan. An entropy based proof

of the moore bound for irregular graphs. CoRR, abs/1011.1058, 2010.
URL http://arxiv.org/abs/1011.1058.
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An

information statistics approach to data stream and communication
complexity. Journal of Computer and System Sciences, 68(4):702–732,
2004. URL http://dx.doi.org/10.1016/j.jcss.2003.11.006.
Boaz Barak, Mark Braverman, Xi Chen, and Anup Rao. How to

compress interactive communication. In Proceedings of the 2010 ACM
International Symposium on Theory of Computing, pages 67–76, 2010.
Paul Beame and Faith E. Fich. Optimal bounds for the predecessor
problem and related problems. J. Comput. Syst. Sci., 65(1):38–72, 2002.
Felix A. Behrend. On the sets of integers which contain no three in

arithmetic progression. Proc. Nat. Acad. Sci., 1946.
Mark Braverman and Ankit Garg. Public vs private coin in bounded-

round information. In Javier Esparza, Pierre Fraigniaud, Thore
Husfeldt, and Elias Koutsoupias, editors, ICALP (1), volume 8572 of
Lecture Notes in Computer Science, pages 502–513. Springer, 2014. ISBN
978-3-662-43947-0.
Mark Braverman and Ankur Moitra. An information complexity

approach to extended formulations. In Dan Boneh, Tim Rough-
garden, and Joan Feigenbaum, editors, Symposium on Theory
of Computing Conference, STOC’13, Palo Alto, CA, USA, June 1-4,
2013, pages 161–170. ACM, 2013. ISBN 978-1-4503-2029-0. URL
http://dl.acm.org/citation.cfm?id=2488608.
Amit Chakrabarti, Yaoyun Shi, Anthony Wirth, and Andrew Yao.

Informational complexity and the direct sum problem for simulta-
neous message complexity. In Proceedings of the 42nd Annual IEEE
Symposium on Foundations of Computer Science, pages 270–278, 2001.
Ashok K. Chandra, Merrick L. Furst, and Richard J. Lipton. Multi-

party protocols. In Proceedings of the fifteenth annual ACM Symposium
on Theory of Computing, pages 94–99. ACM Press, 1983.
Fan R. K. Chung, Ronald L. Graham, Peter Frankl, and James B.

Shearer. Some intersection theorems for ordered sets and graphs.
Journal of Combinatorial Theory, Ser. A, 43(1):23–37, 1986.
Richard Cole and Uzi Vishkin. Deterministic coin tossing and

accelerating cascades: micro and macro techniques for designing
parallel algorithms. In Proceedings of the Eighteenth Annual ACM
Symposium on Theory of Computing, pages 206–219, 28–30 May 1986.
Andrew Drucker, Fabian Kuhn, and Rotem Oshman. On the power

of the congested clique model. In Magnús M. Halldórsson and
Shlomi Dolev, editors, ACM Symposium on Principles of Distributed
Computing, PODC ’14, Paris, France, July 15-18, 2014, pages 367–376.
ACM, 2014. ISBN 978-1-4503-2944-6. URL http://dl.acm.org/
citation.cfm?id=2611462.
Pavol Duris, Zvi Galil, and Georg Schnitger. Lower bounds on

communication complexity. Information and Computation, 73(1):1–22,
1987.
Jack Edmonds. Paths, trees, and flowers. Canadian Journal of Mathe-

matics, 17:449–467, 1965.
Jack Edmonds. Matroids and the greedy algorithm. Math. Program, 1

(1):127–136, 1971. URL http://dx.doi.org/10.1007/BF01584082.
bibliography 187
David Ellis, Yuval Filmus, and Ehud Friedgut. Triangle-intersecting

families of graphs, 2010. URL http://arxiv.org/abs/1010.4909.
Tomàs Feder, Eyal Kushilevitz, Moni Naor, and Noam Nisan. Amor-
tized communication complexity. SIAM Journal on Computing, 24(4):
736–750, 1995. Prelim version by Feder, Kushilevitz, Naor FOCS 1991.
Samuel Fiorini, Thomas Rothvoß, and Hans Raj Tiwary. Extended for-
mulations for polygons. Discrete & Computational Geometry, 48(3):658–
668, 2012. URL http://dx.doi.org/10.1007/s00454-012-9421-9.
Michael L. Fredman and Michael E. Saks. The cell probe complexity

of dynamic data structures. In David S. Johnson, editor, Proceedings of
the 21st Annual ACM Symposium on Theory of Computing, May 14-17,
1989, Seattle, Washigton, USA, pages 345–354. ACM, 1989. ISBN
0-89791-307-8. URL http://doi.acm.org/10.1145/73007.73040.
Silvio Frischknecht, Stephan Holzer, and Roger Wattenhofer. Net-

works cannot compute their diameter in sublinear time. In Proceed-
ings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete
Algorithms, SODA 2012, Kyoto, Japan, January 17-19, 2012, pages
1150–1162. SIAM, 2012.
Michel X. Goemans. Smallest compact formulation for the

permutahedron. Math. Program, 153(1):5–11, 2015. URL
http://dx.doi.org/10.1007/s10107-014-0757-1.
Mika Göös, Toniann Pitassi, and Thomas Watson. Deterministic

communication vs. partition number. In Venkatesan Guruswami,
editor, IEEE 56th Annual Symposium on Foundations of Computer
Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages
1077–1088. IEEE Computer Society, 2015. ISBN 978-1-4673-8191-8.
Ronald L. Graham. Rudiments of Ramsey theory. Number 45 in

Regional Conference series in mathematics. American Mathematical
Society, 1980.
Ronald L. Graham, Bruce L. Rothschild, and Joel H. Spencer. Ramsey

theory. Wiley-Interscience Series in Discrete Mathematics. John Wiley
& Sons, Chichester-New York-Brisbane-Toronto-Singapore, 1980.
Vince Grolmusz. Circuits and multi-party protocols. Computational

Complexity, 7(1):1–18, 1998.
Armin Haken. The intractability of resolution. Theoretical Computer

Science, 39(2–3):297–308, August 1985.
Alfred. W. Hales and Robert. I. Jewett. On regularity and positional

games. Trans. Amer. Math. Soc., 106:222–229, 1963.
Bernd Halstenberg and Rüdiger Reischuk. Different modes of

communication. SIAM Journal on Computing, 22(5):913–934, 1993.
Prahladh Harsha, Rahul Jain, David A. McAllester, and Jaikumar

Radhakrishnan. The communication complexity of correlation.
In IEEE Conference on Computational Complexity, pages 10–23. IEEE
Computer Society, 2007. URL http://doi.ieeecomputersociety.
org/10.1109/CCC.2007.32.
Johan Håstad. Computational limitations of small-depth circuits. The MIT

Press, Cambridge(MA)-London, 1987.
Johan Håstad and Avi Wigderson. The randomized communication

complexity of set disjointness. Theory of Computing, 3(1):211–219, 2007.
URL http://dx.doi.org/10.4086/toc.2007.v003a011.
Thomas Holenstein. Parallel repetition: Simplifications and the

no-signaling case. In Proceedings of the 39th Annual ACM Symposium
on Theory of Computing, 2007.
Stephan Holzer and Roger Wattenhofer. Optimal distributed all

pairs shortest paths and applications. In Darek Kowalski and
Alessandro Panconesi, editors, ACM Symposium on Principles of
Distributed Computing, PODC ’12, Funchal, Madeira, Portugal, July
16-18, 2012, pages 355–364. ACM, 2012. ISBN 978-1-4503-1450-3. URL
Pavel Hrubes and Anup Rao. Circuits with medium fan-in. In 30th
Conference on Computational Complexity, CCC 2015, June 17-19, 2015,
Portland, Oregon, USA, volume 33, pages 381–391, 2015.
Pavel Hrubeš. Personal communication, 2016.
Fritz John. Extremum problems with inequalities as subsidiary

conditions. pages 187–204, 1948.
Volker Kaibel and Kanstantsin Pashkovich. Constructing extended

formulations from reflection relations, November 16 2010. URL
http://arxiv.org/abs/1011.3597.
Bala Kalyanasundaram and Georg Schnitger. The probabilistic

communication complexity of set intersection. SIAM Journal on
Discrete Mathematics, 5(4):545–557, November 1992. ISSN 0895-4801
(print), 1095-7146 (electronic).
Mauricio Karchmer and Avi Wigderson. Monotone circuits for

connectivity require super-logarithmic depth. SIAM Journal on
Discrete Mathematics, 3(2):255–265, May 1990. ISSN 0895-4801 (print),
1095-7146 (electronic).
bibliography 189
Hartmut Klauck, Ashwin Nayak, Amnon Ta-Shma, and David

Zuckerman. Interaction in quantum communication. IEEE Trans.
Information Theory, 53(6):1970–1982, 2007. URL http://dx.doi.org/
10.1109/TIT.2007.896888.
Jon M. Kleinberg and Éva Tardos. Algorithm design. Addison-Wesley,

2006. ISBN 978-0-321-37291-8.
Robin Kothari. Nearly optimal separations between communication

(or query) complexity and partitions. CoRR, abs/1512.01210, 2015.
Nathan Linial. Locality in distributed graph algorithms. SIAM Journal

on Computing, 21(1):193–201, 1992.
László Lovász. Communication complexity: A survey. Technical

report, 1990.
László Lovász and Michael E. Saks. Lattices, Möbius functions and

communication complexity. In FOCS, pages 81–90. IEEE Computer
Society, 1988.
Shachar Lovett. Communication is bounded by root of rank. In

David B. Shmoys, editor, Symposium on Theory of Computing, STOC
2014, New York, NY, USA, May 31 - June 03, 2014, pages 842–846.
ACM, 2014. ISBN 978-1-4503-2710-7.
Peter Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. On

data structures and asymmetric communication complexity. Journal
of Computer and System Sciences, 57:37–49, 1 1998.
Neciporuk. A boolean function. DOKLADY: Russian Academy of

Sciences Doklady. Mathematics (formerly Soviet Mathematics–Doklady), 7,
1966.
Ilan Newman. Private vs. common random bits in communication

complexity. Information Processing Letters, 39(2):67–71, 31 July 1991.
Noam Nisan and Avi Wigderson. Rounds in communication

complexity revisited. SIAM Journal on Computing, 22(1):211–219, 1993.
Noam Nisan and Avi Wigderson. On rank vs. communication

complexity. Combinatorica, 15(4):557–565, 1995.
Mihai Pătraşcu. Unifying the landscape of cell-probe lower

bounds. SIAM Journal on Computing, 40(3):827–847, ????
2011. ISSN 0097-5397 (print), 1095-7111 (electronic). doi:
http://dx.doi.org/10.1137/09075336X.
Mihai Pǎtraşcu and Mikkel Thorup. Time-space trade-offs for

predecessor search. In Proc. 38th ACM Symposium on Theory of
Computing (STOC), pages 232–240, 2006. See also arXiv:cs/0603043.
Mihai Patrascu and Mikkel Thorup. Dynamic integer sets with

optimal rank, select, and predecessor search. In 55th IEEE Annual
Symposium on Foundations of Computer Science, FOCS 2014, Philadel-
phia, PA, USA, October 18-21, 2014, pages 166–175, 2014.
Richard Rado. An inequality. London Journal of Mathematics Society, 27:

1–6, 1952.
Anup Rao and Amir Yehudayoff. Simplified lower bounds on the

multiparty communication complexity of disjointness. In Conference
on Computational Complexity, volume 33, pages 88–101, 2015.
Ran Raz and Avi Wigderson. Monotone circuits for matching

require linear depth. J. ACM, 39(3):736–744, 1992. URL http:
//doi.acm.org/10.1145/146637.146684.
Razborov. On the distributed complexity of disjointness. TCS:

Theoretical Computer Science, 106, 1992.
Thomas Rothvoß. The matching polytope has exponential extension

complexity. In David B. Shmoys, editor, Symposium on Theory
of Computing, STOC 2014, New York, NY, USA, May 31 - June 03,
2014, pages 263–272. ACM, 2014. ISBN 978-1-4503-2710-7. URL
Pranab Sen and Srinivasan Venkatesh. Lower bounds for predecessor

searching in the cell probe model. J. Comput. Syst. Sci., 74(3):364–385,
2008.
Claude E. Shannon. A mathematical theory of communication. Bell

System Technical Journal, 27, 1948. Monograph B-1598.
Alexander A. Sherstov. The multiparty communication complexity

of set disjointness. In Proceedings of the 44th Symposium on Theory of
Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22,
2012, pages 525–548, 2012.
Balazs Szegedy. An information theoretic approach to sidorenko’s

conjecture, January 26 2014. URL http://arxiv.org/abs/1406.6738.
P. van Emde Boas. Preserving order in a forest in less than logarith-

mic time. In Proceedings of the 16th Annual Symposium on Foundations
of Computer Science, SFCS ’75, pages 75–84, Washington, DC, USA,
1975. IEEE Computer Society. doi: 10.1109/SFCS.1975.26. URL
http://dx.doi.org/10.1109/SFCS.1975.26.
John von Neumann. Zur Theorie der Gesellschaftsspiele. (German)

[On the theory of games of strategy]. Mathematische Annalen, 100:
295–320, 1928. ISSN 0025-5831.
bibliography 191
Wikipedia. Linear programming — Wikipedia, the free encyclopedia,

2016a. URL https://en.wikipedia.org/wiki/Linear_programming.
[Online; accessed 30-August-2016].
Wikipedia. Boolean satisfiability problem — Wikipedia, the free en-

cyclopedia, 2016b. URL https://en.wikipedia.org/wiki/Boolean_
satisfiability_problem. [Online; accessed 30-August-2016].
Mihalis Yannakakis. Expressing combinatorial optimization problems

by linear programs. Journal of Computer and System Sciences, 43(3):
441–466, December 1991.
Andrew Chi-Chih Yao. Some complexity questions related to

distributive computing. In STOC, pages 209–213, 1979.
Andrew Chi-Chih Yao. Lower bounds by probabilistic arguments

(extended abstract). In FOCS, pages 420–428. IEEE Computer Society,
1983.

Book

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Book

Transféré par

Droits d'auteur :

Formats disponibles

T H E S I N G L E B I G G E S T P R O B L E M I N C O M M U N I C AT I O N I S T H E I L L U S I O N T H AT

8 Circuits, Branching Programs 117

Lowerbounds on Circuits with few Alternations 122

9 Proof Systems 131

10 Data Structures 141

11 Extension Complexity of Convex Polytopes 159

12 Distributed Computing 179

This is a textbook about interactive communication, a concept first

2. the concept is simple and natural enough that methods from

In this book, we explain some of the central results in the area of

Sets, Numbers and Functions

A graph on the set [n] (called the vertices) is a collection of sets of

| p − q| = (1/2) ∑ | p( x ) − q( x )| = max p( T ) − q( T ), The proofs of these equations is an easy

where the maximum is taken over all subsets T of the universe. We

Some Useful Inequalities

Suppose X is a non-negative random variable, and γ is a number.

Approximating linear functions with exponentials

which can be proved using the concavity of the log function:

A protocol specifies a way for k parties, who each have access

Some Examples of Communication Problems and Protocols

A contains a vertex v of degree less than n/2, Alice announces

Defining 2 party protocols

Here we define exactly what we mean by a 2 party deterministic

A very useful concept for understanding communication protocols

Lemma 1.3. A set R ⊂ X × Y is a rectangle if and only if whenever

Proof. If R = A × B is a rectangle, then ( x, y), ( x 0 , y0 ) ∈ R means that

Xu and Xw form a partition of Xv , and Ru = Xu × Yu and Rw =

Balancing Protocols Anup Says: removed notion of g-

If x ∈ Xv and y ∈ Yv , then the parties repeat the procedure at the

From Rectangles to Protocols

Given Theorem 1.5, one might wonder whether every partition

• vertically good if y ∈ B, and R vertically intersects at most half of the

Some lower bounds

We turn to proving that some problems do not have efficient

Size of Monochromatic Rectangles

Claim 1.12. If R is a 1-monochromatic rectangle, then | R| = 1.

Since there are 2n inputs x with EQ( x, x ) = 1, this means 2n

Theorem 1.13. The deterministic communication complexity of EQ is

Disjointness Next, consider the disjointness function Disj : 2[n] × 2[n] →

Claim 1.14. Every 1-monochromatic rectangle of Disj has size at most 2n .

On the other hand, the number of disjoint pairs ( X, Y ) is exactly

Theorem 1.15. The deterministic communication complexity of Disj is at

Definition 1.16. A function g : X × Y → {0, 1} is said to be (u, v) rich

A rich function has large 1-monochromatic rectangles:

Lemma 1.17. If g : X × Y → {0, 1} is (u, v) rich with u, v > 0, and if

Proof. The statement is proved inductively. For the base case, if

Now let us see some examples where richness can be used to

Lopsided Disjointness Suppose Alice is given a set X ⊆ [n] of size

To prove a lower bound, we need to analyze rectangles of a

Greater-than Our first example using fooling sets is the greater-than

The trivial protocol computing greater-than has complexity 1 +

Claim 1.22. Two points of S cannot lie in the same monochromatic

Indeed, if R is monochromatic, and x < x 0 , but ( x, x ), ( x 0 , x 0 ) ∈

Theorem 1.23. The deterministic communication complexity of GT is at

Disjointness Fooling sets also allow us to prove tighter lower bounds

Theorem 1.24. The deterministic communication complexity of disjoint-

S = {( x, y) ∈ X × Y : x, y differ in only 1 coordinate}.

S contains n · 2n−1 inputs, since one can pick an input of S by picking

Claim 1.25. Suppose R is a monochromatic rectangle that contains r

The key observation here is that two distinct elements ( x, y), ( x, y0 ) ∈

Given that rectangles play such a crucial role in the communica-

Definition 1.26. We say that a boolean function has a 1-cover of size C if

In fact, by the proof of Theorem 1.8, this√

We prove Claim 1.27 using the probabilistic method. Sample a