Académique Documents
Professionnel Documents
Culture Documents
nature, as can be approximately implemented by tree-like or recurrent deep neural networks. This
model class captures the essence of probabilistic context-free grammars as well as recursive self-
reproduction in physical phenomena such as turbulence and cosmological inflation. We derive an
analytic formula for the asymptotic power law and elucidate our results in a statistical physics
context: 1-dimensional “shallow” models (such as Markov models or regular grammars) will fail to
model natural language, because they cannot exhibit criticality, whereas “deep” models with one or
more “hidden” dimensions representing levels of abstraction or scale can potentially succeed.
Critica
l 2D Is
ing mo
Mutual information I(X,Y) in bits
del
Engl
ish W
ikipe
0.01 dia
Bach
Eng
lish
text French text
Human
genom
-4
e
10
Marko
Critica
l 2D Is
v proc
ing mo
del
ess
10 - 6
1 10 100 1000
Distance between symbols d(X,Y)
FIG. 1: Decay of mutual information with separation. Here the mutual information in bits per symbol is shown as a function
of separation d(X, Y ) = |i − j|, where the symbols X and Y are located at positions i and j in the sequence in question, and
shaded bands correspond to 1 − σ error bars. All measured curves are seen to decay roughly as power laws, explaining why they
cannot be accurately modeled as Markov processes — for which the mutual information instead plummets exponentially (the
example shown has I ∝ e−d/6 ). The measured curves are seen to be qualitatively similar to that of a famous critical system in
physics: a 1D slice through a critical 2D Ising model, where the slope is −1/2. The human genome data consists of 177,696,512
base pairs {A, C, T,G} from chromosome 5 from the National Center for Biotechnology Information [10], with unknown base
pairs omitted. The Bach data consists of 5727 notes from Partita No. 2 [11], with all notes mapped into a 12-symbol alphabet
consisting of the 12 half-tones {C, C#, D, D#, E, F, F#, G, G#, A, A#, B}, with all timing, volume and octave information
discarded. The three text corpuses are 100 MB from Wikipedia [12] (206 symbols), the first 114 MB of a French corpus [13]
(185 symbols) and 27 MB of English articles from slate.com (143 symbols). The large long range information appears to be
dominated by poems in the French sample and by html-like syntax in the Wikipedia sample.
II. MARKOV IMPLIES EXPONENTIAL DECAY denote the eigenvalues of M, sorted by decreasing mag-
nitude: |λ1 | ≥ |λ2 | ≥ |λ3 |... All Markov matrices have
|λi | ≤ 1, which is why blowup is avoided when equa-
For two random variables X and Y , the following defini- tion (2) is iterated, and λ1 = 1, with the corresponding
tions of mutual information are all equivalent: eigenvector giving a stationary probability distribution µ
I(X, Y ) ≡ S(X) + S(Y ) − S(X, Y ) satisfying Mµ = µ.
= D p(XY )p(X)p(Y )
In addition, two mild conditions are usually imposed
P (a, b) on Markov matrices: M is irreducible, meaning that
= logB (1)
P (a)P (b) every state is accessible from every other state (other-
X P (a, b) wise, we could decompose the Markov process into sepa-
= P (a, b) logB , rate Markov processes). Second, to avoid processes like
P (a)P (b)
ab 1 → 2 → 1 → 2 · · · that will never converge, we take the
where S ≡ h− logB P i is the Shannon entropy [35] and Markov process to be aperiodic. It is easy to show using
D(p(XY )||p(X)p(Y )) is the Kullback-Leibler divergence the Perron-Frobenius theorem that being irreducible and
[36] between the joint probability distribution and the aperiodic implies |λ2 | < 1.
product of the individual marginals. If the base of the
logarithm is taken to be B = 2, then I(X, Y ) is measured This section is devoted to the intuition behind the fol-
in bits. The mutual information can be interpreted as lowing theorem, whose full proof is given in Appendix
how much one variable knows about the other: I(X, Y ) A and B. The theorem states roughly that for a Markov
is the reduction in the number of bits needed to specify process, the mutual information between two points in
for X once Y is specified. Equivalently, it is the number time t1 and t2 decays exponentially for large separation
of encoding bits saved by using the true joint probability |t2 − t1 |:
P (X, Y ) instead of approximating X and Y are inde-
pendent. It is thus a measure of statistical dependencies Theorem 1: Let M be a Markov matrix that gener-
between X and Y . Although it is more conventional to ates a Markov process. If M is irreducible and aperiodic,
measure quantities such as the correlation coefficient ρ in then the asymptotic behavior of the mutual information
statistics and statistical physics, the mutual information I(t1 , t2 ) is exponential decay toward zero for |t2 − t1 | 1
is more suitable for generic data, since it does not require with decay timescale log |λ12 | , where λ2 is the second
that the variables X and Y are numbers or have any al- largest eigenvalue of M. If M is reducible or periodic,
gebraic structure, whereas ρ requires that we are able to I can instead decay to a constant; no Markov process
multiply X · Y and average. Whereas it makes sense to whatsoever can produce power-law decay. Suppose M is
multiply numbers, is meaningless to multiply or average irreducible and aperiodic so that pt → µ as t → ∞ as
two characters such as “!” and “?”. mentioned above. This convergence of one-point statis-
tics, e.g., pt , has been well-studied [37]. However, one
The rest of this paper is largely a study of the mutual
can also study higher order statistics such as the joint
information between two random variables that are real-
probability distribution for two points in time. For suc-
izations of a discrete stochastic process, with some sep-
cinctness, let us write P (a, b) ≡ P (X = a, Y = b), where
aration in τ in time. More concretely, we can think of
X = Xt1 and Y = Xt2 and τ ≡ |t2 − t1 |. We are inter-
sequences {X1 , X2 , X3 , · · · } of random variables, where
ested in the asymptotic situation where the Markov pro-
each one might take values from some finite alphabet.
cess has converged toPits steady state, so the marginal
For example, if we model English as a discrete stochastic
distribution P (a) ≡ b P (a, b) = µa , independently of
process and take τ = 2, X could represent the first char-
time.
acter (“F”) in this sentence, whereas Y could represent
the third character (“r”) in this sentence.
If the joint probability distribution approximately fac-
In particular, we start by studying the mutual informa- torizes as P (a, b) ≈ µa µb for sufficiently large and well-
tion function of a Markov process, which is analytically separated times t1 and t2 (as we will soon prove), the
tractable. Let us briefly recapitulate some basic facts mutual information will be small. We can therefore Tay-
about Markov processes (see, e.g., [37] for a pedagogical lor expand the logarithm from equation (1) around the
review). A Markov process is defined by a matrix M of point P (a, b) = P (a)P (b), giving
conditional probabilities Mab = P (Xt+1 = a|Xt = b).
Such Markov matrices (also known as stochastic matri-
P P (a, b)
ces) thus have the properties Mab ≥ 0 and a Mab = 1. I(X, Y ) = logB
They fully specify the dynamics of the model: P (a)P (b)
P (a, b)
pt+1 = M pt , (2) = logB 1 + −1 (3)
P (a)P (b)
where pt is a vector with components P (Xt = a) that P (a, b) 1 IR (X, Y )
≈ −1 = ,
specifies the probability distribution at time t. Let λi P (a)P (b) ln B ln B
4
decay toward zero for |t2 − t1 | 1 with decay timescale Roughly speaking, the idea is that language is generated
log |λ12 | , where λ2 is the second largest eigenvalue of M. in a hierarchical rather than linear fashion. When we
This theorem is a strict generalization of Theorem 1, write an essay, we do so by thinking about some big idea,
since given any Markov process M with corresponding and then breaking it into sub-ideas, and sub-sub-ideas,
matrix M, we can construct an HMM that reproduces etc. Similarly, we generate a sentence by first choosing
the exact statistics of M by using M as the transition its basic structure and then fleshing-out each part of the
matrix between the Y ’s and generating Xi from Yi by sentence with more modifiers, etc. For example, the two
simply setting xi = yi with probability 1. sentences “Bob loves Alice” and “Alice is loved by Bob”
are “close” in meaning but there could be nothing similar
The proof is very similar in spirit to the proof of Theorem about them if English were Markov. On the other hand, if
1, so we will just present a sketch here, leaving a full proof English is generated hierarchically, these sentences might
to Appendix B. Let G be the Markov matrix that governs be close in the sense that they diverged close to the leaves
Xi → Yi . To compute the joint probability between two of the generative tree.
random variables Xt1 and Xt2 , we simply compute the
joint probability distribution between Yt1 and Yt2 , which
again involves a factor of Mτ and then use two factors of
G to convert the joint probability on Yt1 , Yt2 to a joint A. A simple recursive grammar model
probability on Xt1 , Xt2 . These additional two factors of
G will not change the fact that there is an exponential
decay given by Mτ . We can formalize the above considerations by giving pro-
duction rules for a toy language L over an alphabet A.
A simple, intuitive bound from information theory In the parlance of theoretical linguistics, our language
(namely the data processing inequality [37]) gives is generated by a stochastic or probabilistic context-free
I(Yt1 , Yt2 ) ≥ I(Yt1 , Xt2 ) ≥ I(Xt1 , Xt2 ). However, The- grammar (PCFG) [41–44]. We will discuss the relation-
orem 1 implies that I(Yt1 , Yt2 ) decays exponentially. ship between our model and a generic PCFG in Section
Hence I(Xt1 , Xt2 ) must also decay at least as fast as ex- C. The language is defined by how a native speaker of L
ponentially. produces sentences: first, she draws one of the |A| char-
There is a well-known correspondence between so-called acters from some probability distribution µ on A. She
probabilistic regular grammars [39] (sometimes referred then takes this character x0 and replaces it with q new
to as stochastic regular grammars) and HMMs. Given a symbols, drawn from a probability distribution P (b|a),
probabilistic regular grammar, one can generate an HMM where a ∈ A is the first symbol and b ∈ A is any of the
that reproduces all statistics and vice versa. Hence, we second symbols. This is repeated over and over. After u
can also state Theorem 2 as follows: steps, she has a sentence of length q u .2
Corollary: No probabilistic regular grammar exhibits One can ask for the character statistics of the sentence
criticality. at production step u given the statistics of the sentence
at production step u − 1. The character distribution is
In the next section, we will show that this statement is
simply
not true for context-free grammars.
X
Pu (b) = P (b|a)Pu (a). (9)
a
III. POWER LAWS FROM GENERATIVE
GRAMMAR Of course this equation does not imply that the process
is a Markov process when the sentences are read left to
right. To characterize the statistics as read from left to
If computationally feasible Markov processes cannot pro- right, we really want to compute the statistical depen-
duce genomes, melodies, texts or other sequences with dencies within a given sequence, e.g., at fixed u.
roughly critical behavior, then how do such sequences
arise? What sort of alternative processes can generate To see that the mutual information decays like a power
them? This question is not only interesting theoretically, law rather than exponentially with separation, consider
but also important in practically: models which can ap- two random variables X and Y separated by τ . One can
proximate such critical sequences could explain how some ask how many generations took place between X and
machine learning algorithms perform better than others the nearest ancestor of X and Y . Typically, this will be
in tasks like such as language processing, and might sug- about logq τ generations. Hence in the tree graph shown
gest ways to improve existing algorithms. In the best case
scenario, theoretical models may even shed light on how
human brains can efficiently generate English sentences
2
without storing googols of parameters. This exponential blow-up is reminiscent of de Sitter space in
cosmic inflation. There is actually a much deeper mathematical
One answer advanced by Chomsky [40] and others which analogy involving conformal symmetry and p-adic numbers that
will loosely inspire our work is the idea of deep structure. has been discussed by Harlow et al.[45].
6
except for a perturbation near the top of the tree. We can so unless the number of paths increases exponentially (or
think of the generalization as equivalent to the old model faster) with length, the overall scaling will not change.
except for a different initial condition. We thus expect on
With this approximation, we can state a key finding of
intuitive grounds that the model will still exhibit power
our theoretical work. Deep models are important because
law decay. This intuition is correct, as we will prove
without the extra “dimension” of depth/abstraction,
rigorously in Appendix C.
there is no way to construct “shortcuts” between random
variables that are separated by large amounts of time
with short-range interactions; 1D models will be doomed
C. Further Generalization: Bayesian networks and to exponential decay. Hence the ubiquity of power laws
context-free grammars explains the success of deep learning. In fact, this can be
seen as the Bayesian net version of the important result
in statistical physics that there are no phase transitions
Just how generic is the scaling behavior of our model?
in 1D [46, 47].
What if the length of the words is not constant? What
about more complex dependencies between layers? If we There are close analogies between our deep recursive
retrace the derivation in the above arguments, it becomes grammar and more conventional physical systems. For
clear that the only key feature of all of our models consid- example, according to the emerging standard model of
ered so far is that the rational mutual information decays cosmology, there was an early period of cosmological in-
exponentially with the causal distance ∆: flation when density fluctuations get getting added on a
fixed scale as space itself underwent repeated doublings,
combining to produce an excellent approximation to a
power-law correlation function. This inflationary pro-
IR ∝
∼e
−γ∆
. (14)
cess is simply a special case of our deep recursive model
(generalized from 1 to 3 dimensions). In this case, the
This is true for (hidden) Markov processes and the hier- hidden “depth” dimension in our model corresponds to
archical grammar models that we have considered above. cosmic time, and the time parameter which labels the
So far we have defined ∆ in terms of quantities specific place in the sequence of interest corresponds to space. A
to these models; for a Markov process, ∆ is simply the similar physical analogy is turbulence in a fluid, where
time separation. Can we define ∆ more generically? In energy in the form of vortices cascades from large scales
order to do so, let us make a brief aside about Bayesian to ever smaller scales through a recursive process where
networks. Formally, a Bayesian net is a directed acyclic larger vortices create smaller ones, leading to a scale-
graph (DAG), where the vertices are random variables invariant power spectrum. There is also a close analogy
and conditional dependencies are represented by the ar- to quantum mechanics: in equation (13) expresses the ex-
rows. Now instead of thinking of X and Y as living at ponential decay of the mutual information with geodesic
certain times (t1 , t2 ), we can think of them as living at distance through the Bayesian network; in quantum me-
vertices (i, j) of the graph. chanics, the correlation function of a many body system
We define ∆(i, j) as follows. Since the Bayesian net is a decays exponentially with the geodesic distance defined
DAG, it is equipped with a partial order ≤ on vertices. by the tensor network which represents the wavefunction
We write k ≤ l iff there is a path from k to l, in which case [48].
we say that k is an ancestor of l. We define the L(k, l) It is also worth examining our model using techniques
to be the number of edges on the shortest directed path from linguistics. A generic PCFG G consists of three
from k to l. Finally, we define the causal distance ∆(i, j) ingredients:
to be
1. An alphabet A = A ∪ T which consists of non-
∆(i, j) ≡ min L(x, i) + L(x, j). (15) terminal symbols A and terminal symbols T .
x≤i,x≤j
It is easy to see that this reduces to our previous defini- 2. A set of production rules of the form a → B, where
tion of ∆ for Markov processes and recursive generative the left hand side a ∈ A is always a single non-
trees (see Figure 2). terminal character and B is a string consisting of
symbols in A.
Is it true that our exponential decay result from equa-
tion (14) holds even for a generic Bayesian net? The 3. Probabilities associated with each production
P rule
answer is yes, under a suitable approximation. The ap- P (a → B), such that for each a ∈ A, B P (a →
proximation is to ignore long paths in the network when B) = 1.
computing the mutual information. In other words, the
mutual information tends to be dominated by the short- It is a remarkable fact that any stochastic-context free
est paths via a common ancestor, whose length is ∆. This grammars can be put in Chomsky normal form [43, 49].
is a generally a reasonable approximation, because these This means that given G, there exists some other gram-
longer paths will give exponentially weaker correlations, mar Ḡ such that all the production rules are either of
8
the form a → bc or a → α, where a, b, c ∈ A and a crucial ingredient of any successful natural language
α ∈ T and the corresponding languages L(G) = L(Ḡ). model: it must have one or more “hidden” dimensions,
In other words, given some complicated grammar G, we which can be used to provide shortcuts between distant
can always find a grammar Ḡ such that the corresponding parts of the network; hence leading to longer-range cor-
statistics of the languages are identical and all the pro- relations that are crucial for critical behavior.
duction rules replace a symbol by at most two symbols
Let us now explore some useful implications of these re-
(at the cost of increasing the number of production rules
sults, both for understanding the success of certain neural
in Ḡ).
network architectures and for using the mutual informa-
This formalism allows us to strengthen our claims. Our tion function as a tool for validating machine learning
model with a branching factor q = 2 is precisely the algorithms.
class of all context-free grammars that are generated by
the production rules of the form a → bc. While this
might naively seem like a very small subset of all possi-
ble context-free grammars, the fact that any context-free A. Connection to Recurrent Neural Networks
grammar can be converted into Chomsky normal form
shows that our theory deals with a generic context-free
While the generative grammar model is appealing from
grammar, except for the additional step of producing ter-
a linguistic perspective, it may superficially appear to
minal symbols from non-terminal symbols. Starting from
have little to do with machine learning algorithms that
a single symbol, the deep dynamics of the PCFG in nor-
are implemented in practice. However, as we will now
mal form are given by a strongly-correlated branching
see, this model can in fact be viewed an idealized version
process with q = 2 which proceeds for a characteristic
of a long-short term memory (LSTM) recurrent neural
number of productions before terminal symbols are pro-
network (RNN) that is generating (“hallucinating”) a se-
duced. Before most symbols have been converted to ter-
quence.
minal symbols, our theory applies, and power-law cor-
relations will exist amongst the non-terminal symbols. First of all, Figure 4 shows that an LSTM RNN can
To the extent that the terminal symbols that are then in fact reproduce critical behavior. In this example, we
produced from non-terminal symbols reflect the correla- trained an RNN (consisting of three hidden LSTM layers
tions of the non-terminal symbols, we expect context-free of size 256 as described in [20]) to predict the next char-
grammars to be able to produce power law correlations. acter in the 100MB Wikipedia sample known as enwik8
From our corollary to Theorem 2, we know that regu- [12]. We then used the LSTM to hallucinate 1 MB of
lar grammars cannot exhibit power-law decays in mu- text and measured the mutual information as a function
tual information. Hence context-free grammars are the of distance. Figure 4 shows that not only is the resulting
simplest grammars which support criticality, e.g., they mutual information function a rough power law, but it
are the lowest in the Chomsky hierarchy that supports also has a slope that is relatively similar to the original.
criticality. Note that our corollary to Theorem 2 also im- We can understand this success by considering a simpli-
plies that not all context-free grammars exhibit criticality fied model that is less powerful and complex than a full
since regular grammars are a strict subset of context-free LSTM, but retains some of its core features — such an
grammars. Whether one can formulate an even sharper approach to studying deep neural nets has proved fruitful
criterion should be the subject of future work. in the past (e.g., [27]).
The usual implementation of LSTMs consists of multiple
cells stacked one on top of each other. Each cell of the
IV. DISCUSSION
LSTM (depicted as a yellow circle in Fig. 3) has a state
that is characterized by a matrix of numbers Ct and is
We have shown that many data sequences generated for updated according to the following rule
the purposes of communications — from English and
French text to Bach and the human genome — exhibit Ct = ft ◦ Ct−1 + it ◦ Dt , (16)
critical behavior, where the mutual information between
symbols decays roughly like a power law with separation. where ◦ denotes element wise multiplication, and Dt =
By introducing a quantity we term rational mutual infor- Dt (Ct−1 , xt ) is some function of the input xt from the
mation, we have proved that (hidden) Markov processes cell from the layer above (denoted by downward arrows in
generically exhibit exponential decay, whereas deep gen- Figure 3, the details of which do not concern us. Generi-
erative grammars exhibit power law decays. This ex- cally, a graph of this picture would look like a rectangular
plains why natural languages are poorly approximated lattice, with each node having an arrow to its right (cor-
by Markov processes, but better approximated by deep responding to the first term in the above equation), and
recurrent neural networks now widely used in machine an arrow from above (corresponding to the second term
learning for natural language processing, as we will dis- in the equation). However, if the forget weights f weights
cuss in detail below. Furthermore, we have identified decay rapidly with depth (e.g., as we go from the bottom
9
cell to the towards the top) so that the timescales for for-
getting grow exponentially, we will show that a reason-
able approximation to the dynamics is given by Figure 3.
Markov
If we neglect the dependency of Dt on Ct−1 , the for-
get gate ft leads to exponential decay of Ct−1 e.g., Remember
Forget
Ct = f t ◦ C0 ; this is how LSTM’s forget their past.
Markov
Note that all operations including exponentiation are
performed element-wise in this section only.
Remember Remember
In general, a cell will smoothly forget its past over a Forget Forget
timescale of ∼ log(1/f ) ≡ τf . On timescales & τf , the
Markov
cells are weakly correlated; on timescales . τf , the cells
are strongly correlated. Hence a discrete approximation
to this above equation is the following:
Markov
Ct = Ct−1 , for τf timesteps
(17)
= Dt (xt ), on every τf + 1 timestep.
This simple approximation leads us right back to the hi-
erarchical grammar! The first line of the above equation
Time
is labeled “remember” in Figure 2 and the second line
is what we refer to as “Markov,” since the next state FIG. 3: Our deep generative grammar model can be viewed
depends only on the previous. Since each cell perfectly as an idealization of a long-short term memory (LSTM) recur-
remembers its pervious state for τf time-steps, the tree rent neural net, where the “forget weights” drop with depth so
can be reorganized so that it is exactly of the form shown that the forget timescales grow exponentially with depth. The
in Figure 3, by omitting nodes which simply copy the pre- graph drawn here is clearly isomorphic to the graph drawn in
vious state. Now supposing that τf grows exponentially Figure 1. For each cell, we approximate the usual incremen-
with depth τf (layer i) ∝ q τf (layer i + 1), we see that the tal updating rule by either perfectly remembering the previ-
successive layers become exponentially sparse, which is ous state (horizontal arrows) or by ignoring the previous state
exactly what happens in our deep grammar model, iden- and determining the cell state by a random rule depending on
tifying the parameter q, governing the growth of the for- the node above (vertical arrows).
get timescale, with the branching parameter in the deep
grammar model. (Compare Figure 2 and Figure 3.)
are far from optimal. Interestingly, the hallucinated text
shows about the same mutual information for distances
∼ O(1), but significantly less mutual information at large
B. A new diagnostic for machine learning separation. Without requiring any knowledge about the
true entropy of the input text (which is famously NP-
hard to compute), this figure immediately shows that
How can one tell whether an neural network can be fur-
the LSTM-RNN we trained is performing sub-optimally;
ther improved? For example, an LSTM RNN similar to
it is not able to capture all the long-term dependencies
the one we used in Figure 3 can predict Wikipedia text
found in the training data.
with a residual entropy ∼ 1.4 bits/character [20], which
is very close to the performance of current state of the As a comparison, we also calculated the bigram transi-
art custom compression software — which achieves ∼ 1.3 tion matrix P (X3 X4 |X1 X2 ) from the data and used it
bits/character [50]. Is that essentially the best compres- to hallucinate 1 MB of text. Despite the fact that this
sion possible, or can significant improvements be made? higher order Markov model needs ∼ 103 more parame-
ters than our LSTM-RNN, it captures less than a fifth
Our results provide an powerful diagnostic for shedding
of the mutual information captured by the LSTM-RNN
further light on this question: measuring the mutual in-
even at modest separations & 5.
formation as a function of separation between symbols
is a computationally efficient way of extracting much In summary, Figure 3 shows both the successes and short-
more meaningful information about the performance of a comings of machine learning. On the one hand, LSTM-
model than simply evaluating the loss function, usually RNN’s can capture long-range correlations much more
given by the conditional entropy H(Xt |Xt−1 , Xt−2 , . . . ). efficiently than Markovian models; on the other hand,
they cannot match the two point functions of training
Figure 3 shows that even with just three layers, the
data, never mind higher order statistics!
LSTM-RNN is able to learn long-range correlations; the
slope of the mutual information of hallucinated text is One might wonder how the lack of mutual information at
comparable to that of the training set. However, the fig- large scales for the bigram Markov model is manifested
ure also shows that the predictions of our LSTM-RNN in the hallucinated text. Below we give a line from the
10
Ma
ikipedia swer is that at least part of the difficulty is that nat-
LSTM
rko
0.010 -RNN ural language is a critical system, with long-ranged
-hallu
v-h
cinat correlations that are difficult for machines to learn.
ed W
all
ik ipedia
uc
0.001
ina
te
10 - 4 2. Why are machines bad at natural languages, and
dW
why are they good? The old answer is that Markov
iki
pe
10 - 5 models are simply not brain/human-like, whereas
10 - 6 dia neural nets are more brain-like and hence better.
1 10 100 1000 The new answer is that Markov models or other
Distance between symbols d(X,Y) 1-dimensional models cannot exhibit critical be-
havior, whereas neural nets and other deep models
(where an extra hidden dimension is formed by the
FIG. 4: Diagnosing different models with by hallucinating
text and then measuring the mutual information as a func-
layers of the network) are able to exhibit critical
tion of separation. The red line is the mutual information of behavior.
enwik8, a 100 MB sample of English Wikipedia. In shaded
blue is the mutual information of hallucinated Wikipedia from 3. How can we know when machines are bad or good?
a trained LSTM with 3 layers of size 256. We plot in solid The old answer is to compute the loss function.
black the mutual information of a Markov process on sin- The new answer is to also compute the mutual in-
gle characters, which we compute exactly. (This would cor- formation as a function of separation, which can
respond to the mutual information of hallucinations in the immediately show how well the model is doing at
limit where the length of the hallucinations goes to infinity).
capturing correlations on different scales.
This curve shows a sharp exponential decay after a distance
of ∼ 10, in agreement with our theoretical predictions. We
also measured the mutual information for hallucinated text Future studies could include more comprehensive mea-
on a Markov process for bigrams, which still underperforms
surements from multiple language/music corpuses and
the LSTMs in long-ranged correlations, despite having ∼ 103
more parameters than other one dimensional data. In addition, more theoret-
ical work is required to explain why natural languages
have slopes that are all comparable ∼ 1/2. In statistical
Markov hallucinations: physics, “coincidences” of these sort are usually signs of
universality: many seemingly unrelated systems have the
[[computhourgist, Flagesernmenserved whirequotes same long-wavelength effective field theory at the critical
or thand dy excommentaligmaktophy as point and hence display the same power law slopes. Per-
its:Fran at ||<If ISBN 088;&ategor haps something similar is happening here, though this
and on of to [[Prefung]]’ and at them rector> is unexpected from the deep generative grammar model
where any power law slope is possible.
This can be compared with an example from the LSTM
RNN:
Acknowledgments: This work was supported by
Proudknow pop groups at Oxford the Foundational Questions Institute http://fqxi.org.
- [http://ccw.com/faqsisdaler/cardiffstwander The authors wish to thank Noam Chomsky and Greg
--helgar.jpg] and Cape Normans’s first Lessard for valuable comments on the linguistic aspects of
attacks Cup rigid (AM). this work, Taiga Abe, Meia Chita-Tegmark, Hanna Field,
Esther Goldberg, Emily Mu, John Peurifoi, Tomaso Pog-
Despite using many fewer parameters, the LSTM man- gio, Luis Seoane, Leon Shen and David Theurel for help-
ages to produce a realistic looking URL and is able to ful discussions and encouragement, Michelle Xu for help
close brackets correctly [51], something that the Markov acquiring genome data and the Center for Brains Minds
model struggles with. and Machines (CMBB) for hospitality.
C. Outlook
Author contributions: HL proposed the project idea
in collaboration with MT. HL and MT collaboratively
Although great challenges remain to accurately model formulated the proofs, performed the numerical experi-
natural languages, our results at least allow us to improve ments, analyzed the data, and wrote the manuscript.
on some earlier answers to key questions we sought to
address :
11
Appendix A: Properties of rational mutual where the expectation value is taken with respect
information to the “true” probability distribution p. Note that
as it is written, p could be any probability mea-
sure on either a discrete or continuous space. The
In this appendix, we prove the following elementary prop- above results can be trivially modified to show that
erties of rational mutual information: DR (p||q) ≥ DKL (p||q) and hence DR (p||q) ≥ 0,
with equality iff p = q.
1. Symmetry: for any two random variables X and
Y , IR (X, Y ) = IR (Y, X). The proof is straightfor-
ward:
Appendix B: General proof for Markov processes
X P (X = a, Y = b)2
IR (X, Y ) = −1
P (X = a)P (Y = b) In this appendix, we drop the assumptions of non-
ab
X P (Y = b, X = a)2 (A1) degeneracy, irreducibility and non-periodicity made in
= − 1 = IR (Y, X). the main body of the paper where we proved that Markov
P (Y = b)P (X = a)
ba processes lead to exponential decay.
Using this Jordan decomposition, we can replicate equa- 3. The periodic case
tion (7) and write
2. The reducible case This asymptotic relation can be strengthened for a sub-
class of Markov processes which obey a condition known
as detailed balance. This subclass arises naturally in the
Now let us generalize to the case where the Markov pro- study of statistical physics [52]. For our purposes, this
cess is reducible, i.e., decomposable into a set of m non- simply means that there exist some real numbers Km and
interacting Markov processes for some integer m > 1. a symmetric matrix Sab = Sba such that
This means that the state space can be decomposed as
m
[ Mab = eKa /2 Sab e−Kb /2 . (B9)
S= Si , (B7)
i=1 Let us note the following facts. (1) The matrix power is
where restricting the Markov process to each Si results simply (M τ )ab = eKa /2 (S τ )ab e−Kb /2 . (2) By the spec-
in a well-defined Markov process on Si that does not tral theorem, we can diagonalize S into an orthonormal
interact with any other Si . If the system starts off in basis of eigenvectors, which we label as v (or sometimes
S1 , it can never transition to S2 , and so forth. Now if w), e.g., Sv = λi v and v · w = δvw . Notice that
we restrict our model to the subset of interest, the mu- X X
tual information will exponentially decay; however, for a Mab eKn /2 vn = eKm /2 Smn vn = λi eKm /2 vm .
generic initial state, the total probability within each set n n
Si remains constant, so the mutual information will be
asymptotically approach the entropy of the probability Hence we have found an eigenvector of M for every eigen-
distribution across sets, which can be at most I = log2 m. vector of S. Conversely, the set of eigenvectors of S forms
In the language of statistical physics, this is an example a basis, so there cannot be any more eigenvectors of M .
of topological order which leads to constant terms in the This implies that all the eigenvalues of M are given by
v
correlation functions; here, the Markov graph of M is dis- Pm = eKm /2 vm , and the eigenvalues of P v are λi . In
connected, so there are m degenerate equilibrium states. other words, M and S share the same eigenvalues.
13
(3) µa = Z1 eKa is an eigenvector with eigenvalue 1, and are the result of the generalization. Note that as before,
hence is the stationary state: µ is the stationary state corresponding to M. We will
only consider the typical case where M is aperiodic, irre-
X 1 X (Ka +Kb )/2 ducible, and non-degenerate; once we have this case, the
Mab µb = e Sab
Z other cases can be easily treated by mimicking our above
b b
(B10) proof for or ordinary Markov processes. Using equation
1 X X
= eKa eKb /2 Sba e−Ka /2 = µa Mba = µa . (7) and defining g = Mµ gives
Z
b b X
P (a, b) = Gbd [(M τ )dc µc ] Gac
The previous facts then let us finish the calculation: cd
X (B16)
= ga gb + λτ2 (Gbd Adc µc Gac ) .
X
P (a, b) 2
eKa (S τ )ab e−Kb eKb −Ka
= cd
P (a)P (b)
ab
X 2
Plugging this in to our definition of rational mutual in-
eKa (S τ )ab e−Kb eKb −Ka
= formation gives
ab
X X P (a, b)2
2
= (S τ )ab = ||S τ ||2 . IR + 1 =
ga gb
ab ab
(B11)
!
X X
τ
= ga gb + λ2 Gbd Adc µc Gac
Now using the fact that ||A||2 = tr AT A and is there- ab cd (B17)
fore invariant under an orthogonal change of basis, we + λ2τ
2 C
find that X
X = 1 + λτ2 Adc µc + λ2τ
2 C
P (a, b) cd
= |λi |2τ . (B12)
P (a)P (b) i =1+ 2τ
λ2 C,
P P
Since the λi ’s are both the eigenvalues of M and S, and where we have used the facts that i Gij = 1, i Aij =
since M is irreducible and aperiodic, there is exactly one 0, and as before C is asymptotically constant. This shows
eigenvalue λ1 = 1, and all other eigenvalues are less than that IR ∝ λ2τ
2 exponentially decays.
one. Altogether,
P (a, b) X
Appendix C: Power laws for generative grammars
IR (t1 , t2 ) = −1= |λi |2τ . (B13)
P (a)P (b) i=2
X IR = λ2∆−4
2 ||N||2 . (C11)
P (a, b) = µa µb + 2 µr Aar Abr . (C4)
r In either the strongly or the weakly correlated case, note
From the definition of rational mutual information, and that N is asymptotically constant. We can write the
second largest eigenvalue |λ2 |2 = q −k2 /2 , where q is the
P
employing the fact that i Aij = 0 gives
2 branching factor,
X µa µb + 2
P
r µr Aar Abr
IR + 1 ≈ IR ∝ q −∆k2 /2 ∝ −k2 logq |i−j|
= C|i − j|−k2 . (C12)
µa µb ∼q
ab
X (C5) Behold the glorious power law! We note that the normal-
µa µb + 4 Nab
2
= ,
ab
ization C must be a function of the form C = m2 f (λ2 , q),
where m2 is the multiplicity of the eigenvalue λ2 . We
= 1 + 4 ||N||2 ,
evaluate this normalization in the next section.
where Nab ≡ (µa µb )−1/2 r µr Aar Abr is a symmetric
P
As before, this result can be sharpened if we assume that
matrix and || · || denotes the Frobenius norm. Hence G satisfies detailed balance Gmn = eKm /2 Smn e−Kn /2
where S is a symmetric matrix and Kn are just num-
IR = λ2∆ 2
2 ||S|| . (C6) bers. Let us only consider the weakly correlated case.
By the spectral theorem, we diagonalize S into an or-
Let us now generalize to the strongly correlated case. As thonormal basis of eigenvectors v. As before, G and S
discussed in the text, the joint probability is modified to share the same eigenvalues. Proceeding,
1 X ∆
P (a, b) = λ va vb e(Ka +Kb )/2 , (C13)
Z v v
X
P (a, b) = Qrs G∆/2−1 G∆/2−1 , (C7)
ar bs
rs
where Z is a constant that ensures that P is properly
where Q is some symmetric matrix which satisfies normalized. Let us move full steam ahead to compute
P the rational mutual information:
r Qrs = µs . We now employ our favorite trick of diag-
onalizing G and then writing X P (a, b)2
X ab v
P (a, b) = Qrs (µa + Aar ) (µb + Abs ) , !2
rs
X X
X = λ∆
v va vb .
2
= µa µb + Qrs µa Abs + µb Aar + Aar Abs . ab v
rs (C14)
X X
= µa µb + µa Abs µs + µb Aar µr
This is
Pjust the Frobenius norm of the symmetric matrix
s r
X H ≡ v λ∆ v va vb ! The eigenvalues of the matrix can be
+ 2 Qrs Aar Abs read off, so we have
rs
X
= µa µb + 2 Qrs Aar Abs . X
rs IR (a, b) = |λi |2∆ . (C15)
(C9) i=2
15
IR1/2(X,Y)
of the normalization constant C in the equation
0.05
as a way to regulate the probability distribution p(δ) so FIG. 5: Decay of rational mutual information with separation
that it is normalizable; at the end of the calculation we for a binary sequence from a numerical simulation with prob-
formally take δmax → ∞ without obstruction. Second, abilities p(0|0) = p(1|1) = 0.9 and a branching factor q = 2.
if we are interested in empirically sampling the mutual The blue curve is not a fit to the simulated data but rather an
information, we cannot generate an infinite string, so set- analytic calculation. The smooth power law displayed on the
ting δmax to a finite value accounts for the fact that our left is what is predicted by our “continuum” approximation.
generated string may be finite. The very small discrepancies (right) are not random but are
fully accounted for by more involved exact calculations with
We now assume d 1 so that we can swap discrete sums discrete sums.
with integrals. We can then compute the conditional
expectation value of 2−k2 δ . This yields
It is well known that a naive
PK estimate of the Shannon
Z ∞
1 − 2−k2 d−k2 entropy obtained Ŝ = − i=1 N i
log Ni
N is biased, gen-
−k2 δ N
IR ≈ 2 P (d|δ) dδ = , (C17) erally underestimating the true entropy from finite sam-
0 k2 (k2 + 1) log(2)
ples. For example, We use the estimator advocated by
or equivalently, Grassberger [53]:
K
1 − |λ2 |4 1 1 X
Cq=2 = . (C18) Ŝ = log N − Ni ψ(Ni ), (D1)
k2 (k2 + 1) log 2 N i=1
P
where ψ(x) is the digamma function, N = Ni , and
It turns out it is also possible to compute the answer K is the number of characters in the alphabet. The
without making any approximations with integrals: mutual information estimator can then be estimated by
ˆ
I(X, Y ) = Ŝ(X) + Ŝ(Y ) − Ŝ(X, Y ). The variance of this
2−(k2 +1)dlog2 (d)e 2k2 +1 − 1 2dlog2 (d)e − 2d 2k2 − 1
IR = . estimator is then the sum of the variances
2k2 +1 − 1 ˆ
(C19) var(I) = varEnt(X) + varEnt(Y ) + varEnt(X, Y ),
(D2)
The resulting predictions are compared in figure Figure 5. where the varEntropy is defined as
varEnt(X) = var (− log p(X), ) (D3)
Appendix D: Estimating (rational) mutual where we can again replace logarithms with the digamma
information from empirical data functionqψ. The uncertainty after N measurements is
ˆ
then ≈ var(I)/N .
Estimating mutual information or rational mutual infor- To compare our theoretical results with experiment in
mation from empirical data is fraught with subtleties. Fig. 4, we must measure the rational mutual information
16
for a binary sequence from (simulated) data. For a binary ased estimator for a data sequence {x1 , x2 , · · · xn }:
sequence with covariance coefficient ρ(X, Y ) = P (1, 1) −
P (1)2 , the rational mutual information is
2 n−d
1
ρ(X, Y ) X
IR (X, Y ) = . (D4) ρ̂(d) = (xi − x̄) (xi+d − x̄) . (D5)
P (0)P (1) n − d − 1 i=1
[1] P. Bak, Physical Review Letters 59, 381 (1987). international conference on acoustics, speech and signal
[2] P. Bak, C. Tang, and K. Wiesenfeld, Physical review A processing (IEEE, 2013), pp. 6645–6649.
38, 364 (1988). [22] R. Collobert and J. Weston, in Proceedings of the 25th
[3] K. Linkenkaer-Hansen, V. V. Nikouline, international conference on Machine learning (ACM,
J. M. Palva, and R. J. Ilmoniemi, The 2008), pp. 160–167.
Journal of Neuroscience 21, 1370 (2001), [23] J. Schmidhuber, Neural Networks 61, 85 (2015).
http://www.jneurosci.org/content/21/4/1370.full.pdf+html, [24] Y. LeCun, Y. Bengio, and G. Hinton, Nature 521, 436
URL http://www.jneurosci.org/content/21/4/1370. (2015).
abstract. [25] R. Meir and E. Domany, Phys. Rev. Lett. 59,
[4] D. J. Levitin, P. Chordia, and V. Menon, Proceedings of 359 (1987), URL http://link.aps.org/doi/10.1103/
the National Academy of Sciences 109, 3716 (2012). PhysRevLett.59.359.
[5] M. Tegmark, ArXiv e-prints (2014), 1401.1219. [26] K. Hornik, M. Stinchcombe, and H. White, Neural net-
[6] G. K. Zipf, Human behavior and the principle of least works 2, 359 (1989).
effort (Addison-Wesley Press, 1949). [27] A. M. Saxe, J. L. McClelland, and S. Ganguli, arXiv
[7] H. W. Lin and A. Loeb, Physical Review E 93, 032306 preprint arXiv:1312.6120 (2013).
(2016). [28] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio,
[8] L. Pietronero, E. Tosatti, V. Tosatti, and A. Vespig- in Advances in neural information processing systems
nani, Physica A: Statistical Mechanics and its Ap- (2014), pp. 2924–2932.
plications 293, 297 (2001), ISSN 0378-4371, URL [29] H. Mhaskar, Q. Liao, and T. Poggio, ArXiv e-prints
http://www.sciencedirect.com/science/article/ (2016), 1603.00988.
pii/S0378437100006336. [30] M. Bianchini and F. Scarselli, IEEE transactions on neu-
[9] M. Kardar, Statistical physics of fields (Cambridge Uni- ral networks and learning systems 25, 1553 (2014).
versity Press, 2007). [31] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and
[10] URL ftp://ftp.ncbi.nih.gov/genomes/Homo_ S. Ganguli, ArXiv e-prints (2016), 1606.05340.
sapiens/. [32] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and
[11] URL http://www.jsbach.net/midi/midi_solo_ J. Sohl-Dickstein, ArXiv e-prints (2016), 1606.05336.
violin.html. [33] P. Mehta and D. J. Schwab, ArXiv e-prints (2014),
[12] URL http://prize.hutter1.net/. 1410.3831.
[13] URL http://www.lexique.org/public/lisezmoi. [34] S. Hochreiter and J. Schmidhuber, Neural computation
corpatext.htm. 9, 1735 (1997).
[14] A. M. Turing, Mind 59, 433 (1950). [35] C. E. Shannon, ACM SIGMOBILE Mobile Computing
[15] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, and Communications Review 5, 3 (1948).
D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock, [36] S. Kullback and R. A. Leibler, Ann. Math. Statist.
E. Nyberg, J. Prager, et al., AI magazine 31, 59 (2010). 22, 79 (1951), URL http://dx.doi.org/10.1214/aoms/
[16] M. Campbell, A. J. Hoane, and F.-h. Hsu, Artificial in- 1177729694.
telligence 134, 57 (2002). [37] T. M. Cover and J. A. Thomas, Elements of information
[17] V. Mnih, Nature 518, 529 (2015). theory (John Wiley & Sons, 2012).
[18] D. Silver, A. Huang, C. J. Maddison, A. Guez, [38] L. R. Rabiner, Proceedings of the IEEE 77, 257 (1989).
L. Sifre, G. van den Driessche, J. Schrittwieser, [39] R. C. Carrasco and J. Oncina, in International Collo-
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., quium on Grammatical Inference (Springer, 1994), pp.
Nature 529, 484 (2016), URL http://dx.doi.org/10. 139–152.
1038/nature16961. [40] N. Chomsky, Aspects of the Theory of Syntax, vol. 11
[19] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush (2015), (MIT press, 1965).
1508.06615, URL https://arxiv.org/abs/1508.06615. [41] S. Ginsburg, The Mathematical Theory of Context
[20] A. Graves, ArXiv e-prints (2013), 1308.0850. Free Languages.[Mit Fig.] (McGraw-Hill Book Company,
[21] A. Graves, A.-r. Mohamed, and G. Hinton, in 2013 IEEE 1966).
17
[42] T. L. Booth, in Switching and Automata Theory, 1969., [48] G. Evenbly and G. Vidal, Journal of Statistical Physics
IEEE Conference Record of 10th Annual Symposium on 145, 891 (2011).
(IEEE, 1969), pp. 74–81. [49] N. Chomsky, Information and control 2, 137 (1959).
[43] T. Huang and K. Fu, Information Sciences 3, 201 (1971), [50] M. Mahoney, Large text compression benchmark.
ISSN 0020-0255, URL http://www.sciencedirect.com/ [51] A. Karpathy, J. Johnson, and L. Fei-Fei, ArXiv e-prints
science/article/pii/S0020025571800075. (2015), 1506.02078.
[44] K. Lari and S. J. Young, Computer speech & lan- [52] C. W. Gardiner et al., Handbook of stochastic methods,
guage 4, 35 (1990). vol. 3 (Springer Berlin, 1985).
[45] D. Harlow, S. H. Shenker, D. Stanford, and L. Susskind, [53] P. Grassberger, ArXiv Physics e-prints (2003),
Physical Review D 85, 063516 (2012). physics/0307138.
[46] L. Van Hove, Physica 16, 137 (1950). [54] W. Li, Journal of Statistical Physics 60, 823 (1990),
[47] J. A. Cuesta and A. Sánchez, Journal of Statistical ISSN 1572-9613, URL http://dx.doi.org/10.1007/
Physics 115, 869 (2004), cond-mat/0306354. BF01025996.