Académique Documents
Professionnel Documents
Culture Documents
Abstract—An exponential family or mixture family of proba- grees or amounts of interactions into pairwise, triplewise, and
bility distributions has a natural hierarchical structure. This paper higher order interactions. To this end, we study a family
gives an “orthogonal” decomposition of such a system based on of joint probability distributions of vari-
information geometry. A typical example is the decomposition of
stochastic dependency among a number of random variables. In ables which have at most -way interactions but no higher inter-
general, they have a complex structure of dependencies. Pairwise actions. Two dual types of projections, namely, the -projection
dependency is easily represented by correlation, but it is more dif- and -projection, to such subspaces play a fundamental role.
ficult to measure effects of pure triplewise or higher order interac- The present paper studies such a hierarchical structure and the
tions (dependencies) among these variables. Stochastic dependency related invariant “quasi-orthogonal” quantitative decomposition
is decomposed quantitatively into an “orthogonal” sum of pair-
wise, triplewise, and further higher order dependencies. This gives by using information geometry [3], [8], [12], [14], [28], [30]. In-
a new invariant decomposition of joint entropy. This problem is im- formation geometry studies the intrinsic geometrical structure to
portant for extracting intrinsic interactions in firing patterns of an be introduced in the manifold of a family of probability distri-
ensemble of neurons and for estimating its functional connections. butions. Its Riemannian structure was introduced by Rao [37].
The orthogonal decomposition is given in a wide class of hierar- Csiszár [21], [22], [23] studied the geometry of -divergence in
chical structures including both exponential and mixture families.
As an example, we decompose the dependency in a higher order detail and applied it to information theory. It was Chentsov [19]
Markov chain into a sum of those in various lower order Markov who developed Rao’s idea further and introduced new invariant
chains. affine connections in the manifolds of probability distributions.
Index Terms—Decomposition of entropy, - and -projections, Nagaoka and Amari [31] developed a theory of dual structures
extended Pythagoras theorem, higher order interactions, higher and unified all of these theories in the dual differential-geomet-
order Markov chain, information geometry, Kullback divergence. rical framework (see also [3], [14], [31]). Information geometry
has been used so far not only for mathematical foundations of
statistical inferences ([3], [12], [28] and many others) but also
I. INTRODUCTION
applied to information theory [5], [11], [25], [18], neural net-
touch upon the cases of multivalued random variables and con- B. Dually Flat Manifolds
tinuous variables. We finally explain the hierarchical structures A manifold is said to be -flat (exponential-flat), when
of higher order Markov chains. there exists a coordinate system (parameterization) such that,
for all
II. PRELIMINARIES FROM INFORMATION GEOMETRY
(8)
A. Manifold, Curve, and Orthogonality
Let us consider a parameterized family of probability dis- identically. Such is called -affine coordinates. When a curve
tributions , where is a random variable and is given by a linear function in the -co-
is a real vector parameter to specify a distri- ordinates, where and are constant vectors, it is called an
bution. The family is regarded as an -dimensional manifold -geodesic. Any coordinate curve itself is an -geodesic. (It
having as a coordinate system. When the Fisher information is possible to define an -geodesic in any manifold, but it is no
matrix more linear and we need the concept of the affine connection.)
A typical example of an -flat manifold is the well-known
exponential family written as
(1)
(9)
where denotes expectation with respect to , is nonde-
generate, is a Riemannian manifold, and plays the role where are given functions and is the normalizing factor.
of a Riemannian metric tensor. The -affine coordinates are the canonical parameters ,
The squared distance between two nearby distributions and (8) holds because
and is given by the quadratic form of
(10)
(2)
does not depend on and .
It is known that this is twice the Kullback–Leibler divergence Dually to the above, a manifold is said to be -flat (mixture-
flat), when there exists a coordinate system such that
(3) (11)
where
identically. Here, is called -affine coordinates. A curve is
called an -geodesic when it is represented by a linear function
(4) in the -affine coordinates. Any coordinate curve
of is an -geodesic.
Let us consider a curve parameterized by in , that A typical example of an -flat manifold is the mixture family
is, a one-parameter family of distributions in . It is
convenient to represent the tangent vector of (12)
the curve at by the random variable called the score
where are given probability distributions and ,
.
(5) The following theorem is known in information geometry.
Theorem 1: A manifold is -flat when and only when it is
which shows how the log probability changes as increases. -flat and vice versa.
Given two curves and intersecting at , the inner
product of the two tangent vectors is given by This shows that an exponential family is automatically -flat
although it is not necessarily a mixture family. A mixture family
is -flat, although it is not in general an exponential family. The
-affine coordinates ( -coordinates) of an exponential family
are given by
(6)
(13)
The two curves intersect at orthogonally when which is known as the expectation parameters. The coordinate
transformation between and is given by the Legendre trans-
(7) formation, and the inverse transformation is
(15)
and
(16)
(23)
(17)
with
(20)
(26)
(27)
(28) m
Fig. 2. Information projection ( -projection).
where is a partition of the entire parameter They define -flat foliations. Because of (21), the two foliations
, being the th subvector, and is a random vector are orthogonal in the sense that submanifolds and are
variable corresponding to it. complementary and orthogonal at any point.
The expectation parameter of (28) is partitioned corre-
spondingly as C. Projections and Pythagoras Theorem
Given a distribution with partition ,
(29) it belongs to when . In
where is expectation with respect to . Thus, general, is considered to represent the effect emerging from
. Here, is a function of the entire , and not but not from . We call it the “ th effect” of .
of only. In later examples, it represents the th-order effect of mutual
interactions of random variables. The submanifold includes
B. Orthogonal Foliations only the probability distributions that do not have effects higher
Let us consider a new coordinate system called the -cut than . Consider the problem of evaluating the effects higher
mixed ones than . To this end, we define the information projection of to
by
(30)
(37)
It consists of a pair of complementary parts of and , namely,
The coordinates of are not simply Theorem 7: The projection of to is the maximizer
of entropy among having the same -marginals as
(48)
since the manifold is not Euclidean but is Riemannian. In order
to obtain , the -cut mixed coordinates are convenient.
Theorem 6: Let the - and -coordinates of be and . This relation is useful for calculating .
Let the - and -coordinates of the projection be
and , respectively. Then, the -cut mixed coordinates E. Orthogonal Decomposition
of is The next problem is to single out the amount of the th-order
effects in , by separating it from the others. This is not
(41) given by . The amount of the effect of order is given by the
divergence
that is, and .
Proof: The point given by (41) is included in . Since (49)
the -coordinates of and differ only in , the
-geodesic connecting and is included which is certified by the following orthogonal decomposition.
in . Since is orthogonal to , this is the
-projection of to . Theorem 8:
In order to obtain the full - or -coordinates of ,
(50)
and , we need to solve the set of
equations
Theorem 8 shows that the th-order effects are “orthogo-
(42)
nally” decomposed in (50). The theorem might be thought as
a trivial result from the Pythagoras theorem. This is so when
(43) is a Euclidean space. However, this is not trivial in the case of a
dually flat manifold. To show this, let us consider the “theorem
of three squares” in a Euclidean space: Let be four
(44)
corners in a rectangular parallelpiped. Then
(45) (51)
where and are the total and marginal The manifold is an exponential family and is also a mix-
entropies of . Hence, ture family at the same time. Let us expand to obtain
the log-linear model [2], [17]
(76)
(85)
One can define mutual information among , , and by
where indexes of , etc., satisfy , etc. Then
(86)
(77) has components. They define the -flat coordinate system
Then, (73) gives a new invariant positive decomposition of of .
In order to simplify index notation, we introduce the fol-
lowing two hyper-index notations. Indexes , etc., run
(78) over -tuples of binary numbers
(82)
1708 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 5, JULY 2001
(95) (109)
when is . The matrix is a
nonsingular matrix. We have
B. -Hierarchical Structure of
(96)
Let us naturally decompose the coordinates into
The relations among the coordinate systems and Given , is the point closest to among
are given by the following theorem. those that do not have intrinsic interactions more than vari-
ables. The amount of interactions higher than is defined by
Theorem 9: . Moreover, may be in-
terpreted as the degree of purely order interactions among
(101) variables. We then have the following decomposition.
From the orthogonal decomposition, we have a new invariant
(102) decomposition of entropy and information, which is different
from those studied by many researchers as the decomposition
(103) of entropy of a number of random variables.
Theorem 10:
(104) (114)
(105)
(115)
Proof: Given a probability distribution , we have
(116)
(106)
where
finite alphabet set . The argument is ex- The corresponding -coordinates are given by
tended easily to the case where take on different alphabet
sets . Let us define the indicator functions for by (122)
(117)
(123)
(118) (124)
When holds identically, there are no in-
We now expand as trinsic interactions among the three, so that all interactions are
pairwise in this case. When , , there are no in-
teractions so that the three random variables are independent.
(119) Given , the independent distribution
is easily obtained by
where stands for nonzero elements of and . (125)
The coefficients ’s form the -coordinate system. Here, the
term representing interactions of variables have However, it is difficult to obtain the analytical expression of
a number of components as . . In general, which consists of distributions having
The corresponding coordinates consist of pairwise correlations but no intrinsic triplewise interactions, is
written in the form
(120) (126)
which represent the marginal distributions of variables The is the one that satisfies
.
We can then define the -cut, in which and (127)
are orthogonal. Given , we can define
(128)
(133) (146)
[12] S. Amari and M. Kawanabe, “Information geometry of estimating func- [26] H. Ito and S. Tsuji, “Model dependence in quantification of spike in-
tions in semi parametric statistical models,” Bernoulli, vol. 3, pp. 29–54, terdependence by joint peri-stimulus time histogram,” Neural Comput.,
1997. vol. 12, pp. 195–217, 2000.
[13] S. Amari, K. Kurata, and H. Nagaoka, “Information geometry of Boltz- [27] E. T. Jaynes, “On the rationale of maximum entropy methods,” Proc.
mann machines,” IEEE Trans. Neural Networks, vol. 3, pp. 260–271, IEEE, vol. 70, pp. 939–952, 1982.
Mar. 1992. [28] R. E. Kass and P. W. Vos, Geometrical Foundations of Asymptotic Infer-
[14] S. Amari and H. Nagaoka, Methods of Information Geometry. New ence, New York: Wiley, 1997.
York: AMS and Oxford Univ. Press, 2000. [29] S. L. Lauritzen, Graphical Models. Oxford, U.K.: Oxford Science,
[15] O. E. Barndorff-Nielsen, Information and Exponential Families in Sta- 1996.
tistical Theory, New York: Wiley, 1978. [30] M. K. Murray and J. W. Rice, Differential Geometry and Statis-
[16] C. Bhattacharyya and S. S. Keerthi, “Mean-field methods for stochastic tics. London, U.K.: Chapman & Hall, 1993.
connections networks,” Phys. Rev., to be published. [31] H. Nagaoka and S. Amari, “Differential geometry of smooth families
[17] Y. M. M. Bishop, S. E. Fienberg, and P. W. Holland, Discrete Multi- of probability distributions,” Univ. Tokyo, Tokyo, Japan, METR, 82-7,
variate Analysis: Theory and Practice. Cambridge, MA: MIT Press, 1982.
1975. [32] A. Ohara, N. Suda, and S. Amari, “Dualistic differential geometry of
[18] L. L. Campbell, “The relation between information theory and the positive definite matrices and its applications to related problems,”
differential geometric approach to statistics,” Inform. Sci., vol. 35, pp. Linear Alg. its Applic., vol. 247, pp. 31–53, 1996.
199–210, 1985. [33] A. Ohara, “Information geometric analysis of an interior point method
[19] N. N. Chentsov, Statistical Decision Rules and Optimal Inference (in for semidefinite programming,” in Geometry in Present Day Science,
Russian). Moscow, U.S.S.R.: Nauka, 1972. English translation: Prov- O. E. Barndorff-Nielsen and E. B. V. Jensen, Eds. Singapore: World
idence, RI: AMS, 1982. Scientific, 1999, pp. 49–74.
[20] T. M. Cover and J. A. Thomas, Elements of Information Theory. New [34] G. Palm, A. M. H. J. Aertsen, and G. L. Gerstein, “On the significance
York: Wiley, 1991. of correlations among neuronal spike trains,” Biol. Cybern., vol. 59, pp.
[21] I. Csiszár, “On topological properties of f -divergence,” Studia Sci. 1–11, 1988.
Math. Hungar., vol. 2, pp. 329–339, 1967. [35] G. Pistone and C. Sempi, “An infinite-dimensional geometric structure
[22] , “I -divergence geometry of probability distributions and mini- on the space of all the probability measures equivalent to a given one,”
mization problems,” Ann. Probab., vol. 3, pp. 146–158, 1975. Ann. Statist., vol. 23, pp. 1543–1561, 1995.
[23] I. Csiszár, T. M. Cover, and B. S. Choi, “Conditional limit theorems [36] G. Pistone and M.-P. Rogantin, “The exponential statistical manifold:
under Markov conditioning,” IEEE Trans. Inform. Theory, vol. IT-33, Mean parameters, orthogonality, and space transformation,” Bernoulli,
pp. 788–801, Nov. 1987. vol. 5, pp. 721–760, 1999.
[24] T. S. Han, “Nonnegative entropy measures of multivariate symmetric [37] C. R. Rao, “Information and accuracy attainable in the estimation of
correlations,” Inform. Contr., vol. 36, pp. 133–156, 1978. statistical parameters,” Bull. Calcutta. Math. Soc., vol. 37, pp. 81–91,
[25] T. S. Han and S. Amari, “Statistical inference under multiterminal data 1945.
compression,” IEEE Trans. Inform. Theory, vol. 44, pp. 2300–2324, Oct. [38] T. Tanaka, “Information geometry of mean field approximation,” Neural
1998. Comput., vol. 12, pp. 1951–1968, 2000.