Vous êtes sur la page 1sur 21

1

Entropy statistics and information theory

Koen Frenken, 8 July 2003

To appear as: Koen Frenken, 2004, Entropy and information theory, in Horst Hanusch

and Andreas Pyka (eds.), The Elgar Companion to Neo-Schumpeterian Economics

(Cheltenham: Edward Elgar)

Entropy measures provide important tools to indicate variety in distributions at

particular moments in time (e.g., market shares) and to analyse evolutionary processes

over time (e.g., technical change). Importantly, entropy statistics are suitable to

decomposition analysis, which renders the measure preferable to alternatives like the

Herfindahl index in cases of decomposition analysis. There are several applications of

entropy in the realms of industrial organisation and innovation studies. The chapter

contains two sections, one on statistics and one on applications. In the first section, we

discuss, in this order:

1. an introduction in the entropy concept and information theory

2. the entropy decomposition theorem

3. prior and posterior probabilities

4. multidimensional extensions

In the second section, we discuss a number of applications of entropy statistics

including:

1. industrial concentration
2

2. corporate diversification

3. regional industrial diversification

4. income inequality

5. organisation theory

1. Entropy statistics

The origin of the entropy concept goes back to Ludwig Boltzmann (1877) and has been

given a probabilistic interpretation in information theory by Claude Shannon

(1948). In the 1960s, Henri Theil developed several applications of information

theory in economics collected in Economics and Information Theory (1967) and

Statistical Decomposition Analysis (1972).

The entropy formula

The entropy formula expresses the expected information content or uncertainty of

a probability distribution. Let Ei stand for an event (e.g., one technology adoption of

technology i) and pi for the probability of event Ei to occur. Let there be n events E1 , …,

En with probabilities p1 ,…, pn adding up to 1. Since the occurrence of events with smaller

probability yields more information (since these are least expected), a measure of

information h should be a decreasing function of pi . Shannon (1948) proposed a

logarithmic function to express information h(pi ):


3

1 
h ( pi ) = log 2 
p 
 (1)
 i

which decreases from infinity to 0 for pi ranging from 0 to 1. The function

reflects the idea that the lower the probability of an event to occur, the higher the

amount of information of a message stating that the event occurred. Information is here

expressed in bits using 2 as a base of the logarithm, while others express information in

‘nits’ using the natural logarithm.

From the n number of information values h (pi ), the expected information content

of a probability distribution, called entropy, is derived by weighing the information values

h (pi ) by their respective probabilities:

n
1 
H = ∑ pi log 2   (2)
i =1  pi 

where H stands for entropy in bits.

It is customary to define (Theil 1972: 5):

 1 
pi log 2   = 0 if pi = 0 (3)
 pi 

which is in accordance to the limit value of the left-hand term for pi approaching zero

(Theil 1972: 5).

The entropy value H is non-negative. The minimum possible entropy value is

zero corresponding to the case in which one event has unit probability:
4

1 
H min =1 ⋅ log 2   = 0 (4)
1 

1
When all states are equally probable ( p i = ), the entropy value is maximum:
n

n
1 1
H max = ∑ log 2 (n) = n log 2 ( n) = log 2 (n) (5)
i =1 n n

(proof is given by Theil 1972: 8-10). Maximum entropy thus increases with n, but

decreasingly so.1

Entropy can be considered as a measure of uncertainty. The more uncertainty

prior to the message that an event occurred, the larger the amount of information

conveyed by the message on average. Theil (1972: 7) remarks that the entropy concept

in this regard is similar to the variance of a random variable whose values are real

numbers. The main difference is that entropy applies to qualitative rather than

quantitative values, and, as such, depends exclusively on the probabilities of possible

events.

When a message is received that prior probabilities pi are transformed in posterior

probabilities qi we have (Theil 1972: 59):

1
In physics, maximum entropy characterises distributions of randomly moving particles that all have an
equal probability to be present in any state (like a prefect gas). When behaving in a non-random way, for
example, when particles move towards already crowded regions, the resulting distribution is skewed and
entropy is lower that its maximum value (Prigogine and Stengers 1984). In the biological context,
maximum entropy refers to a population of genotypes where all possible genotypes have an equal
frequency. Minimum entropy reflects the total dominance of one genotype in the population (which
would result when selection is instantaneous (cf. Fisher 1930: 39-40).
5

n
q 
I (q | p ) = ∑ qi log 2  i  (6)
i =1  pi 

which equals zero when posterior probabilities equal prior probabilities (no

information) and which is positive otherwise.

The entropy decomposition theorem

One of the most powerful and attractive properties of entropy statistics is the way in

which problems of aggregation and disaggregation are handled (Theil 1972:

20-22; Zajdenweber 1972). This is due to the property of additivity of the

entropy formula.

Let Ei stand again for an event, and let there be n events E1 , …, En with

probabilities p1 ,…, pn . Assume that all events can be aggregated into a smaller number

of sets of events S1 , …, SG in such a way that each event exclusively falls under one set

Sg, where g=1,…,G. The probability that event falling under Sg occurs is obtained by

summation:

Pg = ∑p
i∈S g
i
(7)

The entropy at the level of sets of events is:


6

G  1 
H 0 = ∑Pg log 2   (8)
P 
g =1  g 

H0 is called the between-group entropy. The entropy decomposition theorem specifies

the relationship between the between-group entropy H0 at the level of sets and the

entropy H at the elevel of events as defined in (2). Write entropy H as:

n
1  G 1 
H = ∑ pi log 2   = ∑∑ pi log 2  
i =1  pi  g =1 i∈S g  pi 

G
pi   1   Pg  
= ∑Pg ∑P log 2   + log 2  
 P 
g =1 i∈S g g   g   pi  

    G  P 
 ∑ pi log 2  g  
G
p  log 2  1
= ∑ Pg  ∑ i  + ∑ Pg
 p 
  P  g =1  i∈S Pg
g =1  i∈S g Pg   g   g  i 

G  1  G   
= ∑Pg log 2   + ∑Pg  ∑ pi log  1 
P  g =1  i∈S g Pg 2  
g =1  g    pi / Pg 

The first right-hand term in the last line is H0 . Hence:

G
H = H 0 + ∑Pg H g (9)
g =1

where:

pi  1 
Hg = ∑P log2 
 p /P


g = 1,...,G (10)
i∈S g g  i g 
7

The probability pi /Pg , i ∈ Sg is the conditional probability of Ei given knowledge that

one of the events falling under Sg is bound to occur. Hg thus stands for the entropy

within the set Sg and the term ∑ Pg Hg in (9) is the average within-group entropy.

Entropy thus equals the between-group entropy plus the average within-group entropy.

Two properties of this relationship follow (Theil 1972: 22):

(i) H ≥ H0 because both Pg and Hg are nonnegative. It means that after grouping

there cannot be more entropy (uncertainty) than there was before grouping.

(ii) H = H0 if and only if the term ∑ Pg Hg = 0 and ∑ Pg Hg = 0 if and only if

Hg = 0 for each set Sg . It means that entropy equals between-group entropy if

and only if the grouping is such that there is at most one event with nonzero

probability.

In informational terms, the decomposition theorem has the following interpretation.

Consider the first message that one of the sets of events occurred. Its expected

information content is H0 . Consider the subsequent message that one of the events

falling under this set occurred. Its expected information content is Hg . The total

information content becomes H0 + ∑ Pg Hg . Applications of the decomposition theorem

will be discussed in the third and fourth section.

Multidimensional extensions
8

Consider a pair of events (Xi , Yj ) and the probability of co-occurrence of both events.

The probabilities of the two marginal contributions are:

n
pi . = ∑p
j =1
ij (i = 1,...,m) (11)

m
p. j = ∑p
i=1
ij ( j = 1,...,n) (12)

Marginal entropy values are given by:

m
 1 
H ( X ) = ∑ pi . log 2   (13)
i =1  pi . 

n  1 
H (Y ) = ∑ p. j log 2   (14)
p 
j =1  . j 

And two-dimensional entropy is given by:

m n  1 
H ( X , Y ) = ∑∑ pij log 2   (15)
p 
i =1 j =1  ij 

The conditional entropy value measures the uncertainty in one dimension (e.g., X),

which remains when we know event Yj has occurred. It is given by (Theil 1972: 116-

117):
9

m pij  p. j 
H Yj ( X ) = ∑ log 2   (16)
p. j p 
i =1  ij 

n pij p 
H X i (Y ) = ∑ log 2  i.  (17)
pi . p 
j =1  ij 

The average conditional entropy is derived as the weighted average of conditional

entropies:

n m n  p. j 
H Y ( X ) = ∑ p. j H Y j ( X ) = ∑∑ pij log 2   (18)
p 
j =1 i =1 j =1  ij 

m m n p 
H X (Y ) = ∑ pi . H X i (Y ) = ∑∑ pij log 2  i .  (19)
p 
i =1 i =1 j =1  ij 

It can be shown that the average conditional entropy never exceeds the unconditional

entropy, i.e., H X (Y ) ≤ H (Y ) and H Y ( X ) ≤ H ( X ) , and that the average conditional

entropy and the unconditional entropy are equal if and only if the two events are

stochastically independent (Theil 1972: 118-119).

The expected mutual information is a measure of dependence between two

dimensions, i.e., to what extent events tend to co-occur in particular combinations. In

this respect it is comparable with the product-moment correlation coefficient in the way

entropy is comparable to the variance. Mutual information is given by:


10

m n  pij 
J ( X , Y ) = ∑∑ pij log 2   (20)
p ⋅p 
i =1 j =1  i. j 

sometimes also denoted by M(X,Y) or T(X,Y). It can be shown that J ( X , Y ) ≥ 0 and

that J ( X , Y ) = H (Y ) − H X (Y ) and J ( X , Y ) = H ( X ) − H Y ( X ) (Theil 1972: 125-

131). It can further be derived that the multi-dimensional entropy equals the sum of

marginal entropies minus the mutual information (Theil 1972: 126):

H ( X , Y ) = H ( X ) + H (Y ) − J ( X , Y ) (21)

The interpretation is that when mutual information is absent, marginal distributions are

independent and their entropies add up to the total entropy. When mutual information is

positive, marginal distributions are dependent as some combinations occur relatively more

often than other combinations do, and marginal entropies exceed total entropy by an

amount equal to the mutual information.

2. Applications

Applications of entropy statistics were developed mainly during the late 1960s and the

1970s. Tools of entropy statistics are applied in empirical research in industrial

organisation, regional science, economics of innovation, economics of inequality and

organisation theory.
11

Industrial concentration

A popular application of the entropy formula in industrial organisation is in empirical

studies of industrial concentration (Hildenbrand and Paschen 1964; Finkelstein and

Friedberg 1967; Theil 1967: 290-291). Applied to a distribution of market shares,

entropy is an inverse measure of concentration ranging from 0 (monopoly) to infinity

(perfect competition). The measure fulfils the seven axioms that are commonly listed as

desirable properties of any concentration index (Curry and George 1983: 205)

(i) An increase in the cumulative share of the ith firm, for all i, ranking firms 1,

2, … i … n in descending order of size, implies an increase in concentration.

(ii) The ‘principle of transfers’ should hold, i.e. concentration should increase

(decrease) if the share of any firm is increased at the expense of a smaller

(larger) firm.

(iii) The entry of new firms below some arbitrary significant size should reduce

concentration.

(iv) Mergers should increase concentration.

(v) Random brand switching by consumers should reduce concentration.

(vi) If sj is the share of a new firm, then as sj becomes progressively smaller so

should its effects on a concentration index.

(vii) Random factors in the growth of firms should increase concentration.


12

Horowitz and Horowitz (1968) proposed an index of relative entropy by dividing the

entropy by its maximum value log2 (n). In this way, one obtains a concentration index,

which lies between 0 and 1. An important disadvantage of the relative entropy measure

is that axiom (iv) no longer holds. Mergers reduce the value of H, but also reduce the

value of log2 (n). Since there may be a proportionally greater fall in log2 (n) than in H,

concentration may decrease after a merger.

Though the list of axioms is also met by the more popular Herfindahl index,

which is equal to the sum of squares of market shares, the entropy formula is sometimes

preferred because of the entropy decomposition theorem. An early application concerns

Jacquemin and Kumps (1971) who analysed (changes in) industrial concentration of

European firms and sets of European firms (a group of British firms and a group of

European firms belonging to the then EEC).

Corporate diversification and profitability

The decomposition property of the entropy formula has also been exploited to analyse

corporate diversification and its effect on corporate growth (Jacquemin and Berry 1979;

Palepu 1985; Hoskisson et al. 1993). Let pi stand for the proportion of a firm’s total

sales or production in the industry i. Entropy is computed again following (2) and now

indicates corporate diversification. Zero entropy implies perfect specialisation, while

maximum entropy indicates maximum diversification.

The central question is whether diversification is rewarding for firms’

profitability, and whether related diversification within an industry group or unrelated


13

diversification across industry groups is most rewarding for corporate growth. The

hypothesis holds that related diversification is more rewarding as a firm’s core

competencies can be better exploited in related industries. This hypothesis is in

accordance to the resource-based view and the evolutionary theory of the firm that both

explain growth through diversification as being motivated by utilising excess capacity

of resources (including knowledge specific to the firm) and by exploiting economies of

scope (Montgomery 1994).

Jacquemin and Berry (1979), for example, considered firms active in n 4-digit

industries, which can be aggregated to G sets of 2-digit industry groups. Pg stands for

the proportion of a firm’s total sales or production in the 2-digit industry group g and pi

stands for the proportion of a firm’s total sales or production in the 4-digit industry i.

Application of (9) means that a firm’s degree of diversification at the 4-digit level H can

be decomposed into between-group diversification at the 2-digit level and the average

within-group diversification at the 4-digit level. In this way, the entropy measure solves

the problem of possible collinearity between 2-digit and 4-digit for Herfindahl and other

indices in regression analysis (Jacquemin and Berry 1979: 366). Collinearity is avoided

with the entropy measure, as it can be perfectly decomposed in a between-group

component and a within-group component. From the 1970s onwards, evidence seems to

support the thesis that, where diversification generally does not increase profitability,

related diversification typically has beneficial effects (Montgomery 1994).

The entropy measure of diversification has also been applied to patent data and

bibliometric data to analyse the variety in research and innovative efforts at different

disciplinary, organizational or geographical units of analysis In a patent study on

environmentally friendly car technology, Frenken et al. (2003) used the entropy
14

measure in two dimensions to assess whether corporate portfolios have become, on

average, more varied (18), and, vice versa, whether technologies have become, on

average, patented by a larger variety of firms (19). The first measure indicates the

variety of technologies at each corporate level and the second measure indicates the

strength of competition between firms at the level of each technology. Earlier studies

applied entropy measurements on patents at the level of firms and countries (Grupp

1990) and to bibliometric data including publication and citation distributions

(Leydesdorff 1996).

The validity of results, however, depends crucially on the construction of

classification, which is used to measure of the degree of relatedness between a firm’s

activities. Knowing the limitations of standard classifications of statistical bureaus,

future research may benefit from new classifications based on more in-depth

information on the nature of activities and their demand.

Regional industrial diversification

Diversifcation in industries has been measured by entropy at the regional level in the

same way as is done for the corporate level (Hackbart and Anderson 1975; Attaran

1985). In most cases, industry employment data are used to compute the shares of

industries in a region. Using the entropy decomposition theorem as in (9), entropy

values can be decomposed at several digit-level, for example, in first instance at the

level of manufacturing and non-manufacturing and in second instance at the level of

specific manufacturing and non- manufacturing industries (Attaran 1985).


15

The main interest of this regional indicator is to test whether industrial diversity

reduces unemployment and promotes growth. Diversity is said to protect a region from

unemployment and below average growth rates caused by business cycles operating on

supra-regional levels and by external shocks (e.g., oil prices). Empirical evidence

suggests that diversity indeed reduces unemployment, while evidence on the positive

impact of diversity on per capita income is more often absent (Attaran 1985) (and see a

more recent study by Izraeli and Murphy 2003 using the Herfindhal index). The entropy

measure employed in this way, however, does not capture other aspects commonly

though to affect regional employment and growth including the stage of the product

life-cycle of sectors present in a region.

Technological evolution

In the context of innovation studies, Saviotti (1988) proposed to use entropy as a

measure of technological variety. In this context, Ei stand for the probability of the event

that a firm (or consumer) adopts a particular technology i. When there are n possible

events E1 , …, En with probabilities p1 ,…, pn the entropy of the frequency distribution of

technologies indicates the technological variety. Entropy can be used to indicate the

emergence of a dominant design during a product life-cycle in an industry (Abernathy and

Utterback 1978). A fall in entropy towards zero would indicate the emergence of such a

dominant design (Frenken et al. 1999).

Technologies can often be described in multiple dimensions, i.e. as strings of

product characteristics analogous to genetic strings. For example, a vehicle design with
16

steam engine, spring suspension, and block brakes may be coded as string 000 while a

vehicle design with a gasoline engine, spring suspension, and block brakes by string

100, et cetera. A product population of designs then makes up a frequency distribution

that can be analysed using the multi-dimensional extensions of the entropy.

Multi-dimensional entropy captures the technological variety in all dimensions is one

comprehensive variety measure. The mutual information indicates the extent to which

product characteristics co-occur in the product population. The mutual information

value equals zero when there is no dependence between product characteristics. The

higher the value of the mutual information, the more product characteristics co-occur in

‘design families’.

The relationship between variety (multidimensional entropy) and dependence

(mutual information) has been analysed using (21), which can be rewritten, for K

dimensions (k=1,…,K) labelled X1, … ,Xk , as:

K 
 ∑ H ( X k )  = J ( X 1 ,..., X K ) + H ( X 1 ,..., X K )
 k =1 

From this formula, it can be readily understood that, given a value for the sum of

marginal entropies ΣHk , mutual information J can increase only at the expense of

multidimensional entropy H, and vice versa. When analysing a distribution of

technologies in consecutive years, the value of ΣHk may increase allowing both entropy

and mutual information to increase both (though not necessarily so). When entropy and

mutual information rise simultaneously, a product population develops progressively

more varieties through a growing number of design families, a process akin to


17

‘speciation’ in biology. This pattern has been found in data of product characteristics of

early British steam engines in the eighteenth century (Frenken and Novulari 2003).

Income inequality

Another application of the entropy formula as in (2) in economics concerns the

construction of measures of income equality (Theil 1967: 91-134 and 1972: 99-109).

Let pi stands for the income share of individual i. When all individuals earn the same

income, we have complete equality and maximum entropy log2 (n), and when one

individual earns all income, we have complete inequality and zero entropy. To obtain a

measure of income inequality, entropy H can be subtracted from maximum entropy to

obtain:

n
log 2 (n) − H = ∑ pi log 2 n ( pi ) (22)
i =1

also known as ‘redundancy’ in communication theory (Theil 1967: 91-92).

Organisation theory

An approach to organisation theory based on the entropy concept has been developed by

Saviotti (1988), who proposed to use entropy to indicate the variety of possible

organizational configurations of employees with a particular degree of specialisation.


18

When n individuals have the same knowledge, and this knowledge enables each

individual to carry out any task in the organization, there is a maximum variety of

possible organizational configurations equal to log2 (n) (maximum job rotation). By

contrast, when all individuals have unique and specialised knowledge, and each task

requires a different type of knowledge, there is only one possible organization

characterised by the highest degree of division-of-labour. The variety of possible

organizational structures then equals log2 (1) (no job rotation). The introduction of

departmental boundaries implying that job-rotation is restricted to take place within the

department but not across department, would imply that, depending on the size of

departments, the entropy will lie somewhere in between the minimum and maximum

value.

References

Abernathy, W.J., Utterback, J. (1978) Patterns of industrial innovation, Technology

Review 50, pp. 41-47

Attaran, M. (1985) Industrial diversity and economic performance in U.S. areas, The

Annals of Regional Science 20, pp. 44-54

Boltzmann, L. (1877) Ueber die Beziehung eines allgemeine mechanischen Satzes zum

zweiten Hauptsatzes der Warmetheorie. Sitzungsber. Akad. Wiss. Wien, Math.-

Naturwiss. Kl. 75, pp. 67-73


19

Curry, B., George, K.D. (1983) Industrial concentration: a survey. Journal of Industrial

Economics 31(3), pp. 203-255

Finkelstein, M.O., Friedberg, R.M. (1967) The application of an entropy theory of

concentration to the Clayton Act, The Yale Law Review 76, pp. 677-717

Fisher, R.A. (1930) The Genetical Theory of Natural Selection (Oxford: Clarendon

Press)

Frenken, K., Hekkert, M., Godfroij, P. (2003) R&D portfolios in environmentally

friendly automotive propulsion: Variety, competition and policy implications,

Technological Forecasting and Social Change, in press

Frenken, K., Nuvolari, A. (2003) The early development of the steam engine: An

evolutionary interpretation using complexity theory, Manuscript, 27 June.

Frenken, K., Saviotti, P.P., Trommetter, M. (1999) Variety and niche creation in

aircraft, helicopters, motorcycles and microcomputers, Research Policy 28(5), pp.

469-488

Hackbart, M.W., Anderson, D.A. (1975) On measuring economic diversification, Land

Economics 51, pp. 374-378

Hildenbrand, W., Paschen, H. (1964) Ein axiomatische begründetes

Konzentrationsmass, Statistical Information 3 (published by the statistical office of

the European Communities), pp. 53-61

Horowitz, A., Horowitz, I. (1968) Entropy, Markov processes and competition in the

brewing industry, Journal of Industrial Economics 16(3), pp. 196-211


20

Hoskisson, R.E., Hitt, M.A., Johnson, R.A., Moesel, D.D. (1993) Construct-validity of

an objective (entropy) categorical measure of diversification, Strategic Management

Journal 14(3), pp. 215-235

Izraeli, O., Murphy, K.J. (2003) The effect of industrial diversity on state

unemployment rate and per capita income, The Annals of Regional Science 37, pp.

1-14

Jacquemin, A.P., Berry, C.H. (1979) Entropy measure of diversification and corporate

growth, Journal of Industrial Economics 27(4), pp. 359-369

Jacquemin, A.P., Kumps, A.-M. (1971) Changes in the size structure of the largest

European firms: an entropy measure, Journal of Industrial Economics 20(1), pp. 59-

70

Grupp, H. (1990) The concept of entropy in scientometrics and innovation research. An

indicator for institutional involvement in scientific and technological developments,

Scientometrics 18, pp. 219-239

Kodama, F. (1990) ‘Technological entropy dynamics: towards a taxonomy of national

R&D efforts’, pp. 146-167 in: Sigurdson, J. (ed.) Measuring the Dynamics of

Technological Change (London & Irvington: Pinter)

Leydesdorff, L. (1995) The Challenge of Scientometrics. The Development,

Measurement, and Self-Organization of Scientific Communications (Leiden: DSWO

Press, Leiden University)

Montgomery, C.A. (1994) Corporate diversification, Journal of Economic Perspectives

8(3), pp. 163-178


21

Palepu, K. (1985) Diversification strategy, profit performance and the entropy measure,

Strategic Management Journal 6, pp. 239-255.

Prigogine, I., Stengers, I. (1984) Order out of Chaos (New York: Bantam).

Saviotti, P.P. (1988) Information, variety and entropy in technoeconomic development,

Research Policy 17(2), pp. 89-103

Shannon, C.E. (1948) A mathematical theory of communication, Bell System Technical

Journal 27, pp. 379-423 and pp. 623-656

Theil, H. (1967) Economics and Information Theory (Amsterdam: North-Holland)

Theil, H. (1972) Statistical Decomposition Analysis (Amsterdam: North-Holland)

Zadjenweber, D. (1972) Une application de la théorie de l’information à l’économie: la

mesure de la concentration, Revue d’Economie Politique 82, pp. 486-510

Vous aimerez peut-être aussi