Series A (General), Vol. 137, No. 1 (1974), pp. 2534 Published by: Wiley for the Royal Statistical Society Stable URL: http://www.jstor.org/stable/2345142 . Accessed: 05/11/2013 07:44
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a notforprofit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to Journal of the Royal Statistical Society. Series A (General).
http://www.jstor.org
This content downloaded from 210.32.162.238 on Tue, 5 Nov 2013 07:44:42 AM All use subject to JSTOR Terms and Conditions
J. R. Statist. Soc. A,
(1974), 137, Part 1, p. 25
25
A new model for representingsentencelengthdistributionsis suggestedin equation (8) which is a special case of equation (2), with parametery = 2 known a priori. Eight known sentencelength frequency counts taken from English, Greek and Latin prose were all satisfactorilydescribedby distribution(8). For these eight fits, the average probabilityP(X2) was 0 50. A ninth observeddistribution,takenfrom a Latintext of unknownauthorship failed the x2 test applied to the fit of the data to the model in equation (8). This corroborates Yule's (1939) conclusion that it is highly unlikely that de Gerson could have writtenDe ImitationeChristi. It is furtherconjectured that the lastmentioned observed frequency distribution could be well representedby the more general model in equation (2), with a parametery much smallerthan 2 Keywords: SENTENCELENGTH;COMPOUNDPOISSONDISTRIBUTION; CLASSICALPROSE
1. INTRODUCTION
first substantial investigation on sentencelength as a statistical tool to be used in deciding disputed authorship was published by Yule in 1939. Simple statistical indices such as the average number of words per sentence and the standard deviation of sentencelengths were employed. Yule did not suggest a particular mathematical distribution model. Later (Yule, 1944) he explored wordfrequency of an author in addition to sentencelength. Although Yule mentions in that book the negative binomial, he discards this distribution model as totally inadequate for representation of word frequencies and sentencelengths. Williams (1940, 1970) suggests and uses the lognormal distribution as a model for sentencelength. To verify lognormality, Williams plots the observed cumulative percentage frequencies of sentencelengths on logprobability paper in the hope that these plots will approach a straight line. No x2 tests are given for any of Williams's examples. Wake (1957), who discusses sentencelengths in works of Greek authors, also makes use of the lognormal distribution by superimposing the observed histograms of the logarithms of sentencelengths over the "expected" normal distributions. No x2 tests are given. The authorship of Greek prose is again investigated by Morton (1965) who works with distributionfree statistics such as the mean, the median, the quartiles and the deciles. Mosteller and Wallace (1963), in their study of the authorship of the Federalist papers, came to the conclusion that the mean and standard deviation of sentencelength was of no help in solving disputed authorship. In their particular research Mosteller and Wallace found the mean and standard deviations of sentencelength to be virtually identical for Madison and Hamilton. It can be shown, however, that two
THE
This content downloaded from 210.32.162.238 on Tue, 5 Nov 2013 07:44:42 AM All use subject to JSTOR Terms and Conditions
26
[Part 1,
discrete distribution models, with the same first two moments, may have entirely different shapes. For example, the negative binomial may be Jshaped whereas the new distribution discussed in this paper may be unimodal with a mode far away from zero, although the same mean and standard deviation are common to both models. Furthermore, the other investigators mentioned previously, have shown that some authors differ decisively in mean sentencelengths. It would be of great help to have a reasonable mathematical distribution model for sentencelength in order to sharpen our statistical tools, not only with respect to the enhanced power in significance testing but also to investigate the shape of the sentencelength distribution. In addition, a few pertinent statistical indices could be used to express sentencelengths instead of showing massive tables of frequencies of the number of words in sentences. The lognormal model suggested by Williams and used by Wake must be rejected on several grounds: In the first place the number of words in a sentence constitutes a discrete variable whereas the lognormal distribution is continuous. Wake (1957) has pointed out that most observed logsentencelength distributions display upper tails which tend towards zero much faster than the corresponding normal distribution. This is also evident in most of the cumulative percentage frequency distributions of sentencelengths plotted on logprobability paper by Williams (1970). The sweep of the curves drawn through the plotted observations is concave upwards which means that we deal with sublognormal populations. In other words, most of the observed sentencelength distributions, after logarithmic transformation, are negatively skew. Finally, a mathematical distribution model which cannot fit real dataas shown up by the conventional x2 testcannot claim serious attention.
2. THE MODEL
It has been pointed out by some of the writers mentioned previously that sentencelengths are not randomly distributed throughout a given text written by a certain author. A tendency of some serial correlation between the lengths of successive sentences has been observed. This points to "clustering" and one immediately thinks of some compound Poisson process seeing that the underlying distributions must be discrete. Recently, Sichel (1971) proposed a family of discrete distributions which arises from mixing Poisson distributions with parameter A. The mixing distribution is given by 1 {21V(1 6)/a 6}Vy
2 Ky{aV(1 0)}
a2
/
/ ) 4A)
(1)
Here oo<y<oo, 0<0<1 and a >0 are the three parameters and K,(.) is the modified Bessel function of the second kind of order y. The resulting compound Poisson distribution is
{(r =
)} r! Kr+'(a), r!
4)
(2) 2
where r=0,1,2,...,oo. A number of known discrete distribution functions such as the Poisson, negative binonmial,geometric, Fisher's logarithmic series in its original and modified forms, Yule, Good, Waring and Riemann distributions are special or limiting forms of (2).
This content downloaded from 210.32.162.238 on Tue, 5 Nov 2013 07:44:42 AM All use subject to JSTOR Terms and Conditions
1974]
27
If parameter y is made negative in (2), an entirely new set of discrete distribution is generated. Mean and variance of the d.f. in (2) are, respectively, E(r) and var (r) = 4(1 )2 KY+2{(X(1

2 1 (1 6)
(3)
+E(r){I E(r)}.
(4)
In general, all moments exist as long as 0 < 1. If y is known a priori, maximum likelihood estimators for parameters cxand 0 are available (Sichel, 1971). They are not requiredfor the purpose of this investigation as sentencelength distributions are not excessively skew. The first two probabilities (for r = 0 and r = 1) are derived from equation (2) as O(O)= {1(1and K = {1(1 0)}Y(C0X/2) 0(1) {ocx) 0)Y K7{cx1V(1 All other probabilities are easily calculated from the recurrenceformula
+()

(5)
(6) (6
r+y7
1) +
((X )2
O(r 2).
in (2). This is A particularly interesting and simple case arises if we make y =the distribution which will be used to represent sentencelengths. We have, from (2), +(r) = 1(2cx/,)exp {oc V1( 0)} with a mean of E(r) and a variance of var(r) = a 0(2 0)/{4(1  O)} = The population index of dispersion is defined as co = var (r)/E(r) = (2 0)/{2(1  0)} whence 1= 1(26)1 From (9) we obtain (12) (11)
2 (10)
(8)
x0/{2 (1  0)} = p1
(9)
a = 2Fr(1 )/8.
(13)
This content downloaded from 210.32.162.238 on Tue, 5 Nov 2013 07:44:42 AM All use subject to JSTOR Terms and Conditions
28
SIcHEL 
Sentencelengthin WrittenProse
[Part 1,
6 and cx, of populationparameters In (12) and (13) # and a' are momentestimators f is the averagesentencelength in the in the sampleand 6 is the index of dispersion distributions, sample. For the type of skewness encounteredin sentencelength estimators6 and &l are reasonably efficient. model in equation(8) should be truncated Strictlyspeaking,the sentencelength at zero as the minimumnumberof wordsper sentenceis one. However,it has been found that the expectedfrequencies for r = 0, in the case where distribution (8) is does not fitted to real data, are very small. For ease of calculationzerotruncation appearnecessary. The first two proportionatefrequenciesare obtained from (5) and (6), with
Y
1:
= exp [oz{1(1(G)}] b(O) (14)
(15)
and
b(O). #(1) = (o0G/2) with y
= 
distribution (8) is
 0) {10exp (it)}] (16) g(t) = exp [ [1(1 and hence we obtain the characteristic function of the arithmeticmean of samples of n, drawnfroma population(8), as
[g(t/n)]n = exp[na[J(l
0)1{1
6exp(it/n)}]j.
(17)
It follows that the samplingdistributionof the mean has the same form as the ax and the arithmetic replacedby noa mean originalpopulation(8) but with parameter f advancing in steps of 1/n. Thispropertyof population(8) is of considerable help in the populationmean. hypothesistesting concerning 3. APPLICATION in the literature distributions and taken Severalobserved reported sentencelength from Greek, Latin and English texts were fitted to the distributionshown in was unnecessary as the expected equation(8). As mentionedbefore,zerotruncation at r = 0 weresmall. For the purposeof thex2test,the expected frequencies frequencies wereincludedin the firstcell, i.e. the classcontaining15 words. withexamples fromEnglishauthorsthe sentencelengths fromMacaulay's Starting (8) in Table 1. The fit is satisfactory. writings(Yule, 1939)are fitted to distribution these data as indicatedin the In contrast,the negativebinomialdoes not represent last column of Table 2. The total x2 is 79927as comparedto 16846for the new distributionmodel. The same tailend groupingwas used for both models. The deviationsof the negativebinomialfrom the data (and from distribution (8)) follow a systematic werefittedwith the identicalsample patternalthoughboth distributions means and variances. At the start of the curve the negativebinomialyields much at 6 < r<25. Once largerfrequencies.The position is reversedfor the occurrences, in the exceed those of the new distribution again the negativebinomialfrequencies range 26 <r ?65. Finally, in the upper tail for r>66, the negativebinomialtends has the longertail. morerapidlyto zero, that is the new distribution
This content downloaded from 210.32.162.238 on Tue, 5 Nov 2013 07:44:42 AM All use subject to JSTOR Terms and Conditions
1974]
29
TABLE1 Sentencelengthdistribution from Macaulay, fitted to the new model and also to the negative binomial (datafrom Yule, 1939)
New distribution No. of words Observedno. of sentences
fo
1 5 610 1115 1620 2125 2630 3135 3640 4145 4650 5155 5660 6165 6670 7175 7680 8185 8690 91 and over Total
Mean Variance d.f. p(X2)
I
s X2
21 4 2f 6 1,251
22 07 23022
578 201 0 244.6 209i1 157i5 113i0 795 55 6 389 27.3 19.2 136 96 68 4.9 15.2 35 285 4.3 18 J1.3J 48 1,251 0

1103 185.7 202i6 184i2 152i3 118 6 88i6 64.4 45.7 31P9 22.0 149 101 6.7 4 5 14.1
2.9}
3.2
79927 13
0?00
= 2*34068t
0 94965
0 90412
t The general distribution in equation (2) becomes the negative binomial with parametersy
and 0 as ao 0.
In Table 2 sentencelength distributions from works of Wells and Chesterton as given by Williams (1940), are excellently represented by the new model. The large differences in the parameter estimates &l and & for these two authors are of interest. Morton (1965) shows sentencelength distributions taken from eight works of Thucydides and from nine works of Herodotus. The examples from ancient Greek texts in Table 3, once again indicate the success of distribution (8) as a model for sentencelength distributions. Negative binomials were also fitted to the two observed frequency counts in Table 3 making use of the same means and variances as derived from the samples. The respective total x2 values were 82X216 for Thucydides and 42823 for Herodotus.
2
This content downloaded from 210.32.162.238 on Tue, 5 Nov 2013 07:44:42 AM All use subject to JSTOR Terms and Conditions
30
[Part 1,
TABLE 2
Sentencelengthdistributions from H. G. Wells and G. K. Chestertonfitted to the new model (datafrom Williams, 1940)
H. G. Wells No. of words Observedno. of sentences
fo
1 5 610
1115 1620
11 66
107 121
1130 633
1068 1096
27}30
71 112
17257 239f
761 1148
6670
7175
7680
75 61 52 27 29 17 12 8 5 9
4f
3
1
3.1
141
78
1

16
9 09
05
211 1.0 6 17
11.9
5 1
1
600 24X08 19938
_
15J
600 0

1
600 2591 131'05
9132
5788

d.f.
pX2)

11
061
8
067
os
1304312 0 93575
1927635 089030
The systematic deviations of the negative binomials from the data and the new distribution were very similar to those described in the discussion on Table 1. In short, the negative binomial distribution cannot take on the shape of observed sentencelength frequency counts. The data discussed so far display a concave upward curvature if plotted as a c.d.f. on logprobability paper. Some sentencelength distributions do approach a straight line on logprobability paper. To check whether such cases can be represented satisfactorily by distribution (8), two frequency counts given by Wake (1957) were fitted to (8) and they are shown in Table 4. The first example refers to sentencelengths from Timaeus by Plato and the second comes from the Hippocratic Corpus, Regimen in Acute Diseases. As shown in Table 4, both observed frequency counts are well fitted by the new model.
This content downloaded from 210.32.162.238 on Tue, 5 Nov 2013 07:44:42 AM All use subject to JSTOR Terms and Conditions
1974]
31
TABLE 3
Sentencelengthdistributions from Thucydidesand Herodotus,fitted to the new model (datafrom Morton, 1965)
Thucydides (Works 18) No. of words Observedno. of sentences
fo 1 5 610 1115 1620 2125 2630 3135 3640 4145 4650 5155 48 201 274 257 232 144 135 73 70 45 32
5660
6165 6670 7175
15
22 17 9 5} 2j 559 2J 7 1,600 2498 29373
244
1729 132 97
710
95
152
7680
8185 8690 9195 .96100 101 and over Total Mean Variance
X2
5 10
72 1
4
10 1

1P5
0 06 04 03 07 1,8000 10 6
d.f. p(X2)
Yule (1939) discussed the authorship of the Latin essay De Imitatione Christi whose author is unknown. He came to the conclusion that Jean Charlier de Gerson is unlikely to have written this work. In Table 5 the sentencelength distribution for the combined two samples from de Gerson's works, as quoted by Yule (1939), is fitted to the distribution (8). Bearing in mind that the sample size is n = 2,417, the fit is fair [P(X2) = 041 for 17 degrees of freedom]. Most of the contributions to total x2 come from two cells only, that is 15 words and 4650 words per sentence. But for these two deviations from theory, amounting to a x2 contribution of 13X888, the fit would have been excellent.
This content downloaded from 210.32.162.238 on Tue, 5 Nov 2013 07:44:42 AM All use subject to JSTOR Terms and Conditions
32
[Part 1,
TABLE 4
from Plato and the Hippocratic Corpus,fitted to the Sentencelengthdistributions new model (datafrom Wake, 1957)
Plato (Timaeus) Observedno. of sentences
fo
No. of words
1 5 610 1115 1620 2125 2630 3135 3640 4145 4650 5155
5660
15 70 104 102 73 68 45 41 23 18 13
14
15i4 68i9 100i7 98i4 82i2 64i2 487 365 272 202 15 0
112

23 84 91 59 45 22 9 7 6 3 5 2 1
0
6165
6670
10 8f 2'
1 6>14
7175 7680
8690
6.3

1 404
03
01
41
9195 96100
101105
5.7
1 1
01
3550
d.f.
5821 14
pX2)s
0 97
8 035
8909
11 35126 0 95868
9 47232 093221
In contrast, the sentencelength distribution of De Imitatione Christi cannot be represented by the distribution model of equation (8) as shown in the last two columns of Table 5. The x2 is 66788 for nine degrees of freedom and the deviations of the data from the model are systematic suggesting that in the general model of Consequently oxand 0 should be larger. equation (2) y . This is a very good example illustrating that, in addition to differences in means and variances, the shape of the distribution, as measured at least in part by parametery, is most useful in detecting statistically significant differences of sentencelength distributions.
This content downloaded from 210.32.162.238 on Tue, 5 Nov 2013 07:44:42 AM All use subject to JSTOR Terms and Conditions
1974]
33
Sentencelengthdistributions from de Gerson andfrom De Imitatione Christi fitted to the new model (datafrom Yule, 1939)
de Gerson No. of words Observedno. of sentences
fo
1 5 610 1115 1620 2125 2630 3135 3640 4145 4650 5155 5660 6165 6670
7175
93i1 3531 4562 405 3 3135 2293 1639 1162 8241 5841 41P2 294 210 15 0
10 8
918 2906 3050 2177 1346 78 5 447 252 142 80 4.5 25' 1P5 08
0'5
7680
8185
8
3
78
56

1
80
X
0o3
0'2 6'2
3 5) 1 6 J
4
01 01 01
011 1J 1,221
16'24 9636

2,417
23'07 244'98 _
_ 
1,221'0
24389 17
0.11
66'788 9
000
os
J
10'78827 095059
1085495 0'90796
REFERENCES MORTON, A. Q. (1965). The authorship of Greek prose. J. R. Statist. Soc. A, 128, 169224.
MOSTELLER, F. and WALLACE, D. L. (1963). Inference in an authorship problem. J. Amer. Statist.
Ass., 58, 275309. SICHEL, H. S. (1971). On a family of discrete distributions particularly suited to represent longtailed frequency data. In Proceedingsof the ThirdSymposiumon MathematicalStatistics
(N. F. Laubscher, ed.), S.A. C.S.I.R., Pretoria, pp. 5197. WAKE, W. C. (1957). Sentencelength distributions of Greek authors. J. R. Statist. Soc. A, 120,
331346.
This content downloaded from 210.32.162.238 on Tue, 5 Nov 2013 07:44:42 AM All use subject to JSTOR Terms and Conditions
34
[Part 1,
C. B. (1940). A note on the statistical analysis of sentencelength as a criterion of WILLIAMS, literary style. Biometrika,31, 356361. (1970). Style and Vocabulary:NumericalStudies. London: Griffin. YULE,G. U. (1939). On sentencelength as a statistical characteristic of style in prose: with applications to two cases of disputed authorship. Biometrika,30, 363390.  (1944). The Statistical Study of Literary Vocabulary. Cambridge: University Press.
This content downloaded from 210.32.162.238 on Tue, 5 Nov 2013 07:44:42 AM All use subject to JSTOR Terms and Conditions