Vous êtes sur la page 1sur 17

Applying Generalized Linear Mixed Models to Word Counts to Analyze the Literary Style of Detective Short Stories by A.

Conan Doyle, G. K. Chesterton, and E.W. Hornung

Joint Statistical Meetings Miami, Florida August 3, 2011

Work by Roger Bilisoly and Krishna Saha Department of Mathematical Sciences Central Connecticut State University New Britain, Connecticut

Stylometry
Studying literary style goes back to the ancient Greeks:
Aristotles Rhetoric dates to the 4th century B.C.

The idea of using quantitative measures to study style is ~150 years old.
Augustus de Morgan suggests in a letter that studying word length may be useful in comparing authors. (Holmes and Kardos (2003)) Mendenhall (1887) gives the results of his analysis of word length by author.
He compares 200,000 words of Francis Bacon and 400,000 words of William Shakespeare with little success according to Williams (1956).

First big stylometry success by statisticians: Mosteller and Wallaces work on the Federalist Papers in early 1960s. See Mosteller and Wallace (1984). Currently an active area of research in several disciplines: natural language processing, statistics, text mining, information retrieval, and physics.

The Early Detective Short Story Corpus


This talk considers the following ~1,000,000 word corpus:
The 56 canonical Sherlock Holmes short stories by Arthur Conan Doyle The 50 Father Brown short stories by G. K. Chesterton. The 26 A. J. Raffles short stories by E. W. Hornung.

These stories were published between 1891-1935 and are (mostly) out of copyright.
All are available at Project Gutenberg (www.gutenberg.org)

Doyle wrote four longer Sherlock Holmes stories, and Hornung wrote one A. J. Raffles novel, but these are not considered here. A. J. Raffles is an amateur cracksman, but Hornung acknowledges Doyles influence on his writing.
In 1893, Hornung married Constance Doyle, Arthur Conan Doyles sister. The dedication of his first Raffles book was To A. C. D. This Form of Flattery.

All three authors wrote other genres.

Function vs. Content Words


The idea of a word seems obvious, but is hard to pin down.
Is Holmes a word? Is 1888 a word? Is knee-caps one or two words? Etc.

Function words are used for syntactic purposes and their classes are closed: prepositions, pronouns, particles, some adverbs.
For example, up is a particle in the following phrasal verbs:
Mix up, throw up, sit up, screw up, make up, hit up,

Content words denote one or more specific things or ideas.


Verbs, nouns, adjectives, and adverbs are usually content words.

Use of function words are believed to be unconscious habits, whereas content words are used intentionally.
Mosteller and Wallace (1984) studied preposition use to analyze The Federalist Papers because they thought the three potential authors (Hamilton, Madison, Jay) would have distinctive rates of use. Detective story writers, such as Doyle, Chesterton, and Hornung, intentionally use words related to crime and violence.

Example: PCA of Personal Pronouns (Function words)


I me you he him she her it we us they them % Explained Prin1 0.428270 0.438173 0.336905 -.261557 0.073556 0.105492 0.127503 0.279381 0.324365 0.323786 -.322364 -.133618 32.4% Eigenvectors Prin2 Prin3 0.120691 -.030862 0.083333 -.064091 0.192433 -.200798 0.340602 -.270750 0.374285 -.323832 -.581133 -.169591 -.579041 -.157171 0.070345 0.112264 0.053332 0.376130 0.057149 0.375081 0.026483 0.302032 0.024416 0.580514 19.3% 13.4% Prin4 0.292536 0.228027 0.257253 -.342274 -.037269 -.040532 -.037176 0.217366 -.447410 -.461859 0.365480 0.283493 10.8%

Prin1 does have 3 negative weights Prin2 contrasts 3rd person male vs. 3rd person female. Prin3 contrasts plural forms with 2nd and 3rd person singular (except for it). Hence some grammatical patterns are detectable with the bag-of-words approach.

Example: Names of Characters (Content words)


For fiction it is not hard to distinguish series of stories based on recurring character(s): just use names, which are content words.
StoryID D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Sherlock 11 10 7 10 10 10 10 9 5 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Holmes 48 53 46 47 25 29 38 56 14 34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 John 4 6 0 3 9 1 3 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 2 2 0 0 0 0 0 Watson 6 10 4 4 4 16 9 12 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Arthur 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Raffles 0 0 0 0 0 0 0 0 0 0 47 35 47 22 41 50 48 74 42 51 0 0 0 0 0 0 0 0 0 0 Harry 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 Bunny 0 0 0 0 0 0 0 0 0 0 19 15 20 20 11 10 16 27 20 25 0 0 0 0 0 0 0 0 0 0 Manders 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Father 0 0 18 33 13 3 1 0 1 8 0 0 2 1 0 1 0 0 0 0 12 17 22 15 11 20 43 30 16 27 Brown 2 2 2 1 0 2 2 3 0 0 0 0 0 0 2 0 0 3 0 1 17 26 26 18 14 29 49 33 17 27 Hercule 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Flambeau 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 0 1 7 22 32 34 34 0 43

Bunnys last name is not used at all in any of the Raffles stories (it appears in the novel Mr. Justice Raffles.) His first name appears in exactly one story. Watsons name does not appear in D9 although he narrates it.

The Statistical Model


Response variable = word proportions Explanatory variables = author, word, word group, date (nuisance parameter?)
plot indicator variables (future work)

Non-normal error distribution.


Data consists of word counts.

Errors dependent due to repeated measures Random effects plausible:


For example, both Doyle and Chesterton grew tired of writing Sherlock Holmes and Father Brown stories, but did so later in their careers to make money.

The above suggests using Generalized Linear Mixed Models (GLMMs)


Both SAS and R fit GLMMs to data.

SASs PROC GLIMMIX for GLMMs


Let Y = response vector (proportions), X = matrix of fixed effects, and Z = matrix of random effects. The basic model is g(E[Y|]) = X + Z SAS has G-side and R-side random effects
We assume that ~ Normal(0, G) and ~ Normal(0, R) are independent. G models random effects, e.g., date of publication. R models correlated residuals, e.g., word correlations. Both G and R are modeled using the RANDOM statement (unlike PROC GENMOD and PROC MIXEDs REPEATED statement.)

R also has functions that fit GLMMs:


glmmPQL (PQL = penalized quasi-likelihood)
The above PROC GLIMMIX info comes from SAS documentation at http://support.sas.com/rnd/app/papers/glimmix.pdf.

Example: Color Word Analysis Data

Each row represent one story. For example, row 1 is Doyles first Sherlock Holmes short story. The columns are as follows: A = word class B = word C = # times word appears in story D = # words in story E = Story ID F = Author G = Year of book publication

Color Word Rates


BWG combines the counts of the words black, white, gray, and grey. RBGY combines the counts of the words red, blue, green, and yellow. Chesterton clearly uses more color words than the other two. Note that Chesterton went to the Slade School of Art (1893-96).

PROC GLIMMIX Output


Parameter Estimates Standard Error 0.1260 0.1346 0.1445 . .

Effect Intercept author author author Residual

author

Estimate -8.9077 1.2417 0.1768 0 2.9352

DF 1053 1053 1053 . .

t Value -70.71 9.23 1.22 . .

Pr > |t| <.0001 <.0001 0.2213 . .

Chesterton Doyle Hornung

As suggested by the scatterplot, Chesterton is significantly different than Hornung, while Doyle and Hornung have similar word rates. Residual estimates the dispersion parameter. Here it is 2.9352, which indicates over-dispersion, which is not surprising because of presumed clustering.

Example: Rates (per 1000 words) of are and is.


There is good Separation among Hornung, Chesterton, and Doyle. Concordancing can give insight into how these differences arise: see next slide.

Concordancing to Analyze are and is usage (function words)

Given a text pattern, all instances of this pattern are found and the results are displayed so that the matches are lined up.
Regular expressions (regexes) are a popular tool for specifying text patterns.

The resulting matches can be sorted in different ways.

Part of the matches of is found in Doyles stories. The lines are sorted by the first word on the left. Regex used:
/\b(is)\b/

Some Results of Concordancing


Doyle there are they are we are you are Chesterton there are they are we are you are Hornung there are they are we are you are RawCount AdjCount 182 68.5 115 43.3 169 63.6 377 141.8 Doyle he is it is that is there is Chesterton he is it is that is there is Hornung he is it is that is there is RawCount AdjCount 330 124.1 1209 454.7 308 115.8 457 171.9

93 72 57 136

45.6 35.3 27.9 66.6

194 394 192 189

95.0 193.0 94.0 92.6

23 22 27 65

23.0 22.0 27.0 65.0

38 100 17 18

38.0 100.0 17.0 18.0

AdjCount takes into account that Doyles stories have 454340 words, Chesterton has 348879 words, and only 170878 words total in Hornung. Stylistic differences are strong: e.g., Doyle vs. Hornung for there is. Software used is from Sections 3.7 and 6.5 of Bilisoly (2008)

Ongoing Work
It is easy to have models where PROC GLIMMIX fails to converge or has memory problems.
Find out which models work best with this data.

Many other word classes are ready to analyze.


For example, words related in meaning from Rogets thesaurus.

Create a data set of plot-device indicators.


For example, has a dead body turned up? Does the story involve a theft?

References
Fahrmeir, Ludwig and Gerhard Tutz (1994). Multivariate Statistical Modelling Based on Generalized Linear Models, Springer. Holmes, David and Judit Kardos (2003). Who Was the Author? An Introduction to Stylometry. Chance, 16, 5-8. Little, Ramon, George Milliken, Walter Stroup, and Russell Wolfinger (1996). SAS System for Mixed Models, SAS Institute. McCulloch, Charles, Shayle Searle, and John Neuhaus (2008). Generalized, Linear, and Mixed Models, Wiley. Mosteller, Frederick and David Wallace (1984). Applied Bayesian and Classical Inference: The Case of the Federalist Papers, Springer. Pinheiro, Jose and Douglas Bates (2000). Mixed-Effects Models in S and SPlus, Springer. Raudenbush, Stephen and Anthony Bryk (2002). Hierarchical Linear Models: Applications and Data Analysis Methods, 2nd Edition, Sage. SAS Institute (2006). The GLIMMIX Procedure. Available at http://support.sas.com/rnd/app/papers/glimmix.pdf. Shoukri, Mohamed and Mohammad Chaudhary (2007). Analysis of Correlated Data with SAS and R, 3rd Edition, Chapman & Hall. Williams, C. B. (1956). Studies in the History of Probability and Statistics: IV. A Note on an Early Statistical Study of Literary Style. Biometrika, 43, 248-256.

Example of Language Complication: Many word measures are functions of sample size, even rates.

We have seen that lexical measures such as mean frequency and vocabulary size, as well as [model parameter estimates] all depend on the sample size.
p 24 of Word Frequency Distributions by R. Harald Baayen One solution used in corpus linguistics: take equal sized samples from texts. However, stories are whole entities. Above is reason novellas are left out.
An approximate solution is to consider sets of stories close in size. Unfortunately, the detective stories range from 2208 to 13593 total words, which is about a factor of 6.

Vous aimerez peut-être aussi