Académique Documents
Professionnel Documents
Culture Documents
Work by Roger Bilisoly and Krishna Saha Department of Mathematical Sciences Central Connecticut State University New Britain, Connecticut
Stylometry
Studying literary style goes back to the ancient Greeks:
Aristotles Rhetoric dates to the 4th century B.C.
The idea of using quantitative measures to study style is ~150 years old.
Augustus de Morgan suggests in a letter that studying word length may be useful in comparing authors. (Holmes and Kardos (2003)) Mendenhall (1887) gives the results of his analysis of word length by author.
He compares 200,000 words of Francis Bacon and 400,000 words of William Shakespeare with little success according to Williams (1956).
First big stylometry success by statisticians: Mosteller and Wallaces work on the Federalist Papers in early 1960s. See Mosteller and Wallace (1984). Currently an active area of research in several disciplines: natural language processing, statistics, text mining, information retrieval, and physics.
These stories were published between 1891-1935 and are (mostly) out of copyright.
All are available at Project Gutenberg (www.gutenberg.org)
Doyle wrote four longer Sherlock Holmes stories, and Hornung wrote one A. J. Raffles novel, but these are not considered here. A. J. Raffles is an amateur cracksman, but Hornung acknowledges Doyles influence on his writing.
In 1893, Hornung married Constance Doyle, Arthur Conan Doyles sister. The dedication of his first Raffles book was To A. C. D. This Form of Flattery.
Function words are used for syntactic purposes and their classes are closed: prepositions, pronouns, particles, some adverbs.
For example, up is a particle in the following phrasal verbs:
Mix up, throw up, sit up, screw up, make up, hit up,
Use of function words are believed to be unconscious habits, whereas content words are used intentionally.
Mosteller and Wallace (1984) studied preposition use to analyze The Federalist Papers because they thought the three potential authors (Hamilton, Madison, Jay) would have distinctive rates of use. Detective story writers, such as Doyle, Chesterton, and Hornung, intentionally use words related to crime and violence.
Prin1 does have 3 negative weights Prin2 contrasts 3rd person male vs. 3rd person female. Prin3 contrasts plural forms with 2nd and 3rd person singular (except for it). Hence some grammatical patterns are detectable with the bag-of-words approach.
Bunnys last name is not used at all in any of the Raffles stories (it appears in the novel Mr. Justice Raffles.) His first name appears in exactly one story. Watsons name does not appear in D9 although he narrates it.
Each row represent one story. For example, row 1 is Doyles first Sherlock Holmes short story. The columns are as follows: A = word class B = word C = # times word appears in story D = # words in story E = Story ID F = Author G = Year of book publication
author
As suggested by the scatterplot, Chesterton is significantly different than Hornung, while Doyle and Hornung have similar word rates. Residual estimates the dispersion parameter. Here it is 2.9352, which indicates over-dispersion, which is not surprising because of presumed clustering.
Given a text pattern, all instances of this pattern are found and the results are displayed so that the matches are lined up.
Regular expressions (regexes) are a popular tool for specifying text patterns.
Part of the matches of is found in Doyles stories. The lines are sorted by the first word on the left. Regex used:
/\b(is)\b/
93 72 57 136
23 22 27 65
38 100 17 18
AdjCount takes into account that Doyles stories have 454340 words, Chesterton has 348879 words, and only 170878 words total in Hornung. Stylistic differences are strong: e.g., Doyle vs. Hornung for there is. Software used is from Sections 3.7 and 6.5 of Bilisoly (2008)
Ongoing Work
It is easy to have models where PROC GLIMMIX fails to converge or has memory problems.
Find out which models work best with this data.
References
Fahrmeir, Ludwig and Gerhard Tutz (1994). Multivariate Statistical Modelling Based on Generalized Linear Models, Springer. Holmes, David and Judit Kardos (2003). Who Was the Author? An Introduction to Stylometry. Chance, 16, 5-8. Little, Ramon, George Milliken, Walter Stroup, and Russell Wolfinger (1996). SAS System for Mixed Models, SAS Institute. McCulloch, Charles, Shayle Searle, and John Neuhaus (2008). Generalized, Linear, and Mixed Models, Wiley. Mosteller, Frederick and David Wallace (1984). Applied Bayesian and Classical Inference: The Case of the Federalist Papers, Springer. Pinheiro, Jose and Douglas Bates (2000). Mixed-Effects Models in S and SPlus, Springer. Raudenbush, Stephen and Anthony Bryk (2002). Hierarchical Linear Models: Applications and Data Analysis Methods, 2nd Edition, Sage. SAS Institute (2006). The GLIMMIX Procedure. Available at http://support.sas.com/rnd/app/papers/glimmix.pdf. Shoukri, Mohamed and Mohammad Chaudhary (2007). Analysis of Correlated Data with SAS and R, 3rd Edition, Chapman & Hall. Williams, C. B. (1956). Studies in the History of Probability and Statistics: IV. A Note on an Early Statistical Study of Literary Style. Biometrika, 43, 248-256.
Example of Language Complication: Many word measures are functions of sample size, even rates.
We have seen that lexical measures such as mean frequency and vocabulary size, as well as [model parameter estimates] all depend on the sample size.
p 24 of Word Frequency Distributions by R. Harald Baayen One solution used in corpus linguistics: take equal sized samples from texts. However, stories are whole entities. Above is reason novellas are left out.
An approximate solution is to consider sets of stories close in size. Unfortunately, the detective stories range from 2208 to 13593 total words, which is about a factor of 6.