TYPE OF PROPOSAL: Poster presentation
TITLE: Experiments in Multivariate Analysis and Authorship Attribution
KEYWORDS: statistics, authorship, multivariate analysis
AUTHOR: David L. Hoover
AFFILIATION: New York University
E-MAIL: david.hoover@nyu.edu

CONTACT ADDRESS:
David L. Hoover
English Department, NYU
726 Broadway, 7th Floor
New York, NY 10003

FAX NUMBER: (212) 995-4019
PHONE NUMBER: (718) 438-8608 (home); (212) 998-8832 (office)

Although statistical stylistics has never been a very popular area of
study, it is attractive because its powerful techniquesñwidely used
in the sciences and social sciencesñseem especially appropriate for
the large amounts of information that texts represent. I am working on
a project that reexamines the statistical techniques used by the most
careful and respected of the practitioners of the
methodñspecifically, methods inspired by or derived from the work
of John F. Burrows (1987, 1989, 1992; Burrows and Hassall, 1988;
Burrows and Craig, 1994; Craig, 1999).  Specifically, I am working
with cluster analysis of the frequencies of high frequency words. The
high frequency and low semantic load of the most frequent function
words have led researchers to assume that their use is likely to
escape the conscious control of authors. If so, their frequencies may
reflect deeply ingrained linguistic habits and provide what might be
called ìwordprintsî for authors. If such wordprints exist, they
may provide the kind of objective measure of style that has been
sought since the 18th century.

The techniques pioneered by Burrows have been quite well received
because they are careful, reasonable, and compelling, and they have
been extended to examinations of the authorship of the Book of Mormon
(Holmes, 1992), have been tested on the Federalist Papers (Holmes and
Forsyth, 1995), and have been applied to the question of Miltonís
authorship of De Doctrina Christiana (Tweedie, Holmes, and Corns,
1998); see also Holmes (1994), Baayen, Van Halteren, and Tweedie
(1996), and Tweedie and Baayen (1998). This work tends to confirm the
accuracy and effectiveness of multivariate analysis in authorship
attribution, but in each case the field of claimants and the range of
texts is relatively limited. No one has taken up Burrows suggestion
ìto match a natural desire to work on celebrated cases like Henry
VIII and The Revengerís Tragedy with a more sober, though less
immediately rewarding, concern for testing our methods thoroughly on
cases where the true answers are not in any doubtî (Burrows, 1992,
174).

I am interested mainly in the possible application of statistical
techniques to stylistic analysis, especially in the areas of character
development, genre definition, and stylistic variability within works
or authors, but I would like to demonstrate some of the work required
for the task suggested by Burrows. After all, only those statistical
techniques that can effectively and reliably distinguish known authors
and known texts from each other seem likely to be useful in
characterizing and comparing the styles of those authors.

My first experiment analyzed the first 3,000 words of opening chapters
of a group of 50 current novels by 27 authors, downloaded from
WWW.CONTENTVILLE.COM. My second experiment analyzed the first 30,000
words of 46 novels by 31 authors, mainly taken from Hoover
(1999). Another experiment analyzed the 4,000-word sections of 25
pieces of current literary criticism by 14 authors, downloaded from
Project Muse (http://muse.jhu.edu/journals/elh/). Unfortunately, none
of these experiments showed the kind of results that one would
hope. In fact, the best result was less than 90% accurate in
attributing texts to the correct authors. This was true even when
first and third person narration was separated and when function word
homographs were distinguished.  Using the results of the analyses, I
then selected some of the most problematic texts in the second and
third groups for further analysis.  When I analyzed the texts of only
2-4 authors, cluster analysis still failed to group texts by the same
author or distinguish texts by different authors accurately. This lack
of accuracy suggests that further work is necessary before such
techniques can be accepted as important tools in authorship
attribution or stylistic studies.

In my presentation, I would like to show how the process of
multivariate analysis works, from the point when the texts have been
collected to the production of cluster graphs or PCA plots. Based on
conversations with other humanities computing people, I believe that
there is a need for this kind of fairly explicit and basic
introduction to statistical analysis. At the same time, the results of
my experiments seem very interesting and significant in their own
right: although a proof of the accuracy of the techniques on large
groups of varied texts would have been more welcome, a demonstration
of their inaccuracy may, in the long run be just as useful.

The main components of my own technique are TACT, used to analyze the
word frequencies of the individual texts and of a text that combines
all of the texts; FoxPro, a programmable database, used to import word
frequency data, tag it with author and text information, cull the data
so that it includes only the desired number of most frequent words
(generally the 50-500 most frequent words), and create zero- frequency
records of frequent words that do not appear in one or more of the
individual texts (note: I am currently looking into the feasibility of
moving the techniques to Microsoft Access, with Visual Basic as the
programming language); and Minitab, a statistical analysis program,
used to perform the actual PCA and cluster analysis. The techniques I
use allow for quick and relatively painless analysis of many different
groups of texts of many different kinds, and so have the potential to
provide a wide range of tests of the techniques on extremely varied
and extensive groups of textsñsomething that has not, to my
knowledge, been done before.

For my poster presentation, I would need access to a PC running
Windows 98 (NOT 2000 or MEñI need access to DOS, for running batch
files), on which I could install Foxpro (Access might be needed, if I
get the conversion done in time), Minitab, TACT, and a bunch of texts
for possible analysis.


Works Cited:

Baayen, R. Harald. 1993. ìStatistical Models for Word Frequency
     Distributions: A Linguistic Evaluationî. Computers and the
     Humanities 26 (1993), 347-363.
Baayen, R. Harald. 1996. ìThe Effect of Lexical Specialization on the
     Growth Curve of the Vocabularyî. Computational Linguistics 22
     (1996), 455-480.
Baayen, R. Harald, Hans Van Halteren, and Fiona J. Tweedie. 1996.
     ìOutside the Cave of Shadows: Using Syntactic Annotation to
     Enhance Authorship Attribution.î Literary and Linguistic
     Computing. 11 (3), 121-31.
Burrows, J. F. 1987. Computation into Criticism. Oxford: Clarendon
     Press.
Burrows, J. F. 1989. ìëA Visioní as a Revision.î Eighteenth Century
     Studies, 22 (1989), 551-65.
Burrows, J. F. 1992. Computers and the Study of Literature. In Butler,
     1992, 167-204.
Burrows, J. F. and A.J. Hassall. 1988. ìAnna Boleyn and the Authenticity
     of Fieldingís Feminine Narratives.î Eighteenth Century Studies,
     21 (1988), 427-453.
Burrows, J. F. and D. H. Craig. 1994. ìLyrical Drama and the ëTurbid
     Mountebanksí: Styles of Dialogue in Romantic and Renaissance
     Tragedyî Computers and the Humanities 28:63-86.
Craig, Hugh. 1999.  ìContrast and Change in the Idiolects of Ben Jonson
     Charactersî Computers and the Humanities 33:221-40, 1999.
Holmes, D. I. 1992. ìA Stylometric Analysis of Mormon Scripture and
     Related Texts.î Journal of the Royal Statistical Society (A),, 155,
     1 (1992), 91-120.
Holmes, D. I. 1994. ìAuthorship Attributionî. Computers and the
     Humanities 28(2) (1994), 87-106.
Holmes, D. I. and R. S. Forsyth. 1995. ìThe Federalist Revisited: New
     Directions in Authorship Attributionî. Literary and Linguistic
     Computing 10(2), 111-127.
Hoover, David L. 1999. Language and Style in The Inheritors. Lanham,
     MD: University Press of America.
Tweedie, F. J. and R. H. Baayen. 1998 ìHow Variable May a Constant
     Be? Measures of Lexical Richness in Perspectiveî.  Computers
     and the Humanities 32 (1998), 323-352.
Tweedie, F. J., D. I. Holmes, and Thomas N. Corns. 1998. ìThe
     Provenance of De Doctrina Christiana, Attributed to John
     Milton: A Statistical Investigation.î Literary and Linguistic
     Computing 13(2), 77-87.