TYPE OF PROPOSAL: paper TITLE: Reconstructing the stemma of a textual tradition from the order of sections in manuscripts KEYWORDS: computer-assisted stemmatology, sequence rearrangements, Canterbury Tales AUTHOR: Matthew Spencer AFFILIATION: Department of Biochemistry, University of Cambridge, England E-MAIL: ms379@cam.ac.uk AUTHOR: Barbara Bordalejo AFFILIATION: Centre for Technology and the Arts, De Montfort University, England E-MAIL: bb268@is8.nyu.edu AUTHOR: Adrian C. Barbrook AFFILIATION: Department of Biochemistry, University of Cambridge, England E-MAIL: acb18@mole.bio.cam.ac.uk AUTHOR: Linne R. Mooney AFFILIATION: Department of English, University of Maine E-MAIL: mooney@maine.edu AUTHOR: Christopher J. Howe AFFILIATION: Department of Biochemistry, University of Cambridge, England E-MAIL: ch26@mole.bio.cam.ac.uk AUTHOR: Peter Robinson AFFILIATION: Centre for Technology and the Arts, De Montfort University, England E-MAIL: peter.robinson@dmu.ac.uk CONTACT ADDRESS: Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QW, England. FAX NUMBER: +44 (1223) 333 345 PHONE NUMBER: +44 (1223) 333 687 Geoffrey Chaucer's Canterbury Tales exists in over 50 complete manuscripts and early printed editions. It consists of a series of tales told by pilgrims travelling from London to Canterbury, more or less loosely connected by linking passages. Unlike other contemporary works of literature having a similar form, the surviving manuscripts show many different orders of the tales and links [Manly, J. M. & Rickert, E., pages 475-494 in Volume II of The text of the Canterbury Tales: studied on the basis of all known manuscripts (eds. Manly, J. M. & Rickert, E.), University of Chicago Press, Chicago, 1940]. Tales and links have been inserted, deleted or moved from one position to another. Previous studies of the order of these sections have focussed on two main questions: the order intended by Chaucer, and the possibility that differences in order among manuscripts can reveal the genealogy of the manuscripts (the stemma). These studies have used verbal arguments about the internal consistency of different orders, geographical references within the links, and the plausibility of hypothetical rearrangements [e.g. Benson, L. D. The order of The Canterbury Tales. Studies in the Age of Chaucer 3, 77-120 (1981)]. An analogous problem in evolutionary biology is the reconstruction of the phylogeny (family tree) of a set of species from the order of genes on a genome [Sankoff, D. Edit distance for genome comparison based on non-local operations. Lecture Notes in Computer Science 644, 121-135 (1992)]. Genes may be inserted, deleted, transposed (moved from one position to another) or inverted (flipped from forward to reverse order). One could measure the edit distance between two genome orders as the minimum number of these operations needed to convert one order into the other. Similarly, the distance between a pair of manuscripts could be measured as the minimum number of insertions, deletions and transpositions needed to convert one order into the other (inversions are not possible in manuscripts). One could then use well-known methods from evolutionary biology to reconstruct a stemma from the matrix of pairwise distances among manuscripts. Unfortunately, calculating edit distance is a very difficult computational problem. We describe two solutions. First, the edit distance can be separated into the number of transpositions and insertions/deletions. The breakpoint distance between a pair of sequences (manuscripts or genomes) is the proportion of items (tales or genes) common to both sequences but having different right-hand neighbours between the two sequences. Breakpoint distance is simple to calculate, and provides an approximate measure of the number of transpositions provided that few transpositions have occurred [Blanchette, M., Kunisawa, T. & Sankoff, D. Gene order breakpoint evidence in animal mitochondrial phylogeny. Journal of Molecular Evolution 49, 193-203 (1999)]. For damaged manuscripts from which sections may have been lost after writing, we cannot calculate the exact breakpoint distance but we can set bounds on its possible value. The deletion distance is the proportion of items present in one but not both of the pair of sequences, and is an estimate of the number of insertions/deletions. However, we cannot calculate the deletion distance for damaged manuscripts. We therefore use breakpoint distances alone to estimate the number of transpositions among each pair of extant manuscripts in the Canterbury Tales tradition, and produce stemmata based on these distances. Second, we present new maximum likelihood methods for estimating edit distance. These methods are computationally intensive, but are more reliable than breakpoint distance when the number of transpositions is large. We compare stemmata produced using these more sophisticated methods with those from breakpoint distance. For both methods, we use biological software [Swofford, D. L. PAUP*: Phylogenetic Analysis Using Parsimony (*and other methods), Sinauer Associates, Sunderland, MA, 1999] to search for the stemma requiring the smallest sum of edge lengths necessary to reproduce the observed pattern of distances among manuscripts, where distance between two manuscripts on the stemma is measured as the sum of the lengths of the edges (branches) linking the two manuscripts. Although we only allow topologically binary stemmata, edge lengths may be arbitrarily close to zero, so relationships in which a single manuscript has many descendants can be represented. We do not discuss the methods used by scholars such as Quentin, Dearing and Zarri, as we are working with a distance matrix rather than a set of variants. We suggest two methods for comparing different stemmata produced from the same data set. Again, these methods were developed for analogous problems in evolutionary biology. First, one can define a partition distance between two stemmata [Penny, D. & Hendy, M. D. The use of tree comparison metrics. Systematic Zoology 34, 75-82 (1985)]. Removing any edge linking two manuscripts (whether extant or hypothetical) divides a stemma into two sets of manuscripts. The order of manuscripts within the sets is not important. If we can divide two stemmata into the same two sets of manuscripts in this way, we say that the stemmata have an edge in common. If there is no edge in the second stemma whose removal produces the same two sets of manuscripts as the removal of an edge in the first stemma, we say that the edge occurs in only one of the two stemmata. The partition distance is then the proportion of edges that occur in only one of the two stemmata. Partition distance is a simple summary of the number of differences between two stemmata, and can be used to decide whether two stemmata are more similar than one would expect by chance alone. Second, a consensus stemma [Page, R. D. M. & Holmes, E. C. Molecular evolution: a phylogenetic approach (Blackwell Science, Oxford, 1998)] includes only those edges that occur in all (or a specified proportion of) the stemmata to be compared. Parts of the consensus stemma for which the stemmata to be compared contradict each other are left as unresolved, star-like groups. A consensus stemma provides a good visual representation of the areas of uncertainty in the relationships among manuscripts. We also consider methods for dealing with contamination. For the distance measure we discuss here, our preferred technique is to add edges to the tree so as to minimize the sum of squared differences between observed distances and shortest distances on the resulting network [Makarenkov, V. & Legendre, P. in Data analysis, classification and related methods (eds. Kiers, H. A. L., Rasson, J. P., Groenen, P. J. F. & Schader, M.) 35-40 (Springer, New York, 2000)]. The computer-based methods we describe here will not automatically tell us why the manuscripts of the Canterbury Tales show so many different orders, or decide which, if any, order best represents Chaucer's original intention. However, the combination of traditional scholarship and modern technology may allow great progress towards answering these long-standing problems. In a companion paper [Bordalejo, B. and Spencer, M. The Order of the Canterbury Tales: Praxis of Computer Analysis] we discuss the implications of our analyses for Chaucer studies.