NINCH guide home        interview table of contents        previous interview        next interview

 

27   Thesaurus Musicarum Latinarum (TML)

 

HATII interviewed Thomas Mathiesen, the David H. Jacobs Distinguished Professor of Music and Director of the Center for the History of Music Theory and Literature at Indiana University, on January 10 2001. The Center initiated the Thesaurus Musicarum Latinarum project, which is an evolving database with the aim of collecting all Latin music theory dating from the Middle Ages to the Renaissance. The intended audience for the resources made available by the project, is principally those within the University community. In accessing the TML, users will be able to locate and retrieve the source text for educational or research purposes, while the original artifacts are preserved.

 

27.1    Organizational Digitization Program and Policy

The Thesaurus Musicarum Latinarum (TML) is a full-text database of Latin music theory, extending from Censorinus’s De die natali through treatises of the seventeenth century. The database contains more than 5,000,000 words in 741 separate texts accompanied by more than 4,000 graphics. The TML originated in an idea in the spring of 1989 amongst twelve scholars active in textual criticism, codicology, early music, cataloging manuscripts and the history of music theory. These scholars had used the Thesaurus Linguae Graecae (TLG) and it was the TLG that provided the model for the development of the TML. A discussion amongst a larger group of scholars took place at the annual meeting of the American Musicological Society in Austin, Texas in October 1989. This meeting made a general commitment to the project and preliminary editorial and technical committees were established. Indiana University (IU) provided substantial funding to establish the principal TML center. A planning meeting held at Princeton University in 1990 established the basic parameters of the project, and the first texts became available later that year. The texts, which were not originally intended to be viewed online, were first distributed by request on LISTSERV, then by File Transfer Protocol (FTP). With the advent of Gopher and eventually the World Wide Web, the texts and graphics were restructured for online viewing. In 1992, the project received a $230,000 NEH grant in support of work on the creation of texts at the IU TML center and the TML sub-centers that had been established at the University of Colorado in Boulder, Ohio State University, Louisiana State University, the University of Nebraska in Lincoln, and Princeton University (an additional sub-center was eventually established at the Moscow Conservatory). A second NEH grant from 1994 to 1996 enabled the project to reach its goal of including 97-98% of all printed texts, to create the accompanying graphic files, and to begin work on manuscript sources. When the NEH grant ended in 1996 the database was receiving more than 1,000 connections per month. In 1998 the Center for the History of Music Theory and Literature (CHMTL) was established as an umbrella for various music projects developed at Indiana University, including the TML, and is currently in its third year of start-up funding.

There have been relatively few major obstacles to planning the development of a strategic policy for the TML. It was advantageous that the principles behind the project were established before the first text was created. The first obstacle to be cleared was obtaining copyright permission. Fundraising has also proved time consuming but has been successful. One main obstacle to building the database was the constant danger of being shut down as system administrators decided to terminate servers, change operating systems, and otherwise alter services on which the TML relied, quite often without any advance consultation. Therefore, a huge amount of time has been invested in moving data from a VM mainframe to a desktop machine emulating VM, an NT server (for the LISTSERV component of the project), a UNIX server, and finally a Linux server, and in resolving the associated political issues.

The principal criterion that guided the TML in its selection of material for digitization was that the most widely used printed texts would be digitized first. Therefore, provided intellectual property rights were secured, material that had the highest research significance had the highest priority. A feature of this selection criterion was that every edition of a text would be included, as distinguished from the TLG, which includes only a “best text” for each edition. A second selection criterion has been to digitize related manuscript sources to facilitate the production of new editions. These selection criteria have not changed over time, with the exception that the priorities have evolved. There is increased interest in extending the chronological limits backward to the second century and forward to the late seventeenth or early eighteenth century, as well as including material that is not exclusively devoted to music.

The TML has co-operated with other academic institutions on an international basis, although the project was originally envisaged as a US undertaking. The TML was fortunate in this co-operation in that musicology is a small field of scholars well known to each other and able to collaborate effectively. The project director was also well established and financing was in place. The TML would therefore recommend to other institutions, that an essential factor for success is an administrative board that collaborates effectively.

The TML has no anticipated end date, as the number of potential manuscripts to digitize is huge and raises very different encoding issues as compared to printed texts. Although the pace of work has slowed, more than 100 new texts have been added in the last two years.

The primary purpose for the TML in creating the digital deliverables is research. The TML has produced a general description of the project that includes the TML’s format, rationale, scope, significance and level of faithfulness to the original. The level of faithfulness to the original is embodied in the TML’s “Principles of Orthography”, published in the catalog of files called the TML Canon of Data Files (Lincoln: University of Nebraska Press, 1999).

The type of source material digitized includes:

The printed material is all on paper and the manuscript material on paper or parchment. The digitized deliverables are intended to represent the entire body of material. It is not the intention of the project to re-purpose the digital deliverables.

Low character ASCII is used for representing content. The 127 low character ASCII set was chosen to achieve the maximum level of interoperability; because the TML was intended to be as editorially unobtrusive as possible, there was no need for any complex symbols or typography. For describing content, the text files contain simply a bibliographic header, database reference and the file creators. Page or folio breaks are indicated and non-Roman characters, in this case musical notes, are encoded: for example, a semibreve is S. The TML’s encoding system, which is published in the TML Canon of Data Files, has proved applicable to all texts.

In the HTML versions of the texts, there is inevitably more markup, such as paragraph breaks <P>, and the use of graphics flags anchored to the graphic files; the graphics themselves are not embedded to ensure that the text files can be viewed in non-graphic browsers.

The TML did not look at guidelines for digitizing particular document types because many did not exist when the project started. The Unicode system was embryonic and the TEI just beginning. The only model available was that of the TLG.

The intended audiences for the digital deliverables are academics, scholars and graduate school students. Other user groups could use the texts but there is unlikely to be any significant demand from the general public or K-12 because of the specialized nature of the material. However, within the academic community the profile of users has been broader than expected, with medievalists, classicists and antiquarians making use of the material.

There are no limitations on the use of the deliverables because universal rights were cleared. In this process the project insisted that they were not replacing the original but were strictly concerned with the critical text.

 

27.2    Project Management and Planning

Internal advice was available on managing the digitization program but external advice was sought on serving the material (specifically in connection with searching capabilities and pre- and post-processing of searches). Indiana University’s School of Music, as a professional school, has not shown much interest in the CHMTL, however, since 1998 it has contributed approximately one-third of the CHMTL’s annual budget. The remaining funding comes from the Office of Research and the University Graduate School and from the endowed chair held by the CHMTL’s director. The formal project management procedures in place are an external project committee and editorial advisory committee.

The TML does not carry out any feasibility and pilot studies but holds regular planning sessions. The TML has made changes to its delivery system following testing.

The creation of the graphic files occurred entirely in-house and the creation of the texts was divided amongst the various centers established by the project; with some text being sent in from outside the project. All the project’s equipment has been purchased; it includes Apple and UMAX flatbed scanners, 21” monitors, CD burners, and DAT tape backup systems running on five CPUs with fast clock speeds.

An elaborate system is in place for creating the graphic files. The image is first captured in TIFF format, then edited and converted to PICT format and then to GIF. The GIF image is stored in UU encoded form on the NT server (for retrieval through LISTSERV and FTP) and in GIF form on the Linux server (for retrieval through the WWW). The preliminary TIFF and PICT files are not saved.

 

27.3    Human Resources and Training

The CHMTL employs a full-time director, who works a minimum of 20 hours per week on the project, although this level fluctuates. There is now also a full-time Associate Director of the CHMTL, whereas prior to 1998, there had been a half-time project assistant for the TML alone. During the period of the NEH grant, the project employed 2-3 student digitizers; the CHMTL now employs 3-4 students at 10 hours per week, although these hours are distributed amongst the TML and the CHMTL’s other projects. The project staffs have a mixture of music graduate and medieval studies backgrounds. Ability in Latin is an obvious requirement, as is meticulous attention to detail.

Technical advice on digitization was available in-house. Training needs were informally identified in the areas of:

New project assistants, associate directors and student digitizers have all received training from the project staff and through learning on the job. The training has met the project’s requirements and the TML has been able to attract excellent student employees because it has paid quite generously.

 

27.4    Project Life-Cycle Processes and Procedures

The original holder retains copyright in the original materials. This copyright was cleared with the owner’s agreement and in only one instance was a small fee paid. The TML claims copyright on the text compilations, table of codes etc., and defends the right of the TML to determine the method of data delivery. Users are allowed to make printouts of the digital deliverables on paper or film and download to a PC, LAN or WAN. No electronic management systems are in place to control copying.

For textual material users can view and download ASCII text files.

For digital image material users can view and download GIF files.

The TML does not have a conservation procedure for the original materials as this does not fall within the scope of the project.

Information for creating the TML Canon of Data Files (the catalog of the digital versions) is not derived from a catalog for the originals but the original material itself. As such, the record for the digital deliverable and that for the original object are independent of each other. Metadata recorded includes: the author and the title of the treatise (as given in the source); the incipit; the source and type (i.e., print or manuscript) of the data file; the filename, filetype, filelist, and size of the file in kilobytes; annotations; and the names of the persons responsible for entering, checking and approving the data. The digitizer and/or project director creates this metadata from the originals and with the exception of the incipit, the data is created to be Z39.50 compliant. The TML database is referenced in the University’s library catalog but not at the text level. The TML Canon of Data Files is available on the internet and in published form (as noted above).

The project has not rejected any material for digitization. Photocopy intermediaries have been used for digitization, although the material did not exist only in this form.

 

27.5    Format, Resolution and Compression of Digitized Materials

The formats for retroconverted text-based digital deliverables are:

The texts contained both Latin and non-Latin scripts. OCR was used as a conversion method for printed textual materials. The OCR software used was TypeReader and latterly also OmniPage. With clean text, the project achieved 98-99% accuracy, although the accuracy for texts derived from early printed books was well below this level. From this experience, the TML would recommend leaving the OCR decisions to the digitizer; nevertheless, even with poor OCR there are the advantages of avoiding such typical scribal mistakes as skipping lines or jumping from one word to a similar one elsewhere on a page. Keying-in conversion has been employed for manuscript material. Text file sizes range from 6 to 300KB.

For image material, the TIFF file format is used for capture and GIF for delivery. The capture resolution is 300dpi and the delivery resolution 72dpi; sometimes the images are enlarged to 150% for legibility. Capture is at 8 bits and delivery at 1 bit for line art (bi-tonal). The TML recommends being extremely careful with alignment and with threshold settings to achieve maximum clarity in the final image. The delivery compression used is GIF format to improve access and enhance usability. The average image file size is less than 64KB. The project does not retain the original scans in uncompressed form. The project carries out post processing on images using Ofoto (bundled with Apple scanners) and PhotoShop for clean up, de-skew and labeling. Ofoto still carries out some of the operations better than PhotoShop.

For others starting work in digital imaging, the TML recommends careful consideration of user needs and facilities, especially in terms of available equipment and bandwidth.

The quality control procedures in place for the digital deliverables are at least three sets of total checks on each text. The individual entering the data is expected to check and correct it before printing out a text to be passed along to the person responsible for proofreading and checking. This second person identifies and marks errors on the printed texts, after which each mark is counter-marked as the correction is entered into the electronic text. Then, both the marked printed text and the electronic text are passed to a third person for review prior to approval and addition to the database. Where there is a high error rate at any stage in the process, the text is printed once again and subjected to a second proofreading, as outlined just above. The final check and approval by the project director facilitates consistency in the quality control process. Lessons from this experience are that it has been difficult to get people to proofread character-by-character (rather than word-by-word) and to refrain from global search-and-replace editing. In general, the TML has discovered that no more than four double-spaced pages of 12-point text can be proofread per hour with an acceptable rate of accuracy.

Users have open access to the catalog plus materials. If it resides on an individual’s personal computer, the TML has mainly been searched using the GOfer search engine, which the project licensed to distribute at cost. GOfer enables brute searches of as many as eight separate words or strings (controlled by various Boolean operators) throughout any defined set of texts, including such advanced features as soundex and proximity definitions. However, because the files are ASCII format, they can be used with virtually any search engine. For internal use, the project has developed Eureka!, a Windows application that has the same functionality as GOfer but currently without the soundex facility. Eureka! caches its searches, and it is therefore faster than GOfer, can do multiple sub-searches and retrieve previous searches. For searches on the TML website, CNIDR’s Isearch has been implemented with a three-field form controlled by standard Boolean operators; because Isearch operates on indexed text, proximity searches are not currently available on the website. Users can, however, select a century context (or the entire database) and set the number of results to be displayed per screen. Versions of TML distributed on CD emulate the website (including CANTUS, a database of Gregorian chant manuscripts), and the CD can be searched using GOfer or Eureka! The TML averages more than 1,600 portal entries per month and processes on average more than 600 searches per month (search statistics are based on the number of times the script is invoked – i.e., each time a search is actually processed – not on the number of times the search form is retrieved).

Access to the entire database is free of charge. Users pay only for optional programs, such as GOfer ($20), or special media, such as the TML/CANTUS CD-ROM ($65). In theory, these are intended for single users, but this is not monitored or enforced. The charges were calculated to recoup the direct costs of the medium and distribution. Potential users of the database and its optional programs and alternative media are informed about their availability through a project brochure, website announcements, articles in print media, conferences, meetings and an email distribution list (TML-L). Live demonstrations of the TML have proved effective, as was the original trifold brochure; email distributions have been least effective. Use of the TML increased exponentially when it was made available on the web.

 

27.6    Evaluation, Funding and Long-term Sustainability

The TML has not carried out any evaluations of the project (with the exception of the semi-annual and final reports required by the NEH) and has relied on informal feedback, of which they receive a considerable amount. If the project received complaints, it would carry out a formal review.

The TML itself has received $330,000 in NEH funding, and as part of the CHMTL, it has also benefited from the CHMTL’s $90,000 start-up funds; networking and server costs have been absorbed by Indiana University. It is estimated that total project funding has been in the region of $500,000. The TML considers that it has received sufficient funding but the source of continuing funding remains unclear. To maintain the CHMTL (which includes the TML) in perpetuity would require an endowment of $3,500,000 (returning $175,000 per annum), which would provide sufficient operating funds and funding to bring fellows to the CHMTL to make use of its materials, engage in conferences, and offer instruction to graduate students. The TML’s view is that the use of the standards it developed more than ten years ago has enabled it to grow and migrate data with very little need for revision or correction of its data.

New materials are digitized and added at the rate of 50 files per annum and this may increase. The user interface has changed very little over the years, and this fact is one of the features that users like. The TML does not have a formal strategy for preservation to ensure long-term access. At present the TML uses tape backup and stores CDs on and off site in controlled conditions for archiving purposes. In the early days of the project, backups were sent to Colorado; now, all board members have a copy of the data. The project is proud that in over ten years no data has ever been lost. In addition, all paper documentation is kept, including copies of the original sources and printouts of the electronic texts, on which are recorded all the alterations and checks. A longer-term preservation strategy is likely to rely on the (forced) migration of data. It is intended to keep the digital deliverables available indefinitely.

For long-term sustainability, the project is dependent on self-generating funds, but the TML has not secured the resource for this. Should resources prove insufficient, the exit strategy would be to break up the CHMTL’s projects and move them elsewhere. Another possibility would be to provide the TML on a private server. The TML suggests that projects need to recognize that even institutional support is tenuous, that digital data is easily destroyed, and that some projects will not survive.

 

27.7    Conclusion

The TML is an excellent example of an elegantly simple project. It demonstrates the value of establishing clearly defined orthographical principles prior to digitization and formatting the deliverables to a recognized standard. In this case the standard may be the simple low ASCII character set, but it has proved to be both flexible enough to encode a range of material and migrate over different hardware and software systems for over ten years. Furthermore, the TML’s encoded texts have been able to be accessed by very sophisticated search and retrieval software which proves that simple encoding is no barrier to greatly enhanced functionality. The TML did benefit from not having to digitize a wide variety of material or serve a diverse group of users. Nevertheless, one can extrapolate from their experience the benefit of adopting emergent text standards today, such as TEI and/or XML, to suggest that one does not necessarily have to implement the full complexity of these to provide long-term access to digital material.




valid xhtml 1.1
abp~04/02