NINCH guide home interview table of contents previous interview next interview
Gregory Crane, Director of the Perseus Project at the History Department of Tufts University, was interviewed by HATII on September 21 2000. The Perseus Project is a developing digital library, which aims to make available a wide range of resources in the field of the Classics. The project is a reflection of Crane’s interest in the relationship between the humanities and developments in digital technology. In digitally formatting classic texts, illustrations and artifacts, the project intends to increase the potential of the material for education and research, while also enabling broader public access.
Digitization began thirteen years ago with a long term projection in mind. At first the project was unable to distribute the material and it was therefore collected on videodisc and 35mm film; now digitization work is done from slides. SGML tagging began ten years ago and later came into its own, as a very robust structure format.
The project did not conduct a collection survey but knew the corpus of classics texts; the sites were obvious and the key to museum objects was their accessibility or locating substitutes. The institution as a whole was made aware of this process and curators were consulted. The lessons learned from this were to get as many staff involved as possible and not to be too concerned about the details and documentation. If the project were to do this again it would not worry about people misunderstanding the nature of the project.
The process of identifying material was used to establish priorities for digitizing holdings. A critical mass of heterogeneous material was built with most early interest focusing on images as there were already Greek texts digitized elsewhere. These priorities were established by an editor in chief in conjunction with a subject specific editor (for example, in archaeology). These priorities were not formalized in a strategic policy statement but were internalized amongst the people on the project.
The objective this policy sought to achieve was the creation of a digital library to change the questions people asked and to reach a range of users and uses. The project feels it has been successful in these aims even though people projected expectations onto the project, which they would not have had for print material. Obstacles to achieving these objectives were the lack of guidelines, technology limitations and the lack of an inexpensive and accurate GPS system.
The greatest obstacle to planning and building the digital deliverables has been designing pieces that interact so that the whole is greater than the sum of the parts.
In the prioritization of materials for digitization intellectual property rights have acted as a filter but the main priority has been the material’s teaching and learning potential, its research significance, potential to reach disadvantaged groups and social inclusion. Secondary priorities have been research into digitization strategies (considered synonymous with the project) and infrastructure cost reductions (as a means to an end). The selection and prioritization criteria have not changed over time.
The project has co-operated with archives, libraries, museums, academic institutions, corporations, foundations, charities, and government agencies. It has done so at institutional, local, regional, national and international levels. One lesson from this is to design projects with collaboration in mind and determine who will work with whom (e.g. LC rather than the Scottish Cultural Resources Access Network model).
The current status of the project is ongoing; it started in 1987 and has no anticipated end date.
The foremost purposes in creating the digital deliverables are teaching and learning, wider access and public access. Secondary purposes are research and experiment. The project has made an explicit statement of intent (through flyers, etc.) about its rationale.
The type of source material digitized includes:
For images, the nature and format of the materials include 35mm slides and for texts, published books. The digital deliverables represent neither a sample nor an entire body of material; the digitized books represent two-thirds of surviving texts and the images are a representative sample. The project has re-purposed the digital deliverables in two versions of a CD-ROM, which has worked well. One obvious change the project would make if it did this again would be to make it possible to link to every object.
The following standards, guidelines or tools are used for representing content:
The project has not adopted any standards, guidelines or tools for describing content or controlling data values.
The program looked at other existing guidelines for digitizing particular document types but there was not much available at the start. It looked at beta code transcription for Greek and observed cataloging practice. The project rejected the beta code of the Thesaurus Linguae Graecae for page formatting and did not use any for control and display of page layout, such as HTML.
The following standards, guidelines or tools are used for representing structure:
In relation to standards in general and suggestions for navigating between the ideal and the realistic, the recommendation is that XML, SGML and TEI are the least problematic. Also, only marginal benefit is derived from developing your own DTD; it is advisable to adhere to standards, get close to generic best practices and avoid excessive formatting.
The primary intended audiences for the digital deliverables are four-year college, graduate school, lifelong learning, distance learning and subject specialists such as classics, ancient history, or history of science. Secondary audiences include K-12 and community colleges.
Gary Marchiomini has undertaken an evaluation of the target audience. See: http://ils.unc.edu/~march/perseus/lib-trends-final.pdf
The deliverables can be used by those outwith the primary intended target audience. The project has acknowledged the needs of those with disabilities via the W3C’s “Guidelines for Web Site Accessibility”. The project is anxious to minimize any limitations to use, but not all museum images are on the web due to the rights restrictions of museums.
There has been external project management advice through a Technical and Academic Advisory Board (although not recently active). As a funded project Perseus can operate separately from the structure of Tufts. The project has helped inform how Tufts thinks about IT, e.g. the development of its digital library, which will be the eventual home of the data. Formal project management procedures are a Steering Group of scholars from several institutions. The biggest project management challenge has been determining what people are good at and how they can contribute. Quality assurance procedures include spell checks, metadata captioning and review of front ends, which have all been successful.
A pilot study was carried out as part of a $500,000 contract to experiment with the creation of a digital library. This study covered scheduling, training needs, technical feasibility, user needs, workflow analysis, workflow piloting and technology forecasting. As a result of this study a scheduled collaboration between Harvard and Boston was shelved because of organizational over-centralization. No benchmarking or time and motion studies have been undertaken. The Editor in Chief delegated the work with the aim of optimizing the skills within the project team (for whom there are job descriptions).
Text digitization was outsourced for cost reasons, but is now moving in-house with the availability of high end OCR. The equipment for the project has been bought in. The adoption of the OCR digitization process was aided by looking at what the University of Michigan had chosen. Fujitsu flatbed scanners and Canon film scanners are used by the project. Guidelines have been established for calibration, resolution, etc. and vary according to the media. Color chart benchmarks are also used.
The project employs one director (50%), one metadata specialist/curator (100%), one digitizer (100%), one photographer (for 12 years 100%), one or two technical support (100%) staff, three technical development staff (100%) and one 0.5 evaluation specialist (100%). Virtually all of the project staff have a classics background. None of the staff was redeployed from other areas. Advice on the technical aspects of digitization is available in-house.
Training needs were identified in project management, application of technical standards, preparation and handling of materials for digitization, technical operation of digitization equipment, post-digitization processes, metadata creation and digital preservation. All members of the project team have been engaged in training and this has been organized in-house (using own consultants), through independent study and learning on the job. The training has met the needs of the project.
The project is aware of the copyright position of the digital deliverables and owns some of the copyright on the original materials. The copyright or rights status of the final digital deliverable is declared. All methods of organizing digitization of material in copyright have been used (legal provision for libraries, owners’ agreement, payment of a fee, under license and without formalities).
Users of the digital deliverables are allowed to make “fair use” copying of the digital deliverables. Users can view XML (to HTML) text, download thumbnails and lower quality images, and only view highest quality images. No electronic management systems, such as watermarking, are in use.
The project does not have a conservation or preservation procedure for the original materials. Investigations into the condition of the original materials is undertaken by curators and in some cases originals may, for example, be cleaned by them as part of the project. Some material is modified, degraded or compromised to carry out digitization. Risk assessment has relied on the work and advice of curators. Some special equipment is used to minimize stress on objects. Some materials are prepared by curatorial or preservation staffs prior to digitization and are sometimes monitored by them during digitization, especially in museums. No restrictions are placed on the originals post digitization.
Cataloging or reference systems in place prior to digitization range from published records to no catalog. Appropriate information is used from these sources during the digitization process. Both project staff and curators locate core reference material. Some early modern books have been rejected because it was not efficient to OCR or transcribe the handwriting. Reproductions or intermediaries (slides, 35mm, 4x5 transparencies, photographic prints, microfilm) have been used where necessary, but the material did not exist solely in reproduced form.
The project does not catalog the original material.
The digital deliverables are cataloged in an in-house relational database (Filemaker to Postgress to RDF). Dublin Core and RDF are the guidelines used in cataloging and the project is considering the use of MARC and USMARC.
Tools for controlling data values are thesaurus (e.g. of geographic names), museum catalog numbers, Thesaurus Linguae Greacae numbers, subject languages, Getty Art and Architecture. The metadata details recorded include the original object, the digital object, digitization process, technical details and staffing details. The records are created by the digitizer and overseen by one person. This metadata record for the digital deliverables is then held in a separate catalog in electronic form on an intranet server and is available on the internet.
The relation of the records for the digital deliverables to those for the original digitized materials is a mixture of independent and identical. A text string links the catalog and the object.
The format of retroconverted text is SGML and XML. This text contains non-Latin scripts and OCR (Prime Recognition) is used to convert the digital images. The accuracy of this process is generally high but variable, at best 99.95%. Specific treatment to pages prior to OCR is to clean page images. The aim of using OCR was to allow automatic indexing, enhanced searching and computer-based analysis of the materials. The project’s recommendations on OCR are to use high-end software and, if money is available, to outsource it. No keying in is used in-house.
The TIFF file format is used for capturing and preserving. Some capture is done in JPEG, but this is mainly the delivery format. 20-40MB per image is the typical file size at between 8 and 24 bit-depth. Lossless compression is used at the capture and delivery stages when low-resolution cameras are in use. The project undertakes many kinds of post-processing using PhotoShop. The recommendation from the project’s experience of digitizing images is to produce several versions at high resolutions. Moving image digitization used videodisc format.
The quality control procedures are spot tests on texts, and spot and total checks on books. Metadata are double-checked. The project quality control predicates the whole workflow process, and highlights the need to train staff.
Users do not have to pay to use the digital deliverables, rather they have access to them through an open access catalog (although some material is not available). Glimpse is used for full text searching/browsing and sequel and Postgress queries are used on the metadata. Users are also able to zoom in on images.
Potential users of the digital deliverables are informed about their availability through website announcements, articles in print media, conferences and meetings and email shots. The project does not know which medium has been the most effective but it is probably building links to the site. Usage levels are 300,000 page impressions in 24 hours, with 30,000 users, as monitored by automatic data capture.
The project has carried out front-end, formative and summative evaluations that included questionnaires on paper and online, focus group discussions and observing users’ interactions. See: http://ils.unc.edu/~march/perseus/lib-trends-final.pdf
An estimate of the project funding cost is $10 million; $2.5 million was for the Classics Computer Project (CCP), $6 million from Federal Agencies, mainly the National Endowment for the Humanities (NEH). With regard to an ideal cost, the project would estimate $6-7 million, if the project were now to be repeated. The project would not have done anything differently. The project believes that the use of standards has saved money and that to invest in these in the short-term will have long-term benefits. The funding agencies monitor the project through Annual Reports, but the most significant sanction is continued funding.
New material (and its associated metadata) is updated on an ongoing basis. The user interface is changed intermittently and file formats, especially the back-end data, are changed incrementally.
The project has a preservation strategy (the university is creating a digital depository) as well as SGML to XML and different storage media based on a migration data strategy.
The digital deliverables will be available indefinitely and the project’s longer-term sustainability will rely on self-generating funds in future. The project has resources for the next four years. The exit strategy for the project is through the university depository. Loss of the digital deliverable would be a matter of concern.