NINCH guide home        interview table of contents        previous interview        next interview

 

13   Cornell University

 

HATII interviewed Peter Hirtle and Anne Kenney, from the Institute for Digital Collections and the Department of Preservation and Conservation at Cornell University, on October 18 2000. The Institute for Digital Collections endeavors to explore the application of emerging technologies, with a view to enabling greater access to cultural and scientific resources. As a result of their digitization projects, these resources are now available on the World Wide Web. In forming international partnerships with similar institutions, the Institute aims to amass a catalogue of digital collections from around the world. The Department of Preservation and Conservation was established in 1985, with the intention of providing resources for the research and teaching faculties within the academic community.

 

13.1    Organizational Digitization Program and Policy

There are three major areas within Cornell University Library currently engaged in digitization projects: the Cornell Institute for Digital Collections (CIDC), the Department of Preservation and Conservation, and the Albert R. Mann Library. The first two groups work closely together and are the focus of this report. CIDC is a component part of the Digital Library and Information Technologies (DLIT) division in the Cornell University Library. DLIT has responsibility for planning, implementing, and providing technical support for information technology programs and network linkages throughout the library system. The Department of Preservation and Conservation is also a component of the entire library system.

While both groups are involved in the digitization of both text and images, the primary focus of work in the Department of Preservation and Conservation has been on the digitization of textual materials (including illustrations in printed volumes), whereas CIDC has concentrated more on the digitization of visual objects. CIDC supports the middleware used for the delivery of materials it has digitized, relying on both homegrown database systems as well as software from Luna Imaging. Delivery of materials digitized by the Department of Preservation and Conservation is handled by DLIT.

13.1.1   Selection

Collection surveys have not played a significant role in the selection of material for digitization, however, the Department of Preservation and Conservation carried out a collection survey in the early 1990’s. Instead, their approach to selecting collections for digitization has been to identify areas of known strength in the holdings. For example, one strong area at Cornell is its holdings of 19th century materials, which formed the basis for the Making of America (MOA) project. The Department will also scan on demand for faculty-related projects, such as distributed learning. The selection criteria include conservation concerns, teaching and learning potential, historical and cultural value, research significance, and the potential for increased access. The materials must have rich content to be selected.

Funding is the main factor in CIDC’s prioritization and selection of material for digitization. All the projects within CIDC are funded on “soft” money. As a result, material from subject areas likely to attract funding has a greater chance of being selected. The CIDC supports projects initiated by Faculty members that will create resources for teaching. Other factors that influence selection are the teaching and learning potential of materials, the possibilities of enhancing access that may be created by digitisation, and research into digitization strategies.

The Mann Library has taken a radically different approach to selection. In both the Core Historical Literature of Agriculture project and the new Core Historical Literature of Home Economics project, the library has utilized panels of experts to identify the most important materials in each subject area. Those selections determine what is digitized.

Both CIDC and the Department of Preservation and Conservation have co-operated with libraries, museums, and archives, at all levels from local to international. While collaboration can be beneficial, both units feel that co-operation is not without problems and that firm rules for the collaboration have to be developed in order to produce a successful project.

13.1.2   Organization and Mission

Both programs are ongoing. CIDC was established in 1997 with lead gifts of $2 million from several benefactors. Other gifts have been received, as well as funding from the Provost and several Deans. CIDC, like the Department of Preservation and Conservation, also uses consulting income to support additional staff. While the initial gifts to create CIDC were generous, it was not enough to create an endowment. Regular funding would ensure the on-going maintenance of CIDC projects. Consultation work by both sections pays for more staff on a longer-term basis.

The Department of Preservation and Conservation has had digital projects in place since the early 1990s. The Associate Director of the Department also serves as Co-Director of CIDC, ensuring close cooperation between the two units.

The main purpose of both areas’ projects is to create teaching and learning resources. In addition, both units hope to provide access to the materials to individuals and groups outside the library. The Department of Preservation and Conservation is also motivated by conservation concerns. Neither unit has revenue generation as a priority. Both have web pages that have explicit informative statement of intent, mission statements and further project information.

Between them, they have dealt with all types of material, with the exception of film, video recordings, radio broadcasts and TV broadcasts. The Department of Preservation and Conservation has had some dealings with sound recordings through the Music Library.

Both areas use standards, in particular TEI, JPEG, SGML and XML, MARC, EAD, AAT, and Library of Congress subject headings. Cornell feels strongly, that using standards will save money in the long-term. Where standards are not used, it is generally because they do not exist or are not appropriate to the project. Both programs started very early and many standards did not exist as such. This has led to Cornell establishing many of the accepted guidelines in the community.

The intended audience is similar for both programs. Both identify the Cornell academic community as its priority audience with other educational areas (such as K-12 education) meriting lower priority. The other main target audiences are museum, archives, and library users.

 

13.2    Project Management and Planning

CIDC does not carry out any pilot study as such, since it considers all of its projects to be experimental in some way. It applies the lessons learned from earlier projects to each new project, as well as scaling smaller projects to larger ones. The Department of Preservation and Conservation has conducted a number of research projects to answer technical issues relating to digitization, that must be answered before large-scale projects are implemented. It has also conducted usability studies on the user interface for MOA. The interview with University of Michigan gives more detail on this study.

Time and motion, planning, and scheduling tools are not used in either area.

Both programs digitize in-house as well as outsourcing digitization. In all cases, the nature of the material determines the digitization process; fragile or very valuable material is digitized in-house. Both programs believe that outsourcing can reduce costs. Specialized vendors, for example, can implement economies of scale. Pre-set cost figures for digitization can also insulate libraries from any increase in the cost of digitization, while the vendor has to absorb the increase. In-house digitization utilizes flatbed scanners (HP, Epson, Microtek, and Xerox ), film scanners (Nikon Cool Scan), microfilm scanners, and high-end digital cameras (Phase One). Both programs use grayscale and color targets when scanning, and CIDC will often use tone balls when scanning three-dimensional objects.

Cornell University Library has established guidelines for digitization and data capture that are to followed by any project that hopes to deposit digital materials with the proposed central depository.

 

13.3    Human Resources and Training

CIDC employs seven regular staff, although additional staff can be hired to work on discrete projects. Smaller projects might be run by two to three staff, while a larger project may have as many as seven or eight. The staff members tend to be humanists with an interest in technology. Projects often draw on staff from other areas of the library, such as catalogers and curators, to supplement the actual work of digitization.

The Department of Preservation and Conservation has 26 staff employed in areas such as administration and book conservation as well as digitization. These staff members tend to be librarians or conservators.

While some training is given to technical staff, the preference is to employ staff who already have good technical knowledge of digital processes. Staff are encouraged to attend the week-long Digital Imaging Institute course offered by the two units. They are also encouraged to take external courses when appropriate courses can be found. However, learning on the job remains an integral part of the training process.

 

13.4    Project Life-cycle Processes and Procedures

13.4.1   Intellectual Property

Both programs are aware of the copyright status of their materials. CIDC tends not to own the copyright on the materials it digitizes. When there is a question about the copyright status of a work, only on-campus or Cornell-affiliated users are permitted to print or download digital images, and only in direct support of teaching and research.

The Department of Preservation and Conservation generally only digitizes materials that are out of copyright in the U.S., such as those printed before 1923. It asserts a copyright claim on behalf of the Library, in the image headers. Neither unit uses watermarking to try to control further use of the digital files, relying instead on a general “terms and conditions of use” statement found on the web sites.

13.4.2   Preparation/Conservation

The Cornell University Library has a preservation policy that covers the handling of original materials from disbinding to scanning and reassembling. Material is not usually rejected for digitization because of its physical condition, but on occasion incomplete works may be rejected. As part of the digitization process, conservation staff will often replace missing pages, disbind, deacidify, wash, and mend the material to be digitized. Conservators provide training in handling to the scanning technicians. When disbinding cannot take place, the fragility of the original paper and binding presents the greatest challenge. Conservators assess the risk to original materials and help to identify the appropriate conversion methodology. Some scanning has been done using a face-up Minolta book scanner. Film intermediaries are also used when the risk of damaging the original outweighs the lower image quality derived from scanning an intermediate. In some cases, such as with the conversion of Art History slides, the only source material that exists is a film intermediate, which is then used for digitization.

Special equipment, such as cool lights, filters, and book cradles, is used to minimize risks. The processing staff and the curatorial staff work together to prepare the material before digitization.

13.4.3   Access

In general, once a digital surrogate exists, users are encouraged to use the surrogate. However, access to the original is not restricted.

13.4.4   Metadata

The starting point for the metadata in all digital projects is the catalog data. The nature, amount, and quality of the catalog data varies with the collection. Very often, the existing catalog data is enhanced with additional data created as part of the project. Links are created from the main catalog to the digital object through the use of an 856 link and a PURL server. For books, direct links from the catalog record to the book are created. In other cases, links lead to a finding aid encoded in EAD or to a collection-level record for a particular digitization project. Records for individual items are then stored in local databases developed in a variety of packages, including Filemaker Pro, Access, Informix, and SQL Server. Controlled vocabularies are used.

Information about the original object, the digital object, technical details, administrative information, and preservation-related information is captured at various points during and after the digitization process. The information is usually stored in either the image file header and/or local systems.

 

13.5    Formats and Compression of Digitized Materials

13.5.1   Formats

The Department of Preservation and Conservation has prepared several studies on the appropriate resolution and formats of digitized materials. TIFF is used in almost all cases as the file format for the master archival image. The format found on Photo CDs has also been used at the capture stage, but Cornell is in the process of making TIFF copies of many of its PhotoCD masters as a preservation measure. Many different kinds of formats are used when delivering images to the public, including GIF, JPEG, PDF, and MrSID file formats. Usually several sizes of images are available, from thumbnails through high quality versions. Finding aids are encoded in XML using the EAD DTD, and delivered to the user in HTML format.

13.5.2   Compression

Compression methods used include ITU Group 4, LZW, PhotoCD, JPEG and MrSID. Only lossless compression is used on master archival images, and not all master images are compressed. The aim of the compression of the master images is to decrease the amount of storage needed, while compression of access derivatives can improve the speed of delivery.

13.5.3   Quality Control

Some post-processing procedures are usually implemented. This can include color correction on calibrated monitors, the application of a sharpening mask, and contrast stretching. This is usually done using PhotoShop. CIDC and the Department of Preservation and Conservation both calibrate their scanners on a daily or weekly basis.

Quality inspection is carried out on 100% of objects. For visual images, this is usually done immediately after capture. For textual materials, inspection is usually done while adding metadata about the structure of each volume after capture. Checksums for the archival images are created so that any future error in transmission or file copying can be identified and rectified. There is no formal quality control for metadata. It tends to be done on an ad hoc basis, with errors corrected as they are found, but often the software used to enter metadata has functions to limit mistakes.

13.5.4   OCR

TextBridge is the software most often used to OCR text. Cornell feels that the level of accuracy achieved with TextBridge in the MOA project is comparable to that achieved by the University of Michigan. OCR is used to enable full text searching, not to create textual equivalents of the original. Hence, little of the OCR text has been corrected, and none has been rekeyed.

13.5.5   Access

Depending on the project, user access to the digital files can vary from restricted use by in-house staff, to worldwide access to the scanned images and metadata. However, the user is never allowed to have access to the original master TIFF image. There is no charge to use the resources.

Users can search on catalog and index data or, when available, on full text. They can often also browse by fields, and a thumbnail browse option is usually available. No special client is needed to access most collections; they are web accessible. Users of the visual collections accessible via the Insight software from Luna Imaging do gain extra functionality if they chose to access the images using Insight’s Java Client rather than the web browser. The Java client enables users to create groups and slide shows, export images and data as web pages, measure features on images, and annotate them as well. It has a great level of interactivity and enables users to create very easily, their own resource from the larger body of work.

Information about the availability of digitized resources is disseminated through websites, press release, articles, print and broadcast media coverage, conferences and meetings, flyers and bookmarks. Locally, seminars for faculty are given to raise awareness. Web logs are used to monitor usage.

 

13.6    Evaluation, Funding and Long-term Sustainability

Both CIDC and the Department of Preservation and Conservation have conducted user evaluations of the different projects. CIDC investigated the use of its Utopia Project, a database of digitized versions of slides of the Italian Renaissance, while the Department of Preservation and Conservation in conjunction with Cornell’s Human Computer Interaction Laboratory, investigated use of the MOA collection. Based on its evaluation, CIDC selected Insight software from Luna Imaging to serve as the front-end for the dissemination of many of its projects. The Insight client addressed many of the issues raised in the Utopia project. The Department of Preservation and Conservation changed the selection criteria used in the MOA project after discussions with focus groups. D-LIB Magazine and RLG DigiNews have reports on evaluation strategies and user needs methods that are based in part on the Cornell evaluations. Front-end evaluation has become an iterative process and there are regular updates to project software.

Both sections would like to run evaluations that are more thorough in order to monitor user needs more closely. They would especially like to understand better, how users work with finding aids in particular and primary sources in general, in order to be able to develop better interfaces to those sources.

 




valid xhtml 1.1
abp~04/02