NINCH guide home interview table of contents previous interview next interview
On October 16 2000, HATII interviewed John Price Wilkin, Head of the Digital Library Production Service at the University of Michigan; Christie Stephenson, Assistant Head; and Maria Bonn, Head of the Scholarly Publishing Office. The Digital Library Production Service was initiated in 1996 with the intention of creating an infrastructure for the digital library collections at the University. One of a number of units within the University Library’s Digital Library Initiatives division, the DLPS is responsible for the operation and maintenance of current collections, while also preserving these resources for future use.
The Digital Library Production Service (DLPS) has not carried out systematic collection surveys to identify materials for digitization. Units within the library with collection responsibility drive the digitization process. Systematic surveys may be done for some projects which are targeting a segment of a collection. Cataloging is performed by staff in the Cataloging department. Preservation staff participate in projects like the Making of America (MOA) by ensuring that the materials are not “last copies”, that they are processed as preservation materials and that they are appropriate for digital capture. Preservation staff conducted a collection survey before including materials in MOA 4. The overall focus is on the project, which can have different motivations, for example to clear shelf space or to digitize a particular subject.
The priorities are driven by the material — e.g. shelving, cataloging, condition, demand — and collection management is undertaken by collection development staff. The priorities are not formalized, and they can change through the digital process; for example, MOA revived interest in materials that had not attracted attention for many years. DLPS plans to develop selection criteria for conversion projects; this will not be prescriptive but will take the form of a framework.
The program recommends standardizing workflows for the digitization process and for quality assurance. They also recommend creating strategies with collection managers to ensure the matching of materials, methods and goals.
The obstacles identified by DLPS in development are mainly in connection with the data and methodology. In some projects it has been difficult to obtain the data. Some projects have difficulty with the methodology in working with the data, e.g. working with newspapers, and the judgements required in order to create usable resources.
The complexity of the workflow was identified as a main obstacle in the process. DLPS would ideally like to bring systematic thinking to the process to adapt to problems of scale. For example, MOA had to overcome issues associated with co-ordinating 27 staff spread over the library to achieve the goals of the project. The workflow also had to deal with vendors, outsourcing and complex project management systems.
There is a tension between costs and methods, and scaling projects, such as how to put small-scale projects into large-scale workflow.
The main criteria for the selection of material for DLPS are research significance and enhanced access. Other drivers are: intellectual property rights; teaching and learning potential; conservation; historic and cultural value; improved functionality; space rationalization; and preservation. By-products of the process (but not full criteria) are publicity, potential commercial exploitation and support of infrastructure costs. Other factors include provision of user services, labor cost reduction and social inclusion.
The conversion strategy varies from project to project and has changed over time. MOA 4 has different drivers from the first MOA project — moving from hand selection to automated selection.
In developing the digitization program co-operation has taken place with libraries, museums, corporations, foundations, government agencies, and academic institutions. The co-operation was local, regional, national and international. DLPS feels that co-operation is a valuable experience but must be thought out carefully beforehand. It would encourage any institution to speak to other similar organizations before starting any digital program and to set communications lines and shared goals, giving one person the authority to enforce milestones and timelines. There should be a process of accountability, but appropriate level of communication.
The current status of the program is ongoing with no anticipated end date. Digital projects have been active in the University of Michigan (UMICH) since the late 1980s, digital library initiatives since 1993, Humanities Text Initiative (HTI) since 1994 and MOA since 1995. DLPS was established in 1996.
The main purposes in creating the digital deliverables for DLPS are preservation, public access, research, and wider access. Other purposes are teaching and learning, and revenue generation.
Specific projects have created statements of intent, and the content will vary from project to project.
The type of source material digitized includes:
The nature and format of the materials digitized vary considerably from project to project and include all sizes of photographs, original artworks and various sizes of documents. Depending on the specific project, they may represent the whole collection or a sample.
DLSP intends its digital objects to be used again in a variety of projects including printing and for others to use.
The following standards, guidelines or tools are used for representing content:
The following standards, guidelines or tools are used for describing content:
The following standards, guidelines or tools are used for controlling data values:
The following standards, guidelines or tools are used for representing structure:
DLPS looks at appropriate standards and guidelines for each project. Many of the standards available were not necessarily suitable for its needs, or in some cases the time and expertise were not available. Anne Kenney’s guidelines on page imaging were extremely valuable (Kenney, Anne R. Digital Imaging for Libraries and Archives. Cornell University Library, June 1996).
Many guidelines were adapted for projects, such as the papyri project.
Use of controlled vocabularies and striking a healthy balance between standards and reality are both recommended: standards should not be used for their own sake, and pragmatic solutions are often the most effective.
The intended audience for digital deliverables, with their priorities, are:
DLPS identified the academic sector as its target audience. It conducted evaluations on some projects’ target audiences, for example the papyri project and MOA.
Other users whose profile will vary depending on the material can use all the objects.
W3C’s “Guidelines for Web Site Accessibility” were considered for use in the systems. The project stated on the web page reasons for limitations of use, which were mainly due to costs and the nature of the material.
It is difficult to gauge if the target audience differs from the actual audience. Feedback is provided via web logs and via email.
The DLPS uses in-house expertise for project management. The project management is tied to the cataloging and preservation department within the larger structure of the library. The work around digital library issues is changing the structure of the organization. For example, the preservation department has changed some of its staffing as a result of the collaborative work with DLPS on projects such as MOA.
Formal project management procedures are in place. The DLPS administration group provides online guidelines and procedures, and oversees project management. Procedures have had to be adapted to strike a balance between overly centralized procedures and insufficient project management. They have tried to be inclusive of individuals.
There is no formal quality assurance for project management.
No pilot study or feasibility study is carried out for the projects. There is a form of “ramp up” in projects when test scans are carried out. This is an evolving practice and they continue to learn.
Microsoft Project and flowcharts have been used to aid planning. Work allocation depends on previous responsibility. DLPS has produced procedural documents, which cover OCR bitonal and continuous tone scanning, quality assurance and preparation.
The digitization is carried out both in-house and externally. The method used depends on the materials, with digitisation of some fragile material being kept in-house. Cost and scale are factors in this decision. OCR tends to be done in-house, with encoding and bi-tonal scanning typically (but not always) outsourced. In-house methods enable further training of staff and control over the material and process.
The institution bought equipment for the digitization process. The capture process is driven by the material and project goals. Flatbed scanners, film scanners and high-end digital cameras are used. A transparency film scanner (Imacon Flextight Precision II Film Scanner) scans 4x5 transparencies; it bends the transparencies so that, although not a drum scanner, it is more than a slide scanner.
They are constantly working on the guidelines for data capture procedures and use gray scales, color bars and reproduction charts.
The image capture process consists of flatbed scanning, slide scanning, direct digital capture (Kontron) as well as photographing the object, producing 4x5 slides transparencies and digitizing these. The resultant digital objects are considered to be superior to the ones produced by the high-end digital cameras.
The organization is divided into three functional groups: Digitization, Information Retrieval and Architecture. There is currently one director, an assistant director, five FTE digitization staff (supplemented by students), two collection co-ordinators for information retrieval, six programmers and a programming manager, a data loader/media manager, an interface specialist, a technical support person, and an office support person. The staff have a variety of backgrounds and general education. Staff are employed to work in DLPS and are not redeployed from other jobs within the Library. They use both external and in-house advice on the technical aspects of digitization. Specialist technical and equipment operators in bitonal scanning, OCR, text encoding and continuous tone image capture are trained in the following areas:
The training is provided in-house, through external courses, contact with peers in other institutions and by learning on the job.
DLPS is aware of the copyright status of its materials, but the institutions for which it manages content are ultimately responsible. The copyright declaration is in a text statement and sometimes on the image. The project digitizes copyrighted material under library provision and with the owner’s agreement. The users can make local copies of content for subsequent reuse (print, copy into another program, etc.). Downloading to a PC, LAN or WAN is permitted. Users can download and view ASCII text, encoded text or PDF. For images, users can download and view thumbnails, lower quality images, highest quality images and associated descriptive metadata and documents.
Watermarking is not used as this is considered to be ineffective and costly and as yet there is no system that actually works well. There is no recognized standard format and only proprietary systems are currently available.
UMICH has a preservation/conservation procedure for original material. The library preservation department conducted the brittleness survey for MOA; they also examine materials at high risk. Conservation may be carried out as part of the overall workflow, prior to actual digitization.
The risks identified to the material are handling, heat, breakage and dis-binding. Workflows and methods have been adjusted to lessen the risks and special equipment such as cold lights, cradles and exposure limits is used. Where necessary, curatorial staff prepare the material and also monitor the materials at the start of the process until they are happy with procedures. Availability of digital files does not preclude access to original materials, but it will be discouraged, and users encouraged to use the digital objects.
DLPS has always had access to all the relevant material although it sometimes has had to locate some additional material for projects. It has rejected material that has been too brittle. DLPS digitizes both originals and intermediaries.
For text conversion projects, DLPS has digitized original printed material, microfilm and occasionally photocopies (for replacement pages). For image conversion, DLPS captures originals including photographic prints, negatives, 35 mm slidesand creates 4x5 film intermediaries for some capture.
UMICH libraries use the full MARC record catalog and descriptive information is always taken from this to create headers for titles being digitized. UMICH libraries’ original objects are cataloged by the appropriate curatorial unit (e.g. special collections or by the cataloging department). A document is forthcoming that explains the policy for cataloging digital surrogates.
MARC, EAD and USMARC are used to describe text and finding aids and to create collection level records for image collection. Images cataloged within the library employ a locally developed system called VRO (Visual Resources Online). Tools used for controlling data values are LC subject headings and catalogers desk top, as well as picklists and TGM values. The system they use has controls to guard against user error in input.
The projects record metadata about the original object, the digital object, the digitization process, technical details and administrative data. They are currently working on recording staffing details. They also have information inserted in the TIFF header by the vendors. The digitizer and the information professional create the metadata. The final metadata are included in the main catalog, except for visual resources that are kept in a separate catalog. The main catalog was in paper form until 1989 and is now available on the local intranet and the internet. URL and USMARC 856 link the digital and original objects.
DLPS uses SGML and TEI, XML, GIF and PDF for Page Image delivery, and TXT for delivery from SGML. Many of the documents have non-Latin characters, e.g. Greek. They have used beta-code with character entity reference in order to render them for display. They have used various OCR packages over the years but now use PrimeRecognition software to convert page images to text. For OCR without correction, they achieve approximately 99.8% accuracy or better. For OCR with correction, the accuracy is 99.995% (i.e., 1 error in 20,000), determined by sampling, otherwise the text is subjected to another round of proofreading. For keyboarding, they pay for (and verify that they have received) 99.995% accuracy. The documents do not have any special treatment prior to OCR. The aims are to create automatic indexing, enhanced searching, computer based analysis and document retrieval.
Experience has shown that converting a large volume of texts ensures a drop in costs. They recommend that co-operation is very useful to share facilitates as well as to increase the volume being processed. They also recommend buying and testing OCR packages to find the one most suitable for the type of text to be digitized.
DLPS also used keying in as a method to retroconvert text, which gives greater accuracy, but the specifications must be clear. Companies have responded favorably to this and this method is used where it would be absurd to try to OCR the documents.
TIFF file format is used for capturing and preserving scanned images. GIF and JPEG are used for delivery, along with PDF, Wavelet Compression and MrSID. They are seriously considering PNG.
Capture resolutions vary according to source and method. In general, page images are captured at 600 ppi and saved as G4 bi-tonal TIFFs. Resolutions for grayscale and continuous tone images vary depending on the size and nature of the original object and the capabilities of the capture device, but are saved as 8 and 24-bit TIFFs respectively.
For continuous tone images, DLPS always retains the original scans in uncompressed form as TIFFs. Images are stored as Mr. SID files (wavelet compression) and JPEGs are delivered to end-users. JPG is used to compress the files, aiming to improve access, enhance usability and decrease storage requirements.
Color correction, cropping and de-speckling are carried out using a variety of tools post-process. ImageMagic is used to create thumbnails. The file sizes vary hugely, from approximately 120K to 124 MB.
DLPS feels strongly that digital cameras are not the only viable alternative for digital image capture, and recommends that projects are careful to start with the materials and not the equipment. They have had great success with film intermediary as a method of image capture, with subsequent digitization done with a high-end transparency scanner. They argue for the better use of vendors and the establishment of community based, non-prescriptive open dialogue. Investment must be made to ensure success.
DLPS has not yet digitized a significant amount of sound or moving image material, but has begun serious exploration of support for both.
Quality control spot checks happen naturally through the process. They also have checks on random samples. Total checks are done in-house on continuous tone objects. On the papyrus project, the papyrologists organized the quality control. Outsourced material is quality controlled by the vendors as well as internally. For bi-tonal scanning, 5% of the images are checked for legibility and acceptable skew. ImageTag, the program used to assign page-level metadata values to the image files, provides a more thorough check, as an operator scans the images to link sequence to pagination information. The more formal quality control has helped the process; former projects had no such quality control and suffered accordingly.
Access to the digital objects varies with the copyright or license-status of the material or the preference of the image owners. In some cases, external users have only limited access to lower quality images. In-house users have access to the entire catalog and objects. Users can run a full text and fielded search on the associated image metadata and browse the image data. Users do not need any other software or hardware in order to access the materials. They can create image class sets, zoom and pan the images. The usage varies (see http://stats.umdl.umich.edu for more accurate information). The usage data is captured using formal methods and informal observation.
Users have to pay for the use of some of the digital deliverables. Some of the materials have a license to which users have to subscribe for access; this also includes UMICH. The publishers set the costs, and there is different pricing for the size of institution. If the subscriptions reach an optimum level then a price drop is considered (although this scenario has not actually happened).
Costs are not a driver in the projects; some are calculated on the usage and some are projected.
Potential users of the digital deliverables are informed about their availability through website announcements, press releases, articles in print media, print and broadcast media coverage, conferences, meetings and electronic and conventional mail shots. The project believes that listservs have been very successful in disseminating the information on MOA, while the academic articles are useful in communicating the overall programmatic methods and goals.
DLPS has run front-end evaluations using formal methods, online questionnaires, email, focus groups, and user observation. These evaluations have provided input into the development cycle, producing an iterative process.
Formative evaluation takes the form of questionnaires online, email, focus groups and recording interaction logs as well as observations of users’ interaction, which is videoed for later reference. The resulting changes have been part of the iterative process.
One summative evaluation has been carried out, as all else is ongoing. The project was Pricing Electronic Access to Knowledge (PEAK) and included the summative evaluation component as part of the project when it was established. The project used questionnaires online, focus groups and computer interaction logging as evaluation methods.
DLPS has decided to make the evaluation process more formal and invite users to come in for one day a year to look at a variety of projects and areas. It is hoped that this will help steer the development in a more formal manner. It is difficult to get people to help with evaluations, and so they will offer incentives for users to attend this one-day event. The results from this event will be made available to personnel involved in information retrieval design.
The funding for DLPS comes from general funds from UMICH with a small amount of government and federal contribution. They feel adherence to standards has saved money in the long-term. Any documentation required will vary from project to project. Most documentation is part of the general process.
DLPS will continue to add new materials, update metadata, add metadata and amend the user interface. This is an ongoing process. User interface changes are made periodically, with some review done annually. File formats are changed only when the technology dictates. They have formal plans to document the digitization process for long-term preservation. They emphasize that any strategy is based on the materials and not the systems, so migration of data is important, but retention of hardware/software is not.
They intend to keep the digital deliverables available for as long as possible and the loss of this material would matter greatly.
The material, and not the equipment, should lead the process; where possible, scaleable extensible systems should be built to allow flexibility. This is happening through the Digital Library eXtension Service (DLXS - www.dlxs.org) - DLXS is the underlying system for delivery of DLPS digital library materials and is also offered to other non-profits as a supported combination of search and retrieval software and middleware for delivering digital library information including bibliographic, text, and image data.: “The University of Michigan Digital Library eXtension Service (DLXS) provides the foundation and the framework for educational and non-profit institutions to fully develop their digital library collections.” (see http://www.dlxs.org/aboutdlxs.html for a fuller overview).