NINCH guide home        interview table of contents        previous interview        next interview

 

16   Center for Retrospective Digitization, Göttingen State and University Library

 

HATII interviewed Dr. Norbert Lossau, Head of the Center for Retrospective Digitization at the Goettingen State and University Library, on February 22 2001. In 1997, the German Research Foundation launched the funding program that led to the creation of the CRD. In its capacity as a digital research library, the Center contains collections of early printed materials, which are available on the WWW. The Center hopes to support ongoing activities within the GRF, while contributing to national efforts of standardization within the field of digitization.

 

16.1    Organizational Digitization Program and Policy

The Center for Retrospective Digitization in Göttingen (Göttinger DigitalisierungsZentrum GDZ) is funded by the German Research Foundation (Deutsche Forschungsgemeinschaft DFG) and was established in May 1997 at the Lower Saxony State and University Library (SUB) in Göttingen, Germany. The GDZ is part of a funding initiative for retrospective digitization of Library materials building a German digital research library. As the Digitization Center of the Lower Saxony State and University Library Göttingen, it also receives financial support from the State Ministry for Science and Culture.

The GDZ collections are built around early printed book and journal collections such as North American travel literature (approx. 400,000 images), mathematical documents from the reference journal Jahrbuch über die Fortschritte der Mathematik (1868-1943) (250,000 images), and the Gutenberg Bible which, along with their accompanying structural and bibliographic metadata, are available over the world wide web. The main purpose of the GDZ is to acquire up to date know-how and experience in scanning early printed material, image processing, OCR and document management including access to digital libraries and books via the world wide web. The GDZ in Göttingen is also engaged in evaluation of tools and techniques for image capture and text conversion, bibliographic description, document management and the provision of remote access.

The GDZ is engaged in partnership projects like DIEPER (Digitized European Periodicals) where ten European countries have joined to build a virtual network and a central access point to make accessible periodicals that have been retrospectively digitized in Europe and around the world. It has recently started another similar project, Digizeit, which is intended to be similar in principle and aspects of functionality to the US-based JSTOR. The SUB Göttingen has received a further DFG grant to digitize in co-operation with 9 major research libraries 58 journals from 14 different academic fields.

Another important aim is to transfer knowledge into related library digitization projects and the coordination of national efforts towards standardization in various fields (e.g. digital conversion, online access, bibliographic description). Under the Retrospective Digitization of Library holdings funding program, the DFG also supports web-accessible resources held by other German research libraries and academic institutions (see http://gdz.sub.uni-goettingen.de/en/vdf-e/).

The priorities for digitization are set in the context of the overall initiatives in the historical research collections of the SUB, such as the building renovation projects to establish new reading rooms for 18th and 19th century physical materials with side-by-side access to the materials online. The collections policy centers on the Library’s holdings of rare early printed materials, complete collections (e.g. the North American collections) and materials of wide appeal (e.g. the Gutenberg Bible). High priorities and selection criteria for digitization are research significance, enhanced access, use for teaching and learning, provision of user services, historical and cultural value, conservation and preservation.

Access to materials digitized as a result of projects aided by DFG grants is provided free of charge. Although income generation is not a high priority, the new Digizeit project, whilst remaining a not-for-profit project, will consider ways of generating sufficient income for it to continue past its first funded phase.

The primary audiences are firstly users of the University Library, mainly researchers, and a wide international audience, particularly from the university sector in the USA. The GDZ regularly fulfills requests from US universities for documents from its collections which in addition to online delivery are delivered via FTP or CD. The other main audience is the general public worldwide, as some of the materials, such as the Gutenberg Bible, are of wide interest.

 

16.2    Project and Asset Management and Planning

The conversion process adopted by the GDZ follows the guidelines and recommendations of the technical working group implement funded by DFG outlined in its 1997 final report. Although there has been a technology shift since then, with scanning technology having moved on rapidly (for example, in 1997 it was not possible to use a bound volume scanner to scan in 600dpi), the basics of the strategy have remained sound. This strategy centered around the use of standards, both for metadata and for file formats.

Text material is scanned from microfilm (with new production date) and from the original. Scanning from microfilm (35mm) is contracted to a vendor; scanning from the original is undertaken in-house using the Library’s own equipment. Images are created in 600 dpi.

The Zeutschel Omniscan 3000 scanner and the Minolta PS 7000 run under SRZ ProScan Book scanning software, developed for the GDZ by the Satz-Rechen-Zentrum Company in Berlin to meet special production scanning of older books (e.g. with TIFF-header editing, production control window with tree view over scanned pages, masking and cropping of pages during the scanning process).

Image capture of older books often requires some enhancement after scanning. For economic reasons this post-processing is done in batch mode wherever possible and black and white text material is compatable with this semiautomatic method. The GDZ uses a professional Norwegian program called PixEdit, originally designed for the CAD-sector.

In 1999 a new scanning device for capturing color images was added to the equipment of the GDZ. The digital Camera-Back Picture Gate 8000, max. resolution 8000 x 9700 pixels), manufactured by Anagramm GmbH, Germany, is used for face-up scanning of valuable Library resources (e.g. the Gutenberg Bible). Greyscale scanning of illustrations (e.g. from the travel account books) is also performed with this device. A special moveable cradle, first designed for Graz University Library by Manfred Mayer is used to ensure contact-free scanning for rare books. It utilizes a sensitive low pressure system to hold pages down.

The quality control criteria (e.g. legibility, completeness) and procedures for mass-produced text scanning are carried out with the software ACDC. For a complex and valuable project like the scanning of the Gutenberg Bible the procedures were much more rigorous because of the issues of color calibration which had to be closely monitored and thorough. Photoshop was used for color management.

Core bibliographic information is added to the TIFF header of the digital master to reference each image to a bibliographic record in the online library catalog. The digital masters are stored offline on CD-R, an ISO-standard storage medium.

 

16.3    Human Resources and Training

Human resource at the GDZ comprises a director, a technical director, systems administration and database manager, and specialists in OCR, digital imaging and a librarian. The GDZ functions both as a separate team carrying out funded projects (for example funded by the DFG, the European Commission or the National Science Foundation) and as a department of the Library which can carry out services for other departments. For example, there have been close collaborations with the Department of Rare Books and the GDZ on scanning of early printed materials and color digitization. The policy is to retain the GDZ personnel in the IT department of SUB if they are not working on GDZ projects, as the skills and knowledge of this highly specialist team are valuable to the organization.

16.3.1   The DMS – Asset Management

The overarching objective of the GDZ is to build up an open distributed digital research library, accessed via its new RDB-driven DMS, AGORA, which has become established as the central administration tool for the digital library of the GDZ. The software has been developed by a company in Berlin, Satz-Rechen-Zentrum (SRZ). It can be implemented on different RDB platforms and is based on an extensible document model developed at GDZ. The DMS maps bibliographic and structural data (based on printed document structures) to a set of images, but it is open to handle any new document structures. The administrative functions of the AGORA system are controlled by an administrative tool which includes the following features:

The largest part of the data model is used for documenting and retrieving digital resources and their parts. It is made up from a heterogeneous hierarchy of structural objects with common and distinctive features (e.g. monograph, title page, chapter), but special objects (e.g. figures, tables, indexes) can be referenced as well and it supplies document access via printed page numbers.

From the users’ viewpoint the advantage of the DMS is that digitized documents are accessed exactly as a printed book might be accessed, e.g. via tables of contents, indexes, chapter headings, special features like figures or maps. The interface also preserves the printed page numbering (including searching in the printed page numbering schemes) and provides simple and advanced searching and a zooming facility.

The GDZ has chosen XML/RDF as the data interface of AGORA. The thinking behind this is that it provides good support for interoperability, efficient target-orientated retrieval by XML-compliant search engines, it is open for new document semantics, compatible to minimum Dublin Core and that it will become widespread and supplied by many software tools like editors and browsers.

In the future the GDZ will integrate full text retrieval (Verity) and user accounting support compatibility to other related metadata semantics and RDF schemas or XML DTDs.

 

16.4    Project Life-Cycle Processes and Procedures

16.4.1   Access to Digital Materials

Access is provided to both bibliographic records and digitized documents via two methods. Bibliographic records for the digitized documents are created in the PICA Online Library Network Catalog and serve as one method for providing access to the documents. Following the guidelines of the funding body, the DFG, digitized documents in Germany have to be recorded in the online library network catalogs (Verbundkataloge) to ensure direct access to a document entity via the online library catalog. The overall goal is the integration of the digital collections in an online library catalog (locally as well as in a networked catalog) to allow global subject searches over all library holdings independent of their physical representation (printed books, microforms, electronic publications, digitized books, etc.).

Göttingen treats digitized documents as reproductions (as microform – Sekundärform in German) and creates separate records in the online catalog. Special categories for online resources are added (e.g. Date of Digitization, Creator, Copyright, URL). The availability of the PICA-GBV Online Library network catalog on the world wide web allows users to start a search in the Online Catalog and go directly from the hitlist to the electronic version. The bibliographic description in the GBV catalog conforms with the German exchange format for libraries (MAB-2).

Another method of access is directly via the Document Server located at the Library’s homepage. Simple and advanced searching as well as title/author browsing for collections provides easy access to digitized documents. Users have the option to view the document on-screen via a standard web browser (as GIF, JPEG, PNG images), download a PDF file for printing, or order a high quality (600dpi) printout from the Library. Additionally, a PDF version, together with the free Acrobat Reader, will be offered on CD-R. These versions for output will be generated as derivatives from the digital master.

16.4.2   Rights

The copyright on materials is owned by the Library and material is made freely available with few exceptions on use apart from use in broadcasting.

16.4.3   Handling

The GDZ, in its collaborations with the Department of Rare Books in the SUB has built up a significant body of experience in the handling of early printed materials during the digitization process. For further details see: Norbert Lossau & Martin Liebetruth, “Preservation Issues in Digital Imaging Technology” in Microform & imaging review, K.G. Saur, Vol. 29. No. 4, 2000 (originally published in Spectra, Museum Computer Network, Fall 2000, Volume 25, Issue 2, S. 30-36).

 

16.5    Format, Resolution and Compression of Digitized Materials

16.5.1   Text

Text-based materials are scanned as TIFF images. A well-known problem in OCR is the recognition of textbooks in fraktur (gothic). As the broad evaluation of OCR-programs by the GDZ pointed out, standard programs as well as sophisticated trainable programs (Prime Recognition, ProLector, Optopus, FineReader) do not provide solutions for an economically automatic recognition of these kind of texts. This is a serious problem for many digitization projects in Germany. The GDZ in Göttingen uses the Russian program FineReader (vers. 4.0). Non-gothic texts, even from older books and difficult image quality can be recognized with a low failure rate, allowing the creation of a text index for background-search.

The GDZ is looking for a technical solution for the economic recognition of gothic texts by co-operating with a company derived from an academic institute of Mathematics (in Potsdam,Germany). Other institutions in Germany also discovered this problem as a chance to fill a special segment of the market for text recognition programs. For an intervening period – until the issue of Gothic text processing can be resolved – the minimum for accessing digitized text will be the availability of navigational tools given by the original itself, like tables of contents, indexes, list of illustrations, hyperlinked to the referenced portions (e.g. chapters, image pages) of the documents. Materials using gothic text are sometimes sent to a service vendor for re-keying, but this is mainly only done for tables of content and rarely for full texts.

A new project funded by the European Commission and led by the University of Innsbruck in Austria is researching the development of a modular addition to the OCR program Finereader to recognize gothic characters. The GDZ has shared experience with the research team and is working in partnership with them.

16.5.2   Images

Images are created in 600dpi. The scanning process produces a high quality digital master in TIFF-format. Derivatives (GIF, JPEG, PNG) are created on the fly for online delivery. A PDF version can be obtained for downloading or for offline delivery on CD-R, burned by the GDZ.

 

16.6    Evaluation, Funding and Long-term Sustainability

No formal evaluation has been carried out with users, but the unsolicited user feedback has been good. There has been an especially good level of response from users in the USA, with many commenting positively on the free availability of materials.

The longer term funding position is to continue to pursue funded projects for the GDZ such as Digizeit, and to retain personnel. The GDZ is now at the end of its second two-year funding tranche from the DFG. The GDZ will continue even if further external funding is not forthcoming and, as noted above, absorb staff into the IT department of the SUB in order to retain the skills and experience that have been built up.

However, the outlook for funding is promising. The technical infrastructure and the knowledge base (particularly in color digitization and handling early printed books) at the GDZ is so strong that it is felt that there will always be an opportunity to generate income from digitization consultancy and services to clients such as other libraries and archives with valuable or rare/early printed materials to digitize.

The strategy for longer term sustainability, data preservation and archiving is to follow standards in all fields and to keep preservation copies of all data. For example, master TIFFs are always kept for graphic formats. These can amount to considerable amounts of data: for the Gutenberg Bible there are more than 700 CDs of data as one illuminated page might take up 400 MG of storage as an uncompressed TIFF file. CD-R, an ISO standard storage medium is used. Similarly, metadata is stored as ASCII structured data.

The thinking is that there will have to be migration of data in the future to other formats, but the use of standards now will facilitate that migration. The GDZ strategy started out with adherence to standards and it has proved to be a sound foundation for the Center.




valid xhtml 1.1
abp~04/02