NINCH guide home interview table of contents previous interview next interview
Colleen Cahill, the Digital Conversion Co-ordinator at the Geography and Map Division of the Library of Congress, was interviewed by HATII on September 19 2000. The Geography and Map Division holds the largest cartographic collection in the world, with 4.5 million maps dating from the 14th Century. The Library of Congress has an aim to make its resources available to Congress and to the American public, as well as preserving the material for future reference. To this end, the Library established the National Digital Library, whose initiatives seek to improve accessibility and encourage lifelong learning, while creating standards for a professional service in the field of archival digitization. One such example is the American Memory project, through which map collections dating from 1500-1999 were converted to digital form.
A collection survey was not reported as central to the activities of the initiative. The project did not establish local digitization priorities and, while the Geography and Map Division (G&M) undertakes collection surveys, it does not establish the priorities for digitization. The policy of Special Collections of the Library of Congress (LC) has some bearing in this area, but the Geography and Map Division has the final say on what is scanned and in what order.
Obstacles to planning and building the digital deliverables, included the size of the original materials which can require up to twelve scans, moving files and stitching them together, and the use of MrSID for compression. However, the Division recently discovered a method to put the pieces back together using MrSID so that they are now only limited by the hard drive space available. With the right set up, they suggest they could put terabytes of data together.
The selection criteria and their rank for digitization are that:
The project did feel that, within this framework, any of the prioritization criteria applied. These criteria have changed over time; in particular demand has increased for “current news” maps, which are driven by news headlines. In addition, about 12.5% of digitization is “on-demand”.
Developing the digitization program involved co-operation across archives, libraries, foundations and charities (Center for Geographic Studies), and commercial organizations (Microsoft). Based on this experience, where organizations have very different structures, objectives and needs, the program recommends that partners are flexible, so that no organization has to surrender too much.
The current status of the project is ongoing with no anticipated end date; it started in 1994 and received National Digital Library (NDL) status from 1996. The main purposes that lay behind the creation of the digital deliverables were enabling research, facilitating public access, broadening access and assisting in the preservation of the original materials. The program produced an informative statement of intent, which was explicit about its rationale and level of faithfulness to the original, through standard publicity and Cable News Network (CNN) for maps online.
The type of source material digitized includes:
The nature and format of the materials digitized include velum, paper, fiber and books (some dis-bound). The largest items were 24” x 36” in size (but through stitching the original can be much larger), required twelve scans and produced a file of some 2.1 gigabytes. The project attempted to be as inclusive as possible and included Civil War maps, and state and railroad maps. The project does not intend to re-purpose the digital deliverables.
The following standards, guidelines or tools are used for representing content:
The following standards, guidelines or tools are used for describing content:
The following standards, guidelines or tools are used for controlling data values:
The program did consult existing guidelines for digitizing particular document types when planning its digitization strategy, including the NDL guidelines and others through its Geographic Information Systems (GIS) partners’ research in such areas as resolution.
Currently HTML is used to represent structure, but the project intends to move to XML during the next two years. In relation to standards in general the project recommends that those setting out to undertake such work should:
The intended audience for digital deliverables includes:
The project did not feel that any one of these groups had been made a special priority, rather they tried to serve them all. Some groups may, however, have found the material difficult to use because there was little interpretative information provided beyond the catalog record. The program did not undertake an evaluation of the target audience. Groups other than the target audience could use the deliverables, but whether they had special needs is unclear. The project has taken account of the World Wide Web Consortium’s (W3C) “Guidelines for Web Site Accessibility”. The profile of actual users has been slightly broader than expected (e.g. K-12 access). The project did not restrict how the digital deliverables could be used.
Advice on managing the program was available in-house. The project is considered an integral part of the Division and is supervised by the Geography and Map Division, not the NDL. The project has two co-ordinators reporting to one chief, which has resulted in tighter control. One management procedure that did not work was the original digitization workflow. The original workflow took 4 hours to scan, process, catalog and have a map ready for the internet. Through technological and other adjustments the Division have reduced this to one hour. The managerial quality assurance procedures in place are two “eyeballings” of the deliverables and two catalog reviews; these were developed from in-house experience and the project team believes that they have been successful.
A pilot study for scanning was undertaken in the Geography and Map Division, not as part of NDL. It aimed to establish training needs, technical feasibility, and workflow analysis piloting. This study led to a change in the design of the project as equipment choice changed from Unix to NT. In addition benchmarking studies were undertaken for technology, with a base of 4-hour scans. The project formerly allocated work on the basis of one person per project, but now projects are split into sections, cutting down on confusion and lessening the burden on the person responsible for scanning. Human Resources use both job descriptions and performance indicators for all positions.
Digitization is carried out in-house rather than outsourced because the material is so fragile. Equipment was already available in-house, but NDL added post-processing equipment and one scanner. The process of flat bed scanning (using Tangent Artisan 2000) was the only option available because of the fragility of the original material. The use of a high-end camera similar to the Phase One is not viable because of cost, not technological reasons. Guidelines for data capture procedures include calibrating the Tangent scanner and handling advice from preservation staff. Color chart benchmarks are also used when a problem arises.
People working on the project (full time equivalent (FTE)) and their capacities are:
The people working on the project have a variety of backgrounds — two with MLS degrees, and two others with library sciences and technical backgrounds who also have a bachelors and a masters degree. Two staff were redeployed from other areas and two were hired especially for the project. Advice on the technical aspects of digitization came from one internal and one external source.
The training needs of the project team were assessed on the job and were identified as:
All staff members were engaged in training. The training was organized through in-house services where possible. It took advantage of organizations’ own consultants for cataloging, involved attendance at external courses for PhotoShop and XML, and learning on the job. This organization of training has met the needs of the project but a specialized class was required for PhotoShop.
The project is aware of the copyright position of the digital deliverables. Due to the fact that the LC is a US government agency the Division can own no copyright on anything. All the Division’s scanned materials are in the public domain with the exception of one map that the publisher asked the Division to scan and there is a note in the bibliographic metadata to this effect. The copyright or rights status of the final digital deliverable is not declared. Users of the digital deliverables are allowed to make printouts on paper and film, burn to CD, DVD etc., and download to a PC, LAN or WAN. Users can download and view thumbnails and lower quality images. MrSID files are 24 bit color and 300dpi when viewed in the stand-alone viewer. The TIFF file can be retrieved from the MrSID file through the stand-alone viewer. The Geography and Map Division do not permit the TIFF file to be accessed through the web, but will burn CD-ROMs via the Photoduplication Division. The highest quality TIFF images are archived. At this time no electronic management system, such as watermarking, is in use.
The project has a conservation procedure for the original materials. If materials are too fragile they are not scanned but sent to preservation. Any risk to the material identified in preparation for digitization is assessed by specialists rather than by the project. The project tries to minimize risk to the sources. Scanning is done through Mylar casings and uses cradles and supports. Some material is prepared by curatorial or preservation staff before digitization but not monitored by them during digitization (they recognize the limits of their expertise). As yet no access restrictions have been placed on material after its digitization.
The cataloging system in place prior to digitization is MARC to which the project has full access. Index and bibliographic data are used from these records in the digitization process. No further core reference material has to be located for the digital deliverables.
Sometimes the materials have to be altered from their original format for the digitization process (e.g. dis-bound) and some material is rejected before digitization if it is too fragile. Occasionally reproductions or intermediaries are used. These take the form of paper facsimiles; the project does not like to print from microfilm/fiche.
The original material is cataloged in MARC records which includes fields for the digital images. An in-house Microsoft Access database, which is used for tracking, houses structural metadata, the bibliographic metadata is in the MARC record. Tools for controlling data values are LC subject headings. The metadata details recorded include information about the original object, the digital object, the digitization process, technical details, staffing details and administrative information. Metadata records are created primarily by the digitizer, but all other staff have some degree of input. This metadata record is then included in the main catalog, which is held in electronic form on an intranet server. The relation of the records for the digital deliverables to those for the original digitized materials is a mixture of independent and identical. The catalog and the objects are linked through a hook to the library record.
The project did not use special standards for dealing with textual materials. For images it used the TIFF file format for capturing and preserving, and GIF and MrSID formats for delivery. Capture and preservation resolution is 300dpi and delivery resolution 72dpi. 24-bit depth is used for capture and preservation and 8-bit for delivery. The compression used is MrSID, and items are compressed up to 22:1 for delivery.
The project retains images in uncompressed form and carries out post-processing stitch, rotate, crop, color balance, unmask and sometimes hue and saturation using PhotoShop. The average capture and preservation file size is 210 megabytes. Delivery files are approximately 10 megabytes for delivery.
The dynamic range of the equipment is not checked. From its experience of digitizing images the project would recommend achieving as much capture as possible at one shot.
Two people implement quality control procedures by checking each image; metadata recording is also reviewed (although the Access database is not as heavily reviewed because some material which has been scanned is not cataloged).
Users access the digital deliverables through open access to the catalog plus the materials. Users can see a prior graphical or record entry before viewing; metadata searching is handled through a link. Apart from browsing the deliverables, users are able to manipulate images by zooming in and out. Image Alchemy was used for creating thumbnails and Remedy to request server space; MrSID, Cute FTP (for bit transfer rate), Internaweb file browser and Voyager for the catalog are the specialist software tools used by the project.
The project has one of the highest, if not the highest, level of usage in the NDL program. This is monitored by information technology services to which the NDL has access. Users do not have to pay for the use of the digital deliverables. Potential users of the digital deliverables are informed about their availability through website announcements, press releases, articles in print media, print and broadcast media coverage, conferences, meetings and electronic and conventional mail shots. The project believes that electronic media (web and email) are the most effective.
No front-end evaluation has been carried out, although the NDL may have done some work here. Formative evaluation took the form of a great deal of input from staff and chief, and this led to a number of web page changes. No summative evaluations have been undertaken. The NDL is planning such evaluation, but not at the level of this project.
The project is at the high end of NDL funding. LC feels that standards have saved money by increasing productivity. The project was required to provide cost models by one of its funding organizations.
New material is digitized and added monthly (100-500 images) and metadata updated at the same time. The user interface is updated yearly and this year has new navigation tools. File formats may be changed in the future.
The project has a preservation strategy to ensure long-term access, but this is largely determined by the LC’s IT Service which provides storage on the LAN. The strategy for image preservation embraces file formats and updating metadata (the project’s responsibility), and storage media and storage conditions (IT Service’s responsibility). Quality control procedures in the life-cycle management are producing schedules for material to archive. The digital deliverables will be available indefinitely and the project does not need to rely on self-generating funds to sustain the resource, as it believes NDL funding will become permanent. The project has an exit strategy based on the G&M Division adopting the project, but in an altered state.
CDP’s success lies in the dedicated staff working for the project, not only in the small management office, but also in the practical areas of the projects. There is an understanding of true collaboration, and the standards and guidelines devised have ensured that all projects play an equal role. The CDP has used the project management structure to achieve usable minimum standards which projects can apply to their imaging while remaining viable in the union catalog. The DPL imaging lab, which was visited in order to see an actual project in operation, is both large and impressive with superb equipment and staff. Interestingly, they all agree that the best digitizers are photographers and not IT specialists. This is logical, for while the processes may be on computer, the knowledge of color and tone are that of a trained photographer.
The committee structure is ideally placed to represent the views of the partners and to develop the guidelines for the CDP as a whole. Training for all is an essential part of the collaborative as well as the practical side of the Project. Flexibility in the union catalog and the format of finding aids have ensured that all the projects’ digital objects are available to all. There are some important lessons to be learned in collaboration from this project, perhaps the main one being that it takes a separate project management team to consolidate and co-ordinate that collaboration.