NINCH guide home interview table of contents previous interview next interview
David Seaman, Director of the Electronic Text Center at the University of Virginia, was interviewed by HATII on September 22 2000. The Center began work in 1992, with two intentions: to construct and maintain a collection of SGML texts and images that are accessible on the Internet, and to create user communities capable of navigating the collection and handling the materials. The project would ideally reach audiences outside the University community, therefore allowing the digitized material to fulfil its potential to extend the user demographic.
The Electronic Text Center did not conduct a collection survey of its own. However, firm collection management policies were already in place within the library and the project bought in what material they could at the start in 1992. Since then there has been a growing preference from vendors and publishers for titles in electronic form with standard markup. This matches the Center’s original aims of providing an online archive of standards-based texts and images and building a user community to create and use them.
Although the Center did not have its own collection survey, its early priorities for digitization were the high-use materials requested by students from the library, such as Twain and, latterly, Japanese texts. Inaccessible items, such as those within Special Collections were also a priority. Those with an interest in a project, mainly the collection specialist, E-Text Center staff and academic users, establish the priorities for digitization. These priorities have not been formalized in a policy statement but the Center is beginning to look at digitization for preservation and undertaking digitization for reserve readings in the library.
The primary objective of the Center’s digitization policy is to serve multiple audiences and build collections for this. Within this multiple audience Virginia academic users and other partners have first priority, followed by all other users. The Center feels it has been successful in achieving these objectives as usage continues to grow and the site statistics are watched closely to see which texts people are using.
Recommendations for other organizations that are attempting to formalize selection criteria in strategic policy statements are not to take on too much — one or two projects with a well-chosen sub-set focus is most appropriate. These must show relevance, especially to the core institution. Organizations must also play to their home strengths and emphasize content and context.
The largest overall obstacle for planning the development of digital deliverables was securing funding. Co-ordinating the project partners was also a challenge, as was preserving constituents’ sense of ownership.
Research, pedagogy and access drive the whole selection and prioritization of materials to digitize. However, there is an increasing movement towards digitizing material for preservation. Consequently, the decision to digitize material is dictated by physical condition and is less dependent on content. The biggest change over time has been in scale, for example the use of vendors.
The Electronic Text Center’s co-operation has involved archives, libraries, museums, academic institutions, corporations, foundations and charities, and government agencies, as well as high schools and individuals. This co-operation has been across all levels — institutional, local, regional, national and international.
The Center began in 1992 and the current status of the program is ongoing with no anticipated end date.
The primary purposes for which the digital deliverables are created, are as a teaching and learning resource, as research material, and to provide public and wider access. Secondary purposes are preservation, experiment and revenue generation.
The project produced an explicit statement of intent that covered its rationale, scope, significance and primary audience in a brochure and on the web.
The type of source material digitized includes:
There are no very large format materials that are being digitized by the Center and the material is exclusively paper-based.
Inevitably the material that is digitized is a sample but where possible, the Center tries to digitize all the material within a collection. The material is intended to be re-purposed for things such as course packs, e-books and print-on-demand (which is just being introduced by organizations such as Barnes and Noble booksellers). The aim is to generate revenue in a non-evasive way. No one expects to get print books for free, therefore some form of micro-charging, for example, 10c for an e-book, generates revenue while the value remains greater than the charge.
Standards, guidelines and tools used for representing content are:
Standards, guidelines and tools used for describing content are:
Standards, guidelines and tools used for controlling data values are:
As guidelines, the Center looked at TEI and EAD for digitizing particular document types.
Standards, guidelines and tools used for representing structure are:
The Electronic Text Center perceives a nested series of audiences. Of these UVA staff and students have first priority (along with four-year college and graduate school), but the general public also assume a high priority by their sheer bulk, alongside community college and K-12 audiences. A third priority is the growing home schooling movement. These audiences have not been the ones anticipated due to the speed of web development and its rapid accessibility outside the academic community. Furthermore, the Center believes that projects need to be more explicit about what they can do, for example, reaching high school audiences. Evaluation of the target audience occurs through two annual library surveys. The project has not taken account of the W3C’s “Guidelines for Web Accessibility”.
There are limitations on the use of the digital deliverables because of copyright restrictions, for example the state-wide Viva consortium for purchasing materials does not allow the E-Text Center to re-sell, claim ownership or to mirror elsewhere.
Because of the early start of the Center, it has tended towards involvement of consultants regarding the management of digitization. The E-Text Center’s management is firmly part of the University of Virginia’s library structure.
As with many University organizations the UVA library is a rigidly hierarchical organization and the Center exists because one senior librarian was prepared to take the risks. There is now an increasing willingness to assign staff and faculty positions (which is how the E-Text Center got underway).
There are no formal management structures in place and the greatest luxury for the Center has been to be able to act like a start-up in the critical early years and work at the speed of a business. For quality assurance, the Center’s Director meets weekly for half an hour with his immediate supervisor. In addition, there are several issue-based committees within the library.
Neither pilot nor feasibility studies were carried out at the start because the project was small-scale and “ramping up”. The Center has not carried out time and motion studies or benchmarking except where a grant required it, but it recognizes that this is not ideal and not necessarily an exemplary model.
Job descriptions are highly formalized and semi-annual performance evaluations are undertaken. The Center’s work relies significantly on graduate students, however, it also has a growing full-time staff base. Each person has responsibility within the Center; for example a graduate student is appointed as the point person for a project.
Bulk digitization is outsourced for reasons of cost-effectiveness, however, smaller volumes and conversion are carried out in-house. Initially there was no in-house equipment except for cameras, and no significant funding was available for equipment, certainly not enough to go large-scale and ensure value.
High-end professional cameras (two Phase One cameras 10,000 x 12,000 pixels) are used for color digitization of rare books.
Data capture procedures were established for the above mentioned cameras and associated equipment (database, CD burning, etc. – see EAF Scanning Procedures Document for further details), which have quality assurance as the primary objective.
The benchmark used for image digitization is Kodak color strip for each book.
The Center employs one FT director, an associate director and an assistant director, a project manager and a programmer. In addition, there are approximately twelve graduate students during semester at 10 hours per week, rising to 30 hours during the summer. With the exception of the programmer all have a humanities background and have risen through the ranks of the project.
Advice on the technical aspects of digitization was available in-house.
Training has largely been undertaken in-house and expertise developed at the same rate as the Center itself. Most of the Center staff has had some bibliographical training prior to this job, which has been invaluable. At the beginning, all staff had to learn new computing methods and programs, but their previous use of computers in teaching and research provided a strong foundation for learning new systems. The staff members' strong research and pedagogical backgrounds have also given the Center a good understanding of what a humanities researcher and teacher may want from such a system. Close contacts with UVA departments also helps to lure in new graduate student and faculty users.
Worthy of note, are the Center’s efforts in training its users. This training (for example in HTML or TEI) has helped build the user community to the extent that users can create their own texts and contribute them to the Center’s collection.
The Center is aware of the copyright position of the digital deliverables. It does not own the copyright in all the original materials. The copyright status is declared.
Depending on copyright and access restrictions, users are able to download TEI and XML texts.
No electronic management systems are in use, other than domain restrictions.
The Center does not have a conservation procedure for the original material, except insofar as it is beginning to digitize material for preservation.
The original materials are already cataloged by their providers. Some digital surrogates are cataloged on the library’s OPAC system, which is also used for the originals.
Standards or guidelines used for cataloging the digital deliverables are:
Tools used for controlling data values are:
Metadata details recorded are:
Metadata records are created by an archivist/information professional, while a library cataloger fixes errors. The metadata records are included in the main (library) catalog, which is in electronic form and available on the internet. The relationship between the records for the digital deliverable and the original digitized materials is the same. The catalog and object are linked by the 856 field URL link.
The format for retroconverted text material is:
Some texts contained non-Latin scripts (e.g. Japanese).
OCR is used for smaller jobs and to clean modern typefaces. However, for older materials and when working with large-scale materials, it is more appropriate to use keying-in, as it is the best, fastest and most inexpensive method.
For images the capture and preservation format is TIFF and the delivery format is JPEG.
Capture and preservation resolution is between 400 and 600dpi and delivery resolution 100dpi. Bit- depth throughout is 24-bit color and JPEG compression is used for delivery to improve access. The program retains the uncompressed scans. The program carries out processing on the JPEG files using PhotoShop but does not alter the original TIFF files. The average 100 dpi jpeg file size is 100K, while the 24-bit 400 dpi tiff original is around 30-40 megabytes. The dynamic range of the equipment is calibrated and recorded in the metadata.
A recommendation from this area would be that when an object can be scanned as well on a flatbed scanner as on a digital camera, then it would be preferable to use the flatbed scanner (for example in the case of a smaller manuscript page). The Center provides a high level of public access to their browsable and searchable texts. Thousands of these texts and images are publicly accessible, although these etexts are not necessarily in the public domain (they cannot be re-published without the Center’s permission).
Because of contractual obligations with the vendors who supply the texts and the search software, access to parts of the on-line text service is restricted to the parties covered by the various license agreements (usually University of Virginia users only).
The Center has not conducted any formal evaluations, but the close contact and involvement of users means the Center enjoys increased feedback on all aspects of their work: collection development, text encoding, image type and quality, and the design of the search and browse pages. A central lesson has been the validation of some of the Center’s primary choices: the 24-bit TIFF color images and the TEI texts that have produced for years – somewhat on the promise of future software and hardware developments – are providing the flexibility the Center needs to respond to a changing web environment and a developing set of user demands.
Users are able to browse and search by keyword according to collection, subject, publisher, place of publication and further by author’s name or date range. Compound searches can also be performed.
Users do not have to pay to use the digital deliverables, but non-UVA users cannot access material that has been licensed to just to UVA.
Potential users are informed about the digital deliverables by website announcements, press releases, articles in print media, print media coverage, conferences and meetings and email shots. The most effective dissemination strategy, in terms of building a sustainable resource, is based on the Center’s local focus, and the grounding effect this produces.
New works are added on a continuing basis, and the Center is working their way through the public ASCII texts on the Internet whose provenance can be determined.
Each day the Center receives around 38,000 individual user hits from 19,000 unique internet hosts machines, accessing over 130,000 items. This is monitored by automatic data capture.