NINCH guide home interview table of contents previous interview next interview
On September 19 2000, HATII interviewed Julia Flanders, Director of the Women Writers Project at Brown University. The project has as its long-term goal, the symbiosis of early women’s writing and electronic text encoding, thereby making texts written by pre-Victorian women accessible to teachers, students and general readers alike. By the end of the 1980s, researchers at Brown recognized two emerging communities: that of the growing sphere of early women’s studies and the developing field of electronic text encoding. By reconciling the two through digitization, the Women Writers Project hopes to improve access to the material, while also ensuring its long-term preservation.
No processes or activities led to a formal digitization strategy being developed, nor was a collection survey undertaken. However, the original project bid did identify the corpus to be digitized (women’s writing in English from 1350 to 1850). The corpus was estimated at roughly 10,000 texts. The digitized collection was initially estimated at about 1000 texts, but this has been revised and the target collection size is now open-ended. At this time (1986) the project was centered around a group of scholars from a range of institutions rather than locally within the institution.
Texts were originally selected for inclusion in an electronic anthology, since the first priority was to make texts available for teaching use, to remedy the lack of early women’s writing in print. Subsequent collection development has emphasized broadening the collection in particular areas: the Renaissance period, non-literary texts, religious debate, etc. The original priorities were established by the project scholars; expansion of the collection is now guided by project staff and scholars. The priorities were not formalized in any strategic policy statement but the project guidelines for selecting texts are published on the web.
The objectives that this policy sought to achieve were: firstly, that there should be adequate chronological coverage and generic diversity and, secondly, to respond to scholarly needs. The project believes that it has achieved these objectives but that there were also obstacles — it is the many banal factors that affect digitization. For example, some genres take a long time to digitize (drama), which ends up skewing the digitized body, or the health of working relationships with libraries affects the speed of delivery of texts.
Obstacles to planning the development of digital deliverables included a lack of relevant standards in the beginning and a lack of depth to the organization, for example in the ability to train people consistently. In terms of the process of building digital deliverables, obstacles included the length of time it took scholars to understand TEI and how it represented text. There were also other problems of being an early adopter (P1 - P3) and legacy data pre-1993. In the absence of other projects to model on, it took some time to evolve an efficient structure; initially there were very few project staff and hence problems communicating between the scholarly board and the encoders.
The primary selection criteria for digitization were the materials’ historical and cultural value, followed by their research significance and teaching and learning potential. Other selection criteria, that could not be ranked, included improved functionality, enhanced access, the potential to reach disadvantaged groups and social inclusion. It was not felt that any of these selection criteria could be given a prioritization ranking. It was also felt that these selection criteria had changed over time but only in the specifics as the collection developed, for example an increased interest in non-literary materials including science and medicine. Related to this there were also recent changes in prioritization criteria, for example prioritizing texts by women of color and a de-prioritization of manuscripts, long novels, original cookbooks and material already in print, with the exception of considering novels as a test for OCR, and manuscripts as a possible future experiment.
The form of co-operation has been largely institutional within Brown University, including the University Library and with other projects. Recommendations from this experience include the pooling of resources, avoiding duplication of labor, thinking ahead about publication and ensuring co-operating bodies have compatible exit strategies.
The Women Writers Project started in 1986, with funding from the NEH starting in 1988 and continuing until June 2000, and is ongoing with no anticipated end date.
The digital deliverables were created primarily for wider access, with research, teaching and learning resource and experiment also as significant purposes. A further purpose was public access, whilst revenue generation became a goal of the project. The project had originally produced a statement of intent but the details of this were not known.
The types of source materials digitized include printed books, printed documents, unbound printed documents and handwritten documents. In all cases the nature and format of the materials was photocopies or prints from microfilm. The digitized deliverables from these sources were intended to represent a sample of the corpus, rather than the entire body of material. The deliverables were intended to be re-purposed but in an open-ended way, for example customized anthologies or Braille versions. However, no specific plans exist and in practice this has meant only two or three versions of the material — printouts, online and published books. There have been some difficulties in managing these multiple output formats, largely due to a lack of appropriate conversion tools. As a result a disproportionate amount of time has been taken up with writing and maintaining tools for printing and data conversion.
The Women Writers Project used a number of standards, guidelines or tools for representing content. These included TEI, SGML and XML. For describing content the project used TEI headers and Brown Library has created MARC records to describe the project’s materials, which are distributed publicly. For controlling data values the project developed its own name authority files and did not use any other method because of a lack of resources. The project also looked to guidelines for digitizing particular document types and used TEI as soon as it became available, although extensive modifications were made to it. COCOA and MECS were rejected as guidelines because they did not have an adequate level of international support and standardization. For representing structure the project used TEI, SGML and XML.
In relation to standards in general the project’s view is that there is no ideal in encoding standards and that a project cannot be, for example, “pure TEI” but must customize to some extent. At this point in time all research is trying to establish an ideal.
The primary target audience is
The secondary audience is:
Distant priority audience (ranked 10) includes:
Those outside the intended target audience could use the digital deliverables and the most likely group was those beyond the four-year college level. The profile of the actual users has been the one anticipated.
It was not felt necessary to evaluate the needs of the core group of the target audience, as they had provided suggestions for texts. However, a survey of attitudes to the use of electronic texts was undertaken for the Renaissance women section of the project (response of 69 from 300).
The project has acknowledged the needs of those with disabilities through the use of SGML, which means that the website can be used by those with text based browsers. The only limitation to the use of the digital deliverables is minimum system requirements (which are not high) and these are clearly stated.
Advice on project management was available in-house. Last year the project was absorbed into Scholarly Technology Group (STG) but only for administrative purposes. Formal project management procedures include an advisory group to the executive committee, which in turn advises the project director, who reports to the director of STG. One less effective management procedure was an organizational structure that was insufficiently hierarchical. This had the result of much undifferentiated work revolving around one person, (although this could also be a strength).
No feasibility, pilot, time and motion or benchmarking studies were carried out. Planning and scheduling tools include the use of a database to track workflow and paper based charts. The use of student workers was based partly on pedagogical motives and partly on reasons of quality, as they had no preconceptions, can be trained effectively, and are intelligent and well motivated. Both job descriptions and performance indicators are used for staff, the performance indicators coming from Human Resources.
Digitization is carried out in-house as the project does not believe there is anyone else who could do it with the amount of content encoding involved, and it is therefore easier to transcribe and encode at the same time. For this purpose the project bought in equipment (Mac computers). The process of typing in texts was chosen because the source texts are heterogeneous and OCR cannot be used efficiently.
The number of people working on the project and their capacities are based on 1998/9 figures when staffing was at its capacity. There was one FT project director, at 100% on project, one FT textbase editor who was also a metadata specialist at 100%, one FT trainer at 75%, one FT electronic publications editor at 100%, one FT licensing manager at 51% and 20 (10 FT equivalent) encoders (digitizers). The project also shares one FT administrative manager with STG at 50%. All the staff had humanities backgrounds with the exception of some of the encoders, who were all Brown graduate students and learned humanities computing on the job. There were no content specialists on the project staff. None of the staff was redeployed from other areas and advice on the technical aspects of digitization was available in-house.
Training needs were assessed on an informal/incremental basis. Latterly a training package was put together for the trainer on the application of TEI technical standards. Other training areas identified included the technical operation of digitizing equipment (Macs), metadata creation and documentation system. This training was undertaken by the project’s trainer and was organized in-house using internal consultants as well as independent study and learning on the job. The training has met the needs of the project.
The project is aware of the copyright position of the digital deliverables it has created and this is declared on the license to use the materials but not on the website. It does not own the copyright of the original materials, which are all public domain. No primary material in copyright was digitized, but secondary material was digitized with the owners’ agreement and without formality. Users of the digital deliverables are allowed to make printouts on paper, and under a site license to download to a PC, LAN or WAN. Users are able to download ASCII text files, TEI DTD marked-up text, XML marked-up text and PDF formats. No electronic management system is used to control copying.
Preservation and conservation policies and procedures are not applicable as all material is digitized from photocopies or prints from microfilm.
The cataloging or reference system used before digitization is unknown (materials come from a variety of libraries) but the information that is used in the digitization process (the TEI Header) includes the title, author, size, format, library and anomalies. The project does not have access to all the relevant cataloging information before digitization and early on the project was ignorant of much of this information at the encoding level. As a result the project has had to locate core reference material for the digital deliverables. An additional problem has been the poor documantation of microfilms.
No materials were altered from the originals because all digitization was done from intermediaries (photocopies, prints from microfilm and in the past single microfiche) although it did not just exist in reproduced form. Some material was rejected because of illegibility.
The project does not catalog the original materials and uses an in-house Filemaker database to catalog the digital surrogates. Data values are organized by controlled vocabulary (data value in catalog — author, translator, text format, anomalies) and a genre, name relationship database.
The metadata records details about the original object, digital object, digitization process, technical details, staffing details and administrative details. The metadata record is created by the encoder (digitizer) from information provided by the textbase editor. The catalog for the digital deliverables is held in electronic form on an intranet and is also available on the internet (website). The original and digital records shared some information and were independent in other respects. The catalog and record are linked through a unique ID for each object, which becomes the ID number for the transcript and TEI header.
The formats for retroconverted digital text are:
Some of the text contained non-Latin scripts. Keying-in was used in preference to OCR. The lessons from this experience were that the students became fully involved in the texts and this was valuable for intensive encoding, but needed a responsive coding scheme.
Quality control procedures for the deliverables are proof-reading, in-house SGML validation tools for checking encoding consistency and values, and a review stage. Quality assurance issues are also reported by users. These quality assurance methods evolved in response to staff input, periodic review and refined methods of proof-reading. Metadata are also proof-read, but not online. One effect on workflow, was that to have an encoding review stage after the first proof meant that the proof-readers were having to deal with many errors even if the encoding errors were not repeated.
Users have open access to the catalog and access to the digitized materials is through an annual license. Users can search on word, phrase and proximity, while Boolean searching can be done on the metadata. Searching is context sensitive at the collection, subset and text levers. This searching is handled by the TOC and main text frames, and there is also a summary of metadata above the title page. The project had to design many Dynaweb, IP and password authentication systems as well as a customized interface for each text. Users can save file sets and the project hopes soon to be able to allow annotation shortly. Usage is monitored by automatic data capture and the project will use formal data collection in the future.
Users have to pay for the digital deliverables if they are outside Brown University or the general public. Users are charged annually, pro-rata to the size of the institution, and this charge is collected by invoice. The revenues were estimated in advance and in the first year receipts totalled $139,000 (including advance payment for multi-year subscriptions); this year receipts were $24,000 plus $58,000 for next year. It is not known if the revenues generated met those estimated.
Potential users were informed by website, articles and advertisements in print media, conferences, meetings, email shots and library links and portals. The most effective methods have been conferences and conventional mails, with an email follow up.
No front-end or summative evaluation has taken place. Formative evaluation took place with a beta test group through online questionnaires, focus group discussions and observations of user interactions with the system. As a result the interface was adjusted, especially the navigation labels and what information was displayed at what stage, e.g. the TOC and the behavior of frames.
The project has cost $2,801,113 so far and the major sources for this have been Brown University, NEH, Mellon, private donors and industry. In terms of how much it should have cost, the project estimates that 10% of the funding was “wasted”, but of this 10%, 25% was “wasted” in a good way, in learning about the processes. The project felt that grant proposals were under-funded by 25-50%; if it had this money there would be better delivery and an expanded collection through more investment in programming, training, documentation and bibliographic research. It was felt that the use of standards perhaps saved money in the long-term. The NEH were sent reports every six months for their monitoring of the project, and the Mellon Foundation received several reports, but generally funding organizations have not been demanding. The documentation asked of the project by funding organizations consisted of interim and final reports.
The project elements being updated and their frequency include: new digitized material (quarterly); new metadata (added annually); changing metadata (tailing off); and the user interface (updated approximately once a year). The project’s preservation strategy for long-term access to the digitized objects is the use of SGML. There are no quality control procedures for life-cycle management.
The project intends to keep the materials available indefinitely. The longer-term sustainability is dependent upon self-generating funds and the resource will generate sufficient income for this, although not all of these funds are secured. The project has an exit strategy based on STG administering the data to subscribers. The loss of the digital deliverable would be a matter of concern.
The instrument was particularly good for libraries but some areas needed to be addressed with more nuances for some projects.