Archiving Issues Connected with Electronic Publishing: A Case Study within a Digital Library Project.

William H. Mischo

Grainger Engineering Library Information Center, University of Illinois at Urbana-Champaign, USA

Paper presented to the ICSU Press Workshop, Keble College, Oxford, UK, 31 March to 2 April 1998

INTRODUCTION

The University of Illinois at Urbana-Champaign (UIUC) was one of six sites awarded a four year federally funded grant in 1994 under the first phase of the Digital Library Initiative (DLI). The DLI Program was jointly funded by the United States National Science Foundation (NSF), Defense Advanced Research Project Agency (DARPA), and the National Aeronautics and Space Administration (NASA). The other five grant recipients were Stanford University, the University of California at Berkeley, Carnegie Mellon University, the University of California at Santa Barbara, and the University of Michigan. A detailed description of the Illinois DLI project, along with links to the other five projects, can be found at http://dli.grainger.uiuc.edu/ and is described in (1, 2).

Work on the Illinois DLI grant is being carried out by a multi-departmental research team comprised of individuals from the university's Graduate School of Library and Information Science, the University Library, the National Center for Supercomputing Applications (NCSA), and the Department of Computer Science. The project also includes important contributions in-kind from a number of publishers and software companies and equipment grants from several hardware companies. The Illinois DLI project includes research, testbed, and evaluation components.

The overarching focus of the Illinois DLI project has been on the design, development, and evaluation of mechanisms to provide enhanced access to full-text scientific and engineering journal articles within an Internet environment. The development of the technology to provide access to the full-text journal literature has been a major emphasis of the publishing industry, information providers, and libraries. A principle focus of the Illinois DLI project has been on the creation and implementation of an operational Testbed of full-text journal articles in SGML format. This Testbed includes an associated Web-based retrieval and rendering system, called DeLIver (Desktop Link to Virtual Engineering Resources) which provides broad access to the full-text material.

The Illinois Testbed is maintained and administered within the Grainger Engineering Library Information Center, a $22 million facility that opened in 1994 and is dedicated to the exploration of emerging information technologies. The Illinois DLI Testbed is presently comprised of the article full-text and the associated metadata and bit-mapped images of figures for 54 journals containing 49,000 articles supplied by six scholarly professional societies in physics and engineering. The full-text articles for the Testbed have been contributed by: the American Institute of Physics (AIP), the American Physical Society (APS), the American Society of Civil Engineers (ASCE), the Institute of Electrical and Electronics Engineers Computer Society (IEEE CS), and the Institution of Electrical Engineers (IEE).

The Testbed has been operational since September 1996. A detailed analysis of Testbed transaction logs comprising over 4,500 individual search sessions is underway. In the next phase of the Testbed, the breadth of the Testbed will be expanded by adding other journals from the current publishers and by adding several other publishers to the mix.

This paper will outline the Illinois DLI Testbed goals and briefly describe the primary issues addressed in the project, the technologies employed, and the Testbed team's principle accomplishments and findings. The primary focus of the paper will be on issues connected with the archiving of electronic materials, such as full-text, that may be exclusively available online. Archiving of electronic resources is an issue being addressed both in the Illinois DLI project and by other DLI grant recipients.
 

TESTBED GOALS AND OBJECTIVES

The DLI Testbed team is examining processing and access issues connected with the local mounting of full-text journal articles and is also exploring the potential role of full-text in library services. The primary goals of the DLI Testbed team are:

(1): the construction and testing of a multi-publisher SGML-based full-text Testbed employing flexible search and rendering capabilities and offering rich links to internal and external resources;

(2): the integration of the Testbed (and full-text in general) as a resource for users into the continuum of information resources offered to end-users by the Library system;

(3): determining the efficacy of full-text article searching vis--vis document surrogate searching, and exploring end-user full-text searching behavior in an attempt to identify user-searching needs; and

(4): identifying models for effective publishing and retrieval of full-text articles within an Internet environment and employing these models in the Testbed design and development.

SGML is the open standard for document transmission and display (see http://www.sil.org/sgml/sgml.html). SGML is not in itself a markup language, but rather a template or model for marking up the content and structure of a document. The Document Type Definition (DTD) describing an individual publisher's SGML is the instrument that specifies the semantics and syntax of the tags to be used in the document markup. The DTD also specifies the rules that describe the manner in which the SGML tags may be applied to the documents. One of the major roadblocks in the successful deployment of the DLI Testbed has been in the need to write processing software to efficiently process the heterogeneous DTDs of the publishers. In the process of creating a viable Testbed, the Illinois Testbed team developed a number of techniques to address identified problems with SGML processing, normalizing, indexing, storage, retrieval, and rendering.

To support effective retrieval in the Testbed, the Illinois DLI Testbed and Evaluation teams have also carried out studies of end-user searching behavior in an attempt to identify user-searching needs. One requirement specified by the Testbed team from the onset of the project has been that the Testbed as a resource for users must be integrated into the continuum of information resources offered by the Library system. This has been primarily accomplished in two ways: by making the Testbed a search option within the Library public terminal top-level menu; and by linking Testbed full-text records from the short entry displays within the Ovid Compendex and INSPEC periodical index databases. Additional simultaneous search mechanisms are being explored.

An important concern of the Testbed group has been in exploring effective retrieval models for a Web-based electronic journal publishing system. The retrieval and display of full-text journal literature in an Internet environment poses a number of issues for both publishers and libraries. It has now become commonplace for both major and  all-scale publishers to provide Internet (Web-based) access to their publications, particularly journal issues and articles. For libraries and information providers, support for the online journal environment necessitates changes in collection policy, user access mechanisms, equipment provision, etc.

The Testbed team has been examining the issues involved in the switch from a print-based journal environment to the Internet-based model, with a special eye toward providing retrieval mechanisms to optimize user access to full-text journals. To support this, the Testbed team has proposed a distributed repository model that 'federates' or connects the individual publisher repositories of full-text documents. In the DLI Testbed model, these distributed repositories are federated by the extraction of normalized metadata, index, and link data from the heterogeneous full-text of the different publishers. This model addresses the challenge of providing standardized and consistent search capabilities across these distributed and disparate repositories.

The Testbed team has succeeded in demonstrating the efficacy of the distributed repository model by producing cross-DTD metadata, providing parallel database querying and distributed retrieval techniques across a distinguished subset of the full-text repositories, and by setting up and employing an off-site repository at the site of a publisher.

Particularly relevant in any electronic journal publishing and retrieval model is the prominent role currently being played by the professional societies and commercial publishers. It appears that the electronic publishing of scientific articles is coalescing around the same professional society and commercial publisher model that dominates today's print-centric world. The Testbed team distributed repository retrieval model provides a mechanism for retrieval across subsets of the full-text publisher repositories without the requirement that users go to each publisher site to perform individual searches.
 

ISSUES

In the course of the DLI Project, the Testbed Team identified and examined a number of key issues in full-text retrieval and rendering. From the user studies, it was clear that users desired a workbench approach to information, in which full-text is one of many tools used to meet specific information needs. In the course of the project, the Testbed team specified a system with dynamic linking capability centered around simultaneous searching of multiple databases. From this arose the concept of extended full-text or 'Smart Documents' which function as a container with links to multimedia objects that are related to, or expand upon the full-text.

A number of other issues were addressed, including: the efficacy of searching full-text as opposed to a document surrogate of an article containing title, abstract, and controlled vocabulary; the role of controlled vocabulary; the difficulty of accurately rendering display mathematics; the problems with performing effective information retrieval within the constraints of the stateless HTTP protocols of the World-Wide-Web; and the problems posed by inadequate depth and breadth of collection (i.e. too few titles not going back far enough in years).
 

FEATURES AND ACCOMPLISHMENTS

In the course of the project, the Testbed team has developed or explored the following:

The cornerstones of the Testbed, in terms of retrieval capabilities, lie in the effective utilization of the article content and structure, as revealed by SGML, and the associated article-level metadata, which serves to normalize the heterogeneous SGML and provide short-entry display capability. The metadata also contains links to internal and external data, such as forward and backward links to other Testbed articles and links to A & I Service databases and other repositories. The metadata and index files, which contain pointers to the full-text data, can be stored independently of and separately from the full-text.

It is clear that a rich markup format such as XML (eXtensible Markup Language), which is a nearly complete instance of SGML, will become the language of open document systems. SGML permits documents to be treated as objects to be viewed, manipulated, and output. The major strength of SGML, in terms of its retrieval capabilities, lies in its ability to reveal the deep content and structure of a document. While SGML is becoming ubiquitous in the publishing world, it is still, for the most part, being generated by publishers as a byproduct, rather than serving as an integral part  of their production process.
 

DIGITAL ARCHIVING ISSUES

One of the areas of investigation for the Illinois DLI Testbed team has been the issues surrounding the archiving of digital journals, including the storage requirements and technologies. Traditionally, libraries have assumed primary responsibility for the archiving of print journals. The question of who can and should assume primary responsibility for the archiving of electronic journals is important to publishers, professional societies, librarians, archivists, and government agencies, particularly the national libraries. The rapid growth in electronic versions of print format journals and the introduction of new journals published only in electronic format have intensified the concerns over archiving issues.

It is important to note that the term 'archiving' denotes not only the storage of materials but the systematic organization and exhaustive provision of access to the same materials. In the case of electronic materials, one of the major stumbling blocks to providing access has been the wide variety of formats used. Interestingly, electronic archiving issues are somewhat analogous to the nuclear waste problem: 'technology created this problem, but technology does not have a consensus tool to solve the problem'. In the case of electronic files, this has been expressed as: 'I can read a printed book published 300 years ago, but it is impossible for me to read a Microsoft Word II document written in 1988.'

The responsibility for storing and archiving electronic journals has, for the most part, fallen to the publishers who are mounting their own full-text repositories. Archiving back volumes of electronic journals is normally a part of the publisher's business plan. From the library and archival standpoint, this presents potential problems because publishers go out of business, or merge with or are absorbed by other publishers. In several countries, the national libraries have been assigned the responsibility for archiving copies of electronic journals produced by their domestic publishers. In these cases, the incompatible document formats in which publishers submit their titles have prevented the establishment of viable archives.
 

DIGITAL ARCHIVING COSTS

Issues connected with the archiving of electronic materials have been examined in the DLI project at the University of California at Berkeley with the Terabyte Store and in the Illinois DLI Testbed. Both of these projects have focused on questions connected with the format of the materials and with their storage needs and requirements. Also examined have been technical issues associated with computer operating system limitations (such as the 2 gigabyte file size limit in UNIX systems) and disk file partitions that expedite retrieval speed.

There is a long history of experience with the online storage of bibliographic data in machine-readable format by the abstracting and indexing service vendors such as Dialog and BRS. These organizations have changed their storage philosophies over the last twenty years as storage technologies have evolved. Early systems emphasized a hybrid approach with current data stored online and older data offline. With the introduction of high-density disk storage technologies in the mid-1970's, such as the IBM 3350 disk, it became possible to move to an all-online approach.

When the Illinois DLI Testbed was begun in 1995, the average cost of 1 gigabyte of DASD (Direct Access Storage Device) disk storage was $1,000 ($1K) in United States dollars. In 1998, the Testbed is utilizing off-the-shelf 23-gigabyte DASD devices from the Seagate Corporation selling for $1,600. Also available are Seagate 47-gigabyte drives selling for $3,500. Clearly, the cost of DASD devices has been dramatically reduced as the storage capabilities have increased.

The Illinois DLI project Testbed's 54 full-text journals covering the four years from 1995 to present (June 1998) contains approximately 50,000 articles. These articles are comprised of SGML text, bit-mapped images and figures, index data and metadata, and, for most of the articles, an associated PDF file. The DASD storage requirements for the present DLI Testbed defined above is 60 gigabytes, at a cost of $5,700 in 1998 prices.

Extrapolating the storage requirements for this four-year digital collection of 54 full-text journals to 80 years of archival storage yields a storage requirement of 1,200 gigabytes (1.2 terabytes). This 1,200 gigabytes, at 1998 prices, would cost $90,000. For a specific publisher with a catalog of some 50 journals, this would a very reasonable investment. Interestingly, at 1994 prices, the 1,200 gigabytes would have cost $1.2 million

However, a typical academic engineering/science library may subscribe to some 1,600 journals. Mounting and archiving 1,600 full-text journals comprised of 1.5 million articles for four years (1995-1998) would require 1,800 GB at a 1998 cost of $150,000. Extending the archive of 1600 Full-Text Journals to 80 years would require 36,000 gigabytes (36 terabytes) of DASD. The 36,000 gigabytes at 1998 prices would cost $2,800,000. Assuming a reduction in disk drive costs by a factor of 10 over the 80 years (a conservative estimate given the 25-fold reduction from 1994 to 1998) would result in reducing the storage costs to $280K.
 

ARCHIVING CONCLUSIONS

Clearly, advances in disk storage devices, in terms of increases in storage factors and concomitant reductions in costs, have resulted in increased flexibility for text archiving and online access. Obviously, direct access devices offer backward compatibility, scalability, and effective access capabilities. And, the increases in storage density and reductions in cost allow both publishers and information providers, such as libraries, to store and provide access to full-text data at reasonable storage prices. However, problems remain with operating system and database management software limitations.

In addition, libraries and publishers will need to bear the increased costs of high-end database and Web servers, the software necessary to provide effective retrieval capabilities, and the computer programming and hardware and network maintenance staff to provide these services.

The major issues connected with archiving electronic journals are not technology-related. The technologies needed for archiving of large stores of digital materials are within reach for both publishers and libraries. Rather, the real issues have to do with inconsistent and incompatible data storage formats being used by publishers and concerns over actually delegating to the publishers the primary responsibility for archiving.

Clearly, individual academic libraries will not be in a position to, or be given license to, locally mount a complete corpus of scientific and engineering full-text journals. The electronic journal environment is, and will be, publisher driven, and the primary role of the academic library may well be to provide gateway and navigation functions for users lost in this world of distributed full-text repositories. It is incumbent on libraries and library agencies to work closely with publishers on defining standards for full-text archiving and to help to shape legislation that will establish standardized archives for electronic journals. Only in these ways will effective archiving of electronic materials be realized.
 

NOTES

The Softquad Panorama Viewer is described at 'http://www.softquad.com/products/pc-pview.htm'.  Panorama is a Web-enabled SGML viewer that is available as a plug-in for the Netscape and Microsoft Internet Explorer Web browsers.

Ovid Technologies (http://www.ovid.com/) markets stand-alone bibliographic retrieval software which provides access to locally mounted bibliographic and full-text databases. The DLI project uses Ovid as a platform for providing access to the INSPEC and Compendex databases.

The proceedings of the mathematics workshop held at the Grainger Library on May 1, 1996 are available by contacting the Grainger Library. (It is NOT on the DLI Web site).

BIBLIOGRAPHY

1. Bruce Schatz, William H. Mischo, Timothy W. Cole, Joseph Hardin, Ann Bishop, and Hsinchun Chen, "Federating Diverse Collections of Scientific Literature" IEEE Computer 39: no. 5, 1996, pp 28-37. http://www.computer.org/pubs/computer/dli/r50028/r50028.htm.

2. Bruce Schatz, William H. Mischo, Timothy W. Cole, Ann Bishop, Susan Harum, Eric Johnson, Laura Neumann, and Hsinchun Chen, "Federated Search of Scientific Literature: A Retrospective on the Illinois Digital Library Project" IEEE Computer February 1999, in press.

Last updated : 25 August 1998
Copyright 1998 ICSU Press and individual authors. All rights reserved