The Internet Library of Early Journals

Presented at the Aslib Electronics and Multimedia Groups Annual Conference, 14-16 May 1997.
Bill Jupp, Leeds University Library / Internet Library of Early Journals


Contact details:
Bill Jupp
Edw. Boyle Library,
University of Leeds,
Leeds LS2 9JT
Tel. 0113 233 5565
Fax. 0113 233 5539
Email: w.p.jupp@leeds.ac.uk

Abstract

The author begins by providing a brief overview of the different approaches available for the digitisation of library materials and indicates the benefits and shortcomings associated with these techniques. It is from this perspective that the reader is introduced to the relatively new rationale behind the Internet Library of Early Journals which is mounting material drawn from six 18th and 19th century journal titles. A large corpus of material is required to attract scholars to an electronic archive of this nature and the project aims to mount 120,000 page images. Pages are scanned from both paper and microfilm originals; the paper volumes using a Minolta PS3000P overhead scanner and the microfilm from the Meckel MX500XL-G. The requirement for high volume, high throughput, low cost production of images excludes labour intensive operations and the images are simply OCRed without manual correction. The Excalibur EFS database is used to provide full text fuzzy searching of the uncorrected OCRed text. The project has just started full production and the author discusses the findings so far.

Introduction

The Internet Library of Early Journals is a two year collaborative project by the universities of Birmingham, Leeds, Manchester and Oxford and funded within the Electronic Libraries Programme (eLib). The project (ILEJ) is due for completion in mid 1998 when a full report will be made available. The intention is to create a full text resource of 120,000 digitised pages, with indexes, of substantial runs of three 18th and three 19th century journals, to make these available as widely as possible to the academic community, and to evaluate both the technology and user perception. The journals selected are Notes and Queries, Blackwood's Edinburgh Magazine and The Builder from the 19th century and The Gentleman's Magazine, The Annual Register and Philosophical Transactions of the Royal Society from the 18th century. To date, the first title Notes and Queries is available and can be accessed at URL http://www.bodley.ox.ac.uk/ilej This paper aims to give an introductory overview to the possible approaches to digitising library materials in order to show the complex and often conflicting decisions that have to be made. From this perspective the decisions made within the ILEJ project are introduced and located within the digitisation spectrum.

Full text resources

Traditional library computational aids (eg catalogues, databases) have provided the researcher with a list of bibliographic references. She must then take this list to the shelves or to inter-library loans to gain access to the physical material. Of course both of these options may involve delays, waiting for another user to return a book or awaiting delivery from another library. This model, in which the electronic data is solely bibliographic and simply references the physical paper material, is now changing with the availability of full text resources stored electronically. The decrease in the price of computing power means that it is now cost effective to store complete texts electronically, providing both new benefits and new challenges to the user. The advantages of full text relate primarily to access. Immediate and universal access to the material can be provided without delays. But also of major benefit is the enhanced access to the text itself: the ability to search a complete electronic corpus almost instantly; the opportunity for automatic textual analysis and the facility for electronic documents to be linked directly to other relevant resources (for example, the British Library Electronic-Beowulf which provides linkage between three versions of the text (British Library, 1997)). The benefits of electronic full text are such that these resources are proliferating exponentially through the library world.

Approaches to creating a full text resource

The benefits to the researcher of course bring difficulties for the librarian, or at least difficult decisions. There are many different approaches to providing a full text resource and it is not clear at this stage which techniques are the best for which types of material: it is a time of experimentation. Indeed the Internet Library of Early Journals is just such an experiment. The following paragraphs give a brief introduction to the different approaches that have been taken by different projects, and hopefully highlight the complexities of the whole digitisation question.

Text Files

By far the simplest method of providing an electronic full text resource is simply to make available a text file. In the instance where the original document was created electronically, it is a trivial and low cost process to make it available for download and searching etc. This method also has the advantage that the user is seeing the original text as delivered by the author without the involvement of any conversion process; what you see is what the author wrote (WYSIWAW?). There are still issues of presentation with this scenario (are you viewing the document with the correct font etc) but the integrity of the actual text is assured. Difficulties with digitisation begin to arrive when making available material that was not created electronically. The ILEJ project, for example, faces this difficulty in its focus on 18th and 19th century texts. Again, the most straight forward solution would seem to involve simply typing the text into a computer. This, however, can be an expensive process and, perhaps more importantly, is prone to include typing errors: the researcher is no longer guaranteed the integrity of the text. What you see may NOT be what the author wrote. It is possible to increase the typing accuracy by having the text entered by two or even three typists and to compare the results electronically, correcting any mismatches between the different texts. This will assure an almost, but not guaranteed, perfect rendition of the original. But imagine an etymologist finding an example of two different spellings for the same word in an electronic text. Can she be assured that it accurately represents the original text as written by the author? But further, imagine the cost of three typists entering all your data! Retyping the original document is an expensive process that is open to errors; nevertheless, very large resources of older material are becoming available to the user community, on CD ROM or via the Internet using exactly this technique. Chadwick-Healey's very valuable English Poetry and Patrologia Latina (O'Rourke, 1996) have been made available using this technique.

Digital Images

However, a further benefit of the plummeting price of computing power means that the processing of large computer images is now cost effective and can provide an alternative method of delivering full text resources. Instead of typing in the text, the original page is scanned and the user is presented with a computer image rather than a text file. The advantage is that it is much quicker and cheaper than having each page hand typed. Also, the integrity of the text is assured because the user views an image of the original page. Our etymologist can now sleep peacefully at night. The fundamental problem with this method is that at present it almost always involves dismembering and trimming the originals so that they can be fed through a sheet feeder into a flatbed scanner. If there is a sufficient margin they can of course be rebound, but this adds significant cost to the operation. In most cases the original volume is simply discarded with the electronic surrogate becoming the primary resource. Discarding the original volume raises many further problems and expense. The original is lost so the value of the original over and above the electronic version must be accounted for. No matter how high a quality of scan is achieved, information available in the original will be lost. Imagine an individual researching the history of paper manufacture: no information for her in an electronic image. Not only is there an inevitable loss of information but there are also the costs of maintaining the now primary electronic version. And the fact that the electronic version is the primary source, itself increases its cost as the electronic copy must be of an archival quality that can fulfil all potential future uses of the material. This requirement increases storage costs because higher quality images involve larger file sizes and therefore larger disks or longer tapes or more CDs on which to store them. Further, the costs of maintaining an electronic archive for long periods into the future will be considerable. Simply backing up the material onto magnetic tape and storing it in a cupboard will not suffice. Will you have the correct format tape drive available in 10 years time? Will you have the correct software to read the images? Will you have the correct computer hardware/operating system to run the software? Is the integrity of the media dependable for that period or will you need to "refresh" the data onto new media after a fixed number of years? The costs of maintaining electronic data are considerable and must be costed when considering the destruction of the original. In large digitisation programmes this cost can of course be offset against the resources freed by the destruction of the original, eg shelf space in buildings with expensive reinforced flooring etc. Having said all this, destroying the originals and storing the data in electronic form is still a viable method of creating a full text resource, particularly where the original is already deteriorating; Cornell University have pioneered the methodology for this approach (Kenney and Chapman, 1996). A variation of this method is to store the primary copy on microfilm rather than electronically. This has the benefit of not being dependent on the highly dynamic changes of computer technology; a microfilm can simply be stored in the cupboard and retrieved again in 10 years time. Equipment to read (or scan) microfilm will still be around. This hybrid approach of providing an electronic copy for access and a microfilm copy for archive removes a lot of the expense and concerns of maintaining the digital images. Yale University is active in developing methodologies for converting microfilm to digital images (Conway, 1996).

Indexing Digital Images

Assuming that all these difficulties have been successfully accommodated, there are still problems associated with a full text resource of digital images. Firstly, image files are significantly greater in size than text files that have been typed. Not only does this increase storage costs but it also increases download times which can be significant in a networked or internet environment. This may involve a reader spending long periods waiting for delivery and the associated frustration. Secondly, having page images is all very well but haven't we lost the second of our access benefits which initially prompted the urge to digitise: the ability for full text searching and electronic analysis etc? In some situations, availability of the originals is so restricted that access is worthwhile in itself. This is particularly true with very valuable material such as medieval manuscripts, but can also be useful if the material is less valuable but physically remote. The Research Libraries Group Studies in Scarlet collection (Research Libraries Group, 1996) for example, is mounting some unindexed images. But generally a collection of pages that have simply been scanned are of little use to a researcher; they need the additional value provided by electronic indexes and full text to make those images useful. These indexes may simply be hand typed copies of the original indexes and contents pages but this fulfils only the first of our access requirements: immediate availability. In terms of opening up the text for further searching and analysis, this method offers no benefit over the original volumes. Deeper indexes can be produced manually but this of course is an expensive process. However, in some cases, such indexes have already been created by third parties and may be available.

OCR

The method of obtaining full text searching from digitised page images is to run the digitised images through an Optical Character Recognition program (OCR) which will create text files versions of the images which can then be searched. This is an automated process and OCR software is relatively cheap so the cost of converting large quantities of data is very low. However, the pitfall with OCR is that even with near perfect modern page images there will always be errors, the converted text will never be 100% accurate, and this accuracy drops considerably for older texts. One solution is to proof-read and correct the OCR text and this approach has been taken by the eLib funded project ACORN (1997). However this final step adds an enormous workload to what has so far been an almost totally automated process. It is important not to underestimate the time consumed in proof reading and interactive editing. Unless the text is of very high accuracy it can often be cheaper to revert to the initial strategy of copy-typing the originals; interactive editing is much, much slower than direct copy-typing. But even after the most careful proof reading and editing, will our etymologist still be able to sleep peacefully at night?

The Internet Library of Early Journals

It is upon this difficult terrain that the Internet Library of Early Journals (ILEJ) set out to digitise 120,000 pages of 18th and 19th century material and make them available to scholars. Journal titles were chosen not only for their interest to scholars but also for the nature of their content. The Gentleman's Magazine, The Annual Register, and Notes and Queries all consist of short pieces on a wide variety of subjects, with lists of births marriages and deaths, book reviews and similar material. This eclectic content is impossible to index adequately and it was felt that full text searching would open up valuable information that had previously proved recalcitrant. Electronic free text searching would maximise the research value of this type of journal. The nature of these titles means that their attraction to scholars could only result from a large 'critical mass' of material, a mass that would make consulting the resource worthwhile. Twenty year runs of each were seen as providing the minimum quantity capable of attracting scholars to the facility. For The Gentleman's Magazine, the intention is to digitise up to 100 years of material, since the availability of this title on microfilm offers the prospect of faster processing. This necessity for a large quantity of data was the primary fulcrum upon which project decisions rested; it characterises our whole strategy. Our fundamental requirement was for high volume, high throughput and low cost conversion of large quantities of data. Such a requirement necessarily precludes any high cost text conversion, whether it be copy typing or proofing and correcting OCR. Images would therefore be scanned from original volumes and from microfilm and it is these images which would be displayed to the user. Hyper text indexes would be created for the images, both from existing electronic indexes supplied via third parties and by hand typing existing paper indexes and contents pages. But by far the greatest value would be added by OCRing the text to allow full text searching of material that would otherwise remain opaque.

OCR accuracy

Because the project ethos of 'high volume, low cost' excluded proof-reading and correcting OCR, the OCR would be left with errors. This means that the OCRed text files cannot be supplied as the primary data (as they are for example in Project ACORN) but would be used to provide additional value to the page images by allowing searching etc. The electronic library would therefore consist of linked pairs of image/text files in which a search can be made of the text files, but the original page image is supplied for consultation when a text match had been found. The advantage of storing image/text pairs is that, though the OCRed text may not be correct, the user actually consults the original page image and so integrity of the original is guaranteed. This model is in fact used by commercial office document management systems in which letters etc are scanned on their arrival and the images retrieved later by a search through the OCRed text. Because there was no intention of correcting the OCR it meant that, from the start, exhaustive searching of the material would not be possible; there would always be badly OCRed terms that a search would not find. It is this shortcoming that gives the Early Journals full text resource a different slant to comparable resources. The usefulness of the type of material digitised required a large corpus, and within fixed spending constraints this criterion has caused an associated (hopefully insignificant) reduction in retrieval. It our belief that our positioning in the cost/retrieval matrix is appropriate for this type of journal; that is, that scholars would prefer high retrieval rates from a very large corpus of data as opposed to 100% retrieval rates from a much smaller corpus. One of the aims of the project will be to evaluate the truth of this conviction.

Fuzzy Searching

In order to achieve as high a retrieval rate as possible, ILEJ is using the Excalibur EFS fuzzy searching software already in use by the British Library (British Library, 1997) and by Project Elinor at De Montfort University (Ramsden, 1993). EFS is a search engine that allows the user to increase the degree of 'fuzziness' associated with their search. Increasing fuzziness will increase the number of hits found by returning not only words that exactly match the search term but also those with a similar spelling. This technique allows true matches to be found from terms that have been OCRed incorrectly. Fuzziness can be gradually increased returning more and more hits but with less and less correspondence to the original search term. This technique will allow the user greater penetration into the corpus but at a cost; researchers will have to be more thoughtful in choosing their search terms and will be presented with a much greater quantity of negative research, ie dismissing matches that are not relevant. Indeed, this whole paradigm will involve the researcher in developing new skills to successfully utilise this type of resource. In the good old days the reader would search through bibliographic data entered by trained librarians who had used controlled and standard language. However, anybody who has used Alta Vista to search the World Wide Web will be aware of the disadvantages of full text searching and the care needed in choosing search terms. A carelessly chosen term can involve the return of gargantuan quantities of irrelevant material. Although this problem of noise applies to all uncontrolled full text searching it is further exacerbated by fuzzy searching. Not only will fuzzy searching return badly OCRed words that are relevant, to the reader, it will also return similarly spelt words which are not relevant increasing the amount of necessary negative research. This factor adds another axis to our matrix: cost/retrieval/user-convenience. Again it will be part of the project to evaluate our position against these axes and make recommendations for further projects. Because of the factors outlined above, successful techniques for negative research are paramount to the success of this methodology. As previously mentioned, one of the disadvantages of mounting page images as opposed to text is their size. A page image file, even when compressed, can easily be 20 times the size of the corresponding text and consequently delivery over the internet can be expected to be 20 times as slow. If a reader is having to sustain a handicap of this order purely in order to dismiss data that is irrelevant then there will be significant difficulties and frustrations in using the service. Initially the project had considered using the OCRed text solely as a hidden search tool but it is now coming to be seen as an important resource for negative research. It is perceived that the OCRed text will provide enough context and be of sufficient quality to allow the reader to dismiss a page as not relevant. It is only when a relevant page is identified that the image is downloaded for consultation.

Scanning

The model so far describes high speed scanning and automatic OCRing to provide a high volume low cost resource. The only significant manual input has been the hand entering of the paper indexes and contents pages. Indeed, this is exactly the model for those titles scanned from microfilm. The microfilm rolls are fed into a Meckel MX500XL-G scanner which automatically moves through the whole film. However, to date, most of the scanning experience of the project has come from scanning bound volumes using the Minolta PS3000P overhead scanner. The age and value of the original volumes meant that dismembering was never considered. This decision limited the choice of scanner for paper originals to the Minolta PS3000P which today still remains the only hardware available for high throughput scanning of bound volumes. The scanner consists of a 'cradle', upon which the volume is placed open and face up, and the scanning optics which stand about eighteen inches over the cradle. The cradle is made up of moving plates which yield under the book's weight to allow it to be opened flat without stressing the binding. The operator simply holds the book open and presses a foot pedal to initiate the scan. The use of an overhead scanner, although it removes any costs associated with dismembering volumes, significantly increases the actual scanning costs by requiring the full time attendance of an operator. In practice, the older tightly bound volumes used in the project required careful manipulation by the operator and throughput was considerably slower than the scanner specification. An average of about 80 pages per hour was typical; scanner operator costs are a significant factor with this type of equipment. The PS3000P is currently limited to bitonal 400 dpi which is not sufficient resolution for archival standards. Benchmark techniques developed at Cornell were applied to the ILEJ journals to determine the dpi required for full informational capture from bitonal scanning (Kenney and Chapman, 1995). Examination showed that Notes and Queries and The Gentleman's Magazine had characters slightly under 1.0 mm in size which would require a dpi of 615 for full capture. Current technology, therefore, will not allow for the scanning of bound volumes to archival standards.

OCR in ILEJ

Omnipage Pro version 6.0 was the chosen software for OCR as it provided good recognition accuracy and a strong feature set including grey-scale and batch processing capabilities. The difficulties with assessing OCR quality (and this is an insoluble difficulty) mean that it is impossible to get accurate quantitative data without manually counting the words in the original. With older journals this problem is exacerbated as OCR is particularly sensitive to variations in type-weight, foxing, show-through etc that occur in this type of material. Because the quality of older material is so variable, any meaningful statistics on OCR performance would have to be aggregated over impracticably large quantities of data. It is therefore not feasible to come up with a single percentage accuracy of OCR across the large volume of material scanned. However, it has become clear that achieving high quality OCR is highly problematic in these circumstances and these problems are rooted in three different factors. Firstly, OCR packages are designed to OCR modern office documents using modern characters in modern fonts. Eighteenth century material contains long s's (ie an 's' that looks like an 'f'), ligatures and diphthongs that are not recognised by OCR. Although OCR can, in a limited fashion, be trained to recognise alien characters, the recognition capabilities never match those achieved with standard text. However, by far the greatest difficulties have arisen from the inconsistent quality of the original material. A page with dark type face and show through will be followed by one with faint type and broken characters. Some pages will be dark but with areas of light and broken characters. Pages will be skewed and columns can wander. Such variability does not lend itself to high volume processing. Ideally the scanning parameters would be adjusted to suit each page, but this is of course impractical. The final problem area lies with the overhead book scanner itself. As previously mentioned, the limitation of 400 dpi prevents full capture which has ramifications for OCR quality. Also, the scanner currently only supports bitonal (grey-scale is expected later this year). OCRing grey-scale helps enormously with variable material as the OCR is able to 'ride' over the variation. But finally, the fact that the scanner optics sit 18 inches over an open book means that the image quality is never as good as that available from a flatbed scanner where the page is pressed flat against the scanner glass. Problems from uneven page surfaces, focusing and shadowing are inherent in the use of an overhead scanner.

Conclusion

This paper has, briefly, presented the various approaches available when considering the development of a full text scholarly resource. Subsequently it has shown the approach taken by the Internet Library of Early Journals and discussed some of the early technical findings of the project. The project can be understood therefore, as feeding back information into the digital community at two levels: short term technical data and long term methodological data. The project is gaining experience with state of the art scanning equipment, OCR and fuzzy searching software. Recommendations will be made about the benefits and limitations of this equipment and the techniques needed to reap the greatest benefits. This type of information is in immediate demand; people are making choices today about which type of scanner to purchase, which OCR package, which fuzzy search engine. Our expertise will be made available to help with these decisions. But in a year or two this data will be redundant as technology improves. However, the methodological implications gleaned from the ILEJ project will be more far reaching. Even with improved technology the decision as to where to locate a full text resource in the 'cost/retrieval/user-convenience' matrix will still be a pertinent one, in fact it may become more pertinent. As technology improves the difference in cost between manual production (eg typing, proof reading) and automatic production (eg scanning, OCR) will increase. It will become cheaper and cheaper to produce resources in the ILEJ mould, that is high volume but without full retrieval. The success of such resources will depend on the as yet unknown axis in our paradigm; how easily and how well will users be able locate the information relevant to their research. It is this component that may well prove to be the most interesting return from the ILEJ project.

References

British Library, 1997, Excalibur PixTex/EFS. URL: http://portico.bl.uk/access/excalibur.html

Conway, P., 1996, Conversion of microfilm to digital imagery: a demonstration project: performance report on the production conversion phase of Project Open Book. New Haven: Yale University Library.

Electronic-Beowulf, 1997, URL: http://portico.bl.uk/access/beowulf/electronic-beowulf.html

Kenney, A.R. and Chapman , S., 1995, Tutorial: digital resolution requirements for replacing text-based material: methods for benchmarking image quality. Washington, D.C.: Commission on Preservation and Access, 1995.

Kenney, A.R. and Chapman , S., 1996, Digital Imaging for Libraries and Archives. Department of Preservation and Conservation, Cornell University Library.

O'Rourke, T, Chadwyck-Healey - electronic resources for the virtual library: a publisher's perspective of preservation and access. In: A.H. Helal and J.W. Weiss (eds.), Electronic documents and information: from preservation to access (Essen Universitätsbibliothek, Essen), p.160-169.

Project Acorn, 1997, URL: http://acorn.lboro.ac.uk/

Ramsden, A., Wu, Z., Zhao, D., 1993, Selection criteria for a document image processing system for the ELINOR Electronic Library project. Program 27:4, p.371-387.

Research Libraries Group, 1996, Studies in Scarlet. Research Libraries Group News 40, p.3-11.

Acknowledgements

The Internet Library of Early Journals is a project funded by the JISC within the eLib programme. The authors also wish to acknowledge the contribution of other members of the project team to the work described.