INTERNET LIBRARY OF EARLY JOURNALS
Annual Report (August 1997)
1. As reported in July 1996 the project has been subject to major delays due to the unavailability of equipment with full grey scale capability for digitising from both paper and microfilm. Access to grey scale scanning had been assumed in our original project plan, and pilot studies with flatbed scanners suggested that, though grey scale did not significantly affect image appearance as presented to the user (on a PC screen at 100dpi), it did affect the accuracy of OCR output, especially for the poorer quality typography of the 18th Century material.
2. Following discussions with Minolta it was decided to accept the Minolta PS3000 equipment with bi-tonal capability and proceed with the scanning of Notes and Queries and Blackwoods where absence of grey scale was expected to have the least effect. We were informed by Minolta that an upgrade of the IMAX interface card to provide full grey scale capability was under development and would be installed free-of-charge as soon as it was available.
3. Scanning of the first title, Notes and Queries, is now complete with 26,000 images mounted on both Leeds and Oxford servers. Procedures for scanning, OCRing and image conversion have been extensively investigated and defined. Scanning of Blackwoods Magazine is in progress (18 Volumes complete) and scanning of Philosophical Transactions has just started at Manchester.
4. There has also been a major delay in the supply of the Mekel MX500XL-G equipment for digitisation from microfilm with grey-scale provision. This equipment was installed in Oxford in April and the scanning of the Gentleman's Magazine microfilm has started.
5. The server strategy has been revised as a result of the availability of a Web interface for EFS and concerns about the inflexibility of the EFS interface. The Oxford Web interface has been re-designed to offer greater user-friendliness and will be the initial entry point for all users, with a transparent interface to the EFS fuzzy matching software on the Leeds server.
6. Indexes to Notes and Queries have been keyboarded by Digital Imaging Technologies (not in-house as originally planned) and are being mounted on the Oxford server. Notes and Queries will therefore provide the test bed for the use of both OCRable text and keyboarded printed indexes.
7. In response to project publicity we have received 280 enquiries about the proposed service. A pilot service based on Notes and Queries is now being provided, an initial user population is being recruited at all four sites and the first stage of evaluation feedback collected.
8. Discussions on possible exit strategies have been initiated with Chadwyck-Healey, JISC and MIDAS.
B. Activities and Progress
Scanning with the Minolta PS3000 face-up book scanner
9. Minolta scanners are now installed and configured at both Birmingham and Manchester. Operators have also been appointed on both sites. Trouble-shooting and optimisation of scanning and other procedures were undertaken at Manchester before work started at Birmingham.
10. Much effort has been devoted to defining scanning procedures in order to maximise image quality as defined by the legibility and aesthetic appearance to the user, and the OCR output quality. Initial pilots of 200 page samples were used to define the parameters for a basic level of legibility and acceptability, but a high level of variability in quality remained. During the production runs of the first 20 volumes, the image quality has steadily increased as operator skill has developed, with new "tricks" learned, factors influencing quality identified and a series of procedural refinements introduced. Factors identified as responsible for inconsistent quality are: page buckling, variations in type density, see-through, varying margins, tightness of binding and the differential focusing required at the beginning, middle and end of a large volume. Tightness of binding and the lack of flatness of the page appear to be critical factors in relation to OCR quality.
11. A major change in procedure was adopted for the last 7 volumes (34-40): single pages were scanned instead of a double-page spread. The procedure was to scan all "right-hand" pages to the end of a volume, then turn the book round and scan all "left-hand" pages. This gives a significantly more consistent quality of image and OCR output.
12. Though all the images so far scanned are judged legible and of acceptable quality, the overall quality is undoubtedly higher in the later than the earlier volumes as a result of refinements introduced. The scanner operation has produced two reports on the evolution of operating procedures, and a draft procedural manual has been prepared. After revision this manual will be an additional deliverable. Appendix I provides more details of the evolution of the present procedures and the remaining problems.
13. Inconsistencies in image quality still remain and to some extent can be considered inherent in the use of an open book scanner with this type (pre-20th century) of material. With the Minolta scanner variable quality could only be eliminated, if at all, by setting parameters for each page, at an unacceptable cost in reduced throughput. Direct Data Capture (DDC), a Manila based scanning bureau used by Leeds to supply images for the RLG Studies in Scarlet project, have had comparable experiences with the Minolta PS3000. However, much cheaper labour costs allowed them to spend more time on post-scanning image clean up.
14. The scanning of 20 years of Notes and Queries (40 volumes, 26,000 images) has now been completed. Though a maximum of 250 pages an hour was achieved under optimum conditions, the average production throughput over 40 volumes was 97 pages per hour and the hourly rate varied with the volume from a maximum of 159 (vol. 23) to a minimum of 55 (vol. 10) reflecting variations in the characteristics of the volume, including the tightness of binding. The time taken to input metadata is not included in these data.
15. 18 Volumes (9 years) of Blackwoods (out of 40 volumes in total) have been scanned at Birmingham. Scanning of Philosophical Transactions has now started at Manchester.
Scanning of Microfilm with the Mekel MX500XL-G
16. The Mekel scanner, with grey-scale capability, was installed in Oxford at the beginning of April and a scanner operator (20 hours per week) appointed. A one-year sample (365 frames with two pages a frame) of Gentleman's Magazine has been used to pilot the Mekel equipment and assess parameter settings for particular microfilm characteristics.
17. Throughput is high and a theoretical scanning rate of 600 frames per hour is possible at lower resolutions with bi-tonal settings though the actual scanning rate is much lower because of the need to use higher resolutions and grey-scales, to exercise visual quality control and to reset parameters for "bad" frames. The effective scanning speed is nearer to 100 frames (200 images) per hour. Though this is twice as fast as that achieved with the open book scanner, the time required for the input Metadata remains unchanged and represents a higher proportion of staff time required.
18. The microfilm, obtained from bound volumes, exhibits the same variations in quality associated with tight bindings, page buckle and other factors as were observed when digitising directly from bound volumes. These variations are transferred to the images obtained from the microfilm. Though most images are legible, the quality is variable and in some cases poor, irrespective of parameter settings. Other factors which may affect quality are the state of the paper original and the microfilming and subsequent copying processes (this is a second generation microfilm). Before proceeding to a full production run of Gentleman's Magazine we are currently checking whether the quality of the microfilm (as judged by digitised images generated) varies between volumes in a manner similar to that observed in digitising Notes and Queries directly from the paper bound volumes. Work is about to start on the piloting of a microfilm reel of The Builder.
19. TIFF images have been transferred between sites (Manchester to Leeds; Leeds to Oxford; Birmingham to Oxford) by FTP with no problems. These images are now subject to processing in two respects:
OCRing at 400 dpi
conversion (TIFF to GIF) and resizing for display purposes
Most of our experience with these processes, which is outlined below and, in greater detail in Appendix I, is based on experience at Leeds with Notes and Queries. However, the necessary processing software is also now operational at Manchester and at Oxford, and has been used with Blackwood's Magazine and Gentleman's Magazine at Oxford.
20. OCR is viewed as a low-cost method of producing high value indexes which will be especially valuable for "newsy" publications such as Notes and Queries which are item-based and rich in anecdotal information. The resized images were the primary means of providing a screen display to users. The primary purpose of OCR is not for display purposes, but to provide an index, though the OCR'd text may be made accessible to the user, linked to the image, in order to highlight matched terms in the text. The requirements for OCR were therefore less rigorous than if it had been used as the sole display format (though how much less rigorous is acceptable has yet to be quantified). Even a low quality OCR text would provide far greater indexing depth than printed index though the high error rate would result in inconsistency of retrieval. The fuzzy matching software offered by EFS is intended to reduce retrieval failures due to OCR error. Nevertheless, it is recognised that OCRing of pre-20th century text, especially in bound volumes, does give rise to serious problems and there may be a point at which quality is too low to be acceptable.
21. Omnipage has been used exclusively for OCRing in the project. Problems have been encountered in relation to both accuracy and throughput (though in this case there is no trade-off between the two). Great efforts have been devoted to refining the imaging process in order to give consistent OCR quality. As a result the majority of OCR pages are satisfactory but difficulties were encountered in three areas:
the small typeface of adverts, in particular, and of indexes may be difficult to OCR using 400 dpi images.
pages which displayed extremes of typeface density: very light with parts of characters missing or very dark with show-through.
pages with complex structure including mixed fonts.
In extreme cases some pages could not be OCR'd; the software crashes due either to over-complex page structure, extreme variations in typeface or, occasionally, for no discernible reason. It remains to be seen whether grey scales, considered essential for the 18th century titles, could also improve OCR accuracy for the bad pages in the 19th century.
22. OCR throughput takes longer than the initial scanning process and is currently the bottleneck in the production process, though not yet critically so. A complete volume (650 pages) takes 12.5 hours to OCR if there are no software crashes which require manual intervention and hence prevent unattended overnight running. The rate of crashes is linked to a particular volume and varied between 1 page in 40 to 1 in 200. The establishment of processing production lines at both Manchester and Oxford is primarily to speed up the OCR process. We have also supplied image samples to commercial suppliers in order to evaluate alternative OCR software: the RAF software available through the Zuma Corporation and software used by Digital Imaging Corporation who are involved in the JSTOR project.
OCRing of other titles
23. It appears almost certain that the images created from the Gentleman's Magazine microfilm cannot provide an acceptable level of OCR quality even for indexing purposes. A similar conclusion was reached by Offshore Keyboarding Corporation (a division of Digital Imaging Corporation who are involved in the JSTOR project) using their own OCR software.
24. Software is required to convert TIFF to GIF images, to resize to fit the width of an 800:600 pixel screen at 100dpi for display purposes, and for cropping, deskewing and other refinements designed to improve both OCR and presentation. Three software packages have been evaluated: ImageMagick, Image Alchemy and Sequoia ScanFix. ImageMagick, our initial choice, proved far too slow (24 hours per volume) and has been replaced with Image Alchemy (4 hours per volume). The image conversion process is not, therefore, a rate determining step in the production process. ScanFix is also being used for its deskewing and image clean-up facilities.
The Oxford and Leeds Servers
25 The delays experienced in the installation of scanning equipment have enabled the project to explore server options in greater depth and take advantage of changes in the available technology. As a result, our server strategy has been substantially modified. Our original proposal envisaged a Web server at Oxford, using a standard search engine such as PAT, and an X Windows Excalibur EFS server, with fuzzy matching capability at Leeds. Images and OCR'd text were to be mounted on both servers independently. The appearance of a Web interface for EFS, and the increasing universality of access to the Web, has led us to abandon the original X Windows interface. It has also been discovered that the relative inflexibility of the EFS software restricts attempts to make it more user friendly, though a new and possibly more flexible version is about to be launched. However, the fuzzy matching capability provided with EFS continues to be a key element in the project as a means of correcting for the errors in the full-text OCR.
26. Our revised strategy is to use the Oxford server with a WWW interface and PAT as its search engine, as the initial point of entry for all users, but with a transparent link to the EFS server in Leeds for those users wishing to use the fuzzy search option with the OCR indexes. This is the type of application for which SuperJanet was intended. Oxford interface has been re-designed to increase user friendliness and accommodate this strategy. The server uses SGML records conforming to the Text Encoding Initiative (TEI) standard, which integrates bibliographic information, full OCR'd text and images into a single file for each physical volume. A unique SGML identifier links each page to the associated metadata, which includes OCR text, GIF image and bibliographic information. The bibliographic information provides descriptions of the pages, including those with no given page numbers, and distinguishes between the different categories of pages (title page, text, indexes, advertisements). The SGML database will be used to generate Dublin core records.
27. The images and OCR'd text files of all 40 volumes of Notes and Queries are mounted on the Leeds server only, and are accessed as required via the WWW interface.
The User Interface
28. The interface allows the user either to browse by issue and page by page through a journal or to search OCR or other indexes for words or phrases which can be linked by the Boolean operator's and, or. Simple and Fuzzy search options are offered. Boolean operators are not available with fuzzy searching. When a page image is identified and displayed in either search or browse mode, the user has the option to click on previous page or next page. HTML frames are used to ease navigation of the screens and images are displayed to fit an SGVA screen (800 x 600). Appendix II provides screen dumps of the present experimental interface which is currently being revised (the interface can be accessed at http://www.bodley.ox.ac.uk/ilej/).
29. Scanning of Notes and Queries is now complete and will provide the basis for initial user testing. Scanning of Philosophical Transactions has now started. In an earlier pilot, the operator at Manchester reported that scanning a few trial pages of Philosophical Transactions was "a pleasure" (compared to Notes and Queries). The paper is of a superior quality to that of Notes and Queries, has a cloth-like texture and is thicker and heavier. The print is bold and clear and the superior alignment and quality are consistent in contrast to Notes and Queries. Unfortunately the colour of the paper used for Philosophical Transactions is darker (greyish) and many of the volumes have extensive foxing. The initial conclusion is that images produced from Philosophical Transactions will be clearer and the scanning process easier, though grey scales would be required for OCR.
30. 18 volumes of Blackwoods Magazine (out of 40) have been scanned and transferred by FTP to Oxford for OCRing. Scanning of Blackwoods is expected to be complete by the end of October.
31. The pilot scan of one year of the Gentleman's Magazine is complete and production is about start. A pilot scan of The Builder is about to start.
32. Discussions have been held with Keesing's Worldwide, publisher of Keesing's Record of World Events, on the possibility that volumes of The Annual Register could be scanned as part of a joint venture. However, our present expectation is that Annual Register will be scanned jointly at Manchester and Birmingham when work on other paper titles is complete.
33. Our experience in OCRing Notes and Queries images has already been discussed above and also in Appendix I. The feasibility of using OCR indexes, with or without fuzzy matching, is being assessed with each of the other titles in turn as images become available. Present indications are that OCRing will be feasible with the three 19th Century titles (Notes and Queries, Blackwood's Magazine and The Builder), but not with Gentleman's Magazine. Of the three 18th Century titles, Philosophical Transactions is most likely to give acceptable OCR, but the added value is much lower in view of the article-based structure. The use of OCR with Annual Register, the last of the titles to be digitised will be evaluated in September.
34. In the previous Annual Report we reported that a pro-forma has been developed for input of printed indexes and pilot key-punching carried out in Oxford. It was subsequently decided to contract the keyboarding to commercial operators. Indexes to Notes and Queries (1100 pages) have been successfully keyboarded and are about to be mounted on the Oxford server. Quotes are to be sought for keyboarding either contents pages or indexes for Gentleman's Magazine, Philosophical Transactions and Annual Register. A sample of the electronic PCI records of contents pages from Blackwoods has been provided by Chadwyck Healey and will shortly be mounted and linked to the images.
35. At present we are expecting to rely entirely on the OCR'd full-text for The Builder (though the possibility of keyboarding indexes or contents pages has not been ruled out).
36. A financial report specifying recurrent and non-recurrent expenditure up to 31.7.97 is given in Appendix IV. (not included on the Web)
37. The ILEJ home page is: http//www.bodley.ox.ac.uk/ilej/ . In addition to papers and presentations listed in the 1996 Annual Report (July 1996) information has been disseminated by the following:
Emly, M. and Jupp, W. Internet Library of Early Journals. Paper presented at the ELVIRA Conference at de Montfort University, April 1997. In: Electronic Library and Visual Information Research - ELVIRA 4 : Proceedings of the 4th UK/International Conference on Electronic Library and Visual Information Research (eds. C. Davies and A. Ramsden) London. Aslib, 1997, pp. 167-176.
Field, C. The Internet Library of Early Journals : preserving and disseminating the British journal heritage. Serials (in press). (A revised version of a paper presented at the 3rd European Series Conference, Dublin. 25-27th September 1996).
Jupp, W. The Internet Library of Early Journals. Paper to be presented at the Aslib Electronics and Multimedia Groups Annual Conference, Chelmsford, Essex. 14-16th May 1997.
Romanticism On the Net. An Electronic Journal devoted to Romantic Studies.
http://users.ox.ac.uk/~scat0385/ provides information on ILEJ.
C. Learning from the Process of Implementation
38. The project timetable has been subject to major changes due to initial delays in the provision of the Minolta open-book scanners and subsequent delays in the availability of both the Mekel equipment with grey-scale facilities (delivered April 1997) and of the grey-scale board for the Minoltas (still not available). This is a further illustration that the pace of technology development can lag behind both expectations and optimistic supplier claims. As a consequence we successfully applied to JISC for an extension of the period of the project for 9 months (until August 1998) within the existing budget.
39. In order to make the best use of staff resources, we have delayed recruiting scanning staff until equipment was actually installed with consequential further delays before operational scanning started. The need for frequent changes in the project timetable emphasize the managerial issues for a multi-site project which were referred to in the 1996 Annual Report (para. 12), viz: co-ordination of activity across four institutions can slow down progress and the heavy demands on senior management and technical staff involved in, but not funded by the project. The distribution of scanning across three sites gave less flexibility in staff recruitment than concentration on a single site and, therefore, tended to multiply recruitment delays. However, it still remained true that there were substantial gains from the active involvement of four institutions which provided a wider range of expertise, contacts and perceptions than any one single site. Effective collaborations on particular aspects of the project developed, notably between Manchester and Leeds in the initial scanning and OCRing operations. The transfer of scanning expertise from Manchester to Birmingham was also achieved effectively.
40. Contacts with related projects include:
Studies in Scarlet, a US/UK project with Leeds involvement.
The JSTOR project.
Cornell University who are apparently considering the use of the Minolta scanner for work similar to that in ILEJ in the expectation that grey-scales will be provided.
Yale University who have issued a report on a project to digitise 2,000 books from microfilm copies using the bi-tonal Mekel equipment.
The Australian Co-operative Digitisation project which is microfilming as an intermediate stage before digitising.
The SuperJournal project (eLib) The different interfaces used in this project have provided useful input to the design of our own interface.
The Higher Education Digitisation Centre (HEDC) at Hatfield (eLib) which is undertaking large scale digitisation from bound volumes, microfilm and single sheets. The Centre has purchased bound volume scanners from a German competitor to Minolta which has recently (August 1997) provided grey-scales.
41. The ILEJ project contains several elements: digitisation, the creation of OCR'd full-text and other indexes, interface design and end-user feedback. During the course of the project our awareness has increased that we are attempting several major tasks at the same time and of the interaction between these tasks. For example, OCR quality provides a test of image quality and feeds back into the scanning methodology; it is assumed that end-user feedback over the remainder of the project will influence interface design, imaging methodology, indexing and the criteria of acceptable OCR quality. We are also aware that some of these tasks might be better performed by centres specialising in only one. This has already been recognised in the project's decision that keyboarding of indexes by commercial suppliers is more cost-effective than doing it ourselves. HEDC, which concentrates exclusively on the initial digitising process, may be the more efficient approach to large-scale digitisation, but at the cost of isolation from end use of the product. Similarly, the SuperJournal project has devoted substantial resources to the design of alternative interfaces (though this is not an exclusive concern). We would expect to address these issues again in our final report.
D. Interim Evaluation Results
42. Delays in the scanning programme have resulted in consequential delays in user recruitment. As well as the ILEJ Home Page publicity has included articles in Ariadne and in Romanticism on the Net and emails to selected UK mailing lists. This has so far resulted in 280 enquiries about proposed services, 60% from outside the UK. As originally planned, we will be recruiting two populations categorised as local and remote, which will be treated somewhat differently as regards recruitment procedures and evaluation feedback. Appendix V contains copies of the questionnaire and invitation to participate forms which are being used for both user groups and are available in both paper and electronic forms.
43. Local populations are being recruited by distribution of paper copies of the two documents to targeted individuals or groups in each local area and through email lists where appropriate. Users have the choice of returning paper versions of the questionnaire in pre-addressed envelopes or electronic versions on the project Web site. Local respondents to this circularisation will provide the basis for the recruitment of focus groups for assessment of need, assistance with refinement of the pilot service and subsequent evaluation. Local experience in other contexts and in the SuperJournal project provides an awareness of the difficulties of recruiting focus groups among the academic community.
44. The remote user population is being recruited by broadcast publicity on the ILEJ Website and elsewhere rather than by targeted mail or email.
45. All users will be given the opportunity to comment on the service via Web pro-formas and email and asked to complete an email questionnaire survey. These will be the only forms of feedback for our remote user population.
E. Future Development
Work to be done
46. The main tasks to be completed in the next 12 months of the project are:
Continued scanning at Manchester (Philosophical Transactions, Annual Register), Birmingham (Blackwoods Magazine, Annual Register) and Oxford (Gentleman's Magazine, The Builder) in the expectation of mounting images of at least 20 years of each of the six titles by January 1998.
OCRing of Blackwoods, The Builder and possibly either the Annual Register or Philosophical Transactions (September 1997-February 1998).
Obtain quotes and commission keyboarding of either indexes or contents pages of Gentleman's Magazine, Philosophical Transactions and the Annual Register and possibly The Builder if OCR'd text is considered inefficient. The amount of keyboarding may be restricted by the funding remaining in the grant. (September - December 1997).
Continue to assess procedures for digitisation from bound volumes and from microfilm and to consult with other organisations undertaking similar work, including the HEDC at Hatfield. On the basis of our experience to produce a report on digitisation methodology.
Modify the user interface, in the first place in the light of evaluation by project staff on all four sites (September 1997), and in the second place on the basis of user feedback (December 1997).
Continue the recruitment of users, the formation of focus groups and the collection of feedback.
Detailed costing of the digitisation, image processing and index creation (OCR'd full-text or keyboarded indexes) processes.
Preparation of an Exit Plan and Business Strategy for further development (January 1998; see para. 47 below)
Preparation of final report for submission in August 1998.
Business plan and exit strategy
47. Because of delays experienced by the project and referred to in Section B of this report, the operational experience and user feedback required for a firm exit strategy are not yet available. However, we are already exploring options for developments beyond the period of the present project (ending in August 1998) in three key areas:
The provision of continuing access by the Higher Education Community to material which has been digitised in the project. As an interim measure the Oxford server will continue to provide access for a further 12 months beyond the end of the project, but we are seeking a permanent host preferably associated with other material with related content.
Linked to the provision of continued access, a major expansion of the Internet Library of Early Journals based on the extensive holdings of 18th and 19th Century journal literature (and related material) in the four participating institutions. The expansion envisages a collection of long runs of (say) 25 to 30 journals published in the UK and providing a UK corpus comparable to that of JSTOR. We would not intend to use the existing open-book scanning equipment purchased for the project for this larger scale digitisation programme, but would approach either HEDC or an appropriate commercial operator to undertake this work. Facilities in the commercial or HE sectors would be required for index creation, by either OCR or keyboarding, and for dissemination. The costing data on the present experiment would provide the starting point for identifying the substantial resources that will be required for this expansion from either commercial sources or JISC funding, and for negotiation with potential suppliers. Decisions will be required on the selection of material by subject and/or period and for the choice of indexes.
Continuing use of equipment purchased for the project. The open-book scanning equipment at Birmingham and Manchester will not be used for the large-scale expansion of the project, but could be used for:
assessment of candidate titles for the large scale project;
digitisation of items of particular significance to the Institution and its users;
(Birmingham) aspects of the Hybrid Library proposal if successful
exploration of the possibilities for on-demand digitisation.
Our experience with the Mekel equipment for digitising from microfilm is still limited. However, we will expect this equipment to continue to be used for digitisation from microfilm collections held in Oxford and in the other three Institutions. It may also be made available for digitisation projects undertaken by other Institutions with significant microfilm collections.
29th August, 1997.