INTERNET LIBRARY OF EARLY JOURNALS
The following is a more detailed description of the scanning, OCR and image conversion processes based on experience gained at Manchester and Leeds. A part of this Appendix is taken from the paper submitted to the ELVIRA Conference at de Montfort University, April 1997.
It was decided that the digitisation would not be carried out to full archival standards: Cornell and Yale were actively working in this area, and it was felt that the present project should concentrate on the user perspective. It was equally clear that no participating library would wish to dispose of a complete set of such an important early title: dismembering volumes, or otherwise compromising the paper originals was therefore not an option.
The decision to keep the original volumes intact limited the choice of scanner for paper originals to the Minolta PS3000P which today still remains the only hardware available for high throughput scanning of bound volumes. The scanner consists of a cradle, upon which the volume is placed open and face up, and the scanning optics which stand about eighteen inches over the cradle. The cradle is made up of moving plates which yield under the books weight to allow it to be opened flat without stressing the binding. The operator simply holds the book open and presses a foot pedal to initiate the scan. During the scan the distance between the original and the CCD array is continually monitored so that focus can be dynamically maintained. The scanner detects the curvature of the spread pages so that it can compensate optically and provide a flat image from a curved page. Currently the Minolta PS3000P scanner only supports bitonal and with a maximum resolution of 400 dpi. The images taken from the Minolta scanner are used for OCRing to provide searchable text, and also for conversion to GIF files for display on the Web. The scanning criteria therefore needed to be optimised for both ends. Benchmark techniques developed at Cornell were applied to the journals to determine the dpi required for full informational capture (Kenney and Chapman, 1995). Examination showed that both Notes and Queries and The Gentlemans Magazine had significant characters slightly under 1.0 mm in size. This would require a dpi of 615 for full capture (QI 8.0, ie rendering of serifs and font detail) and 385 dpi for medium capture (QI 5.0, ie easily readable but with some loss of character detail). However, the Cornell methodology only applies to human legibility and could not provide the dpi or colour depth information for optimum OCR accuracy. But it was clear at this early stage that the 400 dpi limitation of the Minolta would involve some loss of information with obvious implications for OCR quality. The bitonal file sizes at 400 dpi would be perfectly manageable; Group 4 compressed CCITT TIFF images averaged approximately 160Kb.
In practice it has proved very difficult to get good results with the Minolta technology from the older material. The tensioning of the cradle was far too stiff for it to be effective with the weight of our volumes. As a result the first and last thirds of the volume (ie where the volume thickness is most uneven) could take up to three times as long to scan as the middle third. Neither the dynamic focusing nor the curvature correction was particularly effective and the onus lay on the operator to spend time positioning and adjusting the volume to gain the best results. After experiments with plate glass and perspex, the best operator tool proved to be a three sided, white hardboard collar cut specifically to match the volume size. This could be placed to cover the side and lower edges allowing the operator to apply even pressure to spread and flatten the volume, as well as getting thumbs and fingers out of the picture. This technique, on a good volume, could expect to deliver an average of about one hundred pages per hour. However, by far the biggest difficulty in dealing with older material is the variability of the original. One page may have heavy, dark type and showthrough whereas the facing page may have very light typeface with incomplete characters. The OCR software was particularly sensitive to variations of this nature. The eventual solution to this, though not satisfactory, was to set up three different exposure settings, allowing the operator to select the setting most appropriate for a particular page spread. This additional operator activity decreased the scanning throughput from an average of 100 pages per hour to 80 pages per hour, a significant loss in throughput.
Omnipage Pro version 6.0 was the chosen software for OCR since it provided good recognition accuracy and a strong feature set including grey-scale and batch processing capabilities. The difficulties with assessing OCR quality (and this is a problem that will remain throughout the project) mean that it is impossible to get quantitative data without manually counting the words in the original. Because the quality of OCR must be assessed over widely varying original material, impracticably large quantities of data would be required. Initial tests were carried out on both eighteenth and nineteenth century material using an Epson flat-bed scanner. Results showed that the variability of the originals, a particular problem with the eighteenth century journals, caused problems for bitonal scanning which has a fixed threshold. On dark pages the showthrough would come above the bitonal threshold and appear as black shading; on the lighter pages the faint centres of letters would be below the threshold and the characters would appear broken in the image: both of these problems caused OCR to fail. The nineteenth century material has modern type-faces which OCR well but the problems with eighteenth century material are exacerbated further by archaic type-faces. It became apparent at this early stage that OCRing eighteenth century material would be problematic. Notes and Queries, with good quality nineteenth century originals, was chosen as the first production title because it gave the best pilot OCR results. Unfortunately, there have been significant OCR difficulties even with this high quality material. 400 dpi, the maximum available resolution, proved adequate for processing the main body text but was too low for the smaller text in adverts, footnotes, bibliographies etc. However, it is the variability of the material that has caused the greatest problems, even with the compensatory effect introduced with the three different exposure settings mentioned earlier. The scanner captures a double page spread, so variation between the left and right pages cannot be accommodated. Even scanning single pages (which would halve throughput) cannot completely solve this problem as the type-face weight varies over a single page. Also, the tight binding of the pages means that the open volume has an undulating page surface. The light source for the scanner hits the page at an angle of about sixty degrees and these undulations cause shadowing. There is even darker shadowing at the page gutter. Further, the curvature correction did not prove adequate to prevent OCR loss on tightly bound volumes with deep gutters. Post-scanning clean-up software did not alleviate these difficulties. None of the despeckle algorithms tested could discern between showthrough and ragged characters; despeckle caused improved recognition of some words but concomitant degradation of others, with no discernible overall benefit. Even after much effort, there are substantial areas of text within Notes & Queries which cannot be successfully OCRed. Scanning in grey-scale suggests itself as the obvious solution to these difficulties. Omnipage OCRs grey-scale images well. The advantage of grey-scale is that the OCR software is able to ride over the variation in the material, detecting the differential between background and text. This capability goes a long way to overcoming the problems of variable type-face, showthrough and shadowing.
It should be noted that all of the problems described with OCR are limitations of the Minolta cradle scanner and are therefore a direct consequence of the decision to keep the original volumes intact. A much higher level of OCR recognition could have been achieved if the volumes had been disbound and fed through a flatbed scanner. Curvature and gutter problems would not be present, page undulation due to binding stress would be removed and shadowing caused by the indirect light source would be eliminated. Both higher resolutions and grey-scale would be available. Finally, a further problem with cradle scanners is that the images cannot be accurately batch cropped. On a flatbed scanner the page is registered against the scanner guides and can be accurately cropped by specifying an offset from the known registration point. With a cradle scanner the page moves around within the image frame as the weight of the book shifts from beginning to end; consequently there is no known registration point from which cropping offsets can be calculated. The cropping can therefore never be as neat as that available from flat-bed images
Although Omnipage Pro provided an excellent recognition engine, the product was not robust enough for large batch recognition. Over-complex pages (eg very small type with heavy showthrough) would simply cause Omnipage to stop and sometimes completely crash. This hindered unattended overnight running, essential for achieving the necessary throughput.
With the purchase of the EFS WebFile product it was decided to mount the image files in a format that would could be displayed in common web browsers without the need for additional plug-ins, access being one of the tenets of the project. The JPEG format is designed for images with high tonal variation and is unsatisfactory for bitonal, so GIF was the format chosen. This choice precluded access to the images via the other EFS X-Windows and MS-Windows clients which will only display TIFF images. The scanned TIFF images are converted to GIF and scaled to fit widthways on a standard 800 pixel wide monitor. ImageMagick was initially used for this conversion but the processing time was phenomenal, over 24 hours per volume. ImageMagick was replaced with Image Alchemy, which cuts the time down to about 4 hours per volume and produces significantly better images. The images could be improved further by applying software filters but this is at the cost of image size and therefore download time. The current image quality is therefore a compromise between image quality and image size.
Experiments were made with the Sequoia ScanFix product which is designed to prepare bitonal images for OCR by using despeckle, deskew, sand, fill, grow and erode filters. Surprisingly, after much investigation, these techniques could not be made to improve OCR significantly, improvements to some letters resulted in degradation to others. Even deskewing did not improve OCR. ScanFix is being used on the images but simply to deskew for presentation purposes, not for improvements to OCR.
Cropping of the images has proved to be a problematic area. On a flat bed scanner the image is registered against the scanner guides and cropping simply involves adding an offset to this known point. However, with the PS3000P, the book moves and distorts in the cradle as operator moves through the pages and there is therefore no guarantee that the page will be at a known position in the image. Image Alchemy does not provide a cropping feature so Image Magick is used as a secondary process after image conversion. Image Magick operates reasonably efficiently when manipulating the smaller GIF images and a volume can be cropped in 2 or 3 hours. In order to accommodate the uncertain position of the image, the ImageMagick crop away background colour option was used. This feature crops away all extraneous white from around the image, in the latter volumes this means hard up to the text. A white border is then added to give the sense of a page margin. This worked fine for some volumes but could be upset by the slightest speck of dust or mark on the scanning collar or page margin. As the quality of scanning improved and with the use of ScanFix to deskew the images it seemed sensible to replace cropping with a traditional offset method but allowing a wide margin of error for possible image misalignment. However, with this method there is the risk that cropped text will go undetected. Also because the surrounding image border differs between left and right images it is dependent on the images being in a known order as different parameters need to be applied to left and right pages. This method is currently being investigated.
Excalibur EFS document retrieval software is used on the Leeds server to store the page images and to provide full searching of OCRed text. The power of EFS comes from its powerful indexing algorithms and fuzzy searching capabilities, which goes some way to overcoming inaccurate OCR. If the required term cannot be found by an exact match, the degree of fuzziness can be increased such that the search will find partial matches. Although no real testing has been done with a genuine user community (focus groups are currently being recruited), it is obvious at this early stage that users will need to be canny to get the best from fuzzy searching software. The problems of free text searching over large quantities of unstructured data are well documented and noise is a major issue in searching early journal titles deliberately selected for their eclecticism. Fuzzy searching exacerbates the noise problem. Increasing the degree of fuzziness will not only find badly OCRed instances of the desired search term but will also yield unwanted matches of words which are similar in spelling. The development of successful fuzzy search strategies will be one of the more interesting and challenging areas of the project, and the comparison with conventional indexes mounted at Oxford particularly illuminating.
To provide Web delivery of the images, the original 400 dpi bitonal page images were converted to GIF format (JPEG causes bad ringing around characters in bitonal images) and scaled for optimum viewing on an 800 pixel wide screen. Four bits of grey were added to mask the jaggedness which occurs from bitonal image reduction. Initially the ImageMagick shareware product was used for this conversion but it was unacceptably slow and the image quality was poor. A switch was made to ImageAlchemy which provided much better image quality and a five times increase in throughput. The quality of the display images has improved as the project has progressed; latterly they are very good.
The EFS product does not include a fully adequate mechanism for browsing forwards and backwards within or between journal volumes. SGML would provide the obvious solution, but is not supported within EFS. The EFS search engine will therefore be complemented by a browse mechanism mounted on the Oxford Web server and based on SGML document definitions.