The author begins by providing a brief overview of the
different approaches available for the digitisation of library
materials and indicates the benefits and shortcomings associated
with these techniques. It is from this perspective that the
reader is introduced to the relatively new rationale behind the
Internet Library of Early Journals which is mounting material
drawn from six 18th and 19th century journal titles. A large
corpus of material is required to attract scholars to an
electronic archive of this nature and the project aims to mount
120,000 page images. Pages are scanned from both paper and
microfilm originals; the paper volumes using a Minolta PS3000P
overhead scanner and the microfilm from the Meckel MX500XL-G. The
requirement for high volume, high throughput, low cost production
of images excludes labour intensive operations and the images are
simply OCRed without manual correction. The Excalibur EFS
database is used to provide full text fuzzy searching of the
uncorrected OCRed text. The project has just started full
production and the author discusses the findings so far.
Introduction
The Internet Library of Early Journals is a two year
collaborative project by the universities of Birmingham, Leeds,
Manchester and Oxford and funded within the Electronic Libraries
Programme (eLib). The project (ILEJ) is due for completion in mid
1998 when a full report will be made available. The intention is
to create a full text resource of 120,000 digitised pages, with
indexes, of substantial runs of three 18th and three 19th century
journals, to make these available as widely as possible to the
academic community, and to evaluate both the technology and user
perception. The journals selected are Notes and Queries, Blackwood's
Edinburgh Magazine and The Builder from the 19th
century and The Gentleman's Magazine, The Annual
Register and Philosophical Transactions of the Royal
Society from the 18th century. To date, the first title Notes
and Queries is available and can be accessed at URL
http://www.bodley.ox.ac.uk/ilej This paper aims to give an
introductory overview to the possible approaches to digitising
library materials in order to show the complex and often
conflicting decisions that have to be made. From this perspective
the decisions made within the ILEJ project are introduced and
located within the digitisation spectrum.
Full text resources
Traditional library computational aids (eg catalogues,
databases) have provided the researcher with a list of
bibliographic references. She must then take this list to the
shelves or to inter-library loans to gain access to the physical
material. Of course both of these options may involve delays,
waiting for another user to return a book or awaiting delivery
from another library. This model, in which the electronic data is
solely bibliographic and simply references the physical paper
material, is now changing with the availability of full text
resources stored electronically. The decrease in the price of
computing power means that it is now cost effective to store
complete texts electronically, providing both new benefits and
new challenges to the user. The advantages of full text relate
primarily to access. Immediate and universal access to the
material can be provided without delays. But also of major
benefit is the enhanced access to the text itself: the ability to
search a complete electronic corpus almost instantly; the
opportunity for automatic textual analysis and the facility for
electronic documents to be linked directly to other relevant
resources (for example, the British Library Electronic-Beowulf
which provides linkage between three versions of the text
(British Library, 1997)). The benefits of electronic full text
are such that these resources are proliferating exponentially
through the library world.
Approaches to creating a full text resource
The benefits to the researcher of course bring difficulties
for the librarian, or at least difficult decisions. There are
many different approaches to providing a full text resource and
it is not clear at this stage which techniques are the best for
which types of material: it is a time of experimentation. Indeed
the Internet Library of Early Journals is just such an
experiment. The following paragraphs give a brief introduction to
the different approaches that have been taken by different
projects, and hopefully highlight the complexities of the whole
digitisation question.
Text Files
By far the simplest method of providing an electronic full
text resource is simply to make available a text file. In the
instance where the original document was created electronically,
it is a trivial and low cost process to make it available for
download and searching etc. This method also has the advantage
that the user is seeing the original text as delivered by the
author without the involvement of any conversion process; what
you see is what the author wrote (WYSIWAW?). There are still
issues of presentation with this scenario (are you viewing the
document with the correct font etc) but the integrity of the
actual text is assured. Difficulties with digitisation begin to
arrive when making available material that was not created
electronically. The ILEJ project, for example, faces this
difficulty in its focus on 18th and 19th century texts. Again,
the most straight forward solution would seem to involve simply
typing the text into a computer. This, however, can be an
expensive process and, perhaps more importantly, is prone to
include typing errors: the researcher is no longer guaranteed the
integrity of the text. What you see may NOT be what the author
wrote. It is possible to increase the typing accuracy by having
the text entered by two or even three typists and to compare the
results electronically, correcting any mismatches between the
different texts. This will assure an almost, but not guaranteed,
perfect rendition of the original. But imagine an etymologist
finding an example of two different spellings for the same word
in an electronic text. Can she be assured that it accurately
represents the original text as written by the author? But
further, imagine the cost of three typists entering all your
data! Retyping the original document is an expensive process that
is open to errors; nevertheless, very large resources of older
material are becoming available to the user community, on CD ROM
or via the Internet using exactly this technique.
Chadwick-Healey's very valuable English Poetry and Patrologia
Latina (O'Rourke, 1996) have been made available using this
technique.
Digital Images
However, a further benefit of the plummeting price of
computing power means that the processing of large computer
images is now cost effective and can provide an alternative
method of delivering full text resources. Instead of typing in
the text, the original page is scanned and the user is presented
with a computer image rather than a text file. The advantage is
that it is much quicker and cheaper than having each page hand
typed. Also, the integrity of the text is assured because the
user views an image of the original page. Our etymologist can now
sleep peacefully at night. The fundamental problem with this
method is that at present it almost always involves dismembering
and trimming the originals so that they can be fed through a
sheet feeder into a flatbed scanner. If there is a sufficient
margin they can of course be rebound, but this adds significant
cost to the operation. In most cases the original volume is
simply discarded with the electronic surrogate becoming the
primary resource. Discarding the original volume raises many
further problems and expense. The original is lost so the value
of the original over and above the electronic version must be
accounted for. No matter how high a quality of scan is achieved,
information available in the original will be lost. Imagine an
individual researching the history of paper manufacture: no
information for her in an electronic image. Not only is there an
inevitable loss of information but there are also the costs of
maintaining the now primary electronic version. And the fact that
the electronic version is the primary source, itself increases
its cost as the electronic copy must be of an archival quality
that can fulfil all potential future uses of the material. This
requirement increases storage costs because higher quality images
involve larger file sizes and therefore larger disks or longer
tapes or more CDs on which to store them. Further, the costs of
maintaining an electronic archive for long periods into the
future will be considerable. Simply backing up the material onto
magnetic tape and storing it in a cupboard will not suffice. Will
you have the correct format tape drive available in 10 years
time? Will you have the correct software to read the images? Will
you have the correct computer hardware/operating system to run
the software? Is the integrity of the media dependable for that
period or will you need to "refresh" the data onto new
media after a fixed number of years? The costs of maintaining
electronic data are considerable and must be costed when
considering the destruction of the original. In large
digitisation programmes this cost can of course be offset against
the resources freed by the destruction of the original, eg shelf
space in buildings with expensive reinforced flooring etc. Having
said all this, destroying the originals and storing the data in
electronic form is still a viable method of creating a full text
resource, particularly where the original is already
deteriorating; Cornell University have pioneered the methodology
for this approach (Kenney and Chapman, 1996). A variation of this
method is to store the primary copy on microfilm rather than
electronically. This has the benefit of not being dependent on
the highly dynamic changes of computer technology; a microfilm
can simply be stored in the cupboard and retrieved again in 10
years time. Equipment to read (or scan) microfilm will still be
around. This hybrid approach of providing an electronic copy for
access and a microfilm copy for archive removes a lot of the
expense and concerns of maintaining the digital images. Yale
University is active in developing methodologies for converting
microfilm to digital images (Conway, 1996).
Indexing Digital Images
Assuming that all these difficulties have been successfully
accommodated, there are still problems associated with a full
text resource of digital images. Firstly, image files are
significantly greater in size than text files that have been
typed. Not only does this increase storage costs but it also
increases download times which can be significant in a networked
or internet environment. This may involve a reader spending long
periods waiting for delivery and the associated frustration.
Secondly, having page images is all very well but haven't we lost
the second of our access benefits which initially prompted the
urge to digitise: the ability for full text searching and
electronic analysis etc? In some situations, availability of the
originals is so restricted that access is worthwhile in itself.
This is particularly true with very valuable material such as
medieval manuscripts, but can also be useful if the material is
less valuable but physically remote. The Research Libraries Group
Studies in Scarlet collection (Research Libraries Group, 1996)
for example, is mounting some unindexed images. But generally a
collection of pages that have simply been scanned are of little
use to a researcher; they need the additional value provided by
electronic indexes and full text to make those images useful.
These indexes may simply be hand typed copies of the original
indexes and contents pages but this fulfils only the first of our
access requirements: immediate availability. In terms of opening
up the text for further searching and analysis, this method
offers no benefit over the original volumes. Deeper indexes can
be produced manually but this of course is an expensive process.
However, in some cases, such indexes have already been created by
third parties and may be available.
OCR
The method of obtaining full text searching from digitised
page images is to run the digitised images through an Optical
Character Recognition program (OCR) which will create text files
versions of the images which can then be searched. This is an
automated process and OCR software is relatively cheap so the
cost of converting large quantities of data is very low. However,
the pitfall with OCR is that even with near perfect modern page
images there will always be errors, the converted text will never
be 100% accurate, and this accuracy drops considerably for older
texts. One solution is to proof-read and correct the OCR text and
this approach has been taken by the eLib funded project ACORN
(1997). However this final step adds an enormous workload to what
has so far been an almost totally automated process. It is
important not to underestimate the time consumed in proof reading
and interactive editing. Unless the text is of very high accuracy
it can often be cheaper to revert to the initial strategy of
copy-typing the originals; interactive editing is much, much
slower than direct copy-typing. But even after the most careful
proof reading and editing, will our etymologist still be able to
sleep peacefully at night?
The Internet Library of Early Journals
It is upon this difficult terrain that the Internet Library
of Early Journals (ILEJ) set out to digitise 120,000 pages of
18th and 19th century material and make them available to
scholars. Journal titles were chosen not only for their interest
to scholars but also for the nature of their content. The
Gentleman's Magazine, The Annual Register, and Notes and Queries
all consist of short pieces on a wide variety of subjects, with
lists of births marriages and deaths, book reviews and similar
material. This eclectic content is impossible to index adequately
and it was felt that full text searching would open up valuable
information that had previously proved recalcitrant. Electronic
free text searching would maximise the research value of this
type of journal. The nature of these titles means that their
attraction to scholars could only result from a large 'critical
mass' of material, a mass that would make consulting the resource
worthwhile. Twenty year runs of each were seen as providing the
minimum quantity capable of attracting scholars to the facility.
For The Gentleman's Magazine, the intention is to digitise up to
100 years of material, since the availability of this title on
microfilm offers the prospect of faster processing. This
necessity for a large quantity of data was the primary fulcrum
upon which project decisions rested; it characterises our whole
strategy. Our fundamental requirement was for high volume, high
throughput and low cost conversion of large quantities of data.
Such a requirement necessarily precludes any high cost text
conversion, whether it be copy typing or proofing and correcting
OCR. Images would therefore be scanned from original volumes and
from microfilm and it is these images which would be displayed to
the user. Hyper text indexes would be created for the images,
both from existing electronic indexes supplied via third parties
and by hand typing existing paper indexes and contents pages. But
by far the greatest value would be added by OCRing the text to
allow full text searching of material that would otherwise remain
opaque.
OCR accuracy
Because the project ethos of 'high volume, low cost' excluded
proof-reading and correcting OCR, the OCR would be left with
errors. This means that the OCRed text files cannot be supplied
as the primary data (as they are for example in Project ACORN)
but would be used to provide additional value to the page images
by allowing searching etc. The electronic library would therefore
consist of linked pairs of image/text files in which a search can
be made of the text files, but the original page image is
supplied for consultation when a text match had been found. The
advantage of storing image/text pairs is that, though the OCRed
text may not be correct, the user actually consults the original
page image and so integrity of the original is guaranteed. This
model is in fact used by commercial office document management
systems in which letters etc are scanned on their arrival and the
images retrieved later by a search through the OCRed text.
Because there was no intention of correcting the OCR it meant
that, from the start, exhaustive searching of the material would
not be possible; there would always be badly OCRed terms that a
search would not find. It is this shortcoming that gives the
Early Journals full text resource a different slant to comparable
resources. The usefulness of the type of material digitised
required a large corpus, and within fixed spending constraints
this criterion has caused an associated (hopefully insignificant)
reduction in retrieval. It our belief that our positioning in the
cost/retrieval matrix is appropriate for this type of journal;
that is, that scholars would prefer high retrieval rates from a
very large corpus of data as opposed to 100% retrieval rates from
a much smaller corpus. One of the aims of the project will be to
evaluate the truth of this conviction.
Fuzzy Searching
In order to achieve as high a retrieval rate as possible,
ILEJ is using the Excalibur EFS fuzzy searching software already
in use by the British Library (British Library, 1997) and by
Project Elinor at De Montfort University (Ramsden, 1993). EFS is
a search engine that allows the user to increase the degree of
'fuzziness' associated with their search. Increasing fuzziness
will increase the number of hits found by returning not only
words that exactly match the search term but also those with a
similar spelling. This technique allows true matches to be found
from terms that have been OCRed incorrectly. Fuzziness can be
gradually increased returning more and more hits but with less
and less correspondence to the original search term. This
technique will allow the user greater penetration into the corpus
but at a cost; researchers will have to be more thoughtful in
choosing their search terms and will be presented with a much
greater quantity of negative research, ie dismissing matches that
are not relevant. Indeed, this whole paradigm will involve the
researcher in developing new skills to successfully utilise this
type of resource. In the good old days the reader would search
through bibliographic data entered by trained librarians who had
used controlled and standard language. However, anybody who has
used Alta Vista to search the World Wide Web will be aware of the
disadvantages of full text searching and the care needed in
choosing search terms. A carelessly chosen term can involve the
return of gargantuan quantities of irrelevant material. Although
this problem of noise applies to all uncontrolled full text
searching it is further exacerbated by fuzzy searching. Not only
will fuzzy searching return badly OCRed words that are relevant,
to the reader, it will also return similarly spelt words which
are not relevant increasing the amount of necessary negative
research. This factor adds another axis to our matrix:
cost/retrieval/user-convenience. Again it will be part of the
project to evaluate our position against these axes and make
recommendations for further projects. Because of the factors
outlined above, successful techniques for negative research are
paramount to the success of this methodology. As previously
mentioned, one of the disadvantages of mounting page images as
opposed to text is their size. A page image file, even when
compressed, can easily be 20 times the size of the corresponding
text and consequently delivery over the internet can be expected
to be 20 times as slow. If a reader is having to sustain a
handicap of this order purely in order to dismiss data that is
irrelevant then there will be significant difficulties and
frustrations in using the service. Initially the project had
considered using the OCRed text solely as a hidden search tool
but it is now coming to be seen as an important resource for
negative research. It is perceived that the OCRed text will
provide enough context and be of sufficient quality to allow the
reader to dismiss a page as not relevant. It is only when a
relevant page is identified that the image is downloaded for
consultation.
Scanning
The model so far describes high speed scanning and automatic
OCRing to provide a high volume low cost resource. The only
significant manual input has been the hand entering of the paper
indexes and contents pages. Indeed, this is exactly the model for
those titles scanned from microfilm. The microfilm rolls are fed
into a Meckel MX500XL-G scanner which automatically moves through
the whole film. However, to date, most of the scanning experience
of the project has come from scanning bound volumes using the
Minolta PS3000P overhead scanner. The age and value of the
original volumes meant that dismembering was never considered.
This decision limited the choice of scanner for paper originals
to the Minolta PS3000P which today still remains the only
hardware available for high throughput scanning of bound volumes.
The scanner consists of a 'cradle', upon which the volume is
placed open and face up, and the scanning optics which stand
about eighteen inches over the cradle. The cradle is made up of
moving plates which yield under the book's weight to allow it to
be opened flat without stressing the binding. The operator simply
holds the book open and presses a foot pedal to initiate the
scan. The use of an overhead scanner, although it removes any
costs associated with dismembering volumes, significantly
increases the actual scanning costs by requiring the full time
attendance of an operator. In practice, the older tightly bound
volumes used in the project required careful manipulation by the
operator and throughput was considerably slower than the scanner
specification. An average of about 80 pages per hour was typical;
scanner operator costs are a significant factor with this type of
equipment. The PS3000P is currently limited to bitonal 400 dpi
which is not sufficient resolution for archival standards.
Benchmark techniques developed at Cornell were applied to the
ILEJ journals to determine the dpi required for full
informational capture from bitonal scanning (Kenney and Chapman,
1995). Examination showed that Notes and Queries and The
Gentleman's Magazine had characters slightly under 1.0 mm in size
which would require a dpi of 615 for full capture. Current
technology, therefore, will not allow for the scanning of bound
volumes to archival standards.
OCR in ILEJ
Omnipage Pro version 6.0 was the chosen software for OCR as
it provided good recognition accuracy and a strong feature set
including grey-scale and batch processing capabilities. The
difficulties with assessing OCR quality (and this is an insoluble
difficulty) mean that it is impossible to get accurate
quantitative data without manually counting the words in the
original. With older journals this problem is exacerbated as OCR
is particularly sensitive to variations in type-weight, foxing,
show-through etc that occur in this type of material. Because the
quality of older material is so variable, any meaningful
statistics on OCR performance would have to be aggregated over
impracticably large quantities of data. It is therefore not
feasible to come up with a single percentage accuracy of OCR
across the large volume of material scanned. However, it has
become clear that achieving high quality OCR is highly
problematic in these circumstances and these problems are rooted
in three different factors. Firstly, OCR packages are designed to
OCR modern office documents using modern characters in modern
fonts. Eighteenth century material contains long s's (ie an 's'
that looks like an 'f'), ligatures and diphthongs that are not
recognised by OCR. Although OCR can, in a limited fashion, be
trained to recognise alien characters, the recognition
capabilities never match those achieved with standard text.
However, by far the greatest difficulties have arisen from the
inconsistent quality of the original material. A page with dark
type face and show through will be followed by one with faint
type and broken characters. Some pages will be dark but with
areas of light and broken characters. Pages will be skewed and
columns can wander. Such variability does not lend itself to high
volume processing. Ideally the scanning parameters would be
adjusted to suit each page, but this is of course impractical.
The final problem area lies with the overhead book scanner
itself. As previously mentioned, the limitation of 400 dpi
prevents full capture which has ramifications for OCR quality.
Also, the scanner currently only supports bitonal (grey-scale is
expected later this year). OCRing grey-scale helps enormously
with variable material as the OCR is able to 'ride' over the
variation. But finally, the fact that the scanner optics sit 18
inches over an open book means that the image quality is never as
good as that available from a flatbed scanner where the page is
pressed flat against the scanner glass. Problems from uneven page
surfaces, focusing and shadowing are inherent in the use of an
overhead scanner.
Conclusion
This paper has, briefly, presented the various approaches
available when considering the development of a full text
scholarly resource. Subsequently it has shown the approach taken
by the Internet Library of Early Journals and discussed some of
the early technical findings of the project. The project can be
understood therefore, as feeding back information into the
digital community at two levels: short term technical data and
long term methodological data. The project is gaining experience
with state of the art scanning equipment, OCR and fuzzy searching
software. Recommendations will be made about the benefits and
limitations of this equipment and the techniques needed to reap
the greatest benefits. This type of information is in immediate
demand; people are making choices today about which type of
scanner to purchase, which OCR package, which fuzzy search
engine. Our expertise will be made available to help with these
decisions. But in a year or two this data will be redundant as
technology improves. However, the methodological implications
gleaned from the ILEJ project will be more far reaching. Even
with improved technology the decision as to where to locate a
full text resource in the 'cost/retrieval/user-convenience'
matrix will still be a pertinent one, in fact it may become more
pertinent. As technology improves the difference in cost between
manual production (eg typing, proof reading) and automatic
production (eg scanning, OCR) will increase. It will become
cheaper and cheaper to produce resources in the ILEJ mould, that
is high volume but without full retrieval. The success of such
resources will depend on the as yet unknown axis in our paradigm;
how easily and how well will users be able locate the information
relevant to their research. It is this component that may well
prove to be the most interesting return from the ILEJ project.
British Library, 1997, Excalibur PixTex/EFS. URL: http://portico.bl.uk/access/excalibur.html
Conway, P., 1996, Conversion of microfilm to digital imagery:
a demonstration project: performance report on the production
conversion phase of Project Open Book. New Haven: Yale University
Library.
Electronic-Beowulf, 1997, URL: http://portico.bl.uk/access/beowulf/electronic-beowulf.html
Kenney, A.R. and Chapman , S., 1995, Tutorial: digital
resolution requirements for replacing text-based material:
methods for benchmarking image quality. Washington, D.C.:
Commission on Preservation and Access, 1995.
Kenney, A.R. and Chapman , S., 1996, Digital Imaging for
Libraries and Archives. Department of Preservation and
Conservation, Cornell University Library.
O'Rourke, T, Chadwyck-Healey - electronic resources for the
virtual library: a publisher's perspective of preservation and
access. In: A.H. Helal and J.W. Weiss (eds.), Electronic
documents and information: from preservation to access (Essen
Universitätsbibliothek, Essen), p.160-169.
Project Acorn, 1997, URL: http://acorn.lboro.ac.uk/
Ramsden, A., Wu, Z., Zhao, D., 1993, Selection criteria for a
document image processing system for the ELINOR Electronic
Library project. Program 27:4, p.371-387.
Research Libraries Group, 1996, Studies in Scarlet.
Research Libraries Group News 40, p.3-11.
The Internet Library of Early Journals is a project funded by the JISC within the eLib programme. The authors also wish to acknowledge the contribution of other members of the project team to the work described.