Appendix E: Digitization Methods

Stuart D. Lee, March 1999

Introduction

This short paper seeks to outline some of the main issues surrounding digitization at the moment. It does not attempt to provide complete solutions or recommendations re hardware, software, resolutions, etc. Instead it provides recommendations applicable to the strategic planning level, with particular reference to the University of Oxford.

Some Starting Topics

The Purpose of the Digitization Project

Many of the relevant issues surrounding a digitization project have already been raised in the previous paper on Assessing Collections for Digitization, but they are worth stressing again. The types of questions that need to be asked (and answered!) are: In other words, it is vitally important to be clear about the reasons for embarking on a digitization project from the outset.

The Nature of the Source Material

Having established some of the ground-rules noted above, the nature of the source documents (the material that is to be digitized) should be considered next. Kenney and Chapman (June, 1996, pp. 3-4) suggest that the source documents can be viewed in terms of the original medium they are stored on (i.e. the production process), and their noted attributes.

The medium on which the analogue material is presented to the digitizer needs to be considered, focussing on the ‘physical’ (as opposed to ‘content’) attributes of the material. The analysis of this material has to be done by the digitizer in collaboration with the curators of the material, and experts in conservation methods. For example a list of the most common physical attributes that need to be accounted for would be:

Physical constituency: Paper (matt and gloss), Vellum, Papyri, Microform and other Transparencies (e.g. 35mm slides), Glass, Three-Dimensional Objects (e.g. artefacts such as pottery, statues, book-bindings), Glass plates, Vinyl Records, Audio Cassettes, Audio CDs, Audio Tape Spools, Film, Video (NTSC/PAL/SECAM), etc.

Physical dimensions: With non-time based media the actual dimensions of the object are extremely important, i.e. it is difficult to digitize large maps or posters using conventional scanning equipment, and this may require creating a surrogate (e.g. a photograph) and scanning from that. With time-based media you need to consider length of clip, frame size, frame per second rate.

Physical robustness: Can the document be disbound, for example? Or is it so valuable or delicate that it needs to be digitized under certain conditions? For example, the Refugee Studies Programme Project at Oxford (akin to Yale’s Open Book Project) was in a position whereby it could disbind all its material and thus greatly increase the throughput by the digitizers. Whereas the ILEJ project at Oxford could not disbind any of the material (in that case 18th and 19th century journals), and its digitization process had to account for curvature of the page resulting from tight binding. At the other end of the scale are the requirements of the Celtic and Medieval Manuscripts Project (Oxford) had to work under, demanding the design and construction (from scratch) of special cradles to hold the manuscripts, and the buying-in of new lighting equipment.

In addition to the physical attributes the ‘content’ attributes of the document need to be analysed (to feed directly into ‘Benchmarking’ below). Expanding Kenney and Chapman’s original list to include time-based media, content attributes fall into the following categories:

Text/line art ­ monochrome documents, i.e. with no tonal variation. Examples might be texts (such as the ILEJ journals), woodcuts, and black and white microforms. N.B. with relation to text all references in this article refer to scanned images of the text (which may then be subsequently OCRed) not keyboarded text.

Continuous tone ­ varying gradation in tones, either monochrome (i.e. grey gradations between black and white), or colour. This would cover photographs, works of art, manuscripts, etc.

Half-tones ­ spaced pattern of dots (either monochrome or colour). Used in line engravings and etchings.

Mixed ­ contains two or more of the above.

Artefacts ­ three-dimensional objects. Texture, shadows, etc., all need to be taken into account(1)

Audio ­ spoken word, music, sound effects (or combination of all three). Either mono or stereo.

Film ­ in most cases continuous tone (black and white or colour), but occasionally line-art for animations. Can include audio soundtrack.

Some Difficult Issues

Before moving on to the actual processes and standards used in digitization, it is perhaps wise to consider a few issues which are currently posing problems in the digital arena.

Digitization as Preservation

The question of digitization for preservation is one that is often central to the focus of initiatives. It has been suggested that it is possible to establish a standard of archival quality for digital objects, notably scanned images, which will meet preservation needs, acting as a substitute for any other surrogate (although no accepted standard exists at the moment(2)). By virtue of the fact that it is digital it will be able to produce surrogates, and derivative files without any damage to the original digital master. In theory the digital object would retain all significant information contained in the original document(s), and, under appropriately stringent conditions related to migration, refreshing, and backing-up of the original file, should survive forever.

However, there is considerable unease within the library sector at the prospect of relying on a digital copy as a substitute for other preservation formats(3);with a particular problem being the long-term institutional commitment to the maintenance of digital files. It is very rare to find any institution that has a fully comprehensive policy in place to guarantee the active migration and refreshment of digital objects to ensure longevity of access. Many of the variables involved in such a process are unknown as yet, and where they are known it is clear such maintenance involves considerable cost for the host archive(4).

Where digitization can help in ‘preservation’ is, of course, in the deflection of demand to view the original document. Most curators of rare or valuable material are acutely aware of the damage that repeated handling can do to the original document, and are constantly seeking to limit this access. In some cases, where the document is in a particularly bad state of repair (or is classed as being a security risk, e.g. due to loose leaves), the material may have to be withdrawn from general use, and might only be made available to satisfy the most pressing of research needs, or not at all in some cases. The availability of a high resolution digital surrogate can only be of help to the curator if it will act as another avenue of possibility for the researcher to use before handling the original. As P. Noerr (1998) notes:

Physical handling is one of the most destructive things that can happen to a fragile object. One of the best ways to preserve it is to limit physical access to it. This is a very strong case for creating a digital library

Yet this should not detract from the unavoidable truth that any copy, be it digital or microform, can only serve as a surrogate, not as a replacement for the original. Even with microfilm no surrogate has ever been regarded as a perpetual preservation copy of the original item and this rule should equally be applied to digitization. Above all, there should be no detraction from the continued efforts to preserve the original.

Use of Film

The use of film in digitization is well-established. In many cases this relates to the scanning of existing surrogates (e.g. the proposal re the Bodleian Select 50 Western Manuscripts). In short it is clearly considerably cheaper to scan (either in-house if the equipment exists, or out-sourcing) from microfilm stock rather than directly from the primary source. The only questions that arise is whether the microfilm is of sufficient standard, and whether the image derived from the microfilm will satisfy users.

However, when it comes to scanning material for which no surrogate exists, particularly with material that is in need of preservation, things become more difficult. It has been observed already that microfilm, at present, for graphical and textual material provides the best certainty for preservation (if stored under ideal conditions). It is clear, for example, that a microfilm held under standard preservation conditions can last centuries, whereas the longevity of a digital file, bearing in mind the costs of maintaining its currency, the problems of migration etc., is uncertain (at best). Alan Howell, in his survey of newspaper digitization projects (1997, http://www.thames.rlg.org/preserv/diginews/diginews2.html#film-scanning) noted:

The most effective means to preserve the intellectual contents of newspapers is preservation microfilming. Ideally, reformatting should be undertaken when newspapers are acquired. It must be done before they become too brittle to handle - somewhere between 25 and 100 years depending on their initial strength, use, and the environment in which they are stored. If preservation reformatting is on 35 mm polyester-based silver-halide microfilm, and the film is processed to recognised international standards for chemical stability, housed in inert containers, and then stored under controlled environmental conditions, the microfilm is expected to last several hundred years. The microfilm can serve as a preservation master which can be scanned to provide access copies in digital image form.
He concluded from his survey of existing projects that microfilming is ‘perhaps the most important’ preservation reformatting strategy, and this could increase bearing in mind improvements in optics, and standards for microfilming with a view to later digitization are improved. As Yale’s Project Open Book ‘Organizational Phase’ noted:
The first working hypothesis--that microfilm is satisfactory as a long-term medium for preserving content--builds on the features of microfilm as a long-lasting, inexpensive technology that is well understood in libraries. However, the linear nature of microfilm does not provide easy access. It is cumbersome to browse and read, it requires special equipment at a single location, it does not facilitate use of an item's internal structure, and it does not produce high quality paper copies. (http://www.clir.org/cpa/reports/openbook/openbook.html)
In the case of black and white/greyscale material, microfilming seems relatively clear-cut. Colour images, however, present an extra dimension to the problem as it is clear that colour images held on film degrade much quicker and new copies have to be taken every couple of decades. Yet it has been observed that there have been considerable technical improvements in colour microfilms recently which will serve to increase their longevity. More importantly, in common with all microfilms, they do not have be refreshed for at least twenty years (i.e. their maintenance is relatively low), and it is just possible that they could miss out on one cycle of refreshment without causing too much concern. Digital images, however, probably need to be looked at with a view to refreshing/migration every three to five years and if one of these cycles is missed it could be disastrous (sometimes termed the ‘fast fires’ of digital obsolescence, as opposed to the ‘slow fires’ seen in acid-based printing methods).

However, in the digitization arena there are two approaches to the use of film. As the University of Columbia state in their ‘Technical Recommendations for Digital Imaging Projects’:

Scanning can be done directly from the item or a film intermediary can be made and scanned. Film intermediaries include most commonly 35 mm slides, 4 x 5 transparencies, microfilm, and single-frame microfiche. If properly made and stored, the film intermediary can act as a preservation copy of the item.
The quality of the intermediary will have a direct impact on the quality of the digital image. If the intermediary is poorly made, scratched, faded, or out of focus, the scanned image will be inferior. If the intermediary is of high quality, the scanned image will normally also be high quality. It is best to use camera negatives whenever possible. Every time a slide or other type of film is duplicated, it loses detail and resolution, and the resulting scan is poorer quality.
(http://www.columbia.edu/acis/dl/imagespec2.html)
The question that needs to be asked (for new projects with a preservation element built in) is whether microfilming should be performed first and digital images taken from the film, or whether digitization should be the first action, which can then be used to output to microfilm; hence the term ‘Computer Output Microfilm’ or ‘COM’. (In the case of retrospective conversion of old microfilm stock to digital format, this decision is not applicable unless the masters are of sufficiently low standard to warrant rephotographing).

The most comprehensive review of the potential of COM was performed by the Cornell Digital Microfilm Conversion Project (see Kenney, 1997, http://www.thames.rlg.org/preserv/diginews/diginews2.html#com). This was a sister project to Yale’s Open Book initiative, as both were involved in creating 600dpi bi-tonal images of 19th century brittle books. The COM project investigated the quality and cost effectiveness of the scan first, and then output to microfilm approach, as opposed to filming first and then scanning. They concluded that:

However, this should not be read as a strong recommendation to adopt COM as the solution to the hybrid approach of using microfilm for preservation purposes, and digital images for access. As the project points out: ‘The decision to go with one approach over the other [film first or scan first] will depend on a range of variables associated with the attributes of the original, institutional capabilities, and the availability of appropriate imaging products and services’. For example, in Australia’s Ferguson Project (Webb, http://www.nla.gov.au/nla/staffpaper/cwebb1.html) the process outlined by Cornell for COM was balanced with the film first, scan second approach of such projects as Yale’s Open Book initiative and the latter was adopted, though there were considerable difficulties.

Chapman, Conway, and Kenney (1999) have conducted the most recent study of these issues. Comparing the work done at Cornell (COM) and Yale (film first then scan) the study focuses on ‘The Future of the Hybrid Approach for the Preservation of Brittle Books’ (hybrid meaning digital and microfilm). It notes that ‘despite predictions that microfilm could be replaced by digital imaging, many have come to appreciate that digitization may increase access to materials but it does not guarantee their continued preservation’. The study rests on the assumption that ‘until digital preservation capabilities can be broadly implemented and shown to be cost-effective, microfilm remains the primary reformatting strategy’ with reformatting being the only viable strategy for the preservation of brittle paper, and that ‘although digital imaging can be used to enhance access, preservation goals will not be considered met until a microfilm copy or computer output microfilm recording of digital image files has been produced that satisfies national standards for quality and permanence’.

In short digitizing first and then outputting to microfilm can produce significantly better quality, as noted above. From a purely systematic outlook it involves only one step in the digitization chain from the original to produce access level images. In addition, microfilming first (with the knowledge that the microfilm is to be digitized) can be a troublesome process as experienced with the Bodleian Broadside Ballads Project which noted a considerable drop in throughput by the microfilming unit when attempting to meet the needs of future digitization. However, the equipment needed to produce COM in-house is costly and not readily available at Oxford (though out-sourcing of COM is clearly an option to be considered). Therefore such additional resources would have to be found to make this a viable option for any digitization unit.

For a comprehensive review of agencies performing Microfilm Scanning (including the Zuma Corporation used by the Bodleian Photographic Studio), and COM vendors, see ‘Technical Review’ RLG Diginews 1.2 (August, 1997 - http://www.rlg.org/preserv/diginews/diginews2.html#hardware&software).

Digitization for Access

Demand for access to original materials (often termed ‘primary source materials’) is increasing. This is partly due to the widening education market, with the notable increase in the postgraduate sector and in ‘amateur’ research increasing amongst the retired (i.e. the much-vaunted phenomenon of lifelong learning). Despite repeated reservations about the suitability of digital surrogates as a preservation format there is almost unanimous acceptance that digital files are extremely well suited to facilitating access. Not only can they be transmitted easily via the Web or FTP, they can also be viewed on relatively cheap equipment.

This should not imply, however, that digitization should be regarded as ephemeral, or short-term. Chapman and Kenney’s observation that ‘digital conversion efforts will be economically viable only if they focus on selecting and creating electronic resources for long-term use’ still applies (http://www.dlib.org/dlib/october96/cornell/10chapman.html).

Digital access can also enhance the potential for analysis: that is to say a digital object can be edited, spliced, filtered, etc. without any damage to the original master, and researchers can subject the file to all manner of analyses (e.g. image analysis software) without causing any damage. Increased access is also, unfortunately, a double-edged sword. The widespread availability of digital surrogates (e.g. via the web) can ultimately lead to increased demand for access to the original (as borne out by previous experiences with microfilms). Therefore it is essential that high quality surrogates must be available at the institution housing the original document to deflect this demand (though it must be recognised that it is almost certainly impossible to reduce demand for access to the original, even with the highest quality surrogates available, down to zero).

Having accepted the advantages digitization presents for facilitating access, and the disadvantages digitization has in acting as a substitute for standard preservation methods, it is important not to be misled into digitizing only to a standard which meets current user needs. It is clear from previous projects that it is most cost-effective to digitize at a master level quality to allow for multiple output (e.g. print, microfilm, access images, thumbnails, etc.). This, however, needs to be balanced with the constraints of time and money the project or service is working under.

The Digitization Process in Full

The above discussions have set the scene for a more detailed look at the process of digitization, especially the types of work-flows and decision matrices involved. The digitization chain illustrates the basic steps needed for successful completion of a digitization project. An expanded digitization chain should now read:

Assessment and Selection of Source Material (see previous section)

Digitization assessment

Benchmarking

Full Digitization

Quality Assessment

Post-Editing

Application of Metadata

Delivery

In the abstract this is satisfactory as it covers all the stages one must go through to successfully complete the digitization part of a project. However, in terms of actual practices in the working environment, this is clearly too generalized to be of much use. At the University of Oxford it has been recognised that the priority for the institution is to establish an on-demand (reactive) digitization service that replicates the functionality of a reprographics unit. In addition, bearing in mind the number of unique/rare collections the University currently holds, it should also work towards a more proactive digitization unit that could target collections on a project scale (i.e. not simply reacting to reader requests) and could also offer a cost-effective service to projects throughout the University, operating on a semi-commercial basis. The digitization chain for both of these would be much more elaborate, as can be seen from the two suggested workflows drawn up by this study (see http://www.bodley.ox.ac.uk/scoping/matrix.htm).

Digitization Assessment

Once items have been selected for digitization the next step is to perform an initial assessment of the digitization requirements needed to complete the project. It is clear that prior to any attempt at conversion (even on a sampling scale) there needs to be an assessment of the feasibility of project in terms of the issues raised in ‘Some Starting Topics’ (above), conservation requirements, and how/where the digitization should take place. It is recommended that curatorial and conservation expertise should be brought in at the earlier stage of assessment and selection, and ultimately the decision as to whether the project can go ahead must always rest with the keepers of the source items.

Understandably, when looking at a collection under the ‘Assessment and Selection’ stage some digitization assessment will take place. However, it should be noticed that there may be a considerable time delay between the assessment/selection stage and the actual digitization of the collection, and changes in technology may bring into question some decisions made earlier on. In the case of this study, for example, all the collections analysed will have been studied during the first half of 1999, however it is highly unlikely that funding will be available before the end of the year at the very least. In these few months there may have been significant advancements in the hardware and software available for capturing, and a second digitization assessment must be performed again in the light of this before full digitization can go ahead.

It is at this stage that one should confirm previous decisions made as to whether the digital surrogates are meant to act as a preservation copy, an access copy, a print copy, or all three, and whether the digitization is part of a hybrid solution in which other surrogates are to be used (e.g. microfilms) as part of the conservation process.

More formally, ‘digitization assessment’ should consider:

  1. the original project proposal and decisions reached in the ‘Assessment and Selection of Source Material’ stage, notably the aim of the project and how it will satisfy user needs, and decisions on ‘cherry-picking’, etc.
  2. curatorial and conservation concerns related to:

  3. - the ‘robustness’ of the source material (does it need special treatment when digitizing, or alternatively can it suffer such things as disbinding)
    - the security implications of out-sourcing the digitization
  4. the other ‘physical’ and ‘content’ attributes of the source document
  5. costs of completing the project, with relation to in-house resources and out-sourcing (if allowed)
The aim of the digitization assessment stage should be to: [N.B. If the institution already holds a substantial collection of digital and non-digital surrogates, which could satisfy the needs of the project then this would have been picked up at the ‘Assessment and Selection’ stage]

To give a simple example, when dealing with a graphic original there might be four scanning options to choose from:

Under ‘digitization assessment’ having looked at the types of material selected for copying, one should be in a position to decide which of the above scanning techniques is most suitable for the collection, and to assess the technical requirements and standards.

For time-based audio, things are considerably more complex. The technique of digitizing can not be categorised using the above techniques as such but instead has to differentiate by such methods as the sampling standard (e.g. either mono or stereo), and the sampling rates (e.g. 11khz, 22khz, 44khz). Film involves the issues of both graphical scanning (e.g. reproduction of the picture in the frame), audio (if a soundtrack is included), and also fluidity of motion by focussing on the frame per second rate (fps).

Benchmarking

The next stage in the workflow is that of ‘Benchmarking’. This can be defined as the process undertaken at the beginning of a digitization project that attempts to set the levels used in the capture process to ensure that the most significant information is captured, e.g. setting the resolution or bit depth correctly, or in audio the khz sampling rate, and so on. More formally it requires: By adhering to the three rules above the assessor will be able to: Kenney and Chapman state that ‘objectives should be clearly stated and the consequences of digital applications accessed prior to implementing an imaging initiative’ and they advocate a managerial approach to benchmarking that is ‘designed to serve as a systematic procedure for forecasting a likely outcome’(5). Before attempting to provide a benchmark the project must be clear about its aims and restrictions (i.e. the now familiar questions of ‘digitizing for preservation or access’ and the competing requirements of conversion and access). As Chapman and Kenney note:
It is risky, however, to proceed with any step before fully considering the relationship between conversion -- where quality, throughput, and cost are primary considerations -- and access, where processibility, speed, and usability are desirable. Informed project management recognizes the interrelationships among each of the various processes, and appreciates that decisions made at the beginning affect all subsequent steps. An excessive concern with user needs, current technological capabilities, image quality, or project costs alone may compromise the ultimate utility of digital collections. At the outset, therefore, those involved in planning a conversion project should ask, "How good do the digital images need to be to meet the full range of purposes they are intended to serve?" (http://www.dlib.org/dlib/october96/cornell/10chapman.html)
The most obvious problem with benchmarking is ascertaining what level of capture is satisfactory, i.e. for present and future needs. Kenney and Chapman (June, 1996, p.7) advocate a ‘full informational capture’ policy, i.e. ‘ensuring that all significant information contained in the source document is fully represented’. Elsewhere they elaborate on this by stating that:

The ‘full informational capture’ approach to digital conversion is designed to ensure high quality and functionality while minimizing costs. The objective is not to scan at the highest resolution and bit depth possible, but to match the conversion process to the informational content of the original -- no more, no less. At some point, for instance, continuing to increase resolution will not result in any appreciable gain in image quality, only a larger file size(6)

Yet what is the smallest level of significant detail? For text it might be the smallest letter or symbol that the reader needs to be able to see. In printed books this is often to be found in the footnotes, but in maps and line drawings the object might be an individual house or cartographic symbol. In manuscripts it could be down to distinguishing between textures (e.g. hair and flesh) of the vellum. In photographs or pictures it could be a number of things depending upon the user. To paraphrase - ‘the significant detail is in the eye of the beholder’. James Reilly, Director of the Image Permanence Institute, describes a strategy for scanning photographs which is applicable generally of ‘knowing and loving your documents’. He advocates choosing a representative sample of photographs and, in consultation with those with curatorial responsibility, identifying key features that are critical to the documents' meaning. It is assumed that those with curatorial responsibility will be aware of two important features:

Thus they are in a prime position to identify the main attributes of the document needed for ‘full informational capture’. As stated elsewhere in this report, successful digitization requires a combination of curatorial and technical expertise in order to match ‘subjective attributes of the source to the objective specifications that govern digital conversion (e.g., resolution, bit depth, enhancements, and compression)’. The benchmarking process therefore, might be linked to existing expertise in conservation, preservation, and reader requirements (i.e. within the library/curator sector). Equally important, such expertise would be on hand to deal with the differing demands of each document. It is noticeable that many non-digital source materials, e.g. manuscripts, microfilms, etc., contain items collected together that are not homogenoeus. For example, one microfilm reel may contain straightforward images of a printed text, and also a collection of images from an illuminated manuscript. By having expertise to hand the questions of benchmarking that would arise can be quickly answered.

Chapman and Kenney (October, 1996) list the selected attributes of source documents which can help in assessing significance as:

Bound and unbound printed materials

size/dimensions of document (w x h, in inches)

size of details (in mm)

text characteristics (holograph, printed)

medium and support (e.g., pencil on paper)

illustrations (content and process used)

tones, including colour

dynamic range, density, and contrast

Photographs

format (35 mm, 4" x 5", etc.)

detail and edge reproduction

noise

dynamic range

tone reproduction

colour reproduction(7)

Furthermore, in their Digital Imaging for Libraries and Archives (June 1996, pp. 7-34) they provide a comprehensive system for checking most of the above categories based on a Quality Index system, and using target examples such as the RIT Alphanumeric Test Object, the IEEE Std 167A.1-1995 Facsimile Test Chart, the AIIM Scanner Test Chart 2, the Kodak Q13 Greyscale Control Bar, and the Kodak Q60 Colour Target, noting particular success with the RIT and AIIM tests for resolution(8).

Similarly, the NARA’s EAP guidelines (http://www.nara.gov/nara/vision/eap/eapspec.html) have extensive guidelines for benchmarking and calibration assessing:

In addition Yale’s Open Book Project lists the full workflow of benchmarking as:
Choose a sample of hard-copy originals, along with print negative counterparts.

Digitize portions of the original volume at 600 dpi (title page, table of contents, selected illustrations, indexes) using a calibrated Xerox WG-40 flat-bed scanner with as many of the enhancement features invoked as possible and practical, and following the operational guidelines developed by Cornell University.

Produce laser prints at 600 dpi on the Xerox DocuTech.

Digitize the identical pages from the microfilm print negative version.

Produce laser prints at 600 dpi on the Xerox DocuTech.

Compare matching prints under an eye-loupe (10X magnification), paying particular attention to letter fill-in or drop-out or fill in, highlights and shadows in line drawings, etc.

Choose one combination of filter settings for the microfilm scanner that achieves most closely the appearance of the digitized original.

Note the characteristics of the film source, once "maximum" quality has been obtained.

Scan a volume with similar basic characteristics without benchmarking from the original.

Compare prints of "benchmarked" volume with unbenchmarked one; adjust settings accordingly; note sources of discrepancies for future reference.

(http://www.clir.org/cpa/reports/conway/conway.html)

The above suggestions for benchmarking are entirely valid and form an extremely useful base. However, the most important rule is the rule of eye/experience. Regardless of what the more scientific approaches to benchmarking indicate, one has to produce a pilot scan and judge the results according to what one can see, what one can print, etc. Furthermore, it is widely recognised that no benchmarking can be truly accurate as nearly every collection encountered will have considerable variation within it.

Abstracting from all of this, however, the over-riding message is that benchmarking must be viewed as central to the digitization process. The original source document’s dimensions, condition, attributes, and above all the finest level of detail you need to capture bearing in mind user requirements must all be considered. Having established these you will need to perform the benchmarking tests themselves. Current standards of digitization and the nature of the source documents themselves will have direct effects on how successful you are. For a real-life example of this process in action readers should consult D’Amato and Klopfenstein (1996) especially section 6 on the benchmarking of the illustrations (http://www.nmnh.si.edu/cris/techrpts/imagopts/section6.html#RTFToC31). In this the ‘characteristics’ (the ‘full informational capture approach’ detailed above) of the illustrations were noted, benchmarking was performed at various levels of detail, and then the curators were consulted as to what would be the best standard.

The discussion above, of course, relates to the digitization of manuscripts, texts, graphics, etc., but not to time-based media. In the latter (i.e. film and audio) extra benchmarking standards will need to be brought in looking at pitch, tone, volume, smoothness of motion and transmission, bringing into play extra standards such as fps (frames per second), and khz.

Cost of Digitization

The overall cost of the digitization project is, of course, extremely important and will have direct influence on whether a project can succeed(9). Understandably there is considerable pressure to come up with figures prior to any digitization project to get at least some forecast for the resulting costs. However, in most instances costs are viewed simply as the ‘unit cost’ of digitizing a single document multiplied by the total number of documents in the collection; or to look at the throughput per day and multiply that by a weighted salary. This is both inaccurate, and potentially disastrous. It is essential that when considering the costs incurred by a project one takes into account the overall costs (many of which are hidden costs), looking at such things as administration, consultancy, IT support and maintenance, software/hardware for access and cataloguing, preservation, storage, etc.

However, before looking at how these additional costs may increase the funding required it is worthwhile getting some idea of the unit costs related to digitizing material. There is not simple check-list that will provide accurate and comprehensive figures for digitizing a single item and one can only look at figures presented by other projects. In all cases the figures presented are simply guidelines and should not be regarded as formulaic as there are numerous variables that may come into play related to the condition of the original source document that could lead to marked increases in the unit costs. For example, in the feasibility study for the JIDI project, the Higher Education Digitization Service drew up the following matrix:
{short description of image}

All prices are exclusive of VAT, in pounds sterling, and are for outputting uncompressed TIFFs.

The majority of the projects (5 out of 7) were delivered within the costs identified in the JIDI Feasibility Study report, which is even more impressive because of the time lapse (over a year) between report and actual projects commencement. Out of the remaining 2 contracts, HEDS priced one well over the amount in the JIDI report because of the nature of the originals and the low volume (costs were estimated at a figure of 5,000 items though in reality most projects only brought in between 500-1,500 items which will push up prices), whilst the other contract was costed at £2.20 per item when the JIDI report allowed for an upper limit of £2.00.

A different example, but equally instructive, can be found in the case study of costs performed by the BUILDER project for two of their collections (the University of Birmingham Exam papers, and the Midland Collection). The digitization of each was analysed using various methods and scanners with the costs equating to:
 
Type of scanning University of Birmingham Exam Papers (1587 exam papers, 4539 pages, all A4, loose-leaf, typed/printed text, bi-tonals at 300dpi, converted to PDF) Midland History (1971-1997) (3987 images 6” x 9”, strippable journal, 600 dpi bi-tonal TIFFs, plus 50 photographs, 600dpi greyscale 8-bit TIFFs)
Out-sourcing to HEDS, delivered back on CD Unit cost: 11p per page

Overall cost (including set-up and production): £775.00 (i.e. 17p per page)

Not calculated
Using Minolta PS3000P Scanner (scanner available) Hourly production rate: 90 pages

Overall cost (including weighted annual salary but not hardware): £724.06 (i.e. 16p per page)

Hourly production rate: 80 pages

Overall cost (including weighted annual salary but not hardware): £715.50 (i.e. 18p per page)

Overall cost plus Adobe capture software (for photographs): £1203.13 (30p per page)

Flat-bed scanner with sheet-feeder (Fujitsu ScanPartner 600C for Birmingham Exam Papers, Fujitsu M3093DE/DG for Midland History)) Hourly production rate: 180 pages

Overall cost (including weighted annual salary): £362.03 (i.e. 8p per page).

Overall cost if hiring scanner: £2,007.17 (44p per page)

Overall cost if buying scanner: £2124.53 (47p per page)

Hourly production rate: 180 pages

Overall cost (including weighted annual salary): £362.03 (i.e. 8p per page).

Overall cost if hiring scanner, plus Adobe Capture Software: £4449.66 (£1.12 per page)

Overall cost if buying scanner plus Adobe Capture Software: £5092.75 (£1.28 per page)

Optical Character Recognition Not calculated OCR Processing Time: 133 hours (@ 2 mins per page)

Proof Reading Time: 665 hours (@10 mins per page)

Total cost: £11,456.64 (or £2.94 per page)

BUILDER’s study comes down heavily on the side of outsourcing material for digitization. Experience here shows that the costs of outsourcing material via HEDS are considerably less than attempting to perform the task in-house, even taking into account postage, preparation, and quality assurance. In addition, via the external vendor approach no staff time was involved in the scanning and negotiations were smooth. On the other hand although it can be seen that costs increase considerably with in-house scanning (note the Minolta scanner was made available through a previous project), this must be balanced with the benefits to the host institution of the experience gained, proximity of digitization to material, internal management of source documents and files, and easier quality control. One additional point that is worth noting relates to buying equipment outright as opposed to hiring it in. BUILDER discovered that the increase in costs brought about by purchasing a flat-bed scanner as opposed to hiring one was negligible, thus making the latter system questionable for projects that run over 2 years or more (the conditions of hiring the scanners demanded that there was a 2 year lease).

Costs experienced by Oxford projects are compatible with the general areas of the figures above, but illustrated the importance of taking into account the hidden costs of a digitization project. The Bodleian Broadside Ballads project, combining in-house microfilming with out-sourced scanning of the surrogate, noted a cost of around 61-65p per image. The Celtic and Medieval Manuscripts project, digitizing at high level using either a Kontron or Dicomed camera directly from source, noted a cost per image ranging from £2.50 to £4.50 based on weighted annual salaries and throughput. Yet like the other projects the need to stress the additional costs was emphasised. It was noted, for example, with the Celtic and Medieval MSS that updates to hardware should be at a rate of a new PC every 3 years (c.£1,000), and a new camera every 5 years (c.£20,000). ILEJ, in its final report, noted that each page cost c. 18p to scan, but on top of this each image cost 29p to index, and 25p to process, bringing the true cost up to 75p per image. Furthermore the RSP noted that out-sourcing to Xerox (as recommended by HEDS) resulted in a complete costing breakdown of:

Unit cost: 12p per page

OCR: 16p per page (but does not include cost of proof-reading)

Medium costs: £40.00 per CD

The Wilfred Owen project (see Lee and Groves, 1999) produced a reasonably detailed costing sheet looking at digitization costs of graphics (ranging from 50p to £10.00 per image), keying in (£1.50 per 1,000 characters) and audio/video capture (£5.00 per 3 minutes). In total the full project cost £62,000 delivering around 2,000 digital objects (pages of manuscripts and from a journal, photographs, video/audio clips, still shots), averaging £31 per digital object. Yet, outside of staffing, consultancy, hardware/software, and copyright, only around £3,000 pounds was actually spent on digitizing (i.e. £1.50 per digital object).

Kenney and Rieger (1998) include a much more developed costing sheet than the one produced in the Wilfred Owen Report. As well as indicating hourly production rates for in-house scanning they include a full range of costs for scanning images (various sizes and bindings) at differing resolutions (ranging from $0.25 to $12.00 per image). Once again, they stress the hidden costs of the digital project which are not always apparent when looking at the costs per page level.

In terms of in-house throughput this varied considerably, depending upon the methods used. At its peak, for example the Mekel MX500 XL-G, used in the ILEJ project, should achieve 1,200 images per hour (i.e. two microfilms), but in reality this dropped to as low as 300 a day. The Celtic and Medieval Manuscript project's high-level cameras peaked at a maximum of 200 scans a day (small pages, easy handling), but more realistically settled at between 40 and 70 images a day for the larger files. These figures were for consecutive pages of a given volume, or for comparatively easy single sheet material, and throughput rates would be much slower for odd pages from many volumes, just as the initial setup times can be substantial. Using similar equipment, the JIDI John Johnson collection is only managing to achieve around 40 scans a day (c.800 a month). An example of a project study external to Oxford, is the NDLP Digital Conversion Team which noted:

A current plan to scan 60,000 pages from early congressional journals in bound volumes calls for three people to prepare materials to keep five scanners (with two people per scanner) busy for twelve weeks. Another three full-time people are expected to take twenty weeks to review scanned page-images and derived text versions marked up with SGML after delivery by the contractor. In some cases, preparation and quality review are performed by members of the NDLP Digital Conversion Team. In others, the NDLP has supported the hiring of staff to be based within the divisions responsible for different types of material (such as Music, Prints & Photographs, or Geography & Maps). (Arms, April 1996).

Related to this is the concern expressed be many librarians and curators for the need to quickly throughput high-demand material. Items which have been sent for digitization in many cases will have to be returned to the library or department as quickly as possible to satisfy reader requests.

Bearing in mind the problems associated with drawing up accurate costs for a project the policy adopted by this scoping study has been, in full consultation with HEDS, as follows:

Digitizing Standards

As with costs, any attempt to establish a set of standards for digitization (i.e. resolutions, bit-depth, khz, fps, etc.) is fraught with difficulties. Not only is there no recognised standard to which everyone adheres even for the most basic of scanning (the variance in the source documents mean that nothing could be derived to suit all purposes), there is also the short life of the technology. For example, any recommendations made now would most probably be redundant in a year’s time with the changing capabilities of the capture devices, software, and falling prices. Instead, as with the rest of this section, all one can do at the moment is outline the general issues and indicate present practices.

Specifications

The specifications suggested for any digitization project will be variable for all the reasons outlined throughout this document. At the outset, however, projects should derive a set of standards which will allow full informational capture of the originals, following accepted practice in benchmarking. Having done this, there may need to be accommodations made to take account of available resources. The standards should not just specify requirements for the archival master, but should also state the parameters for the derived products:

High resolution (aka. Archiving resolution) ­ used for highest quality digitization (e.g. Oxford’s Celtic and Medieval Manuscript project), for archiving file format with ‘full informational capture’, for outputting to high-quality print and film surrogates (in the case of graphics/text documents). The distinction between preservation copies and preservation-quality images should be noted (see http://memory.loc.gov/ammem/pictel/index.html).

Medium resolution ­ used for screen display, low quality printing.

Low resolution ­ often used for thumbnails

Depending on the project, all of the above could in theory be treated as access level images, though in most cases this term usually refers to medium resolution images for screen displays, and/or low resolution thumbnails to allow for quicker browsing of the collection. The NARA EAP Project has detailed guidelines for their specifications for digitizing, including a matrix which summarizes the specifications derived for the project (http://www.nara.gov/nara/vision/eap/eapspec.html).

With graphics one of the most important factors is resolution. Different resolutions are required for different purposes, with high resolution often equating to an increase in unit costs. Furthermore, the digitization industry at present is awash with various projects digitizing at different resolution levels, and there are no archival or access standards available to which everyone adheres. Resolution usually refers to the number of horizontal and vertical pixels, e.g. a 640 x 480 image means 640 pixels along the horizontal axis and 480 pixels along the vertical. Dots per inch or DPI refers both to the number of dots/pixels captured per inch from the source document and to the number of pixels per inch on computer monitors and available through printers. However, as with most studies of this nature when referring to DPI, this report is simply looking at the dots per inch used in the scanning or capturing process. It is worth also considering the problem of ‘effective dpi’. Take for example a picture of a map. The digitizer scans the picture in at 600 dpi (e.g. a 5” x 4” print, with the image of the map filling the whole photograph exactly). However, if the original map was in fact 25 inches across (i.e. five times the width of the picture) the ‘effective dpi’ would be 600/5, i.e. 120dpi.

Franziska Frey notes that ‘a growing consensus within the preservation community is that a number of image files must be created for every photograph to meet a range of uses’ (1997). The article goes on to outline a set of standards for three example access files:
 
The digital image is used only as a visual reference image in an electronic data base
  • low quality 
  • thumbnails less than 250 pixels 
  • on-screen viewing set to 480 x 640 
  • colour reproduction not critical 
  • compression allowed 
The digital image is used for reproduction
  • desired reproduction needs to be clearly defined, but example of 8x10” hard copy at 300dpi, need only a 2,400 x 3,000 pixel file 
  • colour mapping essential 
The digital image represents a "replacement" of the original in terms of spatial and tonal information content
  • pixel levels vary from original to original 
  • 8-bit per colour not adequate for future needs 

These should be looked on as simple ‘generic’ guidelines and can not be viewed as an accurate forecast of the digitizing standards for all projects of a similar nature. As noted above, there are too many variables which may come into play differentiating between seemingly similar types of source documents. For example, the Library of Congress’s Manuscript Digitization Demonstration Project (http://memory.loc.gov/ammem/pictel/index.html) outlined the types of issues they considered when looking at their source documents:

The specifics here at interesting but only directly pertinent to the LoC. Instead one should focus on the issues arising such as preservation/access, type of material being looked at, user requirements, and balancing costs with time.

In short any attempt to define a set of digitizing standards is fraught with difficulties. Instead, this study has collected together notes on the varying standards used by numerous Oxford, national, and international digitization projects. These have been presented in a tabular form for people to consult with ease and reflect the decisions made to date (i.e. by March 1999 [N.B. This table has not been made available in this report as it is meant for internal study only]).

These should not be used as definitive requirements for any other projects, but a few overall points can be made:

More specifically with relation to the type of document one is digitizing:

* If a high quality film intermediary already exists, it is cheaper and safer to scan from film rather than from the original item. The quality of the intermediary will have a direct impact on the quality of the digital image. If, as with older film (i.e. captured prior to the British standards established in the 1970s), the intermediary is of poor quality the scanned image will be inferior. It is recognised that it is best to use camera negatives whenever possible as subsequent duplication leads to loos of detail and resolution:

‘In general, it is better to work from a negative than from a positive not only because of generational loss but because the negative provides a smoother curve in the dynamic range, so that highlights and shadows are handled better (Ester, 1996).’

With relation to the types of scanning methods one can use: This study recommends that any digitization unit should:

Digitization Equipment

The variety in scanning equipment available not only represents the multitudinous nature of the types of source documents dealt with by digitization services, but also the increasing commercial competition in this area. For a good reviews of the different types of scanners available see Besser and Trant’s overview (1995 - http://www.gii.getty.edu/intro_imaging/11-Scan.html). In general, the types of digitization equipment used are: The University of Oxford is perhaps representative of many institutions in the fact that it has numerous pieces of digitization equipment but these are all dispersed around various libraries, departments, and institutions. Most notably this applies to the number of flat-bed scanners which remain uncounted, often being the personal property of an individual academic.

However, within the libraries sector, Oxford is fortunate in the equipment it has for high-level scanning. At present it has:

In addition, the Educational Technology Resource Centre in Wellington Square is extremely well-equipped to digitize video. It lists amongst its assets: Combining the two this gives Oxford the capability of capturing high level digital images from microfilm, paper/manuscripts/photographs (maximum A3), video, and direct filming or photographing of artefacts. The most obvious omission is the ability to scan directly from an open book. Here the options are two-fold: low cost, but low performance scanner such as the Minolta (c. £15-20,000) or a high end scanner such as that manufactured by Zeutschel (£50-80,000).

The disparate locations of this equipment, however, pinpoints the need to address where the digitizing equipment should be housed. Even with the most stringent of security systems and safety precautions, some digitization would have to be done at point of collection, i.e. when dealing with rare or unique items (some of which are priceless), one does not wish to move material around too much, particularly using public areas. As Arms (1996) notes it is usual to ‘capture from original materials … on site under curatorial supervision’, something which is particularly relevant for the needs of Oxford given the intellectual and financial value of many of the holdings which might be digitized. The solution to this problem is two-fold:

Quality Assurance

It is imperative that all digitization undergoes a series of quality control analyses at various stages, regardless of whether the material has been produced in-house or via an outside vendor. This is an accepted method of verifying that all reproduction is up to standard (this has been performed on microfilming for years(12)). Bearing in mind limits on time and finances, some form of sampling may be necessary to reduce the costs of this process, as with the NARA who state that at a minimum 10 images or 10% of images (whichever number is higher) need to undergo quality control (these should be selected randomly from the entire collection). Ideally Quality Assurance (or QA) must be performed on all master images and their derivatives with each step being fully documented. The types of things one should look for are: The overall return should be checked for file name integrity, completeness of job, and overall meeting of project scope. NARA recommend that if more than 1% of images looked at fail the above quality control checks then the job needs to be redone.

The Library of Congress concurs with many of these findings. For the American Memory Project running under the NDLP (National Digital Library Program) it established a fifteen-point workflow practice for the Quality Review stage covering such areas:

(http://lcweb2.loc.gov/ammem/award/docs/stepsdig.html or see the full ‘Step-by-Step Instructions’ at http://lcweb2.loc.gov/ammem/award/docs/docimqr.html).

Aids and more specific guidelines to help in the testing of such ideas as contrasts and noise are available with accepted test cards and procedures. A good place to start to look for these is the Photographic and Imaging Manufacturer’s Association/IT10 Still Picture Imaging page (http://www.pima.net/it10a.htm) which has a growing list of standards. (See also Reilly and Frey, 1996).

Further Reading

[Please note that for the Oxford projects highlighted in bold, further details have been collected via interviews with the project participants.] In addition to this list there is an extremely comprehensive set of links and annotations developed by the Colorado Digitization Project (http://coloradodigital.coalliance.org/toolbox.html).

Arms, C. ‘Historical Collections for the National Digital Library: Lessons and Challenges at the Library of Congress’ Part I D-Lib Magazine (April, 1996, http://www.dlib.org/dlib/april96/loc/04c-arms.html); Part II in D-Lib Magazine (May, 1996, http://www.dlib.org/dlib/may96/loc/05c-arms.html). A useful preliminary review of the National Digital Library Program’s (NDLP) digitization of Americana at the Library of Congress.

Arnamaganean Institutes in Copenhagen and Reykjavik (http://www.hum.ku.dk/ami/aminst.html; http://www.hum.ku.dk/ami/amproject.html). Aims to produce a catalogue with links to digital images of the complete collection. Access to these images will be: low-quality (75 dpi) watermarked images available freely over the Web; higher quality (300 dpi) images available to subscribers; and high-quality images (600 dpi) available for sale.

Arts and Humanities Data Service (http://www.ahds.ac.uk/). The main web site of the AHDS with links to all the service providers. In particular the Managing Digital Collections section (http://ahds.ac.uk/manage/manintro.html) with its series of reports.

Ashmolean Museum (http://www.ashmol.ox.ac.uk/). Various projects have been going on at the Ashmolean Museum. These include a collaboration with the Bridgeman Art Library to build up an image library from transparencies, and a forthcoming project looking at the Allen Photography Archive (c. 1,500 black and white photographs).

Australian Co-Operative Digitization Project (http://www.nla.gov.au/ferg/). Collaborative scanning project of Australian newspapers from 1840-45. Produced (via out-sourcing) 400dpi bi-tonal TIFFs (CCITT Group 4 compression). Noted that only 15-20% of digital images could be produced from existing microfilm stock, so had to cost in new microfilming. Reviewed in Howell (1997).

Beazley Archive (http://www.beazley.ox.ac.uk). Four projects currently underway: Database of Athenian Pottery; Beazley’s Drawings; Cast Collection; and Ancient Gems and Finger-Rings. Working with a system to automatically watermark images.

Besser, H., and Trant, J. ‘Introduction to Imaging’ (1995 - http://www.gii.getty.edu/intro_imaging/0-Cover.html). A good overview including a very approachable description (with images) of the types of equipment used in scanning (http://www.gii.getty.edu/intro_imaging/11-Scan.html). It also includes an extremely useful Glossary of technical terms (http://www.gii.getty.edu/intro_imaging/Gloss.html).

Besser, H., and Yamashita, R. ‘The Cost of Digital Image Distribution’ (http://sunsite.berkeley.edu/Imaging/Databases/1998mellon). An extensive Mellon-funded report of the Museum Education Site License Project (MESL). Provides comprehensive figures on costings and processes.

Bodleian Broadside Ballads Project (http://www.bodley.ox.ac.uk/mh/ballads/). C. 30,000 images of Broadside Ballads held at the Bodleian Library, Oxford. Images are photographed to microfilms. These are then sent to an outside vendor for scanning to bi-tonal TIFFs at 400 dpi. On their return they are batch-processed to GIFs and made available via the Allegro database system.

BUILDER ­ Birmingham University’s Integrated Library Development and Electronic Resource (http://builder.bham.ac.uk). A major Hybrid Library project funded under the ELIB project, aimed at developing ‘a working model of the hybrid library within both a teaching and research context, seamlessly information sources, local and remote, using a Web-based interface, and in a way which will be universally applicable.’

Burney Collection at the British Library (http://minos.bl.uk/diglib/access/microfilm-digitisation.html). 1,500 reels of early English newspapers from the Civil War onwards. Used a Mekel 400 XL scanner, in-house, to produce 400 dpi bi-tonal TIFFs (using CCITT Group 4 compression). Reviewed and outlined in Howell (1997).

California Heritage Collection ‘Digitizing the Collection: Image Capture’ (http://sunsite.berkeley.edu/CalHeritage/image.html). Discusses the 1996 project which involved adding images (thumbnails) to finding aids for collections held at the Bancroft Library. Used 35mm slides captured to PhotoCD, converted to JPEGs and then to GIFs for thumbnails, but noted the problems in this multi-staged workflow.

Caribbean Newspapers Imaging Project (http://www.karamelik.uflib.ufl.edu/projects/mellon/). Based at the University of Florida, and funded by the Mellon foundation. Used collections at the A. Smathers Libraries, digitizing from microfilm 265,000 pages of Caribbean newspapers. Produced 400dpi bi-tonal scans (TIFF CCITT Group 4 compression) but experimented with 400dpi greyscales. Reviewed in Howell (1997).

Celtic and Medieval Manuscripts (http://image.ox.ac.uk/). High-resolution digitization of a series of medieval manuscripts held at the Bodleian library and college libraries at the University of Oxford. Using 1 Kontron and 2 Dicomed cameras. All three have employed special cradles but could use traditional copy stand. The project found that the maximum size of document that could be dealt with was A3 and had to use special high-frequency fluorescent cold lighting used (based on model of National Library of Scotland) as traditional lamps produced too much heat (1 hour could induce a 0.1% shrinkage of vellum). Images are scanned as 24-bit uncompressed TIFFs aiming at 600dpi (but getting at best 570dpi). Browsable using HTML only allowing access to JPEGs and GIFs (but has experimented with FlashPIX).

Centre for the Study of Ancient Documents (University of Oxford; http://www.csad.ox.ac.uk/CSAD/Images.html). This unit, part of the faculty of Literae Humaniores, is conducting a series of imaging projects on rare and unique material, including ‘squeezes’ (filter paper impressions of inscriptions), stylist tablets (using 180 degree imaging in collaboration with the University’s Department of Engineering), ink tablets, and papyrology.

Chapman, S., Conway, P., and Kenney, A. R. ‘Digital Imaging and Preservation Microfilm: The Future of the Hybrid Approach for the Preservation of Brittle Books’ RLG DigiNews 3.1 (February 15, 199; http://www.thames.rlg.org/preserv/diginews/diginews3-1.html). A full report (and decision matrix re film first/scan or COM approaches) will appear on the CLIR Web Site.

Colorado Digitization Project (http://coloradodigital.coalliance.org/toolbox.html). An extremely useful site of links to the major topics surrounding digitization.

Columbia University’s ‘Technical Recommendations for Digital Imaging Projects’ (http://www.columbia.edu/acis/dl/imagespec.html). A concise set of guidelines for digitization projects prepared by the Image Quality Working Group of ArchivesCom, a joint Libraries/AcIS committee.

Council on Library and Information Resources (http://www.clir.org/) - CLIR. Runs four programmes (Commission on Preservation and Access, Digital Libraries, The Economics of Information, and Leadership) and commissions numerous publications. Their Commission on Preservation and Access state that: ‘some information is created digitally and exists only that way, but historical materials are also being digitized as a means of providing access to special collections that have been locked in libraries and archives to prevent their deterioration. All digital files pose serious preservation problems, and finding ways to assure the safekeeping and accessibility of knowledge in this new format is among CLIR's highest priorities’ (http://www.clir.org/programs/cpa/cpa.html).

D’Amato, D., and Klopfenstein, R. C., ‘Requirements and Options for the Digitization of the Illustration Collections of the National Museum of Natural History’ (March 1996, http://www.nmnh.si.edu/cris/techrpts/imagopts/index.html). A comprehensive study of the digitization of fish illustrations for the Museum. Takes the project through its various stages of selection, benchmarking and digitization.

DEBORA Project (http://www2.echo.lu/libraries/en/projects/debora.html). Although just starting this EC project aims to ‘develop tools for accessing collections of rare 16th century documents via networks. This includes the setting up of a production chain for digitizing old books. Digitisation will yield sets of images to be stored and indexed in an Image Base Management System (IBMS), accessible via the World-wide Web. The tools will also incorporate image recognition and features supporting co-operative work.’

Digital Heritage and Cultural Content (http://www.echo.lu/digicult/en/backgrd.html). EC-funded site looking at libraries and technology. Includes the full copy of ‘Digitisation of Library Materials’, report of the concertation meeting and workshop held in Luxembourg on 14 December 1998.

Donovan, K. ‘The Promise of FlashPix Image File Format’ RLG DigiNews 2.2 (April 15, 1998 -http://www.rlg.org/preserv/diginews/diginews22.html#FlashPix). A useful overview and analysis of the FlashPix image file format, which may provide a useful solution for access level images. This format has been successfully used by the Celtic and Medieval Manuscripts project at the University of Oxford (http://image.ox.ac.uk/) for a stand-alone exhibition in Ireland.

Elkington, N. ‘Joint RLG and NPO Conference on Guidelines for Digital Imaging’ RLG DigiNews 2.5 (October, 1998; http://www.thames.rlg.org/preserv/diginews/diginews2-5.html#feature1) ­ good overview with links for the workshop held in Warwick, 1998.

Frey, F. ‘Digital Imaging for Photographic Collections: Foundations for Technical Standards’ RLG DigiNews 1.3 (December 15, 1997 - http://www.rlg.org/preserv/diginews/diginews3.html#com). A comprehensive discussion of the standards used for digitizing photographs and many of the issues involved.

Gertz, J. ‘Oversize Color Images Project’ Phase I (http://www.columbia.edu/dlc/nysmb/reports/phase1.html), and Phase II (http://www.columbia.edu/dlc/nysmb/reports/phase2.html)

Global Inventory Project (http://www.gip.int). An EC and G8 funded project that allows one to search an inventory of digital initiatives. Described as a ‘one stop facility’ linking distributed national and international inventories of projects, studies and other activities relevant to the promotion and the further development of knowledge and understanding of the Information Society.

Hawaiian Newspaper Project (http://hypatia.slis.hawaii.edu/~hnp/welcome.html). This project seeks to make available selected, heavily used Hawaiian language newspapers (1834-1948) to students throughout the state of Hawaii who have access to the World Wide Web (WWW). Uses a Minolta Microdax 3000 digital microfilm workstation, scanning approx. 3,800 images to TIFF and GIF formats.

Howell, A. ‘Film Scanning of Newspaper Collections: International Initiatives’ RLG DigiNews 1.2 (August, 1997, http://www.thames.rlg.org/preserv/diginews/diginews2.html#film-scanning). A useful review of three initiatives ­ The Burney Collection at the BL, the Caribbean Newspaper Imaging Project at the University of Florida, and the Australian Co-Operative Digitization Project. Outlines the problems and opportunities of scanning newspapers (all from microfilm). Mentions that the ideal resolution is 600dpi bi-tonal TIFFs but only the Yale Open Book project has successfully achieved this and needed modifications to their Mekel scanner. All three projects resorted to 400dpi bi-tonal scanning, though the University of Florida experimented with conversion from 400 dpi greyscale scanning.

Internet Library of Early Journals (ILEJ, http://www.bodley.ox.ac.uk/ilej/). A collaborative project between Oxford, Leeds, Birmingham and Manchester. Six journals chosen, The Builder, Notes & Queries, Blackwood's (19c) and PTRS, Gentleman's Magazine, and the Annual Registry (18c). Covering 10 or 20 run from each, total images 108,000. Oxford, Manchester, and Birmingham provided main scanning locations (using scanning assistants) with Oxford also doing microfilm scanning. For paper-based documents two Minolta PS3000P scanners were used (one in Manchester, one in Birmingham). Scanned as bi-tonal TIFFs (400dpi) converted to 100dpi GIFs, usually measured as pixels, c. 1000 across. Conversion was performed using ImageAlchemy. Microfilm scanning used Mekel MX500XL-G based in Bodleian Photographic Studio. Scanned as greyscale and bi-tonals, double images (i.e. 2 pages) then split into single page images (21,000 in total from The Gentleman's Magazine and The Builder). The Builder scanned as 200dpi TIFFS (c. 10MB an image) converted to JPEGs (better compression than GIF). Gentleman's Magazine 70% scanned as 300dpi bi-tonal TIFFs converted to GIFs, and 30% as 100dpi greyscale TIFFs converted to JPEG.

Kenney, A. R. ‘The Cornell Digital Microfilm Conversion Report: Final Project to NEH’ RLG DigiNews 1.2 (August, 1997, http://www.thames.rlg.org/preserv/diginews/diginews2.html#com). A summary report of the Computer Output Microfilm project involving 177 reels of 19th and 20th century agricultural history documents.

Kenney, A. R. and Chapman, S. ‘Digital Conversion of Research Library Materials: A Case for Full Informational Capture’ D-Lib Magazine (October, 1996; http://www.dlib.org/dlib/october96/cornell/10chapman.html). This also provides a useful example of benchmarking with an analysis of a real life example using a 1914 ‘brittle book’ entitled Farm Management.

Kenney, A. R. and Chapman, S. Digital Imaging for Libraries and Archives (New York, 1996 ISBN 1 85604 207 3). This book accompanied a series of workshops conducted in the US. It is an invaluable book full of extremely useful formulae, reading lists, etc. Reviewed in Ariadne (http://www.ariadne.ac.uk/checkout/digital-imaging/intro.html) by Brian Kelly, 14th March, 1997.

Kenney, A. R., and Rieger, O. Y. Managing Digital Imaging Projects: An RLG Workshop (RLG: May, 1998). Another book to accompany a workshop on digital imaging projects, which is once again extremely useful. It tackles the area of costing and managing projects, but also has an overview of the basic technologies.

Lee, S. D., and Groves, P. ‘On-Line Tutorials and Digital Archives or ‘Digitising Wilfred’’ (Jan 1999, http://www.jtap.ac.uk). Full report on the Wilfred Owen Multimedia Digital Archive including digitization costs.

Library of Congress American Memory Project and National Digital Library Program (http://lcweb2.loc.gov/). It is strongly recommended that interested parties look at their ‘Quality Review of Document Images’ internal training guide which provides a comprehensive discussion of the problems and recommended solutions adopted in the project (http://lcweb2.loc.gov/ammmem/award/docs/docimqr.html). In addition, in association with Ameritech the LOC have run a National Digital Library Competition (http://memory.loc.gov/ammem/award/lessons.html). This ‘lessons learned’ briefings covers a range of projects including:

See also their Manuscript Digitization Demonstration Project (http://memory.loc.gov/ammem/pictel/index.html). This contains an extensive overview of differing formats and resolutions for the capture of b&w, greyscale, and colour images from manuscripts (with samples).

National Archives and Records Administration’s Electronic Access Project (http://www.nara.gov/nara/), especially their Guidelines for Digitizing Archival Materials for Electronic Access (http://www.nara.gov/nara/vision/eap/eapspec.html).

Noerr, P. ‘The Digital Library Toolkit’ (April 1998, http://www.sun.com/edu/libraries/digitaltoolkit.html). A good overview of questions and processes involved in the setting up a digital library.

Photographic and Imaging Manufacturer’s Association (http://www.pima.net/it10a.htm). This has numerous ‘standards’ and downloadable test cards for quality assurance tests of digital images.

Refugee Studies Programme Digital Library Project (Oxford). An extensive collection (c.25,000 items) of Grey literature focusing on Refugee Studies, drawn from the collections at Oxford (currently in the pilot stage). All material had to be disbound as appropriate and then scanned by Xerox off-site at 300dpi bi-tonal TIFFs, Group 4 compression (some colour also, and some greyscale). A copy sent to RAMOT Digital for batch processing into IOTA. Xerox also provide uncorrected OCR for use in access system employing a TEI catalogue and OpenText 5.0. The original Feasibility Study performed by the Higher Education Digitization Service (http://heds.herts.ac.uk) is now publicly available at: http://heds.herts.ac.uk/Guidance/RSP_fs.html.

Reilly, J. and Franziska, F. ‘Recommendations for the Evaluation of Digital Images Produced from Photographic, Micrographic, and Various Paper Formats’ (http://lcweb2.loc.gov/ammem/ipirpt.html). A detailed evaluation of the performance of scanners, commissioned by the NDLP.

‘Scanners and Digital Cameras’ RLG DigiNews 1.1 (April 15, 1997 - http://www.thames.rlg.org/preserv/diginews/diginews1.html#hardware&software). Although a bit dated, many of the links to sites evaluating digital cameras are valid.

Sharpe, L. ‘Preservation-Quality Scanning of Bound Volumes: Integration of the Picture Elements ISE Board with the Minolta PS-3000 Book Scanner’, RLG DigiNews 1.1 (April 15, 1997 - http://www.thames.rlg.org/preserv/diginews/diginews1.html#feature). Outlines some of the problems with bound volume scanning, and notably use of the Minolta PS3000 scanner.

Smith, A. 'Why Digitize?' and 'The Future of the Past: Preservation in American Research Libraries' (1999 - http://www.clir.org/pubs/reports/reports.html). New reports at the CLIR site. Both studies comes down heavily on the side of digitization for access, as opposed to preservation.

Susstruck, S. ‘Imaging Production Systems at CORBIS Corporation’ RLG DigiNews 2.4 (August 15, 1998; http://www.rlg.org/preserv/diginews/diginews2-4.html#technical). Describes the large digital archives created by the CORBIS Corporation. Use high quality drum scanners (cost around $30-60,000 each).

Technical Advisory Service for Images (TASI - http://www.tasi.ac.uk/). See especially their guidelines and summaries for creating a digital archive (http://www.tasi.ac.uk/building/building2.html).

‘Technical Review: Outsourcing Film Scanning and Computer Output Microfilm (COM) Recording’ RLG DigiNews 1.2 (August, 1997; http://www.thames.rlg.org/preserv/diginews/diginews2.html#hardware&software). A comprehensive review of vendors, including hardware, software, contact details, output formats, etc.

Toyota City Imaging Project (http://www.bodley.ox.ac.uk/toyota/openpage.html). Drawn from the material held in the John Johnson collection at Oxford. Photographed onto 35mm slides by the Bodleian Photographic Studio and then outsourced for conversion to PhotoCD (project began in 1993 when PhotoCDs were widely available but high resolution digitization equipment was not). Access images were taken from PhotoCDs using ImageAlchemy for conversion. Base/16 converted to GIFs, Base x 4 to JPEGs, and also created thumbnails.

UMI’s Early English Books project. Digitization of Early English Books I and II (following Pollard and Redgrave) from microfilm collections. Approximately 22m pages (or 11m images). Scanning to 400 dpi TIFFs using 13 SunRise SR50s, occupying 3.5TB of storage. Perform 100% QA on all images, with indexing to page level delivered by Fulcrum database. Delivering compressed images on the fly by AT & T’s DjVu software, and using Digimarc watermarking. Back-ups on CD-ROM (c. 5,000) and currently delivered by Jukebox system (but with first 24 images of each item being stored on a Hard Drive, using Sun Ultra 450 web server).

Webb, C. ‘The Ferguson Project: A Hybrid Approach to Reformatting Rare Australiana’ (http://www.nla.gov.au/nla/staffpaper/cwebb1.html). A National Library of Australia project based on John Alexander Ferguson’s Bibliography of Australia. Outlines the benefits of the hybrid approach (microfilm and digitization).

Wilfred Owen Multimedia Digital Archive (http://info.ox.ac.uk/jtap). Manuscripts, photographs, audio, and video digitization project centred around the poet Wilfred Owen and the Great War. Used various methods including outsourcing to high resolution digitization unit (using the Kontron camera) at Oxford, but also employed flat-beds, etc. Audio delivered as RealAudio files, and video as MPEG II and QuickTime.

Yale’s Open Book Project (http://www.library.yale.edu/preservation/pobweb.htm). Yale University Library major project to convert 10,000 books from microfilm to digital form, using Xerox Corp. for the out-source scanning (with a Mekel M400 microfilm scanner). Reports available at the CLIR’s site (http://www.clir.org/cpa/reports/openbook/openbook.html and http://www.clir.org/cpa/reports/conway/conway.html).


(1) For example, the Centre for the Study of Ancient Documents (Oxford) is working in collaboration with the University’s Department of Engineering to investigate ways of scanning, via a 180 degree arch, of stylus inscriptions.

(2) Here we are talking about the suggestion that a digital preservation copy can be created as opposed to using standard preservation surrogates. This is not the same as attempts (highly valid) to study how to preserve digital objects such as those being conducted by the CEDARS project.

(3) As Arms (April, 1996) notes: ‘One issue that can not be adequately addressed here is an ongoing topic of discussion at the Library: the potential for digital versions to serve as preservation copies. Traditionally, preservation of content has focussed on creating a facsimile, as faithful a copy of the original as feasible, on a long-lasting medium. The most widely accepted method for preserving the information in textual materials is microfilming and for pictorial materials is photographic reproduction.’

(4) The overall problems of preserving digital information are being addressed by the United Kingdom’s CEDARS project (http://www.curl.ac.uk/cedarsinfo.shtml). For a concise overview of the problem (with historical perspective) see the University of Iowa’s on-line exhibition on preserving information (http://www.lib.uiowa.edu/ref/exhibit/contents.htm).

(5) Kenney, A. R. and Chapman, S. Digital Imaging for Libraries and Archives (New York, 1996 ISBN 1 85604 207 3), p. iv. Reviewed in Ariadne (http://www.ariadne.ac.uk/checkout/digital-imaging/intro.html) by Brian Kelly, 14th March, 1997.

(6) Kenney, A. R. and Chapman, S. ‘Digital Conversion of Research Library Materials: A Case for Full Informational Capture’ D-Lib Magazine (October, 1996; http://www.dlib.org/dlib/october96/cornell/10chapman.html). This also provides a useful example of benchmarking with an analysis of a real life example using a 1914 ‘brittle book’ entitled Farm Management.

(7) For accurate reproduction of colour, one should look to the work of the International Color Consortium (ICC - http://color.org/) in particular their ‘ICC Profile Format’. See the discussion of Color Management Systems in the ‘Technical Review’ of RLG DigiNews 1.3 (December 15, 1997 - http://www.rlg.org/preserv/diginews/diginews3.html#hardware&software).

(8) See also the MTF Target: Sine Patterns M-13-60, discussed in Williams, D. ‘What is an MTF and Why Should You Care?’ RLG DigiNews 2.1 (February 15, 1998 ­ http://www.rlg.org/preserv/diginews/).

(9) See RLG worksheet (http://www.rlg.org/preserv/RLGWorksheet.pdf) and Chapman and Kenney on Costs and Benefits (http://www.dlib.org/dlib/october96/cornell/10chapman.html).

(10) For a quick overview of the types of some of the common file formats available (notably via the Internet) see Perlman, E. and Kallen, I. ‘Common Internet File Formats’ (1995 - http://www.matisse.net/files/formats.html).

(11) It should be noted that the Centre for the Study of Ancient Documents also has a Phase One Fuji camera, but this is clearly owned by the institution.

(12) The Photographic Studio at the Bodleian employs 1 FTE for quality control on all microfilms.