UK Union Catalogue of Chinese Books

DATA CONVERSION

MARC to text | Code conversion | Catenating US MARC | allegro | Downloads

As explained in previous pages, the database is being constructed by taking existing data in a variety of apparently incompatible formats and standards, and converting them to a common standard so that they can be searched through a single interface and displayed consistently. This is achieved principally through the use of an allegro program called import, but for various reasons, a certain amount of pre-processing is necessary. Up to four different processes may be needed, ranging from only one in the case of China MARC, to all four in the case of US MARC format, which is the most complicated and the least amenable to manipulation by non-dedicated utilities. Because of this, US MARC (now MARC 21) has been chosen to illustrate the conversion process.

MARC to text

By this is meant unravelling the data from the ISO directory structure so that the record is expressed as a simple text file with the various fields separated by line feeds. This is quite a normal procedure, usually done so that MARC records can be studied or handled conveniently with a text editor or some other program. Although import can address the directory structure of MARC records directly, it is not geared to the peculiar structure of non-roman US MARC data with its linked 880 fields. For this reason, such data is first unravelled so that the original script data in the 880 fields can be united with the equivalent romanised fields (see below) before further processing takes place. With China MARC data (as with Japanese and Korean MARC) this step is not necessary, as such data is primarily in original script with romanised access points appearing in appropriate subfields of the fields to which they apply.

Sample 1: MARC record with ISO directory structure
Sample 1a: Key to the structure of a MARC record

To achieve the unravelling, we use the program marc.pl, a freeware Perl script by Stephen Thomas, Senior Systems Analyst in Adelaide University Library. For the purposes of this project, Thaddeus Lipinski has written an interface to this and his own script to catenate US MARC orginal script records. We offer this combined utility for downloading (see below) by anyone who needs to manipulate this type of data. Note that the utility can also be used to create a MARC record with ISO directory from an appropriately structured text file with line feeds.

Sample 2: the same record, unravelled with marc.pl

Code conversion

To achieve the conversion from EACC and Big-5 to GB2312-80, a number of existing utilities were examined but found to be unsatisfactory. It was therefore necessary to create a new utility, which like allegro itself, would be configurable by the user. The first step in this process was the compilation of mapping tables, EACC to GB, and Big-5 to GB. These were constructed from the file CJKXREF.TXT published by the Unicode consortium. The file was imported into an allegro database. The structure of the EACC coding system was then exploited by Thaddeus Lipinski to map all full-forms and variants of a character to its GB2312-80 equivalent wherever possible, and these were fed back into the database, which thus consists of the official Unicode mappings with automatically created enhancements. The database was then manually updated as a small number of errors, omissions and duplicate values were discovered in both the Unicode and EACC standards. This is an ongoing process, and at any time, the database can be edited and made to output the mapping tables e2g.tbl (EACC to GB) and b2g.tbl (Big-5 to GB). These tables are used by gb.pl, a Perl script by Thaddeus Lipinski, to convert files encoded in EACC or Big-5 into GB2312-80. The database may be downloaded and viewed by anyone. Those with a license to use the allegro software can edit the database and output new tables automatically. Others can edit the tables with a Unix text editor.

Sample 3: the same record, re-coded with gb.pl

Note: US MARC data is encoded in EACC, a 3-byte code whose strings are enclosed with start and stop markers, whereas GB is a 2-byte code with no end-of-string markers. The ISO directory structure of a MARC record enables data to be identified according to the total length of the record and the position of the data within it. It is therefore obvious that a MARC record must be unravelled (with marc.pl) before conversion to GB. If the conversion were done before the unravelling, the directory would be rendered meaningless and the data inaccessible.

Catenating US MARC

In a US MARC record, the original script data is held in a series of 880 fields which may be linked to their corresponding romanised fields by an elaborate mechanism which allows for up to 99 linked pairs. Thaddeus Lipinski's program catmarc.pl takes each linked 880 field in turn and catenates it with its corresponding romanised field, separating the two with ASCII 13 hex, so that it is then able to be processed by import.

Sample 4: the same record, with 880 fields catenated

allegro

Once all these processes have taken place, the data is then passed to the allegro program import. Like other allegro programs, the functioning of import is defined with a parameter file which can be written by the user in a program-specific language defined in the Systemhandbuch. The language not only enables field codes to be mapped from one format to another, but also allows the data itself to be manipulated in many ways. The complex and variable nature of bibliographical data calls for a correspondingly sophisticated language, in which data can be examined and checked and then handled accordingly. Furthermore, in the course of conversion it is possible to identify and correct common mistakes (to which the extremely complicated MARC formats are particularly prone), so that the converted data is often sounder than its source. In the RSLP exercise, it has been necessary to write several different parameter files to convert incoming data into the project standard as set out under Standards.

Sample 5: the same record imported into the project-specific format

Downloads

We will shortly be making available free of charge all the utilities used in the data conversion process, with the exception of the allegro software, which is the property of the University of Braunschweig. Please proceed through the menu on the left as soon as it is activated.

TOP