UK Union Catalogue of Chinese Books

STANDARDS

Data entry | Format | Conspectus | Samples

As it is not the purpose of this project to prescribe how the participants should catalogue Chinese books in their respective libraries, but to take what they have done and make it accessible through a single interface, there is very little to decide in the area of standards, as these are inevitably determined by what the data permit. We are obliged to create from the three data sources (China MARC, US/UK MARC, and the Bodleian/BL allegro format) a standard into which all can be converted automatically and accurately.

Data entry

It is worth repeating for the sake of non-CJK librarians who may be following the course of this project (- it has already been touched on in Background) that the biggest source of controversy in CJK cataloguing is not bibliographical format, but how data should be entered into that format, especially in the areas of encoding and romanisation. In both, the standards must be those of China MARC, as they are the simplest: the published standard prescribes the use of GB 2312-80 (the smallest character set, with 6,763 Chinese characters) but recently produced records have made use of the extended GBK set, which offers a further 14,240 characters; and in the area of romanisation, pinyin is used in its most basic form, with no syllable aggregation whatsoever, even for proper names.

All data incorporated into the UK union catalogue will therefore be converted into GB coding, and any romanised elements will be converted into non-aggregated pinyin. Furthermore, the data will be stored and displayed principally in original script (if present), with pinyin romanisation used only to produce access points in title and author fields. The user interface will aim to enable search-terms to be entered intuitively.

In the areas of author authority and subject headings, there will be particular difficulties which are incapable of being resolved automatically, as the participants are either using radically different sources of authority (the National Library of China and the Library of Congress), or none at all. However, conflicts of author authority occur only rarely, and the design of the user interface, which will enable multiple cross-index searching to be done extremely efficiently, should obviate all but the worst problems.

Format

The three different formats of the data sources will be mapped to a common project-specific format which, as stated in the project's bid for funding, will be compatible with the data structure of the UK Japanese Union Catalogue, so as to provide a consistent corpus of both Chinese and Japanese data.

As far as possible, the format will adhere to the CURL minimum standard, bearing in mind that being more modern, the national MARCs of both China and Japan, as well as the locally developed allegro format, do not use headings, and Japan MARC and NACSIS format make very little use of subfields and indicators.

The format is summarised in the following table. Except in the area of subject headings and classification, it is expected that all participants will make use of all fields; or put differently, that all participants will regard this format as representing their minimum cataloguing level. The database will neither make use of nor store information over and above this standard, which experience has demonstrated to be fully adequate to the limited search and retrieval functions that are expected of it. A view of how the format works in practice is provided by the existing UK Japanese Union Catalogue, which consists entirely of data automatically imported from NACSIS B-format (see example below under Samples).

FIELD
* repeatable
¤ indexed
          
DESCRIPTION

  
#000 ¤ Record number
This is the locally assigned unique record identifier, and ensures that editorial control of the content is in the hands of the contributing library. The software will use this matchkey to determine whether the incoming record will be added to the database as a new record, or overwrite or delete an existing record. Every record submitted to the database must contain this matchkey, and it will be accepted in exactly the form in which it is supplied, with the addition of a local identifier. The field contains first the local identifier, followed by the record number in subfield $n.
#0rs Source file name of incoming data
The filename incorporates the date when the data was submitted.
#0rx Record link
If record linking is used, the link will go into this field. Currently, only the allegro group uses record linking, to catalogue the contents of congshu and other collectaneous works.
#0rz Part of main record in which analytic is contained
Used only by the allegro users in conjunction with
#0rx. If a collectaneous work is divided into sections, each with its own section name, the section name is entered into this field.
#0ca Coded date information
This field is of fixed length, and contains date type (position 1), date 1 (2-5), date 2 (6-9), all as in conventional MARCs.
#0cc Coded country/language information
The first two positions contain the country code, and are followed by repeatable subfields containing language of text
$t, abstract $a and original $o.
#100*¤ ISBN
Hyphens are not used. Two subfields are possible, for binding
$v and price $p.
#110*¤ ISSN
#120*¤ Tong yi shu kan hao
#260 ¤ Title / statement of responsibility
MARC subfield designations are stripped out. Elements are distinguished with ISBD punctuation. The romanised access point for the first title only goes into subfield
$r. Second and subsequent titles are indexed from #265. Authors are indexed from #360 or #361.
#264*¤ Uniform title
#265*¤ Added/variant title
#360*¤ Author (personal)
#361*¤ Author (corporate)
In both personal and corporate author fields, if the incoming record contains original script, the romanised form will go into subfield $r. Dates and dynasties, where present, will be stored in subfield $d.
#400 Edition statement
Subfields are replaced with ISBD punctuation.
#405 Material specific details
#407 Imprint
Subfields are replaced with ISBD punctuation.
#470 Physical description
Subfields are replaced with ISBD punctuation.
#570* Series
Subfields are replaced with ISBD punctuation. ISSN relating to series is not used.
#572*¤ Series access point
#600* Notes
The extremely complicated analysis of note content favoured by conventional MARCs (but not Japan MARC) will be simplified, and with one or two exceptions, all notes will go into the same field, repeated as necessary, irrespective of content.
#617* Content Notes
#706*¤ NLC classification
There will be discussion as to whether other classification systems (eg Dewey, LC) should also be imported.
#770*¤ NLC subject headings
To be distinguished with indicators.
#772*¤ LC subject headings
To be distinguished with indicators.
#8..*¤ Shelfmark
Local copy information will be stored in the
#8.. fields. Appropriate subfields will be decided at a later stage, but will include at least $v volume number, $a accession number, $c copy notes, $h periodical holdings.

Data in all indexed title and author fields (
#260, #264, #265, #360, #361, #572) is followed by the indicator ¤r if it is in romanisation only.

Conspectus

A mapping table from the formats of the three data sources (China MARC, US/UK MARC, and the Bodleian/BL allegro format) to the project-specific format summarised in the above table is being constructed, and will appear here in the near future.

Samples

1 China MARC
2 Bodleian/BL allegro format
3 US MARC (RLIN)
4 NACSIS B-format (record group)
5 NACSIS download format
6 Japan MARC
7 KORMARC

TOP