UK Union Catalogue of Chinese Books

INDEXING

Title strings | Exact titles | Authors | Subjects | Shelfmarks | ISBN/ISSN

The following information applies to the current state of the database. It is likely that improvements and enhancements will be made from time to time.

Title strings

The title field looks like this:

e08.gif (1640 bytes)

First the whole title is taken, then two bytes (the length of a Chinese character) are chopped off the beginning, and the resulting string taken. This process continues until the whole string has been dealt with, as follows:

e09.gif (1570 bytes)

The same is then done for the romanisation, the cut-off point being in this case the space, and then all spaces are eliminated:

e10.gif (1656 bytes)

The resulting title strings are then all filed in the same index.

The first section of the search interface makes truncated searches in this index. It follows from this logic that the most economical way of searching for a known title is by the most unusual sequence of characters within it, in this case, simply:

e11.gif (1925 bytes)

Titles sometimes contain not only Chinese, but also alpha-numeric characters. If the indexing logic were applied uncritically, nonsense would be produced in cases where the alpha-numeric strings contained an odd number of bytes. It is therefore necessary to strip out the alpha-numeric characters before indexing the Chinese character strings - easily done by the removal of any byte lower than ASCII 161 dec. (all the bytes in GB2312-80 fall in the range 161-254). It follows that a single search box in this part of the interface must not contain a mixture of Chinese and alpha-numeric characters.Thus

e01.gif (1923 bytes)

and

e02.gif (1933 bytes)

will both find

e05.gif (2889 bytes)

But

e03.gif (1923 bytes)

will find nothing, and

e04.gif (1944 bytes)

will find only

e06.gif (1625 bytes)

as the term "1925 nian" is only found as a title string in this record.

Although the Union Catalogue is designed primarily to be searched in original script, romanised title and author access points have also been provided. The China MARC standard has been taken as orthodox, so that all such access points (subfield $r in the fields where they occur) are romanised syllable by syllable in lower case, for example:

tian an men guang chang li shi dang an

and the data is then indexed by the routine described above.

However, as readers often enter romanised search terms in an endless variety of forms, such as:

Tian-an-men Guang-chang li-shi dang-an
Tian'an Men Guangchang li shi dang'an
Tian'anmen Guangchang lishi dang'an

instead of requiring the China MARC norm, the search interface converts all upper case letters to lower case and strips out all punctuation and spaces before searching the title string index. So the above permutations, and many more, will all be reduced to the search term

tiananmenguangchanglishidangan

which will currently locate the following records in the database:

e07.gif (1886 bytes)

TOP

Exact titles

If the complete title of a book is a word such as zhong guo, wen xian, or some other term which is commonly found in other titles, truncated searching is of little use as the result set will be too large. A separate section of the search interface has therefore been provided for such cases, and this section makes an exact search in a separate title index. For example:

e12.gif (2066 bytes)

will currently find

e13.gif (2317 bytes)

The first of the four titles has been found because guo wen occurs as an added title entry in the record. If the same search term is entered as a title-string search, over 2,000 records are located.

TOP

Authors

As with most catalogues, searches in the author section of the interface will often yield extremely big result sets. For example, the search

e16.gif (2665 bytes)

will currently locate over 450 records.

However, as the principal purpose of the interface is to enable readers to gain rapid access to known titles through the entry of minimal search terms, author searches are truncated, not exact, so that very precise results can be obtained with surprisingly little effort. For example, the simple, romanised search

e14.gif (2630 bytes)

will find

e15.gif (2674 bytes)

In future, access to a browsable author index may be provided. Note that if authors are entered in romanisation, the procedure is the same as for titles, as described above, so that

luo zhen yu
Luo, Zhenyu
Luo Zhen-yu
luo-zhen-yu
luozhenyu
&c

will all find the same author, the standard romanised form being, of course, the first.

TOP

Subjects

Not all the records in the database have subject headings (the allegro users are not currently producing them), and those that do use different systems: Cambridge uses National Library of China subject headings, and the libraries that use US systems (SOAS, Durham and Edinburgh) use Library of Congress Subject Headings. Both have been loaded into the subject index, and access to this index is provided in the Further Options part of the search interface. Search terms must in this case be entered exactly in their standard format: Chinese characters for the NLC system, and romanisation (American spellings, and observing capitalisation) for LCSH.

Example of an NLC subject search:

e17.gif (4376 bytes)

Example of an LCSH subject search:

e18.gif (5481 bytes)

The figures to the left of the index entry show how many records are currently attached to that entry. The result will obviously provide only a partial picture of the national book stock, owing to the different systems - if any - in use. But having obtained a result for one library, the title may then be sought in another. In future, the interface may be enhanced to enable this to be done by a simple click rather than manually entering the title as a search term.

TOP

Shelfmarks

The shelfmark index works in exactly the same way as the subject index. Search terms must be entered in their orthodox form to achieve the desired result.

ISBN/ISSN

The final part of the search interface makes a truncated search in the ISBN/ISSN index. Hyphens may be input, but are ignored. A full ISBN will obviously locate a unique title, but a truncated ISBN can be used for example to locate all titles with a particular publisher code, for example:

e19.gif (1856 bytes)

or

e20.gif (1768 bytes)

TOP