root/trunk/gsdl/packages/yaz/doc/profiles.sgml @ 1343

Revision 1343, 24.7 KB (checked in by johnmcp, 20 years ago)

Added the YAZ toolkit source to the packages directory (for z39.50 stuff)

  • Property svn:keywords set to Author Date Id Revision
Line 
1<!doctype linuxdoc system>
2<article>
3<title>Specifying and Using Application (Database) Profiles
4<author>Index Data, <tt/info@indexdata.dk/
5<date>$Revision$
6<abstract>
7YAZ includes a subsystem to manage complex database records, driven
8by a set of configuration tables that reflect a given profile.
9Multiple database profiles can coexeist in the same server, or even
10the same database. The record management system is responsible for
11associating a given record with a specific profile, and processing it
12accordingly. This document describes the various file formats for data
13and configuration files which are used by the module.
14</abstract>
15
16<toc>
17
18<sect>Warnings
19
20<p>
21<itemize>
22<item>The subsystem descibed herein is under development. Not
23everything may work exactly as decribed, and details of the interface
24may change as the module matures.
25
26<item>The exact workings of the subsystem may depend on the
27application in which it is used. This document focuses on the use of
28the module in the <bf/Zebra/ information server which is distributed by Index
29Data as an independent package.
30</itemize>
31
32<sect>Introduction
33
34<p>
35The retrieval facilities of Z39.50 are extremely flexible and powerful.
36They allow any level of structuring of database records. They allow
37controlled re-use of attribute sets (for searching) and tag sets (for
38retrieval) between application profiles; they allow precise selection
39of the desired sub-elements of a database record; they allow different
40variants of a given data element to be represented and selected in a
41structured way; and finally they allow the exchange of any type and
42amount of data to be represented in a single database record.
43
44These powerful retrieval facilities are a recent addition to the
45protocol, and along with the flexible searching facilities, they make
46the protocol an extremely capable tool for precise, structured
47access to information systems. The retrieval facilities add new
48levels of flexibility and control to the protocol, which add to its
49value outside of its traditional domain of the library systems world.
50
51The new facilities, however, also add new complexity to the protocol,
52which is already troubles by a too-steep learning curve. We have seen
53many good projects severely hindered or even thwarted by the sheer
54complexity of implementing the Z39.50 protocol.
55
56At the same time, we feel that the most complex and powerful
57facilities of the protocol (Explain, structured retrieval, etc.), are
58also what the protocol needs to become more widespread, and to fulfill
59what we perceive to be its most noble potential: To provide
60everybody with standardised, well-structured access to the
61information resources of the world.
62
63The purpose of <bf/YAZ/, then, and of this module as well, is to
64<it/simplify/ the use of the protocol for programmers and
65administrators, by providing simple APIs and configuration systems to
66access the functionality of the protocol. The <bf/Retrieval/ module
67deals specifically with the advanced retrieval functions which were
68added to the protocol with version 3, or Z39.50-1994.
69
70<sect>Overview
71
72<sect1>External Data (record) Representation
73
74<p>
75The <bf/Retrieval/ module will eventually support a wide range of
76input formats, ranging from MARC data to USENET news archives. This
77section introduces what we think of as the <it/canonical/ format - the
78one that gives the most general access to the various elements of the
79retrieval functionality.
80
81The basic model presented by the Z39.50 retrieval system is that of a
82recursively defined tree structure, containing a list of tagged elements,
83which may in turn contain either data or more lists of tagged elements, and
84so forth.
85
86We elect to represent this structuring externally by using an
87&dquot;SGML-like&dquot; syntax. The <it/internal/ representation will
88be discussed later.
89
90Consider a record describing an information resource (such a record is
91sometimes known as a <it/locator record/). It might contain a field
92describing the distributor of the information resource, which might in
93turn be partitioned into various fields providing details about the
94distributor, like this:
95
96<tscreen><verb>
97<Distributor>
98    <Name> USGS/WRD &etago;Name>
99    <Organization> USGS/WRD &etago;Organization>
100    <Street-Address>
101    U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW
102    &etago;Street-Address>
103    <City> ALBUQUERQUE &etago;City>
104    <State> NM &etago;State>
105    <Zip-Code> 87102 &etago;Zip-Code>
106    <Country> USA &etago;Country>
107    <Telephone> (505) 766-5560 &etago;Telephone>
108&etago;Distributor>
109</verb></tscreen>
110
111This is how data that the retrieval module reads from an input file
112might look.
113
114Depending on the database profile that is being used, it is likely
115that the data won't look like this when it's transmitted from the
116server to the client, however. Typically, the client will prefer to
117receive the data in a more rigid syntax, such as USMARC or GRS-1. To
118save transmission time and avoid ambiguities of language, the
119individual tags or field names, above, might be translated into
120numbers which are known by both the client and the server (by
121referring to a tag set).
122
123The retrieval module supports various types of conversions that might
124be carried out by the server based on requests from the client. To do
125this, it needs a set of configuration files to describe the
126application profile that the given record adheres to.
127
128<it>
129CAUTION: Because the tables described below serve the dual purpose of
130representing an external application profile and an internal database
131profile, the terminology and structuring used will sometimes be
132somewhat different from the one suggested in the the Z39.50-1995.
133</it>
134
135<sect1>The Abstract Syntax
136
137<p>
138The abstract syntax definition (ARS) is the focal point of the
139application profile description. For a given profile, it may state any
140or all of the following:
141
142<itemize>
143<item>The object identifier of the database schema associated with the
144profile, so that it can be referred to by the client.
145
146<item>The attribute set (which can possibly be a compound of multiple
147sets) which applies in the profile. This is used when indexing and
148searching the records belonging to the given profile.
149
150<item>The Tag set (again, this can consist of several different sets).
151This is used when reading the records from a file, to recognize the
152different tags, and when transmitting the record to the client -
153mapping the tags to their numerical representation, if they are
154known.
155
156<item>The variant set which is used in the profile. This provides a
157vocabulary for specifying the <it/forms/ of data that appear inside
158the records.
159
160<item>Element set names, which are a shorthand way for the client to
161ask for a subset of the data elements contained in a record. Element
162set names, in the retrieval module, are mapped to <it/element
163specifications/, which contain information equivalent to the
164<it/Espec-1/ syntax of Z39.50.
165
166<item>Map tables, which may specify mappings to <it/other/ database
167profiles, if desired.
168
169<item>Possibly, a set of rules describing the mapping of elements to a
170MARC representation.
171
172<item>A list of element description (this is the actual ARS of the
173profile), which lists the ways in which the various tags can be used
174and organized hierarchically.
175</itemize>
176
177Several of the entries above simply refer to other files, which describe the
178given objects.
179
180<sect>The Configuration Files
181
182<p>
183This section describes the syntax and use of the various tables which
184are used by the retrieval module.
185
186The number of different file types may appear daunting at first, but
187each type corresponds fairly clearly to a single aspect of the Z39.50
188retrieval facilities. Further, the average database administrator
189who's simply reusing an existing profile for which tables already
190exist, shouldn't have to worry too much about these tables.
191
192<sect1>The Abstract Syntax (.abs) Files
193
194<p>
195The name of this file type is slightly misleading, since, apart from
196the actual abstract syntax of the profile, it also includes most of
197the other definitions that go into a database profile.
198
199When a record in the canonical, SGML-like format is read from a file
200or from the database, the first tag of the file should reference the
201profile that governs the layout of the record. If the first tag of the
202record is <tt>&lt;gils&gt;</tt>, the system will look for the profile
203definition in the file <tt/gils.abs/. Profile definitions are cached,
204so they only have to be read once during the lifespan of the current
205process.
206
207The file may contain the following directives:
208
209<descrip>
210<tag>name <it/symbolic-name/</tag> This provides a shorthand name or
211description for the profile. Mostly useful for diagnostic purposes.
212
213<tag>reference <it/OID-name/</tag> The reference name of the OID for
214the profile. The reference names can be found in the <bf/util/
215module of <bf/YAZ/.
216
217<tag>attset <it/filename/</tag> The attribute set that is used for
218indexing and searching records belonging to this profile.
219
220<tag>tagset <it/filename/</tag> The tag set (if any) that describe
221that fields of the records.
222
223<tag>varset <it/filename/</tag> The variant set used in the profile.
224
225<tag>maptab <it/filename/</tag> (repeatable) This points to a
226conversion table that might be used if the client asks for the record
227in a different schema from the native one.
228
229<tag>marc <it/filename/</tag> Points to a file containing parameters
230for representing the record contents in the ISO2709 syntax. Read the
231description of the MARC representation facility below.
232
233<tag>esetname <it/name filename/</tag> (repeatable) Associates the
234given element set name with an element selection file. If an (@) is
235given in place of the filename, this corresponds to a null mapping for
236the given element set name.
237
238<tag>elm <it/path name attribute/</tag> (repeatable) Adds an element
239to the abstract record syntax of the schema. The <it/path/ follows the
240syntax which is suggested by the Z39.50 document - that is, a sequence
241of tags separated by slashes (/). Each tag is given as a
242comma-separated pair of tag type and -value surrounded by parenthesis.
243The <it/name/ is the name of the element, and the <it/attribute/
244specifies what attribute to use when indexing the element. A ! in
245place of the attribute name is equivalent to specifying an attribute
246name identical to the element name. A - in place of the attribute name
247specifies that no indexing is to take place for the given element.
248</descrip>
249
250<it>
251NOTE: The mechanism for controlling indexing is inadequate for
252complex databases, and will probably be moved into a separate
253configuration table eventually.
254</it>
255
256The following is an excerpt from the abstract syntax file for the GILS
257profile.
258
259<tscreen><verb>
260name gils
261reference GILS-schema
262attset gils.att
263tagset gils.tag
264varset var1.var
265
266maptab gils-usmarc.map
267
268# Element set names
269
270esetname VARIANT gils-variant.est  # for WAIS-compliance
271esetname B gils-b.est
272esetname G gils-g.est
273esetname W gils-b.est
274esetname F @
275
276elm (1,10)              rank                        -
277elm (1,12)              url                         -
278elm (1,14)              localControlNumber     Local-number
279elm (1,16)              dateOfLastModification Date/time-last-modified
280elm (2,1)               Title                       !
281elm (4,1)               controlIdentifier      Identifier-standard
282elm (2,6)               abstract               Abstract
283elm (4,51)              purpose                     !
284elm (4,52)              originator                  -
285elm (4,53)              accessConstraints           !
286elm (4,54)              useConstraints              !
287elm (4,70)              availability                -
288elm (4,70)/(4,90)       distributor                 -
289elm (4,70)/(4,90)/(2,7) distributorName             !
290elm (4,70)/(4,90)/(2,10 distributorOrganization     !
291elm (4,70)/(4,90)/(4,2) distributorStreetAddress    !
292elm (4,70)/(4,90)/(4,3) distributorCity             !
293</verb></tscreen>
294
295<sect1>The Attribute Set (.att) Files
296
297<p>
298This file type describes the <bf/Use/ elements of an attribute set.
299It contains the following directives.
300
301<descrip>
302
303<tag>name <it/symbolic-name/</tag> This provides a shorthand name or
304description for the attribute set. Mostly useful for diagnostic purposes.
305
306<tag>reference <it/OID-name/</tag> The reference name of the OID for
307the attribute set. The reference names can be found in the <bf/util/
308module of <bf/YAZ/.
309
310<tag>ordinal <it/integer/</tag> This value will be used to represent the
311attribute set in the index. Care should be taken that each attribute
312set has a unique ordinal value.
313
314<tag>include <it/filename/</tag> This directive, which can be
315repeated, is used to include another attribute set as a part of the
316current one. This is used when a new attribute set is defined as an
317extension to another set. For instance, many new attribute sets are
318defined as extensions to the <bf/bib-1/ set. This is an important
319feature of the retrieval system of Z39.50, as it ensures the highest
320possible level of interoperability, as those access points of your
321database which are derived from the external set (say, bib-1) can be used
322even by clients who are unaware of the new set.
323
324<tag>att <it/att-value att-name &lsqb;local-value&rsqb;/</tag> This
325repeatable directive
326introduces a new attribute to the set. The attribute value is stored
327in the index (unless a <it/local-value/ is given, in which case this
328is stored). The name is used to refer to the attribute from the
329<it/abstract syntax/.
330</descrip>
331
332This is an excerpt from the GILS attribute set definition. Notice how
333the file describing the <it/bib-1/ attribute set is referenced.
334
335<tscreen><verb>
336name gils
337reference GILS-attset
338include bib1.att
339ordinal 2
340
341att 2001        distributorName
342att 2002        indexTermsControlled
343att 2003        purpose
344att 2004        accessConstraints
345att 2005        useConstraints
346</verb></tscreen>
347
348<sect1>The Tag Set (.tag) Files
349
350<p>
351This file type defines the tagset of the profile, possibly by
352referencing other tag sets (most tag sets, for instance, will include
353tagsetG and tagsetM from the Z39.50 specification. The file may
354contain the following directives.
355
356<descrip>
357<tag>name <it/symbolic-name/</tag> This provides a shorthand name or
358description for the tag set. Mostly useful for diagnostic purposes.
359
360<tag>reference <it/OID-name/</tag> The reference name of the OID for
361the tag set. The reference names can be found in the <bf/util/
362module of <bf/YAZ/.
363
364<tag>type <it/integer/</tag> The type number of the tag within the schema
365profile.
366
367<tag>include <it/filename/</tag> (repeatable) This directive is used
368to include the definitions of other tag sets into the current one.
369
370<tag>tag <it/number names type/</tag> (repeatable) Introduces a new
371tag to the set. The <it/number/ is the tag number as used in the protocol
372(there is currently no mechanism for specifying string tags at this
373point, but this would be quick work to add). The <it/names/ parameter
374is a list of names by which the tag should be recognized in the input
375file format. The names should be separated by slashes (/). The
376<it/type/ is th recommended datatype of the tag. It should be one of
377the following:
378<itemize>
379<item>structured
380<item>string
381<item>numeric
382<item>bool
383<item>oid
384<item>generalizedtime
385<item>intunit
386<item>int
387<item>octetstring
388<item>null
389</itemize>
390</descrip>
391
392The following is an excerpt from the TagsetG definition file.
393
394<tscreen><verb>
395name tagsetg
396reference TagsetG
397type 2
398
399tag 1   title       string
400tag 2   author      string
401tag 3   publicationPlace string
402tag 4   publicationDate string
403tag 5   documentId  string
404tag 6   abstract    string
405tag 7   name        string
406tag 8   date        generalizedtime
407tag 9   bodyOfDisplay   string
408tag 10  organization    string
409</verb></tscreen>
410
411<sect1>The Variant Set (.var) Files
412
413<p>
414The variant set file is a straightforward representation of the
415variant set definitions associated with the protocol. At present, only
416the <it/Variant-1/ set is known.
417
418These are the directives allowed in the file.
419
420<descrip>
421<tag>name <it/symbolic-name/</tag> This provides a shorthand name or
422description for the variant set. Mostly useful for diagnostic purposes.
423
424<tag>reference <it/OID-name/</tag> The reference name of the OID for
425the variant set, if one is required. The reference names can be found
426in the <bf/util/ module of <bf/YAZ/.
427
428<tag>class <it/integer class-name/</tag> (repeatable) Introduces a new
429class to the variant set.
430
431<tag>type <it/integer type-name datatype/</tag> (repeatable) Addes a
432new type to the current class (the one introduced by the most recent
433<bf/class/ directive). The type names belong to the same name space as
434the one used in the tag set definition file.
435</descrip>
436
437The following is an excerpt from the file describing the variant set
438<it/Variant-1/.
439
440<tscreen><verb>
441name variant-1
442reference Variant-1
443
444class 1 variantId
445
446  type  1   variantId       octetstring
447
448class 2 body
449
450  type  1   iana            string
451  type  2   z39.50          string
452  type  3   other           string
453</verb></tscreen>
454
455<sect1>The Element Set (.est) Files
456
457<p>
458The element set specification files describe a selection of a subset
459of the elements of a database record. The element selection mechanism
460is equivalent to the one supplied by the <it/Espec-1/ syntax of the
461Z39.50 specification. In fact, the internal representation of an
462element set specification is identical to the <it/Espec-1/ structure,
463and we'll refer you to the description of that structure for most of
464the detailed semantics of the directives below.
465
466<it>
467NOTE: Not all of the Espec-1 functionality has been implemented yet.
468The fields that are mentioned below all work as expected, unless
469otherwise is noted.
470</it>
471
472The directives available in the element set file are as follows:
473
474<descrip>
475<tag>defaultVariantSetId <it/OID-name/</tag> If variants are used in
476the following, this should provide the name of the variantset used
477(it's not currently possible to specify a different set in the
478individual variant request). In almost all cases (certainly all
479profiles known to us), the name <tt/Variant-1/ should be given here.
480
481<tag>defaultVariantRequest <it/variant-request/</tag> This directive
482provides a default variant request for
483use when the individual element requests (see below) do not contain a
484variant request. Variant requests consist of a blank-separated list of
485variant components. A variant compont is a comma-separated,
486parenthesized triple of variant class, type, and value (the two former
487values being represented as integers). The value can currently only be
488entered as a string (this will change to depend on the definition of
489the variant in question). The special value (@) is interpreted as a
490null value, however.
491
492<tag>simpleElement <it/path &lsqb;'variant' variant-request&rsqb;/</tag>
493This corresponds to a simple element request in <it/Espec-1/. The
494path consists of a sequence of tag-selectors, where each of these can
495consist of either:
496
497<itemize>
498<item>A simple tag, consisting of a comma-separated type-value pair in
499parenthesis, possibly followed by a colon (:) followed by an
500occurrences-specification (see below). The tag-value can be a number
501or a string. If the first character is an apostrophe ('), this forces
502the value to be interpreted as a string, even if it appears to be numerical.
503
504<item>A WildThing, represented as a question mark (?), possibly
505followed by a colon (:) followed by an occurrences specification (see
506below).
507
508<item>A WildPath, represented as an asterisk (*). Note that the last
509element of the path should not be a wildPath (wildpaths don't work in
510this version).
511</itemize>
512
513The occurrences-specification can be either the string <tt/all/, the
514string <tt/last/, or an explicit value-range. The value-range is
515represented as an integer (the starting point), possibly followed by a
516plus (+) and a second integer (the number of elements, default being
517one).
518
519The variant-request has the same syntax as the defaultVariantRequest
520above. Note that it may sometimes be useful to give an empty variant
521request, simlply to disable the default for a specific set of fields
522(we aren't certain if this is proper <it/Espec-1/, but it works in
523this implementation).
524</descrip>
525
526The following is an example of an element specification belonging to
527the GILS profile.
528
529<tscreen><verb>
530simpleelement (1,10)
531simpleelement (1,12)
532simpleelement (2,1)
533simpleelement (1,14)
534simpleelement (4,1)
535simpleelement (4,52)
536</verb></tscreen>
537
538<sect1>The Schema Mapping (.map) Files
539
540<p>
541Sometimes, the client might want to receive a database record in
542a schema that differs from the native schema of the record. For
543instance, a client might only know how to process WAIS records, while
544the database record is represented in a more specific schema, such as
545GILS. In this module, a mapping of data to one of the MARC formats is
546also thought of as a schema mapping (mapping the elements of the
547record into fields consistent with the given MARC specification, prior
548to actually converting the data to the ISO2709). This use of the
549object identifier for USMARC as a schema identifier represents an
550overloading of the OID which might not be entirely proper. However,
551it represents the dual role of schema and record syntax which
552is assumed by the MARC family in Z39.50.
553
554<it>
555NOTE: The schema-mapping functions are so far limited to a
556straightforward mapping of elements. This should be extended with
557mechanisms for conversions of the element contents, and conditional
558mappings of elements based on the record contents.
559</it>
560
561These are the directives of the schema mapping file format:
562
563<descrip>
564<tag>targetName <it/name/</tag> A symbolic name for the target schema
565of the table. Useful mostly for diagnostic purposes.
566
567<tag>targetRef <it/OID-name/</tag> An OID name for the target schema.
568This is used, for instance, by a server receiving a request to present
569a record in a different schema from the native one. The name, again,
570is found in the <bf/oid/ module of <bf/YAZ/.
571
572<tag>map <it/element-name target-path/</tag> (repeatable) Adds
573an element mapping rule to the table.
574</descrip>
575
576<sect1>The MARC (ISO2709) Representation (.mar) Files
577
578<p>
579This file provides rules for representing a record in the ISO2709
580format. The rules pertain mostly to the values of the constant-length
581header of the record.
582
583<it>NOTE: This will be described better.</it>
584
585<sect>The Input (Data) File Format
586
587<p>
588The retrieval module is designed to manage data derived from a
589variety of different input sources. When used on the client side, the
590source format may be GRS-1 ISO2709. On the server side, the source may
591be a structured ASCII file, augmented by a set of patterns that
592describe the structure of the document.
593
594What we think of as the native source format - the one that is
595guaranteed to provide complete access to the facilities of the module,
596is an &dquot;SGML-like&dquot; syntax, based on an inferred DTD, which
597is in turn based on the profile information from the various files
598mentioned in this document.
599
600Like SGML, an input record consists of tags and data. The tags are
601enclosed by brackets (&lt;...&gt;). As a general rule, each tag should
602be matched by a corresponding close tag, identified by the same tag
603name preceded by a slash (/).
604
605<sect>License
606
607<p>
608Copyright &copy; 1995-2000, Index Data.
609
610This is the Index Data &dquot;P&dquot; license - it applies exclusively to
611the record management module of the YAZ system, and to this
612document.
613
614Permission to use, copy, modify, distribute, and sell this software and
615its documentation, in whole or in part, for any purpose, is hereby granted,
616provided that:
617
6181. This copyright and permission notice appear in all copies of the
619software and its documentation. Notices of copyright or attribution
620which appear at the beginning of any file must remain unchanged.
621
6222. The names of Index Data or the individual authors may not be used to
623endorse or promote products derived from this software without specific
624prior written permission.
625
626THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT WARRANTY OF ANY KIND,
627EXPRESS, IMPLIED, OR OTHERWISE, INCLUDING WITHOUT LIMITATION, ANY
628WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
629IN NO EVENT SHALL INDEX DATA BE LIABLE FOR ANY SPECIAL, INCIDENTAL,
630INDIRECT OR CONSEQUENTIAL DAMAGES OF ANY KIND, OR ANY DAMAGES
631WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER OR
632NOT ADVISED OF THE POSSIBILITY OF DAMAGE, AND ON ANY THEORY OF
633LIABILITY, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
634OF THIS SOFTWARE.
635
636<sect>About Index Data
637
638<p>
639Index Data is a consulting and software-development enterprise that
640specialises in library and information management systems. Our
641interests and expertise span a broad range of related fields, and one
642of our primary, long-term objectives is the development of a powerful
643information management
644system with open network interfaces and hypermedia capabilities.
645
646We make this software available free of charge, on a fairly unrestrictive
647license; as a service to the networking community, and to further the
648development of quality software for open network communication.
649
650We'll be happy to answer questions about the software, and about ourselves
651in general.
652
653<tscreen>
654Index Data
655Ryesgade 3
656DK-2200 Copenhagen N
657</tscreen>
658
659<p>
660<tscreen><verb>
661Phone: +45 3536 3672
662Fax  : +45 3536 0449
663Email: info@indexdata.dk
664</verb></tscreen>
665
666</article>
Note: See TracBrowser for help on using the browser.