source: trunk/gsdl/packages/yaz/doc/profiles.txt@ 1343

Last change on this file since 1343 was 1343, checked in by johnmcp, 24 years ago

Added the YAZ toolkit source to the packages directory (for z39.50 stuff)

  • Property svn:keywords set to Author Date Id Revision
File size: 26.3 KB
Line 
1 Specifying and Using Application (Database) Profiles
2 Index Data, [email protected]
3 $Revision: 1343 $
4
5 YAZ includes a subsystem to manage complex database records, driven by
6 a set of configuration tables that reflect a given profile. Multiple
7 database profiles can coexeist in the same server, or even the same
8 database. The record management system is responsible for associating
9 a given record with a specific profile, and processing it accordingly.
10 This document describes the various file formats for data and configu-
11 ration files which are used by the module.
12 ______________________________________________________________________
13
14 Table of Contents
15
16
17 1. Warnings
18
19 2. Introduction
20
21 3. Overview
22
23 3.1 External Data (record) Representation
24 3.2 The Abstract Syntax
25
26 4. The Configuration Files
27
28 4.1 The Abstract Syntax (.abs) Files
29 4.2 The Attribute Set (.att) Files
30 4.3 The Tag Set (.tag) Files
31 4.4 The Variant Set (.var) Files
32 4.5 The Element Set (.est) Files
33 4.6 The Schema Mapping (.map) Files
34 4.7 The MARC (ISO2709) Representation (.mar) Files
35
36 5. The Input (Data) File Format
37
38 6. License
39
40 7. About Index Data
41
42
43
44 ______________________________________________________________________
45
46 1. Warnings
47
48
49 o The subsystem descibed herein is under development. Not everything
50 may work exactly as decribed, and details of the interface may
51 change as the module matures.
52
53 o The exact workings of the subsystem may depend on the application
54 in which it is used. This document focuses on the use of the module
55 in the Zebra information server which is distributed by Index Data
56 as an independent package.
57
58
59 2. Introduction
60
61 The retrieval facilities of Z39.50 are extremely flexible and
62 powerful. They allow any level of structuring of database records.
63 They allow controlled re-use of attribute sets (for searching) and tag
64 sets (for retrieval) between application profiles; they allow precise
65 selection of the desired sub-elements of a database record; they allow
66 different variants of a given data element to be represented and
67 selected in a structured way; and finally they allow the exchange of
68 any type and amount of data to be represented in a single database
69 record.
70
71 These powerful retrieval facilities are a recent addition to the
72 protocol, and along with the flexible searching facilities, they make
73 the protocol an extremely capable tool for precise, structured access
74 to information systems. The retrieval facilities add new levels of
75 flexibility and control to the protocol, which add to its value
76 outside of its traditional domain of the library systems world.
77
78 The new facilities, however, also add new complexity to the protocol,
79 which is already troubles by a too-steep learning curve. We have seen
80 many good projects severely hindered or even thwarted by the sheer
81 complexity of implementing the Z39.50 protocol.
82
83 At the same time, we feel that the most complex and powerful
84 facilities of the protocol (Explain, structured retrieval, etc.), are
85 also what the protocol needs to become more widespread, and to fulfill
86 what we perceive to be its most noble potential: To provide everybody
87 with standardised, well-structured access to the information resources
88 of the world.
89
90 The purpose of YAZ, then, and of this module as well, is to simplify
91 the use of the protocol for programmers and administrators, by
92 providing simple APIs and configuration systems to access the
93 functionality of the protocol. The Retrieval module deals specifically
94 with the advanced retrieval functions which were added to the protocol
95 with version 3, or Z39.50-1994.
96
97
98 3. Overview
99
100 3.1. External Data (record) Representation
101
102 The Retrieval module will eventually support a wide range of input
103 formats, ranging from MARC data to USENET news archives. This section
104 introduces what we think of as the canonical format - the one that
105 gives the most general access to the various elements of the retrieval
106 functionality.
107
108 The basic model presented by the Z39.50 retrieval system is that of a
109 recursively defined tree structure, containing a list of tagged
110 elements, which may in turn contain either data or more lists of
111 tagged elements, and so forth.
112
113 We elect to represent this structuring externally by using an "SGML-
114 like" syntax. The internal representation will be discussed later.
115
116 Consider a record describing an information resource (such a record is
117 sometimes known as a locator record). It might contain a field
118 describing the distributor of the information resource, which might in
119 turn be partitioned into various fields providing details about the
120 distributor, like this:
121
122
123
124
125
126
127
128
129
130
131
132
133 <Distributor>
134 <Name> USGS/WRD </Name>
135 <Organization> USGS/WRD </Organization>
136 <Street-Address>
137 U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW
138 </Street-Address>
139 <City> ALBUQUERQUE </City>
140 <State> NM </State>
141 <Zip-Code> 87102 </Zip-Code>
142 <Country> USA </Country>
143 <Telephone> (505) 766-5560 </Telephone>
144 </Distributor>
145
146
147
148
149 This is how data that the retrieval module reads from an input file
150 might look.
151
152 Depending on the database profile that is being used, it is likely
153 that the data won't look like this when it's transmitted from the
154 server to the client, however. Typically, the client will prefer to
155 receive the data in a more rigid syntax, such as USMARC or GRS-1. To
156 save transmission time and avoid ambiguities of language, the
157 individual tags or field names, above, might be translated into
158 numbers which are known by both the client and the server (by
159 referring to a tag set).
160
161 The retrieval module supports various types of conversions that might
162 be carried out by the server based on requests from the client. To do
163 this, it needs a set of configuration files to describe the
164 application profile that the given record adheres to.
165
166 CAUTION: Because the tables described below serve the dual purpose of
167 representing an external application profile and an internal database
168 profile, the terminology and structuring used will sometimes be
169 somewhat different from the one suggested in the the Z39.50-1995.
170
171
172 3.2. The Abstract Syntax
173
174 The abstract syntax definition (ARS) is the focal point of the
175 application profile description. For a given profile, it may state any
176 or all of the following:
177
178
179 o The object identifier of the database schema associated with the
180 profile, so that it can be referred to by the client.
181
182 o The attribute set (which can possibly be a compound of multiple
183 sets) which applies in the profile. This is used when indexing and
184 searching the records belonging to the given profile.
185
186 o The Tag set (again, this can consist of several different sets).
187 This is used when reading the records from a file, to recognize the
188 different tags, and when transmitting the record to the client -
189 mapping the tags to their numerical representation, if they are
190 known.
191
192 o The variant set which is used in the profile. This provides a
193 vocabulary for specifying the forms of data that appear inside the
194 records.
195
196 o Element set names, which are a shorthand way for the client to ask
197 for a subset of the data elements contained in a record. Element
198 set names, in the retrieval module, are mapped to element
199 specifications, which contain information equivalent to the Espec-1
200 syntax of Z39.50.
201
202 o Map tables, which may specify mappings to other database profiles,
203 if desired.
204
205 o Possibly, a set of rules describing the mapping of elements to a
206 MARC representation.
207
208 o A list of element description (this is the actual ARS of the
209 profile), which lists the ways in which the various tags can be
210 used and organized hierarchically.
211
212 Several of the entries above simply refer to other files, which
213 describe the given objects.
214
215
216 4. The Configuration Files
217
218 This section describes the syntax and use of the various tables which
219 are used by the retrieval module.
220
221 The number of different file types may appear daunting at first, but
222 each type corresponds fairly clearly to a single aspect of the Z39.50
223 retrieval facilities. Further, the average database administrator
224 who's simply reusing an existing profile for which tables already
225 exist, shouldn't have to worry too much about these tables.
226
227
228 4.1. The Abstract Syntax (.abs) Files
229
230 The name of this file type is slightly misleading, since, apart from
231 the actual abstract syntax of the profile, it also includes most of
232 the other definitions that go into a database profile.
233
234 When a record in the canonical, SGML-like format is read from a file
235 or from the database, the first tag of the file should reference the
236 profile that governs the layout of the record. If the first tag of the
237 record is <gils>, the system will look for the profile definition in
238 the file gils.abs. Profile definitions are cached, so they only have
239 to be read once during the lifespan of the current process.
240
241 The file may contain the following directives:
242
243
244 name symbolic-name
245 This provides a shorthand name or description for the profile.
246 Mostly useful for diagnostic purposes.
247
248
249 reference OID-name
250 The reference name of the OID for the profile. The reference
251 names can be found in the util module of YAZ.
252
253
254 attset filename
255 The attribute set that is used for indexing and searching
256 records belonging to this profile.
257
258
259 tagset filename
260 The tag set (if any) that describe that fields of the records.
261
262
263 varset filename
264 The variant set used in the profile.
265 maptab filename
266 (repeatable) This points to a conversion table that might be
267 used if the client asks for the record in a different schema
268 from the native one.
269
270
271 marc filename
272 Points to a file containing parameters for representing the
273 record contents in the ISO2709 syntax. Read the description of
274 the MARC representation facility below.
275
276
277 esetname name filename
278 (repeatable) Associates the given element set name with an
279 element selection file. If an (@) is given in place of the
280 filename, this corresponds to a null mapping for the given
281 element set name.
282
283
284 elm path name attribute
285 (repeatable) Adds an element to the abstract record syntax of
286 the schema. The path follows the syntax which is suggested by
287 the Z39.50 document - that is, a sequence of tags separated by
288 slashes (/). Each tag is given as a comma-separated pair of tag
289 type and -value surrounded by parenthesis. The name is the name
290 of the element, and the attribute specifies what attribute to
291 use when indexing the element. A ! in place of the attribute
292 name is equivalent to specifying an attribute name identical to
293 the element name. A - in place of the attribute name specifies
294 that no indexing is to take place for the given element.
295
296 NOTE: The mechanism for controlling indexing is inadequate for complex
297 databases, and will probably be moved into a separate configuration
298 table eventually.
299
300 The following is an excerpt from the abstract syntax file for the GILS
301 profile.
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331 name gils
332 reference GILS-schema
333 attset gils.att
334 tagset gils.tag
335 varset var1.var
336
337 maptab gils-usmarc.map
338
339 # Element set names
340
341 esetname VARIANT gils-variant.est # for WAIS-compliance
342 esetname B gils-b.est
343 esetname G gils-g.est
344 esetname W gils-b.est
345 esetname F @
346
347 elm (1,10) rank -
348 elm (1,12) url -
349 elm (1,14) localControlNumber Local-number
350 elm (1,16) dateOfLastModification Date/time-last-modified
351 elm (2,1) Title !
352 elm (4,1) controlIdentifier Identifier-standard
353 elm (2,6) abstract Abstract
354 elm (4,51) purpose !
355 elm (4,52) originator -
356 elm (4,53) accessConstraints !
357 elm (4,54) useConstraints !
358 elm (4,70) availability -
359 elm (4,70)/(4,90) distributor -
360 elm (4,70)/(4,90)/(2,7) distributorName !
361 elm (4,70)/(4,90)/(2,10 distributorOrganization !
362 elm (4,70)/(4,90)/(4,2) distributorStreetAddress !
363 elm (4,70)/(4,90)/(4,3) distributorCity !
364
365
366
367
368
369 4.2. The Attribute Set (.att) Files
370
371 This file type describes the Use elements of an attribute set. It
372 contains the following directives.
373
374
375
376 name symbolic-name
377 This provides a shorthand name or description for the attribute
378 set. Mostly useful for diagnostic purposes.
379
380
381 reference OID-name
382 The reference name of the OID for the attribute set. The
383 reference names can be found in the util module of YAZ.
384
385
386 ordinal integer
387 This value will be used to represent the attribute set in the
388 index. Care should be taken that each attribute set has a unique
389 ordinal value.
390
391
392 include filename
393 This directive, which can be repeated, is used to include
394 another attribute set as a part of the current one. This is used
395 when a new attribute set is defined as an extension to another
396 set. For instance, many new attribute sets are defined as
397 extensions to the bib-1 set. This is an important feature of the
398 retrieval system of Z39.50, as it ensures the highest possible
399 level of interoperability, as those access points of your
400 database which are derived from the external set (say, bib-1)
401 can be used even by clients who are unaware of the new set.
402
403
404 att att-value att-name [local-value]
405 This repeatable directive introduces a new attribute to the set.
406 The attribute value is stored in the index (unless a local-value
407 is given, in which case this is stored). The name is used to
408 refer to the attribute from the abstract syntax.
409
410 This is an excerpt from the GILS attribute set definition. Notice how
411 the file describing the bib-1 attribute set is referenced.
412
413
414
415 name gils
416 reference GILS-attset
417 include bib1.att
418 ordinal 2
419
420 att 2001 distributorName
421 att 2002 indexTermsControlled
422 att 2003 purpose
423 att 2004 accessConstraints
424 att 2005 useConstraints
425
426
427
428
429
430 4.3. The Tag Set (.tag) Files
431
432 This file type defines the tagset of the profile, possibly by
433 referencing other tag sets (most tag sets, for instance, will include
434 tagsetG and tagsetM from the Z39.50 specification. The file may
435 contain the following directives.
436
437
438 name symbolic-name
439 This provides a shorthand name or description for the tag set.
440 Mostly useful for diagnostic purposes.
441
442
443 reference OID-name
444 The reference name of the OID for the tag set. The reference
445 names can be found in the util module of YAZ.
446
447
448 type integer
449 The type number of the tag within the schema profile.
450
451
452 include filename
453 (repeatable) This directive is used to include the definitions
454 of other tag sets into the current one.
455
456
457 tag number names type
458 (repeatable) Introduces a new tag to the set. The number is the
459 tag number as used in the protocol (there is currently no
460 mechanism for specifying string tags at this point, but this
461 would be quick work to add). The names parameter is a list of
462 names by which the tag should be recognized in the input file
463 format. The names should be separated by slashes (/). The type
464 is th recommended datatype of the tag. It should be one of the
465 following:
466
467 o structured
468
469 o string
470
471 o numeric
472
473 o bool
474
475 o oid
476
477 o generalizedtime
478
479 o intunit
480
481 o int
482
483 o octetstring
484
485 o null
486
487 The following is an excerpt from the TagsetG definition file.
488
489
490
491 name tagsetg
492 reference TagsetG
493 type 2
494
495 tag 1 title string
496 tag 2 author string
497 tag 3 publicationPlace string
498 tag 4 publicationDate string
499 tag 5 documentId string
500 tag 6 abstract string
501 tag 7 name string
502 tag 8 date generalizedtime
503 tag 9 bodyOfDisplay string
504 tag 10 organization string
505
506
507
508
509
510 4.4. The Variant Set (.var) Files
511
512 The variant set file is a straightforward representation of the
513 variant set definitions associated with the protocol. At present, only
514 the Variant-1 set is known.
515
516 These are the directives allowed in the file.
517
518
519 name symbolic-name
520 This provides a shorthand name or description for the variant
521 set. Mostly useful for diagnostic purposes.
522
523
524 reference OID-name
525 The reference name of the OID for the variant set, if one is
526 required. The reference names can be found in the util module of
527 YAZ.
528
529 class integer class-name
530 (repeatable) Introduces a new class to the variant set.
531
532
533 type integer type-name datatype
534 (repeatable) Addes a new type to the current class (the one
535 introduced by the most recent class directive). The type names
536 belong to the same name space as the one used in the tag set
537 definition file.
538
539 The following is an excerpt from the file describing the variant set
540 Variant-1.
541
542
543
544 name variant-1
545 reference Variant-1
546
547 class 1 variantId
548
549 type 1 variantId octetstring
550
551 class 2 body
552
553 type 1 iana string
554 type 2 z39.50 string
555 type 3 other string
556
557
558
559
560
561 4.5. The Element Set (.est) Files
562
563 The element set specification files describe a selection of a subset
564 of the elements of a database record. The element selection mechanism
565 is equivalent to the one supplied by the Espec-1 syntax of the Z39.50
566 specification. In fact, the internal representation of an element set
567 specification is identical to the Espec-1 structure, and we'll refer
568 you to the description of that structure for most of the detailed
569 semantics of the directives below.
570
571 NOTE: Not all of the Espec-1 functionality has been implemented yet.
572 The fields that are mentioned below all work as expected, unless
573 otherwise is noted.
574
575 The directives available in the element set file are as follows:
576
577
578 defaultVariantSetId OID-name
579 If variants are used in the following, this should provide the
580 name of the variantset used (it's not currently possible to
581 specify a different set in the individual variant request). In
582 almost all cases (certainly all profiles known to us), the name
583 Variant-1 should be given here.
584
585
586 defaultVariantRequest variant-request
587 This directive provides a default variant request for use when
588 the individual element requests (see below) do not contain a
589 variant request. Variant requests consist of a blank-separated
590 list of variant components. A variant compont is a comma-
591 separated, parenthesized triple of variant class, type, and
592 value (the two former values being represented as integers). The
593 value can currently only be entered as a string (this will
594 change to depend on the definition of the variant in question).
595 The special value (@) is interpreted as a null value, however.
596
597
598 simpleElement path ['variant' variant-request]
599 This corresponds to a simple element request in Espec-1. The
600 path consists of a sequence of tag-selectors, where each of
601 these can consist of either:
602
603
604 o A simple tag, consisting of a comma-separated type-value pair in
605 parenthesis, possibly followed by a colon (:) followed by an
606 occurrences-specification (see below). The tag-value can be a
607 number or a string. If the first character is an apostrophe ('),
608 this forces the value to be interpreted as a string, even if it
609 appears to be numerical.
610
611 o A WildThing, represented as a question mark (?), possibly
612 followed by a colon (:) followed by an occurrences specification
613 (see below).
614
615 o A WildPath, represented as an asterisk (*). Note that the last
616 element of the path should not be a wildPath (wildpaths don't
617 work in this version).
618
619 The occurrences-specification can be either the string all, the
620 string last, or an explicit value-range. The value-range is
621 represented as an integer (the starting point), possibly
622 followed by a plus (+) and a second integer (the number of
623 elements, default being one).
624
625 The variant-request has the same syntax as the
626 defaultVariantRequest above. Note that it may sometimes be
627 useful to give an empty variant request, simlply to disable the
628 default for a specific set of fields (we aren't certain if this
629 is proper Espec-1, but it works in this implementation).
630
631 The following is an example of an element specification belonging to
632 the GILS profile.
633
634
635
636 simpleelement (1,10)
637 simpleelement (1,12)
638 simpleelement (2,1)
639 simpleelement (1,14)
640 simpleelement (4,1)
641 simpleelement (4,52)
642
643
644
645
646
647 4.6. The Schema Mapping (.map) Files
648
649 Sometimes, the client might want to receive a database record in a
650 schema that differs from the native schema of the record. For
651 instance, a client might only know how to process WAIS records, while
652 the database record is represented in a more specific schema, such as
653 GILS. In this module, a mapping of data to one of the MARC formats is
654 also thought of as a schema mapping (mapping the elements of the
655 record into fields consistent with the given MARC specification, prior
656 to actually converting the data to the ISO2709). This use of the
657 object identifier for USMARC as a schema identifier represents an
658 overloading of the OID which might not be entirely proper. However, it
659 represents the dual role of schema and record syntax which is assumed
660 by the MARC family in Z39.50.
661 NOTE: The schema-mapping functions are so far limited to a
662 straightforward mapping of elements. This should be extended with
663 mechanisms for conversions of the element contents, and conditional
664 mappings of elements based on the record contents.
665
666 These are the directives of the schema mapping file format:
667
668
669 targetName name
670 A symbolic name for the target schema of the table. Useful
671 mostly for diagnostic purposes.
672
673
674 targetRef OID-name
675 An OID name for the target schema. This is used, for instance,
676 by a server receiving a request to present a record in a
677 different schema from the native one. The name, again, is found
678 in the oid module of YAZ.
679
680
681 map element-name target-path
682 (repeatable) Adds an element mapping rule to the table.
683
684
685 4.7. The MARC (ISO2709) Representation (.mar) Files
686
687 This file provides rules for representing a record in the ISO2709
688 format. The rules pertain mostly to the values of the constant-length
689 header of the record.
690
691 NOTE: This will be described better.
692
693
694 5. The Input (Data) File Format
695
696 The retrieval module is designed to manage data derived from a variety
697 of different input sources. When used on the client side, the source
698 format may be GRS-1 ISO2709. On the server side, the source may be a
699 structured ASCII file, augmented by a set of patterns that describe
700 the structure of the document.
701
702 What we think of as the native source format - the one that is
703 guaranteed to provide complete access to the facilities of the module,
704 is an "SGML-like" syntax, based on an inferred DTD, which is in turn
705 based on the profile information from the various files mentioned in
706 this document.
707
708 Like SGML, an input record consists of tags and data. The tags are
709 enclosed by brackets (<...>). As a general rule, each tag should be
710 matched by a corresponding close tag, identified by the same tag name
711 preceded by a slash (/).
712
713
714 6. License
715
716 Copyright (C) 1995-2000, Index Data.
717
718 This is the Index Data "P" license - it applies exclusively to the
719 record management module of the YAZ system, and to this document.
720
721 Permission to use, copy, modify, distribute, and sell this software
722 and its documentation, in whole or in part, for any purpose, is hereby
723 granted, provided that:
724
725 1. This copyright and permission notice appear in all copies of the
726 software and its documentation. Notices of copyright or attribution
727 which appear at the beginning of any file must remain unchanged.
728
729 2. The names of Index Data or the individual authors may not be used
730 to endorse or promote products derived from this software without
731 specific prior written permission.
732
733 THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT WARRANTY OF ANY KIND,
734 EXPRESS, IMPLIED, OR OTHERWISE, INCLUDING WITHOUT LIMITATION, ANY
735 WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. IN
736 NO EVENT SHALL INDEX DATA BE LIABLE FOR ANY SPECIAL, INCIDENTAL,
737 INDIRECT OR CONSEQUENTIAL DAMAGES OF ANY KIND, OR ANY DAMAGES
738 WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER OR NOT
739 ADVISED OF THE POSSIBILITY OF DAMAGE, AND ON ANY THEORY OF LIABILITY,
740 ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
741 SOFTWARE.
742
743
744 7. About Index Data
745
746 Index Data is a consulting and software-development enterprise that
747 specialises in library and information management systems. Our
748 interests and expertise span a broad range of related fields, and one
749 of our primary, long-term objectives is the development of a powerful
750 information management system with open network interfaces and
751 hypermedia capabilities.
752
753 We make this software available free of charge, on a fairly
754 unrestrictive license; as a service to the networking community, and
755 to further the development of quality software for open network
756 communication.
757
758 We'll be happy to answer questions about the software, and about
759 ourselves in general.
760
761
762 Index Data Ryesgade 3 DK-2200 Copenhagen N
763
764
765
766
767
768 Phone: +45 3536 3672
769 Fax : +45 3536 0449
770 Email: [email protected]
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
Note: See TracBrowser for help on using the repository browser.