Changeset 6511 for trunk/gsdl3/docs/manual
- Timestamp:
- 2004-01-15T16:02:55+13:00 (20 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
trunk/gsdl3/docs/manual/manual.tex
r6499 r6511 378 378 \subsubsection{Creating a collection from scratch} 379 379 380 ****GEORGE**** 380 Building Greenstone 3 collections is done using the \gst{gs3build} script, whilst the files that control how the building is done are found inside the \gst{etc} subdirectory of \gst{gsdl3/web/sites/localsite/collect/[collectionname]}. There are a number of considerations in building a collection: including what documents appear in the collection, how they are indexed for searching, which classifications are used for browsing, etc. All these aspects are controlled by files within the collection's directory. 381 382 Firstly, the documents that comprise the collection should be placed in the import subdirectory. At present, only documents in this directory will appear in the collection. 383 384 The basic means of finding documents in Greenstone is search. The etc/collectionConfig.xml file controls which indexes are created to support search. By default, a collection will simply index the text of each document in the collection using the MG search engine. Alternative choices include selecting other search engines, indexing individual fields of documents (e.g. the document title) and indexing documents by section. 385 386 Search indexes appear as individual \gst{<index>} elements within the \gst{<search>}element of the \gst{collectionConfig.xml} file, and classifications as individual \gst{<classifier>} elements within the \gst{<browse>} element. In each case, some choices are made using attributes of the element itself, and some through child elements. 387 388 Indexes can alter which search engine to use for that index, the level at which the index should be built (e.g. document, section or paragraph) and the text over which it should be built (e.g. the document text, titles alone, author names, etc.). Section-level indexes allow a reader to recall part of a document (for instance, a chapter) rather than the entire document. However, Greenstone 3 must be able to identify the internal structure of the document to achieve this. The degree to which structure can be found varies from file format to file format. 389 390 Each index also must have a unique name, which is used to identify it within Greenstone The name is given as an attribute of the \gst{<index>} element. The ``type'' indicates which search engine to use for the index. This attribute can contain either 'mg' or 'mgpp'. If the ``type'' attribute is not given, the default indexer is mg. 391 392 The other choices are described using child elements of \gst{<index>}. The \gst{<level>} tag indicates the index level and the \gst{<field>} tag the text to be used. The \gst{<level>} tag can contain one of document, section or paragraph, while the \gst{<field>} tag can contain ``text'' or the name of a metadata field. If the \gst{<level>} tag is omitted, the default setting is to index by document, and if the \gst{<field>} tag is omitted, the default setting is to index the document text. 393 394 Example index tags include: 395 396 To index only the title of each separate document in the collection: 397 \begin{gsc}\begin{verbatim} 398 <index name="dtt"> 399 <level>document</level> 400 <field>dc:title</field> 401 <displayItem name='name' lang="en">entire documents</displayItem> 402 <displayItem name='name' lang="fr">documents entiers</displayItem> 403 <displayItem name='name' lang="es">documentos enteros</displayItem> 404 </index> 405 \end{verbatim}\end{gsc} 406 ...in this case the \gst{<field>} tag refers to the ``title'' metadata item, found in the Dublin Core namespace. The mg search engine would be used on this index. 407 408 Alternatively, to index the full document texts by section: 409 \begin{gsc}\begin{verbatim} 410 <index name="stx" type=''mgpp''> 411 <level>section</level> 412 <displayItem name='name' lang="en">entire documents</displayItem> 413 <displayItem name='name' lang="fr">documents entiers</displayItem> 414 <displayItem name='name' lang="es">documentos enteros</displayItem> 415 </index> 416 \end{verbatim}\end{gsc} 417 ...or... 418 \begin{gsc}\begin{verbatim} 419 <index name="stx" type=''mg''> 420 <level>section</level> 421 <field>text</field> 422 <displayItem name='name' lang="en">entire documents</displayItem> 423 <displayItem name='name' lang="fr">documents entiers</displayItem> 424 <displayItem name='name' lang="es">documentos enteros</displayItem> 425 </index> 426 \end{verbatim}\end{gsc} 427 ...in the first example, the \gst{<field>} tag is not explicitly defined, and would default to 'text', whereas it is explicitly set to 'text' in the second example. Note the different indexer selected for these two indexes. As they are of the same name, they should not appear in the same \gsdt{collectionConfig.xml} file. 428 429 Moving onto \gst{<classifier>} items, the format is broadly similar to \gst{<index>} items, but with a couple of different choices. Firstly, each classifier should have a ``name'' and ``type'' attribute as with \gst{<index>} tags. In the case of \gst{<classifier>} items the ``type'' attribute identifies the type of classifier it is. At present, this should either be ``Hierarchy'' or ``AZList''. 430 431 The remaining choices for the classifier should follow as child elements of the \gst{<classifier>} element. The \gst{<file>} element should contain the name of the file that describes the classifier as its ``URL'' attribute. The format of this file will be described later - it will vary from classifier type to classifier type. The \gst{<field>} element identifies the name of the field to index. More than one \gst{<field>} element may appear if two or more metadata fields are to be used with the classifier. Finally, the \gst{<sort>} item identifies another metadata field which the items within one classifier node are to be ordered. Unlike the \gst{<index>} element, the \gst{<classifier>} element does not have default, assumed values for its children. 432 433 Metadata for documents can be added using metadata.xml files. These files have already been used in Greenstone 2, and the format is the same in Greenstone 3. A metadata.xml file has a root element of <DirectoryMetadata>. This encloses a series of <FileSet> items. Neither of these tags has any attributes. Each <FileSet> item includes two parts: firstly, one or more <FileName> tags, each of which encloses a regular expression to identify the files which are to be assigned the metadata. Only files in the same directory as the metadata.xml, or in one of its child directories, file will be selected. The filename tag encloses the regular expression as text, eg: 434 435 \begin{gsc}\begin{verbatim} 436 <FileName>example</FileName> 437 \end{verbatim}\end{gsc} 438 439 This would match any file containing the text 'example' in its name. The second part of the \gst{<FileSet>} item is a \gst{<Description>} item. The \gst{<Description>} tag has no attributes, but encloses one or more \gst{<Metadata>} tags. Each \gst{<Metadata>} tag contains one metadata item, i.e. a label to describe the metadata and a corresponding value. The \gst{<Metadata>} tag has one compulsory attribute: ``name''. This attribute gives the metadata label to add to the document. Each \gst{<Metadata>} tag also has an optional attribute: ``mode''. If this attribute is set to ``accumulate'' then the value is added to the document, and any existing values for that metadata item are retained. If the attribute is set to ``set'' or is omitted, then the existing value of the metadata item will be deleted. 440 441 A sample \gst{metadata.xml} file can be found in the gs3test collection. However, here is an example fragment of that \gst{metadata.xml} file: 442 443 \begin{gsc}\begin{verbatim} 444 <FileSet> 445 <FileName>ec160e</FileName> 446 <Description> 447 <Metadata name="Title">The Courier - No.160 - Nov - Dec 1996 - Dossier Habitat - Country reports: Fiji , Tonga (ec160e)</Metadata> 448 <Metadata mode="accumulate" name="Language">English</Metadata> 449 <Metadata mode="accumulate" name="Subject">Settlements and housing: general works incl. low- cost housing, planning techniques, surveying, etc.</Metadata> 450 <Metadata mode="accumulate" name="Subject">The Courier ACP 1990 - 1996 Africa-Caribbean-Pacific - European Union</Metadata> 451 <Metadata mode="accumulate" name="Organization">EC Courier</Metadata> 452 <Metadata mode="accumulate" name="AZList">T.1</Metadata> 453 </Description> 454 </FileSet> 455 \end{verbatim}\end{gsc} 456 457 Here, only one file pattern is found in the file set. However, the \gst{Description} tag contains a number of separate metadata items. Note that the \gst{Title} metadata does not have the accumulate metadata. This means that when the title is assigned to a document, its existing \gst{Title} information will be lost. 458 459 Whereever possible, the Greenstone 3 will import and use options from a Greenstone 2 \gst{collect.cfg} file. However, it is strongly recommended that a proper collectionConfig.xml file is used wherever possible. 460 461 To build a collection execute \gst{gs3build.sh -collect collectionname}. The process will run, placing the new indexes in the \gst{building} subdirectory of the collection's directory. 462 463 [TODO: need to describe namespaces somewhere?] 381 464 382 465 how to build a collection, but none of the mechanisms of building.
Note:
See TracChangeset
for help on using the changeset viewer.