Ignore:
Timestamp:
2004-01-15T16:02:55+13:00 (20 years ago)
Author:
cs025
Message:

First cut of adding build documentation to manual.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/gsdl3/docs/manual/manual.tex

    r6499 r6511  
    378378\subsubsection{Creating a collection from scratch}
    379379
    380 ****GEORGE****
     380Building Greenstone 3 collections is done using the \gst{gs3build} script, whilst the files that control how the building is done are found inside the \gst{etc} subdirectory of \gst{gsdl3/web/sites/localsite/collect/[collectionname]}.  There are a number of considerations in building a collection: including what documents appear in the collection, how they are indexed for searching, which classifications are used for browsing, etc.  All these aspects are controlled by files within the collection's directory. 
     381
     382Firstly, the documents that comprise the collection should be placed in the import subdirectory.  At present, only documents in this directory will appear in the collection.
     383
     384The basic means of finding documents in Greenstone is search.  The etc/collectionConfig.xml file controls which indexes are created to support search.  By default, a collection will simply index the text of each document in the collection using the MG search engine.  Alternative choices include selecting other search engines, indexing individual fields of documents (e.g. the document title) and indexing documents by section.
     385
     386Search indexes appear as individual \gst{<index>} elements within the \gst{<search>}element of the \gst{collectionConfig.xml} file, and classifications as individual \gst{<classifier>} elements within the \gst{<browse>} element.  In each case, some choices are made using attributes of the element itself, and some through child elements. 
     387
     388Indexes can alter which search engine to use for that index, the level at which the index should be built (e.g. document, section or paragraph) and the text over which it should be built (e.g. the document text, titles alone, author names, etc.).  Section-level indexes allow a reader to recall part of a document (for instance, a chapter) rather than the entire document.  However, Greenstone 3 must be able to identify the internal structure of the document to achieve this.  The degree to which structure can be found varies from file format to file format.
     389
     390Each index also must have a unique name, which is used to identify it within Greenstone  The name is given as an attribute of the \gst{<index>} element.  The ``type'' indicates which search engine to use for the index.  This attribute can contain either 'mg' or 'mgpp'.  If the ``type'' attribute is not given, the default indexer is mg.
     391
     392The other choices are described using child elements of \gst{<index>}.  The \gst{<level>} tag indicates the index level and the \gst{<field>} tag the text to be used.  The \gst{<level>} tag can contain one of document, section or paragraph, while the \gst{<field>} tag can contain ``text'' or the name of a metadata field.  If the \gst{<level>} tag is omitted, the default setting is to index by document, and if the \gst{<field>} tag is omitted, the default setting is to index the document text.
     393
     394Example index tags include:
     395
     396To index only the title of each separate document in the collection:
     397\begin{gsc}\begin{verbatim}
     398    <index name="dtt">
     399      <level>document</level>
     400      <field>dc:title</field>
     401      <displayItem name='name' lang="en">entire documents</displayItem>
     402      <displayItem name='name' lang="fr">documents entiers</displayItem>
     403      <displayItem name='name' lang="es">documentos enteros</displayItem>
     404    </index>
     405\end{verbatim}\end{gsc}
     406...in this case the \gst{<field>} tag refers to the ``title'' metadata item, found in the Dublin Core namespace.  The mg search engine would be used on this index.
     407
     408Alternatively, to index the full document texts by section:
     409\begin{gsc}\begin{verbatim}
     410    <index name="stx" type=''mgpp''>
     411      <level>section</level>
     412      <displayItem name='name' lang="en">entire documents</displayItem>
     413      <displayItem name='name' lang="fr">documents entiers</displayItem>
     414      <displayItem name='name' lang="es">documentos enteros</displayItem>     
     415    </index>
     416\end{verbatim}\end{gsc}
     417...or...
     418\begin{gsc}\begin{verbatim}
     419    <index name="stx" type=''mg''>
     420      <level>section</level>
     421      <field>text</field>
     422      <displayItem name='name' lang="en">entire documents</displayItem>
     423      <displayItem name='name' lang="fr">documents entiers</displayItem>
     424      <displayItem name='name' lang="es">documentos enteros</displayItem>
     425    </index>
     426\end{verbatim}\end{gsc}
     427...in the first example, the \gst{<field>} tag is not explicitly defined, and would default to 'text', whereas it is explicitly set to 'text' in the second example.  Note the different indexer selected for these two indexes.  As they are of the same name, they should not appear in the same \gsdt{collectionConfig.xml} file.
     428
     429Moving onto \gst{<classifier>} items, the format is broadly similar to \gst{<index>} items, but with a couple of different choices.  Firstly, each classifier should have a ``name'' and ``type'' attribute as with \gst{<index>} tags.  In the case of \gst{<classifier>} items the ``type'' attribute identifies the type of classifier it is.  At present, this should either be ``Hierarchy'' or ``AZList''. 
     430
     431The remaining choices for the classifier should follow as child elements of the \gst{<classifier>} element.  The \gst{<file>} element should contain the name of the file that describes the classifier as its ``URL'' attribute.  The format of this file will be described later - it will vary from classifier type to classifier type.  The \gst{<field>} element identifies the name of the field to index.  More than one \gst{<field>} element may appear if two or more metadata fields are to be used with the classifier.  Finally, the \gst{<sort>} item identifies another metadata field which the items within one classifier node are to be ordered.  Unlike the \gst{<index>} element, the \gst{<classifier>} element does not have default, assumed values for its children.
     432
     433Metadata for documents can be added using metadata.xml files.  These files have already been used in Greenstone 2, and the format is the same in Greenstone 3.  A metadata.xml file has a root element of <DirectoryMetadata>.  This encloses a series of <FileSet> items.  Neither of these tags has any attributes.  Each <FileSet> item includes two parts: firstly, one or more <FileName> tags, each of which encloses a regular expression to identify the files which are to be assigned the metadata.  Only files in the same directory as the metadata.xml, or in one of its child directories, file will be selected.  The filename tag encloses the regular expression as text, eg:
     434
     435\begin{gsc}\begin{verbatim}
     436<FileName>example</FileName>
     437\end{verbatim}\end{gsc}
     438
     439This would match any file containing the text 'example' in its name.  The second part of the \gst{<FileSet>} item is a \gst{<Description>} item.  The \gst{<Description>} tag has no attributes, but encloses one or more \gst{<Metadata>} tags.  Each \gst{<Metadata>} tag contains one metadata item, i.e. a label to describe the metadata and a corresponding value.  The \gst{<Metadata>} tag has one compulsory attribute: ``name''.  This attribute gives the metadata label to add to the document.  Each \gst{<Metadata>} tag also has an optional attribute: ``mode''.  If this attribute is set to ``accumulate'' then the value is added to the document, and any existing values for that metadata item are retained.  If the attribute is set to ``set'' or is omitted, then the existing value of the metadata item will be deleted.
     440
     441A sample \gst{metadata.xml} file can be found in the gs3test collection.  However, here is an example fragment of that \gst{metadata.xml} file:
     442
     443\begin{gsc}\begin{verbatim}
     444    <FileSet>
     445        <FileName>ec160e</FileName>
     446        <Description>
     447            <Metadata name="Title">The Courier - No.160 - Nov - Dec 1996 - Dossier Habitat - Country reports: Fiji , Tonga (ec160e)</Metadata>
     448            <Metadata mode="accumulate" name="Language">English</Metadata>
     449            <Metadata mode="accumulate" name="Subject">Settlements and housing: general works incl. low- cost housing, planning techniques, surveying, etc.</Metadata>
     450            <Metadata mode="accumulate" name="Subject">The Courier ACP 1990 - 1996 Africa-Caribbean-Pacific - European Union</Metadata>
     451            <Metadata mode="accumulate" name="Organization">EC Courier</Metadata>
     452            <Metadata mode="accumulate" name="AZList">T.1</Metadata>
     453        </Description>
     454    </FileSet>
     455\end{verbatim}\end{gsc}
     456
     457Here, only one file pattern is found in the file set.  However, the \gst{Description} tag contains a number of separate metadata items.  Note that the \gst{Title} metadata does not have the accumulate metadata.  This means that when the title is assigned to a document, its existing \gst{Title} information will be lost.
     458
     459Whereever possible, the Greenstone 3 will import and use options from a Greenstone 2 \gst{collect.cfg} file.  However, it is strongly recommended that a proper collectionConfig.xml file is used wherever possible.
     460
     461To build a collection execute \gst{gs3build.sh -collect collectionname}.  The process will run, placing the new indexes in the \gst{building} subdirectory of the collection's directory.
     462
     463[TODO: need to describe namespaces somewhere?]
    381464
    382465how to build a collection, but none of the mechanisms of building.
Note: See TracChangeset for help on using the changeset viewer.