Changeset 6904


Ignore:
Timestamp:
2004-03-03T14:32:22+13:00 (20 years ago)
Author:
kjdon
Message:

still working on this, when will I ever be finished???

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/gsdl3/docs/manual/manual.tex

    r6520 r6904  
    11\documentclass[a4paper,11pt]{article}
    2 \usepackage{times,epsfig}
     2\usepackage{isolatin1,times,epsfig}
    33\hyphenation{Message-Router Text-Query}
    44
     
    5454A description of the general design and architecture of Greenstone3 is covered by the document {\em The design of Greenstone3: An agent based dynamic digital library} (design-2002.ps, in the gsdl3/docs/manual directory).
    5555
    56 This documentation consists of several parts. Section~\ref{sec:install} covers greenstone installation, how to access the library, and some administration issues. Section~\ref{sec:user} looks at using the sample collections, creating new collections, and how to make small customisations to the interface. The remaining sections are aimed towards  the Greenstone developer. Section~\ref{sec:develop-runtime} describes the run-time system, including the structure of the software, and the message format, while Section~\ref{sec:develop-build} describes the collection building process. Section~\ref{sec:new-features} describes how to add new features to Greenstone, such as how to add new services, new page types, new plugins for different document formats.  Section~\ref{sec:distributed} describes how to make Greenstone run in a distributed fashion, using SOAP as an example communications protocol. Finally, there are several appendices, including how to install Greenstone from CVS, and a comparison of greenstone 2 and greenstone 3 format statements.
     56This documentation consists of several parts. Section~\ref{sec:install} covers greenstone installation, how to access the library, and some administration issues. Section~\ref{sec:user} looks at using the sample collections, creating new collections, and how to make small customisations to the interface. The remaining sections are aimed towards  the Greenstone developer. Section~\ref{sec:develop-runtime} describes the run-time system, including the structure of the software, and the message format, while Section~\ref{sec:develop-build} describes the collection building process. Section~\ref{sec:new-features} describes how to add new features to Greenstone, such as how to add new services, new page types, new plugins for different document formats.  Section~\ref{sec:distributed} describes how to make Greenstone run in a distributed fashion, using SOAP as an example communications protocol. Finally, there are several appendices, including how to install Greenstone from CVS, and a comparison of Greenstone2 and Greenstone3 format statements.
    5757\newpage
    5858\section{Greenstone installation and administration}\label{sec:install}
     
    6969\subsubsection{Linux}
    7070
    71 Download the latest version of the self-installing tar file, gsdl3-x.xx-unix.sh, and run it in a shell (./gsdl3-x.xx-unix.sh). Greenstone will be installed into a directory called gsdl3 inside the current directory. The install script will prompt you for  the name of your computer and what port to run tomcat on (the defaults being localhost and 8080).  Once Greenstone has been installed, you can start the library  by running ./gsdl3/gs3-launch.sh, and opening up a browser pointing to localhost:8080/gsdl3 (or different computer name and port).
     71Download the latest version of the self-installing tar file, gsdl3-x.xx-unix.sh, and run it in a shell (./gsdl3-x.xx-unix.sh). Greenstone will be installed into a directory called gsdl3 inside the current directory. The install script will prompt you for  the name of your computer and what port to run Tomcat on (the defaults being localhost and 8080).  Once Greenstone has been installed, you can start the library  by running ./gsdl3/gs3-launch.sh, and opening up a browser pointing to localhost:8080/gsdl3 (or different computer name and port).
    7272
    7373\subsubsection{Windows}
     
    8787\subsubsection{Restarting the library}
    8888
    89 The library program (actually tomcat) can be restarted by ... (** put a mechanism in each install program **).
     89The library program (actually Tomcat) can be restarted by ... (** put a mechanism in each install program **).
    9090
    9191
     
    106106Table~\ref{tab:dirs} shows the file hierarchy for Greenstone3.
    107107The first part  shows the common stuff which can be shared between
    108 Greenstone users---the source, libraries etc. Under Linux, these will eventually be installed into appropriate system directories. The second part shows
     108Greenstone users---the source, libraries etc. Under Linux, these can be installed into appropriate system directories. The second part shows
    109109stuff used by one person/group---their sites and interface setup (see Section~\ref{sec:sites-and-ints}).
    110110etc. There can be several sites/interfaces per installation.
     
    149149  & windows executables for e.g. MGPP\\
    150150gsdl3/comms
    151   & Put some stuff here for want of a better place---things to do with servers and communication. e.g. soap stuff, and tomcat servlet container\\
     151  & Put some stuff here for want of a better place---things to do with servers and communication. e.g. soap stuff, and Tomcat servlet container\\
    152152gsdl3/docs
    153153  & Documentation :-)\\
     
    156156  & This is where the web site is defined. Any static html files can go here. This directory is the Tomcat root directory.\\
    157157gsdl3/web/WEB-INF
    158   & The web.xml file lives here (servlet configuration information for tomcat)\\
     158  & The web.xml file lives here (servlet configuration information for Tomcat)\\
    159159gsdl3/web/WEB-INF/classes
    160160  & Servlet classes go in here\\
     
    187187where they live, whats the difference, what each contains.\\
    188188
    189 A site is comprised of a set of collections and possibly services. An interface is a set of images along with a set of xslt files used for translating xml output from the library into an appropriate form---html for the servlet case.
    190 One greenstone installation can have many sites and interfaces. One instantiation of a servlet uses one site and one interface. Sites and interfaces can be matched up in different ways. For example, a single site might be served with two different interfaces. This provides different modes of access to the same content. eg HTML vs WML, or perhaps providing completely different look and feel for different audiences. A standard interface may be used with many different sites---provides a consistent mode of access to a lot of different content.
    191 
    192 Collections live in the collect directory of a site. Any collections that are found in this directory when the servlet is initialised will be loaded up and presented to the user. Collections require valid configuration files, but apart from this, nothing needs to be done to the site to use new collections. Collection is added while tomcat is running will not be picked up: you can either restart the server, or send a configuration request to the servlet: these are described in Section~\ref{sec:runtime-config}.
     189A site is comprised of a set of collections and possibly some site-wide services. An interface (in this web-based servlet context) is a set of images along with a set of xslt files used for translating xml output from the library into an appropriate form---html in general.
     190
     191One greenstone installation can have many sites and interfaces. One instantiation of a servlet uses one site and one interface. Sites and interfaces can be matched up in different ways. For example, a single site might be served with two different interfaces. This provides different modes of access to the same content. eg HTML vs WML, or perhaps providing a completely different look and feel for different audiences. A standard interface may be used with many different sites---providing a consistent mode of access to a lot of different content.
     192
     193Collections live in the collect directory of a site. Any collections that are found in this directory when the servlet is initialised will be loaded up and presented to the user. Collections require valid configuration files, but apart from this, nothing needs to be done to the site to use new collections. Collections added while Tomcat is running will not be noticed automatically. Either the server needs to be restarted, or a configuration request may be sent to the library, triggering a (re)load of the collection (this is described in Section~\ref{sec:runtime-config}).
    193194
    194195There are two Greenstone sites that come with the distribution: localsite, and soapsite. localsite has several demo  collections, while soapsite has none. soapsite specifies that a soap connection should be made to localsite. Getting this to work involves setting up a soap server for localsite: see Section~\ref{sec:distributed} for details.
     
    224225
    225226Initial Greenstone3 system configuration is determined by a set of configuration files, all expressed in XML. Each site has a configuration file that binds parameters for the site, \gst{siteConfig.xml}. Each interface has a configuration file, \gst{interfaceConfig.xml}, that specifies Actions for the interface. Collections also have several configuration files; these are discussed in Section~\ref{sec:collconfig}.
    226 The configuration files are read in when the system is initialised, and their contents are cached in memory. This means that changes made to these files once the system is running will not take immediate effect. Tomcat needs to be restarted for changes to the interface configuration file to take effect. However, changes to the site configuration file can be incorporated sending a CGI-type command to the library.  There are a series of CGI-type commands that can be sent to the library to induce reconfiguration of different modules, including reloading the whole site. This removes the need to shutdown and restart the system to reflect these changes. These commands are described in Section~\ref{sec:runtime-config}.
     227The configuration files are read in when the system is initialised, and their contents are cached in memory. This means that changes made to these files once the system is running will not take immediate effect. Tomcat needs to be restarted for changes to the interface configuration file to take effect. However, changes to the site configuration file can be incorporated sending a CGI-type command to the library.  There are a series of commands that can be sent to the library to induce reconfiguration of different modules, including reloading the whole site. This removes the need to shutdown and restart the system to reflect these changes. These commands are described in Section~\ref{sec:runtime-config}.
    227228
    228229\subsubsection{Site configuration file}\label{sec:siteconfig}
     
    280281\subsubsection{Interface configuration file}\label{sec:interfaceconfig}
    281282
    282 The interface configuration file \gst{interfaceConfig.xml} lists all the actions that the interface knows about at the start (but other ones can be loaded dynamically). If the interface uses servlets, it specifies what short name each action should use for the action CGI parameter e.g. QueryAction should use a=q. If the interface uses XSLT, it specifies what XSLT file should be used for each action and subaction.
     283The interface configuration file \gst{interfaceConfig.xml} lists all the actions that the interface knows about at the start (other ones can be loaded dynamically). It specifies what short name each action maps to (this is used in library urls for the a (action) parameter) e.g. QueryAction should use a=q. If the interface uses XSLT, it specifies what XSLT file should be used for each action and possibly each subaction. This makes it easy for developers to implement and use different actions and/or XSLT files without recompilation. The server must be restarted, however.
     284
     285It also lists all the languages that the interface text files have been translated into. These have a name attribute, which is the ISO code for the language, and a displayElement which gives the language name in that language (note the non-English characters have been specified in UTF-8 codes). This language list is used on the Preferences page to allow the user to change the interface language. Details on how to add a new language to a Greenstone library are shown in Section~\ref{sec:interface-customise}.
    283286
    284287\begin{figure}
     
    303306    <action name='s' class='SystemAction' xslt='system.xsl'/>
    304307  </actionList>
     308  <languageList>
     309    <language name="en">
     310      <displayItem name='name'>English</displayItem>
     311    </language>
     312    <language name="fr">
     313      <displayItem name='name'>Français</displayItem>
     314    </language>
     315    <language name='es'>
     316      <displayItem name='name'>Español</displayItem>
     317    </language>
     318  </languageList>
    305319</interfaceConfig>
    306320\end{verbatim}\end{gsc}
     
    309323\end{figure}
    310324
    311 This makes it easy for developers to implement and use different actions and/or XSLT files without recompilation. The server must be restarted, however.
    312325
    313326\subsection{Run-time re-initialisation}\label{sec:runtime-config}
     
    315328should this section go in here, cos its kind of adminy, or go into the user stuff, cos you need to do it after building a collection???
    316329
    317 When tomcat is started up, the site and interface configuration files are read in, and actions/services/collections loaded as necessary. The configuration is then static unless tomcat is restarted, or re-configuration commands issued.
    318 
    319 There are several CGI-type commands that can be issued to tomcat to avoid having to restart the server. These can reload the entire site, or just individual collections. Unfortunately at present there are no commands to reconfigure the interface, so if the interface configuration file has changed, tomcat must be restarted for those changes to take effect. Similarly, if the java classes are modified, tomcat must be restarted then too.
     330When Tomcat is started up, the site and interface configuration files are read in, and actions/services/collections loaded as necessary. The configuration is then static unless Tomcat is restarted, or re-configuration commands issued.
     331
     332There are several CGI-type commands that can be issued to Tomcat to avoid having to restart the server. These can reload the entire site, or just individual collections. Unfortunately at present there are no commands to reconfigure the interface, so if the interface configuration file has changed, Tomcat must be restarted for those changes to take effect. Similarly, if the java classes are modified, Tomcat must be restarted then too.
    320333
    321334Currently, the runtime configuration commands can only be accessed by typing in CGI-arguments into the URL, there is no nice web form yet to do this.
    322335
    323 The CGI arguments are entered after the \gst{library?} part of the URL. There are three types of commands: configure, activate, deactivate\footnote{There is no security for these commands yet in Greenstone, so the deactivate/delete command is disabled}. These are specified by \gst{a=s\&sa=c}, \gst{a=s\&sa=a}, and \gst{a=s\&sa=d}, respectively (\gst{a} is action, \gst{sa} is subaction). By default, the requests are sent to the MessageRouter, but they can be sent to a collection/cluster by the addition of \gst{sc=xxx}, where \gst{xxx} is the name of the collection or cluster. Table~\ref{tab:run-time config} describes the arguments in a bit more detail.
     336The CGI arguments are entered after the \gst{library?} part of the URL. There are three types of commands: configure, activate, deactivate\footnote{There is no security for these commands yet in Greenstone, so the deactivate/delete command is disabled}. These are specified by \gst{a=s\&sa=c}, \gst{a=s\&sa=a}, and \gst{a=s\&sa=d}, respectively (\gst{a} is action, \gst{sa} is subaction). By default, the requests are sent to the MessageRouter, but they can be sent to a collection/cluster by the addition of \gst{sc=xxx}, where \gst{xxx} is the name of the collection or cluster. Table~\ref{tab:run-time config} describes the commands and arguments in a bit more detail.
    324337
    325338\begin{table}
     
    329342\begin{tabular}{lp{8cm}}
    330343\hline
    331 \gst{a=s\&sa=c} & reconfigures the whole site, reads in siteConfig.xml, reloads all the collections. Just part of this can be specified with another argument \gst{ss} (system subset). The valid values are \gst{collectionList}, \gst{siteList}, \gst{serviceList}, \gst{clusterList}. \\
     344\gst{a=s\&sa=c} & reconfigures the whole site. Reads in siteConfig.xml, reloads all the collections. Just part of this can be specified with another argument \gst{ss} (system subset). The valid values are \gst{collectionList}, \gst{siteList}, \gst{serviceList}, \gst{clusterList}. \\
    332345\gst{a=s\&sa=c\&sc=XXX} & reconfigures the XXX collection or cluster. \gst{ss} can also be used here, valid values are \gst{metadataList} and \gst{serviceList}. \\
    333346\gst{a=s\&sa=a} & (re)activate a specific module. Modules are specified using two arguments, \gst{st} (system module type) and \gst{sn} (system module name). Valid types are \gst{collection}, \gst{cluster} \gst{site}.\\
     
    344357\subsection{Using a collection}\label{sec:usecolls}
    345358
    346 A collection typically consists of a set of documents, which could be text, html, word, PDF, images, bibliographic records etc, along with some access methods, or services. Typical access methods include searching or browsing for document identifiers, and retrieval of content or metadata for those identifiers.
    347 Searching involves entering words or phrases and getting back lists of documents that contain those words. The search terms may be restricted to particular fields of the document. Browsing ...
     359A collection typically consists of a set of documents, which could be text, html, word, PDF, images, bibliographic records etc, along with some access methods, or ``services''. Typical access methods include searching or browsing for document identifiers, and retrieval of content or metadata for those identifiers.
     360Searching involves entering words or phrases and getting back lists of documents that contain those words. The search terms may be restricted to particular fields of the document.
     361
     362Browsing involves navigating pre-defined hierarchies of documents, following links of interest to find documents. The hierarchies may be constructed on different metadata fields, for example, alphabetical lists of Titles, or a hierarchy of Subject classifications. Clicking on a bookshelf icon takes you to a lower level in the hierarchy, while clicking on a book or page icon takes you to a document.
    348363
    349364In the standard interface that comes with Greenstone3\footnote{of course, this is all customisable}, collections in a digital library are presented in the following manner. The 'home' page of the library shows a list of all the public collections in that library. Clicking on a collection link takes you to the home page for the collection, which we call the 'about' page. The standard page banner looks something like that shown in Figure~\ref{fig:page-banner}.
     
    356371\end{figure}
    357372
    358 The image at the top left is a link to the collection's about page. The top right has buttons to link to the library home page, help pages and preference pages. All the available services are arrayed along a navigation bar, along the bottom of the banner. Click on a name to access that service.
    359 Once you are looking at a document, clicking the open book icon at the top of the document, underneath the navigation bar, will take you back to the search or browse page where you accessed the document from.
     373The image at the top left is a link to the collection's home page. The top right has buttons to link to the library home page, help pages and preference pages. All the available services are arrayed along a navigation bar, along the bottom of the banner. Clicking on a name accesses that service. Search type services generally provide a form to fill in, with parameters including what field or granularity to index, and the query itself. Clicking the
     374The results of a search
     375Once you are looking at a document, clicking the open book icon at the top of the document, underneath the navigation bar, will take you back to the service page that you accessed the document from.
    360376
    361377describe the colls that the sample installation comes with\\
     
    371387Collections live in the collect directory of a site. As described in Section~\ref{sec:sites-and-ints}, there can be several sites per greenstone installation. The collect directory is at \$GSDL3HOME/web/sites/site-name/collect, where site-name is the name of the site you want your new collection to belong to.
    372388
    373 The following two sections describe how to create a collection from scratch, and how to import a greenstone 2 collection. Once a collection has been built, the library server needs to be notified that there is a new collection. This can be accomplished in two ways\footnote{eventually there will also probably be automatic polling for new collections}. If you are the library administrator, you can restart tomcat. The library servlet will then be created afresh, and will discover the new collection when it scans the collect directory for the collection list. Alternatively, there is a CGI command to reload a collection which can also load a new one. Use the CGI arguments \gst{a=s\&sa=a\&st=collection\&sn=collname}---this tells the library program to reload the collname collection.
     389The following two sections describe how to create a collection from scratch, and how to import a greenstone 2 collection. Once a collection has been built, the library server needs to be notified that there is a new collection. This can be accomplished in two ways\footnote{eventually there will also probably be automatic polling for new collections}. If you are the library administrator, you can restart Tomcat. The library servlet will then be created afresh, and will discover the new collection when it scans the collect directory for the collection list. Alternatively, there is a CGI command to reload a collection which can also load a new one. Use the CGI arguments \gst{a=s\&sa=a\&st=collection\&sn=collname}---this tells the library program to reload the collname collection.
    374390
    375391
    376392\subsubsection{Creating a collection from scratch}
    377393
    378 Building Greenstone 3 collections is done using the \gst{gs3build} script, whilst the files that control how the building is done are found inside the \gst{etc} subdirectory of \gst{gsdl3/web/sites/localsite/collect/[collectionname]}.  There are a number of considerations in building a collection: including what documents appear in the collection, how they are indexed for searching, which classifications are used for browsing, etc.  All these aspects are controlled by files within the collection's directory.
     394Building Greenstone 3 collections is done using the \gst{gs3-build.sh} script, with the \gst{collectionConfig.xml} file controlling how the building is done.  There are a number of considerations in building a collection: including what documents appear in the collection, how they are indexed for searching, which classifications are used for browsing, etc.
    379395
    380396Firstly, the documents that comprise the collection should be placed in the import subdirectory.  At present, only documents in this directory will appear in the collection.
    381 
    382 The basic means of finding documents in Greenstone is search.  The etc/collectionConfig.xml file controls which indexes are created to support search.  By default, a collection will simply index the text of each document in the collection using the MG search engine.  Alternative choices include selecting other search engines, indexing individual fields of documents (e.g. the document title) and indexing documents by section.
    383 
    384 Search indexes appear as individual \gst{<index>} elements within the \gst{<search>}element of the \gst{collectionConfig.xml} file, and classifications as individual \gst{<classifier>} elements within the \gst{<browse>} element.  In each case, some choices are made using attributes of the element itself, and some through child elements. 
    385 
    386 Indexes can alter which search engine to use for that index, the level at which the index should be built (e.g. document, section or paragraph) and the text over which it should be built (e.g. the document text, titles alone, author names, etc.).  Section-level indexes allow a reader to recall part of a document (for instance, a chapter) rather than the entire document.  However, Greenstone 3 must be able to identify the internal structure of the document to achieve this.  The degree to which structure can be found varies from file format to file format.
    387 
    388 Each index also must have a unique name, which is used to identify it within Greenstone  The name is given as an attribute of the \gst{<index>} element.  The ``type'' indicates which search engine to use for the index.  This attribute can contain either 'mg' or 'mgpp'.  If the ``type'' attribute is not given, the default indexer is mg.
    389 
    390 The other choices are described using child elements of \gst{<index>}.  The \gst{<level>} tag indicates the index level and the \gst{<field>} tag the text to be used.  The \gst{<level>} tag can contain one of document, section or paragraph, while the \gst{<field>} tag can contain ``text'' or the name of a metadata field.  If the \gst{<level>} tag is omitted, the default setting is to index by document, and if the \gst{<field>} tag is omitted, the default setting is to index the document text.
    391 
    392 Example index tags include:
    393 
    394 To index only the title of each separate document in the collection:
    395 \begin{gsc}\begin{verbatim}
    396 <index name="dtt">
    397   <level>document</level>
    398   <field>dc:title</field>
    399   <displayItem name='name' lang="en">entire documents</displayItem>
    400   <displayItem name='name' lang="fr">documents entiers</displayItem>
    401   <displayItem name='name' lang="es">documentos enteros</displayItem>
    402 </index>
    403 \end{verbatim}\end{gsc}
    404 ...in this case the \gst{<field>} tag refers to the ``title'' metadata item, found in the Dublin Core namespace.  The mg search engine would be used on this index.
    405 
    406 Alternatively, to index the full document texts by section:
    407 \begin{gsc}\begin{verbatim}
    408 <index name="stx" type=''mgpp''>
    409   <level>section</level>
    410   <displayItem name='name' lang="en">entire documents</displayItem>
    411   <displayItem name='name' lang="fr">documents entiers</displayItem>
    412   <displayItem name='name' lang="es">documentos enteros</displayItem>     
    413 </index>
    414 \end{verbatim}\end{gsc}
    415 ...or...
    416 \begin{gsc}\begin{verbatim}
    417 <index name="stx" type=''mg''>
    418   <level>section</level>
    419   <field>text</field>
    420   <displayItem name='name' lang="en">entire documents</displayItem>
    421   <displayItem name='name' lang="fr">documents entiers</displayItem>
    422   <displayItem name='name' lang="es">documentos enteros</displayItem>
    423 </index>
    424 \end{verbatim}\end{gsc}
    425 ...in the first example, the \gst{<field>} tag is not explicitly defined, and would default to 'text', whereas it is explicitly set to 'text' in the second example.  Note the different indexer selected for these two indexes.  As they are of the same name, they should not appear in the same \gst{collectionConfig.xml} file.
    426 
    427 Moving onto \gst{<classifier>} items, the format is broadly similar to \gst{<index>} items, but with a couple of different choices.  Firstly, each classifier should have a ``name'' and ``type'' attribute as with \gst{<index>} tags.  In the case of \gst{<classifier>} items the ``type'' attribute identifies the type of classifier it is.  At present, this should either be ``Hierarchy'' or ``AZList''. 
    428 
    429 The remaining choices for the classifier should follow as child elements of the \gst{<classifier>} element.  The \gst{<file>} element should contain the name of the file that describes the classifier as its ``URL'' attribute.  The format of this file will be described later - it will vary from classifier type to classifier type.  The \gst{<field>} element identifies the name of the field to index.  More than one \gst{<field>} element may appear if two or more metadata fields are to be used with the classifier.  Finally, the \gst{<sort>} item identifies another metadata field which the items within one classifier node are to be ordered.  Unlike the \gst{<index>} element, the \gst{<classifier>} element does not have default, assumed values for its children.
     397[TODO: describe the kinds of documents that can be added, something about METS files?]
    430398
    431399Metadata for documents can be added using metadata.xml files.  These files have already been used in Greenstone 2, and the format is the same in Greenstone 3.  A metadata.xml file has a root element of \gst{<DirectoryMetadata>}.  This encloses a series of \gst{<FileSet>} items.  Neither of these tags has any attributes.  Each \gst{<FileSet>} item includes two parts: firstly, one or more \gst{<FileName>} tags, each of which encloses a regular expression to identify the files which are to be assigned the metadata.  Only files in the same directory as the metadata.xml, or in one of its child directories, file will be selected.  The filename tag encloses the regular expression as text, eg:
     
    459427Here, only one file pattern is found in the file set.  However, the \gst{Description} tag contains a number of separate metadata items.  Note that the \gst{Title} metadata does not have the accumulate metadata.  This means that when the title is assigned to a document, its existing \gst{Title} information will be lost.
    460428
    461 Whereever possible, the Greenstone 3 will import and use options from a Greenstone 2 \gst{collect.cfg} file.  However, it is strongly recommended that a proper \gst{collectionConfig.xml} file is used wherever possible.
    462 
    463 To build a collection execute \gst{gs3build.sh -collect collectionname}.  The process will run, placing the new indexes in the \gst{building} subdirectory of the collection's directory.
    464 
    465 The building directory should be renamed to index, and a buildConfig.xml file added to it. See Section~\ref{sec:buildconfig} and look at the other collections' buildConfig files for examples.
    466 
    467 [TODO: need to describe namespaces somewhere? need to generate the buildConfig file automatically.]
     429The basic means of finding documents in Greenstone is search. Options for building the search indexes include which indexer to use, what granularity to use for the indexes (e.g. whether to index documents as a whole, or sections of documents), what content the index should have (the whole text of the document or one or many metadata fields).
     430
     431Indexes can alter which search engine to use for that index, the level at which the index should be built (e.g. document, section or paragraph) and the text over which it should be built (e.g. the document text, titles alone, author names, etc.).  Section-level indexes allow a reader to recall part of a document (for instance, a chapter) rather than the entire document.  However, Greenstone 3 must be able to identify the internal structure of the document to achieve this.  The degree to which structure can be found varies from file format to file format.
     432
     433The collectionConfig.xml file controls the all of these options for collection building, and the format is described in Section~\ref{sec:collconfig}.
     434
     435Wherever possible, the Greenstone 3 will import and use options from a Greenstone 2 \gst{collect.cfg} file.  However, it is strongly recommended that a proper \gst{collectionConfig.xml} file is used wherever possible.
     436
     437To build a collection, execute \gst{gs3build.sh sitename collectionname}.  The process will run, placing the new indexes in the \gst{building} subdirectory of the collection's directory. You must have mysql running before you start building---running \gst{gs3-launch.sh} will start up the mysql server as well as tomcat.
     438
     439Once the build process is complete, the building directory should be renamed to index (after deleting the existing index directory, if any), and Tomcat prompted to reload the collection---either by restarting the server, or by sending an activate collection command to the library servlet.
     440
     441[TODO: need to describe namespaces somewhere? ]
    468442
    469443\subsubsection{Importing a greenstone 2 collection}
     
    472446The Greenstone 3 run time system requires different configuration files for a collection, so you need to run a conversion script. All this does is create the new collectionConfig.xml and buildConfig.xml from the old collect.cfg and build.cfg files. It does not change the collection in any way, so it can still be used by Greenstone 2 software.
    473447
    474 The conversion script is \gst{convert\_coll\_from\_gs2.pl}. To run it, you need to specify the path to the collect directory, and the collection name. For example,
     448The conversion script is \gst{convert\_coll\_from\_gs2.pl}. To run it, make sure you have sourced setup.bash (or run setup in Windows) in your top-level gsdl directory of the greenstone 2 installation. Then you need to specify the path to the collect directory, and the collection name as parameters to the conversion script. For example,
    475449
    476450\gst{convert\_coll\_from\_gs2.pl -collectdir \$GSDL3HOME/web/\-sites/\-localsite/\-collect demo}
    477451
    478 The script attempts to create gs3 format statements from the old greenstone 2 ones. The conversion may not always work properly, so if the collection looks a bit strange under greenstone 3, you should check the format statements. Format statements are described in Section~\ref{sec:formatstmt}.
    479 
    480 Once again, to have the collection recognised by the library servlet, you can either restart tomcat, or load it manually by sending the arguments \gst{a=s\&sa=c\&c=collname} to the library servlet.
     452The script attempts to create gs3 format statements from the old greenstone 2 ones. The conversion may not always work properly, so if the collection looks a bit strange under Greenstone 3, you should check the format statements. Format statements are described in Section~\ref{sec:formatstmt}.
     453
     454Once again, to have the collection recognised by the library servlet, you can either restart Tomcat, or load it dynamically.
    481455
    482456\subsection{Collection configuration files}\label{sec:collconfig}
    483457
    484458Each collection has two, or possibly three, configuration files, \gst{collectionConfig.xml} and \gst{buildConfig.xml}, and optionally \gst{collectionInit.xml} that give metadata, display and other information for the
    485 collection.\footnote{\gst{siteConfig.xml} and \gst{interfaceConfig.xml} is new for Greenstone3, while \gst{collectionConfig.xml} and \gst{buildConfig.xml} replace \gst{collect.cfg} and \gst{build.cfg} in
     459collection.\footnote{\gst{collectionConfig.xml} and \gst{buildConfig.xml} replace \gst{collect.cfg} and \gst{build.cfg} in
    486460Greenstone2.}  The first includes user-defined presentation metadata for the collection,
    487461such as its name and the {\em About this collection} text; gives formatting information for the collection display; and also gives
     
    502476\subsubsection{collectionConfig.xml}
    503477
    504 The collection configuration file is where the collection designer (e.g. a librarian) decides what form the collection should take. This includes the collection metadata such as title and description, and also includes what indexes and browsing structures should be built. The format of \gst{collectionConfig.xml} is still under consideration. However, Figure~\ref{fig:collconfig} shows the parts of it that have been defined so far. (Since collection building at this stage is still done using Greenstone2 Perl scripts and the old \gst{collect.cfg} file, we have only defined the format for the parts of \gst{collectionConfig.xml} that are used by the runtime-system.)
     478The collection configuration file is where the collection designer (e.g. a librarian) decides what form the collection should take. This includes the collection metadata such as title and description, and also includes what indexes and browsing structures should be built. The format of \gst{collectionConfig.xml} is still under consideration. However, Figure~\ref{fig:collconfig} shows the parts of it that have been defined so far.
    505479
    506480Display elements for a collection or metadata for a document can be entered in any language---use lang='en' attributes to metadata elements to specify which language they are in.
     
    528502    <displayItem name="icon" lang="en">mgppdemo.gif</displayItem>
    529503  </displayItemList>
    530   <search>
     504  <search type='mgpp'>
    531505    <index name="idx"/>
    532506    <format>
     
    557531\label{fig:collconfig}
    558532\end{figure}
    559 
    560 The \gst{<metadataList>} element specifies some collection metadata, such as creator. The \gst{<displayItemList>} specifies some language dependent information that is used for collection display, such as collection name and short description. These displayItem elements can be specified in different languages. If languages other than English are used, the configuration file should be encoded in utf-8.
     533[TODO: add in building istructions for the config file]
     534
     535The \gst{<metadataList>} element specifies some collection metadata, such as creator. The \gst{<displayItemList>} specifies some language dependent information that is used for collection display, such as collection name and short description. These displayItem elements can be specified in different languages. If languages other than English are used, the configuration file should be encoded in UTF-8.
     536 
     537The \gst{<search>} element specifies what indexes should be built, and provides some display and formatting information for each one. Search has an attribute, type, which specifies which indexer to be used for indexing. Currently, mg and mgpp are available. If type is not specified, mg is used. Multiple search elements may be specified, if more than one indexer is to be used.
     538
     539Search indexes appear as individual \gst{<index>} elements within the \gst{<search>} element. Some choices for the index are made using attributes of the element itself, and some through child elements. 
     540
     541Each index must have a unique name, which is used to identify it within Greenstone  The name is given as an attribute of the \gst{<index>} element. 
     542
     543The other choices are described using child elements of \gst{<index>}.  The \gst{<level>} tag indicates the index level and the \gst{<field>} tag the text to be used.  The \gst{<level>} tag can contain one of document, section or paragraph, while the \gst{<field>} tag can contain ``text'' or the name of a metadata field.  If the \gst{<level>} tag is omitted, the default setting is to index by document, and if the \gst{<field>} tag is omitted, the default setting is to index the document text.
     544
     545Example index specifications include:
     546
     547To index only the title of each separate document in the collection:
     548\begin{gsc}\begin{verbatim}
     549<index name="dtt">
     550  <level>document</level>
     551  <field>dc:title</field>
     552  <displayItem name='name' lang="en">entire documents</displayItem>
     553  <displayItem name='name' lang="fr">documents entiers</displayItem>
     554  <displayItem name='name' lang="es">documentos enteros</displayItem>
     555</index>
     556\end{verbatim}\end{gsc}
     557...in this case the \gst{<field>} tag refers to the ``title'' metadata item, found in the Dublin Core namespace.  The mg search engine would be used on this index.
     558
     559Alternatively, to index the full document texts by section:
     560\begin{gsc}\begin{verbatim}
     561<index name="stx" type=''mgpp''>
     562  <level>section</level>
     563  <displayItem name='name' lang="en">entire documents</displayItem>
     564  <displayItem name='name' lang="fr">documents entiers</displayItem>
     565  <displayItem name='name' lang="es">documentos enteros</displayItem>     
     566</index>
     567\end{verbatim}\end{gsc}
     568...or...
     569\begin{gsc}\begin{verbatim}
     570<index name="stx" type=''mg''>
     571  <level>section</level>
     572  <field>text</field>
     573  <displayItem name='name' lang="en">entire documents</displayItem>
     574  <displayItem name='name' lang="fr">documents entiers</displayItem>
     575  <displayItem name='name' lang="es">documentos enteros</displayItem>
     576</index>
     577\end{verbatim}\end{gsc}
     578...in the first example, the \gst{<field>} tag is not explicitly defined, and would default to 'text', whereas it is explicitly set to 'text' in the second example.  Note the different indexer selected for these two indexes.  As they are of the same name, they should not appear in the same \gst{collectionConfig.xml} file.
     579
    561580The \gst{<search>} and \gst{<browse>} elements give some formatting information about the indexes and classifiers. \gst{<displayItem>} elements are used to provide titles for the indexes or classifiers, while \gst{<format>} elements provide formatting instructions, typically for a document or classifier node in a list of results. 
    562581
     582of the \gst{collectionConfig.xml} file, and classifications as individual \gst{<classifier>} elements within the \gst{<browse>} element.  In each case, some choices are made using attributes of the element itself, and some through child elements. 
     583Moving onto \gst{<classifier>} items, the format is broadly similar to \gst{<index>} items, but with a couple of different choices.  Firstly, each classifier should have a ``name'' and ``type'' attribute as with \gst{<index>} tags.  In the case of \gst{<classifier>} items the ``type'' attribute identifies the type of classifier it is.  At present, this should either be ``Hierarchy'' or ``AZList''. 
     584
     585The remaining choices for the classifier should follow as child elements of the \gst{<classifier>} element.  The \gst{<file>} element should contain the name of the file that describes the classifier as its ``URL'' attribute.  The format of this file will be described later - it will vary from classifier type to classifier type.  The \gst{<field>} element identifies the name of the field to index.  More than one \gst{<field>} element may appear if two or more metadata fields are to be used with the classifier.  Finally, the \gst{<sort>} item identifies another metadata field which the items within one classifier node are to be ordered.  Unlike the \gst{<index>} element, the \gst{<classifier>} element does not have default, assumed values for its children.
     586
     587Inside the \gst{<search>} and \gst{<browse>} elements, \gst{<displayItem>} elements are used to provide titles for the indexes or classifiers, while \gst{<format>} elements provide formatting instructions, typically for a document or classifier node in a list of results. Placing the \gst{<format>} instructions at the top level in the search or browse element will apply the format to all the indexes or classifiers, while placing it inside an individual index or classifier element will restrict that formatting instruction to that item.
     588
    563589The \gst{<display>} element contains optional formatting information for the display of documents. Templates that can be specified here include \gst{documentHeading}, \gst{DocumentContent}, and other information that could be specified (in a yet to be decided format) are things such as  whether or not to display the cover image, table of contents etc.
    564590
     591Format elements are desribed in more detail in Section~\ref{sec:formatstmt}.
     592
    565593\subsection{buildConfig.xml}\label{sec:buildconfig}
    566594
    567 The file \gst{buildConfig.xml} is produced by the collection building process, and contains  metadata and other information about the collection that can
     595The file \gst{buildConfig.xml} is produced by the collection building process. Gererally it is not necessary to look at this file, but it can be useful in determining what went wrong if the collection doesn't appear quite the way it was planned.
     596
     597It contains  metadata and other information about the collection that can
    568598be determined automatically,  such as the number of
    569599documents it contains.  It also includes a list of ServiceRack classes that are
     
    790820    </classifier>
    791821    <classifier>...</classifier>
     822    <format><!-- formatting for all the classifiers. these will
     823      be overridden by any classifier specific formatting
     824      instructions --></format>
    792825  </browse>
    793826  <display>
     
    807840The user specifies a \gst{<gsf:template>} for what they want to format---these can match \gst{documentNode} or \gst{classifierNode} (for node in a classification hierarchy).
    808841 
    809 The template above is now represented as:
     842The template at the start of this section is now represented as:
    810843 
    811844\begin{gsc}\begin{verbatim}
     
    843876changing the look and feel for an interface vs a site vs a collection\\
    844877
    845 what needs a tomcat restart?
     878what needs a Tomcat restart?
    846879
    847880\subsubsection{Changing the interface language}
     
    849882The interface language can be changed by going to the preferences page, and choosing a language from the list. The list lists (:-)) all languages in which the interface has been defined  so far.
    850883
    851 It is easy to add a new interface language to greenstone.  Language specific text strings are separated out from the rest of the system to allow for easy incorporation of new languages. These text strings are contained in Java resource bundle properties files. These are plain text files consisting of key-value pairs, located in resources/java. Each interface has one named interface\_name.properties (where name is the interface name). Each service class has one with the same name as the class (e.g. GS2Search.properties). To add another language all of the base .properties  files must be translated. The translated files keep the same names, but with a language extension added. For example, a French version of interface\_default.properties would be named interface\_default\_fr.properties.
     884It is easy to add a new interface language to greenstone.  Language specific text strings are separated out from the rest of the system to allow for easy incorporation of new languages. These text strings are contained in Java resource bundle properties files. These are plain text files consisting of key-value pairs, located in resources/java. Each interface has one named interface\_name.properties (where `name' is the interface name). Each service class has one with the same name as the class (e.g. GS2Search.properties). To add another language all of the base .properties  files must be translated. The translated files keep the same names, but with a language extension added. For example, a French version of interface\_default.properties would be named interface\_default\_fr.properties.
    852885
    853886Keys will be looked up in the properties file closest to the specified language. For example, if language fr\_CA was specified (french language, country Canada), and the default locale was en\_GB,  java would look at properties files in the following order, until it found the key: XXX\_fr\_CA.properties, XXX\_fr.properties,  XXX\_en\_GB.properties, then XXX\_en.properties, and finally the default XXX.properties.
    854887
    855 You can tell Greenstone about a new language by ... currently in interfaceConfig.
     888You can tell Greenstone about a new language by adding it in to the languageList in the interfaceConfig.xml file. This will add it in to the list of languages on the preferences page. Modification of this file requires a restart of the Tomcat server for the changes to be recognised.
    856889
    857890
     
    868901A new interface needs a directory in \$GSDL3HOME/web/interfaces, the name of this directory becomes the interface name. Inside, it needs images and transform directories,  and an interfaceConfig.xml file. Any XSLT may be overridden for a new interface by putting the replacement in the new transform directory. If the appropriate XSLT file is not there, the  one from the default interface will be used - this enables just overriding a few XSLT files as needed.
    869902
    870 To use a new interface, the tomcat web.xml must be edited: either change the interface that a current version of the servlet is using, or add another servlet instantiation to the file (see Section~\ref{sec:sites-and-ints} or Appendix~\ref{app:tomcat}). The Tomcat server must be restarted for this to take effect.
     903To use a new interface, the Tomcat web.xml must be edited: either change the interface that a current version of the servlet is using, or add another servlet instantiation to the file (see Section~\ref{sec:sites-and-ints} or Appendix~\ref{app:tomcat}). The Tomcat server must be restarted for this to take effect.
    871904
    872905\newpage
     
    21362169\end{gsc}\end{quote}
    21372170
    2138 We have set up tomcat to disallow directory listings for everything in the docBase directory.  To turn this back on, you need to edit Tomcat's default web.xml file (\$GSDL3HOME/comms/jakarta/tomcat/conf/web.xml):
     2171We have set up Tomcat to disallow directory listings for everything in the docBase directory.  To turn this back on, you need to edit Tomcat's default web.xml file (\$GSDL3HOME/comms/jakarta/tomcat/conf/web.xml):
    21392172
    21402173In the default servlet definition, change the 'listings' parameter to true.
     
    21422175Tomcat uses a Manager to handle HTTP session information. This may be stored between restarts if possible. To use a persistent session handling manager, uncomment the \gst{<Manager>} element in \gst{\$GSDL3HOME/comms/jakarta/tomcat/conf/server.xml}. For the default manager, session information is stored in the work directory: \gst{\$GSDL3HOME/comms/jakarta/tomcat/work/Standalone/localhost/gsdl3/SESSIONS.ser}. Delete this file to clear the cached session info.
    21432176
    2144 \subsection{Proxying tomcat with apache}
    2145 
    2146 Instead of incorporating servlet support into your existing web server, an easy alternative is to proxy tomcat. The \gst{http://www.greenstone.org/greenstone3} site uses apache to proxy Tomcat. ProxyPass and ProxyPassReverse directives need to be added to the Virtualhost description for the www.greenstone.org server.
     2177\subsection{Proxying Tomcat with apache}
     2178
     2179Instead of incorporating servlet support into your existing web server, an easy alternative is to proxy Tomcat. The \gst{http://www.greenstone.org/greenstone3} site uses apache to proxy Tomcat. ProxyPass and ProxyPassReverse directives need to be added to the Virtualhost description for the www.greenstone.org server.
    21472180
    21482181\begin{quote}\begin{gsc}
     
    21572190In our example, the greenstone 3 servlet can be accessed at \gst{http://www.greenstone.org/greenstone3/library}, instead of at \gst{http://puka.cs.waikato.ac.nz:8080/gsdl3/library}, which is not publically accessible.
    21582191
    2159 \subsection{Running tomcat behind a proxy}
    2160 
    2161 Almost everything works fine when tomcat is running behind a proxy. The only time this causes trouble is if the servlet itself needs to make external http connections. We do this in the infomine demo collection for example. One of the service classes sends http requests to the infomine database at riverside. Since this is going through the proxy, a username and password is needed. It is not sufficient to prompt the user for a password because they are unlikely to have a password for the particular proxy that tomcat is using. What we have done at present is to put a proxy element in the siteConfig.xml file. Here you have to enter a suitable username and password for the proxy server. Unfortunately these are entered in plain text. And the file is viewable via the servlet. So we need a better solution.
     2192\subsection{Running Tomcat behind a proxy}
     2193
     2194Almost everything works fine when Tomcat is running behind a proxy. The only time this causes trouble is if the servlet itself needs to make external http connections. We do this in the infomine demo collection for example. One of the service classes sends http requests to the infomine database at riverside. Since this is going through the proxy, a username and password is needed. It is not sufficient to prompt the user for a password because they are unlikely to have a password for the particular proxy that Tomcat is using. What we have done at present is to put a proxy element in the siteConfig.xml file. Here you have to enter a suitable username and password for the proxy server. Unfortunately these are entered in plain text. And the file is viewable via the servlet. So we need a better solution.
    21622195
    21632196\newpage
Note: See TracChangeset for help on using the changeset viewer.