\documentclass[a4paper,11pt]{article} \usepackage{times} \usepackage{graphicx} \hyphenation{Message-Router Text-Query} \newenvironment{gsc}% Greenstone text bits {\begin{footnotesize}\begin{tt}}% {\end{tt}\end{footnotesize}} \newcommand{\gst}[1]{{\footnotesize \tt #1}} \newcommand{\gsii}{Greenstone2} \newcommand{\gsiii}{Greenstone3} \newcommand{\gs}{Greenstone} \begin{document} \title{\gsiii\ : A modular digital library.} % if you work on this manual, add your name here \author{Katherine Don \\[1ex] Department of Computer Science \\ University of Waikato \\ Hamilton, New Zealand \\ } \date{} \maketitle \newenvironment{bulletedlist}% {\begin{list}{$\bullet$}{\setlength{\itemsep}{0pt}\setlength{\parsep}{0pt}}}% {\end{list}} \noindent Greenstone Digital Library Version 3 is a complete redesign and reimplementation of the \gs\ digital library software. The current version (\gsii) enjoys considerable success and is being widely used. \gsiii \ will capitalize on this success, and in addition it will \begin{bulletedlist} \item improve flexibility, modularity, and extensibility \item lower the bar for ``getting into'' the \gs\ code with a view to understanding and extending it \item use XML where possible internally to improve the amount of self-documentation \item make full use of existing XML-related standards and software \item provide improved internationalization, particularly in terms of sort order, information browsing, etc. \item include new features that facilitate additional ``content management'' operations \item operate on a scale ranging from personal desktop to corporate library \item easily permit the incorporation of text mining operations \item use Java, to encourage multilinguality, X-compatibility, and to permit easier inclusion of existing Java code (such as for text mining). \end{bulletedlist} Parts of \gs\ will remain in other languages (e.g. MG, MGPP); JNI (Java Native Interface) will be used to communicate with these. A description of the general design and architecture of \gsiii\ is covered by the document {\em The design of Greenstone3: An agent based dynamic digital library} (design-2002.ps, in the docs/manual directory). This documentation consists of several parts. Section~\ref{sec:install} is for administrators, and covers \gsiii\ installation, how to access the library, and some administration issues. Section~\ref{sec:user} is for users of the software, and looks at using the sample collections, creating new collections, and how to make small customizations to the interface. The remaining sections are aimed towards the \gs\ developer. Section~\ref{sec:develop-runtime} describes the run-time system, including the structure of the software, and the message format. Section~\ref{sec:new-features} describes how to add new features to \gs, such as how to add new services, new page types, new plugins for different document formats. Section~\ref{sec:distributed} describes how to make \gs\ run in a distributed fashion, using SOAP as an example communications protocol. Finally, there are several appendices, including how to install \gs\ from CVS, some notes on Tomcat and SOAP, and a comparison of \gsii\ and \gsiii\ format statements. \newpage \tableofcontents \newpage \section{\gs\ installation and administration}\label{sec:install} This section covers where to get \gsiii\ from, how to install it and how to run it. The standard method of running \gsiii\ is as a Java servlet. We provide the Tomcat servlet container to run the servlet. Standard web servers may be able to be configured to provide servlet support, and thereby remove the need to use Tomcat. Please see your web server documentation for this. This documentation assumes that you are using Tomcat. To access \gsiii, Tomcat must be started up, and then it can be accessed via a web browser. Ant (Java's XML based build tool) is used for compilation, installation and running Greenstone. The \gst{build.xml} file is the configuration file for the Greenstone project, and \gst{build.properties} contains parameters that can be altered by the user. \subsection{Get and install \gs\ }\label{sec:getandinstall} \gsiii\ is available for download from Sourceforge:\\ \gst{https://sourceforge.net/projects/greenstone3}. There are Windows, Linux, and source releases. The binary releases are self-installing executables: download and run the file to install. A series of prompts will guide you through the installation process. The source release is a gzip'd tar file. Unzip and untar this, check build.properties, then run \gst{'ant install'} to configure and compile the code. The \gsiii\ library can be launched by running the server program. This is accessible from the Start menu on Windows, or by running the \gst{gs3-server.sh/bat} script in the top level \gst{greenstone3} directory. This program will start up the Tomcat web server and launch a browser. Alternatively, you can start it up using Ant: run \gst{'ant start'}, which starts up Tomcat, then in a browser go to \gst{http://localhost:8080/greenstone3}\\ (or \gst{http://your-computer-name:your-chosen-port/greenstone3}). \\ This gets you to a welcome page containing links to four servlets: the \gst{test} servlet (this allows you to check that Tomcat is running properly); the standard \gst{library} servlet which serves \gst{localsite} site with the \gst{gs2} interface; the \gst{gs3library} servlet which serves \gst{localsite} using the \gst{default} \gsiii-style interface; and the \gst{gateway} servlet, which serves \gst{gateway} site with the \gst{default} interface. The \gst{gateway} site uses a SOAP connection to communicate with \gst{localsite}, and demonstrates the library working in a distributed fashion. The SOAP connection is not enabled by default - to enable it, run \gst{'ant deploy-localsite'}. \gsiii\ is also available through CVS (Concurrent Versioning System). This provides the latest development version, and is not guaranteed to be stable. Appendix~\ref{app:cvs} describes how to download and install \gsiii\ from CVS. \subsection{How the library works} The standard library program is a Java servlet. We use the Tomcat servlet container to present the servlets over the web. Tomcat takes CGI-style URLs and passes the arguments to the servlet, which processes these and returns a page of HTML. As far as an end-user is concerned, a servlet is a Java version of a CGI program. The interaction is similar: access is via a web browser, using arguments in a URL. Other types of interfaces can be used, such as Java GUI programs. See Section~\ref{sec:new-interfaces} for details about how to make these. \subsubsection{Restarting the library} You can restart Tomcat by clicking 'Restart Server' on the little server program. You should restart the server any time you make changes in the following for those changes to take effect:\\ \begin{bulletedlist} \begin{gsc} \item \$GSDL3HOME/WEB-INF/web.xml \item \$GSDL3SRCHOME/packages/tomcat/conf/server.xml \end{gsc} \item any classes or jar files used by the servlets \end{bulletedlist} \subsection{Directory structure} Table~\ref{tab:dirs} shows the file hierarchy for \gsiii. The first part shows the common stuff which can be shared between \gs\ users---the source, libraries etc. The second part shows the file hierarchy for the web directory, which comprises the greenstone3 context for Tomcat, and is accessible via Tomcat. The main directories are for sites and interfaces: there can be several sites and interfaces per installation, and they are described in the following section. Two environment variables used by \gsiii\ are often mentioned in this manual: \gst{\$GSDL3SRCHOME} and \gst{\$GSDL3HOME}. \gst{\$GSDL3SRCHOME} refers to the top-level \gst{greenstone3} directory, while \gst{\$GSDL3HOME} refers to the \gst{web} directory. The web directory contains everything needed to serve the \gsiii\ library using Tomcat, and doesn't necessarily need to live with the rest of the \gsiii\ source. \begin{table} \caption{The \gs\ directory structure} \label{tab:dirs} {\footnotesize \begin{tabular}{l p{8cm}} \hline \bf directory & \bf description \\ \hline greenstone3 & The main installation directory---\$GSDL3SRCHOME is set to this directory \\ greenstone3/src & Source code lives here \\ greenstone3/src/java/ & main \gsiii\ java source code \\ greenstone3/src/packages & Imported source packages from other systems e.g. indexing packages may go here \\ greenstone3/lib & Shared library files\\ greenstone3/lib/java & Java jar files not needed in the \gsiii\ runtime\\ greenstone3/lib/jni & Jar files and shared library files (.so, .jnilib, .dll) needed for JNI components \\ greenstone3/resources & any resources that may be needed\\ greenstone3/resources/soap & soap service description files \\ greenstone3/bin & executable stuff lives here\\ greenstone3/bin/script & some Perl and/or shell scripts\\ greenstone3/packages & External packages that may be installed as part of greenstone, e.g. Tomcat \\ greenstone3/docs & Documentation\\ greenstone3/gli & \gs\ Librarian Interface code \\ greenstone3/gs2build & collection building code\\ \hline greenstone3/web & This is where the web site is defined. Any static HTML files can go here. This directory is the root directory used by Tomcat when serving \gsiii. \$GSDL3HOME is set to this directory. \\ greenstone3/web/WEB-INF & The web.xml file lives here (servlet configuration information for Tomcat)\\ greenstone3/web/WEB-INF/classes & Individual class files needed by the servlet go in here, also properties files for java resource bundles - used to handle all the language specific text. This directory is on the servlet classpath\\ greenstone3/web/WEB-INF/lib & jar files needed by the servlets go here \\ greenstone3/web/sites & Contains directories for different sites---a site is a set of collections and services served by a single MessageRouter (MR). The MR may have connections (e.g. soap) to other sites\\ greenstone3/web/sites/localsite & An example site - the site configuration file lives here\\ greenstone3/web/sites/localsite/collect & The collections directory \\ greenstone3/web/sites/localsite/images & Site specific images \\ greenstone3/web/sites/localsite/transforms & Site specific transforms \\ greenstone3/web/interfaces & Contains directories for different interfaces - an interface is defined by its images and XSLT files \\ greenstone3/web/interfaces/default & The default interface\\ greenstone3/web/interfaces/default/images & The images for the default interface\\ greenstone3/web/interfaces/default/js & The javascript libraries for the default interface\\ greenstone3/web/interfaces/default/style & The CSS stylesheets for the default interface\\ greenstone3/web/interfaces/default/transforms & The XSLT files for the default interface\\ greenstone3/web/applet & jar files needed by applets can go here \\ \hline \end{tabular}} \end{table} \subsection{Sites and interfaces}\label{sec:sites-and-ints} Sites and interfaces contain the content and presentation information, respectively, for the digital library. A site is comprised of a set of collections and possibly some site-wide services. An interface (in this web-based servlet context) is a set of images along with a set of XSLT files used for translating xml output from the library into an appropriate form---HTML in general. One \gsiii\ installation can have many sites and interfaces, and these can be paired in different combinations. One instantiation of a servlet uses one site and one interface, so every specified pairing results in a new servlet instance. For example, a single site might be served with two different interfaces. This provides different modes of access to the same content. e.g. HTML vs WML, or perhaps providing a completely different look and feel for different audiences. Alternatively, a standard interface may be used with many different sites---providing a consistent mode of access to a lot of different content. Collections live in the \gst{collect} directory of a site. Any collections that are found in this directory when the servlet is initialized will be loaded up. Public collections will appear on the library home page, while private collections will be hidden. These can still be accessed by typing in cgi arguments. Collections require valid configuration files, but apart from this, nothing needs to be done to the site to use new collections. Collections added while Tomcat is running will not be noticed automatically. Either the server needs to be restarted, or a configuration request may be sent to the library, triggering a (re)load of the collection (this is described in Section~\ref{sec:runtime-config}). There are two sites that come with the distribution: \gst{localsite}, and \gst{gateway}. \gst{localsite} has several demo collections, while \gst{gateway} has none. \gst{gateway} specifies that a SOAP connection should be made to \gst{localsite}. Getting this to work involves setting up a soap server for localsite: see Section~\ref{sec:distributed} for details. There are also two interfaces provided in the distribution: \gst{default} and \gst{gs2}. The default interface is a generic \gsiii\ interface, while the \gst{gs2} interface aims to look like the old \gsii\ interface. Each site and interface has a configuration file which specifies parameters for the site or interface---these are described in Section~\ref{sec:config}. \subsection{Configuring Tomcat}\label{sec:tomcat-config} The file \gst{\$GSDL3HOME/WEB-INF/web.xml} contains the configuration information for Tomcat. It tells Tomcat what servlets to load, what initial parameters to pass them, and what web names map to the servlets. There are four servlets specified in web.xml (these correspond to the four servlet links in the welcome page for \gsiii): one is a test servlet that just prints ``hello greenstone'' to a web page. This is useful if you are having trouble getting Tomcat set up. The other three are the \gs\ library servlets described in Section~\ref{sec:getandinstall}, \gst{library}, \gst{gs3library} and \gst{gateway}. Each servlet must specify which site and which interface to use. Having multiple servlets provides a way of serving different sites, or the same site with a different style of presentation. \gst{site\_name} and \gst{interface\_name} are just two examples of initialization parameters used by the library servlets. The full list is shown in Table~\ref{tab:serv-init}. For more details about Tomcat see Appendix~\ref{app:tomcat}. \begin{table} \caption{\gs\ servlet initialization parameters} \label{tab:serv-init} {\footnotesize \begin{tabular}{lp{3.5cm}p{6cm}} \hline \bf name & \bf sample value & \bf description \\ \hline library\_name & library & the web name of the servlet \\ interface\_name & default & the name of the interface to use\\ site\_name & localsite & the name of the local site to use (use either site\_name or the three remote\_site parameters)\\ remote\_site\_name & org.greenstone.site1 & the name of a remote site (can be anything??) \\ remote\_site\_type & soap & the type of server running on the site \\ remote\_site\_address & http://www.greenstone.org/ greenstone3/services/ localsite & The address of the server \\ default\_lang & en & the default language for the interface\\ receptionist\_class & MyReceptionist & (optional) specifies an alternative Receptionist to use (default is DefaultReceptionist)\\ messagerouter\_class & NewMessageRouter & (optional) specifies an alternative MessageRouter to use (default is MessageRouter)\\ params\_class & GS2Params & (optional) specifies an alternative GSParams class to use \\ \hline \end{tabular}} \end{table} \subsection{Configuring a \gs\ library}\label{sec:config} Initial \gsiii\ system configuration is determined by a set of XML configuration files. Each site has a configuration file that binds parameters for the site, \gst{siteConfig.xml}. Each interface has a configuration file, \gst{interfaceConfig.xml}, that specifies parameters for the interface. Collections also have several configuration files; these are discussed in Section~\ref{sec:collconfig}. The configuration files are read in when the system is initialized, and their contents are cached in memory. This means that changes made to these files once the system is running will not take immediate effect. Tomcat needs to be restarted for changes to the interface configuration file to take effect. However, changes to the site configuration file can be incorporated sending a system command to the library. There are a series of system commands that can be sent to the library to induce reconfiguration of different modules, including reloading the whole site. This removes the need to restart the system to reflect these changes. These commands are described in Section~\ref{sec:runtime-config}. \subsubsection{Site configuration file}\label{sec:siteconfig} The file \gst{siteConfig.xml} specifies the URI for the site (\gst{localSiteName}), the HTTP address for site resources (\gst{httpAddress}), any \gst{ServiceClusters} that the site provides (for example, collection building), any \gst{ServiceRacks} that do not belong to a cluster or collection, and a list of known external sites to connect to. Collections are not specified in the site configuration file, but are determined by the contents of the site's collect directory. The HTTP address is used for retrieving resources from a site outside the XML protocol. Because a site is HTTP accessible through Tomcat, any files (e.g. images) belonging to that site or to its collections can be specified in the HTML of a page by a URL. This avoids having to retrieve these files from a remote site via the XML protocol\footnote{Currently, sites live inside the Tomcat greenstone3 root context, and therefore all their content is accessible over HTTP via the Tomcat address. We need to see if parts can be restricted. Also, if we use a different protocol, then resources from remote sites may need to come through the XML. Also, if we are running locally without using Tomcat, we may want to get them via file:// rather than http://.}. Figure~\ref{fig:siteconfig} shows two example site configuration files. The first example is for a rudimentary site with no site-wide services, which does not connect to any external sites. The second example is for a site with one site-wide service cluster - a collection building cluster. It also connects to the first site using SOAP. These two sites happen to be running on the same machine, which is why they can use \gst{localhost} in the address. For site \gst{gsdl1} to talk to site \gst{localsite}, a SOAP server must be run for \gst{localsite}. The address of the SOAP server, in this case, is \gst{http://localhost:8080/greenstone3/services/localsite}. \begin{figure} \begin{gsc}\begin{verbatim} \end{verbatim}\end{gsc} \begin{gsc}\begin{verbatim} Collection builder Builds collections in a gsdl2-style manner \end{verbatim}\end{gsc} \caption{Two sample site configuration files} \label{fig:siteconfig} \end{figure} Another element that can appear in a site configuration file is \gst{replaceList}. This must have an \gst{id} attribute, and may contain one or more \gst{replace} elements. Replace elements are discussed in Section \ref{sec:collconfig}. The list found in a \gst{siteConfig.xml} file can be applied to any collection by adding a \gst{replaceListRef} element (with the appropriate \gst{id} attribute) to its \gst{collectionConfig.xml} file. \subsubsection{Interface configuration file}\label{sec:interfaceconfig} The interface configuration file \gst{interfaceConfig.xml} lists all the actions that the interface knows about at the start (other ones can be loaded dynamically). Actions create the web pages for the library: there is generally one Action per type of page. For example, a query action produces the pages for searching, while a document action displays the documents. The configuration file specifies what short name each action maps to (this is used in library URLs for the a (action) parameter) e.g. QueryAction should use \gst{a=q}. If the interface uses XSLT, it specifies what XSLT file should be used for each action and possibly each subaction. This makes it easy for developers to implement and use different actions and/or XSLT files without recompilation. The server must be restarted, however. It also lists all the languages that the interface text files have been translated into. These have a \gst{name} attribute, which is the ISO code for the language, and a \gst{displayElement} which gives the language name in that language (note that this file should be encoded in UTF-8). This language list is used on the Preferences page to allow the user to change the interface language. Details on how to add a new language to a \gsiii\ library are shown in Section~\ref{sec:interface-language}. An \gst{optionList} element can be used to disable or enable some optional functionality for the interface. Currently there are three options that can be enabled: \begin{tabular}{lp{7cm}} highlightQueryTerms & Whether search term highlighting is available or not\\ berryBaskets & Whether berry basket functionality is available or not\\ displayAnnotationService & Whether any annotation services (specified in the site config file) should be displayed with a document or not. \\ \end{tabular} An interface may be based on an existing one, for example, the gs2 interface is based on the default interface. This means that it will use any images or templates from the base one unless overridden in the current one. The \gst{baseInterface} attribute of the \gst{} element is used to specify the base interface. \begin{figure} \begin{gsc}\begin{verbatim} English Français Español \end{verbatim}\end{gsc} \caption{Default interface configuration file} \label{fig:ifaceconfig} \end{figure} \subsection{Run-time re-initialization}\label{sec:runtime-config} When Tomcat is started up, the site and interface configuration files are read in, and actions/services/collections loaded as necessary. The configuration is then static unless Tomcat is restarted, or re-configuration commands issued. There are several commands that can be issued to Tomcat to avoid having to restart the server. These can reload the entire site, or just individual collections. Unfortunately at present there are no commands to reconfigure the interface, so if the interface configuration file has changed, Tomcat must be restarted for those changes to take effect. Similarly, if the Java classes are modified, Tomcat must be restarted then too. Currently, the runtime configuration commands can only be accessed by typing arguments into the URL; there is no nice web form yet to do this. The arguments are entered after the \gst{library?} part of the URL. There are three types of commands: configure, activate, deactivate. These are specified by \gst{a=s\&sa=c}, \gst{a=s\&sa=a}, and \gst{a=s\&sa=d}, respectively (\gst{a} is action, \gst{sa} is subaction). By default, the requests are sent to the MessageRouter, but they can be sent to a collection/cluster by the addition of \gst{sc=xxx}, where \gst{xxx} is the name of the collection or cluster. Table~\ref{tab:run-time config} describes the commands and arguments in a bit more detail. \begin{table} \caption{Example run-time configuration arguments.} \label{tab:run-time config} {\footnotesize \begin{tabular}{lp{9cm}} \hline \gst{a=s\&sa=c} & reconfigures the whole site. Reads in siteConfig.xml, reloads all the collections. Just part of this can be specified with another argument \gst{ss} (system subset). The valid values are \gst{collectionList}, \gst{siteList}, \gst{serviceList}, \gst{clusterList}. \\ \gst{a=s\&sa=c\&sc=XXX} & reconfigures the XXX collection or cluster. \gst{ss} can also be used here, valid values are \gst{metadataList} and \gst{serviceList}. \\ \gst{a=s\&sa=a} & (re)activate a specific module. Modules are specified using two arguments, \gst{st} (system module type) and \gst{sn} (system module name). Valid types are \gst{collection}, \gst{cluster} \gst{site}.\\ \gst{a=s\&sa=d} & deactivate a module. \gst{st} and \gst{sn} can be used here too. Valid types are \gst{collection}, \gst{cluster}, \gst{site}, \gst{service}. Modules are removed from the current configuration, but will reappear if Tomcat is restarted.\\ \gst{a=s\&sa=d\&sc=XXX} & deactivate a module belonging to the XXX collection or cluster. \gst{st} and \gst{sn} can be used here too. Valid types are \gst{service}. \\ \hline \end{tabular}} \end{table} \newpage \section{Using \gsiii\ }\label{sec:user} Once \gsiii\ is installed, the sample collections can be accessed. The installation comes with several example collections, and Section~\ref{sec:usecolls} describes these collections and how to use them. Section~\ref{sec:buildcol} describes how to build new collections. \subsection{Using a collection}\label{sec:usecolls} A collection typically consists of a set of documents, which could be text, HTML, word, PDF, images, bibliographic records etc, along with some access methods, or ``services''. Typical access methods include searching or browsing for document identifiers, and retrieval of content or metadata for those identifiers. Searching involves entering words or phrases and getting back lists of documents that contain those words. The search terms may be restricted to particular fields of the document. Browsing involves navigating pre-defined hierarchies of documents, following links of interest to find documents. The hierarchies may be constructed on different metadata fields, for example, alphabetical lists of Titles, or a hierarchy of Subject classifications. Clicking on a bookshelf icon takes you to a lower level in the hierarchy, while clicking on a book or page icon takes you to a document. In the standard interface that comes with \gsiii\ \footnote{of course, this is all customizable}, collections in a digital library are presented in the following manner. The 'home' page of the library shows a list of all the public collections in that library. Clicking on a collection link takes you to the home page for the collection, which we call the collection's 'about' page. The standard page banner for a collection looks something like that shown in Figure~\ref{fig:page-banner}. \begin{figure}[h] \centering \includegraphics[width=4in]{pagebanner} %5.8 \caption{A sample collection page banner} \label{fig:page-banner} \end{figure} The image at the top left is a link to the collection's home page. The top right has buttons to link to the library home page, help and preferences pages. All the available services are arrayed along a navigation bar, along the bottom of the banner. Clicking on a name accesses that service. Search type services generally provide a form to fill in, with parameters including what field or granularity to search, and the query itself. Clicking the search button carries out the search, and a list of matching documents will be displayed. Clicking on the icons in the result list takes you to the document itself. Once you are looking at a document, clicking the open book icon at the top of the document, underneath the navigation bar, will take you back to the service page that you accessed the document from. \subsection{Building a collection}\label{sec:buildcol} There are three ways to get a new collection into \gsiii. The most common way is to use the Greenstone Librarian Interface to create a collection. If you have existing collections in a \gsii\ installation, these can be imported into \gsiii. Thirdly, you can use the Perl command line building scripts directly. Collections live in the \gst{collect} directory of a site. As described in Section~\ref{sec:sites-and-ints}, there can be several sites per \gsiii\ installation. The collect directory is at \gst{\$GSDL3HOME/sites/site-name/collect}, where site-name is the name of the site you want your new collection to belong to. The following three sections briefly describe how to create a collection using GLI, how to import a collection from \gsii, and how to use command line building. Once a collection has been built (and is located in the collect directory), the library server needs to be notified that there is a new collection. This can be accomplished in two ways\footnote{and eventually there will also probably be automatic polling for new collections}. If you are the library administrator, you can restart Tomcat. The library servlet will then be created afresh, and will discover the new collection when it scans the collect directory for the collection list. Alternatively, an activate collection command can be issued to the servlet, using the arguments \gst{a=s\&sa=a\&st=collection\&sn=collname}, where \gst{collname} should be replaced with the collection name---this tells the library program to (re)load the \gst{collname} collection. \subsubsection{Using the Librarian Interface} The Greenstone Librarian Interface (GLI) can be used to create collections. The procedure is the same as for \gsii, but it works in a \gsiii\ context. It can be started under Windows by selecting Greenstone Librarian Interface from the Greenstone 3 Digital Library menu in the Program Files section of the Start menu. On Linux, run \gst{ant gli} from the \gst{greenstone3} directory, or run \gst{./gli4gs3.sh} from the \gst{\$GSDL3SRCHOME/gli} directory. Currently, the GLI works almost exactly the same as for \gsii\footnote{Eventually the GLI will be modified to use \gsiii\ XML configuration files.}. Collection configuration is done in a \gsii\ manner. The main difference is that \gsiii\ has different sites and interfaces and servlets, whereas \gsii\ has a single collect directory, and a single runtime cgi program. The GLI for \gsiii\ has a couple of new configuration parameters: site and servlet. It operates within a single site---you can edit, delete, and create new collections within this site. A servlet is also specified for that site---this is used when previewing a collection. While you are working in one site, you cannot edit collections from another site. However, you can base a collection on one from another site. To change the working site and/or servlet, go to Preferences-$>$Connection in the File menu. By default, the GLI will use site \gst{localsite}, and servlet \gst{library}. Collection building using the GLI will use the \gsii\ Perl scripts and plugins. At the conclusion of the \gsii\ build process, a conversion script will be run to create the \gsiii\ configuration files. This means that format statements are no longer 'live'---changing these will require changes to the \gsiii\ configuration files. Clicking the Preview Collection button will re-run the configuration file conversion script. If you change anything on the Format panel, you will need to click Preview Collection. Just reloading the collection via a browser will not be enough. Detailed instructions about using the GLI can be found in Sections 3.1 and 3.2 of the \gsii\ User's Guide (\gst{GS2-User-en.pdf}). This can be found in your \gsii\ installation, or in the \gst{\$GSDL3SRCHOME/docs/manual} directory if you have installed \gsiii\ from a distribution. \subsubsection{Importing from \gsii} Pre-built \gsii\ collections can also be used in \gsiii. The collection folder should be copied to the collect directory of the site it is to appear in (or a symbolic link may be used if possible). The \gsiii\ run time system requires different configuration files for a collection, so you need to run a conversion script. All this does is create the new \gst{collectionConfig.xml} and \gst{buildConfig.xml} from the old \gst{collect.cfg} and \gst{build.cfg} files. It does not change the collection in any way, so it can still be used by \gsii\ software. The conversion script is \gst{convert\_coll\_from\_gs2.pl}. To run it, make sure you have run \gst{source setup.bash} (or \gst{setup} in Windows) in the \gst{\$GSDL3SRCHOME/gs2build} directory (as well as running the standard \gst{gs3-setup} command). Then you need to specify the path to the collect directory and the collection name as parameters to the conversion script. For example, \begin{gsc} \begin{verbatim} convert_coll_from_gs2.pl -collectdir $GSDL3HOME/sites/localsite/collect gs2mgdemo \end{verbatim} \end{gsc} %$ The script attempts to create \gsiii\ format statements from the old \gsii\ ones. The conversion may not always work properly, so if the collection looks a bit strange under \gsiii, you should check the format statements. Format statements are described in Section~\ref{sec:formatstmt}. Once again, to have the collection recognized by the library servlet, you can either restart Tomcat, or load it dynamically. \subsubsection{Using command line building} This is the same procedure as for \gsii\ command line building, with the addition of a final step to create the \gsiii\ configuration files. The basic steps are (for a new collection called testcol): Linux: \begin{gsc} \begin{verbatim} cd greenstone3 source gs3-setup.sh cd gs2build source setup.bash cd ../ mkcol.pl -collectdir $GSDL3HOME/sites/localsite/collect testcol put source documents and metadata into $GSDL3HOME/sites/localsite/collect/testcol/import edit $GSDL3HOME/sites/localsite/collect/testcol/etc/collect.cfg as appropriate import.pl -collectdir $GSDL3HOME/sites/localsite/collect testcol buildcol.pl -collectdir $GSDL3HOME/sites/localsite/collect testcol rename the $GSDL3HOME/sites/localsite/collect/testcol/building directory to index convert_coll_from_gs2.pl -collectdir $GSDL3HOME/sites/localsite/collect testcol %$ \end{verbatim} \end{gsc} Windows: \begin{gsc} \begin{verbatim} cd greenstone3 gs3-setup cd gs2build setup cd .. perl -S mkcol.pl -collectdir %GSDL3HOME%\sites\localsite\collect testcol put source documents and metadata into %GSDL3HOME%\sites\localsite\collect\testcol\import edit %GSDL3HOME%\sites\localsite\collect\testcol\etc\collect.cfg as appropriate perl -S import.pl -collectdir %GSDL3HOME%\sites\localsite\collect testcol perl -S buildcol.pl -collectdir %GSDL3HOME%\sites\localsite\collect testcol rename the %GSDL3HOME%\sites\localsite\collect\testcol\building directory to index perl -S convert_coll_from_gs2.pl -collectdir %GSDL3HOME%\sites\localsite\collect testcol \end{verbatim} \end{gsc} Once the build process is complete, Tomcat should be prompted to reload the collection---either by restarting the server, or by sending an activate collection command to the library servlet. Metadata for documents can be added using \gst{metadata.xml} files. A \gst{metadata.xml} file has a root element of \gst{}. This encloses a series of \gst{} items. Neither of these tags has any attributes. Each \gst{} item includes two parts: firstly, one or more \gst{} tags, each of which encloses a regular expression to identify the files which are to be assigned the metadata. Only files in the same directory as the \gst{metadata.xml} file, or in one of its child directories, will be selected. The filename tag encloses the regular expression as text, e.g.: \begin{gsc}\begin{verbatim} example \end{verbatim}\end{gsc} This would match any file containing the text 'example' in its name. The second part of the \gst{} item is a \gst{} item. The \gst{} tag has no attributes, but encloses one or more \gst{} tags. Each \gst{} tag contains one metadata item, i.e. a label to describe the metadata and a corresponding value. The \gst{} tag has one compulsory attribute: \gst{'name'}. This attribute gives the metadata label to add to the document. Each \gst{} tag also has an optional attribute: \gst{'mode'}. If this attribute is set to \gst{'accumulate'} then the value is added to the document, and any existing values for that metadata item are retained. If the attribute is set to \gst{'set'} or is omitted, then any existing value of the metadata item will be deleted. \begin{figure} \begin{gsc}\begin{verbatim} ec160e The Courier - No.160 - Nov - Dec 1996 - Dossier Habitat - Country reports: Fiji , Tonga (ec160e) English Settlements and housing: general works incl. low- cost housing, planning techniques, surveying, etc. The Courier ACP 1990 - 1996 Africa-Caribbean-Pacific - European Union EC Courier T.1 b22bue Butterfly Farming in Papua New Guinea (b22bue) English Other animals (micro- livestock, little known animals, silkworms, reptiles, frogs, snails, game, etc.) BOSTID T.1 start a butterfly farm \end{verbatim}\end{gsc} \caption{Sample metadata.xml file} \label{fig:metadatafile} \end{figure} Figure~\ref{fig:metadatafile} shows an example metadata.xml file. Here, only one file pattern is found in each file set. However, the \gst{Description} tag contains a number of separate metadata items. Note that the \gst{Title} metadata does not have the \gst{mode=accumulate} attribute. This means that when this title is assigned to a document, any existing \gst{Title} information will be lost. \subsection{Collection configuration files}\label{sec:collconfig} Each collection has two, or possibly three, \gsiii\ configuration files, \\ \gst{collectionConfig.xml}, \gst{buildConfig.xml}, and optionally \gst{collectionInit.xml}, that give metadata, display and other information for the collection. Currently, \gst{collectionConfig.xml} and \gst{buildConfig.xml} are generated from \gst{collect.cfg} and \gst{build.cfg}. At some stage, the collection building process and the Librarian Interface will be modified to use these files directly. \gst{collect.cfg} and/or \gst{collectionConfig.xml} includes user-defined presentation metadata for the collection, such as its name and the {\em About this collection} text; gives formatting information for the collection display; and also gives instructions on how the collection is to be built. \gst{build.cfg} and/or \gst{buildConfig.xml} are produced by the build-time process and include any metadata that can be determined automatically. It also includes configuration information for any ServiceRacks needed by the collection. All the configuration files should be encoded using UTF-8. The format of \gst{collect.cfg} and \gst{build.cfg} are not discussed here. Please see the \gsii\ manuals for more information regarding these files. \subsubsection{collectionInit.xml} This optional file is only used for non-standard, customized collections. It specifies the class name of the non-standard collection class. The only syntax so far is the class name: \begin{gsc}\begin{verbatim} \end{verbatim}\end{gsc} Section~\ref{sec:new-coll-types} describes an example collection where this file is used. Depending on the type of collection that this is used for, one or both of the other configuration files may not be needed. \subsubsection{collectionConfig.xml} The collection configuration file is where the collection designer (e.g. a librarian) decides what form the collection should take. So far this file only includes the presentation aspects needed by the run-time system. Instructions for collection building have yet to be defined. Presentation aspects include collection metadata such as title and description, display text for indexes, and format statements for search results, classifiers etc. The format of \gst{collectionConfig.xml} is still under consideration. However, Figure~\ref{fig:collconfig} shows the parts of it that have been defined so far. Display elements for a collection can be entered in any language---use \gst{lang='en'} attributes to specify which language they are in. \begin{figure} \begin{gsc}\begin{verbatim} greenstone@cs.waikato.ac.nz true Greenstone3 MG demo collection This is a demonstration collection for the Greenstone3 digital library software. gs3mgdemo.gif gs3mgdemo_sm.gif chapters chapitres capítulos [ ... more indexes ...] Titles [... more classifiers ...] HowTo
\end{verbatim}\end{gsc} \caption{Sample collectionConfig.xml file} \label{fig:collconfig} \end{figure} The \gst{} element specifies some collection metadata, such as creator. The \gst{} specifies some language dependent information that is used for collection display, such as collection name and short description. These \gst{displayItem} elements can be specified in different languages. The \gst{} element provides some display and formatting information for the search indexes, while the \gst{} element concerns classifiers, and the \gst{} element looks at document display. Inside the \gst{} and \gst{} elements, \gst{} elements are used to provide titles for the indexes or classifiers, while \gst{} elements provide formatting instructions, typically for a document or classifier node in a list of results. Placing the \gst{} instructions at the top level in the \gst{search} or \gst{browse} element will apply the format to all the indexes or classifiers, while placing it inside an individual \gst{index} or \gst{classifier} element will restrict that formatting instruction to that item. The \gst{} element contains optional formatting information for the display of documents. Templates that can be specified here include \gst{documentHeading} and \gst{DocumentContent}. Other formatting options may also be specified here, such as whether to display a table of contents and/or cover image for the documents. Format elements are described in Section~\ref{sec:formatstmt}. An optional \gst{} element can be included at the top level. This contains a list of strings and their replacements. This is particularly useful for \gsii\ collections that use macros. The format is like the following: \begin{gsc}\begin{verbatim} \end{verbatim}\end{gsc} Scope determines on what text the replacements are carried out: \gst{text}, \gst{metadata}, and \gst{all} (both text and metadata). An empty scope attribute is equivalent to scope=all. Each replace type can be used with all scope values. Replacing uses Java's 'String.replaceAll' functionality, so macro and replacement text are actually regular expressions. The first example is a straight textual replacement. The second example uses dictionary lookups. xxx will be replaced with the (language-dependent) value for key zzz in resource bundle yyy. The third example uses metadata: xxx will be replaced by the value of the yyy metadata for that document. Appendix~\ref{app:gs2replace} gives some examples that have been used for \gsii\ collections. \subsubsection{buildConfig.xml}\label{sec:buildconfig} The file \gst{buildConfig.xml} is produced by the collection building process. Generally it is not necessary to look at this file, but it can be useful in determining what went wrong if the collection doesn't appear quite the way it was planned. It contains metadata and other information about the collection that can be determined automatically, such as the number of documents in the collection. It also includes a list of \gst{ServiceRack} classes that are required to provide the services that have been built into the collection. The serviceRack names are Java classes that are loaded dynamically at runtime. Any information inside the serviceRack element is specific to that service---there is no set format. Figure~\ref{fig:buildconfig} shows an example. This configuration file specifies that the collection should load up 3 ServiceRacks: \gst{GS2Browse}, \gst{GS2MGPPRetrieve} and \gst{GS2MGPPSearch}. The contents of each \gst{} element are passed to the appropriate ServiceRack objects for configuration. The \gst{collectionConfig.xml} file content is also passed to the ServiceRack objects at configure time---the \gst{format} and \gst{displayItem} information is used directly from the \gst{collectionConfig.xml} file rather than added into \gst{buildConfig.xml} during building. This enables formatting and metadata changes in \gst{collectionConfig.xml} to take effect in the collection without rebuilding being necessary. However, as these files are cached, the collection needs to be reloaded for the changes to appear in the library. \begin{figure} \begin{gsc}\begin{verbatim} 11 mgpp \end{verbatim}\end{gsc} \caption{Sample buildConfig.xml file (gs2mgppdemo collection)} \label{fig:buildconfig} \end{figure} \subsection{Formatting the collection}\label{sec:formatstmt} Part of collection design involves deciding how the collection should look. \gsiii\ has a default 'look' for a collection, so this is optional. However, the default may not suit the purposes of some collections, so many parts to the look of a collection can be determined by the collection designer. In standard \gsiii, the library is served to a web browser by a servlet, and the HTML is generated using XSLT. XSLT templates are used to format all the parts of the pages. These templates can be overridden by including them in the \gst{collectionConfig.xml} file. Some commonly overridden templates are those for formatting lists: search results list, classifier browsing hierarchies, and for parts of the document display. Real XSLT templates for formatting search results or classifier lists are quite complicated, and not at all easy for a new user to write. For example, the following is a sample template for formatting a classifier list, to show Keyword metadata as a link to the document. \begin{gsc}\begin{verbatim} \end{verbatim}\end{gsc} To write this, the user would need to know that: \begin{bulletedlist} \item the variable \gst{\$library\_name} exists, \item the collection name is passed in as a parameter called \gst{collName} \item metadata for a document is found in a \gst{} and that its form is \gst{the value} \item the arguments needed for the link to the document are \gst{a, sa, c, d, a, dt}. \end{bulletedlist} We can use XSLT to transform XML into XSLT. \gsiii\ provides a simplified set of formatting commands, written in XML, which will be transformed into proper XSLT. The user specifies a \gst{} for what they want to format---these typically match \gst{documentNode} or \gst{classifierNode} (for a node in a classification hierarchy). The template above can be represented as: \begin{gsc}\begin{verbatim} \end{verbatim}\end{gsc} Table~\ref{tab:gsf-format} shows the set of \gst{'gsf'} (Greenstone Format) elements. If you have come from a \gsii\ background, Appendix~\ref{app:gs2format} shows \gsii\ format elements and their equivalents in \gsiii\ . \begin{table} \caption{Format elements for GSF format language} \label{tab:gsf-format} {\footnotesize \begin{tabular}{p{6.5cm}p{6.5cm}} \hline \bf Element & \bf Description \\ \hline \gst{} & The document's text\\ \hline \gst{...} & The HTML link to the document itself \\ \gst{... } & Same as above\\ \gst{... } & A link to a classification node (use in classifierNode templates)\\ \gst{... } & The HTML link to the original file---set for documents that have been converted from e.g. Word, PDF, PS \\ \hline \gst{} & An appropriate icon\\ \gst{} & same as above\\ \gst{} & bookshelf icon for classification nodes\\ \gst{} & An appropriate icon for the original file e.g. Word, PDF icon\\ \hline \gst{} & All the values of a metadata element for the current document or section, in this case, Title\\ \gst{} & A more extended selection of metadata values. The select field can be one of those shown in Table~\ref{tab:gsf-select-types}. There are two optional attributes: separator gives a String that will be used to separate the fields, default is ``, ``, and pos can be set to return either the first, last or nth value for that metadata at each section.\\ \gst{} & The value of a metadata element for the current document, formatted in some way. Current formatting options available are listed in Table~\ref{tab:gsf-process-types}. \\ \hline \gst{ } & A choice of metadata. Will select the first existing one. the metadata elements can have the select, separator and pos attributes like normal.\\ \hline \gst{ ... ... ... } & switch on the value of a particular metadata - the metadata is specified in gsf:metadata, has the same attributes as normal.\\ \hline \end{tabular}} \end{table} The \gst{} elements are used to output metadata values. The simplest case is \gst{}---this outputs all the Title metadata values for the current document or section. Namespaces are important here: if the Title metadata is in the Dublin Core (dc) namespace, then the element should look like \gst{}. There are three other attributes for this element. By default, more than one value for the selected metadata is returned, where multiple exist. The attribute \gst{pos} is used when a particular value for the selected metadata is requested (which can be the first, last or nth value). For instance, one document may fall into several classification categories, and therefore may have multiple Subject metadata values. When all are returned, the multiple values are separated by commas by default. The \gst{separator} attribute is used to change the separating string. For example, adding \gst{separator=':~'} to the element will separate all values by a colon and a space. Instead of retrieving all values for a piece of metadata, adding \gst{pos='first'} to the \gst{} element will retrieve the first value. Sometimes you may want to display metadata values for sections other than the current one. For example, in the mgppdemo collection, in a search list we display the Titles of all the enclosing sections, followed by the Title of the current section, all separated by semi-colons. The display ends up looking something like: \emph{Farming snails 2; Starting out; Selecting your snails} where \emph{Selecting your snails} is the Title of the section in the results list, and \emph{Farming snails 2} and \emph{Starting out} are the Titles of the enclosing sections. The \gst{select} attribute is used to display metadata for sections other than the current one. Table~\ref{tab:gsf-select-types} shows the options available for this attribute. The \gst{separator} attribute is used here also, to specify the separating text. To get the previous metadata, the format statement would have the following in it: \begin{gsc} \begin{verbatim} ; \end{verbatim} \end{gsc} \begin{table} \caption{Select types for metadata format elements} \label{tab:gsf-select-types} {\footnotesize \begin{tabular}{ll} \hline \bf Select Type & \bf Description\\ \hline parent & The immediate parent section\\ ancestors & All the parents back to the root (topmost) section\\ root & The root or topmost section \\ %siblings & All the sibling sections\\ %children & The immediate child sections of the current section\\ %descendants & All the descendent sections\\ \hline \end{tabular}} \end{table} \begin{table} \caption{String processing option, for preprocess in gsf:switch, and format in gsf:metadata} \label{tab:gsf-process-types} {\footnotesize \begin{tabular}{ll} \hline \bf Process Type & \bf Description\\ \hline toUpper & Make the value upper case \\ toLower & Make the value lower case \\ tidyWhitespace & Replace multiple whitespace characters with a single space \\ stripWhitespace & Removes all whitespace characters \\ cgiSafe &Make value safe to be a cgi argument \\ formatDate & turns '20040201' into '01 February 2004' in a language dependent manner \\ formatLanguage & turns 'en' into 'English' in a language dependent manner\\ formatBigNumber & \\ \hline \end{tabular}} \end{table} The \gst{} element selects the first available metadata value from the list of options. \begin{gsc} \begin{verbatim} \end{verbatim} \end{gsc} This will display dc.Title if available, otherwise it will use dls.Title if available, otherwise it will use the Title metadata. If there are no values for any of these metadata elements, then nothing will be displayed. The \gst{} element allows different formatting depending on the value of a specified metadata element. For example, the following switch statement could be used to display a different icon for each document in a list depending on which organization it came from. \begin{gsc} \begin{verbatim} \end{verbatim} \end{gsc} Preprocessing of the metadata value is optional. The preprocess types are listed in Table~\ref{tab:gsf-process-types}. These operations are carried out on the value of the selected metadata before the test is carried out. Multiple processing types can be specified, separated by ; and they will be applied in the order specified (from left to right). Each option specifies a test and a test value. Test values are just text. Tests include \gst{startsWith}, \gst{contains}, \gst{exists}, \gst{equals}, \gst{endsWith}. Exists doesn't need a test value. Having an otherwise option ensures that something will be displayed even when none of the tests match. If none of the gsf elements meets your needs for formatting, XSLT can be entered directly into the format element, giving the collection designer full flexibility over how the collection appears. The collection specific templates are added into the configuration file \gst{collectionConfig.xml}. Any templates found in the XSLT files can be overridden. The important part to adding templates into the configuration file is determining where to put them. Formatting templates cannot go just anywhere---there are standard places for them. Figure~\ref{fig:format-places} shows the positions that templates can occur. \begin{figure} \begin{gsc}\begin{verbatim} ... ... ... ... ... ... \end{verbatim}\end{gsc} \caption{Places for format statements} \label{fig:format-places} \end{figure} There are also formatting instructions that are not templates but are options. These are described in Table~\ref{tab:format_options}. They are entered into the configuration file like \gst{} \begin{table} \caption{Formatting options} \label{tab:format_options} {\footnotesize \begin{tabular}{llp{5cm}} \hline \bf option name & \bf values & \bf description \\ \hline coverImages & true, false & whether or not to display cover images for documents \\ documentTOC & true, false & whether or not to display the table of contents for the document\\ \hline \end{tabular}} \end{table} Note, format templates are added into the XSLT files before transforming, while the options are added into the page source, and used in tests in the XSLT. \subsubsection{Changing the service text strings} Each collection has a set of services which are the access points for the information in the collection. Each service has a set of text strings which are used to display it. These include name, description, the text on the submit button, and names and descriptions of all the parameters to the service. These text strings are found in \gst{.properties} files, in \gst{\$GSDL3HOME/WEB-INF/classes}. The names of the files are based on class names. Subclasses can define their own properties, or can use their parent class ones. For example, \gst{AbstractSearch} defines strings for the \gst{TextQuery} service, in \gst{AbstractSearch.properties}. \gst{GS2MGSearch} just uses these default ones, so doesn't need its own properties file. A particular collection can override the properties for any service. For example, if a collection uses the \gst{GS2MGSearch} service rack (look in the \gst{buildConfig.xml} file for a list of service racks used), and the collection builder wants to change the text associated with this service, they can put a \gst{GS2MGSearch.properties} file in the resources directory of the collection. After a reconfigure of the collection, this will be used in preference to the one in the default resources directory. \subsection{Customizing the interface}\label{sec:interface-customise} Format statements in the collection configuration files provide a way to change small parts of the collection display. For large scale customizations to a collection, or ones that apply to a site as a whole, a second mechanism is available. The interface is defined by a set of XSLT files that transform the page data into HTML. Any of these files can be overridden to provide specialized display, on a site or collection basis. The first section looks at customizing the existing interface, while the second section looks at defining a whole new interface. The last section describes how to add a new language translation of an interface. \subsubsection{Modifying an existing interface} Most of an interface is defined by XSLT files, which are stored in \gst{\$GSDL3HOME/\-interfaces/\-interface-name/\-transform}. These can be changed and the changes will take effect straight away. If changes only apply to certain collections or sites, not everything that uses the interface, you can override some of the files by putting new ones in a different place. XSLT files are looked for in the following order: collection, site, interface, default interface. (This currently only apples to sites, and therefore collections, that reside in the same \gs\ installation as the interface.) Sites and collections can have a transform directory, which is where customized XSLT files should go. Any XSLT files in here will be used in preference to the interface files when using this collection. For example, if you want to have a completely different layout for the about page of a collection, you can put a new \gst{about.xsl} file into the collection's \gst{transform} directory, and this will be used instead. This is what we do for the Gutenberg sample collection. This also applies to files that are included from other XSLT files. For example the \gst{query.xsl} for the query pages includes a file called \gst{querytools.xsl}. To have a particular site show a different query interface either of these files may need to be modified. Creating a new version of either of these and putting it in the site \gst{transform} directory will work. Either the new \gst{query.xsl} will include the default \gst{querytools.xsl}, or the default \gst{query.xsl} will include the new \gst{querytools.xsl}. The \gst{xsl:include} directives are preprocessed by the Java code and full paths added based on availability of the files, so that the correct one is used. Note that you cannot include a file with the same name as the including file. For example \gst{query.xsl} cannot include \gst{query.xsl} (it is tempting to want to do this if you just want to change one template for a particular file, and then include the default. but you cant). You can add the argument \gst{o=xml} to any URL and you wil be returned the XML before transformation by a stylesheet. This shows you the XML page source. It can be useful when you are trying to write some new XSLT statements. \subsubsection{Defining a new interface} A new interface may be needed if different instantiations of the library require different interfaces, or different developers want their own look and feel. Creating a new interface will allow modifications to be made while leaving the original one intact. A new interface needs a directory in \gst{\$GSDL3HOME/interfaces}, the name of this directory becomes the interface name. Inside, it needs \gst{images} and \gst{transform} directories, and an \gst{interfaceConfig.xml} file. The \gst{interfaceConfig.xml} file may specify a base interface, in which case the new interface only needs to define XSLT for the parts that are different. Otherwise, it will need a full set of XSLT files. To use a new interface, the \gst{\$GSDL3HOME/WEB-INF/web.xml} file must be edited: either change the interface that a current servlet instance is using, or add another servlet instantiation to the file (see Section~\ref{sec:sites-and-ints} or Appendix~\ref{app:tomcat}). The Tomcat server must be restarted for this to take effect. \subsubsection{Changing the interface language}\label{sec:interface-language} The interface language can be changed by going to the preferences page, and choosing a language from the list, which includes all languages into which the interface has been translated. It is easy to add a new interface language to \gs\ . Language specific text strings are separated out from the rest of the system to allow for easy incorporation of new languages. These text strings are contained in Java resource bundle properties files. These are plain text files consisting of key-value pairs, located in \gst{\$GSDL3HOME/WEB-INF/classes}. Each interface has one named \gst{interface\_name.properties} (where \gst{'name'} is the interface name, for example, \gst{interface\_default.properties}, or \gst{interface\_gs2.properties}). Each service class has one with the same name as the class (e.g. \gst{GS2Search.properties}). To add another language all of the base \gst{.properties} files must be translated. The translated files keep the same names, but with a language extension added. For example, a French version of \gst{interface\_default.properties} would be named \gst{interface\_default\_fr.properties}. Keys will be looked up in the properties file closest to the specified language. For example, if language \gst{fr\_CA} was specified (French language, country Canada), and the default locale was \gst{en\_GB}, Java would look at properties files in the following order, until it found the key: \gst{XXX\_fr\_CA.properties}, \gst{XXX\_fr.properties}, \gst{XXX\_en\_GB.properties}, then \gst{XXX\_en.properties}, and finally the default \gst{XXX.properties}. These new files are available straight away---to use the new language, add e.g. \gst{l=fr} to the arguments in the URL. To get \gs\ to add it in to the list of languages on the preferences page, an entry needs to be added into the languages list in the \gst{interfaceConfig.xml} file (see Section~\ref{sec:interfaceconfig}). Modification of this file requires a restart of the Tomcat server for the changes to be recognized. \newpage \section{Developing \gsiii : Run-time system}\label{sec:develop-runtime} [TODO: rewrite this section\\ runtime object structure diagram. describe the modules.\\ class hierarchy,\\ directory structure and where everything lives\\ message format.\\ overall description of message passing sequence.\\ configuration process - start up and runtime\\ \\ page generation\\ ] \subsection{Overview of modules??} A \gsiii\ 'library' system consists of many components: MessageRouter, Receptionist, Actions, Collections, ServiceRacks etc. Figure~\ref{fig:local} shows how they fit together in a stand-alone system. The top left part is concerned with displaying the data, while the bottom right part is the collection data serving part. The two sides communicate through the MessageRouter. There is a one-to-one correspondence between modules and Java classes, with the exception of services: for coding and/or run-time efficiency reasons, several Service modules may be grouped together into one ServiceRack class. \begin{figure}[t] \centering \includegraphics[width=4in]{local} %5.8 \caption{A simple stand-alone site.} \label{fig:local} \end{figure} {\em MessageRouter}: this is the central module for a site. It controls the site, loading up all the collections, clusters, communicators needed. All messages pass through the MessageRouter. Communication between remote sites is always done between MessageRouters, one for each site. {\em Collection and ServiceCluster}: these are very similar, and group a set of services into a conceptual group.. They both provide some metadata about the collection/cluster, and a list of services. The services are provided by ServiceRack objects that the collection/cluster loads up. A Collection is a specific type of ServiceCluster. A ServiceCluster groups services that are related conceptually, e.g. all the building services may be part of a cluster. What is part of a cluster is specified by the site configuration file. A Collection's services are grouped by the fact that they all operate on some common data---the documents in the collection. Functionally Collection and ServiceCluster are very similar, but conceptually, and to the user, they are quite different. {\em Service}: these provide the core functionality of the system e.g. searching, retrieving documents, building collections etc. One or more may be grouped into a single Java class (ServiceRack) for code reuse, or to avoid instantiating the same objects several times. For example, MGPP searching services all need to have the index loaded into memory. {\em Communicator/Server}: these facilitate communication between remote modules. For example, if you want MR1 to talk to MR2, you need a Communicator-Server pair. The Server sits on top of MR2, and MR1 talks to the Communicator. Each communication type needs a new pair. So far we have only been using SOAP, so we have a SOAPCommunicator and a SOAPServer. {\em Receptionist}: this is the point of contact for the 'front end'. Its core functionality involves routing requests to the Actions, but it may do more than that. For example, a Receptionist may: modify the request in some way before sending it to the appropriate Action; add some data to the page responses that is common to all pages; transform the response into another form using XSLT. There is a hierarchy of different Receptionist types, which is described in Section~\ref{sec:recepts}. {\em Actions}: these do the job of creating the 'pages'. There is a different action for each type of page, for example PageAction handles semi-static pages, QueryAction handles queries, DocumentAction displays documents. They know a little bit about specific service types. Based on the 'CGI' arguments passed in to them, they construct requests for the system, and put together the responses into data for the page. This data is returned to the Receptionist, which may transform it to HTML. The various actions are described in more detail in Section~\ref{sec:pagegen}. \subsection{Start up configuration}\label{sec:startup-config} We use the Tomcat web server, which operates either stand-alone in a test mode or in conjunction with the Apache web server. The \gs\ LibraryServlet class is loaded by Tomcat and the servlet's \gst{init()} method is called. Each time a \gst{get/put/post} (etc.) is used, a new thread is started and \gst{doGet()/doPut()/doPost()} (etc.) is called. The \gst{init()} method creates a new Receptionist and a new MessageRouter. Default classes (DefaultReceptionist, MessageRouter) are used unless subclasses have been specified in the servlet initiation parameters (see Section~\ref{sec:sites-and-ints}). The appropriate system variables are set for each object (interface name, site name, etc.) and then \gst{configure()} is called on both. The MessageRouter handle is passed to the Receptionist. The servlet then communicates only with the Receptionist, not with the MessageRouter. The Receptionist reads in the \gst{interfaceConfig.xml} file (see Section~\ref{sec:interfaceconfig}), and loads up all the different Action classes. Other Actions may be loaded on the fly as needed. Actions are added to a map, with shortnames for keys. Eg the QueryAction is added with key 'q'. The Actions are passed the MessageRouter reference too. If the Receptionist is a TransformingReceptionist, a mapping between shortnames and XSLT file names is also created. The MessageRouter reads in its site configuration file \gst{siteConfig.xml} (see Section~\ref{sec:siteconfig}). It creates a module map that maps names to objects. This is used for routing the messages. It also keeps small chunks of XML---serviceList, collectionList, clusterList and siteList. These are part of what get returned in response to a describe request (see Section~\ref{sec:describe}.). Each ServiceRack specified in the configuration file is created, then queried for its list of services. Each service name is added to the map, pointing to the ServiceRack object. Each service is also added to the serviceList. After this stage, ServiceRacks are transparent to the system, and each service is treated as a separate module. ServiceClusters are created and passed the \gst{} element for configuration. They are added to the map as is, with the cluster name as a key. A serviceCluster is also added to the serviceClusterList. For each site specified, the MessageRouter creates an appropriate type of Communicator object. Then it tries to get the site description. If the server for the remote site is up and running, this should be successful. The site will be added to the mapping with its site name as a key. The site's collections, services and clusters will also be added into the static xml lists. If the server for the remote site is not running, the site will not be included in the siteList or module map. To try again to access the site, either Tomcat must be restarted, or a run-time reconfigure-site command must be sent (see Section~\ref{sec:runtime-config}). The MessageRouter also looks inside the site's \gst{collect} directory, and loads up a Collection object for each valid collection found. If a \gst{collectionInit.xml} file is present, a subclass of Collection may be used. The Collection object reads its \gst{buildConfig.xml} and \gst{collectionConfig.xml} files, determines the metadata, and loads ServiceRack classes based on the names specified in \gst{buildConfig.xml\/}. The \gst{} XML element is passed to the object to be used in configuration. The \gst{collectionConfig.xml} contents are also passed in to the ServiceRacks. Any format or display information that the services need must be extracted from the collection configuration file. Collection objects are added to the module map with their name as a key, and also a collection element is added into the collectionList XML. \subsection{Message passing} There are two types of messages used by the system: external and internal messages. All messages have an enclosing \gst{} element, which contains either one or more requests, or one or more responses. In the following descriptions, the message element is not shown, but is assumed to be present. Action in \gsiii\ is originated by a request coming in from the outside. In the standard web-based \gs, this comes from a servlet and is passed into the Receptionist. This ``external'' type request is a request for a page of data, and contains a representation of the CGI style arguments. A page of XML is returned, which can be in HTML format or other depending on the output parameter of the request. Messages inside the system (``internal'' messages) all follow the same basic format: message elements contain multiple request elements, or multiple response elements. Messaging is all synchronous. The same number of responses as requests will be returned. Currently all requests are independent, so any requests can be combined into the same message, and they will be answered separately, with their responses being sent back in a single message. When a page request (external request) comes in to the Receptionist, it looks at the action attribute and passes the request to the appropriate Action module. The Action will fire one or more internal requests to the MessageRouter, based on the arguments. The data is gathered into a response, which is returned to the Receptionist. The page that the receptionist returns contains the original request, the response from the action and other info as needed (depends on the type of Receptionist). The data may be transformed in some way --- for the \gs\ servlet we transform using XSLT to generate HTML pages. Actions send internal style messages to the MessageRouter. Some can be answered by it, others are passed on to collections, and maybe on to services. Internal requests are for simple actions, such as search, retrieve metadata, retrieve document text There are different internal request types: describe, process, system, format, status. Process requests do the actual work of the system, while the other types get auxiliary information. The format of the requests and responses for each internal request type are described in the following sections. External style requests, and their page responses are described in the Section about page generation (Section~\ref{sec:pagegen}). \subsection{'describe'-type messages}\label{sec:describe} The most basic of the internal standard requests is ``describe-yourself'', which can be sent to any module in the system. The module responds with a semi-predefined piece of XML, making these requests very efficient. The response is predefined apart from any language-specific text strings, which are put together as each request comes in, based on the language attribute of the request. \begin{quote}\begin{gsc}\begin{verbatim} \end{verbatim}\end{gsc}\end{quote} If the \gst{to} field is empty, a request is answered by the MessageRouter. An example response from a MessageRouter might look like this: \begin{quote}\begin{gsc}\begin{verbatim} \end{verbatim}\end{gsc}\end{quote} This MessageRouter has no individual site-wide services (an empty \gst{}), but has a service cluster called build (which provides collection importing and building functionality). It communicates with one site, \gst{org.greenstone.gsdl1}. It is aware of four collections. One of these, \gst{myfiles}, belongs to it; the other three are available through the external site. One of those collections is actually from a further external site. It is possible to ask just for a specific part of the information provided by a describe request, rather than the whole thing. For example, these two messages get the \gst{collectionList} and the \gst{siteList} respectively: \begin{quote}\begin{gsc}\begin{verbatim} \end{verbatim}\end{gsc}\end{quote} Subset options for the MessageRouter include \gst{collectionList}, \gst{serviceClusterList}, \gst{serviceList}, \gst{siteList}. When a collection or service cluster is asked to describe itself, what is returned is a list of metadata, some display elements, and a list of services. For example, here is such a message, along with a sample response. \begin{quote}\begin{gsc}\begin{verbatim} greenstone mgpp demo This is a demonstration collection for the Greenstone digital library software. It contains a small subset (11 books) of the Humanity Development Library. It is built with mgpp. mgppdemo.gif greenstone@cs.waikato.ac.nz 11 mgpp http://kanuka:8090/greenstone3/sites/ localsite/collect/mgppdemo \end{verbatim}\end{gsc}\end{quote} Subset options for a collection or serviceCluster include \gst{metadataList}, \gst{serviceList}, and \gst{displayItemList}. This collection provides many typical services. Notice how this response lists the services available, while the collection configuration file for this collection (Figure~\ref{fig:collconfig}) described serviceRacks. Once the service racks have been configured, they become transparent in the system, and only services are referred to. There are three document retrieval services, for structural information, metadata, and content. The Classifier services retrieve classification structure and metadata. These five services were all provided by the GS2MGPPRetrieve ServiceRack. The three query services were provided by GS2MGPPSearch serviceRack, and provide different kinds of query interface. The last service, PhindApplet, is provided by the PhindPhraseBrowse serviceRack and is an applet service. A \gst{describe} request sent to a service returns a list of parameters that the service accepts and some display information, (and in future may describe the content type for the request and response). Subset options for the request include \gst{paramList} and \gst{displayItemList}. Parameters can be in the following formats: \begin{quote}\begin{gsc}\begin{verbatim}