\documentclass[a4paper,11pt]{article}
\usepackage{times,epsfig}
\hyphenation{Message-Router Text-Query}
\begin{document}
\title{A modular digital library:\\
Architecture and implementation of Greenstone3}
% if you work on this manual, add your name here
\author{Katherine Don and Ian H. Witten \\[1ex]
Department of Computer Science \\
University of Waikato \\ Hamilton, New Zealand \\
\{kjdon, ihw\}@cs.waikato.ac.nz}
\date{}
\maketitle
\newenvironment{bulletedlist}%
{\begin{list}{$\bullet$}{\setlength{\itemsep}{0pt}\setlength{\parsep}{0pt}}}%
{\end{list}}
\noindent
Greenstone Digital Library Version 3 is a complete redesign and
reimplementation of the Greenstone digital library software. The current
version (Greenstone2) enjoys considerable success and is being widely used.
Greenstone3 will capitalize on this success, and in addition it will
\begin{bulletedlist}
\item improve flexibility, modularity, and extensibility
\item lower the bar for ``getting into'' the Greenstone code with a view to
understanding and extending it
\item use XML where possible internally to improve the amount of
self-documentation
\item make full use of existing XML-related standards and software
\item provide improved internationalization, particularly in terms of sort order,
information browsing, etc.
\item include new features that facilitate additional ``content management''
operations
\item operate on a scale ranging from personal desktop to corporate library
\item easily permit the incorporation of text mining operations
\item use Java, to encourage multilinguality, X-compatibility, and to permit
easier inclusion of existing Java code (such as for text mining).
\end{bulletedlist}
Parts of Greenstone will remain in other languages (e.g. MG, MGPP); JNI (Java
Native Interface) will be used to communicate with these.
\section{Architecture}
This section is covered by the paper: An agent based architecture for dynamic digital library construction and configuration. Either cut and paste it in here, or link to the text?? or have two separate docs. dont want to have to maintain two separate versions of the same thing.
\section{Greenstone Implementation}
\label{sec:impl}
\subsection{Configuring Greenstone}
\label{subsec:config}
Greenstone3 involves several different kinds of configuration files, all
expressed in XML. Each site has a configuration file that binds parameters for
the site, {\em siteConfig.xml}. Each collection has two configuration files, {\em collectionConfig.xml} and {\em buildConfig.xml\/}, that give metadata for the
collection.\footnote{These replace {\em collect.cfg} and {\em build.cfg} in
Greenstone2.} The first includes user-defined metadata for the collection,
such as its name and the {\em About this collection} text; and also gives
instructions on how the collection is to be built. The second is produced by
the build-time process and includes any metadata that can be determined
automatically.\footnote{Currently only the buildConfig.xml file is used - collections are built using gs2 style building and therefore use the old collect.cfg.}
\subsubsection{Site configuration file}
The file {\em siteConfig.xml} specifies the URI for the site ({\em
localSiteName\/}), any services or service clusters provided by the site that are not connected
with a particular collection (for example, translation services, or collection building), and a list of
known external sites to connect to. Collections are not specified in the site
configuration file, instead they are determined by the contents of the site's
collections directory.
Here is a configuration file for a rudimentary site with no site-wide services,
which does not connect to any external sites.\footnote{should the code be tolerant of missing elements? or do we require empty elements?}
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
The following configuration file is for a site with one site-wide service cluster - a collection building cluster. It also connects to the previous site using SOAP.
\begin{quote}\begin{footnotesize}\begin{verbatim}
Collection builderBuilds collections in a
gsdl2-style manner
\end{verbatim}\end{footnotesize}\end{quote}
These two sites are running on the same machine. For site1 to talk to localsite, a SOAP server must be run for localsite. The address of the SOAP server, in this case, is "http://localhost:8080/soap/servlet/rpcrouter"
\subsubsection{Building configuration file}
The file {\em buildConfig.xml} contains all metadata and other information about the collection that can
be determined automatically when building the collection, such as the number of
documents it contains. It also includes a list of serviceRack classes that are
required at runtime to provide the services that have been built into the
collection. The serviceRack names are Java classes that are loaded
dynamically at runtime. Any information inside the serviceRack element is
specific to that service---there is no set format. Here is an example:
\begin{quote}\begin{footnotesize}\begin{verbatim}
11mgppdemo.gifGreenstone demo collectionThis is a demonstration
collection for the Greenstone digital library software. It
contains a small subset of the Humanitarian and Development
Libraries.SubjectTitleOrganizationKeyword
\end{verbatim}\end{footnotesize}\end{quote}
Note: because {\em collectionConfig.xml} is not used yet, the {\em colIcon}, {\em colDescription}
and {\em colName} metadata elements have been specified here.
\subsubsection{Collection configuration file}
The format of {\em collectionConfig.xml} has not yet been defined.
\subsubsection{Starting up}
We use the Tomcat web server, which operates either stand-alone in a test mode
or in conjunction with the Apache web server. The Greenstone LibraryServlet
class is loaded by Tomcat and the servlet's {\em init()} method is called. Each time a
{\em get\/}/{\em put\/}/{\em post} (etc.) is used, a new thread is started and
{\em doGet()\/}/{\em doPut()\/}/{\em doPost()} (etc.) is called.
The {\em init()} method creates a new Receptionist and a new instance of the
MessageRouter. The appropriate system variables are set in each (interface
name, site name, etc.) and then {\em configure()} is called. A MessageRouter
reference is given to the Receptionist. The servlet then communicates only with
the Receptionist, not with the MessageRouter.
The Receptionist loads up all the different Action classes. A
static list is used initially, and other Actions may be loaded on the fly as needed.
The MessageRouter reads in its site configuration file {\em siteConfig.xml}. This
lists the ServiceRack classes that need to be loaded, and lists any sites that need
to be connected to. It looks inside the {\em collect} directory which contains
all the site's collections and loads up a Collection object for each valid
collection found.
The Collection object reads its {\em buildConfig.xml} and {\em collectionConfig.xml}
files, determines the metadata, and loads ServiceRack classes based on the
names specified in {\em buildConfig.xml\/}. The {\footnotesize \verb##} XML element is passed to the object to be used in configuration.
\section{System messages}
Once the system is up and running (the configuration
process described in Section~\ref{subsec:config} has been carried out), it is passing messages back and forth. All modules communicate via message passing.
First, we examine the basic message
formats, then how the system creates and responds to the messages.
All messages are enclosed in
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
Messages contain either {\em \/} or {\em \/} elements--- a single message may contain multiple requests. Each {\em \/} (and {\em \/}?) has a language attribute, of the form ``lang='xx'''.
The language attribute is used by the XSLT to determine the language currently
being used by the user interface. Virtually all messages contain text strings,
and services use this attribute to return strings in the appropriate language.
There are two different styles of messaging, explained in the two subsections
below. The first is the communication between the servlet (or other external agent) and the Greenstone system (via the Receptionist). The request contains a simple representation of the arguments in a Greenstone URL, and has the same format as any request in the system. The response is a page of data, typically in HTML. The second style of messaging is the internal Greenstone communication. Requests and responses follow a basic format, and both are in XML.\footnote{We format names in lower case with the first letter of internal words capitalized, like 'matchDocs'.} They typically request one service or one action, and the response contains either the data requested, or a status message.
This section describes the two message formats. The following section looks at how the front-end (Receptionist plus Actions) responds to the URL-type messages, and creates internal xxx-type\footnote{are there good names to distinguish the two types of messages?} messages to pass into the system.
\subsubsection{Servlet to Receptionist messages}\label{subsec:url-type}
Servlet to Receptionist messages are requests for a 'page' of data---for example, the home page for a site; the query page for a collection; the text of a document. They contain, in XML, a representation of the arguments in a
Greenstone URL. The two main arguments are {\em a} (action) and {\em sa}
(subaction).\footnote{The {\em sa} replaces Greenstone's old {\em p} arg for
the page action, and is new for other actions. For example, a text query could
be encoded as {\em a=q \& sa=text\/}.} All other arguments are treated as
parameters.
Here is the XML representation of the arguments:
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
The receptionist routes the message to the appropriate action. The output
field is used to indicate what type of output to return. The actions do not
return responses in the normal format; instead they return a page of
information, expressed by default in HTML. Alternative formats could be XML or WML.
The LibraryServlet class communicates with the Receptionist, which is the entry
point into the system. Future GUIs could communicate either with the
Receptionist or directly with the MessageRouter. If they communicate with the Receptionist they must use the cgi-args type of request, asking for predefined pages of information. If they communicate with the MessageRouter directly, they must use the internal message format described in the next section---this is more powerful, but involves more work by the client. Individual services are requested---the results need to be put together by the client.
The cgi arguments used currently are shown in Table~\ref{tab:args}.
Other arguments can be specified by particular actions.. For example, when the query action recieves a list of parameters from the TextQuery service, it creates short names for them and adds them to the global list of cgi-args.
\begin{table}
\center{\footnotesize
\begin{tabular}{llll}
\hline
\bf Argument & \bf Meaning &\bf Typical values \\
\hline
a & action & a (applet), q (query), b (browse), p (page), pr (process) \\
sa & subaction & home, about (page action)\\
c & collection or & demo, build \\
& service cluster \\
s & service name & TextQuery, ImportCollection \\
rt & request type & d (display), r (request), s (status) \\
ro & request only & 0 or 1 - if set to one, the request is carried out \\
& & but no processing of the results is done \\
o & output type & xml, html, wml \\
l & language & en, fr, zh \\
d & document id & HASHxxx \\
r & resource id & ???\\
id & process handle & an integer identifying a particular process request \\
\hline
\end{tabular}}
\label{tab:args}
\caption{Generic rguments that can appear in a Greenstone URL}
\end{table}
Here is an example message that retrieves the home page in French:
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
This message represents a text query:
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
\subsubsection{Module to module messages}
In Greenstone3's modular architecture messages are used extensively to pass
information from one module to another, for example from an Action to the
MessageRouter module, and from that module to a service module. Requests have
a {\em to} attribute and responses have {\em from\/}. These are addresses used
by routing modules. For example {\em to='site1/site2/demo/TextQuery'} routes a
message to a MessageRouter ({\em site1\/}), from there to another MessageRouter
({\em site2\/}), from there to a collection ({\em demo\/}), and from there to a
particular service ({\em TextQuery\/}).
Each request asks for a description of a single module, or requests a particular service. Unlike the first type of message which requests pre-defined types of pages, these internal requests can ask for any functionality available in the system.
The most basic message is ``describe-yourself'', which can be sent to any module in the system. The module responds with a predefined piece of XML, making these requests very efficient.
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
If the {\em to} field is empty, the request is answered by the first module that it is passed to.
An example response from a MessageRouter might look like this:
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
This MessageRouter has one site-wide service, a cross-collection searching service. It
communicates with one site, {\em org.greenstone.gsdl1\/}. It is aware of four
collections. One of these, {\em myfiles\/}, belongs to it; the other three are
available through the external site. One of those collections is actually from
a further external site.
It is possible to ask just for a specific part of the information provided by a
describe request, rather than the whole message. For example, these two
messages get the {\em collectionList} and the {\em siteList} respectively:
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
When a collection is asked to describe itself, what is returned is all of the
collection specific metadata and a list of services. For example, here is such
a message, along with a sample response.
\begin{quote}\begin{footnotesize}\begin{verbatim}
3215532The demo collectionThis is a demo collection.
\end{verbatim}\end{footnotesize}\end{quote}
A {\em describe} request sent to a service returns a list of parameters that
the service accepts, and describes the content type for the request and
response.
Parameters have the following format:
\begin{quote}\begin{footnotesize}\begin{verbatim}
...
\end{verbatim}\end{footnotesize}\end{quote}
If no default is specified, the parameter is assumed to be mandatory.
Here are some examples of parameters:
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
Here is a message, along with a sample response.
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
So far, we have only looked at ``describe'' requests. These can be asked of any module. Other requests are ``configure'' requests, and requests for services.
``Configure'' requests are used to tell the MessageRouter to update its cached information and activate or deactivate other modules. For example, the MessageRouter has a set of Collection modules that it can talk to. It also holds some XML information about those collections---this is returned when a request for a collection list comes in. If a collection is deleted or modified, or a new one created, this information may need to change, and the list of available modules may also change.
So far, we have {\em activate} and {\em deactivate} configure requests.
Some examples are as follows.
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
The first request is used to remove a collection from the running system once it has been physically deleted. The Collection module is removed from the module list, and information about the collection is removed from the collection list XML. The second request is used when the demo collection has either been modified, or has been newly created. The MessageRouter first checks whether a Collection module of that name already exists, and if so deactivates it, as described above. Then a new Collection module is created and configured, and information added into the XML tree. The final request (re)activates the services provided by the serviceRack class TranslationServices. The site config file is re-read, and the appropriate element used for configuration of the new serviceRack object. As for collections, if one already exists, it is deactivated first.
The response to a configure request is a status or an error message. No data is sent back, just success or error. An example is:
\begin{quote}\begin{footnotesize}\begin{verbatim}
demo collection activated
\end{verbatim}\end{footnotesize}\end{quote}
\footnote{this format not properly defined yet}
Configure requests are only answered by the MessageRouter at this stage. It is possible that other modules may need to respond to these requests also.
The main type of requests in the system are for services. There are different types of services: query, browse, retrieve, process, applet. Query services do some kind of search and return a list of documents. Retrieve services can return those documents, metadata about the documents, or other resources. Browse is for browsing lists or hierarchies of documents. process type services are those where the request is for a command to be run. A status code will be returned immediately, and then if the command has not finished, an update of the status can be requested. Applet services are those that run an applet.
Other possibilities include transform, enrich, extract, accrete. These types of service generally enhance the functionality of the first set. They may be used during collection formation: 'accrete' documents by adding them to a collection, 'transform' the documents into a different format, 'extract' information or acronyms from the documents, 'enrich' those documents with the information extracted or by adding new information. They may also be used during querying: 'transform' a query before using it to query a collection, or 'transform' the documents you get back into an appropriate form.
The basic structure of a service request is as follows:
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
The parameters are name value pairs corresponding to parameters that were specified in the service description sent in response to a describe request.
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
Some requests have a content---for document retrieval, the content is the list of documents to retrieve. For metadata retrieval, teh content is the list of documents, and a list of metadata to retrieve for each document.
Responses vary depending on the type of request.
Responses to query requests contain a content, which is the actual result, along with some metadata about the query\footnote{is this called metadata or something else?}. For instance, a text query on 'snail farming', with the parameter 'maxDocs=10' might return the first 10 documents, and one of the query metadata items would be the total number of documents that matched the query.\footnote{no metadata about the query result is returned yet.}
The following shows some example query requests and their responses.
Find at most 10 Sections containing the word snail (stemmed), returning the results in unsorted order:
\begin{quote}\begin{footnotesize}\begin{verbatim}
snail
\end{verbatim}\end{footnotesize}\end{quote}
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
Give me the Title metadata for these documents:
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
\begin{quote}\begin{footnotesize}\begin{verbatim}
Farming snails 1:
Learning about snails; Building a pen; Food and shelter plants
Learning about snails
Farming snails 2:
Choosing snails; Care and harvesting; Further improvement
\end{verbatim}\end{footnotesize}\end{quote}
Give me the text for this document:
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
\begin{quote}\begin{footnotesize}\begin{verbatim}
</B><P ALIGN="JUSTIFY"></P>
<P ALIGN="JUSTIFY">11. To farm snails is not hard; however,
it is quite different from keeping chickens or ducks or from growing crops
such as maize, rice, cassava or groundnuts.</P>
<P ALIGN="JUSTIFY"></P>
<P ALIGN="JUSTIFY">12. Since farming snails is so different
from other kinds of farming, you will have to learn a lot of new things.
</P>....
\end{verbatim}\end{footnotesize}\end{quote}
Build requests are not a request for data---they are a request for some action to be carried out, for example, create or import or build or activate a collection. The response is a status or an error message. The import and build commands may take a long time to complete, so a message is sent back after a successful start of the command. The status may be polled by the requester to see how the process is going.
Build requests generally do not need a content, they just have a parameter list.\footnote{or is the collection the content?} Like any service, the parameters used by the service can be obtained by a describe request to that service.
Some example requests (note that the build services are grouped into a service cluster called 'build', hence the addresses all begin with 'build/'):
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
\subsection{Generating the pages}
URL-style requests are received by the Receptionist. Based on the arguments, a page of data must be returned to the servlet. As described in Section~\ref{subsec:url-type}, the requests are XML representations of Greenstone URLs. One of the arguments is action (a). This tells the Receptionist which Action module to pass the request to. Action modules decode the rest of the cgi-arguments to determine what requests need to be made to the system.
System requests are received by the MessageRouter, which answers them one by one, either itself or by passing them on to the appropriate module.
Once the data needed from the system has been accumulated, it is put into a 'page' of XML. The page is transformed to its output form, currently HTML, via XSLT transformations, and returned to the user.
The basic page format is:
\begin{quote}\begin{footnotesize}\begin{verbatim}
\end{verbatim}\end{footnotesize}\end{quote}
There are four main elements in the page: config, translate, request, response. The request is the original request that came into the Receptionist---this is included so that any parameters can be preset to their previous values, for example, the query options on the query form.\footnote{this should be saved instead in some sort of state saving - if you leave a page and go back you want your parameters to be the same as well}. The response contains all the data that has been gathered from the system by the action. The other two elements contain extra information needed by XSLT. Config contains run-time variables such as the location of the gsdl home directory, the current site name, the name of the executable that is running (eg library)---these are needed to allow the XSLT to generate correct HTML URLs. Display contains some of the text strings needed in the interface---these are separate from the XSLT to allow for internationalization.
The following subsections outline, for each action, what data is needed and what requests are generated to send to the system. Following that, Section~\ref{subsec:xslt} describes the config and display information, and the xslt files.
\subsubsection{Page action}
Depending on the subaction argument, different pages can be generated. For the 'home' page, a 'describe' request is sent to the MessageRouter---this returns a list of all the collections, services, serviceClusters and sites known about. For each collection, its metadata is retrieved via a 'describe' request. This metadata is added into the previous result, which is then added into the page. The page is
transformed using {\em home.xsl\/}. For the 'about' page, a {\em
describe} request is sent to the module that the about page is about: this may be a collection or a service cluster. This returns a list of metadata
and a list of services, and the result is transformed using {\em about.xsl\/}.
\subsubsection{Query action}
There are three query services which have been implemented: TextQuery, SimpleFieldQuery, and AdvancedFieldQuery. These are all handled in the same way by query action.
For each page, the service description is requested from the service of the current collection (via a describe request). This is done every time the query page is
displayed.\footnote{This information should be cached.} The description includes a list of the parameters available for the query, such as case/stem, max num docs to return, etc. If the request type (rt) parameter is set to d for display, the action only needs to display the form, and this is the only request to the service. Otherwise, the submit button has been pressed, and a query request to the TextQuery service is sent. This has all the parameters from the URL put into the parameter list. A list of document identifiers
is returned. A followup query is sent to the MetadataRetrieve service of the collection: the content includes the list of
documents, with a request for their {\em Title} metadata. The service description and query result are combined into a page of xml, which is
transformed using {\em basicquery.xsl\/} to produce the html page.
\subsubsection{Applet action}
There are two types of request to the applet action: {\em a=a \& sa=d\/} and
{\em a=a \& sa=r\/}. The value {\em sa=d\/} means ``display the applet.'' A
{\em describe} request is sent to the service, which returns the {\footnotesize \verb#