source: trunk/gsdl3/docs/manual/manual.tex@ 5435

Last change on this file since 5435 was 5435, checked in by kjdon, 21 years ago

* empty log message *

  • Property svn:keywords set to Author Date Id Revision
File size: 121.2 KB
Line 
1\documentclass[a4paper,11pt]{article}
2\usepackage{times,epsfig}
3\hyphenation{Message-Router Text-Query}
4
5\newenvironment{gsc}% Greenstone text bits
6{\begin{footnotesize}\begin{tt}}%
7{\end{tt}\end{footnotesize}}
8
9\newcommand{\gst}[1]{{\footnotesize \tt #1}}
10\begin{document}
11
12\title{A modular digital library:\\
13 Architecture and implementation of Greenstone3}
14
15% if you work on this manual, add your name here
16\author{Katherine Don and Ian H. Witten \\[1ex]
17 Department of Computer Science \\
18 University of Waikato \\ Hamilton, New Zealand \\
19 \{kjdon, ihw\}@cs.waikato.ac.nz}
20
21\date{}
22
23\maketitle
24
25\newenvironment{bulletedlist}%
26{\begin{list}{$\bullet$}{\setlength{\itemsep}{0pt}\setlength{\parsep}{0pt}}}%
27{\end{list}}
28
29
30\noindent
31Greenstone Digital Library Version 3 is a complete redesign and
32reimplementation of the Greenstone digital library software. The current
33version (Greenstone2) enjoys considerable success and is being widely used.
34Greenstone3 will capitalize on this success, and in addition it will
35\begin{bulletedlist}
36\item improve flexibility, modularity, and extensibility
37\item lower the bar for ``getting into'' the Greenstone code with a view to
38 understanding and extending it
39\item use XML where possible internally to improve the amount of
40 self-documentation
41\item make full use of existing XML-related standards and software
42\item provide improved internationalization, particularly in terms of sort order,
43 information browsing, etc.
44\item include new features that facilitate additional ``content management''
45 operations
46\item operate on a scale ranging from personal desktop to corporate library
47\item easily permit the incorporation of text mining operations
48\item use Java, to encourage multilinguality, X-compatibility, and to permit
49 easier inclusion of existing Java code (such as for text mining).
50\end{bulletedlist}
51Parts of Greenstone will remain in other languages (e.g. MG, MGPP); JNI (Java
52Native Interface) will be used to communicate with these.
53
54A description of the general design and architecture of Greenstone3 is covered by the document {\em The design of Greenstone3: An agent based dynamic digital library} (design-2002.ps, in the gsdl3/docs/manual directory).
55
56NOTES: structure: make the classes and messages separate. have a class hierarchy and a module hierarchy/picture - keep the two separate. schemas/subschemas??
57user vs developer - make a clearer distinction
58are we going to publish an API. what is it? what do we want to provide?
59\section{System modules}\label{sec:modules}
60
61A Greenstone3 'library' system consists of many components: MessageRouter, Receptionist, Actions, Collections, ServiceRacks etc. Figure~\ref{fig:local} shows how they fit together in a stand-alone system.
62
63\begin{figure}[t]
64 \centering
65 \includegraphics[width=4in]{local} %5.8
66 \caption{A simple stand-alone site.}
67 \label{fig:local}
68\end{figure}
69
70
71{\em MessageRouter}: this is the central module for a site. It controls the site, loading up all the collections, clusters, communicators needed. All messages pass through the MessageRouter. Communication between remote sites is always done between MessageRouters, one for each site.
72
73{\em Collection and ServiceCluster}: these are very similar. They both provide some metadata about the collection/cluster, and a list of services. The services are provided by ServiceRack objects that the collection/cluster loads up. A Collection is a specific type of ServiceCluster. A ServiceCluster groups services that are related conceptually, eg all the building services may be part of a cluster. What is part of a cluster is specified by the site config file. A Collection's services are grouped by the fact that they all operate on some common data---the documents in the collection.
74Functionally Collection and ServiceCluster are very similar, but conceptually, and to the user, they are quite different.
75
76{\em ServiceRack}: these provide one or more services - they are grouped into a single class purely for code reuse, or to avoid instantiating the same objects several times. For example, MGPP searching services all need to have the index loaded into memory. Services provide the core functionality for the system, eg searching, retrieving documents, building collections etc.
77
78{\em Communicator/Server}: these facilitate communication between remote modules. For example, if you want MR1 to talk to MR2, you need a Communicator-Server pair. The Server sits on top of MR2, and MR1 talks to the Communicator. Each communication type needs a new pair. So far we have only been using SOAP, so we have a SOAPCommunicator and a SOAPServer.
79
80{\em Receptionist}: this is the point of contact for the 'front end'. Its core functionality involves routing requests to the Actions, but it may do more than that. For example, a Receptionist may: modify the request in some way before sending it to teh appropriate Action; add some data to the page responses that is common to all pages; transform the response into another form using XSLT for example. There is a hierarchy of different REceptionist types, which is described in Section~\ref{sec:recepts}.
81
82{\em Actions}: these do the job of creating the 'pages'. There is a different action for each type of page, for example PageAction handles semi-static pages, QueryAction handles queries, DocumentAction displays documents. They know a little bit about specific service types. Based on the 'cgi' arguments passed in to them, they construct requests for the system, and put together the responses into data for the page. This data is returned to the Receptionist, which may transform it to HTML. The various actions are described in more detail in Section~\ref{sec:pagegen}.
83
84
85\section{Configuration}\label{sec:config}
86
87Initial Greenstone3 system configuration is determined by a set of configuration files, all expressed in XML. Each site has a configuration file that binds parameters for the site, \gst{siteConfig.xml}. Each interface has a config file, \gst{interfaceConfig.xml}, that specifies Actions for the interface. Each collection has two configuration files, \gst{collectionConfig.xml} and \gst{buildConfig.xml}, that give metadata, display and other information for the
88collection.\footnote{\gst{siteConfig.xml} and \gst{interfaceConfig.xml} is new for Greenstone3, while \gst{collectionConfig.xml} and \gst{buildConfig.xml} replace \gst{collect.cfg} and \gst{build.cfg} in
89Greenstone2.} The first includes user-defined presentation metadata for the collection,
90such as its name and the {\em About this collection} text; gives formatting information for the collection display; and also gives
91instructions on how the collection is to be built. The second is produced by
92the build-time process and includes any metadata that can be determined
93automatically. It also includes configuration information for any ServiceRacks needed by the collection.
94
95The configuration files are read in when the system is initialised, and their contents are cached in memory. This means that changes made to these files once the system is running will have no effect. There are a series of cgi-type commands that can be sent to the library to induce reconfiguration of different modules, including reloading the whole site. This removes the need to shutdown and restart the system to reflect these changes. These commands are described in Section~\ref{sec:runtime-config}.
96
97\subsection{Site configuration file}\label{sec:siteconfig}
98
99The file \gst{siteConfig.xml} specifies the URI for the site (\gst{localSiteName}), the HTTP address for site resources (\gst{httpAddress}), any ServiceClusters that the site provides (for example, collection building), any ServiceRacks that do not belong to a cluster or collection, and a list of
100known external sites to connect to. Collections are not specified in the site
101configuration file, instead they are determined by the contents of the site's
102collections directory.
103
104The HTTP address is used for retrieving resources from a site outside the XML protocol. Because a site is HTTP accessible, any files (e.g. images) belonging to that site or to its collections can be specified in the HTML of a page by a URL. This avoids having to retrieve these files from a remote site via the XML protocol\footnote{Currently, sites live inside the Tomcat gsdl3 root context, and therefore all their content is accessible over HTTP via the Tomcat address. We need to see if parts can be restricted. Also, if we use a different protocol, then resources from remote sites may need to come through the XML. Also, if we are running locally without using Tomcat, we may want to get them via file:// rather than http://.}.
105
106Figure~\ref{fig:siteconfig} shows two example site configuration files. The first example is for a rudimentary site with no site-wide services,
107which does not connect to any external sites. The second example is for a site with one site-wide service cluster - a collection building cluster. It also connects to the first site using SOAP.
108These two sites are running on the same machine. For site \gst{gsdl1} to talk to site \gst{localsite}, a SOAP server must be run for \gst{localsite}. The address of the SOAP server, in this case, is \gst{http://localhost:8090/soap/servlet/rpcrouter}.
109
110
111\begin{figure}
112\begin{gsc}\begin{verbatim}
113<siteConfig>
114 <localSiteName value="org.greenstone.localsite"/>
115 <httpAddress value="http://localhost:8090/gsdl3/sites/localsite"/>
116 <serviceClusterList/>
117 <serviceRackList/>
118 <siteList/>
119</siteConfig>
120\end{verbatim}\end{gsc}
121
122\begin{gsc}\begin{verbatim}
123<siteConfig>
124 <localSiteName value="org.greenstone.gsdl1"/>
125 <httpAddress value="http://localhost:8090/gsdl3/sites/gsdl1"/>
126 <serviceClusterList>
127 <serviceCluster name="build">
128 <metadataList>
129 <metadata name="Title">Collection builder</metadata>
130 <metadata name="Description">Builds collections in a
131 gsdl2-style manner</metadata>
132 </metadataList>
133 <serviceRackList>
134 <serviceRack name="GS2Construct"/>
135 </serviceRackList>
136 </serviceCluster>
137 </serviceClusterList>
138 <siteList>
139 <site name="org.greenstone.localsite"
140 address="http://localhost:8090/soap/servlet/rpcrouter"
141 type="soap"/>
142 </siteList>
143</siteConfig>
144\end{verbatim}\end{gsc}
145\caption{Two sample site configuration files}
146\label{fig:siteconfig}
147\end{figure}
148
149\subsection{Interface configuration file}\label{sec:interfaceconfig}
150
151The interface config file \gst{interfaceConfig.xml} lists all the actions that the interface knows about at the start (but other ones can be loaded dynamically). If the interface uses servlets, it specifies what short name each action should use for the action cgi parameter eg QueryAction should use a=q. If the interface uses xslt, it specifies what xslt file should be used for each action and subaction.
152
153\begin{figure}
154\begin{gsc}\begin{verbatim}
155<interfaceConfig>
156 <actionList>
157 <action name='p' class='PageAction'>
158 <subaction name='home' xslt='home.xsl'/>
159 <subaction name='about' xslt='about.xsl'/>
160 </action>
161 <action name='q' class='QueryAction' xslt='basicquery.xsl'/>
162 <action name='b' class='BrowseAction' xslt='classifier.xsl'/>
163 <action name='a' class='AppletAction' xslt='applet.xsl'/>
164 <action name='d' class='DocumentAction' xslt='document.xsl'/>
165 <action name='pr' class='ProcessAction' xslt='process.xsl'/>
166 <action name='s' class='SystemAction' xslt='system.xsl'/>
167 </actionList>
168</interfaceConfig>
169\end{verbatim}\end{gsc}
170\caption{A sample interface config file}
171\label{fig:ifaceconfig}
172\end{figure}
173
174This makes it easy for developers to implement and use different actions and/or xslt files without recompilation. The server must be restarted, however.
175
176\subsection{Collection configuration file}\label{sec:collconfig}
177
178The collection configuration file is where the collection designer (eg a librarian) decides what form the collection should take. This includes the collection metadata such as title and description, and also includes what indexes and browsing structures should be built. The format of \gst{collectionConfig.xml} is still under consideration. However, Figure~\ref{fig:collconfig} shows the parts of it that have been defined so far. (Since collection building at this stage is still done using Greenstone2 perl scripts and the old \gst{collect.cfg} file, we have only defined the format for the parts of \gst{collectionConfig.xml} that are used by the runtime-system.)
179
180
181\begin{figure}
182\begin{gsc}\begin{verbatim}
183<collectionConfig xmlns:gsf="http://www.greenstone.org/
184 configformat">
185 <metadataList>
186 <metadata name="colName" lang="en">greenstone mgpp demo
187 </metadata>
188 <metadata name="colDescription" lang="en">This is a
189 demonstration collection for the Greenstone digital
190 library software. It contains a small subset (11 books)
191 of the Humanity Development Library.</metadata>
192 <metadata name="colDescription" lang="fr">C'est une
193 collection pour demonstration du logiciel Greenstone.
194 Elle contient une petite partie du projet de bibliotheques
195 humanitaires et de developpement (11 livres).</metadata>
196 <metadata name="colIcon">mgppdemo.gif</metadata>
197 </metadataList>
198 <search type='mgpp'>
199 <index name="tt" content="text,metadata"
200 level="Document,Section">
201 <displayName lang="en">books</displayName>
202 </index>
203 <format>
204 <gsf:template match="documentNode">
205 <td><gsf:link><gsf:metadata name="Title"/>(<gsf:metadata
206 name="Source"/>)</gsf:link></td>
207 </gsf:template>
208 </format>
209 </search>
210 <browse>
211 <classifier name="CL1" type="Hierarchy" content="Subject"
212 level="Document">
213 <option name="hfile" value="sub.txt"/>
214 <option name="sort" value="Title"/>
215 </classifier>
216 <classifier name="CL2" type="AZList" content="Title"
217 level="Document">
218 <displayName lang='en'>all titles</displayName>
219 <format>
220 <gsf:template match="classifierNode">
221 <td><gsf:link type="classifier"><gsf:metadata name="Title"/>
222 </gsf:link></td>
223 </gsf:template>
224 </format>
225 </classifier>
226 <classifier name="CL3" type="List" content="Keyword"
227 level="Document">
228 <format>
229 <gsf:template match="documentNode"><td><gsf:link>
230 <gsf:metadata name="Keyword"/></gsf:link></td></gsf:template>
231 </format>
232 </classifier>
233 <classifier type="Phind" content="text" level="Section"/>
234 </browse>
235</collectionConfig>
236\end{verbatim}\end{gsc}
237\caption{Sample collectionConfig.xml file}
238\label{fig:collconfig}
239***** REDO *****
240\end{figure}
241
242****REDO****
243The \gst{<metadataList>} element specifies some collection metadata, such as name and description. These metadata elements can be specified in different languages. The configuration file should be encoded in utf-8.
244The \gst{<search>} element specifies what type of indexer to use, and what indexes to build. A \gst{<format>} element is used to customize what each document entry in a results list should look like.
245The \gst{<browse>} element specifies what browsing structures should be created over the documents. Again, \gst{<format>} elements are used to customize items in the hierarchy, both classifier nodes, and document entries. Section~\ref{sec:colldesign} looks at the collection configuration file in more detail.
246
247The \gst{<display>} element contains optional formatting information for the display of documents. Templates that can be specified here include \gst{documentHeading}, \gst{DocumentContent}, and other information that could be specified (in a yet to be decided format) are things such as whether or not to display the cover image, table of contents etc.
248
249\subsection{Building configuration file}\label{sec:buildconfig}
250
251The file \gst{buildConfig.xml} is produced by the collection building process, and contains metadata and other information about the collection that can
252be determined automatically, such as the number of
253documents it contains. It also includes a list of ServiceRack classes that are
254required at runtime to provide the services that have been built into the
255collection. The serviceRack names are Java classes that are loaded
256dynamically at runtime. Any information inside the serviceRack element is
257specific to that service---there is no set format. Figure~\ref{fig:buildconfig} shows an example. This config file specifies that the collection should load up 3 ServiceRacks: GS2MGPPRetrieve, GS2MGPPSearch, and PhindPhraseBrowse. The contents of each \gst{<serviceRack>} element are passed to the appropriate ServiceRack objects for configuration. The collectionConfig.xml file is also passed ot the ServiceRack objects at configure time---the \gst{format} and \gst{displayItem} information is used directly from the \gst{collectionConfig.xml} file rather than added into \gst{buildConfig.xml} during building. This enables changes in \gst{collectionConfig.xml} to take effect in the collection without rebuilding being necessary.
258
259
260\begin{figure}
261\begin{gsc}\begin{verbatim}
262<buildConfig xmlns:gsf="www.greenstone.org/format" >
263 <metadataList>
264 <metadata name="numDocs">11</metadata>
265 <metadata name="documentMetadata"><element name="Title"/>
266 <element name="Subject"/><element name="Organization"/>
267 <element name="URL"/></metadata>
268 </metadataList>
269 <serviceRackList>
270 <serviceRack name="GS2MGPPRetrieve">
271 <defaultLevel name="Section"/>
272 <levelList>
273 <level name="Document"/>
274 <level name="Section"/>
275 </levelList>
276 <classifierList>
277 <classifier name="CL1" content="Subject"
278 documentInterleave="true" orientation='vertical'/>
279 <classifier name="CL2" content="Title"
280 documentInterleave="false" orientation='horizontal'/>
281 <classifier name="CL4" content="Organisation"
282 documentInterleave="true" orientation='vertical'/>
283 <classifier name="CL5" content="Keyword"
284 documentInterleave="true" orientation='vertical'/>
285 </classifierList>
286 </serviceRack>
287 <serviceRack name="GS2MGPPSearch">
288 <defaultIndex name="tt"/>
289 <defaultLevel name="Section"/>
290 <levelList>
291 <level name="Document"/>
292 <level name="Section"/>
293 </levelList>
294 <indexList>
295 <index name="tt"/>
296 <index name="t0"/>
297 </indexList>
298 <fieldList>
299 <field shortname="TX" name="TextOnly"/>
300 <field shortname="SU" name="Subject"/>
301 <field shortname="TI" name="Title"/>
302 </fieldList>
303 </serviceRack>
304 <serviceRack name="PhindPhraseBrowse"/>
305 </serviceRackList>
306</buildConfig>
307\end{verbatim}\end{gsc}
308\caption{Sample buildConfig.xml file}
309\label{fig:buildconfig}
310\end{figure}
311
312
313\subsection{Start up configuration}\label{sec:startup-config}
314
315We use the Tomcat web server, which operates either stand-alone in a test mode
316or in conjunction with the Apache web server. The Greenstone LibraryServlet
317class is loaded by Tomcat and the servlet's \gst{init()} method is called. Each time a
318\gst{get/put/post} (etc.) is used, a new thread is started and
319\gst{doGet()/doPut()/doPost()} (etc.) is called.
320
321The \gst{init()} method creates a new Receptionist and a new
322MessageRouter. Default classes (DefaultReceptionist, MessageRouter) are used unless subclasses have been specified in the servlet initiation parameters (see Section~\ref{sec:tomcat}). The appropriate system variables are set for each object (interface
323name, site name, etc.) and then \gst{configure()} is called on both. The MessageRouter
324is passed to the Receptionist. The servlet then communicates only with
325the Receptionist, not with the MessageRouter.
326
327The Receptionist reads in the \gst{interfaceConfig.xml} file, and loads up all the different Action classes. Other Actions may be loaded on the fly as needed. Actions are added to a map, with shortnames for keys. Eg the QueryAction is added with key 'q'. The Actions are passed the MessageRouter reference too.
328If the Receptionist is a Transforming receptionist, a mapping between shortnames and xslt files is also created.
329
330The MessageRouter reads in its site configuration file \gst{siteConfig.xml}. It creates a module map that maps names to objects. This is used for routing the messages. It also keeps small chunks of XML---serviceList, collectionList, clusterList and siteList. These are what get returned in response to a describe request (see Section~\ref{sec:describe}.).
331Each ServiceRack specified in the config file is created, then queried for its list of services. Each service name is added to the map, pointing to the ServiceRack object. Each service is also added to the serviceList. After this stage, ServiceRacks are transparent to the system, and each service is treated as a separate module.
332ServiceClusters are created and passed the \gst{<serviceCluster>} element for configuration. They are added to the map as is, with the cluster name as a key. A serviceCluster is also added to the serviceClusterList.
333For each site specified, the MessageRouter creates an appropriate type Communicator object. Then it tries to get the site description. If the server for the remote site is up and running, this should be successful. The site will be added to the map with its site name as a key. The sites collections, services and clusters will also be added into the static xml lists. If the server for the remote site is not running, the site will not be included in the siteList or module map. To try again to access the site, either Tomcat must be restarted, or a run-time reconfigure sites commands must be sent (see next section).
334
335The MessageRouter also looks inside the site's \gst{collect} directory, and loads up a Collection object for each valid collection found.
336
337The Collection object reads its \gst{buildConfig.xml} and \gst{collectionConfig.xml}
338files, determines the metadata, and loads ServiceRack classes based on the
339names specified in \gst{buildConfig.xml\/}. The \gst{<serviceRack>} XML element is passed to the object to be used in configuration. The \gst{collectionConfig.xml} contents are also passed in to the ServiceRacks. Any format or display information that the services need must be extracted from the collection config file.
340Collection objects are added to the module map with their name as a key, and also a collection element is added into the collectionList xml.
341
342\subsection{Run-time (re)configuration}\label{sec:runtime-config}
343
344The startup configuration reads in the various config files and loads up quite a lot of XML into memory. This avoids having to read in files all the time. However, this means that any changes to these files will have no effect in the system. So some run-time reconfiguration options are provided. Currently, these can only be accessed by typing in cgi-arguments into the URL, there is no nice web form yet to do this. SystemAction converts these arguments into system requests, which are described in Section~\ref{sec:system}.
345
346The cgi arguments are entered after the \gst{library?} part of the URL. There are three types of commands: configure, activate, deactivate. These are specified by \gst{a=s\&sa=c}, \gst{a=s\&sa=a}, and \gst{a=s\&sa=d}, respectively (\gst{a} is action, \gst{sa} is subaction). By default, the requests are sent to the MessageRouter, but they can be sent to a collection/cluster by the addition of \gst{sc=xxx}, where \gst{xxx} is the name of the collection or cluster. Table~\ref{tab:run-time config} describes the arguments in abit more detail.
347
348\begin{table}
349\caption{Example run-time configuration arguments.}
350\label{tab:run-time config}
351\begin{tabular}{lp{8cm}}
352\gst{a=s\&sa=c} & reconfigures the whole site, reads in siteConfig.xml, reloads all the collections. Just part of this can be specified with another argument \gst{ss} (system subset). The valid values are \gst{collectionList}, \gst{siteList}, \gst{serviceList}, \gst{clusterList}. \\
353\gst{a=s\&sa=c\&sc=XXX} & reconfigures the XXX collection or cluster. \gst{ss} can also be used here, valid values are \gst{metadataList} and \gst{serviceList}. \\
354\gst{a=s\&sa=a} & activate a specific module. Modules are specified using two arguments, \gst{st} (system module type) and \gst{sn} (system module name). Valid types are \gst{collection}, \gst{cluster} \gst{site}.\\
355\gst{a=s\&sa=d} & deactivate a module. \gst{st} and \gst{sn} can be used here too. Valid types are \gst{collection}, \gst{cluster}, \gst{site}, \gst{service}. \\
356\gst{a=s\&sa=d\&sc=XXX} & deactivate a module belonging to the XXX collection or cluster. \gst{st} and \gst{sn} can be used here too. Valid types are \gst{service}. \\
357\end{tabular}
358\end{table}
359
360\section{System messages}\label{sec:messages}
361
362
363Once the system is up and running (the configuration
364process described in Section~\ref{sec:startup-config} has been carried out), it is passing messages back and forth. All modules communicate via message passing.
365
366There are two different styles of messaging. The first style of messaging is the internal Greenstone communication. Requests and responses follow a basic format, and both are in XML. Each individual communication is contained in a \gst{<message>} element\footnote{all sample requests and responses shown are assumed to have \gst{<message>} elements}.
367They contain either \gst{<request>} or \gst{<response>} elements--- a single message may contain multiple requests/responses. Each \gst{<request>} (and \gst{<response>}?) has a language attribute, of the form \gst{lang='...'}. Virtually all responses contain text strings, and this attribute specifies the preferred language for these strings. Element and attribute names are formated in lower case with the first letter of internal words capitalized, like 'matchDocs'. Each request typically specifies one service or one action, and the response contains either the data requested, or a status message.
368
369
370Requests have a \gst{to} attribute and responses have \gst{from}. These are addresses used
371by routing modules. For example \gst{to='site1/demo/TextQuery'} routes a
372message to modules named \gst{site1}, \gst{demo} then \gst{TextQuery}. These modules happen to be a MessageRouter for a remote site (\gst{site1}), a Collection (\gst{demo}), and a Service (\gst{TextQuery}).
373
374There are several types of request, specified by the \gst{type} attribute: \gst{describe}, \gst{system}, \gst{process}, \gst{status}, \gst{format}. These requests can ask for any functionality available in the system. They are described in more detail in Sections~\ref{sec:describe}, \ref{sec:system}, \ref{sec:process}, \ref{sec:status}, and \ref{sec:format}, respectively.
375
376The second messaging style is the communication between the servlet (or other external agent) and the Greenstone system (via the Receptionist). The request contains a simple representation of the arguments in a Greenstone URL, and has a request type of 'page', as it is a request for a page of data. It has the same format as any other request in the system. The response, however, does not follow the same format as other responses, and may given in different formats, such as XML, HTML etc.
377
378These page-type messages come into the Receptionist and are passed to the appropriate action. The actions generate appropriate internal messages which are sent to the MessageRouter. The responses are put together into a single page of XML. This may be returned as XML, or transformed into some other form, eg HTML using XSLT. This type of message is described in Section~\ref{sec:page}.
379
380\subsection{page-type messages}\label{sec:page}
381
382These are the special 'external'-style messages. Requests originate from outside Greenstone, for example from a servlet, or java application. They are requests for a 'page' of data---for example, the home page for a site; the query page for a collection; the text of a document. They contain, in XML, a list of arguments specifiying what type of page is required. If the external context is a servlet, the arguments represent the 'cgi' arguments in a Greenstone URL. The two main arguments are \gst{a} (action) and \gst{sa}
383(subaction).\footnote{The \gst{sa} replaces Greenstone's old \gst{p} arg for
384the page action, and is new for other actions. For example, a text query could
385be encoded as \gst{a=q \& sa=text\/}.} All other arguments are encoded as
386parameters.
387
388Here is some examples of requests\footnote{In a servlet context, these correspond to the URLs \gst{a=p\&sa=about\&c=demo\&l=fr}, and \gst{a=q\&l=en\&s=TextQuery\&c=demo\&rt=r\&ca=0\&st=1\&m=10\&q=snail}.}:
389
390\begin{quote}\begin{gsc}\begin{verbatim}
391<request type='page' action='p' subaction='about'
392 lang='fr' output='html'>
393 <paramList>
394 <param name='c' value='demo'/>
395 </paramList>
396</request>
397\end{verbatim}\end{gsc}\end{quote}
398
399\begin{quote}\begin{gsc}\begin{verbatim}
400<request lang='en' type='page' action='q' output='html'>
401 <paramList>
402 <param name='s' value='TextQuery'/>
403 <param name='c' value='demo'/>
404 <param name='rt' value='r'/>
405 <!-- the rest are the service specific params -->
406 <param name='ca' value='0'/> <!-- casefold -->
407 <param name='st' value='1'/> <!-- stem -->
408 <param name='m' value='10'/> <!-- maxdocs -->
409 <param name='q' value='snail'/> <!-- query string -->
410 </paramList>
411</request>
412\end{verbatim}\end{gsc}\end{quote}
413
414The Receptionist routes the message to the appropriate Action (determined by looking up its shortname$->$Action object map). The actions determine what information is needed from the server and retrieves it, making one or more internal requests to the MessageRouter. This information is gathered together into a single response, and returned to the Receptionist. The Receptionist may process the result further, depending on what type of Receptionist is it, and returns the page to the external entity. Section~\ref{sec:pagegen} describes the different types of Receptionist, and details the structure of the 'pages' they produce.
415
416The LibraryServlet class communicates with the Receptionist, which is the entry
417point into the system. Future GUIs could communicate either with the
418Receptionist or directly with the MessageRouter. If they communicate with the Receptionist they may use the either page type requests, asking for predefined pages of information, or they can use any of the other internal type requests--- these requests will be passed directly to the MessageRouter. If they communicate with the MessageRouter directly, they must use the internal message format described in the next sections---this is more powerful, but involves more work by the client. Individual services are requested---the results need to be put together by the client.
419
420The main arguments/parameters used currently are shown in Table~\ref{tab:args}.
421Other arguments can be specified by particular actions. These include any parameters needed to access services. For example, the TextQuery service has a set of parameters including stem and case etc, that are only used by the query action.
422
423\begin{table}
424\center{\footnotesize
425\begin{tabular}{lll}
426\hline
427\bf Argument & \bf Meaning &\bf Typical values \\
428\hline
429a & action & a (applet), q (query), b (browse), p (page), pr (process) \\
430& & s (system)\\
431sa & subaction & home, about (page action)\\
432c & collection or & demo, build \\
433& service cluster \\
434s & service name & TextQuery, ImportCollection \\
435rt & request type & d (display), r (request), s (status) \\
436ro & response only & 0 or 1 - if set to one, the request is carried out \\
437& & but no processing of the results is done \\
438& & currently only used in process actions \\
439o & output type & xml, html, wml \\
440l & language & en, fr, zh ...\\
441d & document id & HASHxxx \\
442r & resource id & ???\\
443pid & process handle & an integer identifying a particular process request \\
444\hline
445\end{tabular}}
446\caption{Generic arguments that can appear in a Greenstone URL}
447\label{tab:args}
448\end{table}
449
450
451\subsection{'describe'-type messages}\label{sec:describe}
452
453The most basic of the internal standard requests is ``describe-yourself'', which can be sent to any module in the system. The module responds with a semi-predefined piece of XML, making these requests very efficient. The response is predefined apart from any language-specific text strings, which are put together as each request comes in, based on the language attribute of the request.
454\begin{quote}\begin{gsc}\begin{verbatim}
455<request lang='en' type='describe' to=''/>
456\end{verbatim}\end{gsc}\end{quote}
457If the \gst{to} field is empty, a request is answered by the MessageRouter.
458An example response from a MessageRouter might look like this:
459\begin{quote}\begin{gsc}\begin{verbatim}
460<response lang='en' type='describe'>
461 <serviceList/>
462 <siteList>
463 <site name='org.greenstone.gsdl1'
464 address='http://localhost:8080/soap/servlet/rpcrouter'
465 type='soap' />
466 </siteList>
467 <serviceClusterList>
468 <serviceCluster name="build" />
469 </serviceClusterList>
470 <collectionList>
471 <collection name='org.greenstone.gsdl1/
472 org.greenstone.gsdl2/fao' />
473 <collection name='org.greenstone.gsdl1/demo' />
474 <collection name='org.greenstone.gsdl1/fao' />
475 <collection name='myfiles' />
476 </collectionList>
477</response>
478\end{verbatim}\end{gsc}\end{quote}
479This MessageRouter has no individual site-wide services (an empty \gst{<serviceList>}), but has a service cluster called build (which provides collection importing and building functionality). It
480communicates with one site, \gst{org.greenstone.gsdl1}. It is aware of four
481collections. One of these, \gst{myfiles}, belongs to it; the other three are
482available through the external site. One of those collections is actually from
483a further external site.
484
485It is possible to ask just for a specific part of the information provided by a
486describe request, rather than the whole thing. For example, these two
487messages get the \gst{collectionList} and the \gst{siteList} respectively:
488\begin{quote}\begin{gsc}\begin{verbatim}
489<request lang='en' type='describe' to=''>
490 <paramList>
491 <param name='subset' value='collectionList'/>
492 </paramList>
493</request>
494
495<request lang='en' type='describe' to=''>
496 <paramList>
497 <param name='subset' value='siteList'/>
498 </paramList>
499</request>
500\end{verbatim}\end{gsc}\end{quote}
501
502When a collection or service cluster is asked to describe itself, what is returned is a list of metadata, some display elements, and a list of services. For example, here is such
503a message, along with a sample response.
504
505\begin{quote}\begin{gsc}\begin{verbatim}
506<request lang='en' type='describe' to='mgppdemo'/>
507
508<response from="mgppdemo" type="describe">
509 <collection name="mgppdemo">
510 <displayItem lang="en" name="name">greenstone mgpp demo
511 </displayItem>
512 <displayItem lang="en" name="description">This is a
513 demonstration collection for the Greenstone digital
514 library software. It contains a small subset (11 books)
515 of the Humanity Development Library. It is built with
516 mgpp.</displayItem>
517 <displayItem lang="en" name="icon">mgppdemo.gif</displayItem>
518 <serviceList>
519 <service name="DocumentStructureRetrieve" type="retrieve" />
520 <service name="DocumentMetadataRetrieve" type="retrieve" />
521 <service name="DocumentContentRetrieve" type="retrieve" />
522 <service name="ClassifierBrowse" type="browse" />
523 <service name="ClassifierBrowseMetadataRetrieve"
524 type="retrieve" />
525 <service name="TextQuery" type="query" />
526 <service name="FieldQuery" type="query" />
527 <service name="AdvancedFieldQuery" type="query" />
528 <service name="PhindApplet" type="applet" />
529 </serviceList>
530 <metadataList>
531 <metadata name="creator">[email protected]</metadata>
532 <metadata name="maintainer">[email protected]</metadata>
533 <metadata name="numDocs">11</metadata>
534 <metadata name="buildType">mgpp</metadata>
535 <metadata name="httpPath">http://kanuka:8090/gsdl3/sites/
536 localsite/collect/mgppdemo</metadata>
537 </metadataList>
538 </collection>
539</response>
540\end{verbatim}\end{gsc}\end{quote}
541
542This collection provides many typical services...
543
544The subset parameter can also be used in a describe request to a collection, to retrieve just the \gst{metadataList} or \gst{serviceList}.
545
546A \gst{describe} request sent to a service returns a list of parameters that
547the service accepts, some display information, (and in future may describe the content type for the request and response).
548
549Parameters have the following format:
550\begin{quote}\begin{gsc}\begin{verbatim}
551<param name='xxx' type='integer|boolean|string' default='yyy'/>
552<param name='xxx' type='enum_single|enum_multi' default='aa'/>
553 <option name='aa'/><option name='bb'/>...
554</param>
555<param name='xxx' type='multi' occurs='4'>
556 <param .../>
557 <param .../>
558</param>
559\end{verbatim}\end{gsc}\end{quote}
560
561If no default is specified, the parameter is assumed to be mandatory.
562Here are some examples of parameters:
563\begin{quote}\begin{gsc}\begin{verbatim}
564<param name='case' type='boolean' default='0'/>
565
566<param name='maxDocs' type='integer' default='50'/>
567
568<param name='index' type='enum' default='dtx'>
569 <option name='dtx'/>
570 <option name='stt'/>
571 <option name='stx'/>
572<param>
573
574<!-- this one is for the text box and field list for the
575simple field query-->
576<param name='simpleField' type='multi' occurs='4'>
577 <param name='fqv' type='string'/>
578 <param name='fqf' type='enum_single'>
579 <option name='TI'/><option name='AU'/><option name='OR'/>
580 </param>
581</param>
582
583\end{verbatim}\end{gsc}\end{quote}
584The type attribute is used to determine how to display the parameters on a web page or interface. For example, a string parameter may result in a text entry box, a boolean an on/off button, enum\_single/enum\_multi a drop-down menu, where one or many items, respectively, can be selected.
585A multi-type parameter indicates that two or more parameters are associated, and should be displayed appropriately. For example, in a field query, the text box and field list should be associated. The occurs attribute specifies how many times the parameter should be displayed on the page.
586Parameters also come with display information: all the text strings needed to present them to the user. These include the name of the parameter and the display values for any options. These are included in the above parameter descriptions in the form of \gst{<displayItem>} elements.
587
588A service description also contains some display information---this includes the name of the service, and the text for the submit button.
589
590Here is a sample describe request to the FieldQuery service of collection mgppdemo, along with its response. The parameters in this example include their display information. Figure~\ref{fig:query-display} gives an example html search form that may be generated from this describe response.
591
592\begin{quote}\begin{gsc}\begin{verbatim}
593<request lang="en" to="mgppdemo/FieldQuery" type="describe" />
594
595<response from="mgppdemo/FieldQuery" type="describe">
596 <service name="FieldQuery" type="query">
597 <displayItem name="name">Form Query</displayItem>
598 <displayItem name="submit">Search</displayItem>
599 <paramList>
600 <param default="Document" name="level" type="enum_single">
601 <displayItem name="name">Granularity to search at</displayItem>
602 <option name="Document">
603 <displayItem name="name">Document</displayItem>
604 </option>
605 <option name="Section">
606 <displayItem name="name">Section</displayItem>
607 </option>
608 </param>
609 <param default="1" name="case" type="boolean">
610 <displayItem name="name">Turn casefolding </displayItem>
611 <option name="0">
612 <displayItem name="name">off</displayItem>
613 </option>
614 <option name="1">
615 <displayItem name="name">on</displayItem>
616 </option>
617 </param>
618 <param default="1" name="stem" type="boolean">
619 <displayItem name="name">Turn stemming </displayItem>
620 <option name="0">
621 <displayItem name="name">off</displayItem>
622 </option>
623 <option name="1">
624 <displayItem name="name">on</displayItem>
625 </option>
626 </param>
627 <param default="10" name="maxDocs" type="integer">
628 <displayItem name="name">Maximum documents to return
629 </displayItem>
630 </param>
631 <param name="simpleField" occurs="4" type="multi">
632 <displayItem name="name"></displayItem>
633 <param name="fqv" type="string">
634 <displayItem name="name">Word or phrase </displayItem>
635 </param>
636 <param default="ZZ" name="fqf" type="enum_single">
637 <displayItem name="name">in field</displayItem>
638 <option name="ZZ">
639 <displayItem name="name">All fields</displayItem>
640 </option>
641 <option name="TX">
642 <displayItem name="name">TextOnly</displayItem>
643 </option>
644 <option name="SU">
645 <displayItem name="name">Subject</displayItem>
646 </option>
647 <option name="TI">
648 <displayItem name="name">Title</displayItem>
649 </option>
650 </param>
651 </param>
652 </paramList>
653 </service>
654</response>
655\end{verbatim}\end{gsc}\end{quote}
656
657\begin{figure}[t]
658 \centering
659 \includegraphics[width=3.5in]{query2.ps}
660 \caption{The previous query service describe response as displayed on the search page.}
661 \label{fig:query-display}
662\end{figure}
663
664A describe request to an applet type service returns the applet html element: this will be embedded into a web page to run the applet.
665\begin{quote}\begin{gsc}\begin{verbatim}
666<request type='describe' to='mgppdemo/PhindApplet'/>
667
668<response type='describe'>
669 <service name='PhindApplet' type='query'>
670 <applet ARCHIVE='phind.jar, xercesImpl.jar, gsdl3.jar,
671 jaxp.jar, xml-apis.jar'
672 CODE='org.greenstone.applet.phind.Phind.class'
673 CODEBASE='lib/java'
674 HEIGHT='400' WIDTH='500'>
675 <PARAM NAME='library' VALUE=''/>
676 <PARAM NAME='phindcgi' VALUE='?a=a&amp;sa=r&amp;sn=Phind'/>
677 <PARAM NAME='collection' VALUE='mgppdemo' />
678 <PARAM NAME='classifier' VALUE='1' />
679 <PARAM NAME='orientation' VALUE='vertical' />
680 <PARAM NAME='depth' VALUE='2' />
681 <PARAM NAME='resultorder' VALUE='L,l,E,e,D,d' />
682 <PARAM NAME='backdrop' VALUE='interfaces/default/>
683 images/phindbg1.jpg'/>
684 <PARAM NAME='fontsize' VALUE='10' />
685 <PARAM NAME='blocksize' VALUE='10' />
686 The Phind java applet.
687 </applet>
688 <displayItem name="name">Browse phrase hierarchies</displayItem>
689 </service>
690</response>
691\end{verbatim}\end{gsc}\end{quote}
692
693Note that the library parameter has been left blank. This is because library refers to the current servlet that is running and the name is not necessarily known in advance. So either the applet action or the Receptionist must fill in this parameter before displaying the html.
694
695\subsection{'system'-type messages}\label{sec:system}
696
697``System'' requests are used to tell a MessageRouter, Collection or ServiceCluster to update its cached information and activate or deactivate other modules. For example, the MessageRouter has a set of Collection modules that it can talk to. It also holds some XML information about those collections---this is returned when a request for a collection list comes in. If a collection is deleted or modified, or a new one created, this information may need to change, and the list of available modules may also change. Currently they are initiated by particular cgi parameters (see Section~\ref{sec:runtime-config}).
698
699The basic format of a system request is as follows:
700
701\begin{quote}\begin{gsc}\begin{verbatim}
702<request type='system' to=''>
703 <system .../>
704</request>
705\end{verbatim}\end{gsc}\end{quote}
706
707One or more actual requests are specified in system elements. The following are examples:
708\begin{quote}\begin{gsc}\begin{verbatim}
709<system type='configure' subset=''/>
710<system type='configure' subset='collectionList'/>
711<system type='activate' moduleType='collection' moduleName='demo'/>
712<system type='deactivate' moduleType='site' moduleName='site1'/>
713\end{verbatim}\end{gsc}\end{quote}
714
715The first request reconfigures the whole site---the MessageRouter goes through its whole configure process again. The second request just reconfigures the collectionList---the MessageRouter will delete all its collection information, and re-look through the collect directory and reload all the collections again.
716The third request is to activate collection demo. This could be a new collection, or a reactivation of an old one. If a collection module already exists, it will be deleted, and a new one loaded. The final request deactivates the site site1---this removes the site from the siteList and module map, and also removes any of that sites collections/services from the static lists.
717
718
719A response just contains a status message, for example:
720\begin{quote}\begin{gsc}\begin{verbatim}
721<response from="">
722 <status>collectionList reconfigured successfully</status>
723</response>
724\end{verbatim}\end{gsc}\end{quote}
725
726At some stage, an error or status code should be included.
727
728System requests are mainly answered by the MessageRouter. However, Collections and ServiceClusters will respond to a subset of these requests.
729
730\subsection{'process'-type messages}
731
732The main type of requests in the system are for services. There are different types of services, currently: \gst{query}, \gst{browse}, \gst{retrieve}, \gst{process}, \gst{applet}, \gst{enrich}. Query services do some kind of search and return a list of document identifiers. Retrieve services can return the content of those documents, metadata about the documents, or other resources. Browse is for browsing lists or hierarchies of documents. Process type services are those where the request is for a command to be run. A status code will be returned immediately, and then if the command has not finished, an update of the status can be requested. Applet services are those that run an applet. Enrich services take a document and return the document with some extra markup added.
733
734 Other possibilities include transform, extract, accrete. These types of service generally enhance the functionality of the first set. They may be used during collection formation: 'accrete' documents by adding them to a collection, 'transform' the documents into a different format, 'extract' information or acronyms from the documents, 'enrich' those documents with the information extracted or by adding new information. They may also be used during querying: 'transform' a query before using it to query a collection, or 'transform' the documents you get back into an appropriate form.
735
736The basic structure of a service 'process' request is as follows:
737\begin{quote}\begin{gsc}\begin{verbatim}
738
739<request lang='en' type='process' to='demo/TextQuery'>
740 <paramList/>
741 other elements...
742</request>
743
744\end{verbatim}\end{gsc}\end{quote}
745
746The parameters are name-value pairs corresponding to parameters that were specified in the service description sent in response to a describe request.
747
748\begin{quote}\begin{gsc}\begin{verbatim}
749<param name='case' value='1'/>
750<param name='maxDocs' value='34'/>
751<param name='index' value='dtx'/>
752\end{verbatim}\end{gsc}\end{quote}
753
754Some requests have other content---for document retrieval, this would be a list of document identifiers to retrieve. For metadata retrieval, the content is the list of documents to retrieve metadata for.
755
756Responses vary depending on the type of request. The following sections look at hte process type requests and responses for each type of service.
757
758\subsubsection{'query'-type services}
759Responses to query requests contain a list of document identifiers, along with some other information, dependent on the query type. For a text query, this includes term frequency information, and some metadata about the result. For instance, a text query on 'snail farming', with the parameter 'maxDocs=10' might return the first 10 documents, and one of the query metadata items would be the total number of documents that matched the query.\footnote{no metadata about the query result is returned yet.}
760
761The following shows an example query request and its response.
762
763Find at most 10 Sections in the mgppdemo collection, containing the word snail (stemmed), returning the results in ranked order:
764\begin{quote}\begin{gsc}\begin{verbatim}
765<request lang='en' to="mgppdemo/TextQuery" type="process">
766 <paramList>
767 <param name="maxDocs" value="10"/>
768 <param name="queryLevel" value="Section"/>
769 <param name="stem" value="1"/>
770 <param name="matchMode" value="some"/>
771 <param name="sortBy" value="1"/>
772 <param name="index" value="t0"/>
773 <param name="case" value="0"/>
774 <param name="query" value="snail"/>
775 </paramList>
776</request>
777
778<response from="mgppdemo/TextQuery" type="process">
779 <metadataList>
780 <metadata name="numDocsMatched" value="59" />
781 </metadataList>
782 <documentNodeList>
783 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4.2"
784 docType='hierarchy' nodeType="leaf" />
785 <documentNode nodeID="HASH010f073f22033181e206d3b7.2.12"
786 docType='hierarchy' nodeType="leaf" />
787 <documentNode nodeID="HASH010f073f22033181e206d3b7.1"
788 docType='hierarchy' nodeType="interior" />
789 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd.2.2"
790 docType='hierarchy' nodeType="leaf" />
791 ...
792 </documentNodeList>
793 <termList>
794 <term field="" freq="454" name="snail" numDocsMatch="58" stem="3">
795 <equivTermList>
796 <term freq="" name="Snail" numDocsMatch="" />
797 <term freq="" name="snail" numDocsMatch="" />
798 <term freq="" name="Snails" numDocsMatch="" />
799 <term freq="" name="snails" numDocsMatch="" />
800 </equivTermList>
801 </term>
802 </termList>
803</response>
804\end{verbatim}\end{gsc}\end{quote}
805
806The list of document identifiers includes some information about document type and node type. Currently, document types include \gst{simple}, \gst{paged} and \gst{hierarchy}. \gst{simple} is for single section documents, i.e. ones with no sub-structure. \gst{paged} is documents that have a single list of sections, while \gst{hierarchy} type documents have a hierarchy of nested sections. For \gst{paged} and \gst{hierarchy} type documents, the node type identifies whather a section is the root of the document, an internal section, or a leaf.
807
808The term list identifies, for each term in teh query, what its frequency in the collection is, how many documents contained that term, and a list of its equivalent terms (if stemming or casefolding was used).
809
810\subsubsection{'browse'-type services}
811
812Browse type services are used for classification browsing. The request consists of a list of classifier identifiers, and some structure parameters listing what structure to retrieve.
813
814\begin{quote}\begin{gsc}\begin{verbatim}
815<request lang="en" to="mgppdemo/ClassifierBrowse" type="process">
816 <paramList>
817 <param name="structure" value="ancestors" />
818 <param name="structure" value="children" />
819 </paramList>
820 <classifierNodeList>
821 <classifierNode nodeID="CL1.2" />
822 </classifierNodeList>
823</request>
824
825<response from="mgppdemo/ClassifierBrowse" type="process">
826 <classifierNodeList>
827 <classifierNode nodeID="CL1">
828 <nodeStructure>
829 <classifierNode nodeID="CL1">
830 <classifierNode nodeID="CL1.2">
831 <classifierNode nodeID="CL1.2.1" />
832 <classifierNode nodeID="CL1.2.2" />
833 <classifierNode nodeID="CL1.2.3" />
834 <classifierNode nodeID="CL1.2.4" />
835 <classifierNode nodeID="CL1.2.5" />
836 </classifierNode>
837 </classifierNode>
838 </nodeStructure>
839 </classifierNode>
840 </classifierNodeList>
841</response>
842\end{verbatim}\end{gsc}\end{quote}
843
844Possible values for structure parameters are \gst{ancestors}, \gst{parent}, \gst{siblings}, \gst{children}, \gst{descendents}. The response gives, for each identifier in the request, a \gst{<nodeStructure>} element with all the requested structure put together into a hierarchy. The structure may include classifier and document nodes.
845
846
847\subsubsection{'retrieve'-type services}
848
849Retrieval services are special in that requests are not explicilty initiated by a user from a form on a web page, but are called from actions in response to other things. This means that their names are hard-coded into the Actions. DocumentContentRetrieve, DocumentStructureRetrieve and DocumentMetadataRetrieve are the standard names for retrieval services for content, structure, and metadata of documents. Requests to each of these include a list of document identifiers. Because these generally refer to parts of documents, the elements are called \gst{<documentNode>}. For the content, that is all that is required. For the metadata retrieval service, the request also needs parameters specifying what metadata is required. For structure retrieval services, requests need parameters specifying what structure or structural info is required.
850
851Some example requests and responses follow.
852
853Give me the Title metadata for these documents:
854\begin{quote}\begin{gsc}\begin{verbatim}
855
856<request lang="en" to="mgppdemo/DocumentMetadataRetrieve" type="process">
857 <paramList>
858 <param name="metadata" value="Title" />
859 </paramList>
860 <documentNodeList>
861 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4.2"/>
862 <documentNode nodeID="HASH010f073f22033181e206d3b7.2.12"/>
863 <documentNode nodeID="HASH010f073f22033181e206d3b7.1"/>
864 ...
865 </documentNodeList>
866</request>
867
868<response from="mgppdemo/DocumentMetadataRetrieve" type="process">
869 <documentNodeList>
870 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4.2">
871 <metadataList>
872 <metadata name="Title">Putting snails in your second pen</metadata>
873 </metadataList>
874 </documentNode>
875 <documentNode nodeID="HASH010f073f22033181e206d3b7.2.12">
876 <metadataList>
877 <metadata name="Title">Now you must decide</metadata>
878 </metadataList>
879 </documentNode>
880 <documentNode nodeID="HASH010f073f22033181e206d3b7.1">
881 <metadataList>
882 <metadata name="Title">Introduction</metadata>
883 </metadataList>
884 </documentNode>
885 </documentNodeList>
886</response>
887\end{verbatim}\end{gsc}\end{quote}
888
889One or more parameters specifying metadata may be included in a request. Also, a value of \gst{all} will retrieve all the metadata for each document.
890
891Any browse-type service must also implement a metadata retrieval service to provide metadata for the nodes in the classification hierarchy. The name of it is the browse service name plus \gst{MetadataRetrieve}. For example, the ClassifierBrowse service described in the previous section should also have a ClassifierBrowseMetadataRetrieve service. The request and response format is exactly the same as for the DocumentMetadataRetrieve service, except that \gst{<documentNode>} elements are replaced by \gst{<classifierNode>} elements (and the corresponding list element is also changed).
892
893Give me the text (content) of this document:
894\begin{quote}\begin{gsc}\begin{verbatim}
895<request lang="en" to="mgppdemo/DocumentContentRetrieve" type="process">
896 <paramList />
897 <documentNodeList>
898 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4.2" />
899 </documentNodeList>
900</request>
901
902<response from="mgppdemo/DocumentContentRetrieve" type="process">
903 <documentNodeList>
904 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4.2">
905 <nodeContent>&lt;Section&gt;
906 &lt;/B&gt;&lt;P ALIGN=&quot;JUSTIFY&quot;&gt;&lt;/P&gt;
907 &lt;P ALIGN=&quot;JUSTIFY&quot;&gt;190. When the plants in
908 your second pen have grown big enough to provide food and
909 shelter, you can put in the snails.&lt;/P&gt;
910 </nodeContent>
911 </documentNode>
912 </documentNodeList>
913</response>
914\end{verbatim}\end{gsc}\end{quote}
915
916The content of a node is returned in a \gst{<nodeContent>} element. In this case it is escaped HTML.
917
918Give me the ancestors and children of the specified node, along with the number of siblings it has:
919\begin{quote}\begin{gsc}\begin{verbatim}
920<request lang="en" to="mgppdemo/DocumentStructureRetrieve" type="process">
921 <paramList>
922 <param name="structure" value="ancestors" />
923 <param name="structure" value="children" />
924 <param name="info" value="numSiblings" />
925 </paramList>
926 <documentNodeList>
927 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4.2" />
928 </documentNodeList>
929</request>
930
931<response from="mgppdemo/DocumentStructureRetrieve" type="process">
932 <documentNodeList>
933 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4.2">
934 <nodeStructureInfo>
935 <info name="numSiblings" value="2" />
936 </nodeStructureInfo>
937 <nodeStructure>
938 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd"
939 docType='hierarchy' nodeType="root">
940 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4"
941 docType='hierarchy' nodeType="interior">
942 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4.2"
943 docType='hierarchy' nodeType="leaf" />
944 </documentNode>
945 </documentNode>
946 </nodeStructure>
947 </documentNode>
948 </documentNodeList>
949</response>
950\end{verbatim}\end{gsc}\end{quote}
951
952Structure is returned inside a \gst{<nodeStructure>} element, while structural info is returned in a \gst{<nodeStructureInfo>} element. Possible values for strcuture parameters are as for browse services: \gst{ancestors}, \gst{parent}, \gst{siblings}, \gst{children}, \gst{descendents}. Possible values for info parameters are \gst{numSiblings}, \gst{siblingPosition}, \gst{numChildren}.
953
954\subsubsection{'process'-type services}\label{sec:process}
955Requests to process-type services are not requests for data---they request some action to be carried out, for example, create a new collection, or import a collection. The response is a status or an error message. The import and build commands may take a long time to complete, so a response is sent back after a successful start to the command. The status may be polled by the requester to see how the process is going.
956
957Process requests generally contain just a parameter list. Like for any service, the parameters used by a process-type service can be obtained by a describe request to that service.
958
959Here are two example requests for process-services that are part of the build service cluster (hence the addresses all begin with 'build/'), followed by an example response:
960
961\begin{quote}\begin{gsc}\begin{verbatim}
962<request lang='en' type='process' to='build/NewCollection'>
963 <paramList>
964 <param name='creator' value='[email protected]'/>
965 <param name='collName' value='the demo collection'/>
966 <param name='collShortName' value='demo'/>
967 </paramlist>
968</request>
969
970<request lang='en' type='process' to='build/ImportCollection'>
971 <paramList>
972 <param name='collection' value='demo'/>
973 </paramlist>
974</request>
975
976<response from="build/ImportCollection">
977 <status code="2" pid="2">Starting process...</status>
978</response>
979\end{verbatim}\end{gsc}\end{quote}
980
981The \gst{code} attribute in the response specifies whether the command has been successfully stated, whether its still going, etc (see Table~\ref{tab:status codes} for a list of currently used codes). The pid attribute specifies a process id number that can be used when querying the status of this process. The content of teh status element is (currenlty) just the output from the process so far. Status messages, which are described in Section~\ref{sec:status}, are used to find out how the process is going, and whether it has finished or not.
982
983\subsubsection{'applet'-type services}
984
985Applet-type services are those that process the data for an applet. A request consists only of a list of parameters, and the response contains an \gst{<appletData>} element that contains the XML data to be returned to tehe applet. The format of this is entirely specific to the applet---there is no set format to the applet data.
986
987Here is an example request and response, used by the Phind applet:
988\begin{quote}\begin{gsc}\begin{verbatim}
989 <request type='query' to='mgppdemo/PhindApplet'>
990 <paramList>
991 <param name='pc' value='1'/>
992 <param name='pptext' value='health'/>
993 <param name='pfe' value='0'/>
994 <param name='ple' value='10'/>
995 <param name='pfd' value='0'/>
996 <param name='pld' value='10'/>
997 <param name='pfl' value='0'/>
998 <param name='pll' value='10'/>
999 </paramList>
1000 </request>
1001
1002 <response type='query' from='mgppdemo/PhindApplet'>
1003 <appletData>
1004 <phindData df='9' ef='46' id='933' lf='15' tf='296'>
1005 <expansionList end='10' length='46' start='0'>
1006 <expansion df='4' id='8880' num='0' tf='59'>
1007 <suffix> CARE</suffix>
1008 </expansion>
1009 ...
1010 </expansionList>
1011 <documentList end='10' length='9' start='0'>
1012 <document freq='78' hash='HASH4632a8a51d33c47a75c559' num='0'>
1013 <title>The Courier - N??159 - Sept- Oct 1996 Dossier Investing
1014 in People Country Reports: Mali ; Western Samoa
1015 </title>
1016 </document>
1017 ...
1018 </documentList>
1019 <thesaurusList end='10' length='15' start='0'>
1020 <thesaurus df='7' id='12387' tf='15' type='RT'>
1021 <phrase>PUBLIC HEALTH</phrase>
1022 </thesaurus>...
1023 </thesaurusList>
1024 </phindData>
1025 </appletData>
1026 </response>
1027
1028\end{verbatim}\end{gsc}\end{quote}
1029
1030\subsubsection{'enrich'-type services}
1031
1032Enrich services typically take some text of documents (inside \gst{<nodeContent>} tags) and returns the text marked up in some way. One example of this is the GatePOSTag service: this identifies Dates, Locations, People and Organizations in the text, and annotates the text with the labels. In the following example, the request is for Location and Dates to be identified.
1033*** TODO ****
1034\begin{quote}\begin{gsc}\begin{verbatim}
1035<request lang="en" to="GatePOSTag" type="process">
1036 <paramList>
1037 <param name="annotationType" value="Date,Location" />
1038 </paramList>
1039 <documentNodeList>
1040 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd">
1041 <nodeContent>
1042 FOOD AND AGRICULTURE ORGANIZATION OF THE UNITED NATIONS
1043 Rome 1986
1044 P-69
1045 ISBN 92-5-102397-2
1046 FAO 1986
1047 </nodeContent>
1048 </documentNode>
1049 </documentNodeList>
1050</request>
1051
1052<response from="GatePOSTag" type="process">
1053 <documentNodeList>
1054 <documentNode nodeID="HASHac0a04dd14571c60d7fbfd">
1055 <nodeContent>
1056 FOOD AND AGRICULTURE ORGANIZATION OF THE UNITED NATIONS
1057 <annotation type="Location">Rome</annotation>
1058 <annotation type="Date">1986</annotation>
1059 P-69
1060 ISBN 92-5-102397-2
1061 FAO <annotation type="Date">1986</annotation>
1062 </nodeContent>
1063 </documentNode>
1064 </documentNodeList>
1065</response>
1066\end{verbatim}\end{gsc}\end{quote}
1067
1068\subsection{'status'-type messages}\label{sec:status}
1069
1070These are only used with process-type services, which are those where a request is sent to start some type of process (see Section~\ref{sec:process}). The initial response states whether the process had successfully started, and whether its still continuing. If the process is not finished, status requests can be sent repeatedly to the service to poll the status, using the pid to identify the process. Status codes are used to identify the state of a process. The values used at the moment are listed in Table~\ref{tab:status codes}\footnote{A more standard set of codes should probably be used, for example, the HTTP codes}.
1071
1072\begin{table}
1073\caption{Status codes currently used in Greenstone 3}
1074\label{tab:status codes}
1075\begin{tabular}{llp{8cm}}
1076\bf code name & \bf code & \bf meaning \\
1077& \bf value & \\
1078SUCCESS & 1 & the request was accepted, and the process was completed \\
1079ACCEPTED & 2 & the request was accepted, and the process has been started, but it is not completed yet \\
1080ERROR & 3 & there was an error and the process was stopped \\
1081CONTINUING & 10 & the process is still continuing \\
1082COMPLETED & 11 & the process has finished \\
1083HALTED & 12 & the process has stopped \\
1084INFO & 20 & just an info message that doesnt imply anything \\
1085\end{tabular}
1086\end{table}
1087
1088 The following shows an example status request, along with two responses, the first a 'ok but continuing' response, and the second a 'successfully completed' response. The content of the status elements in the two responses is the output from the process since the last status update was sent back.
1089
1090\begin{quote}\begin{gsc}\begin{verbatim}
1091<request lang="en" to="build/ImportCollection" type="status">
1092 <paramList>
1093 <param name="pid" value="2" />
1094 </paramList>
1095</request>
1096
1097<response from="build/ImportCollection">
1098 <status code="2" pid="2">Collection construction: import collection.
1099command = import.pl -collectdir /research/kjdon/home/gsdl3/web/sites/
1100 localsite/collect test1
1101starting
1102 </status>
1103</response>
1104
1105<response from="build/ImportCollection">
1106 <status code="11" pid="2">RecPlug: getting directory
1107/research/kjdon/home/gsdl3/web/sites/localsite/collect/test1/import
1108WARNING - no plugin could process /.keepme
1109
1110*********************************************
1111Import Complete
1112*********************************************
1113* 1 document was considered for processing
1114* 0 were processed and included in the collection
1115* 1 was rejected. See /research/kjdon/home/gsdl3/web/sites/
1116 localsite/collect/test1/etc/fail.log for a list of rejected documents
1117Success
1118 </status>
1119</response>
1120\end{verbatim}\end{gsc}\end{quote}
1121
1122\subsection{'format'-type messages}\label{sec:format}
1123
1124Collection designers are able to specify how their collection looks to a certain degree. They can specify format statements for display that will apply to the results of a search, the display of a document, entries in a classification hierarchy, for example. This info is generally service specific. All services respond to a format request, where they return any service specific formatting information. A typical request and response looks like this:
1125\begin{quote}\begin{gsc}\begin{verbatim}
1126<request lang="en" to="mgppdemo/FieldQuery" type="format" />
1127
1128<response from="mgppdemo/FieldQuery" type="format">
1129 <format>
1130 <gsf:template match="documentNode"><td><gsf:link>
1131 <gsf:metadata name="Title" />(<gsf:metadata name="Source" />)
1132 </gsf:link></td>
1133 </gsf:template>
1134 </format>
1135</response>
1136\end{verbatim}\end{gsc}\end{quote}
1137
1138The actual format statements are described further in Section~\ref{sec:colldesign}. They are templates written directly in XSLT, or in GSF, which stands for Greenstone Format, and is a simple XML representation of the more complicated XSLT templates.
1139GSF style format statements need to be converted to proper XSLT. This is currently done by the Receptionist (but may be moved to an ActionHelper): the format xml is transformed to xslt using xslt with the config\_format.xsl stylesheet.
1140
1141\section{Page generation}\label{sec:pagegen} **** REDO ********
1142
1143* talk general first: get data, get format info, transform gsf->xsl. transfrom xml->html
1144
1145URL-style requests are received by the Receptionist. Based on the arguments, a page of data must be returned to the servlet. As described in Section~\ref{sec:page}, the requests are XML representations of Greenstone URLs. One of the arguments is action (a). This tells the Receptionist which Action module to pass the request to. Action modules decode the rest of the cgi-arguments to determine what requests need to be made to the system.
1146System requests are received by the MessageRouter, which answers them one by one, either itself or by passing them on to the appropriate module.
1147
1148Once the data needed from the system has been accumulated, it is put into a 'page' of XML. The page is transformed to its output form, currently HTML, via XSLT transformations, and returned to the user.
1149
1150The basic page format is:
1151\begin{quote}\begin{gsc}\begin{verbatim}
1152<page>
1153 <pageExtra>
1154 <config/>
1155 <display/>
1156 </pageExtra>
1157 <pageRequest/>
1158 <pageResponse/>
1159</page>
1160\end{verbatim}\end{gsc}\end{quote}
1161
1162* show config and describe whats its used for
1163
1164There are four main elements in the page: config, translate, request, response. The request is the original request that came into the Receptionist---this is included so that any parameters can be preset to their previous values, for example, the query options on the query form.\footnote{this should be saved instead in some sort of state saving - if you leave a page and go back you want your parameters to be the same as well}. The response contains all the data that has been gathered from the system by the action. The other two elements contain extra information needed by XSLT. Config contains run-time variables such as the location of the gsdl home directory, the current site name, the name of the executable that is running (eg library)---these are needed to allow the XSLT to generate correct HTML URLs. Display contains some of the text strings needed in the interface---these are separate from the XSLT to allow for internationalization.
1165
1166The following subsections outline, for each action, what data is needed and what requests are generated to send to the system.
1167
1168
1169Once the xml page has been put together, the page to return to the user is created by transforming the XML using XSLT. The output is HTML at this stage, but it will be possible to generate alternative outputs, such as XML, WML etc. A set of XSLT files defines an 'interface'. Different users can change the look of their web pages by creating new XSLT files for a new 'interface'. Just as we have a sites directory where different sites 'live' (ie where their configuration file and collections are located), we have an interfaces directory where the different interfaces 'live' (ie their transforms and images are located there). The default XSLT files are
1170located in interfaces/default/transforms. Collections, sites and other interfaces
1171can override these files by having their own copy of the appropriate
1172files. New interfaces have their own directory inside interfaces/. Sites and collections can have a transform directory containing XSLT files. The order in which the XSLT files are looked for is collection, site, current
1173interface, default interface.\footnote{this currently breaks down for remote sites - need to rethink it a bit.}
1174***TODO*** describe a bit more?? currently only can get this locally
1175
1176\subsection{Receptionists}\label{sec:recepts}
1177
1178The receptionist is the controlling module for the page generation part of greenstone. It has the job of loading up all the actions, and it knows about the message router it and the actions are supposed to talk to. It routes messages received to the appropriate action (page-type messages) or directly to the message router (all other types). Receptionists also do other things, for example, adding to the page received back from the action any information that is common to all pages.
1179
1180There are different ways of providing an interface to greenstone, from web based cgi style (using servlets) to Java GUI applications. These different interfaces require slightly different responses from a receptionist, so we provide several standard types of receptionist.
1181
1182Receptionist: This is the most basic receptionist. The page it returns consists of the original request, and the response from the action it was sent to. Methods preProcessRequest, and postProcessPage are called on the request and page, respectively, but in this basic receptionist, they dont do anything.
1183
1184TransformingReceptionist: This extends Receptionist, and overwrites postProcessPage to transform the page using xslt. An xslt is listed for each action in the receptionists config file, and this is used to transform the page. First, some display information, and config information is added to the page. Then it is transformed using the specified xslt for the action, and returned.
1185
1186WebReceptionist: The WebReceptionist extends TransformingREceptionist. It doesn't do much else except some argument conversion. To keep the url's short, parameters from the services are given shortnames, and these are used in the web pages.
1187
1188DefaultReceptionist: This extends WebReceptionist, and is the default one for greenstone 3 servlets. Due to the page design, some extra information is needed for each page: some metadata about the current collection. THe receptionist sends a describe request to teh collection to get this, and appends it to teh page before transformation using xslt.
1189
1190NZDLReceptionist: (do we want to talk about this?) This is an example of a custom receptionist. For a look-alike nzdl.org system, even more information is needed for each page, namely the list of classifiers available from teh ClassifierBrowse service.
1191
1192By default, the LibraryServlet uses DefaultReceptionist. However, there is an init-param called receptionist which can be set to make the servlet use a different one.
1193
1194\subsection{cgi args}
1195
1196THe args used by the page come from several sources. Receptionist uses a couple, actions use some and services. the receptionist and actions are treated as a whole so must not have conflicting args. GSParams class specifies all teh general basic args, and whether they should be saved or not. servlet has an init parameter params\_class, that specifies which params class to use - if subclass it. actions or receptionist may specify some new ones
1197
1198services may be created by different people, may be on a different site. cant garantee no conflict with action params, or even with other services.
1199so service params are namespaced when they are put on the page. interface (recept and action) params wil have no namespace) the default namespace is s1 (service1) - any params that are for the service will be prefixed by this. eg the case param for a search will be put in the page as s1.case.
1200THe actions must now look for all the s1 params to send to teh service.
1201
1202if there are two or more services combined on a page with a single submit button, they will use s1, s2, s3 etc as needed. the s param (service) will end up with a list eg s=TextQuery,MusicQuery, and the order of these determines the mapping order of teh namespaces, ie s1 will be TExtQuery, s2 MusicQuery.
1203
1204also talk abotu saving args - save ones that GSParams says to save, and any service ones should always save.
1205\subsection{Internationalization}
1206
1207Internationalization is a big part of Greenstone3. Language specific text strings are separated out from the rest of the system to allow for easy incorporation of new languages.
1208
1209Language specific text strings are specified in resource bundle property files. These live in resources/java.
1210
1211There is a properties file per class, and one per interface. At the moment, we have
1212
1213GS2MGPPSearch.properties
1214GS2MGPPRetrieve.properties etc - the service classes
1215
1216interface\_default.properties. - for the default interface
1217
1218To add other languages, create eg GS2MGPPSearch\_fr.properties.
1219
1220The interface ones are treated differently from the other ones. The action doesn't know which text strings are needed by a particular transform, so it gets them all out of the properties file, and puts them into an xml \gst{<display>} element - the xslt can get the ones it needs from there.
1221xslt could perhaps get the stuff from the properties bundle on the fly using java extension elements - would this be better? but we dont want to re-load teh properties file every time a new text string is needed.
1222
1223All other class specific text strings are just retrieved one by one as they are needed and added into the xml - for example, the names for query params are retrieved when the service description is created.
1224
1225* for each page type, show a typical request (cgi or xml??) and a sample response
1226
1227\subsection{Page action}
1228* kind of info pages. other actions are associated with specific services.
1229* uses describe requests to modules
1230Depending on the subaction argument, different pages can be generated. For the 'home' page, a 'describe' request is sent to the MessageRouter---this returns a list of all the collections, services, serviceClusters and sites known about. For each collection, its metadata is retrieved via a 'describe' request. This metadata is added into the previous result, which is then added into the page. The page is
1231transformed using \gst{home.xsl}. For the 'about' page, a \gst{describe} request is sent to the module that the about page is about: this may be a collection or a service cluster. This returns a list of metadata
1232and a list of services, and the result is transformed using \gst{about.xsl}.
1233
1234
1235\subsection{Query action}
1236
1237THe basic url is \gst{a=q\&s=TextQuery\&c=demo\&rt=d/r}.
1238There are three query services which have been implemented: TextQuery, FieldQuery, and AdvancedFieldQuery. These are all handled in the same way by query action.
1239For each page, the service description is requested from the service of the current collection (via a describe request). This is currently done every time the query page is
1240displayed, but should be cached. The description includes a list of the parameters available for the query, such as case/stem, max num docs to return, etc. If the request type (rt) parameter is set to d for display, the action only needs to display the form, and this is the only request to the service. Otherwise, the submit button has been pressed, and a query request to the TextQuery service is sent. This has all the parameters from the URL put into the parameter list. A list of document identifiers
1241is returned. A followup query is sent to the MetadataRetrieve service of the collection: the content includes the list of
1242documents, with a request for some of their metadata. Which metadata to retrieve is determined by looking through the xslt that wil be used to transform the page (Formatter object??). The service description and query result are combined into a page of xml, which is
1243transformed using \gst{basicquery.xsl} to produce the html page.
1244
1245\subsection{Applet action}
1246
1247There are two types of request to the applet action: \gst{a=a \& rt=d\/} and
1248\gst{a=a \& rt=r\/}. The value \gst{rt=d\/} means ``display the applet.'' A
1249\gst{describe} request is sent to the service, which returns the \gst{<applet>} HTML element. The transformation file \gst{applet.xsl} embeds this
1250into the page, and the servlet returns the HTML.
1251
1252The value \gst{rt=r} signals a request from the applet. The result is returned
1253directly to the applet code, in XML. The other parameters are sent to the
1254service untransformed, and the result is passed directly back to the applet.
1255Applet action can therefore work with any applet whose service understands the
1256messages.
1257
1258Here are two examples of requests generated by the Applet action, along with their corresponding responses.
1259
1260The first request corresponds to the URL arguments \gst{a=a \&
1261rt=d \& sn=Phind \& c=mgppdemo\/}, which translate to ``display the Phind
1262applet for the mgppdemo collection''.
1263
1264
1265The second request corresponds to the arguments \gst{a=a \& rt=r \& sn=Phind \& c=mgppdemo \& pc=1 \& pptext=health \& pfe=0 \& ple=10 \& pfd=0 \& pld=10 \& pfl=0 \& pll=10}---this
1266indicates a request to the service itself. The extra arguments (not a, sa, sn, c) are simply copied into the
1267request as parameters. The response is in a form suitable for the applet, placed inside
1268\gst{<appletData>} in a standard Greenstone message. AppletAction returns the
1269contents of appletData to the browser, i.e. to the applet itself.
1270
1271
1272Note that the applet HTML may need to know the name of the \gst{library}
1273program. However, that name is chosen by the person who installed the software
1274and will not necessarily be ``library''. To get around this, the applet can
1275put a parameter called ``library'' into the applet data with a null value:
1276\begin{quote}\begin{gsc}\begin{verbatim}
1277<PARAM NAME='library' VALUE=''/>
1278\end{verbatim}\end{gsc}\end{quote}
1279When the Applet action encounters this parameter it inserts the name of the
1280current library servlet as its value.
1281
1282\subsection{Document action}
1283
1284DocumentAction sends a query to the DocumentRetrieve service of the collection requesting the text of the specified document. At this stage no additional information is obtained, but in future stuff like Title and
1285table of contents would be needed to make the display nicer.
1286
1287
1288\subsection{System action}\label{sec:system-action}
1289
1290SystemAction allows for manual reconfiguration of various components at run-time. There is no interactive web-page displaying the options, it merely turns a set of cgi arguments into an xml system request. The response from a system request is a message which is displayed to the user.
1291
1292\begin{table}
1293\caption{Configure cgi arguments}
1294\label{tab:system-cgi}
1295\begin{tabular}{ll}
1296\hline
1297\bf arg & \bf description\\
1298a=s & system action\\
1299sa=c$|$a$|$d & type of system request: c (configure), a (add/activate), \\
1300& d (delete/deactivate) \\
1301c=demo & the request will go to this collection/servicecluster \\
1302& instead of the message router\\
1303ss=collectionList & subset for configure: only reconfigure this part.\\
1304& For the MessageRouter, can be serviceClusterList, serviceList, \\
1305& collectionList, siteList.\\
1306& For a collection/cluster, can be metadataList or serviceList.\\
1307sn=demo & \\
1308st=collection& \\
1309\hline
1310\end{tabular}
1311\end{table}
1312
1313
1314\section{Collection formation}
1315
1316So far, only Greenstone2 style building is available. This uses the import.pl and buildcol.pl perl scripts from Greenstone2. These scripts and their needed perl modules have not been added to the Greenstone3 system, so to do building, you need to have Greenstone2 installed, and GSDLHOME, and GSDLOS set. (can do this by running 'source setup.bash' in the top level directory of gsdl.
1317
1318There are three ways of getting collections into Greenstone3.
1319
1320\subsection{Importing gs2 collections}
1321
1322Collections built in a Greenstone2 system can be used in Greenstone3. Just copy across the collection's directory into the appropriate collect directory, and run \gst{convert\_coll\_from\_gs2.pl}. You need to specify the collect directory and the collection name. Eg.
1323
1324\gst{convert\_coll\_from\_gs2.pl -collectdir /research/kjdon/gsdl3/web/\-sites/\-localsite/collect demo}
1325
1326This creates the appropriate Greenstone3 XML configuration files. If you restart Tomcat, or give an add command (\gst{a=s\&sa=a\&st=collection\&sn=demo}), you should be able to see your new collection. You may need to edit some of the format stuff by hand.
1327
1328
1329\subsection{Building new collections through the web interface}
1330
1331Collection construction can be done through the web, using the build ServiceCluster in localsite. Just sequence through the steps needed. There is no automatic sequence taking you to the next page, you have to go back to the build 'about' page, and select the next service manually. So far, AddDocument does not work, so documents need to be manually added to the import directory. And there is no ConfigureCollection service yet, so if you want anything other than the default configuration, you need to edit the collect.cfg config file by hand.
1332
1333You need to carry out the following steps:
1334
1335\begin{quote}
1336NewCollection\\
1337- add docs to import directory\\
1338- optionally edit collect.cfg
1339ImportCollection\\
1340BuildCollection\\
1341ActivateCollection\\
1342\end{quote}
1343
1344Note, activate uses \gst{activate\_gs2\_style\_coll.pl} which is similar to \gst{convert\_coll\_from\_gs2.pl} but assumes that collectionConfig.xml already exists.
1345
1346\subsection{Command line building}
1347
1348
1349Collection building can also be done on the command line:
1350
1351\begin{gsc}\begin{verbatim}
1352ConstructCollection -site <site-path>
1353 -mode new|import|build|activate
1354 [options] <coll-name>
1355\end{verbatim}\end{gsc}
1356
1357eg
1358
1359\begin{gsc}\begin{verbatim}
1360ConstructCollection -site /research/kjdon/gsdl3/web/sites/localsite
1361 -mode new
1362 -creator [email protected] testcol
1363\end{verbatim}\end{gsc}
1364
1365The options get passed to the underlying script, - there is no good help message yet.
1366import and build use gs2 import.pl and buildcol.pl so you can specify any of their options if you like.
1367The sequence of steps is the same as for building via the web interface: new, manually add documents to the import directory, and edit collect.cfg if needed, import, build, activate.
1368
1369Building stuff is in src/java/org/greenstone/gsdl3/build.
1370CollectionConstructor is the base class for building control. GS2PerlConstructor is the implementation that uses Greenstone 2 Perl scripts. The building process sends events (ConstructionEvent) to any listeners (ConstructionListener) as important stages happen. You can add one or more listeners to the constructor which will get notified of events. The perl stuff just passes any messages on---should be more informative in future.
1371
1372\subsection{Collection design}\label{sec:colldesign}
1373
1374Part of collection design involves deciding how the collection should look. Greenstone has a default 'look' for a collection, so this is optional. However, the default may not suit the purposes of some collections, so many parts to the look of a collection can be determined by the collection designer.
1375
1376In standard greenstone, the library is served to a web browser by a servlet, and the html is generated using XSLT. XSLT templates are used to format all the parts of the pages. Some commonly overwritten templates are those for formatting lists: search results list, classifier browsing hierarchies, and for parts of the document display.
1377
1378Real XSL templates for formatting search results or classifier lists are quite complicated, and not at all easy for a new user to write. For example, the following is a sample template for formatting a classifier list, to show Keyword metadata as a link to the document.
1379
1380\begin{gsc}\begin{verbatim}
1381<xsl:template match="documentNode" priority="2"
1382 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
1383 <xsl:param name="collName"/>
1384 <td><a href="{\$library_name}?a=d&amp;c={\$collName}&amp;
1385 d={@nodeID}&amp;dt={@docType}"><xsl:value-of
1386 select="metadataList/metadata[@name='Keyword']"/></a>
1387 </td>
1388</xsl:template>
1389 \end{verbatim}\end{gsc}
1390
1391To write this, the user would need to know that:
1392\begin{bulletedlist}
1393\item the variable \$library\_name exists,
1394\item the collection name is passed in as a parameter called collName
1395\item metadata for a document is found in a metadataList and that its form is \gst{<metadata name="Keyword">the value</metadata>}
1396\item the arguments needed for the link to the document are a, sa, c, d and dt.
1397\end{bulletedlist}
1398
1399Since XSLT is written in XML, we can use XSLT to transform XML into XSLT. GSF uses a simple set of XML elements to represent the old (Greenstone2) format statement elements, and we use XSLT to transform it into a proper XSLT template.
1400
1401\begin{tabular}{ll}
1402\bf Greenstone 2 & \bf Greenstone 3 \\
1403\gst{[Text]} & \gst{<gsf:text/>} \\
1404\gst{[num]} & \gst{<gsf:num/>}\\
1405\gst{[link][/link]} & \gst{<gsf:link></gsf:link>} or \\
1406& \gst{<gsf:link type='document'></gsf:link>}\\
1407\gst{[srclink][/srclink]} & \gst{<gsf:link type='source'></gsf:link>}\\
1408\gst{[icon]} & \gst{<gsf:icon/>} or \\
1409& \gst{<gsf:icon type='document'/>}\\
1410\gst{[srcicon]} & \gst{<gsf:icon type='source'/>}\\
1411\gst{[Title]} (metadata) & \gst{<gsf:metadata name='Title'/>} or \\
1412& \gst{<gsf:metadata name='Title' select='current'/>}\\
1413\gst{[parent:Title]} & \gst{<gsf:metadata name='Title' select='parent' />}\\
1414\gst{[parent(All):Title]} & \gst{<gsf:metadata name='Title' select='ancestors'/>}\\
1415\gst{[parent(Top):Title]} & \gst{<gsf:metadata name='Title' select='root' />}\\
1416\gst{[parent(All': '):Title]} & \gst{<gsf:metadata name='Title' select='ancestors'}\\
1417& \gst{separator=': ' />}\\
1418\end{tabular}
1419
1420 Other select values for gsf:metadata are \gst{children} and \gst{descendents}. How you would actually use these is unclear.
1421
1422The user specifies a \gst{<gsf:template>} for what they want to format---these can match \gst{documentNode} or \gst{classifierNode} (for node in a classification hierarchy).
1423
1424The template above is now represented as:
1425
1426\begin{gsc}\begin{verbatim}
1427<gsf:template match='documentNode'>
1428 <td><gsf:link><gsf:metadata name='Keyword'/></gsf:link></td>
1429</gsf:template>
1430\end{verbatim}\end{gsc}
1431
1432I am not sure how the \{If\} and \{Or\} stuff will go yet. Any ideas????
1433\section{Greenstone Installation}
1434
1435This section describes the directory structure of the Greenstone source, and provides an installation guide to installing Greenstone from CVS.
1436
1437\subsection{Directory structure}
1438Table~\ref{tab:dirs} shows the file hierarchy for Greenstone3.
1439The first part shows the common stuff which can be shared between
1440Greenstone users---the src, libraries etc. Under linux, these will eventually be installed into appropriate system directories. The second part shows
1441stuff used by one person/group---their sites and interface setup
1442etc. There can be several sites/interfaces per installation.
1443
1444\begin{table}
1445\caption{The Greenstone directory structure}
1446\label{tab:dirs}
1447\center{\footnotesize
1448\begin{tabular}{l p{8cm}}
1449\hline
1450gsdl3
1451 & The main installation directory---gsdl3home can be changed to something more standard\\
1452gsdl3/src
1453 & Source code lives here \\
1454gsdl3/src/java/org/greenstone/gsdl3
1455 & Contains the top level classes that either have main programs, or are server/servlet classes\\
1456gsdl3/src/java/org/greenstone/gsdl3/core
1457 & ModuleInterface, MessageRouter, Receptionist---the central classes that the others hang off\\
1458gsdl3/src/java/org/greenstone/gsdl3/service
1459 & The various service modules---these things do the work\\
1460gsdl3/src/java/org/greenstone/gsdl3/util
1461 & Utility classes \\
1462gsdl3/src/java/org/greenstone/gsdl3/collection
1463 & ServiceCluster and Collection classes\\
1464gsdl3/src/java/org/greenstone/gsdl3/comms
1465 & Communicator classes, eg SOAP\\
1466gsdl3/src/java/org/greenstone/gsdl3/build
1467 & stuff for collection building \\
1468gsdl3/src/java/org/greenstone/gsdl3/action
1469 & Action classes used by the Receptionist---do the work of displaying the pages\\
1470gsdl3/src/java/org/greenstone/gdbm
1471 & Java wrapper for gdbm---uses j-gdbm, a jni gdbm wrapper\\
1472gsdl3/src/java/org/greenstone/testing
1473 & Junit scaffolding for unit testing.\\
1474gsdl3/src/java/org/greenstone/applet
1475 & where the code for applets goes \\
1476gsdl3/src/java/org/greenstone/applet/phind
1477 & the phind applet (phrase browsing) \\
1478gsdl3/src/cpp/
1479 & Place for any cpp source code---none yet \\
1480gsdl3/packages
1481 & Imported packages from other systems eg mg, mgpp \\
1482gsdl3/lib
1483 & Shared library files\\
1484gsdl3/lib/java
1485 & Java jar files\\
1486gsdl3/resources
1487 & any resources that may be needed\\
1488gsdl3/resources/java
1489 & properties files for java resource bundles - used to handle all the language specific text This directory is on the classpath, so any other Java resources can be placed here \\
1490gsdl3/resources/soap
1491 & soap service description files \\
1492gsdl3/bin
1493 & executable stuff lives here\\
1494gsdl3/bin/script
1495 & some perl building scripts\\
1496gsdl3/bin/linux
1497 & linux executables for eg mgpp\\
1498gsdl3/comms
1499 & Put some stuff here for want of a better place---things to do with servers and communication. eg soap stuff, and tomcat servlet container\\
1500gsdl3/docs
1501 & Documentation :-)\\
1502\hline
1503gsdl3/web
1504 & This is where the web site is defined. Any static html files can go here. This directory is the Tomcat root directory.\\
1505gsdl3/web/WEB-INF
1506 & The web.xml file lives here (servlet configuration information for tomcat)\\
1507gsdl3/web/WEB-INF/classes
1508 & Servlet classes go in here\\
1509gsdl3/web/sites
1510 & Contains directories for different sites---a site is a set of collections and services served by a single MessageRouter (MR). The MR may have connections (eg soap) to other sites\\
1511gsdl3/web/sites/localsite
1512 & One site - the site configuration file lives here\\
1513gsdl3/web/sites/localsite/collect
1514 & The collections directory \\
1515gsdl3/web/sites/localsite/images
1516 & Site specific images \\
1517gsdl3/web/sites/localsite/transforms
1518 & Site specific transforms \\
1519gsdl3/web/interfaces
1520 & Contains directories for different interfaces - an interface is defined by its images and xslt files \\
1521gsdl3/web/interfaces/default
1522 & The default interface\\
1523gsdl3/web/interfaces/default/images
1524 & The images for the default interface\\
1525gsdl3/web/interfaces/default/transforms
1526 & The XSLT files for the default interface\\
1527\hline
1528\end{tabular}}
1529\end{table}
1530
1531\subsection{Installation guide}
1532
1533\newcommand{\gsdlhome}{\$GSDL3HOME}
1534\newcommand{\gshome}{\$GSDLHOME}
1535
1536Currently, Greenstone3 is only available through CVS. The installation procedure has been semi-automated. Note, these instructions are for installation on linux. If you want to use Greenstone3 on Windows, download it using CVS, then follow the instructions in \gst{http://www.cs.waikato.ac.nz/\~mdewsnip/GSDL3Windows.html}.
1537
1538\subsubsection{Get the source}
1539
1540If you have a greenstone\_cvs account, you can use the following:
1541
1542\begin{quote}\begin{gsc}\begin{verbatim}
1543export CVS_RSH=ssh
1544cvs -d :ext:@cvs.scms.waikato.ac.nz:/usr/local/global-cvs/
1545 gsdl-src co gsdl3
1546\end{verbatim}\end{gsc}\end{quote}
1547
1548Otherwise, you can get it through anonymous access:
1549
1550\begin{quote}\begin{gsc}\begin{verbatim}
1551cvs -d :pserver:cvs\[email protected]:2402/usr/local/
1552 global-cvs/gsdl-src co gsdl3
1553\end{verbatim}\end{gsc}\end{quote}
1554
1555If you need it, the password for anonymous CVS access is \gst{anonymous}. Note that some versions of CVS have trouble accessing this repository. We are using version 1.11.1p1.
1556
1557\subsubsection{Compile and install Greenstone}\label{subsec:compile}
1558
1559An install.sh script has been constructed to compile and install Greenstone3. What you need to do is:
1560
1561\begin{quote}\begin{gsc}
1562cd gsdl3\\
1563source setup.bash\\
1564install.bash\\
1565source setup.bash\\
1566\end{gsc}\end{quote}
1567
1568If you want to do Greenstone2 compatible building (currently the only type) you need to have Greenstone2 installed, \gst{source setup.bash} in the top level Greenstone2 directory, then re-\gst{source setup.bash} for Greenstone3. This is to set \gst{\gshome} for Tomcat.
1569
1570\noindent Note: \gst{source setup.bash} needs to be done once in any xterm window before doing a make or running Tomcat. setup.bash sets the environment variables \gst{CLASSPATH, PATH, JAVA\_HOME} etc.
1571
1572If you want to use SOAP to talk to remote sites, you also need to do the following:
1573
1574\begin{quote}\begin{gsc}
1575install-soap.bash
1576\end{gsc}\end{quote}
1577
1578There is one java command that sometimes doesn't work under bash, so you may need to cut and paste it into the terminal to get it to work. See the output from the bash-script for details.
1579
1580To shutdown or startup Tomcat, the commands are:
1581\begin{quote}\begin{gsc}
1582\gsdlhome/comms/jakarta/tomcat/bin/shutdown.sh\\
1583\gsdlhome/comms/jakarta/tomcat/bin/startup.sh\\
1584\end{gsc}\end{quote}
1585
1586You don't want to run install.bash twice - it adds stuff into files.
1587To update your installation, you can run update.bash - this updates your code form CVS, and remakes all the java stuff.
1588
1589
1590\subsubsection{The sample sites}
1591
1592\noindent There are two Greenstone {\em sites} that come with the checkout: localsite, and soapsite. localsite has three collections, while soapsite has none. Each site has a configuration file which specifies the site name, site-wide services if any, and a list of remote sites to connect to.
1593localsite does not connect to any other sites. soapsite specifies a SOAP connection to localsite.
1594
1595\subsubsection{Tomcat}\label{sec:tomcat}
1596
1597\noindent Tomcat is a servlet container. It is used to serve a Greenstone site using a servlet.
1598
1599The file \gst{\gsdlhome/web/WEB-INF/web.xml} contains the setup information for Tomcat---tells it what servlets to load, what initial parameters to pass them, and what web names map to the servlets.
1600There are three servlets specified in web.xml: one is a test servlet that just prints ``hello greenstone'' to a web page. This is useful if you are having trouble getting Tomcat set up. The other two are Greenstone library servlets, {\em library}, which serves localsite, and {\em library1} which serves soapsite.
1601
1602The initialisation parameters used by the library servlets are as follows:
1603
1604\begin{tabular}{llp{5cm}}
1605\bf name & \bf sample value & \bf description \\
1606\hline
1607gsdl3home & /research/kjdon/gsdl3 & the base directory of the gsdl3 installation \\
1608sitename & localsite & the site to use \\
1609interfacename & default & the interface to use\\
1610libraryname & library & the name of the library program \\
1611defaultlang & en & the default language for the interface\\
1612receptionist & NZDLReceptionist & (optional) specifies an alternative Receptionist to use\\
1613messagerouter & NewMessageRouter & (optional) specifies an alternative MessageRouter to use\\
1614\hline
1615\end{tabular}
1616
1617It is possible to run several servlets at once, with different combinations of sites and/or interfaces.
1618
1619The file \gst{\gsdlhome/comms/jakarta/tomcat/conf/server.xml} is the Tomcat configuration file. The installation process adds a context for Greenstone3 servlets (\gst{\gsdlhome/web})---this tells Tomcat where to find the web.xml file, and what URL (\gst{/gsdl3}) to give it. Anything inside the context directory is accessible via Tomcat\footnote{can we use .htaccess files to restrict access??}. For example, the index.html file that lives in \gst{\gsdlhome/web} can be accessed through the URL \gst{localhost:8080/gsdl3/index.html}. The demo collection's images can be accessed through \\
1620\gst{localhost:8080/gsdl3/sites/localsite/collect/demo/images/}~.
1621
1622
1623Tomcat runs by default on port 8080---this can be changed in server.xml. The siteConfig files also need changing if Tomcat's port is changed: \gst{<httpAddress>} for the site, and \gst{<address>} for a remote site both use this.
1624
1625
1626
1627\subsubsection{Serving your site using Tomcat}\label{subsec:runtomcat}
1628
1629\noindent To run Tomcat, you need to have sourced {\footnotesize \verb#setup.bash#} in \gsdlhome\ to set up {\footnotesize \$CLASSPATH} (see \ref{subsec:compile}). Then,
1630
1631\begin{gsc}\begin{tt}
1632\noindent cd \gsdlhome/comms/jakarta/tomcat/bin\\
1633./startup.sh
1634\end{tt}\end{gsc}
1635
1636\noindent ({\footnotesize \verb#./shutdown.sh#} shuts down Tomcat)
1637\\
1638\\
1639\noindent The Tomcat server can be accessed on the web at \gst{http://localhost:8080}---this gets you to a welcome page.
1640The Greenstone stuff is at \gst{http://localhost:8080/gsdl3}---this displays \gst{\gsdlhome/web/index.html}. You should be able to run the test servlet and both library servlets from this page.
1641
1642\noindent Note: Tomcat must be shutdown and restarted any time you make changes in the following for those changes to take effect:\\
1643\begin{bulletedlist}
1644\begin{gsc}
1645\item \gsdlhome/web/WEB-INF/web.xml
1646\item \gsdlhome/comms/jakarta/tomcat-tomcat-4.0.1/conf/server.xml
1647\end{gsc}
1648\item any classes or jar files used by the servlets
1649\end{bulletedlist}
1650\noindent Note: stdin and stdout for the servlets both go to\\
1651\gst{\gsdlhome/comms/jakarta/tomcat/logs/catalina.out}
1652
1653On startup, the servlet loads in its collections and services. If the site or collection configuration files are changed, these changes will not take effect until the site/collection is reloaded. This can be done through the reconfiguration messages (see Section~\ref{sec:runtime-config}, or by restarting Tomcat.
1654
1655Symlinks:
1656
1657Tomcat by default doesn't follow symlinks (although the symlink to lib seems to work). To make it follow symlinks, eg to have the collect directory of a site somewhere else, you need to add the following to tomcats server.xml \\
1658(\$GSDL3HOME/comms/jakarta/tomcat/conf/server.xml):
1659\gst{<Resources allowLinking='true'/>}
1660This needs to go inside the gsdl3 context, i.e.
1661
1662\begin{quote}\begin{gsc}
1663<Context path="/gsdl3" docBase="\$GSDL3HOME/web" debug="1" \\
1664reloadable="true">\\
1665 <Resources allowLinking='true'/>\\
1666</Context>\\
1667\end{gsc}\end{quote}
1668By default, tomcat allows directory listings for everything in the docBase directory. For example, you can enter localhost:8080/gsdl3/sites and it will give you a list of all the sites. To turn this off, you need to edit Tomcat's default web.xml file (\$GSDL3HOME/comms/jakarta/tomcat/conf/web.xml):
1669
1670In the default servlet definition, change the 'listings' param to false.
1671
1672
1673Running tomcat with apache.
1674apache can be easily set up to proxy tomcat eg
1675
1676in the www.mysite.org
1677\begin{quote}\begin{gsc}
1678<VirtualHost a.b.c.d>\\
1679ServerName www.mysite.org\\
1680...\\
1681ProxyPass /greenstone3 http://puka.cs.waikato.ac.nz:8080/gsdl3\\
1682ProxyPassReverse /greenstone3 http://puka.cs.waikato.ac.nz:8080/gsdl3\\
1683</VirtualHost>\\
1684\end{gsc}\end{quote}
1685can now access tomcat, instead of at puka.cs.waikato.ac.nz:8080/gsdl3, but at www.mysite.org/greenstone3
1686
1687if tomcat is running behind a proxy, and you want to access stuff like the infomine database where you need to make external connections, you need to fill in the proxy element in the siteConfig.xml file - unfortunately the password is added in plain text. but can make it so that only the server admin can see it.
1688\subsubsection{Using SOAP to talk to a remote site}
1689
1690\noindent The previous installation stuff is fine if you only want to talk to local sites. However, if you want to connect using SOAP to a remote site, some more stuff needs to be done. soapsite specifies a SOAP connection to localsite. If you run soapsite without connecting to localsite, you don't get any collections. However, if you connect to localsite, you can see all of {\em its} collections.
1691\\
1692\\
1693\noindent The SOAP server we use is actually run as a servlet in Tomcat. You need to set up SOAP, set up the SOAP server class which will be your SOAP web service, and then deploy that service.
1694This is done by install-soap.bash.
1695You can also deploy a service through the website. If Tomcat is not running, start it up (see \ref{subsec:runtomcat}).
1696
1697\noindent The SOAP servlet can be accessed at \begin{gsc}{\tt http://localhost:8080/soap}\end{gsc}. You should see a welcome page. Click on ``Run the admin client''. This enables you to list, deploy and undeploy SOAP services.
1698
1699\noindent To deploy the SOAPServer for localsite:
1700
1701\noindent Click on ``deploy'' and edit the following fields in the deploy form:
1702
1703\begin{tabular}{ll}
1704ID: & org.greenstone.localsite\\
1705Scope: (any will do) & Request---new instantiation for each request\\
1706 & Session---same instantiation across a session\\
1707 & Application---only uses one instantiation\\
1708Methods: &process\\
1709Java Provider / Provider Class: & org.greenstone.gsdl3.SOAPServer\\
1710\end{tabular}
1711
1712\noindent Now click the ``deploy'' button at the bottom of the page. If the service has been deployed, it should appear when you click on the left hand ``List'' button.
1713
1714\noindent Information about deployed services is maintained between Tomcat sessions---you only need to deploy it once. To get the library1 servlet talking to the SOAP server, you need to shutdown and restart Tomcat (see \ref{subsec:runtomcat}). You should see more collections when you run the library1 servlet.
1715
1716\subsubsection{Debugging SOAP}
1717
1718If you need to debug the SOAP stuff for some reason, or just want to look at the SOAP messages that are being passed back and forth, use a program called TcpTunnelGui. This intercepts messages coming in to one port, displays them, and passes them to another port.
1719To run it, type:
1720
1721\begin{quote}\gst{java org.apache.soap.util.net.TcpTunnelGui 8070 localhost 8080}
1722\end{quote}
1723
17248070 is the port that TcpTunnelGui listens on, and 8080 is the port that it sends the messages onto---the port that Tomcat is using. You need to modify Greenstone to talk to port 8070 when it wants to talk to Tomcat, so that the messages go through TcpTunnelGui. This is specified in the \gst{<site>} element of the soapsite site configuration file (\gst{\gsdlhome/web/sites/soapsite/siteConfig.xml}).
1725\begin{quote}\begin{gsc}\begin{verbatim}
1726<site name="org.greenstone.localsite"
1727 address="http://localhost:8080/soap/servlet/rpcrouter"
1728 type="soap"/>
1729\end{verbatim}\end{gsc}\end{quote}
1730
1731Note that \gst{http://localhost:8080/soap/servlet/rpcrouter} is the
1732address for talking to the Tomcat SOAP servlet services.
1733
1734\section{Greenstone Customization}
1735
1736this is the dynamic stuff, immediate or through tomcat restart
1737\subsection{How to define a new interface}
1738
1739Most of an interface is defined by XSLT files, which are stored in web/interfaces/interface-name/transform. A new interface needs a directory in web/interfaces. inside, needs images and transform directories. and interfaceConfig.xml file. Any xslt may be overridden for a new interface by putting the replacement in the new interface transform directory. If the appropriate xslt file is not there, the default one will be used - this enables just overriding a few xslt files as needed.
1740xslt are looked for in order: collection, site, interface, default interface. This also applies to included xslts. (this doesn't work for colls/sites on remote computers. ). the xsl:include directives are preprocessed by the java code and full paths added based on availability of teh files, so that the correct one is used.
1741you cannot include a template with teh same name as teh includer.
1742\subsection{Adding a new language}
1743
1744Adding a new interface language to Greenstone 3 is easy. All of the language-dependent text strings are contained in Java resource bundle properties files. These are plain text files consisting of key-value pairs, located in resources/java. Each interface has one named interface\_name.properties (where name is the interface name). Each service class has one with the same name as the class (eg GS2Search.properties). To add another language these files must be translated. The translated files keep the same names, but with a language extension added. For example, a French version of interface\_default.properties would be named interface\_default\_fr.properties.
1745
1746Keys will be looked up in the properties file closest to the specified language. For example, if language fr\_CA was specified (french language, country Canada), and the default locale was en\_GB, java would look at properties files in the following order, until it found the key: XXX\_fr\_CA.properties, XXX\_fr.properties, XXX\_en\_GB.properties, then XXX\_en.properties, and finally the default XXX.properties.
1747\section{Greenstone Development}
1748
1749this is the customization that requires recompilation.
1750Here are some random notes for developers who want to modify the source code.
1751\subsection{Greenstone utility classes}
1752
1753These are found in \gst{gsdl3/src/java/org/greenstone/gsdl3/util} and provide a variety of useful functions. Table~\ref{tab:utils} gives a brief description of the various classes.
1754
1755\begin{table}
1756\caption{The utility classes in org.greenstone.gsdl3.util}
1757\label{tab:utils}
1758\center{\footnotesize
1759\begin{tabular}{lp{3.75in}}
1760\hline
1761\bf Utility class & \bf Description\\
1762ConfigVars & holds the servlet startup variables, including library name, site name, interface name, default language\\
1763Dictionary & wrapper around a ResourceBundle, providing strings with parameter\\
1764GSCGI & class to map between short name cgi args and long name request parameters \\
1765GSFile & class to create all Greenstone file paths eg used to locate configuration files, xslt files and collection data. \\
1766GSHTML & provides convenience methods for dealing with HTML, eg making strings HTML safe\\
1767GSPath & used to create, examine and modify message address paths\\
1768GSStatus & some static codes for status messages\\
1769GSXML & lots of methods for extracting information out of Greenstone XML, and creating some common types of elements. Also has static Strings for element and attribute names used by Greenstone.\\
1770GSXSLT & some manipulation functions for Greenstone XSLT\\
1771Misc & miscellaneous functions\\
1772OID & class to handle Greenstone (2) OIDs\\
1773XMLConverter & provides methods to create new Documents, parse Strings or Files into Documents, and convert Nodes to Strings\\
1774XMLTransformer & methods to transform XML using XSLT \\
1775XSLTUtil & contains static methods to be called from within XSLT \\
1776\hline
1777\end{tabular}
1778}
1779\end{table}
1780
1781\subsection{Creating new services}
1782
1783*inherit from ServiceRack - abstract base class. this handles the main process method, determines hte service name and request type. if request type is describe, and to is empty, it returns a list of services (short\_service\_info) which is initialised in the configure method. a describe request to a particular service results in getServiceDescription being called, which must be supplied by the subclass.
1784other request types (process) get sent to processXXX methods, where XXX is the service name.
1785
1786* what methods are expected
1787
1788*service type responses expected
1789
1790*a browse type service must also implement servicenameMetadataRetrieve service.
1791
1792* should a metadata retrieval service advertise what metadata is available??
1793\subsection{creating new actions/pages}
1794
1795\subsection{Working with XML}
1796
1797We use the DOM model for handling XML. This involves Documents, Nodes, Elements etc. Node is the basic thing in the tree, all others inherit from this. A Document represents a whole document, and is a kind of container for all the nodes. Elements and Nodes are not supposed to exist outside of the context of a document, so you have to have a document to create them. The document is not the top level node in the tree, to get this, use Document.getDocumentElement(). If you create nodes etc but don't append them to something already in the document tree, they will be separate - but they still know who their owner document is.
1798
1799To create new Documents, and convert Strings or Files to Documents, use XMLConverter.
1800eg:
1801\begin{quote}\begin{gsc}
1802XMLConverter converter = new XMLConverter();\\
1803Document doc = converter.newDOM();\\
1804
1805File stylesheet = new File(``query.xsl'');\\
1806Document style = converter.getDOM(stylesheet);\\
1807
1808String message = ``<message><request type='page'/></message>'';\\
1809Document m = converter.getDOM(message);\\
1810\end{gsc}\end{quote}
1811
1812To output a document as a String, use \gst{converter.getString(doc);}
1813
1814To add nodes and stuff to an empty document - create them, then append to the tree:
1815\begin{quote}\begin{gsc}
1816Document doc = converter.newDOM();\\
1817Element e = doc.createElement(``message'');\\
1818doc.appendChild(e);\\
1819\end{gsc}\end{quote}
1820
1821Note that you can only append one node to a document---this will become the top level node. After that, you can append nodes to child nodes as you like, but a document is only allowed one top level node.
1822
1823Nodes can only be created by a Document. Document has creation methods for all types of Nodes, for example \gst{createElement(element\_name)}, \gst{createAttribute(attr\_name)}, \gst{createTextNode(text\_data)} etc.
1824
1825DOM006 Hierarchy request error: happens if you have more than one root node in your document
1826
1827\subsection{Greenstone XML}
1828
1829Greenstone format namespace: (at the moment)
1830xmlns:gsf="http://www.greenstone.org/configformat"
1831
1832(xslt namespace: xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
1833no DTDs or Schema defined yet. Until there are, try and keep to the following rules:
1834
1835\begin{bulletedlist}
1836
1837\item always return expected elements even if empty, eg \gst{<paramList/>}.
1838
1839\item If you get the whole document it is called \gst{<document>}. However if you are returned a list of pointers to parts of the documents, they are \gst{<documentNode>}s.
1840
1841\item inside a list you can only have elements of the same name as the list. For example, a \gst{<paramList>} should only have \gst{<param>} elements inside it.
1842
1843\end{bulletedlist}
1844\subsection{Working with XSLT}
1845
1846\begin{bulletedlist}
1847\item {\em adding html to an xml doc:}
1848
1849eg I have a text node with html inside it inside a resource element
1850to add that to a new XML doc, I use
1851\gst{<xsl:value-of select='resource'>}
1852
1853if the output mode is xml or html, this will escape any special characters
1854ie $<$ and $>$ etc
1855
1856use
1857\gst{<xsl:value-of disable-output-escaping="yes" select='resource'>}
1858instead.
1859
1860\item {\em including an xml doc into a stylesheet:}
1861
1862\gst{<xsl:variable name='import' select='document(``newdoc.xml'')'/>}
1863
1864then can use the info:
1865
1866\gst{<xsl:value-of select='\$import/element'/>}
1867
1868\item {\em selecting an ancestor:}
1869
1870 the ancestor axis contains the parent of the context node, and its
1871 parent and so on. to pick one node among these:
1872 ancestor::elem-name. I dont know how this works if there are two
1873 nodes with the same name in the axis.
1874
1875\item {\em basic XSLT elements:}
1876\begin{quote}\begin{footnotesize}\begin{verbatim}
1877<xsl:template match='xxx' name='yyy'/>
1878
1879<xsl:apply-templates select='xxx'/>
1880<xsl:call-templates name='yyy'/>
1881
1882<xsl:variable name='doc' select='document("layout.xml")'/>
1883
1884<xsl:value-of select='$doc/chapter1'/> $
1885\end{verbatim}\end{footnotesize}\end{quote}
1886
1887\item {\em using namespaces:}
1888If you are using the same namespace in more than one file, eg in the source xml and in the stylesheet, make sure that the URI for the xmlns:xxx thingy is the same in both cases---otherwise the names don't match. This includes http:// on the front.
1889
1890\item I don't think \gst{<xsl:with-param name='xxx' select='true'/>} is
1891the same as \gst{<xsl:with-param name='xxx'>true</xsl:with-param>}.
1892Use the second one.
1893
1894\item to select a node from a list based on an attribute value: for example
1895\begin{quote}\begin{footnotesize}\begin{verbatim}
1896<xsl:variable name='name'>CL1</xsl:variable>
1897
1898<xsl:value-of select="classifier[@name=\$name]/@content"/>
1899\end{verbatim}\end{footnotesize}\end{quote}
1900
1901\item{\em using Java extension elements:}
1902
1903Declare the namespace for your java extensions using one of the following
1904three formats.
1905
1906class format: \gst{xmlns:my-class="xalan://FQCN"} where FQCN is the fully qualified class name. Examples: \gst{xmlns:my-class="xalan://java.util.Hashtable"}, \gst{xmlns:my-class="xalan://mypackage.myclass"}
1907
1908package format: \gst{xmlns:my-class="xalan://PJPN"} where PJPN is a partial java package name. That is, it is the beginning of or the complete name of a java package. Examples: \gst{xmlns:my-package="xalan://java.util"}, \gst{xmlns:my-package="xalan://mypackage"}
1909
1910Java format: \gst{xmlns:java="http://xml.apache.org/xalan/java"}
1911
1912Then, how you use the java classes and methods depends on which format you declared you namespace.
1913
1914class format:
1915
1916To create an instance of an object: \gst{prefix:new (args)}. Example: \gst{<xsl:variable name="myType" select="my-class:new()">}
1917
1918To invoke an instance method on a specified object: \gst{prefix:methodName (object, args)} where methodName is the name of the method to invoke on object with the args arguments. object must be an object of the class indicated by the namespace declaration. Example: \gst{<xsl:variable name="new-pop" select="my-class:valueOf(\$myType, string(@population))">}
1919
1920To invoke an instance method on a default object: \gst{prefix:methodName (args)} where methodName is the name of the method to invoke with the args arguments. If a matching method is found, a default instance of the class will be created if it does not already exist. Example: \gst{<xsl:variable name="new-pop" select="my-class:valueOf(string(@population))">}
1921
1922To invoke a static method: \gst{prefix:methodName (args)} where methodName is the name of the method to invoke with the args arguments. Example: \gst{<xsl:variable name="new-pop" select="my-class:printit(string(@population))">}
1923
1924package format:
1925
1926o create an instance of an object:
1927 prefix:subpackage.class.new (args)
1928
1929 where prefix is the extension namespace prefix, subpackage is the rest of the package name (the
1930 beginning of the package name was in the namespace declaration), and class is the name of the class.
1931 A new instance is to be created with the args constructor arguments (if any). All constructor methods
1932 are qualified for method selection.
1933 Example: \gst{<xsl:variable name="myType"
1934 select="my-package:extclass.new()">}
1935
1936 To invoke an instance method on a specified instance:
1937 prefix:methodName (object, args)
1938
1939 where prefix is the extension namespace prefix and methodName is the name of the method to invoke
1940 on object with the args arguments. Only instance methods of the object with the name methodName
1941 are qualified methods. If a matching method is found, object will be used to identify the object instance
1942 and args will be passed to the invoked method.
1943 Example: \gst{<xsl:variable name="new-pop"
1944 select="my-package:valueOf(\$myType, string(@population))">}
1945
1946 To invoke a static method:
1947 prefix:subpackage.class.methodName (args)
1948
1949 where prefix is the extension namespace prefix, subpackage is the rest of the package name (the
1950 beginning of the package name was in the namespace declaration), class is the name of the class, and
1951 methodName is the name of the method to invoke with the args arguments. Only static methods with
1952 the name methodName are qualified methods. If a matching method is found, args will be passed to the
1953 invoked static method.
1954 Example: \gst{<xsl:variable name="new-pop"
1955 select="my-package:extclass.printit(string(@population))">}
1956
1957
1958 Unlike the class format namespace, there is no concept of a default object since the namespace
1959 declaration does not identify a unique class.
1960
1961java format:
1962
1963
1964
1965
1966 To create an instance of an object:
1967 prefix:FQCN.new (args)
1968
1969 where prefix is the extension namespace prefix for the Java namespace and FQCN is the fully qualified
1970 class name of the class whose constructor is to be called. A new instance is to be created with the
1971 args constructor arguments (if any). All constructor methods are qualified for method selection.
1972 Example: \gst{<xsl:variable name="myHash"
1973 select="java:java.util.Hashtable.new()">}
1974
1975 To invoke an instance method on a specified instance:
1976 prefix:methodName (object, args)
1977
1978 where prefix is the extension namespace prefix and methodName is the name of the method to invoke
1979 on object with the args arguments. Only instance methods of the object with the name methodName
1980 are qualified methods. If a matching method is found, object will be used to identify the object instance
1981 and args will be passed to the invoked method.
1982 Example: \gst{<xsl:variable name="new-pop"
1983 select="java:put(\$myHash, string(@region), \$newpop)">}
1984
1985 To invoke a static method:
1986 prefix:FQCN.methodName (args)
1987
1988 where prefix is the extension namespace prefix, FQCN is the fully qualified class name of the class
1989 whose static method is to be called, and methodName is the name of the method to invoke with the
1990 args arguments. Only static methods with the name methodName are qualified methods. If a matching
1991 method is found, args will be passed to the invoked static method.
1992 Example: \gst{<xsl:variable name="new-pop"
1993 select="java:java.lang.Integer.valueOf(string(@population))">}
1994
1995
1996 Unlike the class format namespace, there is no concept of a default object since the namespace
1997 declaration does not identify a unique class.
1998
1999
2000
2001
2002
2003
2004
2005\end{bulletedlist}
2006\subsubsection{What can I do to speed up XSL transformations?}
2007
2008This information taken from the Xalan FAQS page.
2009
2010\begin{bulletedlist}
2011
2012\item Use a Templates object (with a different Transformers for each
2013transformation) to perform multiple transformations with the same set
2014of stylesheet instructions.
2015
2016\item Set up your stylesheets to function efficiently.
2017
2018\item Don't use "//" (descendant axes) patterns near the root of a
2019large document.
2020
2021\item Use xsl:key elements and the key() function as an efficient way
2022to retrieve node sets.
2023
2024\item Where possible, use pattern matching rather than xsl:if or
2025xsl:when statements.
2026
2027\item xsl:for-each is fast because it does not require pattern matching.
2028
2029\item avoid recursion
2030
2031\item Keep in mind that xsl:sort prevents incremental processing.
2032
2033\item When you create variables,\\
2034\gst{<xsl:variable name="fooElem" select="foo"/>} is usually faster
2035than \\
2036\gst{<xsl:variable name="fooElem"><xsl:value-of-select="foo"/></xsl:variable>}.
2037
2038\item Be careful using the last() function.
2039
2040\item The use of index predicates within match patterns can be expensive.
2041
2042\item Decoding and encoding is expensive.
2043
2044\item For the ultimate in server-side scalability, perform transform
2045operations on the client.
2046
2047\end{bulletedlist}
2048
2049\subsection{Java gdbm}
2050
2051To talk to gdbm, a jni wrapper called java-gdbm is used. It was
2052obtained from:\\ \gst{http://aurora.rg.iupui.edu/~schadow/dbm-java/pip/gdbm/}
2053
2054It uses packing objects to convert to and from an array of bytes (in
2055gdbm file) from and to java objects. In my GDBMWrapper class I use
2056StringPacking - uses UTF-8 encoding. but some stuff came out funny. so
2057I had to changes the from\_bytes method in StringPacking.java to use
2058new String(raw, "UTF-8") instead of new String(raw). this seems to
2059work.
2060
2061Note---if we use this gdbm stuff to create the file too, may need to
2062alter the to-bytes method.
2063
2064The makefile in j-gdbm is crap---it tries to get stuff from its
2065original CVS tree. I have created a new Makefile---in my-j-gdbm
2066directory. this stuff needs to go into CVS probably.
2067
2068
2069
2070\subsection{Resources}
2071
2072This is a list of some useful resources that we have come across during development of gsdl3.
2073
2074Contents for 'The Java Native Interface Programmer's Guide and
2075Specification' on-line\\
2076\gst{http://java.sun.com/docs/books/jni/html/jniTOC.html}
2077
2078Java Native Interface Specification\\
2079\gst{http://java.sun.com/j2se/1.4/docs/guide/jni/spec/jniTOC.doc.html}
2080
2081JNI Documentation Contents\\
2082\gst{http://java.sun.com/j2se/1.4/docs/guide/jni/index.html}
2083
2084another JNI page\\
2085\gst{http://mindprod.com/jni.html}
2086
2087Java 1.4 API index\\
2088\gst{http://java.sun.com/j2se/1.4/docs/api/index.html}
2089
2090Java tutorial index\\
2091\gst{http://java.sun.com/docs/books/tutorial/index.html}
2092
2093Safari books online - has Java, XML, XSLT, etc books\\
2094\gst{http://proquest.safaribooksonline.com/mainhom.asp?home}
2095
2096Java 1.4 i18n FAQ\\
2097\gst{http://www.sun.com/developers/gadc/faq/java/java1.4.html}
2098
2099Java and XSLT page\\
2100\gst{http://www.javaolympus.com/java/Java\%20and\%20XSLT.html}
2101
2102Xalan-Java overview\\
2103\gst{http://xml.apache.org/xalan-j/overview.html}
2104
2105Tomcat documentation index\\
2106\gst{http://jakarta.apache.org/tomcat/tomcat-4.0-doc/index.html}
2107
2108Servlet and JSP tutorial\\
2109\gst{http://www.apl.jhu.edu/~hall/java/Servlet-Tutorial/}
2110
2111Core Servlets and JavaServer Pages, book by Marty Hall. download the
2112pdf from here (try before you buy link)\\
2113\gst{http://www.coreservlets.com/}
2114
2115J-gdbm page\\
2116\gst{http://aurora.rg.iupui.edu/~schadow/dbm-java/pip/gdbm/}
2117
2118Stuarts page of links\\
2119\gst{http://www.cs.waikato.ac.nz/~nzdl/gsdl3/}
2120
2121a good basic xslt tutorial\\
2122\gst{http://www.zvon.org/xxl/XSLTutorial/Books/Output/contents.html}
2123
2124JAXP (java api for xml processing) package overview\\
2125\gst{http://java.sun.com/xml/jaxp/dist/1.1/docs/api/overview-summary.html}
2126
2127DeveloperWorks, xml zone\\
2128\gst{http://www-106.ibm.com/developerworks/xml/}
2129
2130xslt.com\\
2131\gst{http://www.xslt.com/}
2132
2133jeni tennison's xslt pages\\
2134\gst{http://www.jenitennison.com/xslt/}
2135
2136apaches xml tools\\
2137\gst{http://xml.apache.org/}
2138
2139
2140%\clearpage
2141%\addcontentsline{toc}{chapter}{Bibliography}
2142%\bibliography{main}
2143
2144\end{document}
2145
2146
2147
Note: See TracBrowser for help on using the repository browser.