Changeset 6904
- Timestamp:
- 2004-03-03T14:32:22+13:00 (20 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
trunk/gsdl3/docs/manual/manual.tex
r6520 r6904 1 1 \documentclass[a4paper,11pt]{article} 2 \usepackage{ times,epsfig}2 \usepackage{isolatin1,times,epsfig} 3 3 \hyphenation{Message-Router Text-Query} 4 4 … … 54 54 A description of the general design and architecture of Greenstone3 is covered by the document {\em The design of Greenstone3: An agent based dynamic digital library} (design-2002.ps, in the gsdl3/docs/manual directory). 55 55 56 This documentation consists of several parts. Section~\ref{sec:install} covers greenstone installation, how to access the library, and some administration issues. Section~\ref{sec:user} looks at using the sample collections, creating new collections, and how to make small customisations to the interface. The remaining sections are aimed towards the Greenstone developer. Section~\ref{sec:develop-runtime} describes the run-time system, including the structure of the software, and the message format, while Section~\ref{sec:develop-build} describes the collection building process. Section~\ref{sec:new-features} describes how to add new features to Greenstone, such as how to add new services, new page types, new plugins for different document formats. Section~\ref{sec:distributed} describes how to make Greenstone run in a distributed fashion, using SOAP as an example communications protocol. Finally, there are several appendices, including how to install Greenstone from CVS, and a comparison of greenstone 2 and greenstone3 format statements.56 This documentation consists of several parts. Section~\ref{sec:install} covers greenstone installation, how to access the library, and some administration issues. Section~\ref{sec:user} looks at using the sample collections, creating new collections, and how to make small customisations to the interface. The remaining sections are aimed towards the Greenstone developer. Section~\ref{sec:develop-runtime} describes the run-time system, including the structure of the software, and the message format, while Section~\ref{sec:develop-build} describes the collection building process. Section~\ref{sec:new-features} describes how to add new features to Greenstone, such as how to add new services, new page types, new plugins for different document formats. Section~\ref{sec:distributed} describes how to make Greenstone run in a distributed fashion, using SOAP as an example communications protocol. Finally, there are several appendices, including how to install Greenstone from CVS, and a comparison of Greenstone2 and Greenstone3 format statements. 57 57 \newpage 58 58 \section{Greenstone installation and administration}\label{sec:install} … … 69 69 \subsubsection{Linux} 70 70 71 Download the latest version of the self-installing tar file, gsdl3-x.xx-unix.sh, and run it in a shell (./gsdl3-x.xx-unix.sh). Greenstone will be installed into a directory called gsdl3 inside the current directory. The install script will prompt you for the name of your computer and what port to run tomcat on (the defaults being localhost and 8080). Once Greenstone has been installed, you can start the library by running ./gsdl3/gs3-launch.sh, and opening up a browser pointing to localhost:8080/gsdl3 (or different computer name and port).71 Download the latest version of the self-installing tar file, gsdl3-x.xx-unix.sh, and run it in a shell (./gsdl3-x.xx-unix.sh). Greenstone will be installed into a directory called gsdl3 inside the current directory. The install script will prompt you for the name of your computer and what port to run Tomcat on (the defaults being localhost and 8080). Once Greenstone has been installed, you can start the library by running ./gsdl3/gs3-launch.sh, and opening up a browser pointing to localhost:8080/gsdl3 (or different computer name and port). 72 72 73 73 \subsubsection{Windows} … … 87 87 \subsubsection{Restarting the library} 88 88 89 The library program (actually tomcat) can be restarted by ... (** put a mechanism in each install program **).89 The library program (actually Tomcat) can be restarted by ... (** put a mechanism in each install program **). 90 90 91 91 … … 106 106 Table~\ref{tab:dirs} shows the file hierarchy for Greenstone3. 107 107 The first part shows the common stuff which can be shared between 108 Greenstone users---the source, libraries etc. Under Linux, these will eventuallybe installed into appropriate system directories. The second part shows108 Greenstone users---the source, libraries etc. Under Linux, these can be installed into appropriate system directories. The second part shows 109 109 stuff used by one person/group---their sites and interface setup (see Section~\ref{sec:sites-and-ints}). 110 110 etc. There can be several sites/interfaces per installation. … … 149 149 & windows executables for e.g. MGPP\\ 150 150 gsdl3/comms 151 & Put some stuff here for want of a better place---things to do with servers and communication. e.g. soap stuff, and tomcat servlet container\\151 & Put some stuff here for want of a better place---things to do with servers and communication. e.g. soap stuff, and Tomcat servlet container\\ 152 152 gsdl3/docs 153 153 & Documentation :-)\\ … … 156 156 & This is where the web site is defined. Any static html files can go here. This directory is the Tomcat root directory.\\ 157 157 gsdl3/web/WEB-INF 158 & The web.xml file lives here (servlet configuration information for tomcat)\\158 & The web.xml file lives here (servlet configuration information for Tomcat)\\ 159 159 gsdl3/web/WEB-INF/classes 160 160 & Servlet classes go in here\\ … … 187 187 where they live, whats the difference, what each contains.\\ 188 188 189 A site is comprised of a set of collections and possibly services. An interface is a set of images along with a set of xslt files used for translating xml output from the library into an appropriate form---html for the servlet case. 190 One greenstone installation can have many sites and interfaces. One instantiation of a servlet uses one site and one interface. Sites and interfaces can be matched up in different ways. For example, a single site might be served with two different interfaces. This provides different modes of access to the same content. eg HTML vs WML, or perhaps providing completely different look and feel for different audiences. A standard interface may be used with many different sites---provides a consistent mode of access to a lot of different content. 191 192 Collections live in the collect directory of a site. Any collections that are found in this directory when the servlet is initialised will be loaded up and presented to the user. Collections require valid configuration files, but apart from this, nothing needs to be done to the site to use new collections. Collection is added while tomcat is running will not be picked up: you can either restart the server, or send a configuration request to the servlet: these are described in Section~\ref{sec:runtime-config}. 189 A site is comprised of a set of collections and possibly some site-wide services. An interface (in this web-based servlet context) is a set of images along with a set of xslt files used for translating xml output from the library into an appropriate form---html in general. 190 191 One greenstone installation can have many sites and interfaces. One instantiation of a servlet uses one site and one interface. Sites and interfaces can be matched up in different ways. For example, a single site might be served with two different interfaces. This provides different modes of access to the same content. eg HTML vs WML, or perhaps providing a completely different look and feel for different audiences. A standard interface may be used with many different sites---providing a consistent mode of access to a lot of different content. 192 193 Collections live in the collect directory of a site. Any collections that are found in this directory when the servlet is initialised will be loaded up and presented to the user. Collections require valid configuration files, but apart from this, nothing needs to be done to the site to use new collections. Collections added while Tomcat is running will not be noticed automatically. Either the server needs to be restarted, or a configuration request may be sent to the library, triggering a (re)load of the collection (this is described in Section~\ref{sec:runtime-config}). 193 194 194 195 There are two Greenstone sites that come with the distribution: localsite, and soapsite. localsite has several demo collections, while soapsite has none. soapsite specifies that a soap connection should be made to localsite. Getting this to work involves setting up a soap server for localsite: see Section~\ref{sec:distributed} for details. … … 224 225 225 226 Initial Greenstone3 system configuration is determined by a set of configuration files, all expressed in XML. Each site has a configuration file that binds parameters for the site, \gst{siteConfig.xml}. Each interface has a configuration file, \gst{interfaceConfig.xml}, that specifies Actions for the interface. Collections also have several configuration files; these are discussed in Section~\ref{sec:collconfig}. 226 The configuration files are read in when the system is initialised, and their contents are cached in memory. This means that changes made to these files once the system is running will not take immediate effect. Tomcat needs to be restarted for changes to the interface configuration file to take effect. However, changes to the site configuration file can be incorporated sending a CGI-type command to the library. There are a series of CGI-typecommands that can be sent to the library to induce reconfiguration of different modules, including reloading the whole site. This removes the need to shutdown and restart the system to reflect these changes. These commands are described in Section~\ref{sec:runtime-config}.227 The configuration files are read in when the system is initialised, and their contents are cached in memory. This means that changes made to these files once the system is running will not take immediate effect. Tomcat needs to be restarted for changes to the interface configuration file to take effect. However, changes to the site configuration file can be incorporated sending a CGI-type command to the library. There are a series of commands that can be sent to the library to induce reconfiguration of different modules, including reloading the whole site. This removes the need to shutdown and restart the system to reflect these changes. These commands are described in Section~\ref{sec:runtime-config}. 227 228 228 229 \subsubsection{Site configuration file}\label{sec:siteconfig} … … 280 281 \subsubsection{Interface configuration file}\label{sec:interfaceconfig} 281 282 282 The interface configuration file \gst{interfaceConfig.xml} lists all the actions that the interface knows about at the start (but other ones can be loaded dynamically). If the interface uses servlets, it specifies what short name each action should use for the action CGI parameter e.g. QueryAction should use a=q. If the interface uses XSLT, it specifies what XSLT file should be used for each action and subaction. 283 The interface configuration file \gst{interfaceConfig.xml} lists all the actions that the interface knows about at the start (other ones can be loaded dynamically). It specifies what short name each action maps to (this is used in library urls for the a (action) parameter) e.g. QueryAction should use a=q. If the interface uses XSLT, it specifies what XSLT file should be used for each action and possibly each subaction. This makes it easy for developers to implement and use different actions and/or XSLT files without recompilation. The server must be restarted, however. 284 285 It also lists all the languages that the interface text files have been translated into. These have a name attribute, which is the ISO code for the language, and a displayElement which gives the language name in that language (note the non-English characters have been specified in UTF-8 codes). This language list is used on the Preferences page to allow the user to change the interface language. Details on how to add a new language to a Greenstone library are shown in Section~\ref{sec:interface-customise}. 283 286 284 287 \begin{figure} … … 303 306 <action name='s' class='SystemAction' xslt='system.xsl'/> 304 307 </actionList> 308 <languageList> 309 <language name="en"> 310 <displayItem name='name'>English</displayItem> 311 </language> 312 <language name="fr"> 313 <displayItem name='name'>Français</displayItem> 314 </language> 315 <language name='es'> 316 <displayItem name='name'>Español</displayItem> 317 </language> 318 </languageList> 305 319 </interfaceConfig> 306 320 \end{verbatim}\end{gsc} … … 309 323 \end{figure} 310 324 311 This makes it easy for developers to implement and use different actions and/or XSLT files without recompilation. The server must be restarted, however.312 325 313 326 \subsection{Run-time re-initialisation}\label{sec:runtime-config} … … 315 328 should this section go in here, cos its kind of adminy, or go into the user stuff, cos you need to do it after building a collection??? 316 329 317 When tomcat is started up, the site and interface configuration files are read in, and actions/services/collections loaded as necessary. The configuration is then static unless tomcat is restarted, or re-configuration commands issued.318 319 There are several CGI-type commands that can be issued to tomcat to avoid having to restart the server. These can reload the entire site, or just individual collections. Unfortunately at present there are no commands to reconfigure the interface, so if the interface configuration file has changed, tomcat must be restarted for those changes to take effect. Similarly, if the java classes are modified, tomcat must be restarted then too.330 When Tomcat is started up, the site and interface configuration files are read in, and actions/services/collections loaded as necessary. The configuration is then static unless Tomcat is restarted, or re-configuration commands issued. 331 332 There are several CGI-type commands that can be issued to Tomcat to avoid having to restart the server. These can reload the entire site, or just individual collections. Unfortunately at present there are no commands to reconfigure the interface, so if the interface configuration file has changed, Tomcat must be restarted for those changes to take effect. Similarly, if the java classes are modified, Tomcat must be restarted then too. 320 333 321 334 Currently, the runtime configuration commands can only be accessed by typing in CGI-arguments into the URL, there is no nice web form yet to do this. 322 335 323 The CGI arguments are entered after the \gst{library?} part of the URL. There are three types of commands: configure, activate, deactivate\footnote{There is no security for these commands yet in Greenstone, so the deactivate/delete command is disabled}. These are specified by \gst{a=s\&sa=c}, \gst{a=s\&sa=a}, and \gst{a=s\&sa=d}, respectively (\gst{a} is action, \gst{sa} is subaction). By default, the requests are sent to the MessageRouter, but they can be sent to a collection/cluster by the addition of \gst{sc=xxx}, where \gst{xxx} is the name of the collection or cluster. Table~\ref{tab:run-time config} describes the arguments in a bit more detail.336 The CGI arguments are entered after the \gst{library?} part of the URL. There are three types of commands: configure, activate, deactivate\footnote{There is no security for these commands yet in Greenstone, so the deactivate/delete command is disabled}. These are specified by \gst{a=s\&sa=c}, \gst{a=s\&sa=a}, and \gst{a=s\&sa=d}, respectively (\gst{a} is action, \gst{sa} is subaction). By default, the requests are sent to the MessageRouter, but they can be sent to a collection/cluster by the addition of \gst{sc=xxx}, where \gst{xxx} is the name of the collection or cluster. Table~\ref{tab:run-time config} describes the commands and arguments in a bit more detail. 324 337 325 338 \begin{table} … … 329 342 \begin{tabular}{lp{8cm}} 330 343 \hline 331 \gst{a=s\&sa=c} & reconfigures the whole site , reads in siteConfig.xml, reloads all the collections. Just part of this can be specified with another argument \gst{ss} (system subset). The valid values are \gst{collectionList}, \gst{siteList}, \gst{serviceList}, \gst{clusterList}. \\344 \gst{a=s\&sa=c} & reconfigures the whole site. Reads in siteConfig.xml, reloads all the collections. Just part of this can be specified with another argument \gst{ss} (system subset). The valid values are \gst{collectionList}, \gst{siteList}, \gst{serviceList}, \gst{clusterList}. \\ 332 345 \gst{a=s\&sa=c\&sc=XXX} & reconfigures the XXX collection or cluster. \gst{ss} can also be used here, valid values are \gst{metadataList} and \gst{serviceList}. \\ 333 346 \gst{a=s\&sa=a} & (re)activate a specific module. Modules are specified using two arguments, \gst{st} (system module type) and \gst{sn} (system module name). Valid types are \gst{collection}, \gst{cluster} \gst{site}.\\ … … 344 357 \subsection{Using a collection}\label{sec:usecolls} 345 358 346 A collection typically consists of a set of documents, which could be text, html, word, PDF, images, bibliographic records etc, along with some access methods, or services. Typical access methods include searching or browsing for document identifiers, and retrieval of content or metadata for those identifiers. 347 Searching involves entering words or phrases and getting back lists of documents that contain those words. The search terms may be restricted to particular fields of the document. Browsing ... 359 A collection typically consists of a set of documents, which could be text, html, word, PDF, images, bibliographic records etc, along with some access methods, or ``services''. Typical access methods include searching or browsing for document identifiers, and retrieval of content or metadata for those identifiers. 360 Searching involves entering words or phrases and getting back lists of documents that contain those words. The search terms may be restricted to particular fields of the document. 361 362 Browsing involves navigating pre-defined hierarchies of documents, following links of interest to find documents. The hierarchies may be constructed on different metadata fields, for example, alphabetical lists of Titles, or a hierarchy of Subject classifications. Clicking on a bookshelf icon takes you to a lower level in the hierarchy, while clicking on a book or page icon takes you to a document. 348 363 349 364 In the standard interface that comes with Greenstone3\footnote{of course, this is all customisable}, collections in a digital library are presented in the following manner. The 'home' page of the library shows a list of all the public collections in that library. Clicking on a collection link takes you to the home page for the collection, which we call the 'about' page. The standard page banner looks something like that shown in Figure~\ref{fig:page-banner}. … … 356 371 \end{figure} 357 372 358 The image at the top left is a link to the collection's about page. The top right has buttons to link to the library home page, help pages and preference pages. All the available services are arrayed along a navigation bar, along the bottom of the banner. Click on a name to access that service. 359 Once you are looking at a document, clicking the open book icon at the top of the document, underneath the navigation bar, will take you back to the search or browse page where you accessed the document from. 373 The image at the top left is a link to the collection's home page. The top right has buttons to link to the library home page, help pages and preference pages. All the available services are arrayed along a navigation bar, along the bottom of the banner. Clicking on a name accesses that service. Search type services generally provide a form to fill in, with parameters including what field or granularity to index, and the query itself. Clicking the 374 The results of a search 375 Once you are looking at a document, clicking the open book icon at the top of the document, underneath the navigation bar, will take you back to the service page that you accessed the document from. 360 376 361 377 describe the colls that the sample installation comes with\\ … … 371 387 Collections live in the collect directory of a site. As described in Section~\ref{sec:sites-and-ints}, there can be several sites per greenstone installation. The collect directory is at \$GSDL3HOME/web/sites/site-name/collect, where site-name is the name of the site you want your new collection to belong to. 372 388 373 The following two sections describe how to create a collection from scratch, and how to import a greenstone 2 collection. Once a collection has been built, the library server needs to be notified that there is a new collection. This can be accomplished in two ways\footnote{eventually there will also probably be automatic polling for new collections}. If you are the library administrator, you can restart tomcat. The library servlet will then be created afresh, and will discover the new collection when it scans the collect directory for the collection list. Alternatively, there is a CGI command to reload a collection which can also load a new one. Use the CGI arguments \gst{a=s\&sa=a\&st=collection\&sn=collname}---this tells the library program to reload the collname collection.389 The following two sections describe how to create a collection from scratch, and how to import a greenstone 2 collection. Once a collection has been built, the library server needs to be notified that there is a new collection. This can be accomplished in two ways\footnote{eventually there will also probably be automatic polling for new collections}. If you are the library administrator, you can restart Tomcat. The library servlet will then be created afresh, and will discover the new collection when it scans the collect directory for the collection list. Alternatively, there is a CGI command to reload a collection which can also load a new one. Use the CGI arguments \gst{a=s\&sa=a\&st=collection\&sn=collname}---this tells the library program to reload the collname collection. 374 390 375 391 376 392 \subsubsection{Creating a collection from scratch} 377 393 378 Building Greenstone 3 collections is done using the \gst{gs3 build} script, whilst the files that control how the building is done are found inside the \gst{etc} subdirectory of \gst{gsdl3/web/sites/localsite/collect/[collectionname]}. There are a number of considerations in building a collection: including what documents appear in the collection, how they are indexed for searching, which classifications are used for browsing, etc. All these aspects are controlled by files within the collection's directory.394 Building Greenstone 3 collections is done using the \gst{gs3-build.sh} script, with the \gst{collectionConfig.xml} file controlling how the building is done. There are a number of considerations in building a collection: including what documents appear in the collection, how they are indexed for searching, which classifications are used for browsing, etc. 379 395 380 396 Firstly, the documents that comprise the collection should be placed in the import subdirectory. At present, only documents in this directory will appear in the collection. 381 382 The basic means of finding documents in Greenstone is search. The etc/collectionConfig.xml file controls which indexes are created to support search. By default, a collection will simply index the text of each document in the collection using the MG search engine. Alternative choices include selecting other search engines, indexing individual fields of documents (e.g. the document title) and indexing documents by section. 383 384 Search indexes appear as individual \gst{<index>} elements within the \gst{<search>}element of the \gst{collectionConfig.xml} file, and classifications as individual \gst{<classifier>} elements within the \gst{<browse>} element. In each case, some choices are made using attributes of the element itself, and some through child elements. 385 386 Indexes can alter which search engine to use for that index, the level at which the index should be built (e.g. document, section or paragraph) and the text over which it should be built (e.g. the document text, titles alone, author names, etc.). Section-level indexes allow a reader to recall part of a document (for instance, a chapter) rather than the entire document. However, Greenstone 3 must be able to identify the internal structure of the document to achieve this. The degree to which structure can be found varies from file format to file format. 387 388 Each index also must have a unique name, which is used to identify it within Greenstone The name is given as an attribute of the \gst{<index>} element. The ``type'' indicates which search engine to use for the index. This attribute can contain either 'mg' or 'mgpp'. If the ``type'' attribute is not given, the default indexer is mg. 389 390 The other choices are described using child elements of \gst{<index>}. The \gst{<level>} tag indicates the index level and the \gst{<field>} tag the text to be used. The \gst{<level>} tag can contain one of document, section or paragraph, while the \gst{<field>} tag can contain ``text'' or the name of a metadata field. If the \gst{<level>} tag is omitted, the default setting is to index by document, and if the \gst{<field>} tag is omitted, the default setting is to index the document text. 391 392 Example index tags include: 393 394 To index only the title of each separate document in the collection: 395 \begin{gsc}\begin{verbatim} 396 <index name="dtt"> 397 <level>document</level> 398 <field>dc:title</field> 399 <displayItem name='name' lang="en">entire documents</displayItem> 400 <displayItem name='name' lang="fr">documents entiers</displayItem> 401 <displayItem name='name' lang="es">documentos enteros</displayItem> 402 </index> 403 \end{verbatim}\end{gsc} 404 ...in this case the \gst{<field>} tag refers to the ``title'' metadata item, found in the Dublin Core namespace. The mg search engine would be used on this index. 405 406 Alternatively, to index the full document texts by section: 407 \begin{gsc}\begin{verbatim} 408 <index name="stx" type=''mgpp''> 409 <level>section</level> 410 <displayItem name='name' lang="en">entire documents</displayItem> 411 <displayItem name='name' lang="fr">documents entiers</displayItem> 412 <displayItem name='name' lang="es">documentos enteros</displayItem> 413 </index> 414 \end{verbatim}\end{gsc} 415 ...or... 416 \begin{gsc}\begin{verbatim} 417 <index name="stx" type=''mg''> 418 <level>section</level> 419 <field>text</field> 420 <displayItem name='name' lang="en">entire documents</displayItem> 421 <displayItem name='name' lang="fr">documents entiers</displayItem> 422 <displayItem name='name' lang="es">documentos enteros</displayItem> 423 </index> 424 \end{verbatim}\end{gsc} 425 ...in the first example, the \gst{<field>} tag is not explicitly defined, and would default to 'text', whereas it is explicitly set to 'text' in the second example. Note the different indexer selected for these two indexes. As they are of the same name, they should not appear in the same \gst{collectionConfig.xml} file. 426 427 Moving onto \gst{<classifier>} items, the format is broadly similar to \gst{<index>} items, but with a couple of different choices. Firstly, each classifier should have a ``name'' and ``type'' attribute as with \gst{<index>} tags. In the case of \gst{<classifier>} items the ``type'' attribute identifies the type of classifier it is. At present, this should either be ``Hierarchy'' or ``AZList''. 428 429 The remaining choices for the classifier should follow as child elements of the \gst{<classifier>} element. The \gst{<file>} element should contain the name of the file that describes the classifier as its ``URL'' attribute. The format of this file will be described later - it will vary from classifier type to classifier type. The \gst{<field>} element identifies the name of the field to index. More than one \gst{<field>} element may appear if two or more metadata fields are to be used with the classifier. Finally, the \gst{<sort>} item identifies another metadata field which the items within one classifier node are to be ordered. Unlike the \gst{<index>} element, the \gst{<classifier>} element does not have default, assumed values for its children. 397 [TODO: describe the kinds of documents that can be added, something about METS files?] 430 398 431 399 Metadata for documents can be added using metadata.xml files. These files have already been used in Greenstone 2, and the format is the same in Greenstone 3. A metadata.xml file has a root element of \gst{<DirectoryMetadata>}. This encloses a series of \gst{<FileSet>} items. Neither of these tags has any attributes. Each \gst{<FileSet>} item includes two parts: firstly, one or more \gst{<FileName>} tags, each of which encloses a regular expression to identify the files which are to be assigned the metadata. Only files in the same directory as the metadata.xml, or in one of its child directories, file will be selected. The filename tag encloses the regular expression as text, eg: … … 459 427 Here, only one file pattern is found in the file set. However, the \gst{Description} tag contains a number of separate metadata items. Note that the \gst{Title} metadata does not have the accumulate metadata. This means that when the title is assigned to a document, its existing \gst{Title} information will be lost. 460 428 461 Whereever possible, the Greenstone 3 will import and use options from a Greenstone 2 \gst{collect.cfg} file. However, it is strongly recommended that a proper \gst{collectionConfig.xml} file is used wherever possible. 462 463 To build a collection execute \gst{gs3build.sh -collect collectionname}. The process will run, placing the new indexes in the \gst{building} subdirectory of the collection's directory. 464 465 The building directory should be renamed to index, and a buildConfig.xml file added to it. See Section~\ref{sec:buildconfig} and look at the other collections' buildConfig files for examples. 466 467 [TODO: need to describe namespaces somewhere? need to generate the buildConfig file automatically.] 429 The basic means of finding documents in Greenstone is search. Options for building the search indexes include which indexer to use, what granularity to use for the indexes (e.g. whether to index documents as a whole, or sections of documents), what content the index should have (the whole text of the document or one or many metadata fields). 430 431 Indexes can alter which search engine to use for that index, the level at which the index should be built (e.g. document, section or paragraph) and the text over which it should be built (e.g. the document text, titles alone, author names, etc.). Section-level indexes allow a reader to recall part of a document (for instance, a chapter) rather than the entire document. However, Greenstone 3 must be able to identify the internal structure of the document to achieve this. The degree to which structure can be found varies from file format to file format. 432 433 The collectionConfig.xml file controls the all of these options for collection building, and the format is described in Section~\ref{sec:collconfig}. 434 435 Wherever possible, the Greenstone 3 will import and use options from a Greenstone 2 \gst{collect.cfg} file. However, it is strongly recommended that a proper \gst{collectionConfig.xml} file is used wherever possible. 436 437 To build a collection, execute \gst{gs3build.sh sitename collectionname}. The process will run, placing the new indexes in the \gst{building} subdirectory of the collection's directory. You must have mysql running before you start building---running \gst{gs3-launch.sh} will start up the mysql server as well as tomcat. 438 439 Once the build process is complete, the building directory should be renamed to index (after deleting the existing index directory, if any), and Tomcat prompted to reload the collection---either by restarting the server, or by sending an activate collection command to the library servlet. 440 441 [TODO: need to describe namespaces somewhere? ] 468 442 469 443 \subsubsection{Importing a greenstone 2 collection} … … 472 446 The Greenstone 3 run time system requires different configuration files for a collection, so you need to run a conversion script. All this does is create the new collectionConfig.xml and buildConfig.xml from the old collect.cfg and build.cfg files. It does not change the collection in any way, so it can still be used by Greenstone 2 software. 473 447 474 The conversion script is \gst{convert\_coll\_from\_gs2.pl}. To run it, you need to specify the path to the collect directory, and the collection name. For example,448 The conversion script is \gst{convert\_coll\_from\_gs2.pl}. To run it, make sure you have sourced setup.bash (or run setup in Windows) in your top-level gsdl directory of the greenstone 2 installation. Then you need to specify the path to the collect directory, and the collection name as parameters to the conversion script. For example, 475 449 476 450 \gst{convert\_coll\_from\_gs2.pl -collectdir \$GSDL3HOME/web/\-sites/\-localsite/\-collect demo} 477 451 478 The script attempts to create gs3 format statements from the old greenstone 2 ones. The conversion may not always work properly, so if the collection looks a bit strange under greenstone 3, you should check the format statements. Format statements are described in Section~\ref{sec:formatstmt}.479 480 Once again, to have the collection recognised by the library servlet, you can either restart tomcat, or load it manually by sending the arguments \gst{a=s\&sa=c\&c=collname} to the library servlet.452 The script attempts to create gs3 format statements from the old greenstone 2 ones. The conversion may not always work properly, so if the collection looks a bit strange under Greenstone 3, you should check the format statements. Format statements are described in Section~\ref{sec:formatstmt}. 453 454 Once again, to have the collection recognised by the library servlet, you can either restart Tomcat, or load it dynamically. 481 455 482 456 \subsection{Collection configuration files}\label{sec:collconfig} 483 457 484 458 Each collection has two, or possibly three, configuration files, \gst{collectionConfig.xml} and \gst{buildConfig.xml}, and optionally \gst{collectionInit.xml} that give metadata, display and other information for the 485 collection.\footnote{\gst{ siteConfig.xml} and \gst{interfaceConfig.xml} is new for Greenstone3, while \gst{collectionConfig.xml} and \gst{buildConfig.xml} replace \gst{collect.cfg} and \gst{build.cfg} in459 collection.\footnote{\gst{collectionConfig.xml} and \gst{buildConfig.xml} replace \gst{collect.cfg} and \gst{build.cfg} in 486 460 Greenstone2.} The first includes user-defined presentation metadata for the collection, 487 461 such as its name and the {\em About this collection} text; gives formatting information for the collection display; and also gives … … 502 476 \subsubsection{collectionConfig.xml} 503 477 504 The collection configuration file is where the collection designer (e.g. a librarian) decides what form the collection should take. This includes the collection metadata such as title and description, and also includes what indexes and browsing structures should be built. The format of \gst{collectionConfig.xml} is still under consideration. However, Figure~\ref{fig:collconfig} shows the parts of it that have been defined so far. (Since collection building at this stage is still done using Greenstone2 Perl scripts and the old \gst{collect.cfg} file, we have only defined the format for the parts of \gst{collectionConfig.xml} that are used by the runtime-system.)478 The collection configuration file is where the collection designer (e.g. a librarian) decides what form the collection should take. This includes the collection metadata such as title and description, and also includes what indexes and browsing structures should be built. The format of \gst{collectionConfig.xml} is still under consideration. However, Figure~\ref{fig:collconfig} shows the parts of it that have been defined so far. 505 479 506 480 Display elements for a collection or metadata for a document can be entered in any language---use lang='en' attributes to metadata elements to specify which language they are in. … … 528 502 <displayItem name="icon" lang="en">mgppdemo.gif</displayItem> 529 503 </displayItemList> 530 <search >504 <search type='mgpp'> 531 505 <index name="idx"/> 532 506 <format> … … 557 531 \label{fig:collconfig} 558 532 \end{figure} 559 560 The \gst{<metadataList>} element specifies some collection metadata, such as creator. The \gst{<displayItemList>} specifies some language dependent information that is used for collection display, such as collection name and short description. These displayItem elements can be specified in different languages. If languages other than English are used, the configuration file should be encoded in utf-8. 533 [TODO: add in building istructions for the config file] 534 535 The \gst{<metadataList>} element specifies some collection metadata, such as creator. The \gst{<displayItemList>} specifies some language dependent information that is used for collection display, such as collection name and short description. These displayItem elements can be specified in different languages. If languages other than English are used, the configuration file should be encoded in UTF-8. 536 537 The \gst{<search>} element specifies what indexes should be built, and provides some display and formatting information for each one. Search has an attribute, type, which specifies which indexer to be used for indexing. Currently, mg and mgpp are available. If type is not specified, mg is used. Multiple search elements may be specified, if more than one indexer is to be used. 538 539 Search indexes appear as individual \gst{<index>} elements within the \gst{<search>} element. Some choices for the index are made using attributes of the element itself, and some through child elements. 540 541 Each index must have a unique name, which is used to identify it within Greenstone The name is given as an attribute of the \gst{<index>} element. 542 543 The other choices are described using child elements of \gst{<index>}. The \gst{<level>} tag indicates the index level and the \gst{<field>} tag the text to be used. The \gst{<level>} tag can contain one of document, section or paragraph, while the \gst{<field>} tag can contain ``text'' or the name of a metadata field. If the \gst{<level>} tag is omitted, the default setting is to index by document, and if the \gst{<field>} tag is omitted, the default setting is to index the document text. 544 545 Example index specifications include: 546 547 To index only the title of each separate document in the collection: 548 \begin{gsc}\begin{verbatim} 549 <index name="dtt"> 550 <level>document</level> 551 <field>dc:title</field> 552 <displayItem name='name' lang="en">entire documents</displayItem> 553 <displayItem name='name' lang="fr">documents entiers</displayItem> 554 <displayItem name='name' lang="es">documentos enteros</displayItem> 555 </index> 556 \end{verbatim}\end{gsc} 557 ...in this case the \gst{<field>} tag refers to the ``title'' metadata item, found in the Dublin Core namespace. The mg search engine would be used on this index. 558 559 Alternatively, to index the full document texts by section: 560 \begin{gsc}\begin{verbatim} 561 <index name="stx" type=''mgpp''> 562 <level>section</level> 563 <displayItem name='name' lang="en">entire documents</displayItem> 564 <displayItem name='name' lang="fr">documents entiers</displayItem> 565 <displayItem name='name' lang="es">documentos enteros</displayItem> 566 </index> 567 \end{verbatim}\end{gsc} 568 ...or... 569 \begin{gsc}\begin{verbatim} 570 <index name="stx" type=''mg''> 571 <level>section</level> 572 <field>text</field> 573 <displayItem name='name' lang="en">entire documents</displayItem> 574 <displayItem name='name' lang="fr">documents entiers</displayItem> 575 <displayItem name='name' lang="es">documentos enteros</displayItem> 576 </index> 577 \end{verbatim}\end{gsc} 578 ...in the first example, the \gst{<field>} tag is not explicitly defined, and would default to 'text', whereas it is explicitly set to 'text' in the second example. Note the different indexer selected for these two indexes. As they are of the same name, they should not appear in the same \gst{collectionConfig.xml} file. 579 561 580 The \gst{<search>} and \gst{<browse>} elements give some formatting information about the indexes and classifiers. \gst{<displayItem>} elements are used to provide titles for the indexes or classifiers, while \gst{<format>} elements provide formatting instructions, typically for a document or classifier node in a list of results. 562 581 582 of the \gst{collectionConfig.xml} file, and classifications as individual \gst{<classifier>} elements within the \gst{<browse>} element. In each case, some choices are made using attributes of the element itself, and some through child elements. 583 Moving onto \gst{<classifier>} items, the format is broadly similar to \gst{<index>} items, but with a couple of different choices. Firstly, each classifier should have a ``name'' and ``type'' attribute as with \gst{<index>} tags. In the case of \gst{<classifier>} items the ``type'' attribute identifies the type of classifier it is. At present, this should either be ``Hierarchy'' or ``AZList''. 584 585 The remaining choices for the classifier should follow as child elements of the \gst{<classifier>} element. The \gst{<file>} element should contain the name of the file that describes the classifier as its ``URL'' attribute. The format of this file will be described later - it will vary from classifier type to classifier type. The \gst{<field>} element identifies the name of the field to index. More than one \gst{<field>} element may appear if two or more metadata fields are to be used with the classifier. Finally, the \gst{<sort>} item identifies another metadata field which the items within one classifier node are to be ordered. Unlike the \gst{<index>} element, the \gst{<classifier>} element does not have default, assumed values for its children. 586 587 Inside the \gst{<search>} and \gst{<browse>} elements, \gst{<displayItem>} elements are used to provide titles for the indexes or classifiers, while \gst{<format>} elements provide formatting instructions, typically for a document or classifier node in a list of results. Placing the \gst{<format>} instructions at the top level in the search or browse element will apply the format to all the indexes or classifiers, while placing it inside an individual index or classifier element will restrict that formatting instruction to that item. 588 563 589 The \gst{<display>} element contains optional formatting information for the display of documents. Templates that can be specified here include \gst{documentHeading}, \gst{DocumentContent}, and other information that could be specified (in a yet to be decided format) are things such as whether or not to display the cover image, table of contents etc. 564 590 591 Format elements are desribed in more detail in Section~\ref{sec:formatstmt}. 592 565 593 \subsection{buildConfig.xml}\label{sec:buildconfig} 566 594 567 The file \gst{buildConfig.xml} is produced by the collection building process, and contains metadata and other information about the collection that can 595 The file \gst{buildConfig.xml} is produced by the collection building process. Gererally it is not necessary to look at this file, but it can be useful in determining what went wrong if the collection doesn't appear quite the way it was planned. 596 597 It contains metadata and other information about the collection that can 568 598 be determined automatically, such as the number of 569 599 documents it contains. It also includes a list of ServiceRack classes that are … … 790 820 </classifier> 791 821 <classifier>...</classifier> 822 <format><!-- formatting for all the classifiers. these will 823 be overridden by any classifier specific formatting 824 instructions --></format> 792 825 </browse> 793 826 <display> … … 807 840 The user specifies a \gst{<gsf:template>} for what they want to format---these can match \gst{documentNode} or \gst{classifierNode} (for node in a classification hierarchy). 808 841 809 The template a boveis now represented as:842 The template at the start of this section is now represented as: 810 843 811 844 \begin{gsc}\begin{verbatim} … … 843 876 changing the look and feel for an interface vs a site vs a collection\\ 844 877 845 what needs a tomcat restart?878 what needs a Tomcat restart? 846 879 847 880 \subsubsection{Changing the interface language} … … 849 882 The interface language can be changed by going to the preferences page, and choosing a language from the list. The list lists (:-)) all languages in which the interface has been defined so far. 850 883 851 It is easy to add a new interface language to greenstone. Language specific text strings are separated out from the rest of the system to allow for easy incorporation of new languages. These text strings are contained in Java resource bundle properties files. These are plain text files consisting of key-value pairs, located in resources/java. Each interface has one named interface\_name.properties (where nameis the interface name). Each service class has one with the same name as the class (e.g. GS2Search.properties). To add another language all of the base .properties files must be translated. The translated files keep the same names, but with a language extension added. For example, a French version of interface\_default.properties would be named interface\_default\_fr.properties.884 It is easy to add a new interface language to greenstone. Language specific text strings are separated out from the rest of the system to allow for easy incorporation of new languages. These text strings are contained in Java resource bundle properties files. These are plain text files consisting of key-value pairs, located in resources/java. Each interface has one named interface\_name.properties (where `name' is the interface name). Each service class has one with the same name as the class (e.g. GS2Search.properties). To add another language all of the base .properties files must be translated. The translated files keep the same names, but with a language extension added. For example, a French version of interface\_default.properties would be named interface\_default\_fr.properties. 852 885 853 886 Keys will be looked up in the properties file closest to the specified language. For example, if language fr\_CA was specified (french language, country Canada), and the default locale was en\_GB, java would look at properties files in the following order, until it found the key: XXX\_fr\_CA.properties, XXX\_fr.properties, XXX\_en\_GB.properties, then XXX\_en.properties, and finally the default XXX.properties. 854 887 855 You can tell Greenstone about a new language by ... currently in interfaceConfig.888 You can tell Greenstone about a new language by adding it in to the languageList in the interfaceConfig.xml file. This will add it in to the list of languages on the preferences page. Modification of this file requires a restart of the Tomcat server for the changes to be recognised. 856 889 857 890 … … 868 901 A new interface needs a directory in \$GSDL3HOME/web/interfaces, the name of this directory becomes the interface name. Inside, it needs images and transform directories, and an interfaceConfig.xml file. Any XSLT may be overridden for a new interface by putting the replacement in the new transform directory. If the appropriate XSLT file is not there, the one from the default interface will be used - this enables just overriding a few XSLT files as needed. 869 902 870 To use a new interface, the tomcat web.xml must be edited: either change the interface that a current version of the servlet is using, or add another servlet instantiation to the file (see Section~\ref{sec:sites-and-ints} or Appendix~\ref{app:tomcat}). The Tomcat server must be restarted for this to take effect.903 To use a new interface, the Tomcat web.xml must be edited: either change the interface that a current version of the servlet is using, or add another servlet instantiation to the file (see Section~\ref{sec:sites-and-ints} or Appendix~\ref{app:tomcat}). The Tomcat server must be restarted for this to take effect. 871 904 872 905 \newpage … … 2136 2169 \end{gsc}\end{quote} 2137 2170 2138 We have set up tomcat to disallow directory listings for everything in the docBase directory. To turn this back on, you need to edit Tomcat's default web.xml file (\$GSDL3HOME/comms/jakarta/tomcat/conf/web.xml):2171 We have set up Tomcat to disallow directory listings for everything in the docBase directory. To turn this back on, you need to edit Tomcat's default web.xml file (\$GSDL3HOME/comms/jakarta/tomcat/conf/web.xml): 2139 2172 2140 2173 In the default servlet definition, change the 'listings' parameter to true. … … 2142 2175 Tomcat uses a Manager to handle HTTP session information. This may be stored between restarts if possible. To use a persistent session handling manager, uncomment the \gst{<Manager>} element in \gst{\$GSDL3HOME/comms/jakarta/tomcat/conf/server.xml}. For the default manager, session information is stored in the work directory: \gst{\$GSDL3HOME/comms/jakarta/tomcat/work/Standalone/localhost/gsdl3/SESSIONS.ser}. Delete this file to clear the cached session info. 2143 2176 2144 \subsection{Proxying tomcat with apache}2145 2146 Instead of incorporating servlet support into your existing web server, an easy alternative is to proxy tomcat. The \gst{http://www.greenstone.org/greenstone3} site uses apache to proxy Tomcat. ProxyPass and ProxyPassReverse directives need to be added to the Virtualhost description for the www.greenstone.org server.2177 \subsection{Proxying Tomcat with apache} 2178 2179 Instead of incorporating servlet support into your existing web server, an easy alternative is to proxy Tomcat. The \gst{http://www.greenstone.org/greenstone3} site uses apache to proxy Tomcat. ProxyPass and ProxyPassReverse directives need to be added to the Virtualhost description for the www.greenstone.org server. 2147 2180 2148 2181 \begin{quote}\begin{gsc} … … 2157 2190 In our example, the greenstone 3 servlet can be accessed at \gst{http://www.greenstone.org/greenstone3/library}, instead of at \gst{http://puka.cs.waikato.ac.nz:8080/gsdl3/library}, which is not publically accessible. 2158 2191 2159 \subsection{Running tomcat behind a proxy}2160 2161 Almost everything works fine when tomcat is running behind a proxy. The only time this causes trouble is if the servlet itself needs to make external http connections. We do this in the infomine demo collection for example. One of the service classes sends http requests to the infomine database at riverside. Since this is going through the proxy, a username and password is needed. It is not sufficient to prompt the user for a password because they are unlikely to have a password for the particular proxy that tomcat is using. What we have done at present is to put a proxy element in the siteConfig.xml file. Here you have to enter a suitable username and password for the proxy server. Unfortunately these are entered in plain text. And the file is viewable via the servlet. So we need a better solution.2192 \subsection{Running Tomcat behind a proxy} 2193 2194 Almost everything works fine when Tomcat is running behind a proxy. The only time this causes trouble is if the servlet itself needs to make external http connections. We do this in the infomine demo collection for example. One of the service classes sends http requests to the infomine database at riverside. Since this is going through the proxy, a username and password is needed. It is not sufficient to prompt the user for a password because they are unlikely to have a password for the particular proxy that Tomcat is using. What we have done at present is to put a proxy element in the siteConfig.xml file. Here you have to enter a suitable username and password for the proxy server. Unfortunately these are entered in plain text. And the file is viewable via the servlet. So we need a better solution. 2162 2195 2163 2196 \newpage
Note:
See TracChangeset
for help on using the changeset viewer.