<Text id="0001">Greenstone tutorial exercises (2012)</Text>

]> <Text id="0001">Greenstone tutorial exercises (2012)</Text> Greenstone tutorial exercise Prerequisite: Sample files: Devised for Greenstone version: Modified for Greenstone version: Back to index Back to wiki Print version Copyright © 2005-2012 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.” If you are working from a Greenstone CD-ROM, sample files for these exercises are in the folder sample_files; otherwise they can be downloaded from sourceforge. The text sometimes uses Windows terminology, but the exercises work equally well on other systems if you make appropriate changes to the pathnames. <Text id="0082">Working with a pre-packaged collection (UNAIDS)</Text> You will need the Greenstone UNAIDS CD-ROM Installing a pre-packaged Greenstone collection On inserting the UNAIDS CD-ROM, for many computers installation will begin automatically. If not, "auto-run"—a configurable setting under Windows—is disabled on your computer and you need to double-click Setup.exe on the CD-ROM. My Computer → UNAIDS20 → Setup.exe The InstallShield Wizard begins to install the UNAIDS pre-packaged collection. Select the English language and click <OK>. On the welcome screen, click the <Next> button. Choose Run from CD-ROM (standard) as the setup type. This is the default and is already selected. Then click <Next>. Click <Next> again to install the UNAIDS collection in the default folder, which is C:\Program Files\UNAIDS Library 2.0 [CD-ROM]. Installation Wizard copies the required files from CD-ROM to disk Click <OK> to confirm completion of UNAIDS collection (twice). InstallShield quits—the UNAIDS Library is installed. CD-ROMs like this one that contain pre-packaged Greenstone collections do not include the full Greenstone software. Instead they embody a mini version of Greenstone that allows you to view the collection but not to build new ones. Browsing around a Greenstone collection Launch the prebuilt library by clicking: Start → All Programs → UNAIDS Library 2.0 [CD-ROM] → UNAIDS Library 2.0 (Standard Version). To access Greenstone through the Local Library Server, it is sometimes necessary to turn off the proxy settings of the browser. Greenstone normally detects this and pops up a window alerting you to the problem. Click <Enter Library> in the dialog box and your browser (typically Internet Explorer by default) will display the Greenstone home page. Within the web browser, click titles a-z (in the centre of the navigation bar near the top of the page). Access the first book in the list of titles by clicking the book icon next to the title: About UNAIDS. Use the scroll bar to view the full length of the page. In the table of contents near the top, click the page icon next to the heading Guiding principles of UNAIDS to view this section. Click the page icon next to the heading Global and local impact to view the next section. This style of interaction can be continued to further expand and contract folders and switch to a different section. To fully expand the contents of this introduction chapter, click Expand Document or Chapter in the upper left portion of the page, under the picture of the document's front cover. You can return to the currently selected page of document titles by clicking the book icon next to the title of the book at the top of the table of contents (this signifies closing the book). You also get to the document titles using titles a-z in the navigation bar, in this case to the titles beginning with A-D. If the table of contents is open at the top level—showing all the chapters—then clicking Expand Document or Chapter expands the full document. For long documents, which take some time to load in, Greenstone seeks confirmation for this action: clicking 'continue' loads the full document. Browse around and peruse some other documents in the collection. Searching within a Greenstone collection Access the search page by clicking search in the navigation bar. In the query box under Search for chapters in any language which contain some of the words, enter the term gender then click <Begin Search>. After a short pause, the web browser loads a fresh page showing the results of the search. Click the page icon for the first matching document in the result set (Five Year Implementation Review of the Vienna Declaration and Programme of Action) to view the document. Because the search was at the chapter level, you are taken directly to the matching chapter within the document. Experiment further with searching, and with the interface in general. For example, there is a detailed Help page. It contains a Preferences section through which you can control some search settings. The Preferences options in the UNAIDS collection are intentionally minimalist. Most collections have a separate Preferences button that offers more features. The home page of the UNAIDS library collection cycles through a sequence of front cover images, updated every 5 seconds or so. Clicking a particular image takes you directly to that document. Leaving the Greenstone digital library There are two ways of leaving Greenstone: Exit from the Greenstone Software server. Click on the Greenstone Software in the task bar, then choose Exit from the Browser Selection and Settings menu (or click on the exit hotspot, the red cross at the top right). The Greenstone Software exits, but your web browser continues to run. Exit from your web browser. Leave your web browser in the usual way. The Greenstone server detects when you exit from the browser and generates a popup window that asks whether to close down the server as well. (The reason is that other people may be using Greenstone over the network, and should not be rudely terminated.) Exercise: Use the UNAIDS collection to answer these questions How many publications are there in the collection? 900 How many documents are there that mention Australia in the title? 15 How many top-level subject categories are there? 21 What does AAVP stand for? African Aids Vaccination Programme What does AIDS stand for? Acquired Immuno-Deficiency Syndrome (Search for "AIDS stands for") Considering lower case variants only, how many times does the word "condom" appear in the collection?
How many times for "condoms"? 6789
5243 If case sensitivity does not matter, how many times does the word "condom" appear in the collection?
How many times for "condoms"? 7905
5571 If word endings are ignored, how many times does "condom" and variants such as "condoms" appear in the collection? 13477 How many chapters contain some variations of the word "condom"?
Does this make it a useful search term? 2413 chapters.
No, since there are only 900 documents What year saw the first reported case of AIDS in New Zealand? 1983 <Text id="0145">Working with a pre-packaged collection (Digital Libraries in Education)</Text> You will need the Greenstone Digital Libraries in Education CD-ROM Installing a pre-packaged collection Insert your CD-ROM for the course Digital libraries in education into a Windows computer. If the installation process does not start up straightaway (because the AutoPlay feature is disabled on your computer), navigate to your CD-ROM/DVD drive (normally D:), open the folder prebuilt, and double click on Setup.exe. During installation you are offered a choice of folder to install in: we recommend the default, which is C:\GSDL. You are also presented with the option to run Greenstone from the CD-ROM or to copy the entire CD-ROM. We recommend the latter: please check the box that says Install all collection files. It will take at least a couple of minutes to copy the files across. Finally, the installer offers to install the Netscape browser for you. Do not request this except in the unlikely event that you do not already have a web browser on your computer. CD-ROMs like this one that contain pre-packaged Greenstone collections do not include the full Greenstone software. Instead they embody a mini version of Greenstone that allows you to view the collection but not to build new ones. Browsing around a Greenstone collection To run Greenstone, open the Windows Start menu, Programs, and select Greenstone, then the sub-menu item Digital Libraries in Education: then <Enter Library>. Click the Digital libraries in Education collection's icon. This takes you to the collection's home page, often called the "about" page. The home page contains an access bar with buttons called search, contents, authors a-z, modules, and acronyms. This access bar is the key to finding information in any Greenstone collection. Click <authors a-z>. A list of bookshelf icons appears. Click the one called Marchionini, G. to see the two course readings by Gary Marchionini. One of these items is a PDF file and the other is an HTML file. Click them both in turn to open up the documents. Click the <contents> button in the access bar. This shows two bookshelves, one for this Study Guide and the other for the Course Readings. Choose one and look at what it contains. Clicking a bookshelf that is open closes it. Close the bookshelf you have just opened and then choose the other one and examine its contents. Click <acronyms> in the access bar and find the meaning of the acronym "LOM". Click <search> and search for the word "LOM". Check out the difference between searching text and searching titles (use the pull-down box on the search page). Click the collection icon Digital Libraries in Education at the top left. This takes you back to the collection's about page. Beneath the access bar on the collection's about page is a search box (just the same as the one that appears on the search page), a description of the collection under the heading About this collection, and instructions on how to find information in this collection. Above the access bar is the collection's icon, saying Digital Libraries in Education. On the right is an icon saying about, above which are three buttons, home, help, and preferences. Click <home>. This returns you to the Greenstone home page. Return to the collection (by clicking its icon), and click <help>. This gives more information about how to access the collection. Click <preferences>. This takes you to a page where you can change some of the settings. Now explore the collection by navigating freely around it. Click liberally: all images that appear on the screen are clickable. If you hold the mouse stationary over an image, most browsers will soon pop up a brief "mouse-over" message that tells you what will happen if you click. Experiment! Choose common words like "the" or "and" to search for—that should evoke some response, and nothing will break. (Note: unlike many search systems, Greenstone indexes all words, including these ones.) Exercise: Read the Help page; then answer these questions What does this collection contain? Name five ways to navigate to a target document in this collection. How many documents in the collection are written by Erik Duval? Compare the number of times the words "he" and "she" appear in the collection. How many times does the word "metadata" appear in titles? In the text itself? What's the difference between a some and an all search? What does "MODS" stand for? How do you switch the interface from English to Russian? Does it stay in Russian when you go to the Greenstone home page? Find a search term that yields different results depending on whether you have ignore word endings or whole word must match set on the Preferences page. What's the difference between Graphical and Textual interface format (on the Preferences page)? Exercise: Use the How to build a digital library collection to answer these questions. How many sentences contain the word education? What story from the School Journal collection is featured in the book? How many acronyms used in the book begin with the word Standard? What does tapu mean? How many times does the word library appear? The word libraries? How many times does Library appear with an initial capital letter? How many times does some derivative of the word form appear? Name an English poem that was probably written in about 1000 A.D. Who is Alan Kay? On what page is the first mention of some aspect of Chinese culture? Most of these questions would be rather difficult to answer from the printed book. <Text id="0193">Installing Greenstone</Text> Installing Greenstone on a Windows system There are various ways of getting Greenstone: From a UNESCO CD-ROM (version 2.70) (or FAO IMARK CD-ROM, but this is an earlier version 2.51) These CD-ROMs contain the Greenstone software, plus documented example collections, four language interfaces (English French Spanish Russian), the Export to CD-ROM package, the ImageMagick graphics package, the Java runtime environment, and an installer that installs all of these. From the IITE Digital Libraries in Education CD-ROM, or a Greenstone workshop CD-ROM In addition to all the above software, these CD-ROMs contain the tutorial exercises and a set of sample files to be used for these exercises. CD-ROMs with Greenstone version 2.62 or earlier also include the Greenstone Language Pack, which gives reader's interfaces in many languages (currently about 40). This has its own installer which you have to invoke separately, after you have installed Greenstone. CD-ROMs with version 2.70 or later now come with reader's interfaces in all available languages. Textual images have been removed from the interface; they are now done using CSS (Cascading Style Sheets). The Greenstone Language Pack is no longer needed. Instead, these CD-ROMs come with the Classic Interface Pack, which contains the old text images for use with a backwards compatibility macro file. All these CD-ROMs contain the full Greenstone software, which allows you to view collections and build new ones. They are not the same as CD-ROMs that contain a pre-packaged Greenstone collection, which only allow you to view that collection. From http://www.greenstone.org/download Most people download the Windows distribution from http://www.greenstone.org/download, which contains the latest version of Greenstone. To avoid a single massive download the documented example collections can be downloaded separately. To reduce the download size these collections are distributed in unbuilt form and need to be built. There is also the set of sample files used in these exercises. Most Greenstone CD-ROMs start the installation process as soon as they are inserted into the drive, assuming that the AutoPlay feature is enabled on your computer. If installation does not begin by itself, locate the file setup.exe on the CD and double click it to start the installation process. (On the IMARK CD-ROM this file resides in the folder software_tools → Greenstone). If you download Greenstone over the web, what you get is the installer—just double-click it. If Greenstone has been installed on your computer before, you should completely remove the old version before installing a new one. (However, you need not remove any pre-packaged collections that you may have installed.) To do this, see . Here is what you need to do to install Greenstone. Older versions of the installer follow much the same sequence but use slightly different wording. Select the language for this installation. We choose English Welcome to the Greenstone Digital Library Software Installer. It is recommended that you uninstall any previous installations of Greenstone2 before running this installer. Click <Next> License Agreement. Click <Accept> Choose location to install Greenstone. Leave at the default and click <Next> Components. Click the question mark button on the right of each component will display the description of this component in a popup window. Leave at the default (all components are selected) and click <Next> Enable administration pages. Read the description on this page, if you check to enable, click <Next> to set admin password. Choose a suitable password and click <Next> (If your computer will not be serving collections online, the password doesn't matter) Click <Install> to start the installation. Click <Show Details> to show the details of this installation Files are copied across Installation is complete. To invoke the Greenstone Reader's Interface, go to the Greenstone-2.85 item under All Programs on the Windows Start menu and select Greenstone Server, once the server window is displayed click <Enter Library>. To invoke the Greenstone Librarian Interface, go to the same item and select Librarian Interface (GLI). <Text id="0232">Updating a Greenstone installation</Text> These tutorial exercises assume that you are using Greenstone 2.60 or above. Before updating to a new version of Greenstone, ensure that the computer is not running the Greenstone Librarian Interface or the Greenstone local library server. Normally, quitting your web browser, or quitting the Librarian Interface, also quits the server. Removing Greenstone from a Windows system Completely remove the existing version before you install a new version of Greenstone. Ensure that you are not running Greenstone. If the installed Greenstone version is 2.81 and above, to remove the old version, go to the Greenstone home directory (eg. C:\Users\<username>\Greenstone2 by default, where <username> is your user name) and click Uninstall.bat. Otherwise, if the version is lower than 2.81, remove the old version by going to the Windows Control Panel (from the Settings item on the Start menu). Click Add or Remove Programs, select Greenstone Digital Library Software, and Remove it. (To do this you may need Windows "Administrator" privileges.) For version 2.81 and above, the uninstaller has an option for keeping all your Greenstone collections, leave it at default as selected. For versions lower than 2.81, at the end of the uninstallation procedure you will be asked whether you would like all your Greenstone collections to be removed: you should probably say No if you wish to preserve your work. Occasionally, problems are encountered if older Greenstone installations are not fully removed. To clean up your system, move your Greenstone collect folder, which contains all your collections, to the desktop. Then check for the folder C:\Program Files\gsdl or C:\Program Files\Greenstone or C:\Users\<username>\Greenstone2 for version 2.81 and above, which is where Greenstone is usually installed, and remove it completely if it exists. Reinstalling Greenstone on a Windows system The reinstallation procedure is exactly the same as the original installation procedure, described in . If you already have ImageMagick, you do not need to install it again. There have been some superficial changes to the installation procedure in moving to Greenstone Version 2.60, because it uses a different installer program. There is another important difference that you should be aware of: Versions 2.60 and above are installed in the folder Program Files\Greenstone, whereas prior versions were placed in the folder Program Files\gsdl (these are both default locations that you could have changed during installation.) When upgrading to Version 2.60, if you want to save existing collections you must explicitly move the contents of your collect folder from the old place to the new one. Future Greenstone versions will be installed in the new place, Program Files\Greenstone, so this problem will not happen again. Amalgamating different Greenstone collections If you have previously installed the Greenstone Digital Library software in a non-standard place, you should amalgamate your collections by moving them from the collect folder in the old place into the folder Program Files\Greenstone\collect. If you have installed collections from pre-packaged Greenstone CD-ROMs, they reside in a different place: C:\GSDL\collect. To amalgamate these with your main Greenstone installation, move them into the folder Program Files\Greenstone\collect. The mini version of Greenstone that is associated with the pre-packaged collections is no longer necessary. To uninstall it, select Uninstall on the Greenstone menu of the Windows Start menu. <Text id="0253">Building a small collection of HTML files</Text> You will need some HTML files, such as those in the simple_html folder in sample_files. Running the Greenstone Librarian Interface Start the Greenstone Librarian Interface: Start → All Programs → Greenstone-2.85 → Librarian Interface (GLI) Start → All Programs → Greenstone-3.05 → Greenstone Librarian Interface (GLI) If you are using Windows Vista or Windows 7 and have installed Greenstone into the default location (i.e. C:\Program Files\Greenstone) a User Account Control dialog may appear as you try to start the Greenstone Librarian Interface, click <Yes> to continue. After a short pause a startup screen appears, and then after a slightly longer pause the main Greenstone Librarian Interface appears. (A command prompt is also opened in the background.) Starting a new collection Start a new collection within the Librarian Interface: → You will create a collection based on a few HTML web pages from the Tudor collection. A window pops up. Fill it out with appropriate values—for example, Small HTML Collection
A small collection of HTML pages. Leave the setting for at its default: , and click . Next you must gather together the files that will constitute the collection. A suitable set has been prepared ahead of time in sample_files → simple_html → html_files. Using the left-hand side of the Librarian Interface's panel, interactively navigate to the sample_files → simple_html folder. Adding documents to the collection Now drag the html_files folder from the left-hand side and drop it on the right. The progress bar at the bottom shows some activity. Gradually, duplicates of all the files will appear in the collection panel. You can inspect the files that have been copied by double-clicking on the folder in the right-hand side. Since this is our first collection, we won't complicate matters by manually assigning metadata or altering the collection's design. Instead we rely on default behaviour. So pass directly to the panel by clicking its tab. Building the collection To start building the collection, click the button. Once the collection has built successfully, a window pops up to confirm this. Click . Click the button to look at the end result. This loads the relevant page into your web browser (starting it up if necessary). Viewing the extracted metadata Back in the Librarian Interface, click the tab to view the metadata associated with the documents in the collection. Presently there is no manually assigned metadata, but the act of building the collection has extracted metadata from the documents. Double click the html_files folder to expand its content. Then single-click aragon.html to display all its metadata in the right-hand side of the panel. The initial fields, starting , are empty. These are Dublin Core metadata fields for manually entered data. Use the scroll bar on the extreme right to view the bottom part of the list. There you will see fields starting that express the extracted metadata: for example , based on the text within the HTML Title tags, and , the document's language (represented using the ISO standard 2-letter mnemonic) which Greenstone determines by analyzing the document's text. Close the collection by clicking → . This automatically saves the collection to disk. Viewing the internal links and external links Hyperlinks in a Greenstone collection work like this: If the link is to a document that is also in the collection, clicking it takes you to that document in the collection. If the link is to a document that is not in the collection, clicking it takes you to that document on the web. Go back to the web browser and click the titles link near the top of the page. Open the file boleyn.html and look for the link to Katharine of Aragon (in the 5th paragraph of the Biography section). This links to a document inside the collection--aragon.html. View this document by clicking the link. For an external link, return to boleyn.html and click letters written by Anne (in the Primary Sources section). This takes you out on to the web. If you want a warning message to be displayed first, you can open Greenstone → etc → main.cfg file and uncomment the line cgiarg shortname=el argdefault=prompt (remove the # at the start of a line to uncomment it). Note, that if you are already browsing a collection, then you will need to go back to the home page and re-enter the collection to see this take effect (due to caching of the el argument). Setting up a shortcut in the Librarian interface To set up a shortcut to the source files, in the panel navigate to the folder in your local file space that contains the files you want to use—in our case, the sample_files folder. Select this folder and then right-click it, and choose from the menu. In the field, enter the name you want the shortcut to have, or accept the default . Click . Close all the folders in the file tree in the left-hand pane, and you will see the shortcut to your source files. <Text id="0337">A simple image collection</Text> In the Librarian Interface, start a new collection ( → ) called backdrop. Fill out the fields with appropriate information. For , select the item Simple image collection from the pull-down menu. This will only be available if the documented example collections are installed. If you don't have this collection, select . You can still build an image collection, but some of the tutorial will not match exactly. When you base a collection on an existing one, it inherits all the settings of the old one, including which metadata sets (if any) the collection uses. Copy the images (avoid the README.TXT file) provided in sample_files → images into your newly-formed collection. Change to the panel and build the collection. Preview the result. Click on in the navigation bar to view a list of the photos ordered by filename and presented as a thumbnail accompanied by some basic data about the image. The structure of this collection is the same as Simple image collection, but the content is different. Back in the Librarian Interface, change to the panel and view the extracted metadata for Bear.jpg. Adding Title and Description metadata We work with just the first three files (Bear.jpg, Cat.jpg and Cheetah.jpg) to get a flavour of what is possible. First, we need to add the Dublin Core metadata set which is not used in the Simple image collection collection. Click the button beneath the Collection file tree. A new window pops up showing the metadata sets used by current collection. Click the button to bring up another window showing the available metadata sets. Select the "Dublin Core Metadata Element Set" from the list and click . Click to return to the panel. First, set each file's field to be the same as its filename but without the filename extension. Click on Bear.jpg so its metadata fields are available, then click on its field on the right-hand side. Type in Bear. Repeat the process for Cat.jpg, Cheetah.jpg and so on. Add a description for each image as metadata. What description should you enter? To remind yourself of a file's content, the Librarian Interface lets you open files by double-clicking them. It launches the appropriate application based on the filename extension, Word for .doc files, Acrobat for .pdf files and so on. Double-click Bear.jpg: on Windows, the image will normally be displayed by Microsoft's Photo Editor (although this depends on how your computer has been set up). Back in the pane, make sure that Bear.jpg is selected in the collection tree on the left hand side. Enter the text Bear in the Rocky Mountains as the value for the field. Repeat this process for Cat.jpg and Cheetah.jpg, adding a suitable description for each. Go to the panel and click . Once it has finished building, preview the collection. You will not notice anything new. That's because we haven't changed the design of the collection to take advantage of the new metadata. Change Format Features to display new metadata Now we customize the collection's appearance. Go to the panel and select from the left-hand list. Leave the feature selection controls at their default values, so that is selected for , and is selected as the . In the , edit the text as follows: Click on the browse Format Feature. Find the section under documentNode where it says <td valign="top">
<gsf:displayText name="ImageName"/>:<gsf:metadata name="Image"/><br/>
<gsf:displayText name="Width"/>:<gsf:metadata name="ImageWidth"/><br/>
<gsf:displayText name="Height"/>:<gsf:metadata name="ImageHeight"/><br/>
<gsf:displayText name="Size"/>:<gsf:metadata name="ImageSize"/>
</td> Edit the text as follows: Change _ImageName_: to Title: Change <gsf:displayText name="ImageName"/>: to Title: Change [Image] to [dc.Title] Change <gsf:metadata name="Image"/> to <gsf:metadata name="dc.Title"/> After [dc.Title]<br> add Description: [dc.Description]<br> After <gsf:metadata name="dc.Title"/><br/> add Description: <gsf:metadata name="dc.Description"/><br/> Metadata names are case-sensitive in Greenstone: it is important that you capitalize "Title" and "Description" (and don't capitalize "dc"). The new format statement is displayed in the list of assigned format statements. The first substitution alters the fragment of text that appears to the right of the thumbnail image, the second alters the item of metadata that follows it. The addition displays the description after the Title. Preview the collection by clicking the button. When you click on in the navigation bar the presentation has changed to "Title: Bear" and so on. Each image's description should appear beside the thumbnail, following the title. After the first three items, the Title and Description become blank because we have only assigned Dublin Core metadata to these first three. To get a full listing, enter all the metadata. Changes in the panel take place immediately and you can see the result straightaway by clicking the button. If you modify anything in the , or panels, you will need to rebuild the collection. Changing the size of image thumbnails Let's change the size of the thumbnail image and make it smaller. Thumbnail images are created by the plug-in, so we need to access its configuration settings. To do this, switch to the panel and select from the list on the left. Double-click to pop up a window that shows its settings. (Alternatively, select with a single click and then click further down the screen). Currently most options are off, so standard defaults are used. Select , set it to , and click . Build and preview the collection. Once you have seen the result of the change, return to the panel, select the configuration options for , and switch the option off so that the thumbnail reverts to its normal size when the collection is re-built. Adding a browsing classifier based on Description metadata Now we'll add a new browsing option based on the descriptions. In the panel, select from the left-hand list. Set the menu item for to , then click . A window pops up to control the classifier's options. Set the option to . Next, click the check box and choose from the drop-down list. Click . Build the collection, and preview it. Choose the new link that appears in the navigation bar. Only three items are shown, because only items with the relevant metadata ( in this case) appear in the list. The original browse list includes all photos in the collection because it is based on , extracted metadata that reflects an image's filename, which is set for all images in the collection. Creating a searchable index based on Description metadata Now we'll add an index so that the collection can be searched by descriptions. Switch to the panel and select from the left-hand list. Click the button. Select from the list of metadata to include in the index and click . Leave at its default, "document". Switch to the panel, build the collection, then preview it. There is now a button in the navigation bar. As an example, search for the term "bear" in the index (which is the only index at this point). Switch to the panel, build the collection, then preview it. There is now a button in the navigation bar. As an example, search for the term "bear" in the index (which is the only index at this point). To change the text that is displayed for the index () (), go to the panel back in the Librarian Interface. Select from the left-hand list. This panel allows you to change the text that is displayed on the search form. Change the for the "dc.Description" index to "image descriptions" (or other suitable text). Go back to the browser and reload the search page. Your new text will appear in the search form. Note that if you use text instead of macros in the search metadata display text, you will need to do any translations yourself. <Text id="0279">A collection of Word and PDF files</Text> You will need some source files like those in the sample_files → Word_and_PDF folder. Start a new collection called reports ( → ) and base it on . Copy all the .doc, .rtf, .pdf and .ps files from sample_files → Word_and_PDF → Documents into the collection. There are 9 files in all: you can select multiple files by clicking on the first one and shift-clicking on the last one, and drag them all across together. (This is the normal technique of multiple selection.) Switch to the panel, and build and preview the collection. Viewing the extracted metadata Again, this collection contains no manually assigned metadata. All the information that appears—title and filename—is extracted automatically from the documents themselves. Because of this the quality of some of the title metadata is suspect. Back in the Librarian Interface, click the tab to view the automatically extracted metadata. You will need to scroll down to see the extracted metadata, which begins with . Check whether the metadata is correct for some of the documents by opening them. You can open a document from the Librarian Interface by double clicking on it. The extracted Title metadata for some documents is incorrect. For example, the Titles for pdf01.pdf and word03.doc (the same document in different formats) have missed out the second line. The Title for pdf03.pdf has the wrong text altogether. Manually adding metadata to documents in a collection In the panel, manually add Dublin Core metadata to those documents which have incorrect metadata. Select word03.doc and double-click to open it. Copy the title of this document () and return to the Librarian Interface. Scroll up or down in the metadata table until you can see . Click in the value box and paste in the metadata. Now add information for the same document. You can add more than one value for the same field: when you press Enter in a metadata value field, a new empty field of the same type will be generated. Add each author separately as metadata. Close the document (in Microsoft Word) when you have finished copying metadata from it. External programs opened when viewing documents must be closed before building the collection, otherwise errors can occur. Next add and metadata for a few of the other documents. You will notice as you add more values, they appear in the box below the metadata table. If you are adding the same metadata value to more than one document, you can select it from this list. For example, pdf01.pdf and word03.doc share the same Title; and many documents have common authors. If you build and preview your collection at this point, you will see that the list now shows your new Titles. However, the metadata is not displayed. You need to alter the collection design to use this metadata. In the Librarian Interface, look at the section of the panel, by clicking on this in the list to the left. Here you can add, configure or remove plugins to be used in the collection. There is no need to remove any plugins, but it will speed up processing a little. In this case we have only Word, PDF, RTF, and PostScript documents, and can remove the , , , , , , , and plugins. To delete a plugin, select it and click . is required for any type of source collection and should not be removed. Search indexes The next step in the panel is . These specify what parts of the collection are searchable (e.g. searching by title and author). Delete the index, which is not particularly useful, by selecting it and clicking . By default the titles index (,,) includes , and . Searching this index will search , and metadata. If you wanted to restrict searching to just the manually added metadata, edit this index and deselect and from the list of metadata. You can add indexes based on any metadata. Add a new index based on by clicking . Select in the list of metadata, and click . Browsing classifiers The section adds "classifiers," which provide the collection with browsing functions. Go to this section and observe that Greenstone has provided two List classifiers, based on and metadata. These correspond to the and buttons on the collection's access bar.These correspond to the and buttons on the collection's access bar. Remove the classifier by selecting it and clicking . Now add an classifier for . Select from the drop-down list and click . A popup window for appears. Select from the drop-down list and click . Switch to the panel, and build and preview the collection. Next, go to in the panel, and select the section. Set the display text value for Index: dc.Creator to Creators. Check that all the facilities work properly. There should be three full-text indexes, called , , and . The list should display all the document Titles. The list should show one bookshelf for each author you have assigned as , and clicking on that bookshelf should take you to all the documents they authored. Check that all the facilities work properly. There should be three full-text indexes, called , , and . The list should display all the document Titles. The list should show one bookshelf for each author you have assigned as , and clicking on that bookshelf should take you to all the documents they authored. The list shows all documents which have been assigned metadata, or have automatically extracted . For many documents, extracted Titles may be fine, and it is impractical to add the same metadata again as . Specifying a list of metadata names in the classifier allows us to use both. If you have already done the exercise, some of the documents will have extracted ex.Creator metadata, and some will have dc.Creator. To use both of these in the Creators classifier, make the field read . Build the collection again and preview it. Now extracted Creators should appear in the list. We will play around with the format statements and customize the outlook of this collection in the exercise. <Text id="fw-1">Formatting the Word and PDF collection</Text> In this exercise, we play around with the format statements in the Word and PDF collection. Open the reports collection in the Librarian Interface and go to the section of the panel. Tidying up the default format statement In this part of the exercise, we make the format statement simpler without changing the resulting display. Greenstone's default format statement is complex because it is designed to produce something reasonable under almost any conditions, and also because for practical reasons it needs to be backwards compatible with legacy collections. For this collection, we don't need all of the complexity. Make sure that the format statement is selected in the list of formats. The default format statement looks like the following: <td valign="top">[link][icon][/link]</td>
<td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
<td valign="top">[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td> This format statement is the default used for any vertical list, such as search results, classifiers, and document table of contents. {Or}{[ex.thumbicon],[ex.srcicon]} chooses ex.thumbicon metadata if it's there, otherwise chooses ex.srcicon metadata. If neither are present, nothing is displayed. For this collection there is no ex.thumbicon metadata so the choice is not needed. Replace {Or}{[ex.thumbicon],[ex.srcicon]} (highlighted above) with [ex.srcicon]. There is no exp.Title metadata, so remove that element from {Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}. The resulting format statement looks like the following: <td valign=top>[link][icon][/link]</td>
<td valign=top>[ex.srclink][ex.srcicon][ex./srclink]</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[ex.Title],Untitled} [/highlight] {If}{[ex.Source],<br><i>([ex.Source])</i>}</td>
Preview the collection to make sure the display hasn't changed. You shouldn't notice any difference when looking at search results, classifiers etc. In this part of the exercise, we make the format statement simpler without changing the resulting display. Greenstone's default format statement is complex because it is designed to produce something reasonable under almost any conditions, and also because for practical reasons it needs to be backwards compatible with legacy collections. For this collection, we don't need all of the complexity. Make sure that the format statement is selected in the list of formats. An excerpt from the default format statement for documentNode looks like the following: <td valign="top">
<gsf:link type="source">
<gsf:choose-metadata>
<gsf:metadata name="thumbicon"/>
<gsf:metadata name="srcicon"/>
</gsf:choose-metadata>
</gsf:link>
</td>
This format statement is the default used for the documentNode vertical lists under classifiers. <gsf:choose-metadata>
<gsf:metadata name="thumbicon"/>
<gsf:metadata name="srcicon"/>
</gsf:choose-metadata>
chooses ex.thumbicon metadata if it's there, otherwise chooses ex.srcicon metadata. If neither are present, nothing is displayed. For this collection there is no ex.thumbicon metadata so the choice is not needed. Replace the above with <td valign="top">
<gsf:link type="source">
<gsf:metadata name="srcicon"/>
</gsf:link>
</td>
Next edit the format features: there is no exp.Title metadata, so remove that element from the following <gsf:choose-metadata>
<gsf:metadata name="dc.Title"/>
<gsf:metadata name="exp.Title"/>
<gsf:metadata name="ex.dc.Title"/>
<gsf:metadata name="Title"/>
<gsf:default>Untitled</gsf:default>
</gsf:choose-metadata>
Preview the collection to make sure the display hasn't changed. You shouldn't notice any difference when looking at search results, classifiers etc. Linking to Greenstone version or original version of documents For collections with documents that undergo a conversion process during importing (e.g. Word, PDF, PowerPoint documents, but not text, HTML documents), the original file is stored in the collection along with the converted version. The default format statement links to both versions: [link][icon][/link] links to the Greenstone HTML version, while [ex.srclink][ex.srcicon][/ex.srclink] links to the original. Choose in by selecting from the drop down list, and from the list. Click to add the format statement into the list of assigned formats. Experiment with removing either of the two links from the format statement. To see the results of your changes, preview the collection and do a search. You are making changes to , which means the changes will only apply to search results. Storing and displaying the original allows users to see the correct format, but requires the user to have the relevant program installed. It also increases the size of the collection. The Greenstone version can be viewed in a browser, but may not look as nice. For collections with documents that undergo a conversion process during importing (e.g. Word, PDF, PowerPoint documents, but not text, HTML documents), the original file is stored in the collection along with the converted version. The default format statement links to both versions, but the format statement for links only to the converted version of the original file: <gsf:link type="document">
<gsf:icon type="document"/>
</gsf:link> links to the Greenstone HTML version, while <gsf:link type="source">
<gsf:metadata name="srcicon"/>
</gsf:link>
links to the original. Choose in . Experiment with removing either of the two links from the format statement. To see the results of your changes, preview the collection and do a search. You are making changes to documentNodes under , which means the changes will only apply to search results. Storing and displaying the original allows users to see the correct format, but requires the user to have the relevant program installed. It also increases the size of the collection. The Greenstone version can be viewed in a browser, but may not look as nice. Making bookshelves show how many items they contain Next, we'll customize the format for the list. Classifier bookshelves have only a few pieces of metadata to display: [ex.Title] and [numleafdocs]. Whatever metadata the classifier has been built on, the bookshelf label is always stored as [ex.Title]. This is why a Creator is printed out for each bookshelf even though [dc.Creator] is not specified in the format statement. [numleafdocs] is only defined for bookshelves, so this metadata can be used in an {If} statement to make bookshelves and documents display differently in the list. Make each bookshelf in the Creator classifier show how many entries it contains. In the section of the panel, select the classifier (which is based on metadata) from the drop down list, and from the list. Click the button to add this format into the list of assigned formats. Note that it gets added as in this list: it is the format for the second () classifier. Append the following text to the bottom of the format statement: {If}{[numleafdocs],<td><i>([numleafdocs])</i></td>} Make each bookshelf in the Creator classifier show how many entries it contains. In the section of the panel, select the format statement. This consists of three parts: the first gsf:template is the format statement defining the display of a documentNode, the second one is the format statement that controls the appearance of VList classifierNodes (which appear as bookshelves here), while the final gsf:template block is the format statement defining the display of HList classifierNodes. Scroll down to the end of the second format statement, which is the one for the VList classifiers and appears just before the start of the format statement for HList classifiers. Then insert the line highlighted below, which will display the number of leaf documents inside a classifier bookshelf: <gsf:template match="classifierNode[@classifierStyle = 'VList']">
...
<td>(<gsf:metadata name="numleafdocs"/>)</td>
</gsf:template>
<gsf:template match="classifierNode[@classifierStyle = 'HList']">
<gsf:link type="classifier">
<gsf:metadata name="Title"/>
</gsf:link>
</gsf:template>
Preview the collection. Click on the list and notice that the bookshelves now display how many documents they contain. This revised format statement has the effect of specifying in brackets how many items are contained within a bookshelf. Since only bookshelves define [numleafdocs], only they will display this. By modifying instead of , the change will only apply to the second classifier (). Displaying multi-valued metadata Next we modify the document entries in the Creator classifier to display all authors. Back in , select the format in the list of assigned formats. After {If}{[ex.Source],<br> in the format statement, add [sibling:dc.Creator]. [ex.Source] is not defined for bookshelves, so can also be used to differentiate bookshelves and documents. The resulting format statement looks like: <td valign=top>[link][icon][/link]</td>
<td valign=top>[ex.srclink][ex.srcicon][ex./srclink]</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[ex.Title],Untitled}[/highlight]
{If}{[ex.Source],<br>[sibling:dc.Creator]
<i>([ex.Source])</i>}</td>
{If}{[numleafdocs],<td><i>([numleafdocs])</i></td>} This will display the Greenstone link, the link to the original, then the Title. For bookshelves, it will also display how many documents the bookshelf contains. For documents, it will display all the Authors (Creators), and the source document. [sibling:dc.Creator] displays all the Creator metadata for the document, separated by a space (), while [dc.Creator] displays only the first author. Preview the list and make sure that all authors are displayed for documents. You can change the separator between the authors. Modify the format statement, and replace [sibling:dc.Creator] with [sibling(All'<br/>'):dc.Creator]. This will add a new line after each author (<br/> specifies a line break in HTML). Preview the list. If you have done exercise , the collection will have both dc.Creator and ex.Creator metadata. To display both, you can use [sibling:dc.Creator] [sibling:ex.Creator] To display dc.Creator if it is present, otherwise display ex.Creator, use {Or}{[sibling:dc.Creator],[sibling:ex.Creator]} Next we modify the document entries in the Creator classifier to display all authors. Back in , select the format in the list of assigned formats. Edit the format statement for after the part where it displays the Title metadata, so that it now additionally contains the new line highlighted below. This will display the dc.Creator metadata. <td valign="top">
<gsf:link type="document">
<xsl:call-template name="choose-title"/>
<gsf:switch>
<gsf:metadata name="Source"/>
<gsf:when test="exists">
<br/>
<i>(<gsf:metadata name="Source"/>)</i>
</gsf:when>
</gsf:switch>
</gsf:link>
<br/>
<gsf:metadata name="dc.Creator" />
</td>
The format statement as it is above will now display the Greenstone link, the link to the original, then the Title as before. Since it's defined for documentNodes, it will display all the Authors (Creators), and the source document for documents. Preview the list and make sure that all authors are displayed for documents. The additional line <gsf:metadata name="dc.Creator" /> displays all the Creator metadata for the document, separated by a comma (). The same line could also have been written as <gsf:metadata name="dc.Creator" select="siblings"/>, but mentioning siblings explicitly is not necessary, as all the metadata values for dc.Creator will be returned by default. However, this longer way of requesting specific metadata is useful when parent, ancestors, or root values are required for a piece of metadata, such as when you want not just the current section's Title to be displayed, but wish to display the Title of the (parent) document containing the section as well. If you wish to retrieve only the first, last or nth value for a metadata, you would use the pos attribute. For example, <gsf:metadata name="dc.Creator" pos="first"/> (or alternatively, <gsf:metadata name="dc.Creator" pos="1"/>) displays only the first author. You can change the separator between the authors. Modify the format statement, and replace <gsf:metadata name="dc.Creator" /> with <gsf:metadata name="dc.Creator" separator="<br/>" />. This will add a new line after each author (<br/> is the escaped version of <br/> which specifies a line break in HTML and XML). Preview the list. You can change the separator between the authors. Modify the format statement, and replace <gsf:metadata name="dc.Creator" /> with <gsf:metadata name="dc.Creator" separator=" ">. This will add a space after each author. Preview the list. However, if you want a newline to separate each author, it requires a little more in order to escape the HTML newline (<br />) element: <gsf:metadata name="dc.Creator"><separator><br /></separator></gsf:metadata> If you have done exercise , the collection will have both dc.Creator and ex.Creator metadata. To display the metadata values for both, you can use <gsf:metadata name="dc.Creator" />, <gsf:metadata name="Creator" />
To display dc.Creator if it is present, otherwise display ex.Creator, use <gsf:choose-metadata>
<gsf:metadata name="dc.Creator" />
<gsf:metadata name="Creator" />
</gsf:choose-metadata>
Advanced multi-valued metadata You may notice that the classifier's configuration dialog has two options after the option: and . Manually added metadata can be used to replace or enhance automatically extracted metadata, and these options control exactly which pieces of metadata a document is classified by. For example, say we have two documents. Document 1 has four Creators specified (dc.Creator = dcA, dc.Creator = dcB, ex.Creator = exA, ex.Creator = exB), while document 2 has three (ex.Creator = exA, ex.Creator = exB, ex.Creator = exC). The following table shows which metadata values each document is classified by, for the different classifier options:

options	Document 1	Document 2
-metadata dc.Creator,ex.Creator	dcA, dcB	exA, exB, exC
-metadata dc.Creator,ex.Creator -firstvalueonly	dcA	exA
-metadata dc.Creator,ex.Creator -allvalues	dcA, dcB, exA, exB	exA, exB, exC

We'll now set the option for the classifier. Switch to the section of the panel, select the for metadata in the box and click . Select the option. Rebuild and preview the collection. Now the list classifies documents based on the first author appearing in the metadata. If you set the field of to in the exercise, now the list will classify based on the first author appearing in either the metadata or the metadata. <Text id="pdfbox-ext-0">Processing newer versions of PDF with PDFBox</Text> By default the PDFPlugin can process PDF versions 1.4 and older. The PDFBox extension for Greenstone allows text from more recent PDF files to be extracted. The extension uses PDFBox, an open-source PDF conversion tool. This tutorial will cover how to install the PDFBox extension for Greenstone and how to switch on its functionality in the Greenstone Librarian Interface to process text from newer versions of PDF. Obtaining and installing the PDFBox extension for Greenstone The wiki release notes that go with the Greenstone binary you downloaded will contain the download link to the PDFBox extension that works with your binary. If you want to try the most up-to-date version of the extension, visit http://trac.greenstone.org/browser/gs2-extensions/pdf-box/trunk/pdf-box-java.zip and download the zip archive from there, if you're in Windows. If you are working on a *nix machine, you might instead prefer to download the compressed tar file of the same by visiting http://trac.greenstone.org/browser/gs2-extensions/pdf-box/trunk/pdf-box-java.tar.gz. Move the downloaded file into your Greenstone installation's ext folder. You will now need to decompress the file you downloaded in this location. To do so on Windows XP, rightclick on the file and choose Extract All... and go through the Extraction wizard. On Windows Vista and 7, double clicking on the zip file will open an Explorer window showing you its contents. Click on an empty part inside that window and choose Extract All... to extract its contents. On Linux, to decompress the tar.gz file, run the command: tar -xvzf <tar file name> All going well, you will have a folder called pdf-box inside your Greenstone's ext folder. Turning on the PDFBox extension functionality in GLI Before you can use the extension, make sure that all instances of GLI, the Greenstone Librarian interface, are closed. Note that if you were running GLI through a console, you will want to start up a fresh console, then run the setup script again to set up the Greenstone environment once more, which will this time take the presence of the PDFBox extension into account. To run the setup script, your console needs to be pointing to your Greenstone installation directory. From here, you would run setup.bat if you're on Windows, or source setup.bash if you're on Linux. Launch GLI once more, in the manner you're accustomed to. On Windows, the easiest way is the shortcut to GLI available through the Windows Start menu. Now that you've installed the PDFBox extension, this will be available as an option in the plugin's configuration dialog. To turn on the PDFBox extension for any collection you open in GLI, you would go to the panel, select from the left and on the right, double click the (alternatively, select this plugin and click the below) to open the dialog to configure this plugin. In the dialog, scroll down to the section and select the checkbox next to the option. Click to close the dialog, switch to the panel and rebuild your collection. This time, PDF files will be processed by PDFBox which will extract their text. Try this feature out on a collection of recent PDF files, by configuring its PDFPlugin with the option turned on. You can also experiment by configuring the PDFPlugin used in the Reports collection, although that one contains old PDF versions which the default settings of can already process successfully. If you do decide to test out the PDFBox extension with the Reports collection, then rebuild it and preview it. However, once you've inspected the results, you may wish to go back to the panel and turn off and rebuild the collection once more, so that it's back to its original state and ready for future tutorials. <Text id="ep-1">Enhanced PDF handling</Text> Greenstone converts PDF files to HTML using third-party software: . This lets users view these documents even if they don't have the PDF software installed. Unfortunately, sometimes the formatting of the resulting HTML files is not so good. This exercise explores some extra options to the PDF plugin which may produce a nicer version for display. Some of these options use the standard pdftohtml program, others use ImageMagick and Ghostscript to convert the file to a series of images. Ghostscript is a program that can convert Postscript and PDF files to other formats. You can download it from http://www.cs.wisc.edu/~ghost/ (follow the link to the current stable release). In the Librarian Interface, start a new collection called "PDF collection" and base it on . In the panel, drag just the PDF documents from sample_files → Word_and_PDF → Documents into the new collection. Also drag in the PDF documents from sample_files → Word_and_PDF → difficult_pdf. Go to the panel and build the collection. Examine the output from the build process. You will notice that one of the documents could not be processed. The following messages are shown: "The file pdf05-notext.pdf was recognised but could not be processed by any plugin.", and "3 documents were processed and included in the collection. 1 was rejected". Preview the collection and view the documents. pdf05-notext.pdf does not appear as it could not be processed. pdf06-weirdchars.pdf was processed but looks very strange. The other PDF documents appear as one long document, with no sections. Modes in the Librarian Interface The Librarian Interface can operate in different modes. The default mode is mode. We can use mode to work out why the pdf file could not be processed. Use the item on the menu, tab, to switch to mode and then build the collection again. The panel looks different in mode because it gives more options: locate the button, near the bottom of the window, and click it. Now a message appears saying that the file could not be processed, and why. Amongst all the output, we get the following message: "Error: PDF contains no extractable text. Could not convert pdf05-notext.pdf to HTML format". pdftohtml.pl cannot convert a PDF file to HTML if the PDF file has no extractable text. We recommend that you switch back to mode for subsequent exercises, to avoid confusion. Splitting PDFs into sections In the section of the panel, configure . Switch on the option. In the section, check the checkbox to build the indexes on section level as well as document level. Build and preview the collection. View the text versions of some of the PDF documents. Note that these are now split into a series of pages, and a "go to page" box is provided. Note that these are now split into a series of pages, and two means of jumping between various pages is provided: on the left, individual pages are listed vertically by page number and clicking the "plus" box next to a page will expand its contents, while on the right there's a box with a horizontal scroller which can be used to scroll to the page you wish to view. The format is still a bit ugly though, and pdf05-notext.pdf is still not processed. Using image format If conversion to HTML doesn't produce the result you'd like, PDF documents can be converted to a series of images, one per page. This requires ImageMagick and Ghostscript to be installed. In the section, configure . Set the option to one of the image types, e.g. . Switch off the option, as it is not used with image conversion. Build the collection and preview. All PDF documents (including pdf05-notext.pdf) have been processed and divided into sections. Images from the document are now displayed instead of the extracted text. Both pdf05-notext.pdf and pdf06-weirdchars.pdf display nicely now. All PDF documents (including pdf05-notext.pdf) have been processed and divided into sections, but each section displays . For the conversion to images for PDF documents, no text is extracted. In order to view the documents properly, you will need to modify the format statement. In the section on the panel, select the format statement. Replace [Text] with [srcicon] Preview the collection. Images from the document are now displayed instead of the extracted text. Both pdf05-notext.pdf and pdf06-weirdchars.pdf display nicely now. In this collection, we only have PDF documents and they have all been converted to images. If we had other document types in the collection, we should use a different format statement, such as: {If}{[parent:FileFormat] eq PDF,[srcicon],[Text]} is an extracted metadata item which shows the format of the source document. We can use this to test whether the documents are PDF or not: for PDF documents, display [srcicon], for other documents, display [Text]. Using to control document processing (advanced) Processing all of the PDF documents using an image type may not give the best result for your collection. The images will look nice, but as no text is extracted, searching the full text will not be available for these documents. The best solution would be to process most of the PDF files as HTML, and only use the image format where HTML doesn't work. We achieve this by putting the problem files into a separate folder, and adding another plugin with different options. Go to the panel. Make a new folder called : right click in the collection panel and select from the menu. Change the to , and click . Move the two pdf files that have problems with html (pdf05-notext.pdf and pdf06-weirdchars.pdf) into this folder by drag and drop. We will set up the plugins so that PDF files in this notext folder are processed differently to the other PDF files. Switch to the section of the panel. Add a second PDF plugin by selecting from the drop-down list, and clicking . This plugin will come after the first PDF plugin, so we configure it to process PDF documents as HTML. Set the option to , and switch on the option. Click . Configure the first PDF plugin, and set the option to . The two PDF plugins should have options like the following: plugin PDFPlugin -convert_to pagedimg_jpg -process_exp "notext.*\.pdf"
plugin PDFPlugin -convert_to html -use_sections The version must come earlier in the list than the version. The for the first will process any PDF files in the notext directory. The second will process any PDF files that are not processed by the first one. Note that all plugins have the option, and this can be used to customize which documents are processed by which plugin. Edit the format statement. PDF files processed as HTML will not have images to display, so we need to make sure they get text displayed instead. Change [srcicon] to {If}{[NoText] eq "1",[srcicon],[Text]}. Build and preview the collection. All PDF documents should look relatively nice. Try searching this collection. You will be able to search for the PDFs that were converted to HTML (try e.g. ), but not the ones that were converted to images (try searching for or ). Opening PDF files with query terms highlighted Next we'll customize the format statement to highlight the query terms in a PDF file when it is opened from the search result list. This requires Acrobat Reader 7.0 version or higher, and currently only works on a Microsoft Windows platform. The search terms are kept in the macro variable , and we append to the end of a PDF file link to pass the query terms to the PDF. saves each PDF file in a unique directory. You can use _httpcollection_/index/assoc/[archivedir]/[srclinkFile] to refer to these files. Add by selecting from the drop down list, and from the list. Click to add the format statement into the list of assigned formats. We need to test whether the file is a PDF file before linking to it, using {If}{[ex.FileFormat] eq 'PDF',,}. For PDF files, we use the above path format instead of the [ex.srclink] and [ex./srclink] variables to link to the file. The resulting format statement is: <td valign="top">[link][icon][/link]</td>
<td valign="top">{If}{[ex.FileFormat] eq 'PDF', <a href=\"_httpcollection_/index/assoc/[archivedir]/[srclinkFile]#search="_queryterms_"\">{Or}{[ex.thumbicon],[ex.srcicon]}</a>,
[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]}</td>
<td valign="top">[highlight]
{Or}{[dc.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>
Next we'll customize the format statement to highlight the query terms in a PDF file when it is opened from the search result list. This requires Acrobat Reader 7.0 version or higher, and currently only works on a Microsoft Windows platform. To highlight the query terms in a PDF document, we need to pass them into the PDF file by appending to the end of the document link. We need to create the link ourselves rather than using <gsf:link type="source"/> in the format statement. saves each PDF file in a unique directory for that document, and we can use <gsf:metadata name="httpPath" type="collection"/>/index/assoc/<gsf:metadata name="archivedir"/>/<gsf:metadata name="srclinkFile"/> to refer to the PDF source file. The search terms can be found in the "q" cgi parameter. You can access this using <gsf:cgi-param name="q"/>. Select in for editing. We need to test whether the file is a PDF file before linking to it, using a test on whether the Greenstone extracted FileFormat metadata is PDF. For PDF files, we now generate the link explicitly. The resulting format statement is: <td valign="top">
<gsf:link type="document">
<gsf:icon type="document"/>
</gsf:link>
</td>

<td valign="top">
<gsf:switch>
<gsf:metadata name="FileFormat"/>
<gsf:when test="equals" test-value="PDF">
<a><xsl:attribute name="href"><gsf:metadata name="httpPath" type="collection"/>/index/assoc/<gsf:metadata name="archivedir"/>/<gsf:metadata name="srclinkFile"/>#search=&quot;<gsf:cgi-param name="query"/>&quot;</xsl:attribute>
<gsf:choose-metadata>
<gsf:metadata name="thumbicon"/>
<gsf:metadata name="srcicon"/>
</gsf:choose-metadata>
</a>
</gsf:when>
<gsf:otherwise>
<gsf:link type="source">
<gsf:choose-metadata>
<gsf:metadata name="thumbicon"/>
<gsf:metadata name="srcicon"/>
</gsf:choose-metadata>
</gsf:link>
</gsf:otherwise>
</gsf:switch>
</td>

<td valign="top">
... When the PDF icons are clicked in the search results, Acrobat will open the file with the search window open with the query terms highlighted. <Text id="ew-a">Enhanced Word document handling</Text> The standard way Greenstone processes Word documents is to convert them to HTML format using a third-party program, wvWare. This sometimes doesn't do a very good job of conversion. If you are using Windows, and have Microsoft Word installed, you can take advantage of Windows native scripting to do a better job of conversion. If the original document was hierarchically structured using Word styles, these can be used to structure the resulting HTML. Word document properties can also be extracted as metadata. In your digital library, preview the reports collection. Look at the HTML versions of the Word documents and notice how they have no structure-they have been converted to flat documents. Using Windows native scripting In the Librarian Interface, open up the reports collection. Switch to the panel and select the section on the left-hand side. Double click the plugin and switch on the option. In the section, check the checkbox, if not already the case, to build the indexes on section level as well as document level. Build the collection. You will notice that the Microsoft Word program is started up for each Word document—the document is saved as HTML from Word itself, to get a better conversion. Preview the collection. In the list, notice that word03.doc and word06.doc now have a book icon, rather than a page icon. These now appear with hierarchical structure. The default behaviour for with is to section the document based on , , styles. If you open up the word03.doc or word06.doc documents in Word, you will see that the sections use these Heading styles. Note, to view style information in Word 2003, you can select Format → Styles and Formatting from the menu, and a side bar will appear on the right hand side. (In Word 2007 and later, find the Change Styles button on the far right of the menu ribbon. Click on the tiny Expand icon to its bottom right to display the styles side bar.) Click on a section heading and the formatting information will be displayed in this side bar. Some of the documents do not use styles (e.g. word01.doc) and no structure can be extracted from them. Some documents use user-defined styles. can be configured to use these styles instead of , etc. Next we will configure to use the styles found in word05.doc. Modes in the Librarian Interface The Librarian Interface operates in three modes. Go to → → and see the modes and what functionality they provide access to. is the default mode. Check that this is indeed the currently active mode. Defining styles Open up word05.doc in Word (by double-clicking on it in the pane), and examine the title and section heading styles. You will see that various user-defined header styles are set such as: : Title of the manual : Level 1 section heading : Level 2 section heading : Level 3 section heading : Appendix section title In the section of the panel, select and click . Four types of header can be set which are: level1_header (level1Header1|level1Header2|...) level2_header (level2Header1|level2Header2|...) level3_header (level3Header1|level3Header2|...) title_header (titleHeader1|titleHeader2|...) These header options define which styles should be considered as title, level 1, level 2 and level 3 styles. Ensure that the option is checked, and set the 4 header options to the values highlighted in the following (spaces in the Word styles are removed when converting to HTML styles, and these options must match the HTML styles): level1_header: (ChapterTitle|AppendixTitle)
level2_header: SectionHeading
level3_header: SubsectionHeading
title_header : ManualTitle Once these are set, click . Close any documents that are still open in Word, as this can prevent the build process from completing correctly. Build the collection and preview it. Look in particular at word05.doc. You will see that this document is now also hierarchically structured. If you have documents with different formatting styles, you can use (...|...) to specify all of the different styles. Removing pre-defined table of contents If you look at the HTML versions of word05.doc and word06.doc, you will see that it now has two tables of contents. One is generated by Greenstone based on the document's styles, the other was already defined in the Word document. can be configured to remove predefined tables of contents and tables of figures. The tables must be defined with Word styles in order for this to work. To remove the tables of contents and figures from word06.doc and the table of contents from word05.doc, switch on the option in . Set the option to (MsoToc1|MsoToc2|MsoToc3|MsoTof|TOA). In this document, the table of contents and list of figures use these four style names. Click . Build and preview the collection. Both word05.doc and word06.doc should now have only one table of contents. Extracting document properties as metadata When the option is set, word document properties can be extracted as metadata. By default, only the Title will be extracted. Other properties can be extracted using the option. In the panel, look at the metadata that has been extracted for word05.doc and word06.doc. Now open the documents in Word and look at what properties have been set (File → Properties for Word 2003. In Word 2007+, click the Word Icon on the top left, then choose Prepare → Properties). They have Title, Author, Subject, and Keywords properties. can be configured to look for these properties and extract them. In the panel, under , configure once again. Switch on the configuration option . Set the value to the following (but make sure not to enter any trailing spaces) Title,Author<Creator>,Subject,Keywords<Subject> This will make try to extract Title, Author, Subject and Keywords metadata. Title and Subject will be saved with the same name, while Author will be saved as Creator metadata, and Keywords as Subject metadata. Make sure you have closed all the documents that were opened, then rebuild the collection. Look at the metadata for the two documents again in the panel. You should now see ex.Creator and ex.Subject metadata items. This metadata can now be used in display or browsing classifiers etc. <Text id="assoc-files-0">Associated files: combining different versions of the same document together</Text> This tutorial demonstrates how to combine different versions of the same document together in Greenstone. As an example, two identical articles about Greenstone are used, one is in PDF format, the other in Word. The key to how this collection is set up is that the Word and PDF versions of the document deliberately have the same filename—only the file extension is different. This is something that is quite simple to achieve in practice, as it reflects common practice when a document is published in PDF form. This convention is then exploited by the associate_ext plugin option at build-time in Greenstone, an option that allows variants of a document to be grouped together and treated by Greenstone as a single document, based on similarity of filename. In the example collection of this tutorial, we set this option in the WordPlugin to be pdf. The result of this setting is that it makes the Word version of the document the dominant form in the collection that is built—the text that Greenstone extracts for indexing purposes comes from the Word document—and any PDF version of the document with the same filename is bound to it as an associated file. Start a new collection called Associated Files Example, by selecting File → New. Enter an appropriate description for your collection. Copy the files pdf03.pdf and word03.doc provided in sample_files → Word_and_PDF → Documents into your new collection. Do this by dragging these files across from the filesystem view on the left of the panel into the collection view on the right. In the collection view, rename the 2 files you just copied to greenstone1.pdf and greenstone1.doc, respectively. This sets the input documents up to be in line with the objective of this tutorial: to work with documents of different formats that are named similarly and have identical contents. Go to the panel. In , delete the index for ex.Source, and in , delete the Browsing Classifier for ex.Source too, since we will not be making use of them. In , select the and press the button. In the resulting popup, scroll down to find the associate_ext option, and set this option to . Note 1: as this is an option that is categorized under the heading, it is therefore an option that is available across all the plugins provided by Greenstone. In our example, we happen to be binding a PDF document to a Word document, however it could equally be used to bind MP3 versions of files to PNG artwork of album covers. Note 2: More than one filename extension can be provided as part of this option, separated by a comma. For example, setting the value of the associate_ext in to would allow both an AVI video file (say an oral history interview) and a PNG image (say a picture of the interviewee taken at the time of the recording) to bind to a text version of the document (say representing a transcript of the interview). Both AVI and PNG versions of the file can be present at the same time, or alternatively only one of the two file types need be present, or neither, and Greenstone will process the situation accordingly. Note 3: The option associate_ext is in fact a simplified version of a more general option associate_tail_re. Using regular expression syntax, the latter provides a more powerful way of manipulating filenames. Rather than focus on just the filename extension, with associate_tail_re, one is able to group files together that share a similar filename root, but might start to differ in characters before the filename extension. For instance, the Word version of the document might be my-article.doc but the PDF version might be my-article-ver13.pdf reflecting the fact that the PDF file is saved in version 1.3 of this format. Using associate_tail_re (and a little bit of regular expression know-how!), such differences can be surmounted, and the two files still processed automatically as different versions of the same document. If you're working with structured Word documents that contain formatted headings and you want better structured and formatted HTML versions of the documents to be generated by Greenstone from the Word format, optionally set the windows_scripting option for the if building on Windows. Alternatively, you can turn on the open_office_scripting option if this extension has been added to your Greenstone installation and if either OpenOffice or LibreOffice is available on your system. If you're using windows scripting, optionally set the to heading\s*1, or whatever is appropriate for your documents if they use style information for headings that deviate from the norm for Word. Repeat as is needed for and so forth. For more details on how to control sections within a Word document, see the tutorial. In GLI, or otherwise, assign appropriate dc.Title and dc.Creator metadata to both your documents. Since the contents are identical, you can select the 2 documents in the panel, then set dc.Title and dc.Creator simultaneously for both. Building the collection at this point will have the effect that internally Greenstone will have captured this relationship between the different file versions of the same documents; however, until we make some adjustments to the format statements, none of this will be visible to the end-user. The collection built at this point (with default settings) allows a user to search the text from the Word document, browse by title metadata and so on, but when it comes to the point of viewing a document there will only be the choice of viewing the Word version of the document, or the HTML version that Greenstone automatically generates by processing the Word document. To go beyond this, the key change to make is to alter the part of the default VList statement that says: template of the format statement which chooses between thumbicon and srcicon, and replace this with a reference to equivDocIcon instead. [ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink] to: [ex.equivDocLink][ex.equivDocIcon][ex./equivDocLink] Two things occur in this replacement. The main difference is the switch from using and that provides the link to the primary source document (which is the Word document), and replace it with a hyperlink around an icon to the document that Greenstone has associated as an equivalent document (which is the PDF version). The icon Greenstone chooses to show is based on the filename extension of the matching file it has found. In this case

. The second (more minor) change in this edit is to simplify the statement a bit. The original uses an {Or} statement to show a thumbnail version of the document, if Greenstone has one, in preference over the source icon. Since in this collection we have no thumbnails generated, it has been simplified by eliminating the {Or} combination and going straight to the metadata item. To make the change then, switch to the panel and edit the format statement for VList (All). Change: [link][icon][/link]
[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]
[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],
<i>([ex.Source])</i>} To: [link][icon][/link]
[ex.equivDocLink][ex.equivDocIcon][ex./equivDocLink]
[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[dc.Creator],: [sibling(All'\, '):dc.Creator]}
The second (more minor) change in this edit is to simplify the statement a bit. The original uses a <gsf:choose-metadata/> statement to show a thumbnail version of the document, if Greenstone has one, in preference over the source icon. Since in this collection we have no thumbnails generated, it has been simplified by eliminating the <gsf:choose-metadata/> combination and going straight to the metadata item. To make the change then, switch to the panel and edit the template of the format statement

	Change:	To:
<td valign="top"> <gsf:link type="source"> <gsf:choose-metadata> <gsf:metadata name="thumbicon"/> <gsf:metadata name="srcicon"/> </gsf:choose-metadata> </gsf:link> </td> <td valign="top"> <gsf:link type="document"> <xsl:call-template name="choose-title"/> </gsf:link> <gsf:switch> <gsf:metadata name="Source"/> <gsf:when test="exists"> <br/><i>(<gsf:metadata name="Source"/>)</i> </gsf:when> </gsf:switch> </td>	<td valign="top"> <gsf:metadata name="equivDocLink"/> <gsf:metadata name="equivDocIcon"/> <gsf:metadata name="/equivDocLink"/> </td> <td valign="top"> <gsf:link type="document"> <xsl:call-template name="choose-title"/> </gsf:link> <gsf:switch> <gsf:metadata name="dc.Creator"/> <gsf:when test="exists"> <br/><gsf:metadata name="dc.Creator" separator=", "/> </gsf:when> </gsf:switch> </td>

Note: When Greenstone encounters a file that matches the provided associate_ext value (pdf in our case), it sets the metadata value for that document to be the macro _iconXXX_, where XXX is whatever the filename extension is (so in our case). As long as there is an existing macro defined for that combination of the word icon and the filename extension, then a suitable icon will be displayed when the document appears in a VList. For pdf the displayed icon will be

. <Text id="0403">Exporting a collection to CD-ROM/DVD</Text> Greenstone collections can be published on a self-installing CD-ROM/DVD that works on Windows. Launch the Greenstone Librarian Interface if it is not already running. Choose → . In the resulting popup window, select the collection or collections that you wish to export by ticking their check boxes. You can optionally enter a name for the CD-ROM: this is the name that will appear in the menu when the CD-ROM is run. If a name is not entered, the default will be used. You can also specify whether the resulting CD-ROM will install files onto the host machine when used or not. Click to start the export process. The necessary files for export are written to: Greenstone → tmp → exported_xxx where xxx will be similar to the name you have entered. If you didn't specify a name for the CD-ROM, then the folder name will be exported_collections. You need to use your own computer's software to write these on to CD-ROM. On Windows XP this ability is built into the operating system: assuming you have a CD-ROM or DVD writer insert a blank disk into the drive and drag the contents of exported_xxx into the folder that represents the disk. The result will be a self-installing Windows Greenstone CD-ROM or DVD, which starts the installation process as soon as it is placed in the drive. <Text id="0387">A large collection of HTML files—Tudor</Text> You will need the files in the sample_files → tudor folder. Invoke the Greenstone Librarian Interface (from the Windows Start menu) and start a new collection called tudor (use the menu), based on the default . In the panel, open the tudor folder in sample_files. Drag englishhistory.net from the left-hand side to the right to include it in your tudor collection. (This material is from Marilee Hanson's Tudor England Collection at http://englishhistory.net/tudor.html, distributed with her permission.) Switch to the panel and click . When building has finished, preview the collection. Extracting more metadata from the HTML The browsing facilities in this collection ( and )( and ) are based entirely on extracted metadata. Switch to the panel in the Librarian Interface and examine the metadata that has been extracted for some of the files. Many HTML documents contain metadata in <meta> tags in the <head> of the page. Open up the englishhistory.net → tudor → monarchs → boleyn.html file by navigating to it in the tree on the left hand side, and double clicking it. This will open it in a web browser. View the HTML source of the page (View → Source in Internet Explorer, Tools → Web Developer → Page Source in Mozilla). You will notice that this page has and metadata. By default, only looks for Title metadata. Configure the plugin so that it looks for the other metadata too. Switch to the panel and select the section. Select the line and click . A popup window appears. Switch on the option, and set the value to Title,Author,Page_topic,Content Click . Switch to the panel and rebuild the collection. Go back to the panel and look at the extracted metadata for some of the HTML files in englishhistory.net → tudor → monarchs. The new metadata should now be visible. Looking at different views of the files in the and panels Switch to the panel and in the right-hand side open englishhistory.net → tudor. Change the menu for the right-hand side from to . Notice the files displayed above are filtered accordingly, to show only files of this type. Change the menu to . Again, the files shown above alter. Now return the setting back to , otherwise you may get confused later. Remember, if the or panels do not seem to be showing all your files, this could be the problem. <Text id="0434">Enhanced collection of HTML files—Tudor</Text> We return to the Tudor collection and add metadata that expresses a subject hierarchy. Then we build a classifier that exploits it by allowing readers to browse the documents about Monarchs, Relatives, Citizens, and Others separately. Adding hierarchically-structured metadata and a classifier Open up your tudor collection (the original version, not the webtudor version), switch to the panel and select the citizens folder (a subfolder of englishhistory.net → tudor). Set its metadata to Tudor period|Citizens. The vertical bar ("|") is a hierarchy marker. Selecting a folder and adding metadata has the effect of setting this metadata value for all files contained in this folder, its subfolders, and so on. A popup alerts you to this fact. Click to close the popup. Repeat for the monarchs and relative folders, setting their metadata to Tudor period|Monarchs and Tudor period|Relatives respectively. Note that the hierarchy appears in the area. If you don't want to see the popup each time you add folder level metadata, tick the checkbox; it won't be displayed again. Finally, select all remaining files—the ones that are not in the citizens, monarchs, or relative folders—by selecting the first and shift-clicking the last. Set their metadata to Tudor period|Others: this is done in a single operation (there is a short delay before it completes). When multiple files are selected in the left hand collection tree, all metadata values for all files are shown on the right hand side. Items that are common to all files are displayed in black—e.g. —while others that pertain to only one or some of the files are displayed in grey—e.g. any extracted metadata. Metadata inherited from a parent folder is indicated by a folder icon to the left of the metadata name. Select one of the files in the relative folder to see this. Switch to the panel and select from the left-hand list. Set the menu item for to ; then click . A window pops up to control the classifier's options. Change the to and then click . For tidiness' sake, remove the classifier for Source metadata (included by default) from the list of currently assigned classifiers, because this adds little to the collection. Now switch to the panel, build the collection, and preview it. Choose the new link that appears in the navigation bar, and click the bookshelves to navigate around the four-entry hierarchy that you have created. Adding a hierarchical phrase browser (PHIND) Next we'll add an interactive hierarchical phrase browsing classifier to this collection. Switch to the panel and choose the item from the left-hand list. Choose from the menu. Click . A window pops asking for configuration options: leave the values at their preset defaults (this will base the phrase index on the full text) and click . Build the collection again, preview it, and try out the new option in the navigation bar. An interesting PHIND search term for this collection is . Note that even though it is called a phrase browser, only single terms can be used as the starting point for browsing. Partitioning the full-text index based on metadata values Next we partition the full-text index into four separate pieces. To do this we first define four subcollections obtained by "filtering" the documents according to a criterion based on their metadata. Then an index is assigned to each subcollection. This will enable users to restrict a search to a subset of the documents. Switch to the panel, and click . Ensure that the tab is selected (the default). Define a subcollection filter with name monarchs that matches against , and type Monarchs as the regular expression to match with. Click . This filter includes any file whose metadata contains the word Monarchs. Define another filter, relatives, which matches against the word Relatives. Define a third and fourth, citizens and others, which matches it against the words Citizens and Others respectively. Having defined the subcollection filters, we partition the index into corresponding parts. Click the tab. Select the citizens subcollection and click . Next select monarchs, and click . Repeat for the other two subcollections, so that you end up with four partitions, one based on each subcollection filter. The order they appear in the list is the order they will appear in the drop down menu on the search page. You can change the order by using the and buttons. Build and preview the collection. The search page includes a pulldown menu that allows you to select one of these partitions for searching. For example, try searching the relatives partition for mary and then search the monarchs partition for the same thing. To allow users to search the collection as a whole as well as each subcollection individually, return to the section of the panel and select the tab. Select all four subcollections by either checking their boxes or press the button, and click . To ensure that the combined index appears first in the list on the reader's web page, use the button to get it to the top of the list here in the panel. Then build and preview the collection. Search for a common term (like the) in all five index partitions, and check that the numbers of words (not documents) add up. The text in the drop down box on the search page is based on the filters each partition was built on. To change the text that is displayed, go to the section of the panel. The single filter partitions have sensible default text, but the combined partition does not. Set the for the combined partition to "all". Preview the collection. Controlling the building process Finally we look at how the building process can be controlled. Developing a new collection usually involves numerous cycles of building, previewing, adjusting some enrich and design features, and so on. While prototyping, it is best to temporarily reduce the number of documents in the collection. This can be accomplished through the parameter to the building process. Switch to the panel and view the options that are displayed in the top portion of the screen. Select and set its numeric counter to . Now build. Preview the newly rebuilt collection's page. Previously this listed more than a dozen pages per letter of the alphabet, but now there are just three—the first three files encountered by the building process. Go back to the panel and turn off the option. Rebuild the collection so that all the documents are included. <Text id="0465">Formatting the HTML collection—Tudor</Text> Open up your tudor collection, go to the panel (by clicking on its tab) and select from the left-hand list. Leave the editing controls at their default value, so that displays and is selected as the . The text in the box reads as follows:Select the format feature and inspect its long format statement. <td valign=top>[link][icon][/link]</td>
<td valign=top>[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]} [ex./srclink]</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td> This displays something that looks like this:

A discussion of question five from Tudor Quiz: Henry VIII
(quizstuff.html)

for a particular document whose Title metadata is and whose Source metadata is . This format appears in the search results list, in the list, and also when you get down to individual documents in the hierarchy. This is Greenstone's default format statement. This format appears in the search results list, in the list, and also when you get down to individual documents in the hierarchy. This is Greenstone's default format statement used in the and format features. Greenstone's default format statement is complex because it is designed to produce something reasonable under almost any conditions, and also because for practical reasons it needs to be backwards compatible with legacy collections. Delete the contents of the box and replace it with this simpler version: Replace the template of the format feature with: <td>[link][icon][/link]</td>
<td>[ex.Title]<br>
<i>([ex.Source])</i>
</td> <gsf:template match="documentNode">
<td valign="top">
<gsf:link type="document">
<gsf:icon type="document"/>
</gsf:link>
</td>
<td valign="top">
<gsf:metadata name="Title"/>
<br/>
<i>(<gsf:metadata name="Source"/>)</i>
</td>
</gsf:template> Replace the format feature with the above format statement too. Preview the result (you don't need to build the collection, because changes to format statements take effect immediately). Look at some search results and at the list. They are just the same as before! Under most circumstances this far simpler format statement is entirely equivalent to Greenstone's more complex default. We can also reduce the template of the format feature further, also without changing the display. Replace it with: <gsf:template match="classifierNode[@classifierStyle = 'VList']">
<td valign="top">
<gsf:link type="classifier">
<gsf:icon type="classifier"/>
</gsf:link>
</td>
<td valign="top">
<gsf:metadata name="Title"/>
<br/>
<i>(<gsf:metadata name="Source"/>)</i>
</td>
</gsf:template> But there's a problem. Beside the bookshelves in the browser, beneath the subject appears a mysterious "()". What is printed for these bookshelves is governed by the same format statement, and though bookshelf nodes of the hierarchy have associated Title metadata—their title is the name of the metadata value associated with that bookshelf—they do not have metadata, so it comes out blank. In the section of the panel, the menu (just above menu) displays . That implies that the same format is used for the search results, titles, and all nodes in the subject hierarchy—including internal nodes (that is, bookshelves). The menu can be used to restrict a format statement to a specific one of these lists. We will override this format statement for the hierarchical subject classifier. In the menu, scroll down to the item that says Since we edited the format feature, the same format statements are used for the titles, and all nodes in the subject hierarchy—including internal nodes (that is, bookshelves). The menu can be used to restrict a format statement to a specific classifier and its nodes. We will override this format statement for the hierarchical subject classifier. In the menu, scroll down to the item that says CL2: Hierarchy -metadata and select it. This is the format statement that affects the second classifier (i.e., "CL2"), which is a classifier based on metadata. Click to add this format statement to the collection. Edit the box below to read Edit the box below to contain the following format statements for the and templates. (The changes to the template are mostly the same as before, but without reference to the Source document. This time, the template is simplified further still.) <td>[link][icon][/link]</td>
<td>[ex.Title]</td> <gsf:template match="documentNode">
<td valign="top">
<gsf:link type="document">
<gsf:icon type="document"/>
</gsf:link>
</td>
<td valign="top">
<gsf:metadata name="Title"/>
</td>
</gsf:template>

<gsf:template match="classifierNode[@classifierStyle = 'VList']">
<td valign="top">
<gsf:link type="classifier">
<gsf:icon type="classifier"/>
</gsf:link>
</td>
<td valign="top">
<gsf:metadata name="Title"/>
</td>
</gsf:template> Preview the list in the collection. First, the offending "()" has disappeared from the bookshelves. Second, whenWhen you get down to a list of documents in the subject hierarchy, the filename does not appear beside the title, because is not specified in the format statement and this format statement applies to all nodes in the subject classifier. Note that the search results and titles lists have not changed: they still display the filename underneath the title. Let's change the search results format so that metadata is displayed here instead of the filename. In the menu (under on the panel), scroll down to the item and select it. Click to add this format statement to the collection. Change the box below to read Select the format feature once more for some further editing. Replace the line: <td>[link][icon][/link]</td>
<td>[ex.Title]<br>
[dc.Subject]
</td>
<i>(<gsf:metadata name="Source"/>)</i>
with <gsf:metadata name="dc.Subject"/>
To insert the , position the cursor at the appropriate point and either type it in, or select it from the drop down menu. This menu shows many of the things that you can put in square brackets in the format statement. Preview the collection. Documents in the search results list will be displayed like this:

A discussion of question five from Tudor Quiz: Henry VIII
Tudor period|Others

(The vertical bar appears because this metadata is hierarchical metadata. Unfortunately there is no way to get at individual components of the hierarchy. For most metadata, such as title and author, this isn't a problem.) Finally, let's return to the hierarchy and learn how to do different things to the bookshelves and to the documents themselves. In the menu, re-select the itemReselect the format feature for CL2: Hierarchy -metadata Edit the box below to read <td>[link][icon][/link]</td>
<td>{If}{[numleafdocs],<b>Bookshelf title:</b> [ex.Title],
         <b>Title:</b> [ex.Title]}
</td> Again, you can insert the items in square brackets by selecting them from the drop down box. The statement tests the value of the variable . This variable is only set for internal nodes of the hierarchy, i.e. bookshelves, and gives the number of documents below that node. If it is set we take the first branch, otherwise we take the second. Commas are used to separate the branches. The curly brackets serve to indicate that the is special—otherwise the word itself would be output. For the template, adjust its display of the title to: <td valign="top">
<b>Title:</b> <gsf:metadata name="Title"/>
</td> Next adjust the title display of the template too: <td valign="top">
<b>Bookshelf title:</b> <gsf:metadata name="Title"/>
</td> Preview the collection and examine the subject hierarchy again to see the effect of your changes. Bookshelves should say and then the title, while documents will display and the title. Note that the number of documents in the bookshelf is not displayed: we are using [numleafdocs] to test what kind of item in the list we are at, but we are not displaying it. <Text id="st-1">Section tagging for HTML documents</Text> In a browser, take a look at the Greenstone demo collection. Browse to one of the documents. This collection is based on HTML files, but they appear structured in the collection. This is because these HTML files were tagged by hand into sections. Using a text editor (e.g. WordPad) open up one of the HTML files from the demo collection: Greenstone → collect → demo → import → fb33fe → fb33fe.htm Greenstone3 → web → sites → localsite → collect → lucene-jdbm-demo → import → fb33fe → fb33fe.htm . You will see some HTML comments which contain section information for Greenstone. They look like: 

 When Greenstone encounters a <Section> tag in one of these comments, it will start a new subsection of the document. This will be closed when a </Section> tag is encountered. Metadata can also be added for each section—in this case, metadata has been added for each section. In the browser, find the document in the demo collection (through the browser). Look at its table of contents and compare it to the <Section> tags in the HTML document. Add a new Section into this document. For example, lets add a new subsection into the chapter. In the text editor, add the following just after the Section tag for the section:  Then just before the next section tag (), add the following:  The effect of these changes is to make a new subsection inside the chapter. Open the Greenstone demo collection in the Librarian Interface. In the section of the panel, note that has the option set. This option is needed when <Section> tags are used in the source documents. Build and preview the collection. Look at the document again and check that your new section has been added. <Text id="0411">Downloading files from the web</Text> The Greenstone Librarian Interface's Download panel allows you to download individual files, parts of websites, and indeed whole websites, from the web. Start a new collection called webtudor, and base it on . In a web browser, visit http://englishhistory.net, follow the link to Tudor England. You should be at the URL http://englishhistory.net/tudor.html This is where we started the downloading process to obtain the files you have been using for the tudor collection. You could do the same thing by copying this URL from the web browser, pasting it into the panel, and clicking the button. However, several megabytes will be downloaded, which might strain your network resources—or your patience! For a faster exercise we focus on a smaller section of the site. Go to the panel by clicking its tab. There are five download types listed on the left hand side. For this exercise, we only use the type. Make sure this is selected in the list. Enter this URL http://englishhistory.net/tudor/citizens/ into the box. There are several other options that govern how the download process proceeds. To see a description of an option, hover the mouse over it and a tooltip will appear. To copy just the citizens section of the website, switch on the option by checking its box and set the option to 1. If you don't do this (or if you miss out the terminating "/" in the URL), the downloading process will follow links to other areas of the englishhistory.net website and grab those as well. Also switch on the option to avoid downloading any items on the site pages that actually emanate from outside it (like google ads). If your computer is behind a firewall or proxy server, you will need to edit the proxy settings in the Librarian Interface. Click the button. Switch on the checkbox. Enter the proxy server address and port number in the and boxes. Click . Now click . If you have set proxy information in , a popup will ask for your user name and password. If you're on Windows Vista or later, Windows may show a popup message asking whether you wish to block or unblock the download. In such a case, choose to unblock. Once the download has started, a progress bar appears in the lower half of the panel that reports on how the downloading process is doing. More detailed information can be obtained by clicking . The process can be paused and restarted as needed, or stopped altogether by clicking . Downloading can be a lengthy process involving multiple sites, and so Greenstone allows additional downloads to be queued up. When new URLs are pasted into the box and clicked, a new progress bar is appended to those already present in the lower half of the panel. When the currently active download item completes, the next is started automatically. Downloaded files are stored in a top-level folder called that appears on the left-hand side of the panel. You may not need all the downloaded files, and you choose which you want by dragging selected files from this folder over into the collection area on the right-hand side, just like we have done before when selecting data from the sample_files folder. In this example we will include everything that has been downloaded. Select the englishhistory.net folder within and drag it across into the collection area. Switch to the panel to build and preview the collection. It is smaller than the previous collection because we included only the citizens files. However, these now represent the latest versions of the documents. <Text id="0423">Pointing to documents on the web</Text> Open up your tudor collection, and in the panel inspect the files you dragged into it. The first folder is englishhistory.net, which opens up to reveal tudor, and so on. The files represent a complete sweep of the pages (and supporting images) that constitute the Tudor citizens section of the englishhistory.net web site. They were downloaded from the web in a way that preserved the structure of the original site. This allows any page's original URL to be reconstructed from the folder hierarchy. In the panel, select the section, then select the line and click . A popup window appears. Locate the option (about halfway down the first block of items) and switch it on. Click . Setting this option to the means that Greenstone sets an additional piece of metadata for each document called , which gives its original URL. It is important that the files gathered in the collection start with the web domain name (englishhistory.net in this case). The conversion process will not work if you dragged over a subfolder, for example the tudor folder, because this will set metadata to something like http://tudor/citizens/... rather than http://englishhistory.net/tudor/citizens/... If you had copied over a subfolder previously, delete it and make a fresh copy. Drag the folder in the right-hand side of the panel on to the trash can in the lower right corner. Then obtain a fresh copy of the files by dragging across the englishhistory.net folder from the sample_files → tudor folder (or the folder if you have done exercise ) on the left-hand side. To make use of the new URL metadata, the icon link must be changed to serve up the original URL rather than the copy stored in the digital library. Go to the panel, select the section and edit the template of the format statement by replacing [link][icon][/link] with [weblink][webicon][/weblink] <gsf:link type="document">
<gsf:icon type="document"/>
</gsf:link> with <gsf:link type="web">
<gsf:icon type="web"/>
</gsf:link>
Switch to the panel and build and preview the collection. Note that the document icons have changed. The collection behaves exactly as before, except that when you click a document icon your web browser retrieves the original document from the web (assuming it is still there by the time you do this exercise!). If you are working offline you will be unable to retrieve the document. <Text id="0520">Bibliographic collection</Text> This exercise looks at using fielded searching in a collection. Fielded searching is best used for metadata rich collections. Here we use bibliographic data in MARC format. Start a new collection called Papers Bibliography which will contain a collection of example MARC records of the working papers published at the Computer Science Department, Waikato University. Enter the requested information and base it on . In the panel, open the sample_files → marc folder, drag CMSwp-all.marc into the right-hand pane and drop it there. A popup window asks whether you want to add to the collection to process this file. Click , because this plugin will be needed to process the MARC records. Now select within the panel and remove the default classifier for Source metadata. In the section, remove the index. In this collection all records are from the same file, so metadata, which is set to the filename, is not particularly interesting or useful. Switch to the panel, build the collection, and preview it. Browse through the and view a record or two. Try searching—for example, find items that include . Back in the Librarian Interface, go to the section of the panel. Select from the drop down menu, and click . In the popup window, select as the metadata item. Click . is like , except that terms that appear multiple times in the hierarchy are automatically grouped together and a new node, shown as a bookshelf icon, is formed. Build the collection and preview the result. Using fielded searching Now let's look at fielded searching. In the browser, go to the page. You will notice that there is a option which enables you to switch between "normal" and "fielded" search. Change to fielded search now, press the , and click on the button to go back to the Search page. The search form has changed to a fielded form. Now let's look at fielded searching. In the browser, press the button below the usual search form. This will present a fielded search form. You can specify which search form types are available for a particular collection, and which one is the default, using the format statement. In the panel select from the left-hand list. Select the format statement from the list of assigned formats, and set the contents to just . This will make only fielded searching available for this collection. Search type options include and . You can specify one or both separated by a comma. If both are specified, the first one is used as the default: this is the one that the user will see when they first enter the collection. Search type options include , (for fielded searching) and (for fielded searching with boolean operations). You can specify any combination of these, separated by a comma. If the search type is specified, it will be available in the search area at the top of each page of the collection. Preview the collection again. Notice that the collection's home pagepages no longer includes a query box. (This is because the search form is too big to fit here nicely.) To search, you have to click in the navigation bar. Note that the page has changed so that the "normal" query style is no longer offered. Look at the search form in the collection. There are two fields that can be searched: text and titles. Add some more fields to search on by going back to the Librarian Interface. In the panel, go to the section. Add a new index based on by clicking , selecting in the list of metadata, and clicking . Rebuild the collection and preview the results. Notice the extra field in the drop-down menus in the search form. You can do quite complicated queries by searching for words in different fields at the same time. To change the text that is displayed in the drop-down menus of the search form, go to the section of the panel. Here you can change the display text for the indexes. Exploding the database Go to the panel and try to see the metadata. It doesn't appear! This is because the metadata is associated with records inside the file, not the file itself. Metadata file types, such as MARC, CDS/ISIS, BibTex etc. can be imported into Greenstone but their metadata cannot be viewed in the Librarian Interface. To edit any metadata you need to go back to the program that created the file. Greenstone provides a way of exploding a metadata database so that each record appears as an individual document, with viewable and editable metadata. This process is irreversible: once this step has been done, the database is deleted and can no longer be used in its original program. In the panel, you may notice that the MARC database has a different coloured icon to other files. A metadata database that can be exploded will be displayed with this green icon. Right-click on the file and choose from the menu. A new window opens, containing options for the exploding process. A description of each option can be obtained by hovering the mouse over the option. If it's not already on, turn on the option by checking its box. This option indicates which metadata set to explode the metadata into. The default set is the "Exploded Metadata Set"—a metadata set which initially has no elements in it, but will receive a new element for each metadata field retrieved from the database. Click to start the exploding process. This may take a short while, depending on the size of the database. Once exploding has finished, the MARC database file will have been deleted, and three folders created in its place. These folders contain an empty file for each record in the original database. The metadata for these records can be viewed and edited by switching to the panel. Because the MARC file is no longer present, and the collection contains empty (.nul) files, we need to change the list of plugins. In the section of the panel, remove . Rebuild and preview the collection. You will notice that the classifier is empty, searching no longer returns any results, and the document display is useless. Although the classifier was built on , it still displays the correct titles, but in the panel you can see the metadata are actually the filenames rather than titles of the MARC records. This is because the default format uses the metadata. In the section of the panel, select in the list of assigned format statements. The format statement looks like: Rebuild and preview the collection. You will notice that the classifier is empty, searching no longer returns any results, and the document display is useless. Although the classifier was built on , it still displays the correct titles, but in the panel you can see the metadata are actually the filenames rather than titles of the MARC records. This is because the default format uses the metadata. In the section of the panel, select in the list of assigned format statements. The format statement looks like: <td valign="top">[link][icon][/link]</td>
<td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
<td valign="top">[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td> <gsf:template name="choose-title">
<gsf:choose-metadata>
<gsf:metadata name="dc.Title"/>
<gsf:metadata name="exp.Title"/>
<gsf:metadata name="ex.dc.Title"/>
<gsf:metadata name="Title"/>
<gsf:default>Untitled</gsf:default>
</gsf:choose-metadata>
</gsf:template> The above template, defined in the format features, is included by the format statements. Since there is no metadata and because comes before , the exploded titles will be displayed. Reformatting the collection to use the exploded metadata The collection previously used extracted (ex.) metadata, but now it uses exploded (exp.) metadata. The classifier and search indexes were built on ex metadata, which is why they no longer work properly. There is also no longer any text in the documents. Previously, stored the raw record as the "text" of each record. Now that the metadata is in the Librarian Interface, there is no longer the concept of raw record, and so there is no text. We need to modify the collection design to take note of these changes. In the section, change the Title index to use : select the Title index in the list and click . Deselect and in the list of metadata, and select . Click . Remove the index by selecting it in the list and clicking . Add an index on : click , select in the metadata list, and click . The text index is no longer any use, so remove that index too. To enable combined searching across all indexes at once, click , tick the checkbox, and click . Move this to the top of the list using the button, so that it appears first in the drop down list. Click on the right so that it becomes the default field for searching. To explicitly use the metadata, in the section, change the to use metadata. Double click the in the list, and change the option to use . Click . Do the same thing for the Subject , changing to . Rebuild and preview the collection. The classifiers should be back to normal and searching should now work. In the section of the panel, select in the list of assigned format statements. Switch to the section of the panel to make the following adjustments. There is no dc metadata for this collection, so replace {Or}{[dc.Title],[exp.Title],[ex.dc.Title],[ex.Title],Untitled} with {Or}{[exp.Title],[ex.Title],Untitled}. There is no dc (or ex.dc) metadata for this collection, so in the format feature's template, replace the following <gsf:choose-metadata>
<gsf:metadata name="dc.Title"/>
<gsf:metadata name="exp.Title"/>
<gsf:metadata name="ex.dc.Title"/>
<gsf:metadata name="Title"/>
<gsf:default>Untitled</gsf:default>
</gsf:choose-metadata> with <gsf:choose-metadata>
<gsf:metadata name="exp.Title"/>
<gsf:metadata name="Title"/>
<gsf:default>Untitled</gsf:default>
</gsf:choose-metadata> There are no source or thumb icons, so remove the second line: <td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td> There are no source or thumb icons, so in the and templates of the format feature, remove the occurrences of the following section: <td valign="top">
<gsf:link type="source">
<gsf:choose-metadata>
<gsf:metadata name="thumbicon"/>
<gsf:metadata name="srcicon"/>
</gsf:choose-metadata>
</gsf:link>
</td> The ex.Source metadata is set to the nul filename, so remove that from the display. Remove: {If}{[ex.Source],<br><i>([ex.Source])</i>} <gsf:switch>
<gsf:metadata name="Source"/>
<gsf:when test="exists">
<br/>
<i>(<gsf:metadata name="Source"/>)</i>
</gsf:when>
</gsf:switch>
The resulting format statement looks like: <td valign="top">[link][icon][/link]</td>
<td valign="top">[highlight]
{Or}{[exp.Title],[ex.Title],Untitled}
[/highlight]</td> Clear the format statement by selecting it in the list of assigned format statements and deleting the contents in the . The record Title will be displayed as part of the format, so we don't need it here. In the list, click on one of the documents. This will take you to the document's display page. Exploding the database has left this document display useless. Only the record Title (in this case, the generated filename) is displayed. We will make two changes to improve the document display. First, we will remove the record Title, since it is not useful in this instance. To remove the record Title, we need to override the default format statement with one that does not do anything. Go to the format feature in the section of the panel and add the following just after <gsf:option name="TOC" value="true"/>: <gsf:template name="documentHeading"/> Next, edit the format statement. Delete the contents and replace it with the following (which can be copied from sample_files → marc → format_tweaks → document_text.txt). <table>
<tr><td>Title:</td><td>[exp.Title]</td></tr>
<tr><td>Subject:</td><td>[exp.Subject]</td></tr>
<tr><td>Publisher:</td><td>[exp.Publisher]</td></tr>
</table> Next, override the default behaviour by creating a format statement for this. Still in the format features, after the , add the following format statement (which can be copied from sample_files → marc → format_tweaks → document_content.txt): <gsf:template name="documentContent">
<table>
<tr>
<td>Title:</td>
<td><gsf:metadata name="exp.Title"/></td>
</tr>
<tr>
<td>Subject:</td>
<td><gsf:metadata name="exp.Subject"/></td>
</tr>
<tr>
<td>Publisher:</td>
<td><gsf:metadata name="exp.Publisher"/></td>
</tr>
</table>
</gsf:template> The and buttons are not very useful for this collection, so lets get rid of them. Edit the format statement to make it empty. Press the button to preview the collection and see how the document display has improved. <Text id="is-1">CDS/ISIS collection</Text> This exercise is similar to the exercise, except that a CDS/ISIS database is used instead of a MARC database, and we do not explode the database. Start a new collection called ISIS Collection (base it on New Collection). Drag the files from sample_files → isis (excluding the format_tweaks folder and the README.txt file) into the collection. Build and preview the collection. The default indexes, classifiers and formats are not very useful for this data. There is no metadata searching, and the classifier is completely empty. The filenames classifier is useless because all records come from the same file. In the section of the panel, remove the useless Source and Title indexes, and add new indexes for , and metadata. In the section of the panel, you can set the display text for these indexes to "photographer", "country" and "notes". CDS/ISIS metadata has subfields, and these are represented using ^. In the section of the panel, remove the existing (useless) classifiers for and , and add a new for . Rebuild and preview the collection. In the section of the panel, change the format statement to display and metadata. Change it to look like: <td valign=top>[link][icon][/link]</td>
<td valign=top><b>[ex.Photographer^all]</b><br/>[ex.Notes^all]</td> In the section of the panel, change the format statement to display and metadata. Change its template to look like: <gsf:template match="documentNode">
<td valign="top">
<gsf:link type="document">
<gsf:icon type="document"/>
</gsf:link>
</td>
<td valign="top">
<b><gsf:metadata name="Photographer^all"/></b>
<br/><gsf:metadata name="Notes^all"/>
</td>
</gsf:template> The above format can be copied from sample_files → isis → format_tweaks → browse_tweak.txt. If you want search results to be displayed in a similar manner, make the same changes to the template of the format features too. Make fielded searching the default by changing the format statement to (instead of ). stores a nicely formatted version of the record as the document text, and this is what is displayed when we view a record. Let's tidy it up a little more. Remove the and buttons by setting the format statement to empty. Remove the at the top of the document by setting the format statement to empty. Finally, let's link to the raw record, which is stored as metadata. Edit the format statement to look like the following. (This format can be copied from sample_files → isis → format_tweaks → document_text.txt.) Add the following to the format statement (which can be copied from sample_files → isis → format_tweaks → document_content.txt), to adjust the . This now makes use of a predefined Greenstone javascript function to toggle between displaying and hiding the raw record. <p>[Text]</p>
{If}{_cgiargshowrecord_,
<p><b>CDS Record:</b><br/><tt>[ISISRawRecord]</tt></p>
<center><a href=\'_gwcgi_?e=_cgiarge_&a=d&d=_cgiargd_\'>Hide CDS Record</a></center>,
<center><a href=\'_gwcgi_?e=_cgiarge_&a=d&d=_cgiargd_&showrecord=1\'>Show CDS Record</a></center>
} <gsf:template name="documentContent">
<p>
<xsl:call-template name="wrappedSectionText"/>
</p>

<a href="javascript:;" id="cdsreclink">Show/Hide CDS Record</a>
<div id="cdsrecord">
<b>CDS Record:</b>
<br/>
<tt>
<gsf:metadata name="ISISRawRecord"/>
</tt>
</div>

<script type="text/javascript">
<xsl:text disable-output-escaping="yes">
var link=document.getElementById("cdsreclink");
var div=document.getElementById("cdsrecord");
gs.functions.makeToggle(link, div);
</xsl:text>
</script>
</gsf:template>
Preview the collection. <Text id="mf-1">Customization: macro files and stylesheets</Text> The appearance of all pages produced by Greenstone is governed by macro files, which reside in the folder Greenstone → macros, and images and CSS stylesheets reside in Greenstone → web → style. A macro takes the form _macroname_ {macro value}. Macro names start and end with underscores (_), and the macro value is enclosed in curly brackets ({}). Macro values can be text or HTML, and can include other macros. Macros are grouped into packages, and different packages control the appearance of different pages. For example, the , , , , packages control the home, help, preferences, query, and document pages, respectively. Some macro files contain macros for just one package, for example, home.dm, query.dm, document.dm, while others contain macros for many packages. base.dm contains macros used globally, style.dm controls the common style of each page, english.dm, french.dm and other language files contain the text fragments for the entire interface, in that language. The output of the library program is a page of HTML which is viewed in a web browser. CSS (Cascading Style Sheets) are often used alongside HTML pages to control the formatting, such as layout, colour, font etc. The default Greenstone stylesheet is Greenstone → web → style → style.css. In this exercise, we customize the macros, images and stylesheets to change the appearance of our library. Collection specific customisation Macros can be used to customize single collections by adding them to a file called extra.dm in the macros directory of a collection. We use the Word and PDF collection (from exercise ) as the example for this exercise, but it can be done with any collection. Open up this collection (reports) in the Librarian Interface. Go to the panel, and select from the left hand list. This section allows you to edit the collection's extra.dm macro file. First, we change the title of the section of the about page. Add the following text in the edit box (which can be copied from the file about_tweak.txt in the sample_files → custom folder): package about

_textabout_ {
<div class="section">
<h3>Very Interesting Reports Collection.</h3>
_Global:collectionextra_
</div>
} Preview the collection by pressing the button. The About page will have a new title underneath the search form. Next we add a footer to each page. Add the _footer_ macro to the end of the edit box (which can be copied from the file footer_tweak.txt in the sample_files → custom folder): package Style

_footer_ {
_pagefooterextra_ <center><small>Copyright 2010 My Awesome Digital Library</small></center> _endspacer__htmlfooter_
} The <center> and <small> HTML tags center the text, and make it a smaller size than the rest of the page. Preview the changes in a web browser. Each page should now have the new text at the bottom. Putting text in the main _footer_ macro adds it to all pages of this collection. To add a footer just to a particular page, use _pagefooterextra_ in the appropriate package. For example, lets add some more text to the footer, this time just on the About page. Add the following text immediately after the line package about : _pagefooterextra_ {Collection generated by Me.} Preview the About page in a web browser. The About page should now display the new text, while the other pages won't. Next we'll do some style customisations. Add the following text below the _footer_ macro (which can be copied from the file red_tweak.txt in the sample_files → custom folder) _collectionspecificstyle_ {
<style type="text/css">
/*clear the use of a background image */
body.bgimage \{ background-image: none; \}
/* set the background color to pink */
body.bgimage \{ background: pink; \}
/* clear the background image for the navigation bar, and set its color to red */
div.navbar \{ background-image: none; background-color: red; \}
/* clear the background image for the divider bars, and set their color to red */
div.divbar \{ background-image: none; background-color: red; \}
</style>
} /*...*/ around a line signals a comment, and this style element will be ignored. Preview the collection. The reports collection will now have a pink background, and the navigation bar and divider bars will be red. These changes will only affect this collection. Any macros from the general macro files can be copied into a collection's extra.dm file and modified. Remember to include the package declaration to make sure that the macros get applied to the correct page(s). The style modifications made above were minor. The collection still uses the majority of the standard style file. The style declarations in the _collectionspecificstyle_ macro get appended to the default ones. To completely change the appearance of a collection, we can use a new style sheet altogether. Add the following text (which can be copied from the file css_tweak.txt in the sample_files → custom folder) after the last modifications: _cssheader_ {
<link rel="stylesheet" href="_httpcstyle_/style-blue.css" type="text/css"
  title="Blue Style" charset="UTF-8">
} Outside of the Librarian Interface, locate the collection folder Greenstone → collect → reports. Create a style folder inside this (if not already present), and copy the file sample_files → custom → style-blue.css into this folder. Preview the collection; the about page should look radically different. Changing the colour of the page title and page text In the previous exercises we changed a single collection. Now we change all the pages in our Greenstone installation by modifying style and macro files outside the Librarian Interface. First, we format the page so that some other parts are blue. Preview any collection after each change to make sure that it has worked properly. On Windows, macro file changes require a restart of the Greenstone local library server. Stylesheet changes may require a forced reload in the web browser. Note, use any collection except the reports collection to preview the following changes. Because the reports collection has been modified to use its own custom stylesheet, changes to the main stylesheet won't have any effect on it. The majority of the style definitions reside in an external style file, Greenstone → web → style → style.css, and most style changes involve modifying that file. Open the style.css file in a text editor, e.g. WordPad (and save a .backup copy). Make the following modifications. You might want to preview after each one to see the effect. Change some of the colours: Find the body style instructions: body {
background-color: #ffffff;
} Add color: teal; For a.collectiontitle, set color to blue. For p.collectiontitle, add color: blue; Preview the collection. Now text in the page body is a light green color (teal), and the font of the collection title has changed from black to blue. (If a collection title image is used, you won't see the change on the collection title.) Let's switch the positions of the HOME, HELP and PREFERENCES buttons and the collection name or image. For div.pageinfo, set both float and text-align to left. For div.collectimage, set float and text-align to right. The look of your library should now be substantially different. The HELP, HOME and PREFERENCES buttons are in the left upper corner whereas the collection title is switched to the right of the page. Now we will customize the default Greenstone header image and the background image. Two new images for this exercise can be found in sample_files → custom. Copy newbgimg.gif, newheadimg.gif from the custom folder into the Greenstone → web → images folder. Open the file Greenstone → macros → home.dm in a text editor. Find each occurrence of gsdlhead.gif in this file (there are two) and replace with newheadimg.gif. (If you are using WordPad, you can use Edit → Find to search for the text.) Save home.dm and close the file. Open the file Greenstone → macros → style.dm with the text editor. Locate the following part of the file (this is part of the _cssheader_ macro): <style type="text/css">
body.bgimage \{ background-image: url("_httpimg_/chalk.gif"); scroll repeat-y left top; \}
Use copy and paste on the body.bgimage line to make it look like this: <style type="text/css">
/*body.bgimage \{ background-image: url("_httpimg_/chalk.gif"); scroll repeat-y left top; \}*/
body.bgimage \{ background-image: url("_httpimg_/newbgimg.gif"); scroll repeat-y left top; \}
Here we are changing the background image for the bgimage section of the body of the page to newbgimg.gif. Save style.dm and close the file. Preview the home page in a web browser. (On Windows, restart the Greenstone library server.) The header of the home page, and the background of every page of each collection (except reports which uses a custom _cssheader_ macro) should now use the new graphics. Make your own Greenstone home page You can make radical changes to a page by changing the macro file completely. For example, here we use an alternative to the home page which we have prepared for you in advance and included in your Greenstone installation. Open the file Greenstone → etc → main.cfg in a text editor. Locate the list: # The list of display macro files used by this receptionist
macrofiles tip.dm style.dm base.dm query.dm help.dm pref.dm about.dm \
           document.dm browse.dm status.dm authen.dm users.dm html.dm \
           extlink.dm gsdl.dm extra.dm home.dm collect.dm docs.dm \
           bsummary.dm gti.dm gli.dm nav_css.dm usability.dm \
           ...
Change the text home.dm to yourhome.dm. Save and close the file. Preview the newly structured home page in a web browser. (On Windows, restart the Greenstone library server.) Look at the file macros/yourhome.dm in a text editor to see how these changes are expressed. Reverse this last change by changing yourhome.dm back to home.dm in the file Greenstone → etc → main.cfg. You may also like to reverse the other changes you have made. The final part of this exercise looks at how we determined which images needed replacing, and which macro files should be edited. How to determine which images to replace (advanced) In step 10 of this exercise we replaced the default background () and header () images with new ones. To do this we needed to change the image names in the macro files. How did we know which images we were replacing and which macro files to edit? This exercise shows you how to find out. To find out the names of the images to replace, go to the home page of your digital library in a browser. Right-click on the header image () and select "Save picture as". A dialog will pop up and will display the image name: (or if you are using the new header). Click Cancel to close the dialog—you don't need to save the images. Do the same for the background image by right clicking on the left hand green (or blue) swirly bar. This time choose "Save background as" to find the name: (or ), then click Cancel. These instructions apply to Internet Explorer. Other browsers may have other options in the right-click menu. For example, Mozilla provides "View Image" and "View Background Image" options. Using these options will put the path to the image in the browser address box, and the name can be seen from this. Once you have identified the names of the images to be replaced, you need to find out where they occur in the macro files. To do this on Windows, you would search the macro files for the image names using the program, which is run in a command prompt. Open a command prompt using Start → Programs → Accessories → Command Prompt, or Start → Run and enter cmd as the name of the program to run. You can type findstr/? to see a description of the program and its arguments. To search the macro files for type findstr /s /m /C:"gsdlhead.gif" "C:\Program Files\Greenstone\macros\*.dm" means all files ending in (while tells it to search within subfolders and lists the files that matched). A list of all applicable macro files will be displayed, along with any matches. You will see that home.dm and exported_home.dm both contain . home.dm is the one you want to edit—exported_home.dm is used for the home page when you export a collection to CD-ROM. On Linux systems, the equivalent command to run in a terminal would be: fgrep -rl "gsdlhead.gif" /full/path/to/your/greenstone/macros/ Do the same thing for : findstr /s /m /C:"chalk.gif" "C:\Program Files\Greenstone\macros\*.dm" base.dm and style.dm are the only files that mention this image. Close the command prompt. <Text id="0540">Looking at a multimedia collection</Text> Copy the entire folder sample_files → beatles → advbeat_large (with all its contents) into your Greenstone collect folder. If you have installed Greenstone in the usual place, this is My Computer → Local Disk (C:) → Users → <Username> → Greenstone → collect My Computer → Local Disk (C:) → Users → <Username> → Greenstone3 → web → sites → localsite → collect where <Username> is the username under which Greenstone is installed. Put advbeat_large in there. Then go into the advbeat_large folder and delete its index subfolder. On Windows, if the Greenstone Digital Library Local Library Server is already running, re-start it by clicking the world icon on the task bar and then pressing Restart Library. On Linux and Mac, just do a forced reload/refresh of the web browser (eg. by pressing Shift and the refresh button in Firefox to do a forced reload). If the Local Library Server hasn't been started yet, start it up first by selecting Greenstone Digital Library from the Start menu on Windows, or run ./gs-server.sh on Linux and Mac. Start up GLI and open the collection. Switch to the panel and rebuild the collection. Preview the result. Explore the Beatles collection. Note how the button divides the material into seven different types. Within each category, the documents have appropriate icons. Some documents have an audio icon: when you click these you hear the music (assuming your computer is set up with appropriate player software). Others have an image thumbnail: when you click these you see the images. Look at the browser. Each title has a bookshelf that may include several related items. For example, Hey Jude has a MIDI file, lyrics, and a discography item. Observe the low quality of the metadata. For example, the five items under (under in the browser) have different variants as their titles. The collection would have been easier to organize had the metadata been cleaned up manually first, but that would be a big job. Only a tiny amount of metadata was added by hand—fewer than ten items. The original metadata was left untouched and Greenstone facilities were used to clean it up automatically. (You will find in that this is possible but tricky.) In the file browser, take a look at the files that makes up the collection, in the sample_files → beatles → advbeat_large → import folder. What a mess! There are over 450 files under seven top-level sub-folders. Organization is minimal, reflecting the different times and ways the files were gathered. For example, html_lyrics and discography are excerpts of web sites, and images contains various images in JPEG format. For each type, drill down through the hierarchy and look at a sample document. <Text id="0550">Building a multimedia collection</Text> We will proceed to reconstruct from scratch the Beatles collection that you have just looked at. We develop the collection using a small subset of the material, purely to speed up the repeated rebuilding that is involved. Start a new collection ( → ) called small beatles, basing it on the default . (Basing it on the existing Advanced Beatles collection would make your life far easier, but we want you to learn how to build it from scratch!) Copy the files provided in sample_files → beatles → advbeat_small into your new collection. Do this by opening up advbeat_small, selecting the eight items within it (from discography to beatles_midi.zip), and dragging them across. Because some of these files are in MP3 and MARC formats you will be asked whether to include and in your collection. Click . A window may pop up explaining that the import documents contain css files, which none of Greenstone's plugins are expected to process directly. CSS files normally belong to a web page and we don't need to process them directly. Click button. Change to the panel and browse around the files. There is no metadata—yet. Recall that you can double-click files to view them. (There are no MIDI files in the collection: these require more advanced customisation because there is no MIDI plugin. We will deal with them later.) Change to the panel and build the collection. Preview the result. Manually correcting metadata You might want to correct some of the metadata—for example, the atrocious misspelling in the titles "MAGICAL MISTERY TOUR." These documents are in the discography section, with filenames that contain the same misspelling. Locate one of them in the panel. Notice that the extracted metadata element is now filled in, and misspelt. You cannot correct this element, for it is extracted from the file and will be re-extracted every time the collection is re-built. Instead, add metadata for these two files: "Magical Mystery Tour." In the panel, open the discography folder and drill down to the individual files. Set the value for the two offending items. Build the collection again, and preview it. Extracted metadata is unreliable. But it is very cheap! On the other hand, manually assigned metadata is reliable, but expensive. The previous section of this exercise has shown how to aim for the best of both worlds by using extracted metadata but correcting it when it is wrong. Browsing by media type First let's remove the classifier for filenames, which isn't very useful, and replace it with a browsing structure that groups documents by category (discography, lyrics, audio etc.). Categories are defined by manually assigned metadata. Change to the panel, select the folder discography and set its metadata value to "Discography". Setting this value at the folder level means that all files within the folder inherit it. Repeat the process. Assign "Lyrics" to the html_lyrics folder, "Images" to images, "MARC" to marc, "Audio" to mp3, "Tablature" to tablature_txt, and "Supplementary" to wordpdf. Switch to the panel and select the section. Delete the classifier (the second one). Add a classifier and select as the field. Click the and select in the drop-down list. Click the check box and choose from the drop-down list. Click the checkbox, and select in the drop-down list: this will make the classifier display documents in alphabetical order of title. Specify as the . Build the collection again and preview it. Note how we assigned metadata to all documents in the collection with a minimum of labour. We did this by capitalizing on the folder structure of the original information. Even though we complained earlier about how messy this folder structure is, you can still take advantage of it when assigning metadata. Suppressing dummy text Alongside the Audio files there is an MP3 icon, which plays the audio when you click it, and also a text document that contains some dummy text. Image files also have dummy documents. These dummy documents aren't supposed to be seen, but to suppress them you have to fiddle with a format statement. Change to the panel and select the section. Ensure that the format feature is selected, and make the changes that are highlighted below to its template. You need to insert five lines into the first line, and delete the second line. (Note, the changes are available in a text file, see below.) Change: <td valign=top>[link][icon][/link]</td>
<td valign=top>[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td> <td valign="top">
<gsf:link type="document">
<gsf:icon type="document"/>
</gsf:link>
</td>
<td valign="top">
<gsf:link type="source">
<gsf:choose-metadata>
<gsf:metadata name="thumbicon"/>
<gsf:metadata name="srcicon"/>
</gsf:choose-metadata>
</gsf:link>
</td> to this: <td valign=top>
{If}{[dc.Format] eq 'Audio',
[srclink][srcicon][/srclink],
{If}{[dc.Format] eq 'Images',
[srclink][thumbicon][/srclink],
{If}{[dc.Format] eq 'Supplementary',
[srclink][srcicon][/srclink] [link][icon][/link], [link][icon][/link]}}}</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td> <td valign="top">
<gsf:switch>
<gsf:metadata name="dc.Format"/>
<gsf:when test='equals' test-value='Audio'>
<gsf:link type="source"><gsf:metadata name="srcicon"/></gsf:link>
</gsf:when>
<gsf:when test='equals' test-value='Images'>
<gsf:link type="source"><gsf:metadata name="thumbicon"/></gsf:link>
</gsf:when>
<gsf:when test='equals' test-value='Supplementary'>
<gsf:link type="source"><gsf:metadata name="srcicon"/></gsf:link>
<gsf:link type="document"><gsf:icon type="document"/></gsf:link>
</gsf:when>
<gsf:otherwise>
<gsf:link type="document"><gsf:icon type="document"/></gsf:link>
</gsf:otherwise>
</gsf:switch>
</td> To make this easier for you we have prepared a plain text file that contains the new text. In WordPad open the following file: sample_files → beatles → format_tweaks → audio_tweak_3.txt (Be sure to use WordPad rather than Notepad, because Notepad does not display the line breaks correctly.) Place it in the copy buffer by highlighting the text in WordPad and selecting Edit → Copy. Now move back to the Librarian Interface, highlight all the text that makes up the current format statementhighlight the portion of the existing template of the format statement that needs to be replaced, and use → to transform the old statement to the new one. Preview the result. You may need to click the browser's <Reload> button to force it to re-load the page. While we're at it, let's remove the source filename from where it appears after each document. In the format feature, delete the text that is highlighted below: In the template of the format feature, delete the following text: <td valign=top>
{If}{[dc.Format] eq 'Audio',
[srclink][srcicon][/srclink],
{If}{[dc.Format] eq 'Images',
[srclink][thumbicon][/srclink],
{If}{[dc.Format] eq 'Supplementary',
[srclink][srcicon][/srclink] [link][icon][/link], [link][icon][/link]}}}</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td> <gsf:switch>
<gsf:metadata name="Source"/>
<gsf:when test="exists">
<br/>
<i>(<gsf:metadata name="Source"/>)</i>
</gsf:when>
</gsf:switch> Preview the result (you don't need to rebuild the collection.) Using rather than There are sometimes several documents with the same title. For example, appears both as lyrics and tablature (under ). The browser might be improved by grouping these together under a bookshelf icon. This is a job for an . In the previous tutorial we showed how to use the option in classifier to group documents with the same metadata value ( in that case) in one bookshelf. Here we use instead. Change to the panel and select the section. Remove the classifier (at the top) Add an classifier, and enter , as its metadata. Finish by pressing . Move the new classifier to the top of the list ( button). Build the collection again and preview it. Both items for now appear under the same bookshelf. However, many entries haven't been amalgamated because of non-uniform titles: for example appears as several different variants. We will learn below how to amalgamate these. Making bookshelves show how many items they contain Make the bookshelves show how many documents they contain by inserting a line in the format statement in the section of the panel. The added line is shown highlighted below. The complete format statement can be copied from sample_files → beatles → format_tweaks → show_num_docs.txt. <td valign=top>
{If}{[dc.Format] eq 'Audio',
[srclink][srcicon][/srclink],
{If}{[dc.Format] eq 'Images',
[srclink][thumbicon][/srclink],
{If}{[dc.Format] eq 'Supplementary', [srclink][srcicon][/srclink] [link][icon][/link], [link][icon][/link]}}}</td>
<td>{If}{[numleafdocs],([numleafdocs])}</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]</td> Make the bookshelves show how many documents they contain by modifying the template of the format feature in the section of the panel. Insert the highlighted statements: <gsf:template match="classifierNode[@classifierStyle = 'VList']">
...
<gsf:metadata name="Title"/>
</td>
<td>
(<gsf:metadata name="numleafdocs"/>)
</td> <gsf:template
The complete format statement for the template of the format feature can be copied from sample_files → beatles → format_tweaks → show_num_docs_3.txt. Preview the result (you don't need to build the collection.) Bookshelves in the titles and browse classifiers should show how many documents they contain. Adding a Phind phrase browser In the section on the panel, add a classifier. Leave the settings at their defaults: this generates a phrase browsing classifier that sources its phrases from Title and text. Build the collection again and preview it. Select the new option from the navigation bar. Enter a single word in the text box, such as . The phrase browser will present you with phrases found in the collection containing the search term. This can provide a useful way of browsing a very large collection. Note that even though it is called a phrase browser, only single terms can be used as the starting point for browsing. Branding the collection with an image To complete the collection, lets give it a new image for the top left corner of the page. Go to the section of the panel. Use the browse button of to select the following image: sample_files → beatles → advbeat_large → images → beatlesmm.png Preview the collection, and make sure the new image appears. Using In this section we incorporate the MIDI files. Greenstone has no MIDI plugin (yet). But that doesn't mean you can't use MIDI files! is a useful generic plugin. It knows nothing about any given format but can be tailored to process particular document types—like MIDI—based on their filename extension, and set basic metadata. In the section of the panel: add ; activate its field and set it to to make it recognize files with extension ; Set to and to . In this collection, all MIDI files are contained in the file beatles_midi.zip. (already in the list of default plugins) is used to unpack the files and pass them down the list of plugins until they reach . Build the collection and preview it. Unfortunately the MIDI files don't appear as Audio under the browse button. That's because they haven't been assigned metadata. Back in the panel, click on the file beatles_midi.zip and assign its value to "Audio"—do this by clicking on "Audio" in the list. All files extracted from the Zip file inherit its settings. Cleaning up a title browser using regular expressions We now clean up the browser. We are going to use the classifier option. The aim is to amalgamate variants of titles by stripping away extraneous text. For example, we would like to treat , and the same for grouping purposes. To achieve this: Go to the Title under on the panel; Click the button on it. Activate its option and set it to: (?i)(\\s+\\d+)|(\\s+[[:punct:]].*) Build the collection and preview the result. Observe how many more times similar titles have been amalgamated under the same bookshelf. Test your understanding of regular expressions by trying to rationalize the amalgamations. (Note: stands for any punctuation character.) The icons beside the Word and PDF documents are not the correct ones, but that will be fixed in the next format statement. One powerful use of regular expressions in the exercise was to clean up the browser. Perhaps the best way of doing this would be to have proper title metadata. The metadata extracted from HTML files is messy and inconsistent, and this was reflected in the original browser. Defining proper title metadata would be simple but rather laborious. Instead, we have opted to use regular expressions in the classifier to clean up the title metadata. This is difficult to understand, and a bit fiddly to do, but if you can cope with its idiosyncrasies it provides a quick way to clean up the extracted metadata and avoid having to enter a large amount of metadata. Using non-standard macro files Using different icons for different media types To put finishing touches to our collection, we add some decorative features Close the collection in the Librarian Interface ( → ). Using your Windows file browser outside Greenstone, locate the folder sample_files → beatles → advbeat_large Open up another file browser, and locate the small beatles collection in your Greenstone installation: Greenstone3 → web → sites → localsite → collect → smallbea is the folder name generated by Greenstone for this collection. You can determine what the folder name is for a collection by looking at the title bar of the Librarian Interface: the folder name is displayed in brackets after the collection name. Using the file browser, copy the images and macros folders from the advbeat_large folder into the smallbea folder. (It's OK to overwrite the existing images folder: the image in it is included in the folder being copied.) The images folder includes some useful icons, and the macros folder defines some macro names that use these images. Using the file browser, copy the images folder from the advbeat_large folder into the smallbea folder. (It's OK to overwrite the existing images folder: the image in it is included in the folder being copied.) The images folder includes some useful icons. To see the macro definitions, open the collection in the Librarian Interface ( → ) and view the section in the panel. Using different icons for different media types Re-edit your format statementthe previously edited portion of the format statement of the format feature (in on the panel) to be the following. You can copy this text from the file sample_files → beatles → format_tweaks → multi_icons_3.txt.Change: <td valign=top>
  {If}{[numleafdocs],[link][icon][/link]}
  {If}{[dc.Format] eq 'Lyrics',[link]_iconlyrics_[/link]}
  {If}{[dc.Format] eq 'Discography',[link]_icondisc_[/link]}
  {If}{[dc.Format] eq 'Tablature',[link]_icontab_[/link]}
  {If}{[dc.Format] eq 'MARC',[link]_iconmarc_[/link]}
  {If}{[dc.Format] eq 'Images',[srclink][thumbicon][/srclink]}
  {If}{[dc.Format] eq 'Supplementary',[srclink][srcicon][/srclink]}
  {If}{[dc.Format] eq 'Audio',[srclink]{If}{[FileFormat] eq 'MIDI',_iconmidi_,_iconmp3_}[/srclink]}
</td>
<td>
{If}{[numleafdocs],([numleafdocs])}
</td>
<td valign=top>
[highlight]
{Or}{[dc.Title],[Title],Untitled}
[/highlight]
</td> <td valign="top">
<gsf:switch>
<gsf:metadata name="dc.Format"/>
<gsf:when test='equals' test-value='Audio'>
<gsf:link type="source"><gsf:metadata name="srcicon"/></gsf:link>
</gsf:when>
<gsf:when test='equals' test-value='Images'>
<gsf:link type="source"><gsf:metadata name="thumbicon"/></gsf:link>
</gsf:when>
<gsf:when test='equals' test-value='Supplementary'>
<gsf:link type="source"><gsf:metadata name="srcicon"/></gsf:link> <gsf:link type="document"><gsf:icon type="document"/></gsf:link>
</gsf:when>
<gsf:otherwise>
<gsf:link type="document"><gsf:icon type="document"/></gsf:link>
</gsf:otherwise>
</gsf:switch>
</td> to this: <td valign="top">
<gsf:switch>
<gsf:metadata name="dc.Format"/>
<gsf:when test="equals" test-value="Lyrics">
<gsf:link type="document">
<gsf:icon file="lyrics.gif" select="collection" />
</gsf:link>
</gsf:when>
<gsf:when test="equals" test-value="Discography">
<gsf:link type="document">
<gsf:icon file="disc.gif" select="collection" />
</gsf:link>
</gsf:when>
<gsf:when test="equals" test-value="Tablature">
<gsf:link type="document">
<gsf:icon file="tab.gif" select="collection" />
</gsf:link>
</gsf:when>
<gsf:when test="equals" test-value="MARC">
<gsf:link type="document">
<gsf:icon file="marc.gif" select="collection" />
</gsf:link>
</gsf:when>
<gsf:when test="equals" test-value="Images">
<gsf:link type="source">
<gsf:metadata name="thumbicon"/>
</gsf:link>
</gsf:when>
<gsf:when test="equals" test-value="Supplementary">
<gsf:link type="source">
<gsf:metadata name="srcicon"/>
</gsf:link>
</gsf:when>
<gsf:when test="equals" test-value="Audio">
<gsf:link type="source">
<gsf:switch>
<gsf:metadata name="FileFormat"/>
<gsf:when test="equals" test-value="MIDI">
<gsf:icon file="midi.gif" select="collection" />
</gsf:when>
<gsf:otherwise>
<gsf:metadata name="srcicon"/>
</gsf:otherwise>
</gsf:switch>
</gsf:link>
</gsf:when>
</gsf:switch>
</td> Preview your collection as before. Now different icons are used for discography, lyrics, tablature, and MARC metadata. Even MP3 and MIDI audio file types are distinguished. If you let the mouse hover over one of these images a "tool tip" appears explaining what file type the icon represents in the current interface language (note: extra.dm only defines English and French). Changing the collection's background image Go to the section in the panel. The content is fairly brief, specifying only what needs to be overridden from the default behaviour for this collection. Near the top you should see: _collectionspecificstyle_ {
<style>
body.bgimage \{ background-image: url("_httpcimages_/beat_margin.gif"); \}
\#page \{ margin-left: 120px; \}
</style>
} Replace the text with . This line relates to the background image used. The new image tile.jpg was in the images folder that was copied across previously. Preview the collection's home page. The page background is now the new graphic. Other features can be altered by editing the macros—for example, the headers and footers used on each page, and the highlighting style used for search terms (specify a different colour, use bold etc.). Building a full-size version of the collection To finish, let's now build a larger version of the collection. To do this: Close the current collection ( → ). Start a new collection called large beatles ( → ). Base this new collection on small beatles. Copy the content of sample_files → beatles → advbeat_large → import into this newly formed collection. Since there are considerably more files in this set of documents the copy will take longer. Build the collection and preview the result. (If you want the collection to have an icon, you will have to add it from the panel.) Adding an image collage browser Switch to the panel and select the section. Pull down the menu and select . Click . There is no need to customize the options, so click at the bottom of the resulting popup. Now change to the panel and build and preview the collection. Try out the browsing classifier. You can click on any image during the collage display and the image will be opened up. <Text id="0674">Scanned image collection</Text> Here we build a small replica of Niupepa, the Maori Newspaper collection, using five newspapers taken from two newspaper series. It allows full text searching and browsing by title and date. When a newspaper is viewed, a preview image and its corresponding plain text are presented side by side, with a "go to page" navigation feature at the top of the page. The collection involves a mixture of plugins, classifiers, and format statements. The bulk of the work is done by , a plugin designed precisely for the kind of data we have in this example. For each document, an file is prepared that specifies a list of image files that constitute the document, tagged with their page number and (optionally) accompanied by a text file containing the machine-readable version of the image, which is used for full text searching. Three newspapers in our collection (all from the series ) have text representations, and two (from ) have images only. Item files can also specify metadata. In our example the newspaper series is recorded as and its date of publication as . Issue and metadata is also recorded, where appropriate. This metadata is extracted as part of the building process. Start a new collection called Paged Images and fill out the fields with appropriate information: it is a collection sourced from an excerpt of Niupepa documents. In the panel, open the sample_files → niupepa → sample_items folder and drag the two subfolders into your collection on the right-hand side. A popup window asks whether you want to add to the collection: click , because this plugin will be needed to process the item files. will process the item files, creating a document for each one with a separate section for each page listed. Thumbnail and screen-resolution sized images of each page image will be generated. Go to the panel, build the collection and preview the result. Search for and view one of the titles listed (all three appear as ). Browse by and view one of the newspapers. Note that only the newspapers have text; papers don't. This collection was built with Greenstone's default settings. You can locate items of interest, but the information is less clearly and attractively presented than in the full Niupepa collection. Grouping documents by series title and displaying dates within each group Under , documents from the same series are repeated without any distinguishing features such as date, volume or number. It would be better to group them by series title and display other information within each group. This can be accomplished using the option to the classifier, and tuning the classifier's format statement. In the panel, under the section, delete the classifier for . This classifier is not much use. Select the classifier for and click . Set to . This will create a bookshelf for each Title in the collection. Note, setting this option to will only create a bookshelf when more than one document shares a Title. Build the collection, and preview the list. Now we change the format statement for to display more information about the documents. In the section of the panel, select the classifier (CL1) in the list., and in the list. Click to add this format statement to your collection. Delete the contents of the box, and add the following text. (This format statement can be copied and pasted from the file sample_files → niupepa → formats → titles_tweak.txt.) Edit the contents of the classifier format statement by removing the following in the template: <td valign="top">[link][icon][/link]</td>
<td valign="top">
{If}{[numleafdocs],[ex.Title] ([numleafdocs]),
Volume [ex.Volume] Number [ex.Number] Date [format:ex.Date]}
</td> <td valign="top">
<gsf:link type="source">
<gsf:choose-metadata>
<gsf:metadata name="thumbicon"/>
<gsf:metadata name="srcicon"/>
</gsf:choose-metadata>
</gsf:link>
</td>
<td valign="top">
<gsf:link type="document">
<xsl:call-template name="choose-title"/>
</gsf:link>
<gsf:switch>
<gsf:metadata name="Source"/>
<gsf:when test="exists">
<br/>
<i>
<gsf:metadata name="Source"/>)
</i>
</gsf:when>
</gsf:switch>
</td> In its place, insert the following (which can be copied from sample_files → niupepa → formats → titles_tweak_gs3.txt): <td valign="top">
Volume: <gsf:metadata name="Volume"/> Number: <gsf:metadata name="Number"/> Date: <gsf:metadata format="formatDate" name="Date"/>
</td>
Then, in template for s, replace the contents of the final <td> table cell element with the following which can also be copied from the file titles_tweak_gs3.txt: <td valign="top">
<xsl:call-template name="choose-title"/> (<gsf:metadata name="numleafdocs"/>)
</td>
Refresh in the web browser to view the new list. As a consequence of using the option of the classifier, bookshelf icons appear when titles are browsed. This revised format statement has the effect of specifying in brackets how many items are contained within a bookshelf, for classifier nodes. It works by exploiting the fact that only bookshelf icons define [numleafdocs] metadata. For document nodes, Title is not displayed. Instead, Volume, Number and Date information are displayed. Browsing documents by Date. Back in the panel, under the section, add a classifier, leaving its option set to . In the section of the panel, select in the list, and click to add this format statement to your collection. In the template of the new feature, replace: <gsf:switch>
<gsf:metadata name="Source"/>
<gsf:when test="exists"/>
<br/>
<i>(<gsf:metadata name="Source">)</i>
</gsf:when>
</gsf:switch> with this, which can also be copied from the file titles_tweak_gs3.txt: </td>
<td valign="top">
<xsl:call-template name="choose-date"/>
The above makes reference to the "choose-date" template which we're about to create: select the format statement in the and append the following definition for the "choose-date" template (which can be copied from sample_files → niupepa → formats → global_tweak_gs3.txt): <gsf:template name="choose-date">
<gsf:choose-metadata>
<gsf:metadata format="formatDate" name="dc.Date"/>
<gsf:metadata format="formatDate" name="exp.Date"/>
<gsf:metadata format="formatDate" name="ex.dc.Date"/>
<gsf:metadata format="formatDate" name="Date"/>
<gsf:default>undated</gsf:default>
</gsf:choose-metadata>
</gsf:template> Build the collection, and preview the list. The list groups documents by date. Greenstone's internal date format is YYYYMMDD, for example 18580601, and this is crucial for the classifier to correctly parse date metadata and generate an ordered date list. However, the date has been made to look nice by adding a macro"" attribute to Date metadata in the format statement. In the section of the panel, select in the list, and in the list. Click to add this format statement to your collection. Replace the last line <td>{Or}{[format:dc.Date],[format:exp.Date],[format:ex.Date]}</td> with <td>{Or}{[dc.Date],[exp.Date],[ex.Date]}</td> Back in the format statement, edit the display of the date metadata to remove the special date-formatting, so that it looks like: <gsf:template name="choose-date">
<gsf:choose-metadata>
<gsf:metadata name="dc.Date"/>
<gsf:metadata name="exp.Date"/>
<gsf:metadata name="ex.dc.Date"/>
<gsf:metadata name="Date"/>
<gsf:default>undated</gsf:default>
</gsf:choose-metadata>
</gsf:template> Refresh in the web browser to view the new list. The dates are now shown in internal format. Change the format statement back to reinstate the nicely formatted dates. This can be done by selecting in assigned format statements panel and clicking <>. This can be done by selecting in assigned format statements panel and clicking <> a few times. Displaying scanned images and suppressing dummy text When you reach a newspaper, only its associated text is displayed. When either of the newspapers is accessed, the document view presents the message No scanned image information (screen-view resolution or otherwise) is shown, even though it has been computed and stored with the document. This can be fixed by a format statement that modifies the default behaviour for . In the section of the panel, select the format statement. The default format string displays the document's plain text, which, if there is none, is set to Change this to the following text. (This format statement can be copied and pasted from the file sample_files → niupepa → formats → doc_tweak.txt) <table><tr>
<td valign=top>[srclink][screenicon][/srclink]</td>
<td valign=top>[Text]</td>
</tr></table> Including [screenicon] has the effect of embedding the screen-sized image generated by switching the option on in . It is hyperlinked to the original image by the construct [srclink]...[/srclink]. This is a large image but it may be scaled by your browser. This modification will display screenview image, but does nothing about the dummy text , which will still be displayed. To get rid of this, edit the format statement again and replace <td valign=top>[Text]</td> with {If}{[NoText],,<td valign=top>[Text]</td>} Preview the collection and view one of the documents. The line should now be gone. Searching at page level The newspaper documents are split into sections, one per page. For large documents, it is useful to be able to search on sections rather than documents. This allows users to more easily locate the relevant information in the document. Go to the section of the panel. Remove the index and, if not already the case, check the checkbox to build the indexes on section level as well as document level. Make section level the default by selecting its radio button. Set the display text used for the level drop-down menu by going to the section on the panel. Set the document level text to "newspaper", and the section level text to "page". Build and preview the collection. Choose . Compare searching at "newspaper" level with searching at "page" level. A useful search term for this collection is . Tidying up search results You will notice that when searching for individual pages, a thumbnail of the newspaper image is displayed in the search results. For text pages like this, these are not very useful. Let's tell not to generate thumbnails. In the panel, under the section, select from the list and click . Switch on the option and set its value to . Rebuild and preview the collection, doing a search at page level. Search results at newspaper level display the original filename. Let's remove that also. Go to section of the panel in the Librarian Interface, choose in list, and select the format statement from the list of assigned format statements. Remove the following from the last line of the format string: {If}{[ex.Source],<br><i>([ex.Source])</i>} Preview the collection. You might notice that newspaper level search results only display the newspaper Title, and not any volume information, while page level search results only show a large scan of the newspaper page, the Title of the page (the page number), and not the Title of the newspaper. We'll modify the format statement to show Volume and Number information, and for page results, the newspaper title as well as the page number. In the section, select in to adjust how search results are displayed., and in . Click to add this format to the collection. The previous changes modified , so they will apply to all s that don't have specific format statements. These next changes are made to so will only apply to search results. The extracted Title for the current section is specified as [ex.Title]<gsf:metadata name="Title"/> while the Title for the parent section is [parent:ex.Title]<gsf:metadata name="Title" select="parent"/>. Since the same format statement is used when searching both whole newspapers and newspaper pages, we need to make sure it works in both cases. Set the format statement to the following text (it can be copied and pasted from the file sample_files → niupepa → formats → search_tweak.txt): Replace the lines comprising the final <td> table cell element with the following format statement (it can be copied and pasted from the file sample_files → niupepa → formats → search_tweak_gs3.txt): <td valign="top">[link][icon][/link]</td>
<td valign="top">
{If}{[parent:ex.Title],[parent:ex.Title] Volume [parent:ex.Volume] Number [parent:ex.Number]: Page [ex.Title],
[ex.Title] Volume [ex.Volume] Number [ex.Number]}
<br/><i>({Or}{[format:parent:ex.Date],[format:ex.Date],undated})</i></td>
</td> <td>
<gsf:switch>
<gsf:metadata name="Title" select="parent"/>
<gsf:when test="exists">
<gsf:metadata name="Title" select="parent"/> Volume:<gsf:metadata name="Volume" select="parent"/> Number:<gsf:metadata name="Number" select="parent"/> - Page:<gsf:metadata name="Title"/>
</gsf:when>
<gsf:otherwise>
<gsf:metadata name="Title"/> Volume:<gsf:metadata name="Volume"/> Number:<gsf:metadata name="Number"/>
</gsf:otherwise>
</gsf:switch>
<br/>
<i>
<gsf:choose-metadata>
<gsf:metadata name="Date" select="parent" format="formatDate" />
<gsf:metadata name="Date" format="formatDate" />
<gsf:default>undated</gsf:default>
</gsf:choose-metadata>
</i>
</td> Preview the search results. Items display newspaper title, Volume, Number and Date, and pages also display the page number. The collection you have just built involves a fairly complex document structure. There are two series of newspapers, and . In the series there are two actual newspapers, Numbers 1 and 2. Number 1 has 4 pages, numbered 1, 2, 3, 4; Number 2 has 4 pages, numbered 5, 6, 7, 8. The page numbers increase consecutively through each volume, despite the fact that the volume is divided into different Numbers. Each page in the Te Waka series is represented by a single file, a GIF image of the page. The series has three actual newspapers, Numbers 1, 2, and 3. Number 1 has 4 pages, numbered 1, 2, 3, 4; Number 2 has 5 pages, numbered 5, 6, 7, 8, 9; Number 3 has 5 pages, numbered 10, 11, 12, 13, 14. Again the page numbers increase consecutively through each volume. Each page in this series is represented by two files, a GIF image of the page and a text file containing the OCR’d text that appears on it. The key to this structure is in the respective .item files. Here is a synopsis of the information they contain: (9-1-1) Te Waka Volume 1 Number 1
    p.1 gif
    p.2 gif
    p.3 gif
    p.4 gif
(9-1-2) Te Waka Volume 1 Number 2
    p.5 gif
    p.6 gif
    p.7 gif
    p.8 gif
(10-1-1) Te Whetu Volume 1 Number 1
    p.1 gif text
    p.2 gif text
    p.3 gif text
    p.4 gif text
(10-1-2) Te Whetu Volume 1 Number 2
    p.5 gif text
    …
    p.9 gif text
(10-1-3) Te Whetu Volume 1 Number 3
    p.10 gif text
    …
    p.14 gif text
<Text id="sc1">Advanced scanned image collection</Text> In this exercise we build upon the collection created in the exercise. We add a new newspaper by creating an item file for it, add a new newspaper using the extended XML item file format, and modify the formatting. Adding another newspaper to the collection Another newspaper has been scanned and OCRed, but has no item file. We will add this newspaper into the collection, and create an item file for it. In the Librarian Interface, open up the Paged Image collection that was created in exercise if it is not already open ( → ). In the panel, add the folder sample_files → niupepa → new_papers → 12 to your collection. Inside the folder you can see that there are 4 images and 4 text files. Create an item file for the collection. Have a look at an existing item file to see the format. Start up a text editor (e.g. WordPad) to open a new document. Add some metadata. The for this newspaper is . The is 3, is 6, and the is . (Greenstone's date format is .) Metadata must be added in the form: <Metadata name>Metadata value For this document, the metadata looks like: <Title>Te Haeata 1859-1862
<Date>18610902
<Volume>3
<Number>6 For each page, add a line in the file in the following format: pagenum:imagefile:textfile For example, the first page entry would look like 1:images/12_3_6_1.gif:text/12_3_6_1.txt Note that if there is no text file, you can leave that space blank. You need to add a line for each page in the document. Make sure you increment the page number as well as the image number for each line. (The full text for this file can be copied from sample_files → niupepa → formats → 12_3_6.item.) Save the file using Filename , and save as a plain text document. (If you are using Windows, make sure the file doesn't accidentally end up getting saved as .) Back in the panel of the Librarian Interface, locate the new file in the Workspace tree, and drag it into the collection, adding it to the folder. Build the collection and preview. Check that your new document has been added. XML based item file There are two styles of item files. The first, which was used in the previous section, uses a simple text based format, and consists of a list of metadata for the document, and a list of pages. This format allows specification of document level metadata, and a single list of pages. The second style is an extended format, and uses XML. It allows a hierarchy of pages, and metadata specification at the page level as well as at the document level. In this section, we add in two newspapers which use XML-based item files. In the panel, add the folder sample_files → niupepa → new_papers → xml (you need to add the folder, not the folder) to your collection. Open up the file xml → 23 → 23__2.item and have a look at the XML. This is of the newspaper titled . The contents of this document have been grouped into two sections: , which contains an , and , which contains the page images (and OCR text). Build and preview the collection. The xml style items have been included, but the document display for these items is not very nice. Using to control document processing Paged documents can be presented with a hierarchical table of contents, or with next and previous page arrows, and a "go to page" box (like we have done so far). The display type is specified by the option to . The next and previous arrows suit the linear sequence documents, while the table of contents suits the hierarchically organised document. Ordinarily, a Greenstone collection would have one plugin per document type, and all documents of that type get the same processing. In this case, we want to treat the XML-based item files differently from the text-based item files. We can achieve this by adding two plugins to the collection, and configuring them differently. Go to the section of the panel, and add a new plugin. Enable the option, set the option to and set the option to and click . Move this plugin above the original one in the list. The XML based newspapers have been grouped into a folder called xml. This enables us to process these files differently, by utilizing the option which all plugins support. The first in the list looks for item files underneath the xml folder. These documents will be processed as 'hierarchical' documents. Item files that don't match the process expression (i.e. aren't underneath the xml folder) will be passed onto the second , and these are treated as 'paged' documents. Rebuild and preview the collection. Compare the document display for a paged document e.g. with a hierarchical document, e.g. . Switching between images and text We can modify the document display to switch between the text version and the screenview and full size versions. We do this using a combination of format statements and macro files. First of all we will add a macro file to the collection. Close the collection in the Librarian Interface. In a file browser outside of Greenstone, locate the Paged Image collection in your Greenstone installation: Greenstone → collect → pagedima. Also in a file browser, locate the file sample_files → niupepa → macros → extra.dm. Copy this file and paste it into the macros folder inside the pagedima collection. Back in the Librarian Interface, open up the collection again, and go to the section of the panel. Select in the list, and click . Tick the checkbox. This gives us more control over the layout of the page—in this case, we want to replace the standard and buttons with buttons that switch between images and text. Select the format item and set it to the following text (which can copied from sample_files → niupepa → formats → adv_doc_heading.txt). <div class="heading_title">{Or}{[parent(Top):ex.Title],[ex.Title]}</div>
<div class="buttons" id="toc_buttons">
{If}{[srcicon],_document:viewfullsize_}
{If}{[screenicon],_document:viewpreview_}
{If}{[NoText] ne '1',_document:viewtext_}
</div>
<div class="toc">[DocTOC]</div>
{Or}{[parent(Top):ex.Title],[ex.Title]} outputs the newspaper Title metadata. This is only stored at the top level of the document, so if we are at a subsection, we need to get it from the top ([parent(Top):ex.Title]). Note that we can't just use [parent:ex.Title] as this retrieves the Title from the immediate parent node, which may not be the top node of the document. _document:viewpreview_, _document:viewfullsize_, _document:viewtext_ are macros defined in extra.dm which output buttons for preview, fullsize and text versions, respectively. We choose which buttons to display based on what metadata and text the document has. Note you can view the macros by going to the section of the panel. [DocTOC] is the document table of contents or "go to page" navigation element. Since we are using extended options, we need to explicitly specify this for it to appear in the page. The different pieces are surrounded by <div> elements, so that the appropriate styling information can be used. Select the format statement and set it to the following text (which can be copied from sample_files → niupepa → formats → adv_doc_text.txt): {If}{_cgiargp_ eq 'fullsize',[srcicon],
{If}{_cgiargp_ eq 'preview',[screenicon],
{If}{[NoText] ne '1',[Text],[screenicon]}}} This format statement changes the display based on the argument (_cgiargp_). This is not used normally for document display, so we can use it here to switch between full size image ([srcicon]), preview size image ([screenicon]) and text ([Text]) versions of each page. Preview the collection. View some of the documents—once you have reached a newspaper page, you should get fullsize, preview and text options. <Text id="0702">Open Archives Initiative (OAI) collection</Text> This exercise explores service-level interoperability using the Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH). So that you can do this on a stand-alone computer, we do not actually connect to the external server that is acting as the data provider. Instead we have provided an appropriate set of files that take the form of XML records produced by the OAI-PMH protocol. One of Greenstone's documented example collections is sourced over OAI. This exercise takes you through the steps necessary to reconstruct it. You may wish to take a look at the documented example collection OAI demo now to see what this exercise will build. Start a new collection called OAI Service Provider. Fill out the fields with appropriate information. In the panel, locate the folder sample_files → oai → sample_small → oai. Drag this folder into the collection and drop it there. During the copy operation, a popup window may appear asking whether to add to the list of plug-ins used in the collection, because the Librarian Interface has not found an existing plug-in that can handle this file type. Press the button to include it. The files for this collection consist of a set of images (in JCDLPICS → srcdocs) and a set of OAI records (in JCDLPICS) which contain metadata for the images. When files are copied across like this, the Librarian Interface studies each one and uses its filename extension to check whether the collection contains a corresponding plug-in. No plug-in in the list is capable of processing the OAI file records that are copied across (they have the file extension .oai), so the Librarian Interface prompts you to add the appropriate plug-in. Sometimes there is more than one plug-in that could process a file—for example, the .xml extension is used for many different XML formats. The popup window, therefore, offers a choice of all possible plug-ins that matched. It is normally easy to determine the correct choice. If you wish, you can ignore the prompt (click ), because plug-ins can be added later, in the section of the panel. You will need to specify which document the OAI metadata values should be attached to. In the panel, select the section, then select the and click . Locate the option in the popup window and type (it may not be available in the drop-down list until after building). Click . Finally, you may want to remove the to speed up building (since it's not going to extract metadata relevant to this tutorial anyway). You also need to configure the image plug-in. Select the line in the section and click . In the resulting popup window locate the option, switch it on, and type the number in the box beside it to create a screen-view image of 300 pixels. Click . Now switch to the panel and build and preview the collection. will process the OAI records, and assign metadata to the images, which are processed by . Like other collections we have built by relying on Greenstone defaults, the end result is passable but can be improved. The next steps refine the collection using the metadata harvested by OAI-PMH into the .oai files. In the section of the panel, delete the two classifiers ( and ). Add an classifier based on metadata. Configure it with as the . Now add an classifier based on metadata. In its configuration panel set to , to , to and to . Setting to 2 will mean that two or more documents with the same description will be grouped into a bookshelf; the default of 1 means that every document will get a bookshelf. and control how many documents are grouped into each section of the horizontal A-Z list. In this case, each group can have as few as one document, and no more than ten. In the section of the panel, delete all indexes and add a new one based on metadata. Set the for the index by going to the section in the panel and changing its label to "_labelDescription_". Using a macro for an index name means that it will display in the correct language (assuming that the macro has been translated). You can check Greenstone → macros → english.dm to see which macros are available."descriptions". Build the collection and preview it. Tweaking the presentation with format statements In the panel, select . First replace the format statement with the following (which can be copied from the file vlist_tweak.txt in the sample_files → oai →format_tweaks folder). <td>
  {If}{[numleafdocs],[link][icon][/link],[link][thumbicon][/link]}
</td>
<td valign=middle>
  {If}{[numleafdocs],[Title],<i>[ex.dc.Description]</i>}
</td> In the panel, select . First, in the format statement, replace the templates for and for s with the following (which can be copied from sample_files → oai → format_tweaks → browse_tweak.txt). <gsf:template match="documentNode">
<td valign="top">
<gsf:link type="document">
<gsf:metadata name="thumbicon"/>
</gsf:link>
</td>
<td valign="middle">
<i>
<gsf:metadata name="ex.dc.Description"/>
</i>
</td>
</gsf:template> <gsf:template match="classifierNode[@classifierStyle = 'VList']">
<td valign="top">
<gsf:link type="classifier">
<gsf:icon type="classifier"/>
</gsf:link>
</td>
<td valign="top">
<xsl:call-template name="choose-title"/>
</td>
</gsf:template> This format statement customizes the appearance of vertical lists such as the search results and captions lists to show a thumbnail icon followed by Description metadata. Next, select from the list and change its format statement to: Next, select the format statement from the list and add the following to create a custom format statement: <h3>[ex.dc.Subject]</h3> <gsf:template name="documentHeading">
<h3>
<gsf:metadata name="ex.dc.Subject"/>
</h3>
</gsf:template> The document heading appears above the and buttons when you get to a document in the collection. By default displays the document's metadata. In this particular set of OAI exported records, titles are filenames of JPEG images, and the filenames are particularly uninformative (for example, 01dla14). You can see them in the panel if you select an image in oai → JCDLPICS → srcdocs and check its and metadata. The above format statement displays metadata instead. Finally, you will have noticed that where the document itself should appear, you see only . To rectify this, select in the pull-down list and use the following as its format statement (this text is in doctxt_tweak.txt in the format_tweaks folder mentioned earlier): <center><table width=_pagewidth_ border=1>
  <tr><td colspan=2 align=center>
    <a href=[ex.dc.OrigURL]>[screenicon]</a></td></tr>
  <tr><td>Caption:</td><td> <i>[ex.dc.Description]</i> <br>
    (<a href=[ex.dc.OrigURL]>original [ImageWidth]x[ImageHeight] [ImageType] available</a>)
    </td></tr>
  <tr><td>Subject:</td><td> [ex.dc.Subject]</td></tr>
  <tr><td>Publisher:</td><td> [ex.dc.Publisher]</td></tr>
  <tr><td>Rights:</td><td> [ex.dc.Rights]</td></tr>
</table></center> Still in the format in the list, add the following (which can be copied from sample_files → oai → format_tweaks → document_content.txt) to create a custom format statement: <gsf:template name="documentContent">
<table>
<tr>
<td colspan="2" align="center">
<a><xsl:attribute name="href">
<gsf:metadata name="ex.dc.OrigURL"/>
</xsl:attribute>
<gsf:metadata name="screenicon"/>
</a>
</td>
</tr>
<tr>
<td>Caption:</td>
<td><i><gsf:metadata name="ex.dc.Description"/></i><br/>
<a><xsl:attribute name="href"><gsf:metadata name="ex.dc.OrigURL"/></xsl:attribute>
original <gsf:metadata name="ImageWidth"/>x<gsf:metadata name="ImageHeight"/> <gsf:metadata name="ImageType"/> available
</a>
</td>
</tr>
<tr>
<td>Subject:</td>
<td><gsf:metadata name="ex.dc.Subject"/></td>
</tr>
<tr>
<td>Publisher:</td>
<td><gsf:metadata name="ex.dc.Publisher"/></td>
</tr>
<tr>
<td>Rights:</td>
<td><gsf:metadata name="ex.dc.Rights"/></td>
</tr>
</table>
</gsf:template> This format statement alters how the document view is presented. It includes a screen-sized version of the image that hyperlinks back to the original larger version available on the web. Factual information extracted from the image, such as width, height and type, is also displayed. It uses XSLT to construct the hyperlink which makes use of the greenstone metadata containing the link to the original image. Format statements are processed by the runtime system, so the collection does not need to be rebuilt for these changes to take effect. Click to see the changes. To expedite building, this collection contains fewer source documents than the pre-built version supplied with the Greenstone installation. However, after these modifications, its functionality is the same. <Text id="oaiserver-0">Setting up your Greenstone OAI Server</Text> Greenstone 2 collections are not enabled for OAI out of the box. To make a collection available for serving up over OAI, some minor adjustments need to be made first. Greenstone 3 collections are available over OAI by default. Their collectionConfig.xml files already specify that each collection is OAI enabled, through use of an OAIPMH element. If you want to disable a collection from being accessible over OAI, edit the OAIPMH element in that collection's collectionConfig.xml. This tutorial will look at how to make an existing collection available over OAI and testing its accessibility by getting it validated against the Open Archives validator. Use a text editor to open the file etc/oai.cfg located in your Greenstone installation folder. The oai.cfg configuration file contains properties that control the behaviour and features of your Greenstone OAI server. The basic properties to edit in order to get your collection served by the inbuilt OAI server are the repositoryName, repositoryID and oaicollection. Look up these properties in the file. For repositoryName and repositoryID, type in some values that make sense for your digital library. For example: repositoryName "Greenstone"
repositoryID "greenstone" For this tutorial, we'll make the backdrop collection created in the simple image tutorial available over OAI. Therefore, add this collection's name to the end of the property: oaicollection demo documented-examples/oai-e backdrop If you have a great many documents and do not want the OAI server to return all of them in one go, you could set the property to something lower than the default 250 value in the oai.cfg file. Like: resumeafter 50 If you're on Windows, it's best to be using the Apache web server. So if you're using the Local Library Server, stop the web server by exiting the little white dialog (the Greenstone Server Interface). Use a file browser to go into your Greenstone installation directory and rename the there to to disable it. Now re-launch the Greenstone Server from the menu, so that this time, the included Apache web server will be used instead, launching its own little white dialog. You are now ready to visit your oaiserver home page to check that it's all looking good. Start up the Greenstone Server by going to Windows Start → All Programs → Greenstone 2.85 → Greenstone Server. Start up the Greenstone 3 Server by going to Windows Start → All Programs → Greenstone-3 → Greenstone3 Server. Press the button and you will end up on your Digital Library home page as usual. Adjust the URL so that instead of the suffix, it says . The page that loads now will contain an error message () saying that you've provided an illegal OAI verb. This is because the OAI specification requires you to provide more instruction in the URL as to what you want. The specification defines verbs and possible arguments to them. A basic verb is , which requests the OAI server to return some information about the OAI repository that it's serving. Adjust the URL once more by suffixing , so that your URL now looks like: http://<domain>/greenstone/cgi-bin/oaiserver.cgi?verb=Identify http://<domain:port>/greenstone3/oaiserver?verb=Identify Visiting this page now gives some information about your Greenstone OAI repository. Although the data transmitted over OAI is in the form of XML, Greenstone uses a stylesheet to transform that XML response into a user-friendly, structured web page that you see when you perform the request (as happens when you visit the response page). This allows and other verbs in the OAI specification to be shown in the main Greenstone OAI Server pages as link buttons. You can see these verbs represented in the main Greenstone (or ) page as a row of links, starting with "Identify" at the top and in the lower end of the page. Clicking on the links will execute that verb as a request and return the response from your Greenstone OAI server as a structured web page. Try clicking on all the links. OAI defines a concept called a . In Greenstone, the OAI Set concept is mapped to the practical Greenstone collection. The link to the verb will therefore request the Greenstone OAI server to list all the collections that have been enabled for OAI. Click on the ListSets link and have a look. The response page for the verb will show you that your backdrop collection (created in the Simple image collection tutorial) is one of the collections available over OAI in your Greenstone repository. You will see a couple of buttons next to each collection (or ) listed here. The first is Identifiers and the second Records. Click on the Identifiers button for the backdrop Set. This will list all the IDs of the documents contained in your OAI collection. If you look at the IDs, they look similar enough to Greenstone's internal document IDs, but with an additional prefix (oai:<repositoryID>:<setname>, where repositoryID was set by you in the configuration file, and setname is the name of the collection). Click the browser Back button to get back to the ListSets page and press the Records button located next to the backdrop collection. If you had specified some Dublin Core (dc) metadata for each of the images in the backdrop collection, then the page that loads will display this information for each document in the collection (Set). Greenstone's OAI at present supports 3 metadata formats, as is explained in the instructive comments in the oai.cfg file. Of these three, the OAI standard for Dublin Core, , is the one pertinent to this tutorial. If your collection specifies metadata for a different metadata set format, you can use the oai.cfg file to tell Greenstone how to map the metadata fields of your chosen metadata set format into the Dublin Core metadata set supported by the Greenstone OAI server (or one of the other metadata sets it supports). Look in the oai.cfg file again and scroll down to the section on oaimapping, which will explain and provide examples for how to specify such mappings from your metadata format to one that Greenstone's OAI server uses. For instance, the demo collection comes enabled for OAI upon installation, and specifies some mappings from its metadata format to . Its metadata is mapped to using the following line in the oai.cfg configuration file (note the use of case): oaimapping dls.Title oai_dc.title Because the backdrop collection uses DC metadata already, no mapping is required. Greenstone 3's OAI implementation uses the OAI standard for Dublin Core, , metadata format. By default, it maps all Dublin Core metadata you may have assigned to your collections into . This default mapping is specified in the web\WEB-INF\classes\OAIConfig.xml file. If all (or most) of your collections will be using a different metadata format, you can edit the OAIConfig.xml file's mappingList section to create mappings from the metadata fields you're using to those in . You can also specify mappings at a collection-level, overriding the mappings in OAIConfig.xml for that collection. So if a collection specifies metadata for a different metadata set format from the default mappings in OAIConfig.xml, adjust the collection's web\sites\localsite\collect\<collection-name>\etc\collectionConfig.xml file to tell Greenstone how to map the metadata fields of your chosen metadata set format into the Dublin Core metadata set supported by the Greenstone OAI server. For instance, look in the demo collection's collectionConfig.xml file (web\sites\localsite\collect\lucene-jdbm-demo\etc\collectionConfig.xml) and scroll down to the definition for the OAIPMH ServiceRack. Look in its mappingList which will explain and provide examples for how to specify such oai mappings from the metadata format that the demo collection uses, to the Dublin Core () metadata used by Greenstone's OAI server. Its metadata is mapped to using the following line in the collectionConfig.xml configuration file (note the use of case): <mapping>dc:title,dls.Title</mapping> Because the collection uses DC metadata, no mapping is required, as the default mappings from DC metadata to are already specified in OAIConfig.xml. Validating the Greenstone OAI server In this section, you'll be testing that you've set up your Greenstone OAI server correctly so that it's accessible over OAI. For this part of the exercise, you need to be on a networked computer and your host computer needs to be visible to the outside world. (That is, when you provide the full name of your computer, someone else in the world should be able to find that computer by typing its URL into their browser's address field.) We'll be using an external OAI client to access our up-and-running Greenstone OAI server. It's not just any OAI client either, but an OAI Server validator. You will want to be running the included Apache web server. So if you're on Windows and using the Local Library Server, quit it and rename the application in your Greenstone installation folder to server.not. Then use the menu shortcut to the Greenstone Server once more, to now launch the Apache web server. For this exercise, we will be visiting the Open Archives Validator, for which your OAIserver needs to provide a valid email address. In a text editor, open up your greenstone installation's etc/oai.cfg file and set the value of the maintainer field to your email address. Note that by default, your Greenstone installation will make the demo collection available over OAI. This collection has been set up with a dummy (and invalid) email address for the creator and maintainer fields in the collection's collect.cfg file. You will need to open up collect/demo/etc/collect.cfg and clear the email values for the creator and maintainer properties (or else set these to a valid email again). Otherwise the OpenArchives validator will resort to using the demo collection's default dummy email to send the initial validation results to. Alternatively, you can simply remove the demo collection from being listed in the oai.cfg file's oaicollection property, which will cease to make the demo collection available over OAI. Note also that, if you wish to specify contact emails at a collection level, you will need to edit your greenstone installation's collect/<collection-name>/etc/collect.cfg file for those collections and set the creator and maintainer fields to the desired email address. If your collection contains document items for which you have not assigned any (Dublin Core, dc) metadata, the OAI validation can fail because it is dependent on having Metadata Formats listed even on a per record (per document) basis. Therefore, if your document has no dc metadata assigned, Greenstone won't know what OAI-supported metadata format is used by that document in order to list it. In practice, this means that you either have to assign one or more dc.* metadata to each document in your OAI collection, or you will have to set up an oaimapping in the oai.cfg file to map existing metadata of whichever format to dc.* metadata. For instance, if you created an image collection without assigning any metadata and are happy to use the Title or Source metadata that Greenstone extracted for each image (, ) as the image document's "title", you could map either of these metadata to in the file oai.cfg. To do so, you'd open up oai.cfg in an editor, go down to the section specifying the oaimapping properties and add a new line: oaimapping Title oai_dc.title (Or: oaimapping SourceFile oai_dc.title). This step will not be not necessary for the backdrop collection if you had assigned any dc.* metadata for each image in the collection. Note: If the demo collection that comes with a Greenstone installation is not built, it will either need to be built before submitting your OAI server for inspection by the Open Archives validator, or you will need to adjust the oai.cfg file once more by removing the mention of demo from the oaicollection property. This is because the demo collection is mentioned as being set up for OAI in the oai.cfg file. However, if this collection is unbuilt, it will not be accessible to the OAI validator and so your oaiserver may fail tests due to this oversight. If you are working with legacy collections (built before Greenstone version 2.85) you may have to rebuild them if you plan to make them available over OAI and be compliant with the Open Archives validator. Rebuilding old collections will recalculate the value for the repository. This calculation is different from Greenstone 2.85 onwards. Next you will need to set up your Greenstone server to be accessible from outside, so that external OAI clients can access it. Go to the File → Settings menu of your Greenstone server interface dialog and check the option and also check the option (or the option) as its address resolution method. Press the button in the Greenstone Server Interface dialog that says (or it may say ). Your Digital Library home page will open up in a browser tab. Adjust this URL to have a suffix of oaiserver.cgi in place of the terminating library.cgi, then copy the resulting URL and visit http://www.openarchives.org/Register/ValidateSite. For this exercise, we will be visiting the Open Archives Validator, for which your OAIserver needs to provide a valid email address. In a text editor, open up your Greenstone installation's file again and set the value of the adminEmail element to the email address where the validation results are to be sent. If testing the behaviour of the resumptionToken, set the resumeAfter element to a low value like 5. Restart the Greenstone 3 server if it was running. Otherwise, go to Start → Greenstone → Greenstone3 Server to start up the server. When the library home page opens in your browser, change the library suffix in the URL to oaiserver, which is the baseURL of your OAI Server. Copy this URL and visit http://www.openarchives.org/Register/ValidateSite. The Open Archives Validator page will request the URL to your Greenstone OAI server. Paste the URL you have in your copy buffer into the field provided for this, and press the Validate baseURL button to start running the tests. You will be told to check your email to continue the remaining tests and to get the validation report. If the validator does not recognise the URL, make sure you have given the full domain of your host machine rather than just the host name. If that URL is still not accepted, visit the page again and check this works. If it doesn't, it may be your machine is not set up to be accessible to outside networks. Check your proxy settings, make sure you've set up port forwarding and that your firewall is not interfering. <Text id="0733">Downloading over OAI</Text> GLI can serve as an OAI client application: it can connect to a remote OAI server and retrieve metadata, even download documents. The tutorial did not obtain the data from an external OAI-PMH server. This missing step is accomplished either by running a command-line program or by using the panel in the Librarian Interface. This exercise explains how you would do this using both methods. In the previous exercise, we set up the Greenstone server to serve the Simple image collection (backdrop) over OAI. In this tutorial, we will use GLI to connect to that OAI server and download OAI metadata for the Simple image collection and even download its documents. The principle is the same if you wish to connect to other OAI servers. Downloading using the Librarian Interface Quit any running Greenstone installations. Launch GLI. This should launch the Greenstone server as well, if this is not already running, so that the OAI server is also up and running. In GLI, go to the panel. To the left, choose as the . On the right, set the field to contain the URL to your Greenstone OAI server. It would be of the form http://<hostname:portnumber>/greenstone/cgi-bin/oaiserver.cgi http://<hostname:portnumber>/greenstone3/oaiserver Make sure that you can generally access this URL from your browser. If your computer is behind a firewall or proxy server, you will need to edit the proxy settings in the Librarian Interface. Click the button. Switch on the checkbox. Enter the proxy server address and port number in the and boxes. Click to get back to the section of the panel. If at this stage you were to press the (in the central row of buttons), a dialog will pop up with basic details about the OAI server. At the end, it will diplay the names of the sets available via that OAI Server. A setSpec and a setName property will be defined for each available set. In our example, (the Simple Image collection) would be listed as one of the setNames with its setSpec as localsite:backdrop. Press the to close the dialog. Tick the checkbox as well as the checkbox. For the latter, type for the namethe setSpec value of . Then tick . Also tick and include in the list of comma separated values for it so that it becomes jpg,doc,pdf,ppt Next, tick and set it to 10. There will be 9 images in the collection, so we don't really need to set the Max records value, but this is a helpful feature that you can use when downloading from an OAI server. Finally, click , located beside the button. If you have set proxy information in , a popup will ask for your user name and password. Once the download has started, a progress bar appears in the lower half of the panel that reports on how the downloading process is doing. GLI will download oai metadata and, because we have ticked the checkbox, it will also be retrieving actual documents, but not more than 10, because of the limit of 10 that we've placed on the number of records to download. After a while, it will have finished downloading. Change to the panel, and on the left-hand side, open up the folder. This is where Greenstone stores files you downloaded using the panel. In this case, it will contain a folder wherein the oai metadata files and images that you've just downloaded from your own Greenstone OAI server is stored. These files can then be added to a collection, as will be covered later in this tutorial. Downloading using the command line For command line downloading to work, your computer must have a direct connection to the Internet—being behind a firewall may interfere with the ability to download the information. You will need to use the Librarian Interface for downloading if you are behind a firewall. Close the Librarian Interface. Start up the Greenstone server again. Open a DOS window to access the command-line prompt. This facility should be located somewhere within your Start → Programs menu, but details vary between different Windows systems. If you cannot locate it, select Start → Run and enter cmd in the popup window that appears. Before you start, you must set up your Greenstone environment in the terminal. In the DOS window, move to the home directory where you installed Greenstone. This is accomplished by something like: cd C:\Program Files\Greenstone Type: setup.bat gs3-setup.bat to set up the ability to run Greenstone command-line programs. On Linux/Mac, you would run source setup.bashgs3-setup.sh. GLI uses a perl script, downloadfrom.pl, to do the downloading. This can be run on the command line, outside of GLI. The script can download using several different protocols. These are specified using the option. To see the available options for download mode, run perl -S downloadfrom.pl -h. This shows that the current options are: . For OAI downloading, use -download_mode OAI. To see the options for downloading using the OAI mode, you can run perl -S downloadinfo.pl OAIDownload. The options are the same as you can see in the GLI OAI download panel. We'll use the set and max_records OAI Download options to download a maximum of 5 OAI records from the backdrop collection at your Greenstone's OAI server, which was made available over OAI as a in the previous tutorial again: perl -S downloadfrom.pl -download_mode OAI -url http://<hostname:portnumber>/greenstone/cgi-bin/oaiserver.cgi -set backdrop -max_records 5 perl -S downloadfrom.pl -download_mode OAI -url http://<hostname:portnumber>/greenstone3/oaiserver -set localsite:backdrop -max_records 5 The records (and optionally documents, if you additionally pass in the -get_doc flag to the above command) will be downloaded into the folder where the downloadfrom.pl script is run from. To change this, use the -cache_dir full-path-to-folder option and set its value to the full path of the destination folder you choose. You can import the downloaded documents into a new Greenstone collection and build them in the usual manner. Building the downloaded documents in GLI If you used GLI to download documents over OAI, as seen in the first part of the tutorial, you can find the downloaded items in the folder in the filesystem view on the left side of the panel. If you used the command line to download documents, the downloaded files will be stored wherever you ran the script from. Open GLI, locate files you downloaded over OAI and drag and drop these into a new Greenstone collection called . Because there are *.oai files among those downloaded, GLI will offer to add the . You may wish to go to the panel and remove the from the list of to speed up building. Switch to the panel and press the build button. During this stage, the will extract the metadata in the files and attach them to the associated files of the downloaded backdrop collection. You can see this once the collection has been built by switching to the panel and clicking on an oai file, as no metadata is set for such files. However, if you then click on a jpg file and scroll down, there will be metadata names that start with ex.dc. This refers to Greenstone-extracted Dublin Core metadata. and will be set to the values you had assigned the images in the tutorial A Simple Image Collection. Greenstone will have added additional ex.dc metadata in the form of , which is the source URL for this image. If you wish, you can now set up this collection in a manner similar to how the backdrop collection was set up in . Don't forget to copy in any specific format statements, then rebuild and preview the collection. <Text id="0750">Use METS as Greenstone's Internal Representation</Text> In the Greenstone Librarian Interface, open up one of your existing collections, for example the Small HTML Collection collection. To be able to substitute for you need to be in mode. Click → → and change to mode. Switch to the panel and select . Remove from the list of plug-ins and add , with the default configuration options. Move this plugin to where was (just below ). Now change to the panel, locate the options for the import process and set to . Import options are not available unless you are in mode. Rebuild the collection. In your Windows file browser, locate the archives folder for the collection you are working with (in Greenstone3 → web → sites → localsite → collect → <collname> → archives). For each document in the collection, Greenstone has generated two files: docmets.xml, the core METS description, and doctxt.xml, a supporting file. (Note: unless you are connected to the Internet you may be unable to view doctxt.xml in your web browser, because it refers to a remote resource.) Depending on the source documents there may be additional files, such as the images used within a web page. One of METS' many features is the ability to reference information in external XML files. Greenstone uses this to tie the content of the document, which is stored in the external XML file doctxt.xml, to its hierarchical structure, which is described in the core METS file docmets.xml. <Text id="0760">Moving a collection from DSpace to Greenstone</Text> Start a new collection called StoneD and fill out its fields appropriately. In the panel add . Leave the plugin options at their defaults and press . Using the up arrow, move the position of to the top of the list (above ). In the panel, locate the folder sample_files → dspace. It contains five example items exported from a DSpace institutional repository. Copy them into your collection by dragging them over to the right-hand side of the panel. Cancel out any dialog offering to add plugins. Build the collection and preview it to see the basic defaults exhibited by a DSpace collection. If you browse by , you will find 7 documents listed, though only 5 items were exported from DSpace. Two of the original items had alternative forms in their directory folder. The DSpace plug-in options control what happens in such situations: the default is to treat them as separate Greenstone documents. Below we use a plug-in option () to fuse the alternative forms together. This option has the effect of treating documents with the same filename but different extensions as a single entity within a collection. One of the files is viewed as the primary document—it is indexed, and metadata is extracted from it if possible—while the others are handled as "associated files." The option takes as its argument a list of file extensions (separated by commas): the first one in the list that matches becomes the primary document. Select and click . Switch on its configuration option . Set its value to . Build and preview the collection. There are now only 5 documents, because only one version of each document has been included—the primary version. Adding indexing and browsing capabilities to match DSpace's The DSpace exported files contain Dublin Core metadata for title and author (amongst other things). In the panel, select . Delete the index, and add one for . Rename the index by going to the section in the panel. Select this index and change its value to ."_labelAuthor_". Using a macro for an index name means that it will display in the correct language (assuming that the macro has been translated). You can check Greenstone → macros → english.dm to see which macros are available. Go back to the panel, select . Select the classifier and click . Change the option to . Activate the option and set its value to . If not already active, activate the option. Then set it to . Finally, activate and set this to . Click to close the dialog. Now select the section of the panel, and select the format statement in the list of assigned format statements. Add the following text before the final </td> of the template: {If}{[ex.equivlink],<br>Also available as:[ex.equivlink]} <gsf:switch>
<gsf:metadata name="equivDocLink"/>
<gsf:when test="exists">
<br/>Also available as:
<gsf:metadata name="equivDocLink"/>
<gsf:metadata name="equivDocIcon"/>
<gsf:metadata name="/equivDocLink"/>
</gsf:when>
</gsf:switch> Also, let's add a format statement for the classifier based on metadata. In the menu (under on the panel), select the item that starts with: CL2: List -metadata Leave as the and clickClick . Adjust the template of this format statement to make reference to <gsf:equivDocLink/> too, exactly as in the previous step. Then replaceEdit the text in the box. Replace {Or}{[dc.Title],[exp.Title],[ex.Title],Untitled} <xsl:call-template name="choose-title"/> with {If}{[numleafdocs],([numleafdocs]) [ex.Title],[ex.dc.Title] <gsf:metadata name="ex.dc.Title"/> Then scroll down to the template for s. Here, replace: <gsf:metadata name="Title"/> with <gsf:metadata name="Title"/> (<gsf:metadata name="numleafdocs"/>) This will display the number of documents for each bookshelf in the classifier. And for individual documents within each bookshelf, it will display the . Build the collection once again and preview it. There are still only 5 documents, but against some of the entries appears the line followed by icons that link to the alternative representations. <Text id="0788">Moving a collection from Greenstone to DSpace</Text> In this exercise you export a Greenstone collection in a form suitable for DSpace. It is possible to do this from the Librarian Interface's menu, which contains an item called , that allows you to export collections in different forms. However, to gain a deeper understanding of Greenstone, we perform the work by invoking a program from the Windows command-line prompt. This requires some technical skill; if you are not used to working in the command-line environment we recommend that you skip this exercise. Using Greenstone from the command line Open a DOS window to access the command-line prompt. This facility should be located somewhere within your Start → Programs menu, but details vary between different Windows systems. If you cannot locate it and you are running , select Start → Run and enter cmd in the popup window that appears. In either or , click the Start button and type cmd in the search box at the bottom of the Start menu. In the DOS window, move to the home directory where you installed Greenstone. This is accomplished by something like: cd C:\Program Files\Greenstone Type: setup gs3-setup to set up the ability to run Greenstone command-line programs. On a Linux or Mac machine, you would similarly open a terminal, change directory into your Greenstone installation's top-level folder and type: source setup.bash source gs3-setup.sh Change directory into the folder containing the StoneD collection you built in the last exercise. cd collect\stoned cd web\sites\localsite\collect\stoned Run the following command to export the collection using the DSpace import/export format: perl -S export.pl -saveas DSpace -removeold stoned perl -S export.pl -saveas DSpace -site localsite -removeold stoned Exporting in Greenstone is an additive process. If you ran the export.pl command once again, the new files exported would be added—with different folder names—to those already in the export folder. For the kind of explorations we are conducting we might re-run the command several times. The -removeold option deletes files that have previously been exported. This command has created a new subfolder, collect → stoned → exportweb → sites → localsite → collect → stoned → export. Use the file browser to explore it. In it are the files needed to ingest this set of documents into DSpace. You could equally well run the export.pl command on a different Greenstone collection and transfer the output to a DSpace installation by using DSpace's batch-import facility. <Text id="gems-1">Editing metadata sets</Text> GEMS (Greenstone Editor for Metadata Sets) can be used to modify existing metadata sets or create new ones. GEMS is launched from the Librarian Interface when you want to create a new metadata set, or edit an existing one. In this exercise, we run GEMS outside of the Librarian Interface. Running GEMS Start the Greenstone Editor for Metadata Sets (GEMS) Start → All Programs → Greenstone-2.85 → Metadata Set Editor (GEMS)Start → All Programs → Greenstone-3.05 → Greenstone Editor for Metadata Sets (GEMS) (If you're on Linux, use a terminal to run the gli/gems.sh start-up script.) GEMS starts up with no metadata set loaded. You can start a new set, or open an existing one, from the menu. Creating a new metadata set In this exercise, we will create a new metadata set. In order to save time, we will base it on an existing one: Development Library Subset. From the menu, select → . A popup window appears: . Fill in the fields. Use for the , for the , and select "Development Library Subset Example Metadata" from the drop down list. Click . The new metadata set will be displayed. The left hand side lists the elements (and sub-elements, if any) for the set, and the right hand side displays the set or element attributes. Since the new set was based on the Development Library Subset metadata set, it already contains all the elements from that set. Adding a new element to a metadata set Right click on in the left hand tree (or in the blank space in the left hand side) and choose from the menu that appears. In the popup window, type for the new element name, and click . The new element will appear in the list. In the right hand side, the default attributes will appear for the new element. "Label" and "definition" are used in the Librarian Interface when displaying metadata elements and their descriptions (the "definition" is shown as additional text for the element). These attributes can be set in multiple languages. Save the new metadata set by → , then close the GEMS by → . <Text id="indexers-1">Building and searching with different indexers</Text> Greenstone supports three indexers MG, MGPP and Lucene. MG is the original indexer used by Greenstone which is described in the book "Managing Gigabytes". It does section level indexing and compression of the source documents. MG is implemented in C. MGPP is a re-implementation of MG that provides word-level indexes and enables proximity, phrase and field searching. MGPP is implemented in C++ and is the default indexer for new collections. Lucene (http://lucene.apache.org/) is a java-based, full-featured text indexing and searching system developed by Apache. It provides a similar range of search functionality to MGPP with the addition of single-character wildcards and range searching. It was added to Greenstone to facilitate incremental collection building, which MG and MGPP can't provide. Build with Lucene Start a new collection ( → ) called Demo Lucene and base it on the Greenstone demo (demo)Demo Collection (lucene-jdbm-demo) collection, fill out its fields appropriately. In the panel, click and click Greenstone demo (demo)localsite → Demo Collection (lucene-jdbm-demo), it will show the documents in the Greenstone demo collection. Drag all 11 folders in the demo folder into the new collection. If you haven't installed the Greenstone demo (demo)Demo Collection (lucene-jdbm-demo) collection yet, you can download the demo.zip file from the link above, unzip it and put it into the collect folder in your Greenstone installation. Go to the panel, look at the metadata that is associated with each directory. Go to the section in the panel. The MGPP indexer is in use because the Greenstone Demo collection, which this collection is based on, uses the MGPP indexer. Go to the panel, look at the metadata that is associated with each directory. Go to the section in the panel. The Lucene indexer is already in use because the Demo Collection (lucene-jdbm-demo) collection, which this collection is based on, uses the Lucene indexer. Click the button at the right top corner of the panel. A new window will pop up for selecting the Indexers. After selecting an indexer, a brief description will appear in the box below. Select Lucene and click . Please note that the section may have changed accordingly. Build and preview the collection. Search with Lucene Lucene provides single letter and multiple letter wildcards and range searching. The query syntax could be quite complicated (for more information please see http://lucene.apache.org/java/docs/queryparsersyntax.html. Here we will learn how to use the wildcards while constructing queries. * is a multiple letter wildcard. To perform a multiple letter wildcard search, append * to the end of the query term. For example, econom* will search for words like econometrics, economist, economical, economy, which have the common part econom but different word endings. To perform a single letter wildcard search, use ? instead. For example, search for economi?? will only match words that have two and only two letters left after economi, such as economist, economics, and economies. Please note that stopwords are used by default with Lucene indexer, so searching for words like the will match 0 documents. This is explained in a message on the search page, which states that such words are too common and were ignored. Build with MGPP Start a new collection called Greenstone Demo MGPP and also base it on the Greenstone demo (demo)Demo Collection (lucene-jdbm-demo). In the panel, drag all 11 folders from → Greenstone demo (demo)localsite → Demo Collection (lucene-jdbm-demo) into the new collection. In the section of the panel, you will notice that the active indexer is MGPP, since this is the default. (If not, you'd click the button, select MGPP and click , in which case the section and its options may change accordingly.) In the section of the panel, you will notice that the active indexer is Lucene. Click the button at the right top corner of the panel. A new window will pop up for selecting the Indexers. After selecting an indexer, a brief description will appear in the box below. Select MGPP and click . Please note that the section may have changed accordingly. There are three options at the bottom of the panel — , and . Notice that all three are enabled. Once an option is enabled, it will also appear in the collection's page and can be turned on or off from there. In the section, also select , if it isn't already, but make document the default. Build and preview the collection. Search with MGPP MGPP supports stemming, casefolding and accentfolding. By default, searching in collections built with MGPP indexer is set to and . So searching econom will return 0 documents. Searching for fao and FAO return the same result — 85 word counts and 11 matched documents. Go to the page by clicking the button at the top right corner. You can see that the option is set to and the option is set to . Sometimes we may want to ignore word endings while searching so as to match different variations of the term. Go to the page and change the option from to . Click the button. Click . This time try searching for econom again, 9 documents are found. Please note that word endings are determined according to the third-party stemming tables incorporated in Greenstone, not by the user. Thus the searches may not do precisely what is expected, especially when cultural variations or dialects are concerned. Besides, not all languages support stemming, only English and French have stemming at the moment. Go to the page and change back to to avoid confusion later on. Click the button. Sometimes we may want to search for the exact term, that is, differentiate the upper cases from lower cases. Back in the , set the option from to . Click the button. Click . Now try searching for fao and FAO respectively this time, notice the difference in the results? Go back to the page and change the option back to to avoid confusion later on. Click button. MGPP supports stemming, casefolding and accentfolding. By default, searching in collections built with MGPP indexer is set to and . So searching econom will return 0 documents. Searching for fao will return 0 documents, whereas searching for FAO will return 89 word counts and 11 matched documents. Go to the page by clicking the button at the top right corner. You can see that stem is off, which means the word endings option is set to . And case (folding) is off too, which means the case difference option is set to . Sometimes we may want to ignore word endings while searching so as to match different variations of the term. Change the option from to . This will change the search settings from the default, which is that the , to . Now try searching for econom again, 9 documents are found. Please note that word endings are determined according to the third-party stemming tables incorporated in Greenstone, not by the user. Thus the searches may not do precisely what is expected, especially when cultural variations or dialects are concerned. In addition, not all languages support stemming; only English and French have stemming at the moment. Change the option back to () to avoid confusion later on. Sometimes we may want to search for the exact term, that is, differentiate the upper cases from lower cases. In the page, the default settings already insist that upper/lower case must match (case folding is off). If you want to ignore case when searching, switch folding to (). Now try searching for fao and FAO respectively. Notice the search results are the same for both this time. Use search mode hotkeys with query term MGPP has several hotkeys for setting the search modes for a query term. These hotkeys explicitly set the option and the option for the query being constructed. Use them in the plain or . #s and #u are hotkeys for the option. Appending #s to a query term will specifically enable the function. For example, click on the button and try searching for econom#s. 9 documents are found, which is the same as in the previous section. Remember that we have set it back to . This means using hotkeys will override the current preference settings. Appending #u to a query term will explicitly set the current search to . Note that using hotkeys will only affect that query term. That is, hotkeys are used per term. For example, if a query expression contains more than one term, some terms can have hotkeys and others not, and the hotkeys can be different for different terms. This provides a fine-grained control of the query, whereas changing settings in the page will affect the query as a wholechanging the controls for a search field on the page will apply to all the query terms in that field. Hotkeys #i and #c control the case sensitivity. Appending #i to a query term will explicitly set the search to (i.e. case insensitive). (i.e. case insensitive). For example, search for fao#i returns 11 documents. In contrast, appending #c will specifically turn off the casefolding, that is, .. For example, search for fao#c returns 0 documents. Finally, the hotkeys can also be used in combination. For example, you can append #uc to a query term so as to match the whole term (without stemming) and in its exact form (differentiate upper cases and lower cases). A quick reference of the search mode hotkeys in MGPP     #s         #u         #i         #c         #s         #u         #i         #c     <Text id="depositor-1">Incremental building of a collection</Text> Collections built with the Lucene indexer support incremental addition, updates, and deletion of documents and metadata. By default, the import and build processes delete old index files in the index directory and intermediate files in the archives directory. With incremental building, the import and build process will keep the old files and only process the new or modified ones. Incremental import can be done with any collection, but incremental modification of the indexes can only be done for collections that use the Lucene indexer. The first part of this tutorial looks at using for incremental building. only supports addition of new documents and associated metadata. If you want to delete or modify existing documents and their metadata, you will need to use GLI or command line building. is Greenstone’s runtime support for institutional repositories. It provides the collection building work flow through a web interface. only works with the Web library server, not the local library server. Greenstone users belonging to the user group have access to . Enabling For Windows users, first make sure that you are using a Web Server (e.g. Apache) instead of the Local Library Server. The binary installation of Greenstone will install Apache, but by default the Local Library Server will be used. To switch to using Apache, the GSDLHOME → server.exe file to something else. Then re-run the Greenstone Server, from the Start → Greenstone Server menu. Note: You might need to set permissions for the GSDLHOME → tmp and GSDLHOME → collect or GSDLHOME → collect → your_accessible_collection directory. In Greenstone, is disabled by default. To enable it, edit the file GSDLHOME → etc → main.cfg. Look for the "depositor" line, and change disabled to enabled. Setting a user group Use of involves an authentication step. A user will need a Greenstone account which belongs to an appropriate user group. The user group gives access to edit any collection, while the ***-collection-editor group gives a user access to edit the *** collection, where *** is the collection's short name (or directory name). By default, the admin account is a member of the all-collections-editor group. The Greenstone admin pages are used to add new users and modify their group settings. Admin pages may have been enabled when you installed Greenstone. If not, they can be activated by changing the "status" line in the main.cfg file and changing disabled to enabled To access the administration pages, go to your Greenstone home page and click the (below the list of collections). To see the list of users, click the link on the left under section. You will need to sign in. You can use the admin account, or any other account which has been added to the group. If you didn't set up the admin pages when you installed Greenstone, then a default admin account will be created with password "admin". Please change this immediately. Let's modify the groups for the demo user. This user was added for the authentication demonstration collection to allow restricted access to some of the documents. If this user doesn't exist for you, create a new user by clicking on the link under the section on the left. Give it the name "demo" and password "demo". Click . Back in the Administration Pages, click the link and the new user "demo" should be listed there now. We'll give this user access to modify the Demo Lucene collection that we will be using for this tutorial. If you have given the collection the title "Demo Lucene", then it's short name is likely to be demoluce. You can check this in GLI: Open the Demo Lucene collection, go to Format->General, and look for the collection folder item. Here we assume demoluce. In the page, at the end of each user entry there are two links: and . Click on the user account, and you will be shown more detailed information about the demo user. Add at the end of the line, using a comma to separate group entries, so that the field now contains: demo,demoluce-collection-editor. (Note, if your collection shortname is not demoluce, then replace demoluce with the appropriate name in ***-collection-editor.) Click . Click the link on the left side and return to the Greenstone home page. Use the Depositor to do incremental addition On the Greenstone library home page, click button. You will see a drop-down selection list of all the available collections. Select Demo Lucene from the list and sign in with the account. The next page asks you to fill in the metadata fields — , , , and . These metadata fields are from the Development Library Subset (DLS) metadata set, which is the metadata set used in the Demo Lucene collection. In order to ensure the new document will be displayed in the classifiers, next we will specify these metadata for the new document. The default metadata fields that would be displayed here for a new collection are the , and from the Dublin Core Metadata Set. You can customize which metadata fields are required for items added through in the section on the panel in the Greenstone Librarian Interface. We are going to deposit this file: sample_files → demo_NewFiles → r9006e.htm. Double click r9006e.htm and have a look at its content. Type the following in the field: Selected guidelines for the management of records and archives: a RAMP reader (r9006e) (Note, You can copy this and the following metadata values across from the sample_files → demo_NewFiles → r9006e-metadata.txt). In the field, type UNESCO In the field, type: Communication, Information and Documentation|Records and Archives Management Programme (RAMP) of UNESCO, Archive Management In field, type: manage records and archives Finally in the field, type: English Click the button. Click the Choose File button and select sample_files → demo_NewFiles → new → r9006e.htm, click the button and check the document has been uploaded successfully. Click the button and wait for the process to finish. You will see the message if the collection has been built successfully or error messages if something has gone wrong. Click to preview the newly built collection and check that the newly added document is displayed correctly. For example, in the organizations classifier you should find a new bookshelf named , which contains the new document. Batch addition with the Depositor also supports batch addition of new documents. This is achieved by zipping up the new documents (together with their metadata files) and depositing the zip file. Please note that the collection must have in order to be able process the uploaded zip file, otherwise you need to add the first in Librarian Interface. Go to the Greenstone's home page and click button. Select Demo Lucene from the list and log in if asked to do so again. Leave the metadata fields blank, because the zip file we are adding contains files which specify these metadata values. Click the button, select sample_files → demo_NewFiles → new_files.zip, which contains two new HTML documents along with their associated images and files. Click and then the button. After the building is finished, click to preview the collection. On the collection's home page, it says the collection now contains 14 documents. Check the titles classifier to see that the new documents Above and beyond and Utilization and construction of pit silos have been added successfully. A major benefit of using is that the user can upload documents and metadata remotely, without having to have Greenstone installed at the client end. is a tool for remote data input, allowing you to also deposit items to collections built with the MG or MGPP indexers. The difference is that the MG and MGPP indexers need to rebuild the entire index after adding a new item, while the Lucene indexer incrementally adds the new document to the existing index.