source: other-projects/nightly-tasks/diffcol/trunk/gs3-model-collect/PDFBox/archives/HASH1a9c.dir/doc.xml

Last change on this file was 38016, checked in by anupama, 8 months ago

AUTOCOMMIT by gen-model-colls.sh script. Message: Regenerating GS3 model collections except the Word-PDF-Enhanced* collections

File size: 56.3 KB
Line 
1<?xml version="1.0" encoding="utf-8" standalone="no"?>
2<!DOCTYPE Archive SYSTEM "https://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3<Archive>
4<Section>
5 <Description>
6 <Metadata name="gsdldoctype">indexed_doc</Metadata>
7 <Metadata name="Language">en</Metadata>
8 <Metadata name="Encoding">utf8</Metadata>
9 <Metadata name="Title">Greenstone: A Comprehensive Open-Source Digital Library Software System Ian H....</Metadata>
10 <Metadata name="URL">http://Scratch/ak19/gs3-svn-model-4Sep2023/gs2build/tmp/F111.html</Metadata>
11 <Metadata name="UTF8URL">http://Scratch/ak19/gs3-svn-model-4Sep2023/gs2build/tmp/F111.html</Metadata>
12 <Metadata name="gsdlsourcefilename">import/pdf01.pdf</Metadata>
13 <Metadata name="gsdlsourcefilerenamemethod">url</Metadata>
14 <Metadata name="gsdlconvertedfilename">/Scratch/ak19/gs3-svn-model-4Sep2023/gs2build/tmp/F111.html</Metadata>
15 <Metadata name="OrigSource">F111.html</Metadata>
16 <Metadata name="Source">pdf01.pdf</Metadata>
17 <Metadata name="SourceFile">pdf01.pdf</Metadata>
18 <Metadata name="Plugin">PDFPlugin</Metadata>
19 <Metadata name="FileSize">269487</Metadata>
20 <Metadata name="FilenameRoot">pdf01</Metadata>
21 <Metadata name="FileFormat">PDF</Metadata>
22 <Metadata name="srcicon">_iconpdf_</Metadata>
23 <Metadata name="srclink_file">doc.pdf</Metadata>
24 <Metadata name="srclinkFile">doc.pdf</Metadata>
25 <Metadata name="NumPages">9</Metadata>
26 <Metadata name="Identifier">HASH1a9cea0f239f754007681b</Metadata>
27 <Metadata name="lastmodified">1693810437</Metadata>
28 <Metadata name="lastmodifieddate">20230904</Metadata>
29 <Metadata name="oailastmodified">1693810552</Metadata>
30 <Metadata name="oailastmodifieddate">20230904</Metadata>
31 <Metadata name="assocfilepath">HASH1a9c.dir</Metadata>
32 <Metadata name="gsdlassocfile">doc.pdf:application/pdf:</Metadata>
33 </Description>
34 <Content>
35&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;&lt;b&gt;Greenstone: A Comprehensive Open-Source&lt;br /&gt;Digital Library Software System&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;i&gt;Ian H. Witten,* Rodger J. McNab,† Stefan J. Boddie,* David Bainbridge*&lt;br /&gt;&lt;/i&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;* Dept of Computer Science&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;University of Waikato, New Zealand&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;E-mail: {ihw, sjboddie, davidb}@cs.waikato.ac.nz&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;† Digilib Systems&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Hamilton, New Zealand&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;E-mail: [email protected]&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;ABSTRACT&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;This paper describes the Greenstone digital library&lt;br /&gt;software, a comprehensive, open-source system for the&lt;br /&gt;construction and presentation of information collections.&lt;br /&gt;Collections built with Greenstone offer effective full-text&lt;br /&gt;searching and metadata-based browsing facilities that are&lt;br /&gt;attractive and easy to use. Moreover, they are easily&lt;br /&gt;maintainable and can be augmented and rebuilt entirely&lt;br /&gt;automatically. The system is extensible: software&lt;br /&gt;“plugins” accommodate different document and metadata&lt;br /&gt;types.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;INTRODUCTION&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Notwithstanding intense research activity in the digital&lt;br /&gt;library field during the second half of the 1990s,&lt;br /&gt;comprehensive software systems for creating digital&lt;br /&gt;libraries are not widely available. In fact, the usual solution&lt;br /&gt;when creating a digital library is also the most&lt;br /&gt;obvious—just put it on the Web. But consider how much&lt;br /&gt;effort is involved in constructing a Web site for a digital&lt;br /&gt;library. To be effective it needs to be visually attractive&lt;br /&gt;and ergonomically easy to use, incorporate convenient and&lt;br /&gt;powerful searching capabilities, and offer rich and natural&lt;br /&gt;browsing facilities. Above all it must be easy to maintain&lt;br /&gt;and augment, which presents a significant challenge if any&lt;br /&gt;manual organization is involved.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The alternative is to automate these activities through&lt;br /&gt;software tools. But the broad scope of digital library&lt;br /&gt;requirements makes this a daunting prospect. Ideally the&lt;br /&gt;software should incorporate facilities ranging from&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;multilingual information retrieval to distributed computing&lt;br /&gt;protocols, from interoperability to search engine&lt;br /&gt;technology, from metadata standards to multiformat&lt;br /&gt;document parsing, from multimedia to multiple operating&lt;br /&gt;systems, from Web browsers to plug-and-play DVDs.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The Greenstone Digital Library Software from the New&lt;br /&gt;Zealand Digital Library (NZDL) project tackles this issue&lt;br /&gt;by providing a new way of organizing information and&lt;br /&gt;making it available over the Internet. A &lt;i&gt;collection&lt;/i&gt; of&lt;br /&gt;information comprises several (typically several thousand,&lt;br /&gt;or several million) &lt;i&gt;documents&lt;/i&gt;, and a uniform interface is&lt;br /&gt;provided to all documents in a collection. A library may&lt;br /&gt;include many different collections, each organized&lt;br /&gt;differently—though there is a strong family resemblance in&lt;br /&gt;how collections are presented.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Making information available using this system is far more&lt;br /&gt;than “just putting it on the Web.” The collection becomes&lt;br /&gt;maintainable, searchable, and browsable. Each collection,&lt;br /&gt;prior to presentation, undergoes a “building” process that,&lt;br /&gt;once established, is completely automatic. This process&lt;br /&gt;creates all the structures that are used at run-time for&lt;br /&gt;accessing the collection. Searching is based on various&lt;br /&gt;indexes, while browsing is based on various metadata;&lt;br /&gt;support structures for both are created during the building&lt;br /&gt;operation. When new material appears it can be fully&lt;br /&gt;incorporated into the collection by rebuilding.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;To address the exceptionally broad demands of digital&lt;br /&gt;libraries, the system is public and extensible. It is issued&lt;br /&gt;under the Gnu public license and, in the spirit of open-&lt;br /&gt;source software, users are invited to contribute&lt;br /&gt;modifications and enhancements. Only through an&lt;br /&gt;international cooperative effort will digital library software&lt;br /&gt;become sufficiently comprehensive to meet the world’s&lt;br /&gt;needs. Currently the Greenstone software is used at sites in&lt;br /&gt;Canada, Germany, New Zealand, Romania, UK, and the&lt;br /&gt;US, and collections range from newspaper articles to&lt;br /&gt;technical documents, from educational journals to oral&lt;br /&gt;history, from visual art to folksongs. The software has&lt;br /&gt;been used for collections in many different languages, and&lt;br /&gt;for CD-ROMs that have been published by the United&lt;br /&gt;Nations and other humanitarian agencies in Belgium,&lt;br /&gt;France, Japan, and the US for distribution in developing&lt;br /&gt;countries (Humanity Libraries, 1998; PAHO, 1999;&lt;br /&gt;UNESCO, 1999; UNU, 1998). Further details can be&lt;br /&gt;obtained from &lt;i&gt;www.nzdl.org&lt;/i&gt;.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;This paper sets the scene with a brief discussion of what a&lt;br /&gt;digital library is. We then give an overview of the facilities&lt;br /&gt;offered by Greenstone and show how end users find&lt;br /&gt;information in collections. Next we describe the files and&lt;br /&gt;directories involved in a collection, and then discuss the&lt;br /&gt;processes of updating existing collections and creating new&lt;br /&gt;ones, including extending the software to provide new&lt;br /&gt;facilities. We conclude with an overview of related work.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;WHAT IS A DIGITAL LIBRARY?&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Ten definitions of the term “digital library” have been&lt;br /&gt;culled from the literature by Fox (1998), and their spirit is&lt;br /&gt;captured in the following brief characterization:&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; &lt;i&gt;A collection of digital objects, including text,&lt;br /&gt;video, and audio, along with methods for access&lt;br /&gt;and retrieval, and for selection, organization&lt;br /&gt;and maintenance of the collection&lt;br /&gt;&lt;/i&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; (Akscyn and Witten, 1998). Lesk (1998) views digital&lt;br /&gt;libraries as “organized collections of digital information,”&lt;br /&gt;and wisely recommends that they articulate the principles&lt;br /&gt;governing what is included and how the collection is&lt;br /&gt;organized.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Digital libraries are generally distinguished from the&lt;br /&gt;World-Wide Web, the essential difference being in&lt;br /&gt;selection and organization. But they are not generally&lt;br /&gt;distinguished from a web &lt;i&gt;site&lt;/i&gt;: indeed, virtually all extant&lt;br /&gt;digital libraries manifest themselves as a web site. Hence&lt;br /&gt;the obvious question: to make a digital library, why not&lt;br /&gt;just put the information on the Web?&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; But we make a distinction between a digital library and a&lt;br /&gt;web site that lies at the heart of our software design: one&lt;br /&gt;should easily be able to add new material to a library&lt;br /&gt;without having to integrate it manually or edit its content&lt;br /&gt;in any way. Once added, new material should immediately&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;become a first-class component of the library. And what&lt;br /&gt;permits it to be integrated into existing searching and&lt;br /&gt;browsing structures without any manual intervention is&lt;br /&gt;&lt;i&gt;metadata&lt;/i&gt;. This provides sufficient focus to the concept of&lt;br /&gt;“digital library” to support the development of a&lt;br /&gt;construction kit.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;OVERVIEW OF GREENSTONE&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Information collections built by Greenstone combine&lt;br /&gt;extensive full-text search facilities with browsing indexes&lt;br /&gt;based on different metadata types. There are several ways&lt;br /&gt;for users to find information, although they differ between&lt;br /&gt;collections depending on the metadata available and the&lt;br /&gt;collection design. Typically you can &lt;i&gt;search for particular&lt;br /&gt;words&lt;/i&gt; that appear in the text, or within a section of a&lt;br /&gt;document, or within a title or section heading. You can&lt;br /&gt;&lt;i&gt;browse documents by title&lt;/i&gt;: just click on the displayed book&lt;br /&gt;icon to read it. You can &lt;i&gt;browse documents by subject&lt;/i&gt;.&lt;br /&gt;Subjects are represented by bookshelves: just click on a&lt;br /&gt;shelf to see the books. Where appropriate, documents&lt;br /&gt;come complete with a table of contents (constructed&lt;br /&gt;automatically): you can click on a chapter or subsection to&lt;br /&gt;open it, expand the full table of contents, or expand the full&lt;br /&gt;document.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; An example of searching is shown in Figure 1 where&lt;br /&gt;documents in the Global Help Project’s Humanity&lt;br /&gt;Development Library (HDL) are being searched for&lt;br /&gt;chapters matching the word &lt;i&gt;butterfly&lt;/i&gt;. In Figure 2 the same&lt;br /&gt;collection is being browsed by subject: by clicking on the&lt;br /&gt;bookshelf icons the user has discovered an item under&lt;br /&gt;Section 16, Animal Husbandry. Pursuing an interest in&lt;br /&gt;butterfly farming, the user selects a book by clicking on its&lt;br /&gt;book icon. In Figure 3 the front cover of the book is&lt;br /&gt;displayed as a graphic on the left, and the automatically&lt;br /&gt;constructed table of contents appears at the start of the&lt;br /&gt;document. The current focus, &lt;i&gt;Introduction and Summary&lt;/i&gt;,&lt;br /&gt;is shown in bold in the table of contents with its text&lt;br /&gt;starting further down the page.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; In accordance with Lesk’s advice, a statement of purpose&lt;br /&gt;and coverage accompanies each collection, along with an&lt;br /&gt;explanation of how it is organized (Figure 1 shows the&lt;br /&gt;start of this). A distinction is made between &lt;i&gt;searching&lt;/i&gt; and&lt;br /&gt;&lt;i&gt;browsing&lt;/i&gt;. Searching is full-text, and—depending on the&lt;br /&gt;collection’s design—the user can choose between indexes&lt;br /&gt;built from different parts of the documents, or from&lt;br /&gt;different metadata. Some collections have an index of full&lt;br /&gt;documents, an index of sections, an index of paragraphs,&lt;br /&gt;an index of titles, and an index of section headings, each of&lt;br /&gt;which can be searched for particular words or phrases.&lt;br /&gt;Browsing involves data structures created from metadata&lt;br /&gt;that the user can examine: lists of authors, lists of titles,&lt;br /&gt;lists of dates, hierarchical classification structures, and so&lt;br /&gt;on. Data structures for both browsing and searching are&lt;br /&gt;built according to instructions in a configuration file,&lt;br /&gt;which controls both building and serving the collection.&lt;br /&gt;Sample configuration files are discussed below.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;Figure 1: Searching the HDL collection&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt; Rich browsing facilities can be provided by manually&lt;br /&gt;linking parts of documents together and building explicit&lt;br /&gt;indexes and tables of contents. However, manually-created&lt;br /&gt;linking becomes difficult to maintain, and often falls into&lt;br /&gt;disrepair when a collection expands. The Greenstone&lt;br /&gt;software takes a different tack: it facilitates &lt;i&gt;maintainability&lt;br /&gt;&lt;/i&gt;by creating all searching and browsing structures&lt;br /&gt;automatically from the documents themselves. No links&lt;br /&gt;are inserted by hand. This means that when new&lt;br /&gt;documents in the same format become available, they can&lt;br /&gt;be added automatically. Indeed, for some collections this is&lt;br /&gt;done by processes that wake up regularly, scout for new&lt;br /&gt;material, and rebuild the indexes—all without manual&lt;br /&gt;intervention.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Collections comprise many documents: thousands, tens of&lt;br /&gt;thousands, or even millions. Each document may be&lt;br /&gt;hierarchically organized into &lt;i&gt;sections&lt;/i&gt; (subsections, sub-&lt;br /&gt;subsections, and so on). Each section comprises one or&lt;br /&gt;more &lt;i&gt;paragraphs&lt;/i&gt;. Metadata such as author, title, date,&lt;br /&gt;keywords, and so on, may be associated with documents,&lt;br /&gt;or with individual sections of documents. This is the raw&lt;br /&gt;material for indexes. It must either be provided explicitly&lt;br /&gt;for each document and section (for example, in an&lt;br /&gt;accompanying spreadsheet) or be derivable automatically&lt;br /&gt;from the source documents. Metadata is converted to&lt;br /&gt;Dublin Core and stored with the document for internal use.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; In order to accommodate different kinds of source&lt;br /&gt;documents, the software is organized so that “plugins” can&lt;br /&gt;be written for new document types. Plugins exist for plain&lt;br /&gt;text documents, HTML documents, email documents, and&lt;br /&gt;bibliographic formats. Word documents are handled by&lt;br /&gt;saving them as HTML; PostScript ones by applying a&lt;br /&gt;preprocessor (Nevill-Manning &lt;i&gt;et al&lt;/i&gt;., 1998). Specially&lt;br /&gt;written plugins also exist for proprietary formats such as&lt;br /&gt;that used by the BBC archives department. A collection&lt;br /&gt;may have source documents in different forms: it is just a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;matter of specifying all the necessary plugins. In order to&lt;br /&gt;build browsing indexes from metadata, an analogous&lt;br /&gt;scheme of “classifiers” is used: classifiers create indexes&lt;br /&gt;of various kinds based on metadata. Source documents are&lt;br /&gt;brought into the Greenstone system through a process&lt;br /&gt;called &lt;i&gt;importing&lt;/i&gt;, which uses the plugins and classifiers&lt;br /&gt;specified in the collection configuration file.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; The international Unicode character set is used throughout,&lt;br /&gt;so documents—and interfaces—can be written in any&lt;br /&gt;language. Collections have so far been produced in&lt;br /&gt;English, French, Spanish, German, Maori, Chinese, and&lt;br /&gt;Arabic. The NZDL Web site provides numerous examples.&lt;br /&gt;Collections can contain text, pictures, and even audio and&lt;br /&gt;video clips; a text-only version of the interface is also&lt;br /&gt;provided to accommodate visually impaired users.&lt;br /&gt;Compression technology is used to ensure best use of&lt;br /&gt;storage (Witten &lt;i&gt;et al &lt;/i&gt;., 1999). Most non-textual material is&lt;br /&gt;either linked to textual documents or accompanied by&lt;br /&gt;textual descriptions (such as photo captions) to allow full-&lt;br /&gt;text searching and browsing. However, the architecture&lt;br /&gt;permits the implementation of plugins and classifiers even&lt;br /&gt;for non-textual data.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; The system includes an “administrative” function whereby&lt;br /&gt;specified users can examine the composition of all&lt;br /&gt;collections, protect documents so that they can only be&lt;br /&gt;accessed by registered users on presentation of a password,&lt;br /&gt;and so on. Logs of user activity are kept that record all&lt;br /&gt;queries made to every Greenstone collection (though this&lt;br /&gt;facility can be disabled).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Although primarily designed for Internet access over the&lt;br /&gt;World-Wide Web, collections can be made available, in&lt;br /&gt;precisely the same form, on CD-ROM. In either case they&lt;br /&gt;are accessed through any Web browser. Greenstone CD-&lt;br /&gt;ROMs operate on a standalone PC under Windows 3.X,&lt;br /&gt;95, 98, and NT, and the interaction is identical to accessing&lt;br /&gt;the collection on the Web—except that response is faster&lt;br /&gt;and more predictable. The requirement to operate on early&lt;br /&gt;Windows systems is one that plagues the software design,&lt;br /&gt;but is crucial for many users—particularly those in&lt;br /&gt;underdeveloped countries seeking access to humanitarian&lt;br /&gt;aid collections. If the PC is connected to a network&lt;br /&gt;(intranet or Internet), a custom-built Web server provided&lt;br /&gt;on each CD makes exactly the same information available&lt;br /&gt;to others through their standard Web browser. The use of&lt;br /&gt;compression ensures that the greatest possible volume of&lt;br /&gt;information can be packed on to a CD-ROM.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; The collection-serving software operates under Unix and&lt;br /&gt;Windows NT, and works with standard Web servers. A&lt;br /&gt;flexible process structure allows different collections to be&lt;br /&gt;served by different computers, yet be presented to the user&lt;br /&gt;in the same way, on the same Web page, as part of the&lt;br /&gt;same digital library, even as part of the same collection&lt;br /&gt;(McNab and Witten, 1998). Existing collections can be&lt;br /&gt;updated and new ones brought on-line at any time, without&lt;br /&gt;bringing the system down; the process responsible for the&lt;br /&gt;user interface will notice (through periodic polling) when&lt;br /&gt;new collections appear and add them to the list presented&lt;br /&gt;to the user.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;Figure 2: Browsing the HDL collection by subject&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;&lt;b&gt;FINDING INFORMATION&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Greenstone digital library systems generally include&lt;br /&gt;several separate collections. A home page allows you to&lt;br /&gt;select a collection; in addition, each collection has its own&lt;br /&gt;“about” page that gives you information about how the&lt;br /&gt;collection is organized and the principles governing what&lt;br /&gt;is included.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; All icons in the screenshots of Figures 1–4 are clickable.&lt;br /&gt;Those icons at the top of the page return to the home page,&lt;br /&gt;provide help text, and allow you to set user interface and&lt;br /&gt;searching preferences. The navigation bar underneath&lt;br /&gt;gives access to the searching and browsing facilities,&lt;br /&gt;which differ from one collection to another.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Each of the five buttons provides a different way to find&lt;br /&gt;information. You can &lt;i&gt;search for particular words&lt;/i&gt; that&lt;br /&gt;appear in the text from the “search” page (or from the&lt;br /&gt;“about” page of Figure 1). This collection contains indexes&lt;br /&gt;of chapters, section titles, and entire books. The default&lt;br /&gt;search interface is a simple one, suitable for casual users;&lt;br /&gt;advanced searching—which allows full Boolean&lt;br /&gt;expressions, phrase searching, case and stemming&lt;br /&gt;control—can be enabled from the &lt;i&gt;Preferences&lt;/i&gt; page.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; This collection has four browsable metadata indexes. You&lt;br /&gt;can &lt;i&gt;access publications by subject&lt;/i&gt; by clicking the &lt;i&gt;subjects&lt;br /&gt;&lt;/i&gt;button, which brings up a list of subjects, represented by&lt;br /&gt;bookshelves (Figure 2). You can &lt;i&gt;access publications by&lt;br /&gt;title&lt;/i&gt; by clicking &lt;i&gt;titles a-z&lt;/i&gt; (Figure 4), which brings up a list&lt;br /&gt;of books in alphabetic order. You can &lt;i&gt;access publications&lt;br /&gt;by organization&lt;/i&gt; (i.e. Dublin Core “publisher”), bringing up&lt;br /&gt;a list of organizations. You can &lt;i&gt;access publications by&lt;br /&gt;“how to” listing&lt;/i&gt;, yielding a list of hints defined by the&lt;br /&gt;collection’s editors. We use the Dublin Core as a base and&lt;br /&gt;extend it in an &lt;i&gt;ad hoc&lt;/i&gt; manner to accommodate the&lt;br /&gt;individual requirements of collection designers.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;FILES IN A COLLECTION&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; When a new collection is created or material is added to an&lt;br /&gt;existing one, the original source documents are first&lt;br /&gt;brought into the system through a process known as&lt;br /&gt;“importing.” This involves converting documents into a&lt;br /&gt;simple HTML-like format known as GML (for&lt;br /&gt;“Greenstone Markup Language”), which includes any&lt;br /&gt;metadata associated with the document. Documents are&lt;br /&gt;assumed to be in the Unicode UTF-8 code (of which the&lt;br /&gt;ASCII characters form a subset).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; &lt;b&gt;Files and directories&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; There is a separate directory for each collection, which&lt;br /&gt;contains five subdirectories: the original raw material&lt;br /&gt;(&lt;i&gt;import&lt;/i&gt;), the GML files created from this (&lt;i&gt;archives&lt;/i&gt;), the&lt;br /&gt;final collection as it is served to users (&lt;i&gt;index&lt;/i&gt;), a directory&lt;br /&gt;for use during the building process (&lt;i&gt;building&lt;/i&gt;), and one for&lt;br /&gt;any supporting files (&lt;i&gt;etc&lt;/i&gt;)—including the configuration file&lt;br /&gt;that controls the collection creation procedure. Additional&lt;br /&gt;files might be required: for example, building a hierarchy&lt;br /&gt;of classifications requires a data file of sub-classifications.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; &lt;b&gt;The imported documents&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; In order to identify documents internally, a unique object&lt;br /&gt;identifier or OID is assigned to each original source&lt;br /&gt;document when it is imported (formed by hashing the&lt;br /&gt;content, to overcome file duplication effects caused by&lt;br /&gt;mirroring) and stored as metadata within that document. It&lt;br /&gt;is important that OIDs persist throughout the index-&lt;br /&gt;building process—so that a user’s search history is&lt;br /&gt;unaffected by rebuilding the collection. OIDs are assigned&lt;br /&gt;by hashing the contents of the original source document.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Once imported, each document is stored in its own&lt;br /&gt;subdirectory of &lt;i&gt;archives&lt;/i&gt;, along with any associated&lt;br /&gt;files—for example, images. To ensure compatibility with&lt;br /&gt;Windows 3.0, only eight characters are used in directory&lt;br /&gt;and file names, which causes annoying but essentially&lt;br /&gt;trivial complications.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; &lt;b&gt;Inside the documents&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; The GML format imposes a limited amount of structure on&lt;br /&gt;documents. Documents are divided into paragraphs. They&lt;br /&gt;can be split hierarchically into sections and subsections.&lt;br /&gt;OIDs are extended to identify these components by&lt;br /&gt;appending numbers, separated by periods, to a document’s&lt;br /&gt;OID. When a book is read, its section hierarchy is visible&lt;br /&gt;as the table of contents (Figure 3). Chapters, sections,&lt;br /&gt;subsections, and pages are all implemented simply as&lt;br /&gt;“sections” within the document. In some collections&lt;br /&gt;documents do not have a hierarchical subsection structure,&lt;br /&gt;but are split into pages to permit browsing within a&lt;br /&gt;retrieved document.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; The document structure is used for searchable indexes.&lt;br /&gt;There are three levels of index: &lt;i&gt;documents&lt;/i&gt;, &lt;i&gt;sections&lt;/i&gt;, and&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;Figure 3: Reading a book in the HDL&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;&lt;i&gt;paragraphs&lt;/i&gt;, corresponding to the distinctions that GML&lt;br /&gt;makes—the hierarchical structure is flattened for the&lt;br /&gt;purposes of creating these indexes. Indexes can be of text,&lt;br /&gt;or metadata, or any combination. Thus you can create a&lt;br /&gt;searchable index of section titles, and/or authors, and/or&lt;br /&gt;document descriptions, as well as the document text.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;UPDATING EXISTING COLLECTIONS&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Updating an existing collection with new files in the same&lt;br /&gt;format is easy. For example, the raw material for the HDL&lt;br /&gt;is supplied in the form of HTML files marked up with&lt;br /&gt;&amp;lt;&amp;lt;TOC&amp;gt;&amp;gt; tags to split books into sections and&lt;br /&gt;subsections, and &amp;lt;&amp;lt;I&amp;gt;&amp;gt; tags to indicate where an image is&lt;br /&gt;to be inserted. For each book in the library there is a&lt;br /&gt;directory that contains a single HTML file representing the&lt;br /&gt;book, and separate files containing the associated images.&lt;br /&gt;An accompanying spreadsheet file contains the&lt;br /&gt;classification hierarchy; this is converted to a simple file&lt;br /&gt;format (using Excel’s &lt;i&gt;Save As&lt;/i&gt; command).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Since the collection exists, its directory is already set up&lt;br /&gt;with subdirectories &lt;i&gt;import&lt;/i&gt;, &lt;i&gt;archives&lt;/i&gt;, &lt;i&gt;building&lt;/i&gt;, &lt;i&gt;index&lt;/i&gt;, and&lt;br /&gt;&lt;i&gt;etc&lt;/i&gt;, and the &lt;i&gt;etc&lt;/i&gt; directory will contain a suitable collection&lt;br /&gt;configuration file.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; &lt;b&gt;The updating procedure&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; To update a collection, the new raw material is placed in&lt;br /&gt;the &lt;i&gt;import&lt;/i&gt; directory, in whatever form it is available. Then&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;the &lt;i&gt;import&lt;/i&gt; process is invoked, which converts the files into&lt;br /&gt;GML using the specified plugins. Old material for which&lt;br /&gt;GML files have previously been created is not re-imported.&lt;br /&gt;Then the &lt;i&gt;build&lt;/i&gt; process is invoked to build the requisite&lt;br /&gt;indexes for the collection. Finally, the contents of the&lt;br /&gt;&lt;i&gt;building&lt;/i&gt; directory are moved into the &lt;i&gt;index&lt;/i&gt; directory, and&lt;br /&gt;the new version of the collection automatically becomes&lt;br /&gt;live.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; This procedure may seem cumbersome. But all the steps&lt;br /&gt;are necessary for efficient operation with large collections.&lt;br /&gt;The &lt;i&gt;import&lt;/i&gt; process could be performed on the fly during&lt;br /&gt;the building operation—but because building indexes is a&lt;br /&gt;multipass operation, the often lengthy importing would be&lt;br /&gt;repeated several times. The &lt;i&gt;build&lt;/i&gt; process can take&lt;br /&gt;considerable time—a day or two, for very large&lt;br /&gt;collections. Consequently, the results are placed in the&lt;br /&gt;&lt;i&gt;building&lt;/i&gt; directory so that, if the collection already exists, it&lt;br /&gt;will continue to be served to users in its old form&lt;br /&gt;throughout the building operation.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Active users of the collection will not be disturbed when&lt;br /&gt;the new version becomes live—they will probably not&lt;br /&gt;even notice. The persistent OIDs ensure that interactions&lt;br /&gt;remain coherent—users who are examining the results of a&lt;br /&gt;query or browse operation will still retrieve the expected&lt;br /&gt;documents—and if a search is actually in progress when&lt;br /&gt;the change takes place the program detects the resulting&lt;br /&gt;file-structure inconsistency and automatically and&lt;br /&gt;transparently re-executes the query, this time on the new&lt;br /&gt;version of the collection.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; &lt;b&gt;How it works&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; The original material in the &lt;i&gt;import&lt;/i&gt; directory may be in any&lt;br /&gt;format, and plugins are required to process each format&lt;br /&gt;type. The plugins that a collection uses must be specified&lt;br /&gt;in the collection configuration file. The &lt;i&gt;import&lt;/i&gt; program&lt;br /&gt;reads the list of plugins and passes each document to each&lt;br /&gt;plugin in order until it finds one that can process it. When&lt;br /&gt;updating an existing collection, all plugins necessary to&lt;br /&gt;process new material should already have been specified in&lt;br /&gt;the configuration file.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; The building step creates the indexes for both searching&lt;br /&gt;and browsing. The MG software is generally used to do the&lt;br /&gt;searching (Witten &lt;i&gt;et al.&lt;/i&gt;, 1999), and the &lt;i&gt;mgbuild&lt;/i&gt; module is&lt;br /&gt;automatically invoked to create each of the indexes that is&lt;br /&gt;required. For example, the Humanity Development Library&lt;br /&gt;has three indexes, one for entire books, one for chapters,&lt;br /&gt;and one for section titles. Subdirectories of the &lt;i&gt;index&lt;br /&gt;&lt;/i&gt;directory are created for each of these indexes.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;Figure 4: Browsing titles in the HDL&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt; MG also compresses the text of the collection; and the&lt;br /&gt;image files are linked into the &lt;i&gt;index&lt;/i&gt; subdirectory. Now&lt;br /&gt;none of the material in the &lt;i&gt;import&lt;/i&gt; and &lt;i&gt;archives&lt;/i&gt; directories&lt;br /&gt;is needed to run the collection and can be removed from&lt;br /&gt;the file system (though they would be needed if the&lt;br /&gt;collection were rebuilt).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Associated with each collection is a database stored in&lt;br /&gt;GDBM (Gnu database manager) format. This contains an&lt;br /&gt;entry for each document, giving its OID, its internal MG&lt;br /&gt;document number, and metadata such as title. Information&lt;br /&gt;for each of the browsing indexes, which appear as buttons&lt;br /&gt;on the Greenstone search/browse bar, is also extracted&lt;br /&gt;during the building process and stored in the database. A&lt;br /&gt;“classifier” program is required for each browsing index to&lt;br /&gt;extract the appropriate information from GML documents.&lt;br /&gt;Like plugins, classifiers are written on an &lt;i&gt;ad hoc&lt;/i&gt; basis for&lt;br /&gt;the particular information required, and where possible&lt;br /&gt;reused from one collection to another.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; The building program creates the indexes based on&lt;br /&gt;whatever appears in the &lt;i&gt;archives&lt;/i&gt; directory. The first plugin&lt;br /&gt;specified by all collections is one that processes GML&lt;br /&gt;files, and so if &lt;i&gt;archives&lt;/i&gt; contains imported files they will be&lt;br /&gt;processed correctly. If it contains material in the original&lt;br /&gt;format, that will be converted using the appropriate plugin.&lt;br /&gt;Thus the import process is optional.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; GML is designed to be fast and easy to parse, an important&lt;br /&gt;requirement when millions of documents are to be&lt;br /&gt;processed. Something as simple as requiring tags to be&lt;br /&gt;lower-case, for example, yields a substantial speed-up. In&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;certain circumstances, however, it might be preferable to&lt;br /&gt;use a standardized format such as XML. This is&lt;br /&gt;straightforward to implementjust write an XML&lt;br /&gt;pluginalthough we have not done so ourselves. Given&lt;br /&gt;the transitory nature of the imported data, to date, we have&lt;br /&gt;found GML a satisfactory and beneficial format.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;CREATING NEW COLLECTIONS&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Building new collections from scratch is only slightly&lt;br /&gt;different from updating an existing collection. The key&lt;br /&gt;new requirement is creating a collection configuration file,&lt;br /&gt;and a software utility is provided to help. Two pieces of&lt;br /&gt;information are required for this: the name of the directory&lt;br /&gt;that the collection will use (into which the source data and&lt;br /&gt;other files will eventually be placed), and a contact e-mail&lt;br /&gt;address for use if any problems are encountered by the&lt;br /&gt;software once the collection is up and running. The utility&lt;br /&gt;creates files and directories within the newly-named&lt;br /&gt;directory to support a generic collection of plain text&lt;br /&gt;documents. With suitable data placed in the &lt;i&gt;import&lt;br /&gt;&lt;/i&gt;directory, building the collection at this point will yield a&lt;br /&gt;document-level searchable index of all the text and a&lt;br /&gt;browsable list of “titles” (defined in this case to be the&lt;br /&gt;document filenames).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; To enhance the functionality and presentation— something&lt;br /&gt;anything but the most trivial collection will require—the&lt;br /&gt;configuration file must be edited. For a collection sourced&lt;br /&gt;from documents in an already supported data format,&lt;br /&gt;presented in a similar fashion to an existing collection, the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;creator [email protected] 1&lt;br /&gt;maintainer [email protected] 2&lt;br /&gt;public True 3&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;4&lt;br /&gt;indexes document:text 5&lt;br /&gt;defaultindex document:text 6&lt;br /&gt;plugins GMLPlug TEXTPlug ArcPlug RecPlug 7&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;8&lt;br /&gt;classify AZList metadata=Title 9&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;10&lt;br /&gt;collectionmeta collectionname &amp;quot;generic text collection&amp;quot; 11&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(a) collectionmeta .document:text &amp;quot;documents&amp;quot; 12&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;creator [email protected] 1&lt;br /&gt;maintainer [email protected] 2&lt;br /&gt;public True 3&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;4&lt;br /&gt;indexes document:text document:From 5&lt;br /&gt;defaultindex document:text 6&lt;br /&gt;plugins GMLPlug EMAILPlug ArcPlug RecPlug 7&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;8&lt;br /&gt;classify AZList metadata=Title 9&lt;br /&gt;classify DateList 10&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;11&lt;br /&gt;collectionmeta collectionname &amp;quot;Email messages&amp;quot; 12&lt;br /&gt;collectionmeta .document:text &amp;quot;documents&amp;quot; 13&lt;br /&gt;collectionmeta .document:From &amp;quot;email senders&amp;quot; 14&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;15&lt;br /&gt;format QueryResults \\ 16&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(b) &amp;lt;td&amp;gt;[link][icon][/link]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[Title]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[Author]&amp;lt;/td&amp;gt; 17&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;Figure 5: Collection configuration files (a) generic, (b) for an email collection&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;amount of editing is minimal. Importing new data formats&lt;br /&gt;and browsing metadata in ways not currently supported are&lt;br /&gt;more complex activities that require programming skills.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; &lt;b&gt;Modifying the configuration file&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Figure 5b shows simple alterations to the generic&lt;br /&gt;configuration file in Figure 5a that was generated by the&lt;br /&gt;new-collection utility. &lt;i&gt;TEXTPlug&lt;/i&gt; is replaced with&lt;br /&gt;&lt;i&gt;EMAILPlug&lt;/i&gt; (line 7) which reads email files and extracts&lt;br /&gt;metadata (&lt;i&gt;From&lt;/i&gt;, &lt;i&gt;To&lt;/i&gt;, &lt;i&gt;Date&lt;/i&gt;, &lt;i&gt;Subject&lt;/i&gt;) from them. A classifier&lt;br /&gt;for dates is added (line 10) to make the collection&lt;br /&gt;browsable chronologically. The default presentation of&lt;br /&gt;search results is overridden (line 17) to display both the&lt;br /&gt;title of the message (i.e. Dublin Core &lt;i&gt;Title&lt;/i&gt;) and its sender&lt;br /&gt;(i.e. Dublin Core &lt;i&gt;Author&lt;/i&gt;). Elements in square brackets,&lt;br /&gt;such as &lt;i&gt;[Title]&lt;/i&gt;, are replaced by the metadata associated&lt;br /&gt;with a particular document. The built-in term &lt;i&gt;[icon]&lt;br /&gt;&lt;/i&gt;produces a suitable image that represents the document&lt;br /&gt;(such as a book icon or page icon), and the &lt;i&gt;[link]
[/link]&lt;br /&gt;&lt;/i&gt;construct forms a hyperlink to the complete document.&lt;br /&gt;Anything else in the format statement, which in this case is&lt;br /&gt;solely table-cell tags in HTML, is passed through to the&lt;br /&gt;page being displayed.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;As this example shows, creating a new collection that stays&lt;br /&gt;within the bounds of the library’s established capabilities&lt;br /&gt;falls within the capability of many computer users—for&lt;br /&gt;instance, computer-trained librarians. Extending&lt;br /&gt;Greenstone to handle new document formats and browse&lt;br /&gt;metadata in new ways is more challenging.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; &lt;b&gt;Writing new plugins and classifiers&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Extensibility is obtained through plugins and classifiers.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; These are modules of code that can be slotted into the&lt;br /&gt;system to enhance its capabilities. Plugins parse&lt;br /&gt;documents, extracting the text and metadata to be indexed.&lt;br /&gt;Classifiers control how metadata is brought together to&lt;br /&gt;form browsable data structures. Both are specified in an&lt;br /&gt;object-oriented framework using inheritance to minimize&lt;br /&gt;the amount of code written.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; A plugin must specify three things: what file formats it can&lt;br /&gt;handle, how they should be parsed, and whether the plugin&lt;br /&gt;is recursive. File formats are normally determined using&lt;br /&gt;regular expression matching on the filename. For example,&lt;br /&gt;the HTML plugin accepts all files that end in &lt;i&gt;.htm&lt;/i&gt;, .&lt;i&gt;html&lt;/i&gt;,&lt;br /&gt;&lt;i&gt;.HTM&lt;/i&gt;, or &lt;i&gt;.HTML&lt;/i&gt;. (It is quite possible, however, to write&lt;br /&gt;plugins that “look inside” the file as well.) For other files,&lt;br /&gt;the plugin returns &lt;i&gt;undefined&lt;/i&gt; and the file is passed to the&lt;br /&gt;next plugin in the collection’s configuration file (e.g.&lt;br /&gt;Figure 5 line 7). If it can, the plugin parses the file and&lt;br /&gt;returns the number of documents processed. This involves&lt;br /&gt;extracting text and metadata and adding it to the library’s&lt;br /&gt;content through calls to &lt;i&gt;add text&lt;/i&gt; and &lt;i&gt;add metadata&lt;/i&gt;.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Some plugins (“recursive” ones) add extra files into the&lt;br /&gt;stream of data processed during the building phase by&lt;br /&gt;artificially reactivating the list of plugins. This is how&lt;br /&gt;directory hierarchies are traversed.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Plugins are small modules of code that are easy to write.&lt;br /&gt;We monitored the time it took to develop a new one that&lt;br /&gt;was different to any we had produced so far. We chose to&lt;br /&gt;make as an example a collection of HTML bookmark files,&lt;br /&gt;the motivation being to produce a convenient way of&lt;br /&gt;searching and browsing one’s bookmarked Web pages.&lt;br /&gt;Figure 6 shows a user searching for bookmarked pages&lt;br /&gt;about &lt;i&gt;music&lt;/i&gt;. The new plugin took under an hour to write,&lt;br /&gt;and was 160 lines long (ignoring blank lines and&lt;br /&gt;comments)—about the average length of existing plugins.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Classifiers are more general than plugins because they&lt;br /&gt;work on GML-format data. For example, any plugin that&lt;br /&gt;generates date metadata in accordance with the Dublin&lt;br /&gt;core can request the collection to be browsable&lt;br /&gt;chronologically by specifying the &lt;i&gt;DateList&lt;/i&gt; classifier in the&lt;br /&gt;collection’s configuration file (Figure 7). Classifiers are&lt;br /&gt;more elaborate than most plugins, but new ones are seldom&lt;br /&gt;required. The average length of existing classifiers is 230&lt;br /&gt;lines.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Classifiers must specify three things: an initialization&lt;br /&gt;routine, how individual documents are classified, and the&lt;br /&gt;final browsable data structure. Initialization takes care of&lt;br /&gt;any options specified in the configuration file (such as&lt;br /&gt;&lt;i&gt;metadata=Title &lt;/i&gt;on line 9 of Figure 5b). Classifying&lt;br /&gt;individual documents is an iterative process: for each one,&lt;br /&gt;a call to &lt;i&gt;document-classify&lt;/i&gt; is made. On presentation of the&lt;br /&gt;document’s OID, the necessary metadata is located and&lt;br /&gt;used to control where the document is added to the&lt;br /&gt;browsable data structure being constructed.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; Once all documents have been added, a request is made for&lt;br /&gt;the completed data structure. Some classifiers return the&lt;br /&gt;data structure directly; others transform the data structure&lt;br /&gt;before it is returned. For example, the &lt;i&gt;AZList&lt;/i&gt; classifier&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;Figure 6: Searching bookmarked Web pages&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;divides the alphabetically sorted list of metadata into&lt;br /&gt;separate pages of about the same size and returns the&lt;br /&gt;alphabetic ranges for each one (Figure 4).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;OVERVIEW OF RELATED WORK&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Two projects that provide substantial open source digital&lt;br /&gt;library software are Dienst (Lagoze and Fielding, 1998)&lt;br /&gt;and Harvest (Bowman &lt;i&gt;et al.&lt;/i&gt;, 1994). The origins of Dienst&lt;br /&gt;(&lt;i&gt;www.cs.cornell.edu/cdlrg&lt;/i&gt;) stretch back to 1992. The term&lt;br /&gt;has come to represent three entities: a conceptual&lt;br /&gt;architecture for distributed digital libraries; an open&lt;br /&gt;protocol for service communication; and a software&lt;br /&gt;system that implements the protocol. To date, five sample&lt;br /&gt;digital libraries have been built using this technology.&lt;br /&gt;They manifest themselves in two forms: technical reports&lt;br /&gt;and primary source documents.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Best known is NCSTRL, the Networked Computer&lt;br /&gt;Science Technical Reference Library project&lt;br /&gt;(&lt;i&gt;www.ncstrl.org&lt;/i&gt;). This collection facilitates searching by&lt;br /&gt;title, author and abstract, and browsing by year and author,&lt;br /&gt;across a distributed network of document repositories.&lt;br /&gt;Documents can (where supported) be delivered in various&lt;br /&gt;formats such as PostScript, a thumbnail overview of the&lt;br /&gt;pages, and a GIF image of a particular page.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The &lt;i&gt;Making of America&lt;/i&gt; resource is an example of a&lt;br /&gt;collection based around primary sourcesin this case&lt;br /&gt;American social history, 1830−1900. It has a different&lt;br /&gt;“look and feel” to NCSTRL, being strongly oriented&lt;br /&gt;toward browsing rather than searching. A user navigates&lt;br /&gt;their way through a hierarchical structure of hyperlinks to&lt;br /&gt;reach a book of interest. The book itself is a series of&lt;br /&gt;scanned images: delivery options include going directly to&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;a page number, next and previous page buttons, and&lt;br /&gt;displaying a particular page at different resolutions. A text&lt;br /&gt;version of the page is also available upon which a&lt;br /&gt;searching option is also provided.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Started in 1994, Harvest is also a long-running research&lt;br /&gt;project. It provides an efficient means of gathering source&lt;br /&gt;data from the Internet and distributing indexing&lt;br /&gt;information over the Internet. This is accomplished&lt;br /&gt;through five components: &lt;i&gt;gatherer&lt;/i&gt;, &lt;i&gt;broker&lt;/i&gt;, &lt;i&gt;indexer&lt;/i&gt;,&lt;br /&gt;&lt;i&gt;replicator&lt;/i&gt; and &lt;i&gt;cache&lt;/i&gt;. The first three are central to creating,&lt;br /&gt;updating and searching a collection; the last two help to&lt;br /&gt;improve performance over the Internet through transparent&lt;br /&gt;mirroring and caching techniques.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The system is configurable and customizable. While&lt;br /&gt;searching is most commonly implemented using Glimpse&lt;br /&gt;(&lt;i&gt;glimpse.cs.arizona.edu&lt;/i&gt;), in principle any search engine&lt;br /&gt;that supports incremental updates and Boolean&lt;br /&gt;combinations of attribute-based queries can be used. It is&lt;br /&gt;possible to control what type of documents are gathered&lt;br /&gt;during creation and updating, and how the query interface&lt;br /&gt;looks and is laid out.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Sample collections cited by the developers include 21,000&lt;br /&gt;computer science technical reports and 7,000 home pages.&lt;br /&gt;Other examples include a sizable collection of agriculture-&lt;br /&gt;related electronic journals and magazines called “tomato-&lt;br /&gt;juice” (accessed through &lt;i&gt;hegel.lib.ncsu.edu&lt;/i&gt;) and a full-text&lt;br /&gt;index of library-related electronic serials&lt;br /&gt;(&lt;i&gt;sunsite.berkeley.edu/IndexMorganagus&lt;/i&gt;). Harvest is also&lt;br /&gt;often used to index Web sites (for example&lt;br /&gt;&lt;i&gt;www.middlebury.edu&lt;/i&gt;).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Comparing Greenstone with Dienst and Harvest, there are&lt;br /&gt;both similarities and differences. All provide substantial&lt;br /&gt;digital library systems, hence common themes recur, but&lt;br /&gt;they are driven by projects with different aims. Harvest,&lt;br /&gt;for instance, was not conceived as a digital library project&lt;br /&gt;at all, but by virtue of its selective document gathering&lt;br /&gt;process it can be classed (and is used) as one. While it&lt;br /&gt;provides sophisticated search options, it lacks the&lt;br /&gt;complementary service of browsing. Furthermore it adds&lt;br /&gt;no structure or order to the documents collected, relying&lt;br /&gt;on whatever structures are present in the site that they&lt;br /&gt;were gathered from. A proven strength of the design is its&lt;br /&gt;flexibility through configuration and customizationan&lt;br /&gt;element also present in Greenstone.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Dienstbest exemplified through the NCSTRL&lt;br /&gt;worksupports searching and browsing, like Greenstone.&lt;br /&gt;Both use open protocols. Differences include a high&lt;br /&gt;reliance in Dienst on user-supplied information when a&lt;br /&gt;document is added, and a smaller range of document types&lt;br /&gt;supported—although Dienst does include a document&lt;br /&gt;model that should, over time, allow this to expand with&lt;br /&gt;relative ease.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;There are also commercial systems that provide similar&lt;br /&gt;digital library services to those described. However, since&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;Figure 7: Browsing a newspaper collection by date&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;corporate culture instills proprietary attitudes there is little&lt;br /&gt;opportunity for advancement through a shared&lt;br /&gt;collaborative effort. Consequently they are not reviewed&lt;br /&gt;here.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;CONCLUSIONS&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Greenstone is a comprehensive software system for&lt;br /&gt;creating digital library collections. It builds data structures&lt;br /&gt;for searching and browsing from the material provided,&lt;br /&gt;rather than relying on any hand-crafting. The process is&lt;br /&gt;controlled by a configuration file, and once a collection&lt;br /&gt;exists new material can be added completely&lt;br /&gt;automatically. Browsing is based on Dublin Core&lt;br /&gt;metadata.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;New collections can be developed easily, particularly if&lt;br /&gt;they resemble existing ones. Extensibility is achieved&lt;br /&gt;through software “plugins” that can be written to&lt;br /&gt;accommodate documents, and metadata, in different&lt;br /&gt;formats. Standard plugins exist for many document types;&lt;br /&gt;new ones are easily written. Browsing is controlled by&lt;br /&gt;“classifiers” that process metadata into browsing structures&lt;br /&gt;(by date, alphabetical, hierarchical, etc).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;However, the most powerful support for extensibility is&lt;br /&gt;achieved not by technical means but by making the source&lt;br /&gt;code freely available under the Gnu public license. Only&lt;br /&gt;through an international cooperative effort will digital&lt;br /&gt;library software become sufficiently comprehensive to&lt;br /&gt;meet the world’s needs with the richness and flexibility&lt;br /&gt;that users deserve.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;ACKNOWLEDGMENTS&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;We gratefully acknowledge all those who have worked on&lt;br /&gt;the Greenstone software, and all members of the New&lt;br /&gt;Zealand Digital Library project for their enthusiasm and&lt;br /&gt;ideas.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;REFERENCES&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;1. Akscyn, R.M. and Witten, I.H. (1998) “Report on First&lt;br /&gt;Summit on International Cooperation on Digital&lt;br /&gt;Libraries.” ks.com/idla-wp-oct98.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;2. Bowman, C.M., Danzig, P.B., Manber, U., and&lt;br /&gt;Schwartz, M.F. “Scalable Internet resource discovery:&lt;br /&gt;Research problems and approaches” &lt;i&gt;Communications&lt;br /&gt;of the ACM,&lt;/i&gt; Vol. 37, No. 8, pp. 98−107, 1994.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;3. Fox, E. (1998) “Digital library definitions.”&lt;br /&gt;ei.cs.vt.edu/~fox/dlib/def.html.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;4. Humanity Libraries (1998) &lt;i&gt;Humanity Development&lt;br /&gt;Library&lt;/i&gt;. CD-ROM produced by the Global Help&lt;br /&gt;Project, Antwerp, Belgium.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;5. Lagoze, C. and Fielding, D “Defining Collections in&lt;br /&gt;Distributed Digital Libraries” &lt;i&gt;D-Lib Magazine&lt;/i&gt;, Nov.&lt;br /&gt;1998.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;6. PAHO (1999) &lt;i&gt;Virtual Disaster Library&lt;/i&gt;. CD-ROM&lt;br /&gt;produced by the Pan-American Health Organization,&lt;br /&gt;Washington DC, USA.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;7. McNab, R.J., Witten, I.H. and Boddie, S.J. (1998) “A&lt;br /&gt;distributed digital library architecture incorporating&lt;br /&gt;different index styles.” &lt;i&gt;Proc IEEE Advances in Digital&lt;br /&gt;Libraries&lt;/i&gt;, Santa Barbara, CA, pp. 36–45.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;8. Nevill-Manning, C.G., Reed, T., and Witten, I.H.&lt;br /&gt;(1998) “Extracting text from PostScript”&lt;br /&gt;&lt;i&gt;Software—Practice and Experience&lt;/i&gt;, Vol. 28, No. 5, pp.&lt;br /&gt;481–491; April.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;9. UNESCO (1999) &lt;i&gt;SAHEL point DOC: Anthologie du&lt;br /&gt;développement au Sahel&lt;/i&gt;. CD-ROM produced by&lt;br /&gt;UNESCO, Paris, France.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;10. UNU (1998) &lt;i&gt;Collection on critical global issues.&lt;/i&gt; CD-&lt;br /&gt;ROM produced by the United Nations University&lt;br /&gt;Press, Tokyo, Japan.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;11. Witten, I.H., Moffat, A. and Bell, T. (1999) &lt;i&gt;Managing&lt;br /&gt;Gigabytes: compressing and indexing documents and&lt;br /&gt;images&lt;/i&gt;, Morgan Kaufmann, second edition.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;</Content>
36</Section>
37</Archive>
Note: See TracBrowser for help on using the repository browser.