indexed_doc en utf8 Bronwyn Greenstone: A Comprehensive Open-Source Digital Library Software... http://research/ak19/GS2bin_5Aug2013/collect/Word-PDF-Basic/tmp/1375688869/pdf01.html http://research/ak19/GS2bin_5Aug2013/collect/Word-PDF-Basic/tmp/1375688869/pdf01.html import/pdf01.pdf tmp/1375688869/pdf01.html pdf01.html pdf01.pdf pdf01.pdf PDFPlugin 269487 pdf01 PDF _iconpdf_ doc.pdf doc.pdf 9 Ian H. Witten Rodger J. McNab Stefan J. Boddie David Bainbridge Greenstone: A comprehensive open-source digital library software system 8.57 /research/ak19/GS2bin_5Aug2013/collect/Word-PDF-Basic/import 2013:08:02 19:30:45+12:00 pdf01.pdf 644 269487 PDF application/pdf Bronwyn 2000:03:02 15:21:24 Microsoft Word false 1.2 9 Acrobat PDFWriter 4.0 for Power Macintosh HASH1a9cea0f239f754007681b 1375428645 20130802 1375688869 20130805 HASH1a9cea0f.dir pdf01-2_1.jpg:image/jpeg: pdf01-3_1.jpg:image/jpeg: pdf01-4_1.jpg:image/jpeg: pdf01-5_1.jpg:image/jpeg: pdf01-7_1.jpg:image/jpeg: pdf01-8_1.jpg:image/jpeg: doc.pdf:application/pdf: <A name=1></a><b>Greenstone: A Comprehensive Open-Source</b><br> <b>Digital Library Software System</b><br> <i>Ian H. Witten,* Rodger J. McNab,† Stefan J. Boddie,* David Bainbridge*</i><br> * Dept of Computer Science<br> † Digilib Systems<br> University of Waikato, New Zealand<br> Hamilton, New Zealand<br> E-mail: {ihw, sjboddie, davidb}@cs.waikato.ac.nz<br> E-mail: rodger@digilibs.com<br> <b>ABSTRACT</b><br> multilingual information retrieval to distributed computing<br>protocols, from interoperability to search engine<br> This paper describes the Greenstone digital library<br> technology, from metadata standards to multiformat<br> software, a comprehensive, open-source system for the<br> document parsing, from multimedia to multiple operating<br> construction and presentation of information collections.<br> systems, from Web browsers to plug-and-play DVDs.<br> Collections built with Greenstone offer effective full-text<br>searching and metadata-based browsing facilities that are<br> The Greenstone Digital Library Software from the New<br> attractive and easy to use. Moreover, they are easily<br> Zealand Digital Library (NZDL) project tackles this issue<br> maintainable and can be augmented and rebuilt entirely<br> by providing a new way of organizing information and<br> automatically. The system is extensible: software<br> making it available over the Internet. A <i>collection</i> of<br> “plugins” accommodate different document and metadata<br> information comprises several (typically several thousand,<br> types.<br> or several million) <i>documents</i>, and a uniform interface is<br>provided to all documents in a collection. A library may<br> <b>INTRODUCTION</b><br> include many different collections, each organized<br>differently—though there is a strong family resemblance in<br> Notwithstanding intense research activity in the digital<br> how collections are presented.<br> library field during the second half of the 1990s,<br>comprehensive software systems for creating digital<br> Making information available using this system is far more<br> libraries are not widely available. In fact, the usual solution<br> than “just putting it on the Web.” The collection becomes<br> when creating a digital library is also the most<br> maintainable, searchable, and browsable. Each collection,<br> obvious—just put it on the Web. But consider how much<br> prior to presentation, undergoes a “building” process that,<br> effort is involved in constructing a Web site for a digital<br> once established, is completely automatic. This process<br> library. To be effective it needs to be visually attractive<br> creates all the structures that are used at run-time for<br> and ergonomically easy to use, incorporate convenient and<br> accessing the collection. Searching is based on various<br> powerful searching capabilities, and offer rich and natural<br> indexes, while browsing is based on various metadata;<br> browsing facilities. Above all it must be easy to maintain<br> support structures for both are created during the building<br> and augment, which presents a significant challenge if any<br> operation. When new material appears it can be fully<br> manual organization is involved.<br> incorporated into the collection by rebuilding.<br> The alternative is to automate these activities through<br> To address the exceptionally broad demands of digital<br> software tools. But the broad scope of digital library<br> libraries, the system is public and extensible. It is issued<br> requirements makes this a daunting prospect. Ideally the<br> under the Gnu public license and, in the spirit of open-<br> software should incorporate facilities ranging from<br> source software, users are invited to contribute<br>modifications and enhancements. Only through an<br>international cooperative effort will digital library software<br>become sufficiently comprehensive to meet the world’s<br>needs. Currently the Greenstone software is used at sites in<br>Canada, Germany, New Zealand, Romania, UK, and the<br>US, and collections range from newspaper articles to<br>technical documents, from educational journals to oral<br>history, from visual art to folksongs. The software has<br>been used for collections in many different languages, and<br>for CD-ROMs that have been published by the United<br>Nations and other humanitarian agencies in Belgium,<br>France, Japan, and the US for distribution in developing<br>countries (Humanity Libraries, 1998; PAHO, 1999;<br>UNESCO, 1999; UNU, 1998). Further details can be<br>obtained from <i>www.nzdl.org</i>.<br> <hr> <A name=2></a><IMG src="_httpdocimg_/pdf01-2_1.jpg"><br> become a first-class component of the library. And what<br>permits it to be integrated into existing searching and<br>browsing structures without any manual intervention is<br><i>metadata</i>. This provides sufficient focus to the concept of<br>“digital library” to support the development of a<br>construction kit.<br> <b>OVERVIEW OF GREENSTONE</b><br> <br>Information collections built by Greenstone combine<br>extensive full-text search facilities with browsing indexes<br>based on different metadata types. There are several ways<br>for users to find information, although they differ between<br>collections depending on the metadata available and the<br>collection design. Typically you can <i>search for particular<br>words</i> that appear in the text, or within a section of a<br>document, or within a title or section heading. You can<br><i>browse documents by title</i>: just click on the displayed book<br>icon to read it. You can <i>browse documents by subject</i>.<br>Subjects are represented by bookshelves: just click on a<br>shelf to see the books. Where appropriate, documents<br> <b>Figure 1: Searching the HDL collection</b><br> come complete with a table of contents (constructed<br>automatically): you can click on a chapter or subsection to<br> This paper sets the scene with a brief discussion of what a<br> open it, expand the full table of contents, or expand the full<br> digital library is. We then give an overview of the facilities<br> document.<br> offered by Greenstone and show how end users find<br>information in collections. Next we describe the files and<br> <br>An example of searching is shown in Figure 1 where<br> directories involved in a collection, and then discuss the<br> documents in the Global Help Project’s Humanity<br> processes of updating existing collections and creating new<br> Development Library (HDL) are being searched for<br> ones, including extending the software to provide new<br> chapters matching the word <i>butterfly</i>. In Figure 2 the same<br> facilities. We conclude with an overview of related work.<br> collection is being browsed by subject: by clicking on the<br>bookshelf icons the user has discovered an item under<br> <b>WHAT IS A DIGITAL LIBRARY?</b><br> Section 16, Animal Husbandry. Pursuing an interest in<br>butterfly farming, the user selects a book by clicking on its<br> <br>Ten definitions of the term “digital library” have been<br> book icon. In Figure 3 the front cover of the book is<br> culled from the literature by Fox (1998), and their spirit is<br> displayed as a graphic on the left, and the automatically<br> captured in the following brief characterization:<br> constructed table of contents appears at the start of the<br> <br> document. The current focus, <i>Introduction and Summary</i>,<br> <i>A collection of digital objects, including text,</i><br> is shown in bold in the table of contents with its text<br> <i>video, and audio, along with methods for access</i><br> starting further down the page.<br> <i>and retrieval, and for selection, organization<br>and maintenance of the collection</i><br> <br>In accordance with Lesk’s advice, a statement of purpose<br> <br> and coverage accompanies each collection, along with an<br> (Akscyn and Witten, 1998). Lesk (1998) views digital<br> explanation of how it is organized (Figure 1 shows the<br> libraries as “organized collections of digital information,”<br> start of this). A distinction is made between <i>searching</i> and<br> and wisely recommends that they articulate the principles<br> <i>browsing</i>. Searching is full-text, and—depending on the<br> governing what is included and how the collection is<br> collection’s design—the user can choose between indexes<br> organized.<br> built from different parts of the documents, or from<br> <br>Digital libraries are generally distinguished from the<br> different metadata. Some collections have an index of full<br> World-Wide Web, the essential difference being in<br> documents, an index of sections, an index of paragraphs,<br> selection and organization. But they are not generally<br> an index of titles, and an index of section headings, each of<br> distinguished from a web <i>site</i>: indeed, virtually all extant<br> which can be searched for particular words or phrases.<br> digital libraries manifest themselves as a web site. Hence<br> Browsing involves data structures created from metadata<br> the obvious question: to make a digital library, why not<br> that the user can examine: lists of authors, lists of titles,<br> just put the information on the Web?<br> lists of dates, hierarchical classification structures, and so<br> <br> on. Data structures for both browsing and searching are<br> But we make a distinction between a digital library and a<br> built according to instructions in a configuration file,<br> web site that lies at the heart of our software design: one<br> which controls both building and serving the collection.<br> should easily be able to add new material to a library<br> Sample configuration files are discussed below.<br> without having to integrate it manually or edit its content<br>in any way. Once added, new material should immediately<br> <hr> <A name=3></a><IMG src="_httpdocimg_/pdf01-3_1.jpg"><br> matter of specifying all the necessary plugins. In order to<br>build browsing indexes from metadata, an analogous<br>scheme of “classifiers” is used: classifiers create indexes<br>of various kinds based on metadata. Source documents are<br>brought into the Greenstone system through a process<br>called <i>importing</i>, which uses the plugins and classifiers<br>specified in the collection configuration file.<br> <br>The international Unicode character set is used throughout,<br>so documents—and interfaces—can be written in any<br>language. Collections have so far been produced in<br>English, French, Spanish, German, Maori, Chinese, and<br>Arabic. The NZDL Web site provides numerous examples.<br>Collections can contain text, pictures, and even audio and<br>video clips; a text-only version of the interface is also<br>provided to accommodate visually impaired users.<br>Compression technology is used to ensure best use of<br>storage (Witten <i>et al </i>., 1999). Most non-textual material is<br>either linked to textual documents or accompanied by<br>textual descriptions (such as photo captions) to allow full-<br>text searching and browsing. However, the architecture<br> <b>Figure 2: Browsing the HDL collection by subject</b><br> permits the implementation of plugins and classifiers even<br>for non-textual data.<br> <br>Rich browsing facilities can be provided by manually<br> <br> linking parts of documents together and building explicit<br> The system includes an “administrative” function whereby<br> indexes and tables of contents. However, manually-created<br> specified users can examine the composition of all<br> linking becomes difficult to maintain, and often falls into<br> collections, protect documents so that they can only be<br> disrepair when a collection expands. The Greenstone<br> accessed by registered users on presentation of a password,<br> software takes a different tack: it facilitates <i>maintainability</i><br> and so on. Logs of user activity are kept that record all<br> by creating all searching and browsing structures<br> queries made to every Greenstone collection (though this<br> automatically from the documents themselves. No links<br> facility can be disabled).<br> are inserted by hand. This means that when new<br> <br>Although primarily designed for Internet access over the<br> documents in the same format become available, they can<br> World-Wide Web, collections can be made available, in<br> be added automatically. Indeed, for some collections this is<br> precisely the same form, on CD-ROM. In either case they<br> done by processes that wake up regularly, scout for new<br> are accessed through any Web browser. Greenstone CD-<br> material, and rebuild the indexes—all without manual<br> ROMs operate on a standalone PC under Windows 3.X,<br> intervention.<br> 95, 98, and NT, and the interaction is identical to accessing<br> Collections comprise many documents: thousands, tens of<br> the collection on the Web—except that response is faster<br> thousands, or even millions. Each document may be<br> and more predictable. The requirement to operate on early<br> hierarchically organized into <i>sections</i> (subsections, sub-<br> Windows systems is one that plagues the software design,<br> subsections, and so on). Each section comprises one or<br> but is crucial for many users—particularly those in<br> more <i>paragraphs</i>. Metadata such as author, title, date,<br> underdeveloped countries seeking access to humanitarian<br> keywords, and so on, may be associated with documents,<br> aid collections. If the PC is connected to a network<br> or with individual sections of documents. This is the raw<br> (intranet or Internet), a custom-built Web server provided<br> material for indexes. It must either be provided explicitly<br> on each CD makes exactly the same information available<br> for each document and section (for example, in an<br> to others through their standard Web browser. The use of<br> accompanying spreadsheet) or be derivable automatically<br> compression ensures that the greatest possible volume of<br> from the source documents. Metadata is converted to<br> information can be packed on to a CD-ROM.<br> Dublin Core and stored with the document for internal use.<br> <br>The collection-serving software operates under Unix and<br> <br>In order to accommodate different kinds of source<br> Windows NT, and works with standard Web servers. A<br> documents, the software is organized so that “plugins” can<br> flexible process structure allows different collections to be<br> be written for new document types. Plugins exist for plain<br> served by different computers, yet be presented to the user<br> text documents, HTML documents, email documents, and<br> in the same way, on the same Web page, as part of the<br> bibliographic formats. Word documents are handled by<br> same digital library, even as part of the same collection<br> saving them as HTML; PostScript ones by applying a<br> (McNab and Witten, 1998). Existing collections can be<br> preprocessor (Nevill-Manning <i>et al</i>., 1998). Specially<br> updated and new ones brought on-line at any time, without<br> written plugins also exist for proprietary formats such as<br> bringing the system down; the process responsible for the<br> that used by the BBC archives department. A collection<br> user interface will notice (through periodic polling) when<br> may have source documents in different forms: it is just a<br> new collections appear and add them to the list presented<br>to the user.<br> <hr> <A name=4></a><IMG src="_httpdocimg_/pdf01-4_1.jpg"><br> <b>FILES IN A COLLECTION</b><br> <br>When a new collection is created or material is added to an<br>existing one, the original source documents are first<br>brought into the system through a process known as<br>“importing.” This involves converting documents into a<br>simple HTML-like format known as GML (for<br>“Greenstone Markup Language”), which includes any<br>metadata associated with the document. Documents are<br>assumed to be in the Unicode UTF-8 code (of which the<br>ASCII characters form a subset).<br> <br><b>Files and directories</b><br> <br>There is a separate directory for each collection, which<br>contains five subdirectories: the original raw material<br>(<i>import</i>), the GML files created from this (<i>archives</i>), the<br>final collection as it is served to users (<i>index</i>), a directory<br>for use during the building process (<i>building</i>), and one for<br>any supporting files (<i>etc</i>)—including the configuration file<br> <b>Figure 3: Reading a book in the HDL</b><br> that controls the collection creation procedure. Additional<br>files might be required: for example, building a hierarchy<br>of classifications requires a data file of sub-classifications.<br> <b>FINDING INFORMATION</b><br> <br>Greenstone digital library systems generally include<br> <br> several separate collections. A home page allows you to<br> <b>The imported documents</b><br> select a collection; in addition, each collection has its own<br> <br>In order to identify documents internally, a unique object<br> “about” page that gives you information about how the<br> identifier or OID is assigned to each original source<br> collection is organized and the principles governing what<br> document when it is imported (formed by hashing the<br> is included.<br> content, to overcome file duplication effects caused by<br> <br>All icons in the screenshots of Figures 1–4 are clickable.<br> mirroring) and stored as metadata within that document. It<br> Those icons at the top of the page return to the home page,<br> is important that OIDs persist throughout the index-<br> provide help text, and allow you to set user interface and<br> building process—so that a user’s search history is<br> searching preferences. The navigation bar underneath<br> unaffected by rebuilding the collection. OIDs are assigned<br> gives access to the searching and browsing facilities,<br> by hashing the contents of the original source document.<br> which differ from one collection to another.<br> <br>Once imported, each document is stored in its own<br> <br>Each of the five buttons provides a different way to find<br> subdirectory of <i>archives</i>, along with any associated<br> information. You can <i>search for particular words</i> that<br> files—for example, images. To ensure compatibility with<br> appear in the text from the “search” page (or from the<br> Windows 3.0, only eight characters are used in directory<br> “about” page of Figure 1). This collection contains indexes<br> and file names, which causes annoying but essentially<br> of chapters, section titles, and entire books. The default<br> trivial complications.<br> search interface is a simple one, suitable for casual users;<br>advanced searching—which allows full Boolean<br> <br><b>Inside the documents</b><br> expressions, phrase searching, case and stemming<br>control—can be enabled from the <i>Preferences</i> page.<br> <br>The GML format imposes a limited amount of structure on<br> <br> documents. Documents are divided into paragraphs. They<br> This collection has four browsable metadata indexes. You<br> can be split hierarchically into sections and subsections.<br> can <i>access publications by subject</i> by clicking the <i>subjects</i><br> OIDs are extended to identify these components by<br> button, which brings up a list of subjects, represented by<br> appending numbers, separated by periods, to a document’s<br> bookshelves (Figure 2). You can <i>access publications by</i><br> OID. When a book is read, its section hierarchy is visible<br> <i>title</i> by clicking <i>titles a-z</i> (Figure 4), which brings up a list<br> as the table of contents (Figure 3). Chapters, sections,<br> of books in alphabetic order. You can <i>access publications</i><br> subsections, and pages are all implemented simply as<br> <i>by organization</i> (i.e. Dublin Core “publisher”), bringing up<br> “sections” within the document. In some collections<br> a list of organizations. You can <i>access publications by</i><br> documents do not have a hierarchical subsection structure,<br> <i>“how to” listing</i>, yielding a list of hints defined by the<br> but are split into pages to permit browsing within a<br> collection’s editors. We use the Dublin Core as a base and<br> retrieved document.<br> extend it in an <i>ad hoc</i> manner to accommodate the<br>individual requirements of collection designers.<br> <br>The document structure is used for searchable indexes.<br>There are three levels of index: <i>documents</i>, <i>sections</i>, and<br> <hr> <A name=5></a><IMG src="_httpdocimg_/pdf01-5_1.jpg"><br> the <i>import</i> process is invoked, which converts the files into<br>GML using the specified plugins. Old material for which<br>GML files have previously been created is not re-imported.<br>Then the <i>build</i> process is invoked to build the requisite<br>indexes for the collection. Finally, the contents of the<br><i>building</i> directory are moved into the <i>index</i> directory, and<br>the new version of the collection automatically becomes<br>live.<br> <br>This procedure may seem cumbersome. But all the steps<br>are necessary for efficient operation with large collections.<br>The <i>import</i> process could be performed on the fly during<br>the building operation—but because building indexes is a<br>multipass operation, the often lengthy importing would be<br>repeated several times. The <i>build</i> process can take<br>considerable time—a day or two, for very large<br>collections. Consequently, the results are placed in the<br><i>building</i> directory so that, if the collection already exists, it<br>will continue to be served to users in its old form<br>throughout the building operation.<br> <br>Active users of the collection will not be disturbed when<br>the new version becomes live—they will probably not<br> <b>Figure 4: Browsing titles in the HDL</b><br> even notice. The persistent OIDs ensure that interactions<br>remain coherent—users who are examining the results of a<br>query or browse operation will still retrieve the expected<br> <i>paragraphs</i>, corresponding to the distinctions that GML<br> documents—and if a search is actually in progress when<br> makes—the hierarchical structure is flattened for the<br> the change takes place the program detects the resulting<br> purposes of creating these indexes. Indexes can be of text,<br> file-structure inconsistency and automatically and<br> or metadata, or any combination. Thus you can create a<br> transparently re-executes the query, this time on the new<br> searchable index of section titles, and/or authors, and/or<br> version of the collection.<br> document descriptions, as well as the document text.<br> <b>UPDATING EXISTING COLLECTIONS</b><br> <br><b>How it works</b><br> <br>Updating an existing collection with new files in the same<br> <br>The original material in the <i>import</i> directory may be in any<br> format is easy. For example, the raw material for the HDL<br> format, and plugins are required to process each format<br> is supplied in the form of HTML files marked up with<br> type. The plugins that a collection uses must be specified<br> &lt;&lt;TOC&gt;&gt; tags to split books into sections and<br> in the collection configuration file. The <i>import</i> program<br> subsections, and &lt;&lt;I&gt;&gt; tags to indicate where an image is<br> reads the list of plugins and passes each document to each<br> to be inserted. For each book in the library there is a<br> plugin in order until it finds one that can process it. When<br> directory that contains a single HTML file representing the<br> updating an existing collection, all plugins necessary to<br> book, and separate files containing the associated images.<br> process new material should already have been specified in<br> An accompanying spreadsheet file contains the<br> the configuration file.<br> classification hierarchy; this is converted to a simple file<br>format (using Excel’s <i>Save As</i> command).<br> <br>The building step creates the indexes for both searching<br>and browsing. The MG software is generally used to do the<br> <br>Since the collection exists, its directory is already set up<br> searching (Witten <i>et al.</i>, 1999), and the <i>mgbuild</i> module is<br> with subdirectories <i>import</i>, <i>archives</i>, <i>building</i>, <i>index</i>, and<br> automatically invoked to create each of the indexes that is<br> <i>etc</i>, and the <i>etc</i> directory will contain a suitable collection<br> required. For example, the Humanity Development Library<br> configuration file.<br> has three indexes, one for entire books, one for chapters,<br>and one for section titles. Subdirectories of the <i>index</i><br> <br> directory are created for each of these indexes.<br> <b>The updating procedure</b><br> <br>To update a collection, the new raw material is placed in<br>the <i>import</i> directory, in whatever form it is available. Then<br> <hr> <A name=6></a>creator<br> davidb@cs.waikato.ac.nz<br> 1<br> maintainer<br> davidb@cs.waikato.ac.nz<br> 2<br> public<br> True<br> 3<br>4<br> indexes<br> document:text<br> 5<br> defaultindex<br> document:text<br> 6<br> plugins<br> GMLPlug TEXTPlug ArcPlug RecPlug<br> 7<br>8<br> classify<br> AZList metadata=Title<br> 9<br>10<br> collectionmeta<br> collectionname &quot;generic text collection&quot;<br> 11<br> (a)<br> collectionmeta<br> .document:text &quot;documents&quot;<br> 12<br> creator<br> davidb@cs.waikato.ac.nz<br> 1<br> maintainer<br> davidb@cs.waikato.ac.nz<br> 2<br> public<br> True<br> 3<br>4<br> indexes<br> document:text document:From<br> 5<br> defaultindex<br> document:text<br> 6<br> plugins<br> GMLPlug EMAILPlug ArcPlug RecPlug<br> 7<br>8<br> classify<br> AZList metadata=Title<br> 9<br> classify<br> DateList<br> 10<br>11<br> collectionmeta<br> collectionname &quot;Email messages&quot;<br> 12<br> collectionmeta<br> .document:text &quot;documents&quot;<br> 13<br> collectionmeta<br> .document:From &quot;email senders&quot;<br> 14<br>15<br> format<br> QueryResults \\\\<br> 16<br> (b)<br> &lt;td&gt;[link][icon][/link]&lt;/td&gt;&lt;td&gt;[Title]&lt;/td&gt;&lt;td&gt;[Author]&lt;/td&gt;<br> 17<br> <b>Figure 5: Collection configuration files (a) generic, (b) for an email collection</b><br> <br>MG also compresses the text of the collection; and the<br> certain circumstances, however, it might be preferable to<br> image files are linked into the <i>index</i> subdirectory. Now<br> use a standardized format such as XML. This is<br> none of the material in the <i>import</i> and <i>archives</i> directories<br> straightforward to implementjust write an XML<br> is needed to run the collection and can be removed from<br> pluginalthough we have not done so ourselves. Given<br> the file system (though they would be needed if the<br> the transitory nature of the imported data, to date, we have<br> collection were rebuilt).<br> found GML a satisfactory and beneficial format.<br> <br>Associated with each collection is a database stored in<br> <b>CREATING NEW COLLECTIONS</b><br> GDBM (Gnu database manager) format. This contains an<br>entry for each document, giving its OID, its internal MG<br> <br>Building new collections from scratch is only slightly<br> document number, and metadata such as title. Information<br> different from updating an existing collection. The key<br> for each of the browsing indexes, which appear as buttons<br> new requirement is creating a collection configuration file,<br> on the Greenstone search/browse bar, is also extracted<br> and a software utility is provided to help. Two pieces of<br> during the building process and stored in the database. A<br> information are required for this: the name of the directory<br> “classifier” program is required for each browsing index to<br> that the collection will use (into which the source data and<br> extract the appropriate information from GML documents.<br> other files will eventually be placed), and a contact e-mail<br> Like plugins, classifiers are written on an <i>ad hoc</i> basis for<br> address for use if any problems are encountered by the<br> the particular information required, and where possible<br> software once the collection is up and running. The utility<br> reused from one collection to another.<br> creates files and directories within the newly-named<br> <br> directory to support a generic collection of plain text<br> The building program creates the indexes based on<br> documents. With suitable data placed in the <i>import</i><br> whatever appears in the <i>archives</i> directory. The first plugin<br> directory, building the collection at this point will yield a<br> specified by all collections is one that processes GML<br> document-level searchable index of all the text and a<br> files, and so if <i>archives</i> contains imported files they will be<br> browsable list of “titles” (defined in this case to be the<br> processed correctly. If it contains material in the original<br> document filenames).<br> format, that will be converted using the appropriate plugin.<br>Thus the import process is optional.<br> <br>To enhance the functionality and presentation— something<br> <br> anything but the most trivial collection will require—the<br> GML is designed to be fast and easy to parse, an important<br> configuration file must be edited. For a collection sourced<br> requirement when millions of documents are to be<br> from documents in an already supported data format,<br> processed. Something as simple as requiring tags to be<br> presented in a similar fashion to an existing collection, the<br> lower-case, for example, yields a substantial speed-up. In<br> <hr> <A name=7></a><IMG src="_httpdocimg_/pdf01-7_1.jpg"><br> <br>These are modules of code that can be slotted into the<br>system to enhance its capabilities. Plugins parse<br>documents, extracting the text and metadata to be indexed.<br>Classifiers control how metadata is brought together to<br>form browsable data structures. Both are specified in an<br>object-oriented framework using inheritance to minimize<br>the amount of code written.<br> <br>A plugin must specify three things: what file formats it can<br>handle, how they should be parsed, and whether the plugin<br>is recursive. File formats are normally determined using<br>regular expression matching on the filename. For example,<br>the HTML plugin accepts all files that end in <i>.htm</i>, . <i>html</i>,<br><i>.HTM</i>, or <i>.HTML</i>. (It is quite possible, however, to write<br>plugins that “look inside” the file as well.) For other files,<br>the plugin returns <i>undefined</i> and the file is passed to the<br>next plugin in the collection’s configuration file (e.g.<br>Figure 5 line 7). If it can, the plugin parses the file and<br>returns the number of documents processed. This involves<br>extracting text and metadata and adding it to the library’s<br>content through calls to <i>add text</i> and <i>add metadata</i>.<br> <br>Some plugins (“recursive” ones) add extra files into the<br> <b>Figure 6: Searching bookmarked Web pages</b><br> stream of data processed during the building phase by<br>artificially reactivating the list of plugins. This is how<br>directory hierarchies are traversed.<br> amount of editing is minimal. Importing new data formats<br>and browsing metadata in ways not currently supported are<br> <br>Plugins are small modules of code that are easy to write.<br> more complex activities that require programming skills.<br> We monitored the time it took to develop a new one that<br>was different to any we had produced so far. We chose to<br>make as an example a collection of HTML bookmark files,<br> <br><b>Modifying the configuration file</b><br> the motivation being to produce a convenient way of<br> <br> searching and browsing one’s bookmarked Web pages.<br> Figure 5b shows simple alterations to the generic<br> Figure 6 shows a user searching for bookmarked pages<br> configuration file in Figure 5a that was generated by the<br> about <i>music</i>. The new plugin took under an hour to write,<br> new-collection utility. <i>TEXTPlug</i> is replaced with<br> and was 160 lines long (ignoring blank lines and<br> <i>EMAILPlug</i> (line 7) which reads email files and extracts<br> comments)—about the average length of existing plugins.<br> metadata (<i>From</i>, <i>To</i>, <i>Date</i>, <i>Subject</i>) from them. A classifier<br>for dates is added (line 10) to make the collection<br> <br>Classifiers are more general than plugins because they<br> browsable chronologically. The default presentation of<br> work on GML-format data. For example, any plugin that<br> search results is overridden (line 17) to display both the<br> generates date metadata in accordance with the Dublin<br> title of the message (i.e. Dublin Core <i>Title</i>) and its sender<br> core can request the collection to be browsable<br> (i.e. Dublin Core <i>Author</i>). Elements in square brackets,<br> chronologically by specifying the <i>DateList</i> classifier in the<br> such as <i>[Title]</i>, are replaced by the metadata associated<br> collection’s configuration file (Figure 7). Classifiers are<br> with a particular document. The built-in term <i>[icon]</i><br> more elaborate than most plugins, but new ones are seldom<br> produces a suitable image that represents the document<br> required. The average length of existing classifiers is 230<br> (such as a book icon or page icon), and the <i>[link]…[/link]</i><br> lines.<br> construct forms a hyperlink to the complete document.<br> <br> Anything else in the format statement, which in this case is<br> Classifiers must specify three things: an initialization<br> solely table-cell tags in HTML, is passed through to the<br> routine, how individual documents are classified, and the<br> page being displayed.<br> final browsable data structure. Initialization takes care of<br>any options specified in the configuration file (such as<br> As this example shows, creating a new collection that stays<br> <i>metadata=Title </i>on line 9 of Figure 5b). Classifying<br> within the bounds of the library’s established capabilities<br> individual documents is an iterative process: for each one,<br> falls within the capability of many computer users—for<br> a call to <i>document-classify</i> is made. On presentation of the<br> instance, computer-trained librarians. Extending<br> document’s OID, the necessary metadata is located and<br> Greenstone to handle new document formats and browse<br> used to control where the document is added to the<br> metadata in new ways is more challenging.<br> browsable data structure being constructed.<br> <br>Once all documents have been added, a request is made for<br> <br><b>Writing new plugins and classifiers</b><br> the completed data structure. Some classifiers return the<br>data structure directly; others transform the data structure<br> <br>Extensibility is obtained through plugins and classifiers.<br> before it is returned. For example, the <i>AZList</i> classifier<br> <hr> <A name=8></a><IMG src="_httpdocimg_/pdf01-8_1.jpg"><br> a page number, next and previous page buttons, and<br>displaying a particular page at different resolutions. A text<br>version of the page is also available upon which a<br>searching option is also provided.<br> Started in 1994, Harvest is also a long-running research<br>project. It provides an efficient means of gathering source<br>data from the Internet and distributing indexing<br>information over the Internet. This is accomplished<br>through five components: <i>gatherer</i>, <i>broker</i>, <i>indexer</i>,<br><i>replicator</i> and <i>cache</i>. The first three are central to creating,<br>updating and searching a collection; the last two help to<br>improve performance over the Internet through transparent<br>mirroring and caching techniques.<br> The system is configurable and customizable. While<br>searching is most commonly implemented using Glimpse<br>(<i>glimpse.cs.arizona.edu</i>), in principle any search engine<br>that supports incremental updates and Boolean<br>combinations of attribute-based queries can be used. It is<br>possible to control what type of documents are gathered<br>during creation and updating, and how the query interface<br> <b>Figure 7: Browsing a newspaper collection by date</b><br> looks and is laid out.<br> Sample collections cited by the developers include 21,000<br> divides the alphabetically sorted list of metadata into<br> computer science technical reports and 7,000 home pages.<br> separate pages of about the same size and returns the<br> Other examples include a sizable collection of agriculture-<br> alphabetic ranges for each one (Figure 4).<br> related electronic journals and magazines called “tomato-<br>juice” (accessed through <i>hegel.lib.ncsu.edu</i>) and a full-text<br> <b>OVERVIEW OF RELATED WORK</b><br> index of library-related electronic serials<br> Two projects that provide substantial open source digital<br> (<i>sunsite.berkeley.edu/IndexMorganagus</i>). Harvest is also<br> library software are Dienst (Lagoze and Fielding, 1998)<br> often used to index Web sites (for example<br> and Harvest (Bowman <i>et al.</i>, 1994). The origins of Dienst<br> <i>www.middlebury.edu</i>).<br> (<i>www.cs.cornell.edu/cdlrg</i>) stretch back to 1992. The term<br> Comparing Greenstone with Dienst and Harvest, there are<br> has come to represent three entities: a conceptual<br> both similarities and differences. All provide substantial<br> architecture for distributed digital libraries; an open<br> digital library systems, hence common themes recur, but<br> protocol for service communication; and a software<br> they are driven by projects with different aims. Harvest,<br> system that implements the protocol. To date, five sample<br> for instance, was not conceived as a digital library project<br> digital libraries have been built using this technology.<br> at all, but by virtue of its selective document gathering<br> They manifest themselves in two forms: technical reports<br> process it can be classed (and is used) as one. While it<br> and primary source documents.<br> provides sophisticated search options, it lacks the<br> Best known is NCSTRL, the Networked Computer<br> complementary service of browsing. Furthermore it adds<br> Science Technical Reference Library project<br> no structure or order to the documents collected, relying<br> (<i>www.ncstrl.org</i>). This collection facilitates searching by<br> on whatever structures are present in the site that they<br> title, author and abstract, and browsing by year and author,<br> were gathered from. A proven strength of the design is its<br> across a distributed network of document repositories.<br> flexibility through configuration and customizationan<br> Documents can (where supported) be delivered in various<br> element also present in Greenstone.<br> formats such as PostScript, a thumbnail overview of the<br> Dienstbest exemplified through the NCSTRL<br> pages, and a GIF image of a particular page.<br> worksupports searching and browsing, like Greenstone.<br> The <i>Making of America</i> resource is an example of a<br> Both use open protocols. Differences include a high<br> collection based around primary sourcesin this case<br> reliance in Dienst on user-supplied information when a<br> American social history, 1830−1900. It has a different<br> document is added, and a smaller range of document types<br> “look and feel” to NCSTRL, being strongly oriented<br> supported—although Dienst does include a document<br> toward browsing rather than searching. A user navigates<br> model that should, over time, allow this to expand with<br> their way through a hierarchical structure of hyperlinks to<br> relative ease.<br> reach a book of interest. The book itself is a series of<br> There are also commercial systems that provide similar<br> scanned images: delivery options include going directly to<br> digital library services to those described. However, since<br> <hr> <A name=9></a>corporate culture instills proprietary attitudes there is little<br> <b>REFERENCES</b><br> opportunity for advancement through a shared<br> 1. Akscyn, R.M. and Witten, I.H. (1998) “Report on First<br> collaborative effort. Consequently they are not reviewed<br> Summit on International Cooperation on Digital<br> here.<br> Libraries.” ks.com/idla-wp-oct98.<br> 2. Bowman, C.M., Danzig, P.B., Manber, U., and<br> <b>CONCLUSIONS</b><br> Schwartz, M.F. “Scalable Internet resource discovery:<br> Greenstone is a comprehensive software system for<br> Research problems and approaches” <i>Communications</i><br> creating digital library collections. It builds data structures<br> <i>of the ACM,</i> Vol. 37, No. 8, pp. 98−107, 1994.<br> for searching and browsing from the material provided,<br> 3. Fox, E. (1998) “Digital library definitions.”<br> rather than relying on any hand-crafting. The process is<br> ei.cs.vt.edu/~fox/dlib/def.html.<br> controlled by a configuration file, and once a collection<br>exists new material can be added completely<br> 4. Humanity Libraries (1998) <i>Humanity Development</i><br> automatically. Browsing is based on Dublin Core<br> <i>Library</i>. CD-ROM produced by the Global Help<br> metadata.<br> Project, Antwerp, Belgium.<br> New collections can be developed easily, particularly if<br> 5. Lagoze, C. and Fielding, D “Defining Collections in<br> they resemble existing ones. Extensibility is achieved<br> Distributed Digital Libraries” <i>D-Lib Magazine</i>, Nov.<br> through software “plugins” that can be written to<br> 1998.<br> accommodate documents, and metadata, in different<br> 6. PAHO (1999) <i>Virtual Disaster Library</i>. CD-ROM<br> formats. Standard plugins exist for many document types;<br> produced by the Pan-American Health Organization,<br> new ones are easily written. Browsing is controlled by<br> Washington DC, USA.<br> “classifiers” that process metadata into browsing structures<br> 7. McNab, R.J., Witten, I.H. and Boddie, S.J. (1998) “A<br> (by date, alphabetical, hierarchical, etc).<br> distributed digital library architecture incorporating<br> However, the most powerful support for extensibility is<br> different index styles.” <i>Proc IEEE Advances in Digital</i><br> achieved not by technical means but by making the source<br> <i>Libraries</i>, Santa Barbara, CA, pp. 36–45.<br> code freely available under the Gnu public license. Only<br> 8. Nevill-Manning, C.G., Reed, T., and Witten, I.H.<br> through an international cooperative effort will digital<br> (1998) “Extracting text from PostScript”<br> library software become sufficiently comprehensive to<br> <i>Software—Practice and Experience</i>, Vol. 28, No. 5, pp.<br> meet the world’s needs with the richness and flexibility<br> 481–491; April.<br> that users deserve.<br> 9. UNESCO (1999) <i>SAHEL point DOC: Anthologie du</i><br> <b>ACKNOWLEDGMENTS</b><br> <i>développement au Sahel</i>. CD-ROM produced by<br>UNESCO, Paris, France.<br> We gratefully acknowledge all those who have worked on<br>the Greenstone software, and all members of the New<br> 10. UNU (1998) <i>Collection on critical global issues.</i> CD-<br> Zealand Digital Library project for their enthusiasm and<br> ROM produced by the United Nations University<br> ideas.<br> Press, Tokyo, Japan.<br> 11. Witten, I.H., Moffat, A. and Bell, T. (1999) <i>Managing</i><br> <i>Gigabytes: compressing and indexing documents and<br>images</i>, Morgan Kaufmann, second edition.<br> <hr>