indexed_doc en utf8 Bronwyn Greenstone: A Comprehensive Open-Source Digital Library Software... http://research/ak19/GS2bin_5Aug2013/collect/Word-PDF-Basic/tmp/1375688869/pdf01.html http://research/ak19/GS2bin_5Aug2013/collect/Word-PDF-Basic/tmp/1375688869/pdf01.html import/pdf01.pdf tmp/1375688869/pdf01.html pdf01.html pdf01.pdf pdf01.pdf PDFPlugin 269487 pdf01 PDF _iconpdf_ doc.pdf doc.pdf 9 Ian H. Witten Rodger J. McNab Stefan J. Boddie David Bainbridge Greenstone: A comprehensive open-source digital library software system 8.57 /research/ak19/GS2bin_5Aug2013/collect/Word-PDF-Basic/import 2013:08:02 19:30:45+12:00 pdf01.pdf 644 269487 PDF application/pdf Bronwyn 2000:03:02 15:21:24 Microsoft Word false 1.2 9 Acrobat PDFWriter 4.0 for Power Macintosh HASH1a9cea0f239f754007681b 1375428645 20130802 1375688869 20130805 HASH1a9cea0f.dir pdf01-2_1.jpg:image/jpeg: pdf01-3_1.jpg:image/jpeg: pdf01-4_1.jpg:image/jpeg: pdf01-5_1.jpg:image/jpeg: pdf01-7_1.jpg:image/jpeg: pdf01-8_1.jpg:image/jpeg: doc.pdf:application/pdf: <A name=1></a>Greenstone: A Comprehensive Open-Source Digital Library Software System Ian H. Witten,* Rodger J. McNab,† Stefan J. Boddie,* David Bainbridge* * Dept of Computer Science † Digilib Systems University of Waikato, New Zealand Hamilton, New Zealand E-mail: {ihw, sjboddie, davidb}@cs.waikato.ac.nz E-mail: rodger@digilibs.com ABSTRACT multilingual information retrieval to distributed computing protocols, from interoperability to search engine This paper describes the Greenstone digital library technology, from metadata standards to multiformat software, a comprehensive, open-source system for the document parsing, from multimedia to multiple operating construction and presentation of information collections. systems, from Web browsers to plug-and-play DVDs. Collections built with Greenstone offer effective full-text searching and metadata-based browsing facilities that are The Greenstone Digital Library Software from the New attractive and easy to use. Moreover, they are easily Zealand Digital Library (NZDL) project tackles this issue maintainable and can be augmented and rebuilt entirely by providing a new way of organizing information and automatically. The system is extensible: software making it available over the Internet. A collection of “plugins” accommodate different document and metadata information comprises several (typically several thousand, types. or several million) documents, and a uniform interface is provided to all documents in a collection. A library may INTRODUCTION include many different collections, each organized differently—though there is a strong family resemblance in Notwithstanding intense research activity in the digital how collections are presented. library field during the second half of the 1990s, comprehensive software systems for creating digital Making information available using this system is far more libraries are not widely available. In fact, the usual solution than “just putting it on the Web.” The collection becomes when creating a digital library is also the most maintainable, searchable, and browsable. Each collection, obvious—just put it on the Web. But consider how much prior to presentation, undergoes a “building” process that, effort is involved in constructing a Web site for a digital once established, is completely automatic. This process library. To be effective it needs to be visually attractive creates all the structures that are used at run-time for and ergonomically easy to use, incorporate convenient and accessing the collection. Searching is based on various powerful searching capabilities, and offer rich and natural indexes, while browsing is based on various metadata; browsing facilities. Above all it must be easy to maintain support structures for both are created during the building and augment, which presents a significant challenge if any operation. When new material appears it can be fully manual organization is involved. incorporated into the collection by rebuilding. The alternative is to automate these activities through To address the exceptionally broad demands of digital software tools. But the broad scope of digital library libraries, the system is public and extensible. It is issued requirements makes this a daunting prospect. Ideally the under the Gnu public license and, in the spirit of open- software should incorporate facilities ranging from source software, users are invited to contribute modifications and enhancements. Only through an international cooperative effort will digital library software become sufficiently comprehensive to meet the world’s needs. Currently the Greenstone software is used at sites in Canada, Germany, New Zealand, Romania, UK, and the US, and collections range from newspaper articles to technical documents, from educational journals to oral history, from visual art to folksongs. The software has been used for collections in many different languages, and for CD-ROMs that have been published by the United Nations and other humanitarian agencies in Belgium, France, Japan, and the US for distribution in developing countries (Humanity Libraries, 1998; PAHO, 1999; UNESCO, 1999; UNU, 1998). Further details can be obtained from www.nzdl.org. <hr> <A name=2></a><IMG src="_httpdocimg_/pdf01-2_1.jpg"> become a first-class component of the library. And what permits it to be integrated into existing searching and browsing structures without any manual intervention is metadata. This provides sufficient focus to the concept of “digital library” to support the development of a construction kit. OVERVIEW OF GREENSTONE Information collections built by Greenstone combine extensive full-text search facilities with browsing indexes based on different metadata types. There are several ways for users to find information, although they differ between collections depending on the metadata available and the collection design. Typically you can search for particular words that appear in the text, or within a section of a document, or within a title or section heading. You can browse documents by title: just click on the displayed book icon to read it. You can browse documents by subject. Subjects are represented by bookshelves: just click on a shelf to see the books. Where appropriate, documents Figure 1: Searching the HDL collection come complete with a table of contents (constructed automatically): you can click on a chapter or subsection to This paper sets the scene with a brief discussion of what a open it, expand the full table of contents, or expand the full digital library is. We then give an overview of the facilities document. offered by Greenstone and show how end users find information in collections. Next we describe the files and An example of searching is shown in Figure 1 where directories involved in a collection, and then discuss the documents in the Global Help Project’s Humanity processes of updating existing collections and creating new Development Library (HDL) are being searched for ones, including extending the software to provide new chapters matching the word butterfly. In Figure 2 the same facilities. We conclude with an overview of related work. collection is being browsed by subject: by clicking on the bookshelf icons the user has discovered an item under WHAT IS A DIGITAL LIBRARY? Section 16, Animal Husbandry. Pursuing an interest in butterfly farming, the user selects a book by clicking on its Ten definitions of the term “digital library” have been book icon. In Figure 3 the front cover of the book is culled from the literature by Fox (1998), and their spirit is displayed as a graphic on the left, and the automatically captured in the following brief characterization: constructed table of contents appears at the start of the document. The current focus, Introduction and Summary, A collection of digital objects, including text, is shown in bold in the table of contents with its text video, and audio, along with methods for access starting further down the page. and retrieval, and for selection, organization and maintenance of the collection In accordance with Lesk’s advice, a statement of purpose and coverage accompanies each collection, along with an (Akscyn and Witten, 1998). Lesk (1998) views digital explanation of how it is organized (Figure 1 shows the libraries as “organized collections of digital information,” start of this). A distinction is made between searching and and wisely recommends that they articulate the principles browsing. Searching is full-text, and—depending on the governing what is included and how the collection is collection’s design—the user can choose between indexes organized. built from different parts of the documents, or from Digital libraries are generally distinguished from the different metadata. Some collections have an index of full World-Wide Web, the essential difference being in documents, an index of sections, an index of paragraphs, selection and organization. But they are not generally an index of titles, and an index of section headings, each of distinguished from a web site: indeed, virtually all extant which can be searched for particular words or phrases. digital libraries manifest themselves as a web site. Hence Browsing involves data structures created from metadata the obvious question: to make a digital library, why not that the user can examine: lists of authors, lists of titles, just put the information on the Web? lists of dates, hierarchical classification structures, and so on. Data structures for both browsing and searching are But we make a distinction between a digital library and a built according to instructions in a configuration file, web site that lies at the heart of our software design: one which controls both building and serving the collection. should easily be able to add new material to a library Sample configuration files are discussed below. without having to integrate it manually or edit its content in any way. Once added, new material should immediately <hr> <A name=3></a><IMG src="_httpdocimg_/pdf01-3_1.jpg"> matter of specifying all the necessary plugins. In order to build browsing indexes from metadata, an analogous scheme of “classifiers” is used: classifiers create indexes of various kinds based on metadata. Source documents are brought into the Greenstone system through a process called importing, which uses the plugins and classifiers specified in the collection configuration file. The international Unicode character set is used throughout, so documents—and interfaces—can be written in any language. Collections have so far been produced in English, French, Spanish, German, Maori, Chinese, and Arabic. The NZDL Web site provides numerous examples. Collections can contain text, pictures, and even audio and video clips; a text-only version of the interface is also provided to accommodate visually impaired users. Compression technology is used to ensure best use of storage (Witten et al ., 1999). Most non-textual material is either linked to textual documents or accompanied by textual descriptions (such as photo captions) to allow full- text searching and browsing. However, the architecture Figure 2: Browsing the HDL collection by subject permits the implementation of plugins and classifiers even for non-textual data. Rich browsing facilities can be provided by manually linking parts of documents together and building explicit The system includes an “administrative” function whereby indexes and tables of contents. However, manually-created specified users can examine the composition of all linking becomes difficult to maintain, and often falls into collections, protect documents so that they can only be disrepair when a collection expands. The Greenstone accessed by registered users on presentation of a password, software takes a different tack: it facilitates maintainability and so on. Logs of user activity are kept that record all by creating all searching and browsing structures queries made to every Greenstone collection (though this automatically from the documents themselves. No links facility can be disabled). are inserted by hand. This means that when new Although primarily designed for Internet access over the documents in the same format become available, they can World-Wide Web, collections can be made available, in be added automatically. Indeed, for some collections this is precisely the same form, on CD-ROM. In either case they done by processes that wake up regularly, scout for new are accessed through any Web browser. Greenstone CD- material, and rebuild the indexes—all without manual ROMs operate on a standalone PC under Windows 3.X, intervention. 95, 98, and NT, and the interaction is identical to accessing Collections comprise many documents: thousands, tens of the collection on the Web—except that response is faster thousands, or even millions. Each document may be and more predictable. The requirement to operate on early hierarchically organized into sections (subsections, sub- Windows systems is one that plagues the software design, subsections, and so on). Each section comprises one or but is crucial for many users—particularly those in more paragraphs. Metadata such as author, title, date, underdeveloped countries seeking access to humanitarian keywords, and so on, may be associated with documents, aid collections. If the PC is connected to a network or with individual sections of documents. This is the raw (intranet or Internet), a custom-built Web server provided material for indexes. It must either be provided explicitly on each CD makes exactly the same information available for each document and section (for example, in an to others through their standard Web browser. The use of accompanying spreadsheet) or be derivable automatically compression ensures that the greatest possible volume of from the source documents. Metadata is converted to information can be packed on to a CD-ROM. Dublin Core and stored with the document for internal use. The collection-serving software operates under Unix and In order to accommodate different kinds of source Windows NT, and works with standard Web servers. A documents, the software is organized so that “plugins” can flexible process structure allows different collections to be be written for new document types. Plugins exist for plain served by different computers, yet be presented to the user text documents, HTML documents, email documents, and in the same way, on the same Web page, as part of the bibliographic formats. Word documents are handled by same digital library, even as part of the same collection saving them as HTML; PostScript ones by applying a (McNab and Witten, 1998). Existing collections can be preprocessor (Nevill-Manning et al., 1998). Specially updated and new ones brought on-line at any time, without written plugins also exist for proprietary formats such as bringing the system down; the process responsible for the that used by the BBC archives department. A collection user interface will notice (through periodic polling) when may have source documents in different forms: it is just a new collections appear and add them to the list presented to the user. <hr> <A name=4></a><IMG src="_httpdocimg_/pdf01-4_1.jpg"> FILES IN A COLLECTION When a new collection is created or material is added to an existing one, the original source documents are first brought into the system through a process known as “importing.” This involves converting documents into a simple HTML-like format known as GML (for “Greenstone Markup Language”), which includes any metadata associated with the document. Documents are assumed to be in the Unicode UTF-8 code (of which the ASCII characters form a subset). Files and directories There is a separate directory for each collection, which contains five subdirectories: the original raw material (import), the GML files created from this (archives), the final collection as it is served to users (index), a directory for use during the building process (building), and one for any supporting files (etc)—including the configuration file Figure 3: Reading a book in the HDL that controls the collection creation procedure. Additional files might be required: for example, building a hierarchy of classifications requires a data file of sub-classifications. FINDING INFORMATION Greenstone digital library systems generally include several separate collections. A home page allows you to The imported documents select a collection; in addition, each collection has its own In order to identify documents internally, a unique object “about” page that gives you information about how the identifier or OID is assigned to each original source collection is organized and the principles governing what document when it is imported (formed by hashing the is included. content, to overcome file duplication effects caused by All icons in the screenshots of Figures 1–4 are clickable. mirroring) and stored as metadata within that document. It Those icons at the top of the page return to the home page, is important that OIDs persist throughout the index- provide help text, and allow you to set user interface and building process—so that a user’s search history is searching preferences. The navigation bar underneath unaffected by rebuilding the collection. OIDs are assigned gives access to the searching and browsing facilities, by hashing the contents of the original source document. which differ from one collection to another. Once imported, each document is stored in its own Each of the five buttons provides a different way to find subdirectory of archives, along with any associated information. You can search for particular words that files—for example, images. To ensure compatibility with appear in the text from the “search” page (or from the Windows 3.0, only eight characters are used in directory “about” page of Figure 1). This collection contains indexes and file names, which causes annoying but essentially of chapters, section titles, and entire books. The default trivial complications. search interface is a simple one, suitable for casual users; advanced searching—which allows full Boolean Inside the documents expressions, phrase searching, case and stemming control—can be enabled from the Preferences page. The GML format imposes a limited amount of structure on documents. Documents are divided into paragraphs. They This collection has four browsable metadata indexes. You can be split hierarchically into sections and subsections. can access publications by subject by clicking the subjects OIDs are extended to identify these components by button, which brings up a list of subjects, represented by appending numbers, separated by periods, to a document’s bookshelves (Figure 2). You can access publications by OID. When a book is read, its section hierarchy is visible title by clicking titles a-z (Figure 4), which brings up a list as the table of contents (Figure 3). Chapters, sections, of books in alphabetic order. You can access publications subsections, and pages are all implemented simply as by organization (i.e. Dublin Core “publisher”), bringing up “sections” within the document. In some collections a list of organizations. You can access publications by documents do not have a hierarchical subsection structure, “how to” listing, yielding a list of hints defined by the but are split into pages to permit browsing within a collection’s editors. We use the Dublin Core as a base and retrieved document. extend it in an ad hoc manner to accommodate the individual requirements of collection designers. The document structure is used for searchable indexes. There are three levels of index: documents, sections, and <hr> <A name=5></a><IMG src="_httpdocimg_/pdf01-5_1.jpg"> the import process is invoked, which converts the files into GML using the specified plugins. Old material for which GML files have previously been created is not re-imported. Then the build process is invoked to build the requisite indexes for the collection. Finally, the contents of the building directory are moved into the index directory, and the new version of the collection automatically becomes live. This procedure may seem cumbersome. But all the steps are necessary for efficient operation with large collections. The import process could be performed on the fly during the building operation—but because building indexes is a multipass operation, the often lengthy importing would be repeated several times. The build process can take considerable time—a day or two, for very large collections. Consequently, the results are placed in the building directory so that, if the collection already exists, it will continue to be served to users in its old form throughout the building operation. Active users of the collection will not be disturbed when the new version becomes live—they will probably not Figure 4: Browsing titles in the HDL even notice. The persistent OIDs ensure that interactions remain coherent—users who are examining the results of a query or browse operation will still retrieve the expected paragraphs, corresponding to the distinctions that GML documents—and if a search is actually in progress when makes—the hierarchical structure is flattened for the the change takes place the program detects the resulting purposes of creating these indexes. Indexes can be of text, file-structure inconsistency and automatically and or metadata, or any combination. Thus you can create a transparently re-executes the query, this time on the new searchable index of section titles, and/or authors, and/or version of the collection. document descriptions, as well as the document text. UPDATING EXISTING COLLECTIONS How it works Updating an existing collection with new files in the same The original material in the import directory may be in any format is easy. For example, the raw material for the HDL format, and plugins are required to process each format is supplied in the form of HTML files marked up with type. The plugins that a collection uses must be specified <<TOC>> tags to split books into sections and in the collection configuration file. The import program subsections, and <> tags to indicate where an image is reads the list of plugins and passes each document to each to be inserted. For each book in the library there is a plugin in order until it finds one that can process it. When directory that contains a single HTML file representing the updating an existing collection, all plugins necessary to book, and separate files containing the associated images. process new material should already have been specified in An accompanying spreadsheet file contains the the configuration file. classification hierarchy; this is converted to a simple file format (using Excel’s Save As command). The building step creates the indexes for both searching and browsing. The MG software is generally used to do the Since the collection exists, its directory is already set up searching (Witten et al., 1999), and the mgbuild module is with subdirectories import, archives, building, index, and automatically invoked to create each of the indexes that is etc, and the etc directory will contain a suitable collection required. For example, the Humanity Development Library configuration file. has three indexes, one for entire books, one for chapters, and one for section titles. Subdirectories of the index directory are created for each of these indexes. The updating procedure To update a collection, the new raw material is placed in the import directory, in whatever form it is available. Then <hr> <A name=6></a>creator davidb@cs.waikato.ac.nz 1 maintainer davidb@cs.waikato.ac.nz 2 public True 3 4 indexes document:text 5 defaultindex document:text 6 plugins GMLPlug TEXTPlug ArcPlug RecPlug 7 8 classify AZList metadata=Title 9 10 collectionmeta collectionname "generic text collection" 11 (a) collectionmeta .document:text "documents" 12 creator davidb@cs.waikato.ac.nz 1 maintainer davidb@cs.waikato.ac.nz 2 public True 3 4 indexes document:text document:From 5 defaultindex document:text 6 plugins GMLPlug EMAILPlug ArcPlug RecPlug 7 8 classify AZList metadata=Title 9 classify DateList 10 11 collectionmeta collectionname "Email messages" 12 collectionmeta .document:text "documents" 13 collectionmeta .document:From "email senders" 14 15 format QueryResults \\\\ 16 (b) <td>[link][icon][/link]</td><td>[Title]</td><td>[Author]</td> 17 Figure 5: Collection configuration files (a) generic, (b) for an email collection MG also compresses the text of the collection; and the certain circumstances, however, it might be preferable to image files are linked into the index subdirectory. Now use a standardized format such as XML. This is none of the material in the import and archives directories straightforward to implementjust write an XML is needed to run the collection and can be removed from pluginalthough we have not done so ourselves. Given the file system (though they would be needed if the the transitory nature of the imported data, to date, we have collection were rebuilt). found GML a satisfactory and beneficial format. Associated with each collection is a database stored in CREATING NEW COLLECTIONS GDBM (Gnu database manager) format. This contains an entry for each document, giving its OID, its internal MG Building new collections from scratch is only slightly document number, and metadata such as title. Information different from updating an existing collection. The key for each of the browsing indexes, which appear as buttons new requirement is creating a collection configuration file, on the Greenstone search/browse bar, is also extracted and a software utility is provided to help. Two pieces of during the building process and stored in the database. A information are required for this: the name of the directory “classifier” program is required for each browsing index to that the collection will use (into which the source data and extract the appropriate information from GML documents. other files will eventually be placed), and a contact e-mail Like plugins, classifiers are written on an ad hoc basis for address for use if any problems are encountered by the the particular information required, and where possible software once the collection is up and running. The utility reused from one collection to another. creates files and directories within the newly-named directory to support a generic collection of plain text The building program creates the indexes based on documents. With suitable data placed in the import whatever appears in the archives directory. The first plugin directory, building the collection at this point will yield a specified by all collections is one that processes GML document-level searchable index of all the text and a files, and so if archives contains imported files they will be browsable list of “titles” (defined in this case to be the processed correctly. If it contains material in the original document filenames). format, that will be converted using the appropriate plugin. Thus the import process is optional. To enhance the functionality and presentation— something anything but the most trivial collection will require—the GML is designed to be fast and easy to parse, an important configuration file must be edited. For a collection sourced requirement when millions of documents are to be from documents in an already supported data format, processed. Something as simple as requiring tags to be presented in a similar fashion to an existing collection, the lower-case, for example, yields a substantial speed-up. In <hr> <A name=7></a><IMG src="_httpdocimg_/pdf01-7_1.jpg"> These are modules of code that can be slotted into the system to enhance its capabilities. Plugins parse documents, extracting the text and metadata to be indexed. Classifiers control how metadata is brought together to form browsable data structures. Both are specified in an object-oriented framework using inheritance to minimize the amount of code written. A plugin must specify three things: what file formats it can handle, how they should be parsed, and whether the plugin is recursive. File formats are normally determined using regular expression matching on the filename. For example, the HTML plugin accepts all files that end in .htm, . html, .HTM, or .HTML. (It is quite possible, however, to write plugins that “look inside” the file as well.) For other files, the plugin returns undefined and the file is passed to the next plugin in the collection’s configuration file (e.g. Figure 5 line 7). If it can, the plugin parses the file and returns the number of documents processed. This involves extracting text and metadata and adding it to the library’s content through calls to add text and add metadata. Some plugins (“recursive” ones) add extra files into the Figure 6: Searching bookmarked Web pages stream of data processed during the building phase by artificially reactivating the list of plugins. This is how directory hierarchies are traversed. amount of editing is minimal. Importing new data formats and browsing metadata in ways not currently supported are Plugins are small modules of code that are easy to write. more complex activities that require programming skills. We monitored the time it took to develop a new one that was different to any we had produced so far. We chose to make as an example a collection of HTML bookmark files, Modifying the configuration file the motivation being to produce a convenient way of searching and browsing one’s bookmarked Web pages. Figure 5b shows simple alterations to the generic Figure 6 shows a user searching for bookmarked pages configuration file in Figure 5a that was generated by the about music. The new plugin took under an hour to write, new-collection utility. TEXTPlug is replaced with and was 160 lines long (ignoring blank lines and EMAILPlug (line 7) which reads email files and extracts comments)—about the average length of existing plugins. metadata (From, To, Date, Subject) from them. A classifier for dates is added (line 10) to make the collection Classifiers are more general than plugins because they browsable chronologically. The default presentation of work on GML-format data. For example, any plugin that search results is overridden (line 17) to display both the generates date metadata in accordance with the Dublin title of the message (i.e. Dublin Core Title) and its sender core can request the collection to be browsable (i.e. Dublin Core Author). Elements in square brackets, chronologically by specifying the DateList classifier in the such as [Title], are replaced by the metadata associated collection’s configuration file (Figure 7). Classifiers are with a particular document. The built-in term [icon] more elaborate than most plugins, but new ones are seldom produces a suitable image that represents the document required. The average length of existing classifiers is 230 (such as a book icon or page icon), and the [link]…[/link] lines. construct forms a hyperlink to the complete document. Anything else in the format statement, which in this case is Classifiers must specify three things: an initialization solely table-cell tags in HTML, is passed through to the routine, how individual documents are classified, and the page being displayed. final browsable data structure. Initialization takes care of any options specified in the configuration file (such as As this example shows, creating a new collection that stays metadata=Title on line 9 of Figure 5b). Classifying within the bounds of the library’s established capabilities individual documents is an iterative process: for each one, falls within the capability of many computer users—for a call to document-classify is made. On presentation of the instance, computer-trained librarians. Extending document’s OID, the necessary metadata is located and Greenstone to handle new document formats and browse used to control where the document is added to the metadata in new ways is more challenging. browsable data structure being constructed. Once all documents have been added, a request is made for Writing new plugins and classifiers the completed data structure. Some classifiers return the data structure directly; others transform the data structure Extensibility is obtained through plugins and classifiers. before it is returned. For example, the AZList classifier <hr> <A name=8></a><IMG src="_httpdocimg_/pdf01-8_1.jpg"> a page number, next and previous page buttons, and displaying a particular page at different resolutions. A text version of the page is also available upon which a searching option is also provided. Started in 1994, Harvest is also a long-running research project. It provides an efficient means of gathering source data from the Internet and distributing indexing information over the Internet. This is accomplished through five components: gatherer, broker, indexer, replicator and cache. The first three are central to creating, updating and searching a collection; the last two help to improve performance over the Internet through transparent mirroring and caching techniques. The system is configurable and customizable. While searching is most commonly implemented using Glimpse (glimpse.cs.arizona.edu), in principle any search engine that supports incremental updates and Boolean combinations of attribute-based queries can be used. It is possible to control what type of documents are gathered during creation and updating, and how the query interface Figure 7: Browsing a newspaper collection by date looks and is laid out. Sample collections cited by the developers include 21,000 divides the alphabetically sorted list of metadata into computer science technical reports and 7,000 home pages. separate pages of about the same size and returns the Other examples include a sizable collection of agriculture- alphabetic ranges for each one (Figure 4). related electronic journals and magazines called “tomato- juice” (accessed through hegel.lib.ncsu.edu) and a full-text OVERVIEW OF RELATED WORK index of library-related electronic serials Two projects that provide substantial open source digital (sunsite.berkeley.edu/IndexMorganagus). Harvest is also library software are Dienst (Lagoze and Fielding, 1998) often used to index Web sites (for example and Harvest (Bowman et al., 1994). The origins of Dienst www.middlebury.edu). (www.cs.cornell.edu/cdlrg) stretch back to 1992. The term Comparing Greenstone with Dienst and Harvest, there are has come to represent three entities: a conceptual both similarities and differences. All provide substantial architecture for distributed digital libraries; an open digital library systems, hence common themes recur, but protocol for service communication; and a software they are driven by projects with different aims. Harvest, system that implements the protocol. To date, five sample for instance, was not conceived as a digital library project digital libraries have been built using this technology. at all, but by virtue of its selective document gathering They manifest themselves in two forms: technical reports process it can be classed (and is used) as one. While it and primary source documents. provides sophisticated search options, it lacks the Best known is NCSTRL, the Networked Computer complementary service of browsing. Furthermore it adds Science Technical Reference Library project no structure or order to the documents collected, relying (www.ncstrl.org). This collection facilitates searching by on whatever structures are present in the site that they title, author and abstract, and browsing by year and author, were gathered from. A proven strength of the design is its across a distributed network of document repositories. flexibility through configuration and customizationan Documents can (where supported) be delivered in various element also present in Greenstone. formats such as PostScript, a thumbnail overview of the Dienstbest exemplified through the NCSTRL pages, and a GIF image of a particular page. worksupports searching and browsing, like Greenstone. The Making of America resource is an example of a Both use open protocols. Differences include a high collection based around primary sourcesin this case reliance in Dienst on user-supplied information when a American social history, 1830−1900. It has a different document is added, and a smaller range of document types “look and feel” to NCSTRL, being strongly oriented supported—although Dienst does include a document toward browsing rather than searching. A user navigates model that should, over time, allow this to expand with their way through a hierarchical structure of hyperlinks to relative ease. reach a book of interest. The book itself is a series of There are also commercial systems that provide similar scanned images: delivery options include going directly to digital library services to those described. However, since <hr> <A name=9></a>corporate culture instills proprietary attitudes there is little REFERENCES opportunity for advancement through a shared 1. Akscyn, R.M. and Witten, I.H. (1998) “Report on First collaborative effort. Consequently they are not reviewed Summit on International Cooperation on Digital here. Libraries.” ks.com/idla-wp-oct98. 2. Bowman, C.M., Danzig, P.B., Manber, U., and CONCLUSIONS Schwartz, M.F. “Scalable Internet resource discovery: Greenstone is a comprehensive software system for Research problems and approaches” Communications creating digital library collections. It builds data structures of the ACM, Vol. 37, No. 8, pp. 98−107, 1994. for searching and browsing from the material provided, 3. Fox, E. (1998) “Digital library definitions.” rather than relying on any hand-crafting. The process is ei.cs.vt.edu/~fox/dlib/def.html. controlled by a configuration file, and once a collection exists new material can be added completely 4. Humanity Libraries (1998) Humanity Development automatically. Browsing is based on Dublin Core Library. CD-ROM produced by the Global Help metadata. Project, Antwerp, Belgium. New collections can be developed easily, particularly if 5. Lagoze, C. and Fielding, D “Defining Collections in they resemble existing ones. Extensibility is achieved Distributed Digital Libraries” D-Lib Magazine, Nov. through software “plugins” that can be written to 1998. accommodate documents, and metadata, in different 6. PAHO (1999) Virtual Disaster Library. CD-ROM formats. Standard plugins exist for many document types; produced by the Pan-American Health Organization, new ones are easily written. Browsing is controlled by Washington DC, USA. “classifiers” that process metadata into browsing structures 7. McNab, R.J., Witten, I.H. and Boddie, S.J. (1998) “A (by date, alphabetical, hierarchical, etc). distributed digital library architecture incorporating However, the most powerful support for extensibility is different index styles.” Proc IEEE Advances in Digital achieved not by technical means but by making the source Libraries, Santa Barbara, CA, pp. 36–45. code freely available under the Gnu public license. Only 8. Nevill-Manning, C.G., Reed, T., and Witten, I.H. through an international cooperative effort will digital (1998) “Extracting text from PostScript” library software become sufficiently comprehensive to Software—Practice and Experience, Vol. 28, No. 5, pp. meet the world’s needs with the richness and flexibility 481–491; April. that users deserve. 9. UNESCO (1999) SAHEL point DOC: Anthologie du ACKNOWLEDGMENTS développement au Sahel. CD-ROM produced by UNESCO, Paris, France. We gratefully acknowledge all those who have worked on the Greenstone software, and all members of the New 10. UNU (1998) Collection on critical global issues. CD- Zealand Digital Library project for their enthusiasm and ROM produced by the United Nations University ideas. Press, Tokyo, Japan. 11. Witten, I.H., Moffat, A. and Bell, T. (1999) Managing Gigabytes: compressing and indexing documents and images, Morgan Kaufmann, second edition. <hr>