Context Navigation

doc.xml

Last change on this file was 38016, checked in by anupama, 8 months ago
AUTOCOMMIT by gen-model-colls.sh script. Message: Regenerating GS3 model collections except the Word-PDF-Enhanced* collections
File size: 56.3 KB

Line
1	<?xml version="1.0" encoding="utf-8" standalone="no"?>
2	<!DOCTYPE Archive SYSTEM "https://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3	<Archive>
4	<Section>
5	<Description>
6	<Metadata name="gsdldoctype">indexed_doc</Metadata>
7	<Metadata name="Language">en</Metadata>
8	<Metadata name="Encoding">utf8</Metadata>
9	<Metadata name="Title">Greenstone: A Comprehensive Open-Source Digital Library Software System Ian H....</Metadata>
10	<Metadata name="URL">http://Scratch/ak19/gs3-svn-model-4Sep2023/gs2build/tmp/F111.html</Metadata>
11	<Metadata name="UTF8URL">http://Scratch/ak19/gs3-svn-model-4Sep2023/gs2build/tmp/F111.html</Metadata>
12	<Metadata name="gsdlsourcefilename">import/pdf01.pdf</Metadata>
13	<Metadata name="gsdlsourcefilerenamemethod">url</Metadata>
14	<Metadata name="gsdlconvertedfilename">/Scratch/ak19/gs3-svn-model-4Sep2023/gs2build/tmp/F111.html</Metadata>
15	<Metadata name="OrigSource">F111.html</Metadata>
16	<Metadata name="Source">pdf01.pdf</Metadata>
17	<Metadata name="SourceFile">pdf01.pdf</Metadata>
18	<Metadata name="Plugin">PDFPlugin</Metadata>
19	<Metadata name="FileSize">269487</Metadata>
20	<Metadata name="FilenameRoot">pdf01</Metadata>
21	<Metadata name="FileFormat">PDF</Metadata>
22	<Metadata name="srcicon">_iconpdf_</Metadata>
23	<Metadata name="srclink_file">doc.pdf</Metadata>
24	<Metadata name="srclinkFile">doc.pdf</Metadata>
25	<Metadata name="NumPages">9</Metadata>
26	<Metadata name="Identifier">HASH1a9cea0f239f754007681b</Metadata>
27	<Metadata name="lastmodified">1693810437</Metadata>
28	<Metadata name="lastmodifieddate">20230904</Metadata>
29	<Metadata name="oailastmodified">1693810552</Metadata>
30	<Metadata name="oailastmodifieddate">20230904</Metadata>
31	<Metadata name="assocfilepath">HASH1a9c.dir</Metadata>
32	<Metadata name="gsdlassocfile">doc.pdf:application/pdf:</Metadata>
33	</Description>
34	<Content>
35	<a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p><b>Greenstone: A Comprehensive Open-Source<br />Digital Library Software System<br /></b></p><br /><p><i>Ian H. Witten,* Rodger J. McNab,â Stefan J. Boddie,* David Bainbridge<br /></i></p><br /><p> Dept of Computer Science<br /></p><br /><p>University of Waikato, New Zealand<br /></p><br /><p>E-mail: {ihw, sjboddie, davidb}@cs.waikato.ac.nz<br /></p><br /><p>â Digilib Systems<br /></p><br /><p>Hamilton, New Zealand<br /></p><br /><p>E-mail: [email protected]<br /></p><br /><p><b>ABSTRACT<br /></b></p><br /><p>This paper describes the Greenstone digital library<br />software, a comprehensive, open-source system for the<br />construction and presentation of information collections.<br />Collections built with Greenstone offer effective full-text<br />searching and metadata-based browsing facilities that are<br />attractive and easy to use. Moreover, they are easily<br />maintainable and can be augmented and rebuilt entirely<br />automatically. The system is extensible: software<br />âpluginsâ accommodate different document and metadata<br />types.<br /></p><br /><p><b>INTRODUCTION<br /></b></p><br /><p>Notwithstanding intense research activity in the digital<br />library field during the second half of the 1990s,<br />comprehensive software systems for creating digital<br />libraries are not widely available. In fact, the usual solution<br />when creating a digital library is also the most<br />obviousâjust put it on the Web. But consider how much<br />effort is involved in constructing a Web site for a digital<br />library. To be effective it needs to be visually attractive<br />and ergonomically easy to use, incorporate convenient and<br />powerful searching capabilities, and offer rich and natural<br />browsing facilities. Above all it must be easy to maintain<br />and augment, which presents a significant challenge if any<br />manual organization is involved.<br /></p><br /><p>The alternative is to automate these activities through<br />software tools. But the broad scope of digital library<br />requirements makes this a daunting prospect. Ideally the<br />software should incorporate facilities ranging from<br /></p><br /><p>multilingual information retrieval to distributed computing<br />protocols, from interoperability to search engine<br />technology, from metadata standards to multiformat<br />document parsing, from multimedia to multiple operating<br />systems, from Web browsers to plug-and-play DVDs.<br /></p><br /><p>The Greenstone Digital Library Software from the New<br />Zealand Digital Library (NZDL) project tackles this issue<br />by providing a new way of organizing information and<br />making it available over the Internet. A <i>collection</i> of<br />information comprises several (typically several thousand,<br />or several million) <i>documents</i>, and a uniform interface is<br />provided to all documents in a collection. A library may<br />include many different collections, each organized<br />differentlyâthough there is a strong family resemblance in<br />how collections are presented.<br /></p><br /><p>Making information available using this system is far more<br />than âjust putting it on the Web.â The collection becomes<br />maintainable, searchable, and browsable. Each collection,<br />prior to presentation, undergoes a âbuildingâ process that,<br />once established, is completely automatic. This process<br />creates all the structures that are used at run-time for<br />accessing the collection. Searching is based on various<br />indexes, while browsing is based on various metadata;<br />support structures for both are created during the building<br />operation. When new material appears it can be fully<br />incorporated into the collection by rebuilding.<br /></p><br /><p>To address the exceptionally broad demands of digital<br />libraries, the system is public and extensible. It is issued<br />under the Gnu public license and, in the spirit of open-<br />source software, users are invited to contribute<br />modifications and enhancements. Only through an<br />international cooperative effort will digital library software<br />become sufficiently comprehensive to meet the worldâs<br />needs. Currently the Greenstone software is used at sites in<br />Canada, Germany, New Zealand, Romania, UK, and the<br />US, and collections range from newspaper articles to<br />technical documents, from educational journals to oral<br />history, from visual art to folksongs. The software has<br />been used for collections in many different languages, and<br />for CD-ROMs that have been published by the United<br />Nations and other humanitarian agencies in Belgium,<br />France, Japan, and the US for distribution in developing<br />countries (Humanity Libraries, 1998; PAHO, 1999;<br />UNESCO, 1999; UNU, 1998). Further details can be<br />obtained from <i>www.nzdl.org</i>.</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>This paper sets the scene with a brief discussion of what a<br />digital library is. We then give an overview of the facilities<br />offered by Greenstone and show how end users find<br />information in collections. Next we describe the files and<br />directories involved in a collection, and then discuss the<br />processes of updating existing collections and creating new<br />ones, including extending the software to provide new<br />facilities. We conclude with an overview of related work.<br /></p><br /><p><b>WHAT IS A DIGITAL LIBRARY?<br /></b></p><br /><p> Ten definitions of the term âdigital libraryâ have been<br />culled from the literature by Fox (1998), and their spirit is<br />captured in the following brief characterization:<br /></p><br /><p> <i>A collection of digital objects, including text,<br />video, and audio, along with methods for access<br />and retrieval, and for selection, organization<br />and maintenance of the collection<br /></i></p><br /><p> (Akscyn and Witten, 1998). Lesk (1998) views digital<br />libraries as âorganized collections of digital information,â<br />and wisely recommends that they articulate the principles<br />governing what is included and how the collection is<br />organized.<br /></p><br /><p> Digital libraries are generally distinguished from the<br />World-Wide Web, the essential difference being in<br />selection and organization. But they are not generally<br />distinguished from a web <i>site</i>: indeed, virtually all extant<br />digital libraries manifest themselves as a web site. Hence<br />the obvious question: to make a digital library, why not<br />just put the information on the Web?<br /></p><br /><p> But we make a distinction between a digital library and a<br />web site that lies at the heart of our software design: one<br />should easily be able to add new material to a library<br />without having to integrate it manually or edit its content<br />in any way. Once added, new material should immediately<br /></p><br /><p>become a first-class component of the library. And what<br />permits it to be integrated into existing searching and<br />browsing structures without any manual intervention is<br /><i>metadata</i>. This provides sufficient focus to the concept of<br />âdigital libraryâ to support the development of a<br />construction kit.<br /></p><br /><p><b>OVERVIEW OF GREENSTONE<br /></b></p><br /><p> Information collections built by Greenstone combine<br />extensive full-text search facilities with browsing indexes<br />based on different metadata types. There are several ways<br />for users to find information, although they differ between<br />collections depending on the metadata available and the<br />collection design. Typically you can <i>search for particular<br />words</i> that appear in the text, or within a section of a<br />document, or within a title or section heading. You can<br /><i>browse documents by title</i>: just click on the displayed book<br />icon to read it. You can <i>browse documents by subject</i>.<br />Subjects are represented by bookshelves: just click on a<br />shelf to see the books. Where appropriate, documents<br />come complete with a table of contents (constructed<br />automatically): you can click on a chapter or subsection to<br />open it, expand the full table of contents, or expand the full<br />document.<br /></p><br /><p> An example of searching is shown in Figure 1 where<br />documents in the Global Help Projectâs Humanity<br />Development Library (HDL) are being searched for<br />chapters matching the word <i>butterfly</i>. In Figure 2 the same<br />collection is being browsed by subject: by clicking on the<br />bookshelf icons the user has discovered an item under<br />Section 16, Animal Husbandry. Pursuing an interest in<br />butterfly farming, the user selects a book by clicking on its<br />book icon. In Figure 3 the front cover of the book is<br />displayed as a graphic on the left, and the automatically<br />constructed table of contents appears at the start of the<br />document. The current focus, <i>Introduction and Summary</i>,<br />is shown in bold in the table of contents with its text<br />starting further down the page.<br /></p><br /><p> In accordance with Leskâs advice, a statement of purpose<br />and coverage accompanies each collection, along with an<br />explanation of how it is organized (Figure 1 shows the<br />start of this). A distinction is made between <i>searching</i> and<br /><i>browsing</i>. Searching is full-text, andâdepending on the<br />collectionâs designâthe user can choose between indexes<br />built from different parts of the documents, or from<br />different metadata. Some collections have an index of full<br />documents, an index of sections, an index of paragraphs,<br />an index of titles, and an index of section headings, each of<br />which can be searched for particular words or phrases.<br />Browsing involves data structures created from metadata<br />that the user can examine: lists of authors, lists of titles,<br />lists of dates, hierarchical classification structures, and so<br />on. Data structures for both browsing and searching are<br />built according to instructions in a configuration file,<br />which controls both building and serving the collection.<br />Sample configuration files are discussed below.<br /></p><br /><p><b>Figure 1: Searching the HDL collection</b></p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p> Rich browsing facilities can be provided by manually<br />linking parts of documents together and building explicit<br />indexes and tables of contents. However, manually-created<br />linking becomes difficult to maintain, and often falls into<br />disrepair when a collection expands. The Greenstone<br />software takes a different tack: it facilitates <i>maintainability<br /></i>by creating all searching and browsing structures<br />automatically from the documents themselves. No links<br />are inserted by hand. This means that when new<br />documents in the same format become available, they can<br />be added automatically. Indeed, for some collections this is<br />done by processes that wake up regularly, scout for new<br />material, and rebuild the indexesâall without manual<br />intervention.<br /></p><br /><p>Collections comprise many documents: thousands, tens of<br />thousands, or even millions. Each document may be<br />hierarchically organized into <i>sections</i> (subsections, sub-<br />subsections, and so on). Each section comprises one or<br />more <i>paragraphs</i>. Metadata such as author, title, date,<br />keywords, and so on, may be associated with documents,<br />or with individual sections of documents. This is the raw<br />material for indexes. It must either be provided explicitly<br />for each document and section (for example, in an<br />accompanying spreadsheet) or be derivable automatically<br />from the source documents. Metadata is converted to<br />Dublin Core and stored with the document for internal use.<br /></p><br /><p> In order to accommodate different kinds of source<br />documents, the software is organized so that âpluginsâ can<br />be written for new document types. Plugins exist for plain<br />text documents, HTML documents, email documents, and<br />bibliographic formats. Word documents are handled by<br />saving them as HTML; PostScript ones by applying a<br />preprocessor (Nevill-Manning <i>et al</i>., 1998). Specially<br />written plugins also exist for proprietary formats such as<br />that used by the BBC archives department. A collection<br />may have source documents in different forms: it is just a<br /></p><br /><p>matter of specifying all the necessary plugins. In order to<br />build browsing indexes from metadata, an analogous<br />scheme of âclassifiersâ is used: classifiers create indexes<br />of various kinds based on metadata. Source documents are<br />brought into the Greenstone system through a process<br />called <i>importing</i>, which uses the plugins and classifiers<br />specified in the collection configuration file.<br /></p><br /><p> The international Unicode character set is used throughout,<br />so documentsâand interfacesâcan be written in any<br />language. Collections have so far been produced in<br />English, French, Spanish, German, Maori, Chinese, and<br />Arabic. The NZDL Web site provides numerous examples.<br />Collections can contain text, pictures, and even audio and<br />video clips; a text-only version of the interface is also<br />provided to accommodate visually impaired users.<br />Compression technology is used to ensure best use of<br />storage (Witten <i>et al </i>., 1999). Most non-textual material is<br />either linked to textual documents or accompanied by<br />textual descriptions (such as photo captions) to allow full-<br />text searching and browsing. However, the architecture<br />permits the implementation of plugins and classifiers even<br />for non-textual data.<br /></p><br /><p> The system includes an âadministrativeâ function whereby<br />specified users can examine the composition of all<br />collections, protect documents so that they can only be<br />accessed by registered users on presentation of a password,<br />and so on. Logs of user activity are kept that record all<br />queries made to every Greenstone collection (though this<br />facility can be disabled).<br /></p><br /><p> Although primarily designed for Internet access over the<br />World-Wide Web, collections can be made available, in<br />precisely the same form, on CD-ROM. In either case they<br />are accessed through any Web browser. Greenstone CD-<br />ROMs operate on a standalone PC under Windows 3.X,<br />95, 98, and NT, and the interaction is identical to accessing<br />the collection on the Webâexcept that response is faster<br />and more predictable. The requirement to operate on early<br />Windows systems is one that plagues the software design,<br />but is crucial for many usersâparticularly those in<br />underdeveloped countries seeking access to humanitarian<br />aid collections. If the PC is connected to a network<br />(intranet or Internet), a custom-built Web server provided<br />on each CD makes exactly the same information available<br />to others through their standard Web browser. The use of<br />compression ensures that the greatest possible volume of<br />information can be packed on to a CD-ROM.<br /></p><br /><p> The collection-serving software operates under Unix and<br />Windows NT, and works with standard Web servers. A<br />flexible process structure allows different collections to be<br />served by different computers, yet be presented to the user<br />in the same way, on the same Web page, as part of the<br />same digital library, even as part of the same collection<br />(McNab and Witten, 1998). Existing collections can be<br />updated and new ones brought on-line at any time, without<br />bringing the system down; the process responsible for the<br />user interface will notice (through periodic polling) when<br />new collections appear and add them to the list presented<br />to the user.<br /></p><br /><p><b>Figure 2: Browsing the HDL collection by subject</b></p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p><b>FINDING INFORMATION<br /></b></p><br /><p> Greenstone digital library systems generally include<br />several separate collections. A home page allows you to<br />select a collection; in addition, each collection has its own<br />âaboutâ page that gives you information about how the<br />collection is organized and the principles governing what<br />is included.<br /></p><br /><p> All icons in the screenshots of Figures 1â4 are clickable.<br />Those icons at the top of the page return to the home page,<br />provide help text, and allow you to set user interface and<br />searching preferences. The navigation bar underneath<br />gives access to the searching and browsing facilities,<br />which differ from one collection to another.<br /></p><br /><p> Each of the five buttons provides a different way to find<br />information. You can <i>search for particular words</i> that<br />appear in the text from the âsearchâ page (or from the<br />âaboutâ page of Figure 1). This collection contains indexes<br />of chapters, section titles, and entire books. The default<br />search interface is a simple one, suitable for casual users;<br />advanced searchingâwhich allows full Boolean<br />expressions, phrase searching, case and stemming<br />controlâcan be enabled from the <i>Preferences</i> page.<br /></p><br /><p> This collection has four browsable metadata indexes. You<br />can <i>access publications by subject</i> by clicking the <i>subjects<br /></i>button, which brings up a list of subjects, represented by<br />bookshelves (Figure 2). You can <i>access publications by<br />title</i> by clicking <i>titles a-z</i> (Figure 4), which brings up a list<br />of books in alphabetic order. You can <i>access publications<br />by organization</i> (i.e. Dublin Core âpublisherâ), bringing up<br />a list of organizations. You can <i>access publications by<br />âhow toâ listing</i>, yielding a list of hints defined by the<br />collectionâs editors. We use the Dublin Core as a base and<br />extend it in an <i>ad hoc</i> manner to accommodate the<br />individual requirements of collection designers.<br /></p><br /><p><b>FILES IN A COLLECTION<br /></b></p><br /><p> When a new collection is created or material is added to an<br />existing one, the original source documents are first<br />brought into the system through a process known as<br />âimporting.â This involves converting documents into a<br />simple HTML-like format known as GML (for<br />âGreenstone Markup Languageâ), which includes any<br />metadata associated with the document. Documents are<br />assumed to be in the Unicode UTF-8 code (of which the<br />ASCII characters form a subset).<br /></p><br /><p> <b>Files and directories<br /></b></p><br /><p> There is a separate directory for each collection, which<br />contains five subdirectories: the original raw material<br />(<i>import</i>), the GML files created from this (<i>archives</i>), the<br />final collection as it is served to users (<i>index</i>), a directory<br />for use during the building process (<i>building</i>), and one for<br />any supporting files (<i>etc</i>)âincluding the configuration file<br />that controls the collection creation procedure. Additional<br />files might be required: for example, building a hierarchy<br />of classifications requires a data file of sub-classifications.<br /></p><br /><p> <b>The imported documents<br /></b></p><br /><p> In order to identify documents internally, a unique object<br />identifier or OID is assigned to each original source<br />document when it is imported (formed by hashing the<br />content, to overcome file duplication effects caused by<br />mirroring) and stored as metadata within that document. It<br />is important that OIDs persist throughout the index-<br />building processâso that a userâs search history is<br />unaffected by rebuilding the collection. OIDs are assigned<br />by hashing the contents of the original source document.<br /></p><br /><p> Once imported, each document is stored in its own<br />subdirectory of <i>archives</i>, along with any associated<br />filesâfor example, images. To ensure compatibility with<br />Windows 3.0, only eight characters are used in directory<br />and file names, which causes annoying but essentially<br />trivial complications.<br /></p><br /><p> <b>Inside the documents<br /></b></p><br /><p> The GML format imposes a limited amount of structure on<br />documents. Documents are divided into paragraphs. They<br />can be split hierarchically into sections and subsections.<br />OIDs are extended to identify these components by<br />appending numbers, separated by periods, to a documentâs<br />OID. When a book is read, its section hierarchy is visible<br />as the table of contents (Figure 3). Chapters, sections,<br />subsections, and pages are all implemented simply as<br />âsectionsâ within the document. In some collections<br />documents do not have a hierarchical subsection structure,<br />but are split into pages to permit browsing within a<br />retrieved document.<br /></p><br /><p> The document structure is used for searchable indexes.<br />There are three levels of index: <i>documents</i>, <i>sections</i>, and<br /></p><br /><p><b>Figure 3: Reading a book in the HDL</b></p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p><i>paragraphs</i>, corresponding to the distinctions that GML<br />makesâthe hierarchical structure is flattened for the<br />purposes of creating these indexes. Indexes can be of text,<br />or metadata, or any combination. Thus you can create a<br />searchable index of section titles, and/or authors, and/or<br />document descriptions, as well as the document text.<br /></p><br /><p><b>UPDATING EXISTING COLLECTIONS<br /></b></p><br /><p> Updating an existing collection with new files in the same<br />format is easy. For example, the raw material for the HDL<br />is supplied in the form of HTML files marked up with<br />&lt;&lt;TOC&gt;&gt; tags to split books into sections and<br />subsections, and &lt;&lt;I&gt;&gt; tags to indicate where an image is<br />to be inserted. For each book in the library there is a<br />directory that contains a single HTML file representing the<br />book, and separate files containing the associated images.<br />An accompanying spreadsheet file contains the<br />classification hierarchy; this is converted to a simple file<br />format (using Excelâs <i>Save As</i> command).<br /></p><br /><p> Since the collection exists, its directory is already set up<br />with subdirectories <i>import</i>, <i>archives</i>, <i>building</i>, <i>index</i>, and<br /><i>etc</i>, and the <i>etc</i> directory will contain a suitable collection<br />configuration file.<br /></p><br /><p> <b>The updating procedure<br /></b></p><br /><p> To update a collection, the new raw material is placed in<br />the <i>import</i> directory, in whatever form it is available. Then<br /></p><br /><p>the <i>import</i> process is invoked, which converts the files into<br />GML using the specified plugins. Old material for which<br />GML files have previously been created is not re-imported.<br />Then the <i>build</i> process is invoked to build the requisite<br />indexes for the collection. Finally, the contents of the<br /><i>building</i> directory are moved into the <i>index</i> directory, and<br />the new version of the collection automatically becomes<br />live.<br /></p><br /><p> This procedure may seem cumbersome. But all the steps<br />are necessary for efficient operation with large collections.<br />The <i>import</i> process could be performed on the fly during<br />the building operationâbut because building indexes is a<br />multipass operation, the often lengthy importing would be<br />repeated several times. The <i>build</i> process can take<br />considerable timeâa day or two, for very large<br />collections. Consequently, the results are placed in the<br /><i>building</i> directory so that, if the collection already exists, it<br />will continue to be served to users in its old form<br />throughout the building operation.<br /></p><br /><p> Active users of the collection will not be disturbed when<br />the new version becomes liveâthey will probably not<br />even notice. The persistent OIDs ensure that interactions<br />remain coherentâusers who are examining the results of a<br />query or browse operation will still retrieve the expected<br />documentsâand if a search is actually in progress when<br />the change takes place the program detects the resulting<br />file-structure inconsistency and automatically and<br />transparently re-executes the query, this time on the new<br />version of the collection.<br /></p><br /><p> <b>How it works<br /></b></p><br /><p> The original material in the <i>import</i> directory may be in any<br />format, and plugins are required to process each format<br />type. The plugins that a collection uses must be specified<br />in the collection configuration file. The <i>import</i> program<br />reads the list of plugins and passes each document to each<br />plugin in order until it finds one that can process it. When<br />updating an existing collection, all plugins necessary to<br />process new material should already have been specified in<br />the configuration file.<br /></p><br /><p> The building step creates the indexes for both searching<br />and browsing. The MG software is generally used to do the<br />searching (Witten <i>et al.</i>, 1999), and the <i>mgbuild</i> module is<br />automatically invoked to create each of the indexes that is<br />required. For example, the Humanity Development Library<br />has three indexes, one for entire books, one for chapters,<br />and one for section titles. Subdirectories of the <i>index<br /></i>directory are created for each of these indexes.<br /></p><br /><p><b>Figure 4: Browsing titles in the HDL</b></p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p> MG also compresses the text of the collection; and the<br />image files are linked into the <i>index</i> subdirectory. Now<br />none of the material in the <i>import</i> and <i>archives</i> directories<br />is needed to run the collection and can be removed from<br />the file system (though they would be needed if the<br />collection were rebuilt).<br /></p><br /><p> Associated with each collection is a database stored in<br />GDBM (Gnu database manager) format. This contains an<br />entry for each document, giving its OID, its internal MG<br />document number, and metadata such as title. Information<br />for each of the browsing indexes, which appear as buttons<br />on the Greenstone search/browse bar, is also extracted<br />during the building process and stored in the database. A<br />âclassifierâ program is required for each browsing index to<br />extract the appropriate information from GML documents.<br />Like plugins, classifiers are written on an <i>ad hoc</i> basis for<br />the particular information required, and where possible<br />reused from one collection to another.<br /></p><br /><p> The building program creates the indexes based on<br />whatever appears in the <i>archives</i> directory. The first plugin<br />specified by all collections is one that processes GML<br />files, and so if <i>archives</i> contains imported files they will be<br />processed correctly. If it contains material in the original<br />format, that will be converted using the appropriate plugin.<br />Thus the import process is optional.<br /></p><br /><p> GML is designed to be fast and easy to parse, an important<br />requirement when millions of documents are to be<br />processed. Something as simple as requiring tags to be<br />lower-case, for example, yields a substantial speed-up. In<br /></p><br /><p>certain circumstances, however, it might be preferable to<br />use a standardized format such as XML. This is<br />straightforward to implementï£§just write an XML<br />pluginï£§although we have not done so ourselves. Given<br />the transitory nature of the imported data, to date, we have<br />found GML a satisfactory and beneficial format.<br /></p><br /><p><b>CREATING NEW COLLECTIONS<br /></b></p><br /><p> Building new collections from scratch is only slightly<br />different from updating an existing collection. The key<br />new requirement is creating a collection configuration file,<br />and a software utility is provided to help. Two pieces of<br />information are required for this: the name of the directory<br />that the collection will use (into which the source data and<br />other files will eventually be placed), and a contact e-mail<br />address for use if any problems are encountered by the<br />software once the collection is up and running. The utility<br />creates files and directories within the newly-named<br />directory to support a generic collection of plain text<br />documents. With suitable data placed in the <i>import<br /></i>directory, building the collection at this point will yield a<br />document-level searchable index of all the text and a<br />browsable list of âtitlesâ (defined in this case to be the<br />document filenames).<br /></p><br /><p> To enhance the functionality and presentationâ something<br />anything but the most trivial collection will requireâthe<br />configuration file must be edited. For a collection sourced<br />from documents in an already supported data format,<br />presented in a similar fashion to an existing collection, the<br /></p><br /><p>creator [email protected] 1<br />maintainer [email protected] 2<br />public True 3<br /></p><br /><p>4<br />indexes document:text 5<br />defaultindex document:text 6<br />plugins GMLPlug TEXTPlug ArcPlug RecPlug 7<br /></p><br /><p>8<br />classify AZList metadata=Title 9<br /></p><br /><p>10<br />collectionmeta collectionname &quot;generic text collection&quot; 11<br /></p><br /><p>(a) collectionmeta .document:text &quot;documents&quot; 12<br /></p><br /><p>creator [email protected] 1<br />maintainer [email protected] 2<br />public True 3<br /></p><br /><p>4<br />indexes document:text document:From 5<br />defaultindex document:text 6<br />plugins GMLPlug EMAILPlug ArcPlug RecPlug 7<br /></p><br /><p>8<br />classify AZList metadata=Title 9<br />classify DateList 10<br /></p><br /><p>11<br />collectionmeta collectionname &quot;Email messages&quot; 12<br />collectionmeta .document:text &quot;documents&quot; 13<br />collectionmeta .document:From &quot;email senders&quot; 14<br /></p><br /><p>15<br />format QueryResults \\ 16<br /></p><br /><p>(b) &lt;td&gt;[link][icon][/link]&lt;/td&gt;&lt;td&gt;[Title]&lt;/td&gt;&lt;td&gt;[Author]&lt;/td&gt; 17<br /></p><br /><p><b>Figure 5: Collection configuration files (a) generic, (b) for an email collection</b></p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>amount of editing is minimal. Importing new data formats<br />and browsing metadata in ways not currently supported are<br />more complex activities that require programming skills.<br /></p><br /><p> <b>Modifying the configuration file<br /></b></p><br /><p> Figure 5b shows simple alterations to the generic<br />configuration file in Figure 5a that was generated by the<br />new-collection utility. <i>TEXTPlug</i> is replaced with<br /><i>EMAILPlug</i> (line 7) which reads email files and extracts<br />metadata (<i>From</i>, <i>To</i>, <i>Date</i>, <i>Subject</i>) from them. A classifier<br />for dates is added (line 10) to make the collection<br />browsable chronologically. The default presentation of<br />search results is overridden (line 17) to display both the<br />title of the message (i.e. Dublin Core <i>Title</i>) and its sender<br />(i.e. Dublin Core <i>Author</i>). Elements in square brackets,<br />such as <i>[Title]</i>, are replaced by the metadata associated<br />with a particular document. The built-in term <i>[icon]<br /></i>produces a suitable image that represents the document<br />(such as a book icon or page icon), and the <i>[link]âŠ[/link]<br /></i>construct forms a hyperlink to the complete document.<br />Anything else in the format statement, which in this case is<br />solely table-cell tags in HTML, is passed through to the<br />page being displayed.<br /></p><br /><p>As this example shows, creating a new collection that stays<br />within the bounds of the libraryâs established capabilities<br />falls within the capability of many computer usersâfor<br />instance, computer-trained librarians. Extending<br />Greenstone to handle new document formats and browse<br />metadata in new ways is more challenging.<br /></p><br /><p> <b>Writing new plugins and classifiers<br /></b></p><br /><p> Extensibility is obtained through plugins and classifiers.<br /></p><br /><p> These are modules of code that can be slotted into the<br />system to enhance its capabilities. Plugins parse<br />documents, extracting the text and metadata to be indexed.<br />Classifiers control how metadata is brought together to<br />form browsable data structures. Both are specified in an<br />object-oriented framework using inheritance to minimize<br />the amount of code written.<br /></p><br /><p> A plugin must specify three things: what file formats it can<br />handle, how they should be parsed, and whether the plugin<br />is recursive. File formats are normally determined using<br />regular expression matching on the filename. For example,<br />the HTML plugin accepts all files that end in <i>.htm</i>, .<i>html</i>,<br /><i>.HTM</i>, or <i>.HTML</i>. (It is quite possible, however, to write<br />plugins that âlook insideâ the file as well.) For other files,<br />the plugin returns <i>undefined</i> and the file is passed to the<br />next plugin in the collectionâs configuration file (e.g.<br />Figure 5 line 7). If it can, the plugin parses the file and<br />returns the number of documents processed. This involves<br />extracting text and metadata and adding it to the libraryâs<br />content through calls to <i>add text</i> and <i>add metadata</i>.<br /></p><br /><p> Some plugins (ârecursiveâ ones) add extra files into the<br />stream of data processed during the building phase by<br />artificially reactivating the list of plugins. This is how<br />directory hierarchies are traversed.<br /></p><br /><p> Plugins are small modules of code that are easy to write.<br />We monitored the time it took to develop a new one that<br />was different to any we had produced so far. We chose to<br />make as an example a collection of HTML bookmark files,<br />the motivation being to produce a convenient way of<br />searching and browsing oneâs bookmarked Web pages.<br />Figure 6 shows a user searching for bookmarked pages<br />about <i>music</i>. The new plugin took under an hour to write,<br />and was 160 lines long (ignoring blank lines and<br />comments)âabout the average length of existing plugins.<br /></p><br /><p> Classifiers are more general than plugins because they<br />work on GML-format data. For example, any plugin that<br />generates date metadata in accordance with the Dublin<br />core can request the collection to be browsable<br />chronologically by specifying the <i>DateList</i> classifier in the<br />collectionâs configuration file (Figure 7). Classifiers are<br />more elaborate than most plugins, but new ones are seldom<br />required. The average length of existing classifiers is 230<br />lines.<br /></p><br /><p> Classifiers must specify three things: an initialization<br />routine, how individual documents are classified, and the<br />final browsable data structure. Initialization takes care of<br />any options specified in the configuration file (such as<br /><i>metadata=Title </i>on line 9 of Figure 5b). Classifying<br />individual documents is an iterative process: for each one,<br />a call to <i>document-classify</i> is made. On presentation of the<br />documentâs OID, the necessary metadata is located and<br />used to control where the document is added to the<br />browsable data structure being constructed.<br /></p><br /><p> Once all documents have been added, a request is made for<br />the completed data structure. Some classifiers return the<br />data structure directly; others transform the data structure<br />before it is returned. For example, the <i>AZList</i> classifier<br /></p><br /><p><b>Figure 6: Searching bookmarked Web pages</b></p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>divides the alphabetically sorted list of metadata into<br />separate pages of about the same size and returns the<br />alphabetic ranges for each one (Figure 4).<br /></p><br /><p><b>OVERVIEW OF RELATED WORK<br /></b></p><br /><p>Two projects that provide substantial open source digital<br />library software are Dienst (Lagoze and Fielding, 1998)<br />and Harvest (Bowman <i>et al.</i>, 1994). The origins of Dienst<br />(<i>www.cs.cornell.edu/cdlrg</i>) stretch back to 1992. The term<br />has come to represent three entities: a conceptual<br />architecture for distributed digital libraries; an open<br />protocol for service communication; and a software<br />system that implements the protocol. To date, five sample<br />digital libraries have been built using this technology.<br />They manifest themselves in two forms: technical reports<br />and primary source documents.<br /></p><br /><p>Best known is NCSTRL, the Networked Computer<br />Science Technical Reference Library project<br />(<i>www.ncstrl.org</i>). This collection facilitates searching by<br />title, author and abstract, and browsing by year and author,<br />across a distributed network of document repositories.<br />Documents can (where supported) be delivered in various<br />formats such as PostScript, a thumbnail overview of the<br />pages, and a GIF image of a particular page.<br /></p><br /><p>The <i>Making of America</i> resource is an example of a<br />collection based around primary sourcesï£§in this case<br />American social history, 1830â1900. It has a different<br />âlook and feelâ to NCSTRL, being strongly oriented<br />toward browsing rather than searching. A user navigates<br />their way through a hierarchical structure of hyperlinks to<br />reach a book of interest. The book itself is a series of<br />scanned images: delivery options include going directly to<br /></p><br /><p>a page number, next and previous page buttons, and<br />displaying a particular page at different resolutions. A text<br />version of the page is also available upon which a<br />searching option is also provided.<br /></p><br /><p>Started in 1994, Harvest is also a long-running research<br />project. It provides an efficient means of gathering source<br />data from the Internet and distributing indexing<br />information over the Internet. This is accomplished<br />through five components: <i>gatherer</i>, <i>broker</i>, <i>indexer</i>,<br /><i>replicator</i> and <i>cache</i>. The first three are central to creating,<br />updating and searching a collection; the last two help to<br />improve performance over the Internet through transparent<br />mirroring and caching techniques.<br /></p><br /><p>The system is configurable and customizable. While<br />searching is most commonly implemented using Glimpse<br />(<i>glimpse.cs.arizona.edu</i>), in principle any search engine<br />that supports incremental updates and Boolean<br />combinations of attribute-based queries can be used. It is<br />possible to control what type of documents are gathered<br />during creation and updating, and how the query interface<br />looks and is laid out.<br /></p><br /><p>Sample collections cited by the developers include 21,000<br />computer science technical reports and 7,000 home pages.<br />Other examples include a sizable collection of agriculture-<br />related electronic journals and magazines called âtomato-<br />juiceâ (accessed through <i>hegel.lib.ncsu.edu</i>) and a full-text<br />index of library-related electronic serials<br />(<i>sunsite.berkeley.edu/IndexMorganagus</i>). Harvest is also<br />often used to index Web sites (for example<br /><i>www.middlebury.edu</i>).<br /></p><br /><p>Comparing Greenstone with Dienst and Harvest, there are<br />both similarities and differences. All provide substantial<br />digital library systems, hence common themes recur, but<br />they are driven by projects with different aims. Harvest,<br />for instance, was not conceived as a digital library project<br />at all, but by virtue of its selective document gathering<br />process it can be classed (and is used) as one. While it<br />provides sophisticated search options, it lacks the<br />complementary service of browsing. Furthermore it adds<br />no structure or order to the documents collected, relying<br />on whatever structures are present in the site that they<br />were gathered from. A proven strength of the design is its<br />flexibility through configuration and customizationï£§an<br />element also present in Greenstone.<br /></p><br /><p>Dienstï£§best exemplified through the NCSTRL<br />workï£§supports searching and browsing, like Greenstone.<br />Both use open protocols. Differences include a high<br />reliance in Dienst on user-supplied information when a<br />document is added, and a smaller range of document types<br />supportedâalthough Dienst does include a document<br />model that should, over time, allow this to expand with<br />relative ease.<br /></p><br /><p>There are also commercial systems that provide similar<br />digital library services to those described. However, since<br /></p><br /><p><b>Figure 7: Browsing a newspaper collection by date</b></p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>corporate culture instills proprietary attitudes there is little<br />opportunity for advancement through a shared<br />collaborative effort. Consequently they are not reviewed<br />here.<br /></p><br /><p><b>CONCLUSIONS<br /></b></p><br /><p>Greenstone is a comprehensive software system for<br />creating digital library collections. It builds data structures<br />for searching and browsing from the material provided,<br />rather than relying on any hand-crafting. The process is<br />controlled by a configuration file, and once a collection<br />exists new material can be added completely<br />automatically. Browsing is based on Dublin Core<br />metadata.<br /></p><br /><p>New collections can be developed easily, particularly if<br />they resemble existing ones. Extensibility is achieved<br />through software âpluginsâ that can be written to<br />accommodate documents, and metadata, in different<br />formats. Standard plugins exist for many document types;<br />new ones are easily written. Browsing is controlled by<br />âclassifiersâ that process metadata into browsing structures<br />(by date, alphabetical, hierarchical, etc).<br /></p><br /><p>However, the most powerful support for extensibility is<br />achieved not by technical means but by making the source<br />code freely available under the Gnu public license. Only<br />through an international cooperative effort will digital<br />library software become sufficiently comprehensive to<br />meet the worldâs needs with the richness and flexibility<br />that users deserve.<br /></p><br /><p><b>ACKNOWLEDGMENTS<br /></b></p><br /><p>We gratefully acknowledge all those who have worked on<br />the Greenstone software, and all members of the New<br />Zealand Digital Library project for their enthusiasm and<br />ideas.<br /></p><br /><p><b>REFERENCES<br /></b></p><br /><p>1. Akscyn, R.M. and Witten, I.H. (1998) âReport on First<br />Summit on International Cooperation on Digital<br />Libraries.â ks.com/idla-wp-oct98.<br /></p><br /><p>2. Bowman, C.M., Danzig, P.B., Manber, U., and<br />Schwartz, M.F. âScalable Internet resource discovery:<br />Research problems and approachesâ <i>Communications<br />of the ACM,</i> Vol. 37, No. 8, pp. 98â107, 1994.<br /></p><br /><p>3. Fox, E. (1998) âDigital library definitions.â<br />ei.cs.vt.edu/~fox/dlib/def.html.<br /></p><br /><p>4. Humanity Libraries (1998) <i>Humanity Development<br />Library</i>. CD-ROM produced by the Global Help<br />Project, Antwerp, Belgium.<br /></p><br /><p>5. Lagoze, C. and Fielding, D âDefining Collections in<br />Distributed Digital Librariesâ <i>D-Lib Magazine</i>, Nov.<br />1998.<br /></p><br /><p>6. PAHO (1999) <i>Virtual Disaster Library</i>. CD-ROM<br />produced by the Pan-American Health Organization,<br />Washington DC, USA.<br /></p><br /><p>7. McNab, R.J., Witten, I.H. and Boddie, S.J. (1998) âA<br />distributed digital library architecture incorporating<br />different index styles.â <i>Proc IEEE Advances in Digital<br />Libraries</i>, Santa Barbara, CA, pp. 36â45.<br /></p><br /><p>8. Nevill-Manning, C.G., Reed, T., and Witten, I.H.<br />(1998) âExtracting text from PostScriptâ<br /><i>SoftwareâPractice and Experience</i>, Vol. 28, No. 5, pp.<br />481â491; April.<br /></p><br /><p>9. UNESCO (1999) <i>SAHEL point DOC: Anthologie du<br />dÃ©veloppement au Sahel</i>. CD-ROM produced by<br />UNESCO, Paris, France.<br /></p><br /><p>10. UNU (1998) <i>Collection on critical global issues.</i> CD-<br />ROM produced by the United Nations University<br />Press, Tokyo, Japan.<br /></p><br /><p>11. Witten, I.H., Moffat, A. and Bell, T. (1999) <i>Managing<br />Gigabytes: compressing and indexing documents and<br />images</i>, Morgan Kaufmann, second edition.</p><br /><br /></div></div><br /></Content>
36	</Section>
37	</Archive>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: other-projects/nightly-tasks/diffcol/trunk/gs3-model-collect/PDFBox/archives/HASH1a9c.dir/doc.xml

Download in other formats: