Context Navigation

source: other-projects/nightly-tasks/diffcol/trunk/gs3-model-collect/PDFBox/archives/HASH1a9c.dir/doc.xml@ 28241

Last change on this file since 28241 was 28241, checked in by ak19, 11 years ago
Rebuilt the GS3 model collection after the change over to using placeholders for standard GS path prefixes in the two archiveinf gdb files
File size: 52.4 KB

Line
1	<?xml version="1.0" encoding="utf-8" standalone="no"?>
2	<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3	<Archive>
4	<Section>
5	<Description>
6	<Metadata name="gsdldoctype">indexed_doc</Metadata>
7	<Metadata name="Language">en</Metadata>
8	<Metadata name="Encoding">utf8</Metadata>
9	<Metadata name="Title">Greenstone: A Comprehensive Open-SourceDigital Library Software</Metadata>
10	<Metadata name="URL">http://research/ak19/gs3-svn-26Aug2013/gs2build/tmp/F204.html</Metadata>
11	<Metadata name="UTF8URL">http://research/ak19/gs3-svn-26Aug2013/gs2build/tmp/F204.html</Metadata>
12	<Metadata name="gsdlsourcefilename">import/pdf01.pdf</Metadata>
13	<Metadata name="gsdlconvertedfilename">/research/ak19/gs3-svn-26Aug2013/gs2build/tmp/F204.html</Metadata>
14	<Metadata name="OrigSource">F204.html</Metadata>
15	<Metadata name="Source">pdf01.pdf</Metadata>
16	<Metadata name="SourceFile">pdf01.pdf</Metadata>
17	<Metadata name="Plugin">PDFPlugin</Metadata>
18	<Metadata name="FileSize">269487</Metadata>
19	<Metadata name="FilenameRoot">pdf01</Metadata>
20	<Metadata name="FileFormat">PDF</Metadata>
21	<Metadata name="srcicon">_iconpdf_</Metadata>
22	<Metadata name="srclink_file">doc.pdf</Metadata>
23	<Metadata name="srclinkFile">doc.pdf</Metadata>
24	<Metadata name="NumPages">9</Metadata>
25	<Metadata name="Identifier">HASH1a9cea0f239f754007681b</Metadata>
26	<Metadata name="lastmodified">1378708192</Metadata>
27	<Metadata name="lastmodifieddate">20130909</Metadata>
28	<Metadata name="oailastmodified">1378708549</Metadata>
29	<Metadata name="oailastmodifieddate">20130909</Metadata>
30	<Metadata name="assocfilepath">HASH1a9c.dir</Metadata>
31	<Metadata name="gsdlassocfile">doc.pdf:application/pdf:</Metadata>
32	</Description>
33	<Content>
34	<a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>Greenstone: A Comprehensive Open-Source<br />Digital Library Software System<br /></p><br /><p>Ian H. Witten,* Rodger J. McNab,â Stefan J. Boddie,* David Bainbridge<br /><br /> Dept of Computer Science<br /></p><br /><p>University of Waikato, New Zealand<br />E-mail: {ihw, sjboddie, davidb}@cs.waikato.ac.nz<br /></p><br /><p>â Digilib Systems<br />Hamilton, New Zealand<br /></p><br /><p>E-mail: [email protected]<br /></p><br /><p>ABSTRACT<br /></p><br /><p>This paper describes the Greenstone digital library<br />software, a comprehensive, open-source system for the<br />construction and presentation of information collections.<br />Collections built with Greenstone offer effective full-text<br />searching and metadata-based browsing facilities that are<br />attractive and easy to use. Moreover, they are easily<br />maintainable and can be augmented and rebuilt entirely<br />automatically. The system is extensible: software<br />âpluginsâ accommodate different document and metadata<br />types.<br /></p><br /><p>INTRODUCTION<br /></p><br /><p>Notwithstanding intense research activity in the digital<br />library field during the second half of the 1990s,<br />comprehensive software systems for creating digital<br />libraries are not widely available. In fact, the usual solution<br />when creating a digital library is also the most<br />obviousâjust put it on the Web. But consider how much<br />effort is involved in constructing a Web site for a digital<br />library. To be effective it needs to be visually attractive<br />and ergonomically easy to use, incorporate convenient and<br />powerful searching capabilities, and offer rich and natural<br />browsing facilities. Above all it must be easy to maintain<br />and augment, which presents a significant challenge if any<br />manual organization is involved.<br />The alternative is to automate these activities through<br />software tools. But the broad scope of digital library<br />requirements makes this a daunting prospect. Ideally the<br />software should incorporate facilities ranging from<br /></p><br /><p>multilingual information retrieval to distributed computing<br />protocols, from interoperability to search engine<br />technology, from metadata standards to multiformat<br />document parsing, from multimedia to multiple operating<br />systems, from Web browsers to plug-and-play DVDs.<br />The Greenstone Digital Library Software from the New<br />Zealand Digital Library (NZDL) project tackles this issue<br />by providing a new way of organizing information and<br />making it available over the Internet. A collection of<br />information comprises several (typically several thousand,<br />or several million) documents, and a uniform interface is<br />provided to all documents in a collection. A library may<br />include many different collections, each organized<br />differentlyâthough there is a strong family resemblance in<br />how collections are presented.<br />Making information available using this system is far more<br />than âjust putting it on the Web.â The collection becomes<br />maintainable, searchable, and browsable. Each collection,<br />prior to presentation, undergoes a âbuildingâ process that,<br />once established, is completely automatic. This process<br />creates all the structures that are used at run-time for<br />accessing the collection. Searching is based on various<br />indexes, while browsing is based on various metadata;<br />support structures for both are created during the building<br />operation. When new material appears it can be fully<br />incorporated into the collection by rebuilding.<br />To address the exceptionally broad demands of digital<br />libraries, the system is public and extensible. It is issued<br />under the Gnu public license and, in the spirit of open-<br />source software, users are invited to contribute<br />modifications and enhancements. Only through an<br />international cooperative effort will digital library software<br />become sufficiently comprehensive to meet the worldâs<br />needs. Currently the Greenstone software is used at sites in<br />Canada, Germany, New Zealand, Romania, UK, and the<br />US, and collections range from newspaper articles to<br />technical documents, from educational journals to oral<br />history, from visual art to folksongs. The software has<br />been used for collections in many different languages, and<br />for CD-ROMs that have been published by the United<br />Nations and other humanitarian agencies in Belgium,<br />France, Japan, and the US for distribution in developing<br />countries (Humanity Libraries, 1998; PAHO, 1999;<br />UNESCO, 1999; UNU, 1998). Further details can be<br />obtained from www.nzdl.org.</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>This paper sets the scene with a brief discussion of what a<br />digital library is. We then give an overview of the facilities<br />offered by Greenstone and show how end users find<br />information in collections. Next we describe the files and<br />directories involved in a collection, and then discuss the<br />processes of updating existing collections and creating new<br />ones, including extending the software to provide new<br />facilities. We conclude with an overview of related work.<br /></p><br /><p>WHAT IS A DIGITAL LIBRARY?<br /></p><br /><p> Ten definitions of the term âdigital libraryâ have been<br />culled from the literature by Fox (1998), and their spirit is<br />captured in the following brief characterization:<br /></p><br /><p> A collection of digital objects, including text,<br />video, and audio, along with methods for access<br />and retrieval, and for selection, organization<br />and maintenance of the collection<br /></p><br /><p> (Akscyn and Witten, 1998). Lesk (1998) views digital<br />libraries as âorganized collections of digital information,â<br />and wisely recommends that they articulate the principles<br />governing what is included and how the collection is<br />organized.<br /> Digital libraries are generally distinguished from the<br />World-Wide Web, the essential difference being in<br />selection and organization. But they are not generally<br />distinguished from a web site: indeed, virtually all extant<br />digital libraries manifest themselves as a web site. Hence<br />the obvious question: to make a digital library, why not<br />just put the information on the Web?<br /> But we make a distinction between a digital library and a<br />web site that lies at the heart of our software design: one<br />should easily be able to add new material to a library<br />without having to integrate it manually or edit its content<br />in any way. Once added, new material should immediately<br /></p><br /><p>become a first-class component of the library. And what<br />permits it to be integrated into existing searching and<br />browsing structures without any manual intervention is<br />metadata. This provides sufficient focus to the concept of<br />âdigital libraryâ to support the development of a<br />construction kit.<br /></p><br /><p>OVERVIEW OF GREENSTONE<br /></p><br /><p> Information collections built by Greenstone combine<br />extensive full-text search facilities with browsing indexes<br />based on different metadata types. There are several ways<br />for users to find information, although they differ between<br />collections depending on the metadata available and the<br />collection design. Typically you can search for particular<br />words that appear in the text, or within a section of a<br />document, or within a title or section heading. You can<br />browse documents by title: just click on the displayed book<br />icon to read it. You can browse documents by subject.<br />Subjects are represented by bookshelves: just click on a<br />shelf to see the books. Where appropriate, documents<br />come complete with a table of contents (constructed<br />automatically): you can click on a chapter or subsection to<br />open it, expand the full table of contents, or expand the full<br />document.<br /> An example of searching is shown in Figure 1 where<br />documents in the Global Help Projectâs Humanity<br />Development Library (HDL) are being searched for<br />chapters matching the word butterfly. In Figure 2 the same<br />collection is being browsed by subject: by clicking on the<br />bookshelf icons the user has discovered an item under<br />Section 16, Animal Husbandry. Pursuing an interest in<br />butterfly farming, the user selects a book by clicking on its<br />book icon. In Figure 3 the front cover of the book is<br />displayed as a graphic on the left, and the automatically<br />constructed table of contents appears at the start of the<br />document. The current focus, Introduction and Summary,<br />is shown in bold in the table of contents with its text<br />starting further down the page.<br /> In accordance with Leskâs advice, a statement of purpose<br />and coverage accompanies each collection, along with an<br />explanation of how it is organized (Figure 1 shows the<br />start of this). A distinction is made between searching and<br />browsing. Searching is full-text, andâdepending on the<br />collectionâs designâthe user can choose between indexes<br />built from different parts of the documents, or from<br />different metadata. Some collections have an index of full<br />documents, an index of sections, an index of paragraphs,<br />an index of titles, and an index of section headings, each of<br />which can be searched for particular words or phrases.<br />Browsing involves data structures created from metadata<br />that the user can examine: lists of authors, lists of titles,<br />lists of dates, hierarchical classification structures, and so<br />on. Data structures for both browsing and searching are<br />built according to instructions in a configuration file,<br />which controls both building and serving the collection.<br />Sample configuration files are discussed below.<br /></p><br /><p>Figure 1: Searching the HDL collection</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p> Rich browsing facilities can be provided by manually<br />linking parts of documents together and building explicit<br />indexes and tables of contents. However, manually-created<br />linking becomes difficult to maintain, and often falls into<br />disrepair when a collection expands. The Greenstone<br />software takes a different tack: it facilitates maintainability<br />by creating all searching and browsing structures<br />automatically from the documents themselves. No links<br />are inserted by hand. This means that when new<br />documents in the same format become available, they can<br />be added automatically. Indeed, for some collections this is<br />done by processes that wake up regularly, scout for new<br />material, and rebuild the indexesâall without manual<br />intervention.<br />Collections comprise many documents: thousands, tens of<br />thousands, or even millions. Each document may be<br />hierarchically organized into sections (subsections, sub-<br />subsections, and so on). Each section comprises one or<br />more paragraphs. Metadata such as author, title, date,<br />keywords, and so on, may be associated with documents,<br />or with individual sections of documents. This is the raw<br />material for indexes. It must either be provided explicitly<br />for each document and section (for example, in an<br />accompanying spreadsheet) or be derivable automatically<br />from the source documents. Metadata is converted to<br />Dublin Core and stored with the document for internal use.<br /> In order to accommodate different kinds of source<br />documents, the software is organized so that âpluginsâ can<br />be written for new document types. Plugins exist for plain<br />text documents, HTML documents, email documents, and<br />bibliographic formats. Word documents are handled by<br />saving them as HTML; PostScript ones by applying a<br />preprocessor (Nevill-Manning et al., 1998). Specially<br />written plugins also exist for proprietary formats such as<br />that used by the BBC archives department. A collection<br />may have source documents in different forms: it is just a<br /></p><br /><p>matter of specifying all the necessary plugins. In order to<br />build browsing indexes from metadata, an analogous<br />scheme of âclassifiersâ is used: classifiers create indexes<br />of various kinds based on metadata. Source documents are<br />brought into the Greenstone system through a process<br />called importing, which uses the plugins and classifiers<br />specified in the collection configuration file.<br /> The international Unicode character set is used throughout,<br />so documentsâand interfacesâcan be written in any<br />language. Collections have so far been produced in<br />English, French, Spanish, German, Maori, Chinese, and<br />Arabic. The NZDL Web site provides numerous examples.<br />Collections can contain text, pictures, and even audio and<br />video clips; a text-only version of the interface is also<br />provided to accommodate visually impaired users.<br />Compression technology is used to ensure best use of<br />storage (Witten et al ., 1999). Most non-textual material is<br />either linked to textual documents or accompanied by<br />textual descriptions (such as photo captions) to allow full-<br />text searching and browsing. However, the architecture<br />permits the implementation of plugins and classifiers even<br />for non-textual data.<br /> The system includes an âadministrativeâ function whereby<br />specified users can examine the composition of all<br />collections, protect documents so that they can only be<br />accessed by registered users on presentation of a password,<br />and so on. Logs of user activity are kept that record all<br />queries made to every Greenstone collection (though this<br />facility can be disabled).<br /> Although primarily designed for Internet access over the<br />World-Wide Web, collections can be made available, in<br />precisely the same form, on CD-ROM. In either case they<br />are accessed through any Web browser. Greenstone CD-<br />ROMs operate on a standalone PC under Windows 3.X,<br />95, 98, and NT, and the interaction is identical to accessing<br />the collection on the Webâexcept that response is faster<br />and more predictable. The requirement to operate on early<br />Windows systems is one that plagues the software design,<br />but is crucial for many usersâparticularly those in<br />underdeveloped countries seeking access to humanitarian<br />aid collections. If the PC is connected to a network<br />(intranet or Internet), a custom-built Web server provided<br />on each CD makes exactly the same information available<br />to others through their standard Web browser. The use of<br />compression ensures that the greatest possible volume of<br />information can be packed on to a CD-ROM.<br /> The collection-serving software operates under Unix and<br />Windows NT, and works with standard Web servers. A<br />flexible process structure allows different collections to be<br />served by different computers, yet be presented to the user<br />in the same way, on the same Web page, as part of the<br />same digital library, even as part of the same collection<br />(McNab and Witten, 1998). Existing collections can be<br />updated and new ones brought on-line at any time, without<br />bringing the system down; the process responsible for the<br />user interface will notice (through periodic polling) when<br />new collections appear and add them to the list presented<br />to the user.<br /></p><br /><p>Figure 2: Browsing the HDL collection by subject</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>FINDING INFORMATION<br /></p><br /><p> Greenstone digital library systems generally include<br />several separate collections. A home page allows you to<br />select a collection; in addition, each collection has its own<br />âaboutâ page that gives you information about how the<br />collection is organized and the principles governing what<br />is included.<br /> All icons in the screenshots of Figures 1â4 are clickable.<br />Those icons at the top of the page return to the home page,<br />provide help text, and allow you to set user interface and<br />searching preferences. The navigation bar underneath<br />gives access to the searching and browsing facilities,<br />which differ from one collection to another.<br /> Each of the five buttons provides a different way to find<br />information. You can search for particular words that<br />appear in the text from the âsearchâ page (or from the<br />âaboutâ page of Figure 1). This collection contains indexes<br />of chapters, section titles, and entire books. The default<br />search interface is a simple one, suitable for casual users;<br />advanced searchingâwhich allows full Boolean<br />expressions, phrase searching, case and stemming<br />controlâcan be enabled from the Preferences page.<br /> This collection has four browsable metadata indexes. You<br />can access publications by subject by clicking the subjects<br />button, which brings up a list of subjects, represented by<br />bookshelves (Figure 2). You can access publications by<br />title by clicking titles a-z (Figure 4), which brings up a list<br />of books in alphabetic order. You can access publications<br />by organization (i.e. Dublin Core âpublisherâ), bringing up<br />a list of organizations. You can access publications by<br />âhow toâ listing, yielding a list of hints defined by the<br />collectionâs editors. We use the Dublin Core as a base and<br />extend it in an ad hoc manner to accommodate the<br />individual requirements of collection designers.<br /></p><br /><p>FILES IN A COLLECTION<br /></p><br /><p> When a new collection is created or material is added to an<br />existing one, the original source documents are first<br />brought into the system through a process known as<br />âimporting.â This involves converting documents into a<br />simple HTML-like format known as GML (for<br />âGreenstone Markup Languageâ), which includes any<br />metadata associated with the document. Documents are<br />assumed to be in the Unicode UTF-8 code (of which the<br />ASCII characters form a subset).<br /></p><br /><p> Files and directories<br /></p><br /><p> There is a separate directory for each collection, which<br />contains five subdirectories: the original raw material<br />(import), the GML files created from this (archives), the<br />final collection as it is served to users (index), a directory<br />for use during the building process (building), and one for<br />any supporting files (etc)âincluding the configuration file<br />that controls the collection creation procedure. Additional<br />files might be required: for example, building a hierarchy<br />of classifications requires a data file of sub-classifications.<br /></p><br /><p> The imported documents<br /></p><br /><p> In order to identify documents internally, a unique object<br />identifier or OID is assigned to each original source<br />document when it is imported (formed by hashing the<br />content, to overcome file duplication effects caused by<br />mirroring) and stored as metadata within that document. It<br />is important that OIDs persist throughout the index-<br />building processâso that a userâs search history is<br />unaffected by rebuilding the collection. OIDs are assigned<br />by hashing the contents of the original source document.<br /> Once imported, each document is stored in its own<br />subdirectory of archives, along with any associated<br />filesâfor example, images. To ensure compatibility with<br />Windows 3.0, only eight characters are used in directory<br />and file names, which causes annoying but essentially<br />trivial complications.<br /></p><br /><p> Inside the documents<br /></p><br /><p> The GML format imposes a limited amount of structure on<br />documents. Documents are divided into paragraphs. They<br />can be split hierarchically into sections and subsections.<br />OIDs are extended to identify these components by<br />appending numbers, separated by periods, to a documentâs<br />OID. When a book is read, its section hierarchy is visible<br />as the table of contents (Figure 3). Chapters, sections,<br />subsections, and pages are all implemented simply as<br />âsectionsâ within the document. In some collections<br />documents do not have a hierarchical subsection structure,<br />but are split into pages to permit browsing within a<br />retrieved document.<br /> The document structure is used for searchable indexes.<br />There are three levels of index: documents, sections, and<br /></p><br /><p>Figure 3: Reading a book in the HDL</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>paragraphs, corresponding to the distinctions that GML<br />makesâthe hierarchical structure is flattened for the<br />purposes of creating these indexes. Indexes can be of text,<br />or metadata, or any combination. Thus you can create a<br />searchable index of section titles, and/or authors, and/or<br />document descriptions, as well as the document text.<br /></p><br /><p>UPDATING EXISTING COLLECTIONS<br /></p><br /><p> Updating an existing collection with new files in the same<br />format is easy. For example, the raw material for the HDL<br />is supplied in the form of HTML files marked up with<br />&lt;&lt;TOC&gt;&gt; tags to split books into sections and<br />subsections, and &lt;&lt;I&gt;&gt; tags to indicate where an image is<br />to be inserted. For each book in the library there is a<br />directory that contains a single HTML file representing the<br />book, and separate files containing the associated images.<br />An accompanying spreadsheet file contains the<br />classification hierarchy; this is converted to a simple file<br />format (using Excelâs Save As command).<br /> Since the collection exists, its directory is already set up<br />with subdirectories import, archives, building, index, and<br />etc, and the etc directory will contain a suitable collection<br />configuration file.<br /></p><br /><p> The updating procedure<br /></p><br /><p> To update a collection, the new raw material is placed in<br />the import directory, in whatever form it is available. Then<br /></p><br /><p>the import process is invoked, which converts the files into<br />GML using the specified plugins. Old material for which<br />GML files have previously been created is not re-imported.<br />Then the build process is invoked to build the requisite<br />indexes for the collection. Finally, the contents of the<br />building directory are moved into the index directory, and<br />the new version of the collection automatically becomes<br />live.<br /> This procedure may seem cumbersome. But all the steps<br />are necessary for efficient operation with large collections.<br />The import process could be performed on the fly during<br />the building operationâbut because building indexes is a<br />multipass operation, the often lengthy importing would be<br />repeated several times. The build process can take<br />considerable timeâa day or two, for very large<br />collections. Consequently, the results are placed in the<br />building directory so that, if the collection already exists, it<br />will continue to be served to users in its old form<br />throughout the building operation.<br /> Active users of the collection will not be disturbed when<br />the new version becomes liveâthey will probably not<br />even notice. The persistent OIDs ensure that interactions<br />remain coherentâusers who are examining the results of a<br />query or browse operation will still retrieve the expected<br />documentsâand if a search is actually in progress when<br />the change takes place the program detects the resulting<br />file-structure inconsistency and automatically and<br />transparently re-executes the query, this time on the new<br />version of the collection.<br /></p><br /><p> How it works<br /></p><br /><p> The original material in the import directory may be in any<br />format, and plugins are required to process each format<br />type. The plugins that a collection uses must be specified<br />in the collection configuration file. The import program<br />reads the list of plugins and passes each document to each<br />plugin in order until it finds one that can process it. When<br />updating an existing collection, all plugins necessary to<br />process new material should already have been specified in<br />the configuration file.<br /> The building step creates the indexes for both searching<br />and browsing. The MG software is generally used to do the<br />searching (Witten et al., 1999), and the mgbuild module is<br />automatically invoked to create each of the indexes that is<br />required. For example, the Humanity Development Library<br />has three indexes, one for entire books, one for chapters,<br />and one for section titles. Subdirectories of the index<br />directory are created for each of these indexes.<br /></p><br /><p>Figure 4: Browsing titles in the HDL</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p> MG also compresses the text of the collection; and the<br />image files are linked into the index subdirectory. Now<br />none of the material in the import and archives directories<br />is needed to run the collection and can be removed from<br />the file system (though they would be needed if the<br />collection were rebuilt).<br /> Associated with each collection is a database stored in<br />GDBM (Gnu database manager) format. This contains an<br />entry for each document, giving its OID, its internal MG<br />document number, and metadata such as title. Information<br />for each of the browsing indexes, which appear as buttons<br />on the Greenstone search/browse bar, is also extracted<br />during the building process and stored in the database. A<br />âclassifierâ program is required for each browsing index to<br />extract the appropriate information from GML documents.<br />Like plugins, classifiers are written on an ad hoc basis for<br />the particular information required, and where possible<br />reused from one collection to another.<br /> The building program creates the indexes based on<br />whatever appears in the archives directory. The first plugin<br />specified by all collections is one that processes GML<br />files, and so if archives contains imported files they will be<br />processed correctly. If it contains material in the original<br />format, that will be converted using the appropriate plugin.<br />Thus the import process is optional.<br /> GML is designed to be fast and easy to parse, an important<br />requirement when millions of documents are to be<br />processed. Something as simple as requiring tags to be<br />lower-case, for example, yields a substantial speed-up. In<br /></p><br /><p>certain circumstances, however, it might be preferable to<br />use a standardized format such as XML. This is<br />straightforward to implement just write an XML<br />plugin although we have not done so ourselves. Given<br />the transitory nature of the imported data, to date, we have<br />found GML a satisfactory and beneficial format.<br /></p><br /><p>CREATING NEW COLLECTIONS<br /></p><br /><p> Building new collections from scratch is only slightly<br />different from updating an existing collection. The key<br />new requirement is creating a collection configuration file,<br />and a software utility is provided to help. Two pieces of<br />information are required for this: the name of the directory<br />that the collection will use (into which the source data and<br />other files will eventually be placed), and a contact e-mail<br />address for use if any problems are encountered by the<br />software once the collection is up and running. The utility<br />creates files and directories within the newly-named<br />directory to support a generic collection of plain text<br />documents. With suitable data placed in the import<br />directory, building the collection at this point will yield a<br />document-level searchable index of all the text and a<br />browsable list of âtitlesâ (defined in this case to be the<br />document filenames).<br /> To enhance the functionality and presentationâ something<br />anything but the most trivial collection will requireâthe<br />configuration file must be edited. For a collection sourced<br />from documents in an already supported data format,<br />presented in a similar fashion to an existing collection, the<br /></p><br /><p>creator [email protected] 1<br />maintainer [email protected] 2<br />public True 3<br /></p><br /><p>4<br />indexes document:text 5<br />defaultindex document:text 6<br />plugins GMLPlug TEXTPlug ArcPlug RecPlug 7<br /></p><br /><p>8<br />classify AZList metadata=Title 9<br /></p><br /><p>10<br />collectionmeta collectionname &quot;generic text collection&quot; 11<br /></p><br /><p>(a) collectionmeta .document:text &quot;documents&quot; 12<br /></p><br /><p>creator [email protected] 1<br />maintainer [email protected] 2<br />public True 3<br /></p><br /><p>4<br />indexes document:text document:From 5<br />defaultindex document:text 6<br />plugins GMLPlug EMAILPlug ArcPlug RecPlug 7<br /></p><br /><p>8<br />classify AZList metadata=Title 9<br />classify DateList 10<br /></p><br /><p>11<br />collectionmeta collectionname &quot;Email messages&quot; 12<br />collectionmeta .document:text &quot;documents&quot; 13<br />collectionmeta .document:From &quot;email senders&quot; 14<br /></p><br /><p>15<br />format QueryResults \\ 16<br /></p><br /><p>(b) &lt;td&gt;[link][icon][/link]&lt;/td&gt;&lt;td&gt;[Title]&lt;/td&gt;&lt;td&gt;[Author]&lt;/td&gt; 17<br />Figure 5: Collection configuration files (a) generic, (b) for an email collection</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>amount of editing is minimal. Importing new data formats<br />and browsing metadata in ways not currently supported are<br />more complex activities that require programming skills.<br /></p><br /><p> Modifying the configuration file<br /></p><br /><p> Figure 5b shows simple alterations to the generic<br />configuration file in Figure 5a that was generated by the<br />new-collection utility. TEXTPlug is replaced with<br />EMAILPlug (line 7) which reads email files and extracts<br />metadata (From, To, Date, Subject) from them. A classifier<br />for dates is added (line 10) to make the collection<br />browsable chronologically. The default presentation of<br />search results is overridden (line 17) to display both the<br />title of the message (i.e. Dublin Core Title) and its sender<br />(i.e. Dublin Core Author). Elements in square brackets,<br />such as [Title], are replaced by the metadata associated<br />with a particular document. The built-in term [icon]<br />produces a suitable image that represents the document<br />(such as a book icon or page icon), and the [link]âŠ[/link]<br />construct forms a hyperlink to the complete document.<br />Anything else in the format statement, which in this case is<br />solely table-cell tags in HTML, is passed through to the<br />page being displayed.<br />As this example shows, creating a new collection that stays<br />within the bounds of the libraryâs established capabilities<br />falls within the capability of many computer usersâfor<br />instance, computer-trained librarians. Extending<br />Greenstone to handle new document formats and browse<br />metadata in new ways is more challenging.<br /></p><br /><p> Writing new plugins and classifiers<br /></p><br /><p> Extensibility is obtained through plugins and classifiers.<br /></p><br /><p> These are modules of code that can be slotted into the<br />system to enhance its capabilities. Plugins parse<br />documents, extracting the text and metadata to be indexed.<br />Classifiers control how metadata is brought together to<br />form browsable data structures. Both are specified in an<br />object-oriented framework using inheritance to minimize<br />the amount of code written.<br /> A plugin must specify three things: what file formats it can<br />handle, how they should be parsed, and whether the plugin<br />is recursive. File formats are normally determined using<br />regular expression matching on the filename. For example,<br />the HTML plugin accepts all files that end in .htm, .html,<br />.HTM, or .HTML. (It is quite possible, however, to write<br />plugins that âlook insideâ the file as well.) For other files,<br />the plugin returns undefined and the file is passed to the<br />next plugin in the collectionâs configuration file (e.g.<br />Figure 5 line 7). If it can, the plugin parses the file and<br />returns the number of documents processed. This involves<br />extracting text and metadata and adding it to the libraryâs<br />content through calls to add text and add metadata.<br /> Some plugins (ârecursiveâ ones) add extra files into the<br />stream of data processed during the building phase by<br />artificially reactivating the list of plugins. This is how<br />directory hierarchies are traversed.<br /> Plugins are small modules of code that are easy to write.<br />We monitored the time it took to develop a new one that<br />was different to any we had produced so far. We chose to<br />make as an example a collection of HTML bookmark files,<br />the motivation being to produce a convenient way of<br />searching and browsing oneâs bookmarked Web pages.<br />Figure 6 shows a user searching for bookmarked pages<br />about music. The new plugin took under an hour to write,<br />and was 160 lines long (ignoring blank lines and<br />comments)âabout the average length of existing plugins.<br /> Classifiers are more general than plugins because they<br />work on GML-format data. For example, any plugin that<br />generates date metadata in accordance with the Dublin<br />core can request the collection to be browsable<br />chronologically by specifying the DateList classifier in the<br />collectionâs configuration file (Figure 7). Classifiers are<br />more elaborate than most plugins, but new ones are seldom<br />required. The average length of existing classifiers is 230<br />lines.<br /> Classifiers must specify three things: an initialization<br />routine, how individual documents are classified, and the<br />final browsable data structure. Initialization takes care of<br />any options specified in the configuration file (such as<br />metadata=Title on line 9 of Figure 5b). Classifying<br />individual documents is an iterative process: for each one,<br />a call to document-classify is made. On presentation of the<br />documentâs OID, the necessary metadata is located and<br />used to control where the document is added to the<br />browsable data structure being constructed.<br /> Once all documents have been added, a request is made for<br />the completed data structure. Some classifiers return the<br />data structure directly; others transform the data structure<br />before it is returned. For example, the AZList classifier<br /></p><br /><p>Figure 6: Searching bookmarked Web pages</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>divides the alphabetically sorted list of metadata into<br />separate pages of about the same size and returns the<br />alphabetic ranges for each one (Figure 4).<br /></p><br /><p>OVERVIEW OF RELATED WORK<br /></p><br /><p>Two projects that provide substantial open source digital<br />library software are Dienst (Lagoze and Fielding, 1998)<br />and Harvest (Bowman et al., 1994). The origins of Dienst<br />(www.cs.cornell.edu/cdlrg) stretch back to 1992. The term<br />has come to represent three entities: a conceptual<br />architecture for distributed digital libraries; an open<br />protocol for service communication; and a software<br />system that implements the protocol. To date, five sample<br />digital libraries have been built using this technology.<br />They manifest themselves in two forms: technical reports<br />and primary source documents.<br /></p><br /><p>Best known is NCSTRL, the Networked Computer<br />Science Technical Reference Library project<br />(www.ncstrl.org). This collection facilitates searching by<br />title, author and abstract, and browsing by year and author,<br />across a distributed network of document repositories.<br />Documents can (where supported) be delivered in various<br />formats such as PostScript, a thumbnail overview of the<br />pages, and a GIF image of a particular page.<br /></p><br /><p>The Making of America resource is an example of a<br />collection based around primary sources in this case<br />American social history, 1830 1900. It has a different<br />âlook and feelâ to NCSTRL, being strongly oriented<br />toward browsing rather than searching. A user navigates<br />their way through a hierarchical structure of hyperlinks to<br />reach a book of interest. The book itself is a series of<br />scanned images: delivery options include going directly to<br /></p><br /><p>a page number, next and previous page buttons, and<br />displaying a particular page at different resolutions. A text<br />version of the page is also available upon which a<br />searching option is also provided.<br /></p><br /><p>Started in 1994, Harvest is also a long-running research<br />project. It provides an efficient means of gathering source<br />data from the Internet and distributing indexing<br />information over the Internet. This is accomplished<br />through five components: gatherer, broker, indexer,<br />replicator and cache. The first three are central to creating,<br />updating and searching a collection; the last two help to<br />improve performance over the Internet through transparent<br />mirroring and caching techniques.<br /></p><br /><p>The system is configurable and customizable. While<br />searching is most commonly implemented using Glimpse<br />(glimpse.cs.arizona.edu), in principle any search engine<br />that supports incremental updates and Boolean<br />combinations of attribute-based queries can be used. It is<br />possible to control what type of documents are gathered<br />during creation and updating, and how the query interface<br />looks and is laid out.<br /></p><br /><p>Sample collections cited by the developers include 21,000<br />computer science technical reports and 7,000 home pages.<br />Other examples include a sizable collection of agriculture-<br />related electronic journals and magazines called âtomato-<br />juiceâ (accessed through hegel.lib.ncsu.edu) and a full-text<br />index of library-related electronic serials<br />(sunsite.berkeley.edu/IndexMorganagus). Harvest is also<br />often used to index Web sites (for example<br />www.middlebury.edu).<br />Comparing Greenstone with Dienst and Harvest, there are<br />both similarities and differences. All provide substantial<br />digital library systems, hence common themes recur, but<br />they are driven by projects with different aims. Harvest,<br />for instance, was not conceived as a digital library project<br />at all, but by virtue of its selective document gathering<br />process it can be classed (and is used) as one. While it<br />provides sophisticated search options, it lacks the<br />complementary service of browsing. Furthermore it adds<br />no structure or order to the documents collected, relying<br />on whatever structures are present in the site that they<br />were gathered from. A proven strength of the design is its<br />flexibility through configuration and customization an<br />element also present in Greenstone.<br /></p><br /><p>Dienst best exemplified through the NCSTRL<br />work supports searching and browsing, like Greenstone.<br />Both use open protocols. Differences include a high<br />reliance in Dienst on user-supplied information when a<br />document is added, and a smaller range of document types<br />supportedâalthough Dienst does include a document<br />model that should, over time, allow this to expand with<br />relative ease.<br /></p><br /><p>There are also commercial systems that provide similar<br />digital library services to those described. However, since<br /></p><br /><p>Figure 7: Browsing a newspaper collection by date</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>corporate culture instills proprietary attitudes there is little<br />opportunity for advancement through a shared<br />collaborative effort. Consequently they are not reviewed<br />here.<br /></p><br /><p>CONCLUSIONS<br /></p><br /><p>Greenstone is a comprehensive software system for<br />creating digital library collections. It builds data structures<br />for searching and browsing from the material provided,<br />rather than relying on any hand-crafting. The process is<br />controlled by a configuration file, and once a collection<br />exists new material can be added completely<br />automatically. Browsing is based on Dublin Core<br />metadata.<br />New collections can be developed easily, particularly if<br />they resemble existing ones. Extensibility is achieved<br />through software âpluginsâ that can be written to<br />accommodate documents, and metadata, in different<br />formats. Standard plugins exist for many document types;<br />new ones are easily written. Browsing is controlled by<br />âclassifiersâ that process metadata into browsing structures<br />(by date, alphabetical, hierarchical, etc).<br />However, the most powerful support for extensibility is<br />achieved not by technical means but by making the source<br />code freely available under the Gnu public license. Only<br />through an international cooperative effort will digital<br />library software become sufficiently comprehensive to<br />meet the worldâs needs with the richness and flexibility<br />that users deserve.<br /></p><br /><p>ACKNOWLEDGMENTS<br /></p><br /><p>We gratefully acknowledge all those who have worked on<br />the Greenstone software, and all members of the New<br />Zealand Digital Library project for their enthusiasm and<br />ideas.<br /></p><br /><p>REFERENCES<br />1. Akscyn, R.M. and Witten, I.H. (1998) âReport on First<br /></p><br /><p>Summit on International Cooperation on Digital<br />Libraries.â ks.com/idla-wp-oct98.<br /></p><br /><p>2. Bowman, C.M., Danzig, P.B., Manber, U., and<br />Schwartz, M.F. âScalable Internet resource discovery:<br />Research problems and approachesâ Communications<br />of the ACM, Vol. 37, No. 8, pp. 98 107, 1994.<br /></p><br /><p>3. Fox, E. (1998) âDigital library definitions.â<br />ei.cs.vt.edu/~fox/dlib/def.html.<br /></p><br /><p>4. Humanity Libraries (1998) Humanity Development<br />Library. CD-ROM produced by the Global Help<br />Project, Antwerp, Belgium.<br /></p><br /><p>5. Lagoze, C. and Fielding, D âDefining Collections in<br />Distributed Digital Librariesâ D-Lib Magazine, Nov.<br />1998.<br /></p><br /><p>6. PAHO (1999) Virtual Disaster Library. CD-ROM<br />produced by the Pan-American Health Organization,<br />Washington DC, USA.<br /></p><br /><p>7. McNab, R.J., Witten, I.H. and Boddie, S.J. (1998) âA<br />distributed digital library architecture incorporating<br />different index styles.â Proc IEEE Advances in Digital<br />Libraries, Santa Barbara, CA, pp. 36â45.<br /></p><br /><p>8. Nevill-Manning, C.G., Reed, T., and Witten, I.H.<br />(1998) âExtracting text from PostScriptâ<br />SoftwareâPractice and Experience, Vol. 28, No. 5, pp.<br />481â491; April.<br /></p><br /><p>9. UNESCO (1999) SAHEL point DOC: Anthologie du<br />dÃ©veloppement au Sahel. CD-ROM produced by<br />UNESCO, Paris, France.<br /></p><br /><p>10. UNU (1998) Collection on critical global issues. CD-<br />ROM produced by the United Nations University<br />Press, Tokyo, Japan.<br /></p><br /><p>11. Witten, I.H., Moffat, A. and Bell, T. (1999) Managing<br />Gigabytes: compressing and indexing documents and<br />images, Morgan Kaufmann, second edition.</p><br /><br /></div></div><br /></Content>
35	</Section>
36	</Archive>

Note: See TracBrowser for help on using the repository browser.

Download in other formats: