[28047] | 1 | <?xml version="1.0" encoding="utf-8" standalone="no"?>
|
---|
| 2 | <!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
|
---|
| 3 | <Archive>
|
---|
| 4 | <Section>
|
---|
| 5 | <Description>
|
---|
| 6 | <Metadata name="gsdldoctype">indexed_doc</Metadata>
|
---|
| 7 | <Metadata name="Language">en</Metadata>
|
---|
| 8 | <Metadata name="Encoding">utf8</Metadata>
|
---|
| 9 | <Metadata name="Author">Bronwyn</Metadata>
|
---|
| 10 | <Metadata name="Title">Greenstone: A Comprehensive Open-Source Digital Library Software...</Metadata>
|
---|
[28812] | 11 | <Metadata name="URL">http://research/ak19/gs2-svn-22Aug2013/collect/Word-PDF-Basic/tmp/1391133804_1/pdf01.html</Metadata>
|
---|
| 12 | <Metadata name="UTF8URL">http://research/ak19/gs2-svn-22Aug2013/collect/Word-PDF-Basic/tmp/1391133804_1/pdf01.html</Metadata>
|
---|
[28047] | 13 | <Metadata name="gsdlsourcefilename">import/pdf01.pdf</Metadata>
|
---|
[28812] | 14 | <Metadata name="gsdlconvertedfilename">tmp/1391133804_1/pdf01.html</Metadata>
|
---|
[28047] | 15 | <Metadata name="OrigSource">pdf01.html</Metadata>
|
---|
| 16 | <Metadata name="Source">pdf01.pdf</Metadata>
|
---|
| 17 | <Metadata name="SourceFile">pdf01.pdf</Metadata>
|
---|
| 18 | <Metadata name="Plugin">PDFPlugin</Metadata>
|
---|
| 19 | <Metadata name="FileSize">269487</Metadata>
|
---|
| 20 | <Metadata name="FilenameRoot">pdf01</Metadata>
|
---|
| 21 | <Metadata name="FileFormat">PDF</Metadata>
|
---|
| 22 | <Metadata name="srcicon">_iconpdf_</Metadata>
|
---|
| 23 | <Metadata name="srclink_file">doc.pdf</Metadata>
|
---|
| 24 | <Metadata name="srclinkFile">doc.pdf</Metadata>
|
---|
| 25 | <Metadata name="NumPages">9</Metadata>
|
---|
| 26 | <Metadata name="dc.Creator">Ian H. Witten</Metadata>
|
---|
| 27 | <Metadata name="dc.Creator">Rodger J. McNab</Metadata>
|
---|
| 28 | <Metadata name="dc.Creator">Stefan J. Boddie</Metadata>
|
---|
| 29 | <Metadata name="dc.Creator">David Bainbridge</Metadata>
|
---|
| 30 | <Metadata name="dc.Title">Greenstone: A comprehensive open-source digital library software system</Metadata>
|
---|
| 31 | <Metadata name="ex.ExifTool.ExifToolVersion">8.57</Metadata>
|
---|
[28239] | 32 | <Metadata name="ex.File.Directory">/research/ak19/gs2-svn-22Aug2013/collect/Word-PDF-Basic/import</Metadata>
|
---|
[28812] | 33 | <Metadata name="ex.File.FileModifyDate">2014:01:31 14:56:44+13:00</Metadata>
|
---|
[28047] | 34 | <Metadata name="ex.File.FileName">pdf01.pdf</Metadata>
|
---|
| 35 | <Metadata name="ex.File.FilePermissions">644</Metadata>
|
---|
| 36 | <Metadata name="ex.File.FileSize">269487</Metadata>
|
---|
| 37 | <Metadata name="ex.File.FileType">PDF</Metadata>
|
---|
| 38 | <Metadata name="ex.File.MIMEType">application/pdf</Metadata>
|
---|
| 39 | <Metadata name="ex.PDF.Author">Bronwyn</Metadata>
|
---|
| 40 | <Metadata name="ex.PDF.CreateDate">2000:03:02 15:21:24</Metadata>
|
---|
| 41 | <Metadata name="ex.PDF.Creator">Microsoft Word</Metadata>
|
---|
| 42 | <Metadata name="ex.PDF.Linearized">false</Metadata>
|
---|
| 43 | <Metadata name="ex.PDF.PDFVersion">1.2</Metadata>
|
---|
| 44 | <Metadata name="ex.PDF.PageCount">9</Metadata>
|
---|
| 45 | <Metadata name="ex.PDF.Producer">Acrobat PDFWriter 4.0 for Power Macintosh</Metadata>
|
---|
| 46 | <Metadata name="Identifier">HASH1a9cea0f239f754007681b</Metadata>
|
---|
[28812] | 47 | <Metadata name="lastmodified">1391133404</Metadata>
|
---|
[28811] | 48 | <Metadata name="lastmodifieddate">20140131</Metadata>
|
---|
[28812] | 49 | <Metadata name="oailastmodified">1391133804</Metadata>
|
---|
[28811] | 50 | <Metadata name="oailastmodifieddate">20140131</Metadata>
|
---|
[28047] | 51 | <Metadata name="assocfilepath">HASH1a9c.dir</Metadata>
|
---|
| 52 | <Metadata name="gsdlassocfile">pdf01-2_1.jpg:image/jpeg:</Metadata>
|
---|
| 53 | <Metadata name="gsdlassocfile">pdf01-3_1.jpg:image/jpeg:</Metadata>
|
---|
| 54 | <Metadata name="gsdlassocfile">pdf01-4_1.jpg:image/jpeg:</Metadata>
|
---|
| 55 | <Metadata name="gsdlassocfile">pdf01-5_1.jpg:image/jpeg:</Metadata>
|
---|
| 56 | <Metadata name="gsdlassocfile">pdf01-7_1.jpg:image/jpeg:</Metadata>
|
---|
| 57 | <Metadata name="gsdlassocfile">pdf01-8_1.jpg:image/jpeg:</Metadata>
|
---|
| 58 | <Metadata name="gsdlassocfile">doc.pdf:application/pdf:</Metadata>
|
---|
| 59 | </Description>
|
---|
| 60 | <Content>
|
---|
| 61 | <A name=1></a><b>Greenstone: A Comprehensive Open-Source</b><br>
|
---|
| 62 | <b>Digital Library Software System</b><br>
|
---|
| 63 | <i>Ian H. Witten,* Rodger J. McNab,â Stefan J. Boddie,* David Bainbridge*</i><br>
|
---|
| 64 | * Dept of Computer Science<br>
|
---|
| 65 | â Digilib Systems<br>
|
---|
| 66 | University of Waikato, New Zealand<br>
|
---|
| 67 | Hamilton, New Zealand<br>
|
---|
| 68 | E-mail: {ihw, sjboddie, davidb}@cs.waikato.ac.nz<br>
|
---|
| 69 | E-mail: [email protected]<br>
|
---|
| 70 | <b>ABSTRACT</b><br>
|
---|
| 71 | multilingual information retrieval to distributed computing<br>protocols, from interoperability to search engine<br>
|
---|
| 72 | This paper describes the Greenstone digital library<br>
|
---|
| 73 | technology, from metadata standards to multiformat<br>
|
---|
| 74 | software, a comprehensive, open-source system for the<br>
|
---|
| 75 | document parsing, from multimedia to multiple operating<br>
|
---|
| 76 | construction and presentation of information collections.<br>
|
---|
| 77 | systems, from Web browsers to plug-and-play DVDs.<br>
|
---|
| 78 | Collections built with Greenstone offer effective full-text<br>searching and metadata-based browsing facilities that are<br>
|
---|
| 79 | The Greenstone Digital Library Software from the New<br>
|
---|
| 80 | attractive and easy to use. Moreover, they are easily<br>
|
---|
| 81 | Zealand Digital Library (NZDL) project tackles this issue<br>
|
---|
| 82 | maintainable and can be augmented and rebuilt entirely<br>
|
---|
| 83 | by providing a new way of organizing information and<br>
|
---|
| 84 | automatically. The system is extensible: software<br>
|
---|
| 85 | making it available over the Internet. A <i>collection</i> of<br>
|
---|
| 86 | âpluginsâ accommodate different document and metadata<br>
|
---|
| 87 | information comprises several (typically several thousand,<br>
|
---|
| 88 | types.<br>
|
---|
| 89 | or several million) <i>documents</i>, and a uniform interface is<br>provided to all documents in a collection. A library may<br>
|
---|
| 90 | <b>INTRODUCTION</b><br>
|
---|
| 91 | include many different collections, each organized<br>differentlyâthough there is a strong family resemblance in<br>
|
---|
| 92 | Notwithstanding intense research activity in the digital<br>
|
---|
| 93 | how collections are presented.<br>
|
---|
| 94 | library field during the second half of the 1990s,<br>comprehensive software systems for creating digital<br>
|
---|
| 95 | Making information available using this system is far more<br>
|
---|
| 96 | libraries are not widely available. In fact, the usual solution<br>
|
---|
| 97 | than âjust putting it on the Web.â The collection becomes<br>
|
---|
| 98 | when creating a digital library is also the most<br>
|
---|
| 99 | maintainable, searchable, and browsable. Each collection,<br>
|
---|
| 100 | obviousâjust put it on the Web. But consider how much<br>
|
---|
| 101 | prior to presentation, undergoes a âbuildingâ process that,<br>
|
---|
| 102 | effort is involved in constructing a Web site for a digital<br>
|
---|
| 103 | once established, is completely automatic. This process<br>
|
---|
| 104 | library. To be effective it needs to be visually attractive<br>
|
---|
| 105 | creates all the structures that are used at run-time for<br>
|
---|
| 106 | and ergonomically easy to use, incorporate convenient and<br>
|
---|
| 107 | accessing the collection. Searching is based on various<br>
|
---|
| 108 | powerful searching capabilities, and offer rich and natural<br>
|
---|
| 109 | indexes, while browsing is based on various metadata;<br>
|
---|
| 110 | browsing facilities. Above all it must be easy to maintain<br>
|
---|
| 111 | support structures for both are created during the building<br>
|
---|
| 112 | and augment, which presents a significant challenge if any<br>
|
---|
| 113 | operation. When new material appears it can be fully<br>
|
---|
| 114 | manual organization is involved.<br>
|
---|
| 115 | incorporated into the collection by rebuilding.<br>
|
---|
| 116 | The alternative is to automate these activities through<br>
|
---|
| 117 | To address the exceptionally broad demands of digital<br>
|
---|
| 118 | software tools. But the broad scope of digital library<br>
|
---|
| 119 | libraries, the system is public and extensible. It is issued<br>
|
---|
| 120 | requirements makes this a daunting prospect. Ideally the<br>
|
---|
| 121 | under the Gnu public license and, in the spirit of open-<br>
|
---|
| 122 | software should incorporate facilities ranging from<br>
|
---|
| 123 | source software, users are invited to contribute<br>modifications and enhancements. Only through an<br>international cooperative effort will digital library software<br>become sufficiently comprehensive to meet the worldâs<br>needs. Currently the Greenstone software is used at sites in<br>Canada, Germany, New Zealand, Romania, UK, and the<br>US, and collections range from newspaper articles to<br>technical documents, from educational journals to oral<br>history, from visual art to folksongs. The software has<br>been used for collections in many different languages, and<br>for CD-ROMs that have been published by the United<br>Nations and other humanitarian agencies in Belgium,<br>France, Japan, and the US for distribution in developing<br>countries (Humanity Libraries, 1998; PAHO, 1999;<br>UNESCO, 1999; UNU, 1998). Further details can be<br>obtained from <i>www.nzdl.org</i>.<br>
|
---|
| 124 | <hr>
|
---|
| 125 | <A name=2></a><IMG src="_httpdocimg_/pdf01-2_1.jpg"><br>
|
---|
| 126 | become a first-class component of the library. And what<br>permits it to be integrated into existing searching and<br>browsing structures without any manual intervention is<br><i>metadata</i>. This provides sufficient focus to the concept of<br>âdigital libraryâ to support the development of a<br>construction kit.<br>
|
---|
| 127 | <b>OVERVIEW OF GREENSTONE</b><br>
|
---|
| 128 | <br>Information collections built by Greenstone combine<br>extensive full-text search facilities with browsing indexes<br>based on different metadata types. There are several ways<br>for users to find information, although they differ between<br>collections depending on the metadata available and the<br>collection design. Typically you can <i>search for particular<br>words</i> that appear in the text, or within a section of a<br>document, or within a title or section heading. You can<br><i>browse documents by title</i>: just click on the displayed book<br>icon to read it. You can <i>browse documents by subject</i>.<br>Subjects are represented by bookshelves: just click on a<br>shelf to see the books. Where appropriate, documents<br>
|
---|
| 129 | <b>Figure 1: Searching the HDL collection</b><br>
|
---|
| 130 | come complete with a table of contents (constructed<br>automatically): you can click on a chapter or subsection to<br>
|
---|
| 131 | This paper sets the scene with a brief discussion of what a<br>
|
---|
| 132 | open it, expand the full table of contents, or expand the full<br>
|
---|
| 133 | digital library is. We then give an overview of the facilities<br>
|
---|
| 134 | document.<br>
|
---|
| 135 | offered by Greenstone and show how end users find<br>information in collections. Next we describe the files and<br>
|
---|
| 136 | <br>An example of searching is shown in Figure 1 where<br>
|
---|
| 137 | directories involved in a collection, and then discuss the<br>
|
---|
| 138 | documents in the Global Help Projectâs Humanity<br>
|
---|
| 139 | processes of updating existing collections and creating new<br>
|
---|
| 140 | Development Library (HDL) are being searched for<br>
|
---|
| 141 | ones, including extending the software to provide new<br>
|
---|
| 142 | chapters matching the word <i>butterfly</i>. In Figure 2 the same<br>
|
---|
| 143 | facilities. We conclude with an overview of related work.<br>
|
---|
| 144 | collection is being browsed by subject: by clicking on the<br>bookshelf icons the user has discovered an item under<br>
|
---|
| 145 | <b>WHAT IS A DIGITAL LIBRARY?</b><br>
|
---|
| 146 | Section 16, Animal Husbandry. Pursuing an interest in<br>butterfly farming, the user selects a book by clicking on its<br>
|
---|
| 147 | <br>Ten definitions of the term âdigital libraryâ have been<br>
|
---|
| 148 | book icon. In Figure 3 the front cover of the book is<br>
|
---|
| 149 | culled from the literature by Fox (1998), and their spirit is<br>
|
---|
| 150 | displayed as a graphic on the left, and the automatically<br>
|
---|
| 151 | captured in the following brief characterization:<br>
|
---|
| 152 | constructed table of contents appears at the start of the<br>
|
---|
| 153 | <br>
|
---|
| 154 | document. The current focus, <i>Introduction and Summary</i>,<br>
|
---|
| 155 | <i>A collection of digital objects, including text,</i><br>
|
---|
| 156 | is shown in bold in the table of contents with its text<br>
|
---|
| 157 | <i>video, and audio, along with methods for access</i><br>
|
---|
| 158 | starting further down the page.<br>
|
---|
| 159 | <i>and retrieval, and for selection, organization<br>and maintenance of the collection</i><br>
|
---|
| 160 | <br>In accordance with Leskâs advice, a statement of purpose<br>
|
---|
| 161 | <br>
|
---|
| 162 | and coverage accompanies each collection, along with an<br>
|
---|
| 163 | (Akscyn and Witten, 1998). Lesk (1998) views digital<br>
|
---|
| 164 | explanation of how it is organized (Figure 1 shows the<br>
|
---|
| 165 | libraries as âorganized collections of digital information,â<br>
|
---|
| 166 | start of this). A distinction is made between <i>searching</i> and<br>
|
---|
| 167 | and wisely recommends that they articulate the principles<br>
|
---|
| 168 | <i>browsing</i>. Searching is full-text, andâdepending on the<br>
|
---|
| 169 | governing what is included and how the collection is<br>
|
---|
| 170 | collectionâs designâthe user can choose between indexes<br>
|
---|
| 171 | organized.<br>
|
---|
| 172 | built from different parts of the documents, or from<br>
|
---|
| 173 | <br>Digital libraries are generally distinguished from the<br>
|
---|
| 174 | different metadata. Some collections have an index of full<br>
|
---|
| 175 | World-Wide Web, the essential difference being in<br>
|
---|
| 176 | documents, an index of sections, an index of paragraphs,<br>
|
---|
| 177 | selection and organization. But they are not generally<br>
|
---|
| 178 | an index of titles, and an index of section headings, each of<br>
|
---|
| 179 | distinguished from a web <i>site</i>: indeed, virtually all extant<br>
|
---|
| 180 | which can be searched for particular words or phrases.<br>
|
---|
| 181 | digital libraries manifest themselves as a web site. Hence<br>
|
---|
| 182 | Browsing involves data structures created from metadata<br>
|
---|
| 183 | the obvious question: to make a digital library, why not<br>
|
---|
| 184 | that the user can examine: lists of authors, lists of titles,<br>
|
---|
| 185 | just put the information on the Web?<br>
|
---|
| 186 | lists of dates, hierarchical classification structures, and so<br>
|
---|
| 187 | <br>
|
---|
| 188 | on. Data structures for both browsing and searching are<br>
|
---|
| 189 | But we make a distinction between a digital library and a<br>
|
---|
| 190 | built according to instructions in a configuration file,<br>
|
---|
| 191 | web site that lies at the heart of our software design: one<br>
|
---|
| 192 | which controls both building and serving the collection.<br>
|
---|
| 193 | should easily be able to add new material to a library<br>
|
---|
| 194 | Sample configuration files are discussed below.<br>
|
---|
| 195 | without having to integrate it manually or edit its content<br>in any way. Once added, new material should immediately<br>
|
---|
| 196 | <hr>
|
---|
| 197 | <A name=3></a><IMG src="_httpdocimg_/pdf01-3_1.jpg"><br>
|
---|
| 198 | matter of specifying all the necessary plugins. In order to<br>build browsing indexes from metadata, an analogous<br>scheme of âclassifiersâ is used: classifiers create indexes<br>of various kinds based on metadata. Source documents are<br>brought into the Greenstone system through a process<br>called <i>importing</i>, which uses the plugins and classifiers<br>specified in the collection configuration file.<br>
|
---|
| 199 | <br>The international Unicode character set is used throughout,<br>so documentsâand interfacesâcan be written in any<br>language. Collections have so far been produced in<br>English, French, Spanish, German, Maori, Chinese, and<br>Arabic. The NZDL Web site provides numerous examples.<br>Collections can contain text, pictures, and even audio and<br>video clips; a text-only version of the interface is also<br>provided to accommodate visually impaired users.<br>Compression technology is used to ensure best use of<br>storage (Witten <i>et al </i>., 1999). Most non-textual material is<br>either linked to textual documents or accompanied by<br>textual descriptions (such as photo captions) to allow full-<br>text searching and browsing. However, the architecture<br>
|
---|
| 200 | <b>Figure 2: Browsing the HDL collection by subject</b><br>
|
---|
| 201 | permits the implementation of plugins and classifiers even<br>for non-textual data.<br>
|
---|
| 202 | <br>Rich browsing facilities can be provided by manually<br>
|
---|
| 203 | <br>
|
---|
| 204 | linking parts of documents together and building explicit<br>
|
---|
| 205 | The system includes an âadministrativeâ function whereby<br>
|
---|
| 206 | indexes and tables of contents. However, manually-created<br>
|
---|
| 207 | specified users can examine the composition of all<br>
|
---|
| 208 | linking becomes difficult to maintain, and often falls into<br>
|
---|
| 209 | collections, protect documents so that they can only be<br>
|
---|
| 210 | disrepair when a collection expands. The Greenstone<br>
|
---|
| 211 | accessed by registered users on presentation of a password,<br>
|
---|
| 212 | software takes a different tack: it facilitates <i>maintainability</i><br>
|
---|
| 213 | and so on. Logs of user activity are kept that record all<br>
|
---|
| 214 | by creating all searching and browsing structures<br>
|
---|
| 215 | queries made to every Greenstone collection (though this<br>
|
---|
| 216 | automatically from the documents themselves. No links<br>
|
---|
| 217 | facility can be disabled).<br>
|
---|
| 218 | are inserted by hand. This means that when new<br>
|
---|
| 219 | <br>Although primarily designed for Internet access over the<br>
|
---|
| 220 | documents in the same format become available, they can<br>
|
---|
| 221 | World-Wide Web, collections can be made available, in<br>
|
---|
| 222 | be added automatically. Indeed, for some collections this is<br>
|
---|
| 223 | precisely the same form, on CD-ROM. In either case they<br>
|
---|
| 224 | done by processes that wake up regularly, scout for new<br>
|
---|
| 225 | are accessed through any Web browser. Greenstone CD-<br>
|
---|
| 226 | material, and rebuild the indexesâall without manual<br>
|
---|
| 227 | ROMs operate on a standalone PC under Windows 3.X,<br>
|
---|
| 228 | intervention.<br>
|
---|
| 229 | 95, 98, and NT, and the interaction is identical to accessing<br>
|
---|
| 230 | Collections comprise many documents: thousands, tens of<br>
|
---|
| 231 | the collection on the Webâexcept that response is faster<br>
|
---|
| 232 | thousands, or even millions. Each document may be<br>
|
---|
| 233 | and more predictable. The requirement to operate on early<br>
|
---|
| 234 | hierarchically organized into <i>sections</i> (subsections, sub-<br>
|
---|
| 235 | Windows systems is one that plagues the software design,<br>
|
---|
| 236 | subsections, and so on). Each section comprises one or<br>
|
---|
| 237 | but is crucial for many usersâparticularly those in<br>
|
---|
| 238 | more <i>paragraphs</i>. Metadata such as author, title, date,<br>
|
---|
| 239 | underdeveloped countries seeking access to humanitarian<br>
|
---|
| 240 | keywords, and so on, may be associated with documents,<br>
|
---|
| 241 | aid collections. If the PC is connected to a network<br>
|
---|
| 242 | or with individual sections of documents. This is the raw<br>
|
---|
| 243 | (intranet or Internet), a custom-built Web server provided<br>
|
---|
| 244 | material for indexes. It must either be provided explicitly<br>
|
---|
| 245 | on each CD makes exactly the same information available<br>
|
---|
| 246 | for each document and section (for example, in an<br>
|
---|
| 247 | to others through their standard Web browser. The use of<br>
|
---|
| 248 | accompanying spreadsheet) or be derivable automatically<br>
|
---|
| 249 | compression ensures that the greatest possible volume of<br>
|
---|
| 250 | from the source documents. Metadata is converted to<br>
|
---|
| 251 | information can be packed on to a CD-ROM.<br>
|
---|
| 252 | Dublin Core and stored with the document for internal use.<br>
|
---|
| 253 | <br>The collection-serving software operates under Unix and<br>
|
---|
| 254 | <br>In order to accommodate different kinds of source<br>
|
---|
| 255 | Windows NT, and works with standard Web servers. A<br>
|
---|
| 256 | documents, the software is organized so that âpluginsâ can<br>
|
---|
| 257 | flexible process structure allows different collections to be<br>
|
---|
| 258 | be written for new document types. Plugins exist for plain<br>
|
---|
| 259 | served by different computers, yet be presented to the user<br>
|
---|
| 260 | text documents, HTML documents, email documents, and<br>
|
---|
| 261 | in the same way, on the same Web page, as part of the<br>
|
---|
| 262 | bibliographic formats. Word documents are handled by<br>
|
---|
| 263 | same digital library, even as part of the same collection<br>
|
---|
| 264 | saving them as HTML; PostScript ones by applying a<br>
|
---|
| 265 | (McNab and Witten, 1998). Existing collections can be<br>
|
---|
| 266 | preprocessor (Nevill-Manning <i>et al</i>., 1998). Specially<br>
|
---|
| 267 | updated and new ones brought on-line at any time, without<br>
|
---|
| 268 | written plugins also exist for proprietary formats such as<br>
|
---|
| 269 | bringing the system down; the process responsible for the<br>
|
---|
| 270 | that used by the BBC archives department. A collection<br>
|
---|
| 271 | user interface will notice (through periodic polling) when<br>
|
---|
| 272 | may have source documents in different forms: it is just a<br>
|
---|
| 273 | new collections appear and add them to the list presented<br>to the user.<br>
|
---|
| 274 | <hr>
|
---|
| 275 | <A name=4></a><IMG src="_httpdocimg_/pdf01-4_1.jpg"><br>
|
---|
| 276 | <b>FILES IN A COLLECTION</b><br>
|
---|
| 277 | <br>When a new collection is created or material is added to an<br>existing one, the original source documents are first<br>brought into the system through a process known as<br>âimporting.â This involves converting documents into a<br>simple HTML-like format known as GML (for<br>âGreenstone Markup Languageâ), which includes any<br>metadata associated with the document. Documents are<br>assumed to be in the Unicode UTF-8 code (of which the<br>ASCII characters form a subset).<br>
|
---|
| 278 | <br><b>Files and directories</b><br>
|
---|
| 279 | <br>There is a separate directory for each collection, which<br>contains five subdirectories: the original raw material<br>(<i>import</i>), the GML files created from this (<i>archives</i>), the<br>final collection as it is served to users (<i>index</i>), a directory<br>for use during the building process (<i>building</i>), and one for<br>any supporting files (<i>etc</i>)âincluding the configuration file<br>
|
---|
| 280 | <b>Figure 3: Reading a book in the HDL</b><br>
|
---|
| 281 | that controls the collection creation procedure. Additional<br>files might be required: for example, building a hierarchy<br>of classifications requires a data file of sub-classifications.<br>
|
---|
| 282 | <b>FINDING INFORMATION</b><br>
|
---|
| 283 | <br>Greenstone digital library systems generally include<br>
|
---|
| 284 | <br>
|
---|
| 285 | several separate collections. A home page allows you to<br>
|
---|
| 286 | <b>The imported documents</b><br>
|
---|
| 287 | select a collection; in addition, each collection has its own<br>
|
---|
| 288 | <br>In order to identify documents internally, a unique object<br>
|
---|
| 289 | âaboutâ page that gives you information about how the<br>
|
---|
| 290 | identifier or OID is assigned to each original source<br>
|
---|
| 291 | collection is organized and the principles governing what<br>
|
---|
| 292 | document when it is imported (formed by hashing the<br>
|
---|
| 293 | is included.<br>
|
---|
| 294 | content, to overcome file duplication effects caused by<br>
|
---|
| 295 | <br>All icons in the screenshots of Figures 1â4 are clickable.<br>
|
---|
| 296 | mirroring) and stored as metadata within that document. It<br>
|
---|
| 297 | Those icons at the top of the page return to the home page,<br>
|
---|
| 298 | is important that OIDs persist throughout the index-<br>
|
---|
| 299 | provide help text, and allow you to set user interface and<br>
|
---|
| 300 | building processâso that a userâs search history is<br>
|
---|
| 301 | searching preferences. The navigation bar underneath<br>
|
---|
| 302 | unaffected by rebuilding the collection. OIDs are assigned<br>
|
---|
| 303 | gives access to the searching and browsing facilities,<br>
|
---|
| 304 | by hashing the contents of the original source document.<br>
|
---|
| 305 | which differ from one collection to another.<br>
|
---|
| 306 | <br>Once imported, each document is stored in its own<br>
|
---|
| 307 | <br>Each of the five buttons provides a different way to find<br>
|
---|
| 308 | subdirectory of <i>archives</i>, along with any associated<br>
|
---|
| 309 | information. You can <i>search for particular words</i> that<br>
|
---|
| 310 | filesâfor example, images. To ensure compatibility with<br>
|
---|
| 311 | appear in the text from the âsearchâ page (or from the<br>
|
---|
| 312 | Windows 3.0, only eight characters are used in directory<br>
|
---|
| 313 | âaboutâ page of Figure 1). This collection contains indexes<br>
|
---|
| 314 | and file names, which causes annoying but essentially<br>
|
---|
| 315 | of chapters, section titles, and entire books. The default<br>
|
---|
| 316 | trivial complications.<br>
|
---|
| 317 | search interface is a simple one, suitable for casual users;<br>advanced searchingâwhich allows full Boolean<br>
|
---|
| 318 | <br><b>Inside the documents</b><br>
|
---|
| 319 | expressions, phrase searching, case and stemming<br>controlâcan be enabled from the <i>Preferences</i> page.<br>
|
---|
| 320 | <br>The GML format imposes a limited amount of structure on<br>
|
---|
| 321 | <br>
|
---|
| 322 | documents. Documents are divided into paragraphs. They<br>
|
---|
| 323 | This collection has four browsable metadata indexes. You<br>
|
---|
| 324 | can be split hierarchically into sections and subsections.<br>
|
---|
| 325 | can <i>access publications by subject</i> by clicking the <i>subjects</i><br>
|
---|
| 326 | OIDs are extended to identify these components by<br>
|
---|
| 327 | button, which brings up a list of subjects, represented by<br>
|
---|
| 328 | appending numbers, separated by periods, to a documentâs<br>
|
---|
| 329 | bookshelves (Figure 2). You can <i>access publications by</i><br>
|
---|
| 330 | OID. When a book is read, its section hierarchy is visible<br>
|
---|
| 331 | <i>title</i> by clicking <i>titles a-z</i> (Figure 4), which brings up a list<br>
|
---|
| 332 | as the table of contents (Figure 3). Chapters, sections,<br>
|
---|
| 333 | of books in alphabetic order. You can <i>access publications</i><br>
|
---|
| 334 | subsections, and pages are all implemented simply as<br>
|
---|
| 335 | <i>by organization</i> (i.e. Dublin Core âpublisherâ), bringing up<br>
|
---|
| 336 | âsectionsâ within the document. In some collections<br>
|
---|
| 337 | a list of organizations. You can <i>access publications by</i><br>
|
---|
| 338 | documents do not have a hierarchical subsection structure,<br>
|
---|
| 339 | <i>âhow toâ listing</i>, yielding a list of hints defined by the<br>
|
---|
| 340 | but are split into pages to permit browsing within a<br>
|
---|
| 341 | collectionâs editors. We use the Dublin Core as a base and<br>
|
---|
| 342 | retrieved document.<br>
|
---|
| 343 | extend it in an <i>ad hoc</i> manner to accommodate the<br>individual requirements of collection designers.<br>
|
---|
| 344 | <br>The document structure is used for searchable indexes.<br>There are three levels of index: <i>documents</i>, <i>sections</i>, and<br>
|
---|
| 345 | <hr>
|
---|
| 346 | <A name=5></a><IMG src="_httpdocimg_/pdf01-5_1.jpg"><br>
|
---|
| 347 | the <i>import</i> process is invoked, which converts the files into<br>GML using the specified plugins. Old material for which<br>GML files have previously been created is not re-imported.<br>Then the <i>build</i> process is invoked to build the requisite<br>indexes for the collection. Finally, the contents of the<br><i>building</i> directory are moved into the <i>index</i> directory, and<br>the new version of the collection automatically becomes<br>live.<br>
|
---|
| 348 | <br>This procedure may seem cumbersome. But all the steps<br>are necessary for efficient operation with large collections.<br>The <i>import</i> process could be performed on the fly during<br>the building operationâbut because building indexes is a<br>multipass operation, the often lengthy importing would be<br>repeated several times. The <i>build</i> process can take<br>considerable timeâa day or two, for very large<br>collections. Consequently, the results are placed in the<br><i>building</i> directory so that, if the collection already exists, it<br>will continue to be served to users in its old form<br>throughout the building operation.<br>
|
---|
| 349 | <br>Active users of the collection will not be disturbed when<br>the new version becomes liveâthey will probably not<br>
|
---|
| 350 | <b>Figure 4: Browsing titles in the HDL</b><br>
|
---|
| 351 | even notice. The persistent OIDs ensure that interactions<br>remain coherentâusers who are examining the results of a<br>query or browse operation will still retrieve the expected<br>
|
---|
| 352 | <i>paragraphs</i>, corresponding to the distinctions that GML<br>
|
---|
| 353 | documentsâand if a search is actually in progress when<br>
|
---|
| 354 | makesâthe hierarchical structure is flattened for the<br>
|
---|
| 355 | the change takes place the program detects the resulting<br>
|
---|
| 356 | purposes of creating these indexes. Indexes can be of text,<br>
|
---|
| 357 | file-structure inconsistency and automatically and<br>
|
---|
| 358 | or metadata, or any combination. Thus you can create a<br>
|
---|
| 359 | transparently re-executes the query, this time on the new<br>
|
---|
| 360 | searchable index of section titles, and/or authors, and/or<br>
|
---|
| 361 | version of the collection.<br>
|
---|
| 362 | document descriptions, as well as the document text.<br>
|
---|
| 363 | <b>UPDATING EXISTING COLLECTIONS</b><br>
|
---|
| 364 | <br><b>How it works</b><br>
|
---|
| 365 | <br>Updating an existing collection with new files in the same<br>
|
---|
| 366 | <br>The original material in the <i>import</i> directory may be in any<br>
|
---|
| 367 | format is easy. For example, the raw material for the HDL<br>
|
---|
| 368 | format, and plugins are required to process each format<br>
|
---|
| 369 | is supplied in the form of HTML files marked up with<br>
|
---|
| 370 | type. The plugins that a collection uses must be specified<br>
|
---|
| 371 | &lt;&lt;TOC&gt;&gt; tags to split books into sections and<br>
|
---|
| 372 | in the collection configuration file. The <i>import</i> program<br>
|
---|
| 373 | subsections, and &lt;&lt;I&gt;&gt; tags to indicate where an image is<br>
|
---|
| 374 | reads the list of plugins and passes each document to each<br>
|
---|
| 375 | to be inserted. For each book in the library there is a<br>
|
---|
| 376 | plugin in order until it finds one that can process it. When<br>
|
---|
| 377 | directory that contains a single HTML file representing the<br>
|
---|
| 378 | updating an existing collection, all plugins necessary to<br>
|
---|
| 379 | book, and separate files containing the associated images.<br>
|
---|
| 380 | process new material should already have been specified in<br>
|
---|
| 381 | An accompanying spreadsheet file contains the<br>
|
---|
| 382 | the configuration file.<br>
|
---|
| 383 | classification hierarchy; this is converted to a simple file<br>format (using Excelâs <i>Save As</i> command).<br>
|
---|
| 384 | <br>The building step creates the indexes for both searching<br>and browsing. The MG software is generally used to do the<br>
|
---|
| 385 | <br>Since the collection exists, its directory is already set up<br>
|
---|
| 386 | searching (Witten <i>et al.</i>, 1999), and the <i>mgbuild</i> module is<br>
|
---|
| 387 | with subdirectories <i>import</i>, <i>archives</i>, <i>building</i>, <i>index</i>, and<br>
|
---|
| 388 | automatically invoked to create each of the indexes that is<br>
|
---|
| 389 | <i>etc</i>, and the <i>etc</i> directory will contain a suitable collection<br>
|
---|
| 390 | required. For example, the Humanity Development Library<br>
|
---|
| 391 | configuration file.<br>
|
---|
| 392 | has three indexes, one for entire books, one for chapters,<br>and one for section titles. Subdirectories of the <i>index</i><br>
|
---|
| 393 | <br>
|
---|
| 394 | directory are created for each of these indexes.<br>
|
---|
| 395 | <b>The updating procedure</b><br>
|
---|
| 396 | <br>To update a collection, the new raw material is placed in<br>the <i>import</i> directory, in whatever form it is available. Then<br>
|
---|
| 397 | <hr>
|
---|
| 398 | <A name=6></a>creator<br>
|
---|
| 399 | [email protected]<br>
|
---|
| 400 | 1<br>
|
---|
| 401 | maintainer<br>
|
---|
| 402 | [email protected]<br>
|
---|
| 403 | 2<br>
|
---|
| 404 | public<br>
|
---|
| 405 | True<br>
|
---|
| 406 | 3<br>4<br>
|
---|
| 407 | indexes<br>
|
---|
| 408 | document:text<br>
|
---|
| 409 | 5<br>
|
---|
| 410 | defaultindex<br>
|
---|
| 411 | document:text<br>
|
---|
| 412 | 6<br>
|
---|
| 413 | plugins<br>
|
---|
| 414 | GMLPlug TEXTPlug ArcPlug RecPlug<br>
|
---|
| 415 | 7<br>8<br>
|
---|
| 416 | classify<br>
|
---|
| 417 | AZList metadata=Title<br>
|
---|
| 418 | 9<br>10<br>
|
---|
| 419 | collectionmeta<br>
|
---|
| 420 | collectionname &quot;generic text collection&quot;<br>
|
---|
| 421 | 11<br>
|
---|
| 422 | (a)<br>
|
---|
| 423 | collectionmeta<br>
|
---|
| 424 | .document:text &quot;documents&quot;<br>
|
---|
| 425 | 12<br>
|
---|
| 426 | creator<br>
|
---|
| 427 | [email protected]<br>
|
---|
| 428 | 1<br>
|
---|
| 429 | maintainer<br>
|
---|
| 430 | [email protected]<br>
|
---|
| 431 | 2<br>
|
---|
| 432 | public<br>
|
---|
| 433 | True<br>
|
---|
| 434 | 3<br>4<br>
|
---|
| 435 | indexes<br>
|
---|
| 436 | document:text document:From<br>
|
---|
| 437 | 5<br>
|
---|
| 438 | defaultindex<br>
|
---|
| 439 | document:text<br>
|
---|
| 440 | 6<br>
|
---|
| 441 | plugins<br>
|
---|
| 442 | GMLPlug EMAILPlug ArcPlug RecPlug<br>
|
---|
| 443 | 7<br>8<br>
|
---|
| 444 | classify<br>
|
---|
| 445 | AZList metadata=Title<br>
|
---|
| 446 | 9<br>
|
---|
| 447 | classify<br>
|
---|
| 448 | DateList<br>
|
---|
| 449 | 10<br>11<br>
|
---|
| 450 | collectionmeta<br>
|
---|
| 451 | collectionname &quot;Email messages&quot;<br>
|
---|
| 452 | 12<br>
|
---|
| 453 | collectionmeta<br>
|
---|
| 454 | .document:text &quot;documents&quot;<br>
|
---|
| 455 | 13<br>
|
---|
| 456 | collectionmeta<br>
|
---|
| 457 | .document:From &quot;email senders&quot;<br>
|
---|
| 458 | 14<br>15<br>
|
---|
| 459 | format<br>
|
---|
| 460 | QueryResults \\\\<br>
|
---|
| 461 | 16<br>
|
---|
| 462 | (b)<br>
|
---|
| 463 | &lt;td&gt;[link][icon][/link]&lt;/td&gt;&lt;td&gt;[Title]&lt;/td&gt;&lt;td&gt;[Author]&lt;/td&gt;<br>
|
---|
| 464 | 17<br>
|
---|
| 465 | <b>Figure 5: Collection configuration files (a) generic, (b) for an email collection</b><br>
|
---|
| 466 | <br>MG also compresses the text of the collection; and the<br>
|
---|
| 467 | certain circumstances, however, it might be preferable to<br>
|
---|
| 468 | image files are linked into the <i>index</i> subdirectory. Now<br>
|
---|
| 469 | use a standardized format such as XML. This is<br>
|
---|
| 470 | none of the material in the <i>import</i> and <i>archives</i> directories<br>
|
---|
| 471 | straightforward to implementjust write an XML<br>
|
---|
| 472 | is needed to run the collection and can be removed from<br>
|
---|
| 473 | pluginalthough we have not done so ourselves. Given<br>
|
---|
| 474 | the file system (though they would be needed if the<br>
|
---|
| 475 | the transitory nature of the imported data, to date, we have<br>
|
---|
| 476 | collection were rebuilt).<br>
|
---|
| 477 | found GML a satisfactory and beneficial format.<br>
|
---|
| 478 | <br>Associated with each collection is a database stored in<br>
|
---|
| 479 | <b>CREATING NEW COLLECTIONS</b><br>
|
---|
| 480 | GDBM (Gnu database manager) format. This contains an<br>entry for each document, giving its OID, its internal MG<br>
|
---|
| 481 | <br>Building new collections from scratch is only slightly<br>
|
---|
| 482 | document number, and metadata such as title. Information<br>
|
---|
| 483 | different from updating an existing collection. The key<br>
|
---|
| 484 | for each of the browsing indexes, which appear as buttons<br>
|
---|
| 485 | new requirement is creating a collection configuration file,<br>
|
---|
| 486 | on the Greenstone search/browse bar, is also extracted<br>
|
---|
| 487 | and a software utility is provided to help. Two pieces of<br>
|
---|
| 488 | during the building process and stored in the database. A<br>
|
---|
| 489 | information are required for this: the name of the directory<br>
|
---|
| 490 | âclassifierâ program is required for each browsing index to<br>
|
---|
| 491 | that the collection will use (into which the source data and<br>
|
---|
| 492 | extract the appropriate information from GML documents.<br>
|
---|
| 493 | other files will eventually be placed), and a contact e-mail<br>
|
---|
| 494 | Like plugins, classifiers are written on an <i>ad hoc</i> basis for<br>
|
---|
| 495 | address for use if any problems are encountered by the<br>
|
---|
| 496 | the particular information required, and where possible<br>
|
---|
| 497 | software once the collection is up and running. The utility<br>
|
---|
| 498 | reused from one collection to another.<br>
|
---|
| 499 | creates files and directories within the newly-named<br>
|
---|
| 500 | <br>
|
---|
| 501 | directory to support a generic collection of plain text<br>
|
---|
| 502 | The building program creates the indexes based on<br>
|
---|
| 503 | documents. With suitable data placed in the <i>import</i><br>
|
---|
| 504 | whatever appears in the <i>archives</i> directory. The first plugin<br>
|
---|
| 505 | directory, building the collection at this point will yield a<br>
|
---|
| 506 | specified by all collections is one that processes GML<br>
|
---|
| 507 | document-level searchable index of all the text and a<br>
|
---|
| 508 | files, and so if <i>archives</i> contains imported files they will be<br>
|
---|
| 509 | browsable list of âtitlesâ (defined in this case to be the<br>
|
---|
| 510 | processed correctly. If it contains material in the original<br>
|
---|
| 511 | document filenames).<br>
|
---|
| 512 | format, that will be converted using the appropriate plugin.<br>Thus the import process is optional.<br>
|
---|
| 513 | <br>To enhance the functionality and presentationâ something<br>
|
---|
| 514 | <br>
|
---|
| 515 | anything but the most trivial collection will requireâthe<br>
|
---|
| 516 | GML is designed to be fast and easy to parse, an important<br>
|
---|
| 517 | configuration file must be edited. For a collection sourced<br>
|
---|
| 518 | requirement when millions of documents are to be<br>
|
---|
| 519 | from documents in an already supported data format,<br>
|
---|
| 520 | processed. Something as simple as requiring tags to be<br>
|
---|
| 521 | presented in a similar fashion to an existing collection, the<br>
|
---|
| 522 | lower-case, for example, yields a substantial speed-up. In<br>
|
---|
| 523 | <hr>
|
---|
| 524 | <A name=7></a><IMG src="_httpdocimg_/pdf01-7_1.jpg"><br>
|
---|
| 525 | <br>These are modules of code that can be slotted into the<br>system to enhance its capabilities. Plugins parse<br>documents, extracting the text and metadata to be indexed.<br>Classifiers control how metadata is brought together to<br>form browsable data structures. Both are specified in an<br>object-oriented framework using inheritance to minimize<br>the amount of code written.<br>
|
---|
| 526 | <br>A plugin must specify three things: what file formats it can<br>handle, how they should be parsed, and whether the plugin<br>is recursive. File formats are normally determined using<br>regular expression matching on the filename. For example,<br>the HTML plugin accepts all files that end in <i>.htm</i>, . <i>html</i>,<br><i>.HTM</i>, or <i>.HTML</i>. (It is quite possible, however, to write<br>plugins that âlook insideâ the file as well.) For other files,<br>the plugin returns <i>undefined</i> and the file is passed to the<br>next plugin in the collectionâs configuration file (e.g.<br>Figure 5 line 7). If it can, the plugin parses the file and<br>returns the number of documents processed. This involves<br>extracting text and metadata and adding it to the libraryâs<br>content through calls to <i>add text</i> and <i>add metadata</i>.<br>
|
---|
| 527 | <br>Some plugins (ârecursiveâ ones) add extra files into the<br>
|
---|
| 528 | <b>Figure 6: Searching bookmarked Web pages</b><br>
|
---|
| 529 | stream of data processed during the building phase by<br>artificially reactivating the list of plugins. This is how<br>directory hierarchies are traversed.<br>
|
---|
| 530 | amount of editing is minimal. Importing new data formats<br>and browsing metadata in ways not currently supported are<br>
|
---|
| 531 | <br>Plugins are small modules of code that are easy to write.<br>
|
---|
| 532 | more complex activities that require programming skills.<br>
|
---|
| 533 | We monitored the time it took to develop a new one that<br>was different to any we had produced so far. We chose to<br>make as an example a collection of HTML bookmark files,<br>
|
---|
| 534 | <br><b>Modifying the configuration file</b><br>
|
---|
| 535 | the motivation being to produce a convenient way of<br>
|
---|
| 536 | <br>
|
---|
| 537 | searching and browsing oneâs bookmarked Web pages.<br>
|
---|
| 538 | Figure 5b shows simple alterations to the generic<br>
|
---|
| 539 | Figure 6 shows a user searching for bookmarked pages<br>
|
---|
| 540 | configuration file in Figure 5a that was generated by the<br>
|
---|
| 541 | about <i>music</i>. The new plugin took under an hour to write,<br>
|
---|
| 542 | new-collection utility. <i>TEXTPlug</i> is replaced with<br>
|
---|
| 543 | and was 160 lines long (ignoring blank lines and<br>
|
---|
| 544 | <i>EMAILPlug</i> (line 7) which reads email files and extracts<br>
|
---|
| 545 | comments)âabout the average length of existing plugins.<br>
|
---|
| 546 | metadata (<i>From</i>, <i>To</i>, <i>Date</i>, <i>Subject</i>) from them. A classifier<br>for dates is added (line 10) to make the collection<br>
|
---|
| 547 | <br>Classifiers are more general than plugins because they<br>
|
---|
| 548 | browsable chronologically. The default presentation of<br>
|
---|
| 549 | work on GML-format data. For example, any plugin that<br>
|
---|
| 550 | search results is overridden (line 17) to display both the<br>
|
---|
| 551 | generates date metadata in accordance with the Dublin<br>
|
---|
| 552 | title of the message (i.e. Dublin Core <i>Title</i>) and its sender<br>
|
---|
| 553 | core can request the collection to be browsable<br>
|
---|
| 554 | (i.e. Dublin Core <i>Author</i>). Elements in square brackets,<br>
|
---|
| 555 | chronologically by specifying the <i>DateList</i> classifier in the<br>
|
---|
| 556 | such as <i>[Title]</i>, are replaced by the metadata associated<br>
|
---|
| 557 | collectionâs configuration file (Figure 7). Classifiers are<br>
|
---|
| 558 | with a particular document. The built-in term <i>[icon]</i><br>
|
---|
| 559 | more elaborate than most plugins, but new ones are seldom<br>
|
---|
| 560 | produces a suitable image that represents the document<br>
|
---|
| 561 | required. The average length of existing classifiers is 230<br>
|
---|
| 562 | (such as a book icon or page icon), and the <i>[link]âŠ[/link]</i><br>
|
---|
| 563 | lines.<br>
|
---|
| 564 | construct forms a hyperlink to the complete document.<br>
|
---|
| 565 | <br>
|
---|
| 566 | Anything else in the format statement, which in this case is<br>
|
---|
| 567 | Classifiers must specify three things: an initialization<br>
|
---|
| 568 | solely table-cell tags in HTML, is passed through to the<br>
|
---|
| 569 | routine, how individual documents are classified, and the<br>
|
---|
| 570 | page being displayed.<br>
|
---|
| 571 | final browsable data structure. Initialization takes care of<br>any options specified in the configuration file (such as<br>
|
---|
| 572 | As this example shows, creating a new collection that stays<br>
|
---|
| 573 | <i>metadata=Title </i>on line 9 of Figure 5b). Classifying<br>
|
---|
| 574 | within the bounds of the libraryâs established capabilities<br>
|
---|
| 575 | individual documents is an iterative process: for each one,<br>
|
---|
| 576 | falls within the capability of many computer usersâfor<br>
|
---|
| 577 | a call to <i>document-classify</i> is made. On presentation of the<br>
|
---|
| 578 | instance, computer-trained librarians. Extending<br>
|
---|
| 579 | documentâs OID, the necessary metadata is located and<br>
|
---|
| 580 | Greenstone to handle new document formats and browse<br>
|
---|
| 581 | used to control where the document is added to the<br>
|
---|
| 582 | metadata in new ways is more challenging.<br>
|
---|
| 583 | browsable data structure being constructed.<br>
|
---|
| 584 | <br>Once all documents have been added, a request is made for<br>
|
---|
| 585 | <br><b>Writing new plugins and classifiers</b><br>
|
---|
| 586 | the completed data structure. Some classifiers return the<br>data structure directly; others transform the data structure<br>
|
---|
| 587 | <br>Extensibility is obtained through plugins and classifiers.<br>
|
---|
| 588 | before it is returned. For example, the <i>AZList</i> classifier<br>
|
---|
| 589 | <hr>
|
---|
| 590 | <A name=8></a><IMG src="_httpdocimg_/pdf01-8_1.jpg"><br>
|
---|
| 591 | a page number, next and previous page buttons, and<br>displaying a particular page at different resolutions. A text<br>version of the page is also available upon which a<br>searching option is also provided.<br>
|
---|
| 592 | Started in 1994, Harvest is also a long-running research<br>project. It provides an efficient means of gathering source<br>data from the Internet and distributing indexing<br>information over the Internet. This is accomplished<br>through five components: <i>gatherer</i>, <i>broker</i>, <i>indexer</i>,<br><i>replicator</i> and <i>cache</i>. The first three are central to creating,<br>updating and searching a collection; the last two help to<br>improve performance over the Internet through transparent<br>mirroring and caching techniques.<br>
|
---|
| 593 | The system is configurable and customizable. While<br>searching is most commonly implemented using Glimpse<br>(<i>glimpse.cs.arizona.edu</i>), in principle any search engine<br>that supports incremental updates and Boolean<br>combinations of attribute-based queries can be used. It is<br>possible to control what type of documents are gathered<br>during creation and updating, and how the query interface<br>
|
---|
| 594 | <b>Figure 7: Browsing a newspaper collection by date</b><br>
|
---|
| 595 | looks and is laid out.<br>
|
---|
| 596 | Sample collections cited by the developers include 21,000<br>
|
---|
| 597 | divides the alphabetically sorted list of metadata into<br>
|
---|
| 598 | computer science technical reports and 7,000 home pages.<br>
|
---|
| 599 | separate pages of about the same size and returns the<br>
|
---|
| 600 | Other examples include a sizable collection of agriculture-<br>
|
---|
| 601 | alphabetic ranges for each one (Figure 4).<br>
|
---|
| 602 | related electronic journals and magazines called âtomato-<br>juiceâ (accessed through <i>hegel.lib.ncsu.edu</i>) and a full-text<br>
|
---|
| 603 | <b>OVERVIEW OF RELATED WORK</b><br>
|
---|
| 604 | index of library-related electronic serials<br>
|
---|
| 605 | Two projects that provide substantial open source digital<br>
|
---|
| 606 | (<i>sunsite.berkeley.edu/IndexMorganagus</i>). Harvest is also<br>
|
---|
| 607 | library software are Dienst (Lagoze and Fielding, 1998)<br>
|
---|
| 608 | often used to index Web sites (for example<br>
|
---|
| 609 | and Harvest (Bowman <i>et al.</i>, 1994). The origins of Dienst<br>
|
---|
| 610 | <i>www.middlebury.edu</i>).<br>
|
---|
| 611 | (<i>www.cs.cornell.edu/cdlrg</i>) stretch back to 1992. The term<br>
|
---|
| 612 | Comparing Greenstone with Dienst and Harvest, there are<br>
|
---|
| 613 | has come to represent three entities: a conceptual<br>
|
---|
| 614 | both similarities and differences. All provide substantial<br>
|
---|
| 615 | architecture for distributed digital libraries; an open<br>
|
---|
| 616 | digital library systems, hence common themes recur, but<br>
|
---|
| 617 | protocol for service communication; and a software<br>
|
---|
| 618 | they are driven by projects with different aims. Harvest,<br>
|
---|
| 619 | system that implements the protocol. To date, five sample<br>
|
---|
| 620 | for instance, was not conceived as a digital library project<br>
|
---|
| 621 | digital libraries have been built using this technology.<br>
|
---|
| 622 | at all, but by virtue of its selective document gathering<br>
|
---|
| 623 | They manifest themselves in two forms: technical reports<br>
|
---|
| 624 | process it can be classed (and is used) as one. While it<br>
|
---|
| 625 | and primary source documents.<br>
|
---|
| 626 | provides sophisticated search options, it lacks the<br>
|
---|
| 627 | Best known is NCSTRL, the Networked Computer<br>
|
---|
| 628 | complementary service of browsing. Furthermore it adds<br>
|
---|
| 629 | Science Technical Reference Library project<br>
|
---|
| 630 | no structure or order to the documents collected, relying<br>
|
---|
| 631 | (<i>www.ncstrl.org</i>). This collection facilitates searching by<br>
|
---|
| 632 | on whatever structures are present in the site that they<br>
|
---|
| 633 | title, author and abstract, and browsing by year and author,<br>
|
---|
| 634 | were gathered from. A proven strength of the design is its<br>
|
---|
| 635 | across a distributed network of document repositories.<br>
|
---|
| 636 | flexibility through configuration and customizationan<br>
|
---|
| 637 | Documents can (where supported) be delivered in various<br>
|
---|
| 638 | element also present in Greenstone.<br>
|
---|
| 639 | formats such as PostScript, a thumbnail overview of the<br>
|
---|
| 640 | Dienstbest exemplified through the NCSTRL<br>
|
---|
| 641 | pages, and a GIF image of a particular page.<br>
|
---|
| 642 | worksupports searching and browsing, like Greenstone.<br>
|
---|
| 643 | The <i>Making of America</i> resource is an example of a<br>
|
---|
| 644 | Both use open protocols. Differences include a high<br>
|
---|
| 645 | collection based around primary sourcesin this case<br>
|
---|
| 646 | reliance in Dienst on user-supplied information when a<br>
|
---|
| 647 | American social history, 1830â1900. It has a different<br>
|
---|
| 648 | document is added, and a smaller range of document types<br>
|
---|
| 649 | âlook and feelâ to NCSTRL, being strongly oriented<br>
|
---|
| 650 | supportedâalthough Dienst does include a document<br>
|
---|
| 651 | toward browsing rather than searching. A user navigates<br>
|
---|
| 652 | model that should, over time, allow this to expand with<br>
|
---|
| 653 | their way through a hierarchical structure of hyperlinks to<br>
|
---|
| 654 | relative ease.<br>
|
---|
| 655 | reach a book of interest. The book itself is a series of<br>
|
---|
| 656 | There are also commercial systems that provide similar<br>
|
---|
| 657 | scanned images: delivery options include going directly to<br>
|
---|
| 658 | digital library services to those described. However, since<br>
|
---|
| 659 | <hr>
|
---|
| 660 | <A name=9></a>corporate culture instills proprietary attitudes there is little<br>
|
---|
| 661 | <b>REFERENCES</b><br>
|
---|
| 662 | opportunity for advancement through a shared<br>
|
---|
| 663 | 1. Akscyn, R.M. and Witten, I.H. (1998) âReport on First<br>
|
---|
| 664 | collaborative effort. Consequently they are not reviewed<br>
|
---|
| 665 | Summit on International Cooperation on Digital<br>
|
---|
| 666 | here.<br>
|
---|
| 667 | Libraries.â ks.com/idla-wp-oct98.<br>
|
---|
| 668 | 2. Bowman, C.M., Danzig, P.B., Manber, U., and<br>
|
---|
| 669 | <b>CONCLUSIONS</b><br>
|
---|
| 670 | Schwartz, M.F. âScalable Internet resource discovery:<br>
|
---|
| 671 | Greenstone is a comprehensive software system for<br>
|
---|
| 672 | Research problems and approachesâ <i>Communications</i><br>
|
---|
| 673 | creating digital library collections. It builds data structures<br>
|
---|
| 674 | <i>of the ACM,</i> Vol. 37, No. 8, pp. 98â107, 1994.<br>
|
---|
| 675 | for searching and browsing from the material provided,<br>
|
---|
| 676 | 3. Fox, E. (1998) âDigital library definitions.â<br>
|
---|
| 677 | rather than relying on any hand-crafting. The process is<br>
|
---|
| 678 | ei.cs.vt.edu/~fox/dlib/def.html.<br>
|
---|
| 679 | controlled by a configuration file, and once a collection<br>exists new material can be added completely<br>
|
---|
| 680 | 4. Humanity Libraries (1998) <i>Humanity Development</i><br>
|
---|
| 681 | automatically. Browsing is based on Dublin Core<br>
|
---|
| 682 | <i>Library</i>. CD-ROM produced by the Global Help<br>
|
---|
| 683 | metadata.<br>
|
---|
| 684 | Project, Antwerp, Belgium.<br>
|
---|
| 685 | New collections can be developed easily, particularly if<br>
|
---|
| 686 | 5. Lagoze, C. and Fielding, D âDefining Collections in<br>
|
---|
| 687 | they resemble existing ones. Extensibility is achieved<br>
|
---|
| 688 | Distributed Digital Librariesâ <i>D-Lib Magazine</i>, Nov.<br>
|
---|
| 689 | through software âpluginsâ that can be written to<br>
|
---|
| 690 | 1998.<br>
|
---|
| 691 | accommodate documents, and metadata, in different<br>
|
---|
| 692 | 6. PAHO (1999) <i>Virtual Disaster Library</i>. CD-ROM<br>
|
---|
| 693 | formats. Standard plugins exist for many document types;<br>
|
---|
| 694 | produced by the Pan-American Health Organization,<br>
|
---|
| 695 | new ones are easily written. Browsing is controlled by<br>
|
---|
| 696 | Washington DC, USA.<br>
|
---|
| 697 | âclassifiersâ that process metadata into browsing structures<br>
|
---|
| 698 | 7. McNab, R.J., Witten, I.H. and Boddie, S.J. (1998) âA<br>
|
---|
| 699 | (by date, alphabetical, hierarchical, etc).<br>
|
---|
| 700 | distributed digital library architecture incorporating<br>
|
---|
| 701 | However, the most powerful support for extensibility is<br>
|
---|
| 702 | different index styles.â <i>Proc IEEE Advances in Digital</i><br>
|
---|
| 703 | achieved not by technical means but by making the source<br>
|
---|
| 704 | <i>Libraries</i>, Santa Barbara, CA, pp. 36â45.<br>
|
---|
| 705 | code freely available under the Gnu public license. Only<br>
|
---|
| 706 | 8. Nevill-Manning, C.G., Reed, T., and Witten, I.H.<br>
|
---|
| 707 | through an international cooperative effort will digital<br>
|
---|
| 708 | (1998) âExtracting text from PostScriptâ<br>
|
---|
| 709 | library software become sufficiently comprehensive to<br>
|
---|
| 710 | <i>SoftwareâPractice and Experience</i>, Vol. 28, No. 5, pp.<br>
|
---|
| 711 | meet the worldâs needs with the richness and flexibility<br>
|
---|
| 712 | 481â491; April.<br>
|
---|
| 713 | that users deserve.<br>
|
---|
| 714 | 9. UNESCO (1999) <i>SAHEL point DOC: Anthologie du</i><br>
|
---|
| 715 | <b>ACKNOWLEDGMENTS</b><br>
|
---|
| 716 | <i>développement au Sahel</i>. CD-ROM produced by<br>UNESCO, Paris, France.<br>
|
---|
| 717 | We gratefully acknowledge all those who have worked on<br>the Greenstone software, and all members of the New<br>
|
---|
| 718 | 10. UNU (1998) <i>Collection on critical global issues.</i> CD-<br>
|
---|
| 719 | Zealand Digital Library project for their enthusiasm and<br>
|
---|
| 720 | ROM produced by the United Nations University<br>
|
---|
| 721 | ideas.<br>
|
---|
| 722 | Press, Tokyo, Japan.<br>
|
---|
| 723 | 11. Witten, I.H., Moffat, A. and Bell, T. (1999) <i>Managing</i><br>
|
---|
| 724 | <i>Gigabytes: compressing and indexing documents and<br>images</i>, Morgan Kaufmann, second edition.<br>
|
---|
| 725 | <hr>
|
---|
| 726 |
|
---|
| 727 |
|
---|
| 728 | </Content>
|
---|
| 729 | </Section>
|
---|
| 730 | </Archive>
|
---|