1 | <?xml version="1.0" encoding="utf-8" standalone="no"?>
|
---|
2 | <!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
|
---|
3 | <Archive>
|
---|
4 | <Section>
|
---|
5 | <Description>
|
---|
6 | <Metadata name="gsdldoctype">indexed_doc</Metadata>
|
---|
7 | <Metadata name="Language">en</Metadata>
|
---|
8 | <Metadata name="Encoding">utf8</Metadata>
|
---|
9 | <Metadata name="URL">http://Scratch/ak19/gs2-svn-12Dec2016/collect/Enhanced-PDF/tmp/1487129922/pdf01.html</Metadata>
|
---|
10 | <Metadata name="UTF8URL">http://Scratch/ak19/gs2-svn-12Dec2016/collect/Enhanced-PDF/tmp/1487129922/pdf01.html</Metadata>
|
---|
11 | <Metadata name="Title">Greenstone: A Comprehensive Open-Source Digital Library Software System Ian H. Witten,* Rodger J....</Metadata>
|
---|
12 | <Metadata name="gsdlsourcefilename">import/pdf01.pdf</Metadata>
|
---|
13 | <Metadata name="gsdlconvertedfilename">tmp/1487129922/pdf01.html</Metadata>
|
---|
14 | <Metadata name="OrigSource">pdf01.html</Metadata>
|
---|
15 | <Metadata name="Source">pdf01.pdf</Metadata>
|
---|
16 | <Metadata name="SourceFile">pdf01.pdf</Metadata>
|
---|
17 | <Metadata name="Plugin">PDFPlugin</Metadata>
|
---|
18 | <Metadata name="FileSize">269487</Metadata>
|
---|
19 | <Metadata name="FilenameRoot">pdf01</Metadata>
|
---|
20 | <Metadata name="FileFormat">PDF</Metadata>
|
---|
21 | <Metadata name="srcicon">_iconpdf_</Metadata>
|
---|
22 | <Metadata name="srclink_file">doc.pdf</Metadata>
|
---|
23 | <Metadata name="srclinkFile">doc.pdf</Metadata>
|
---|
24 | <Metadata name="NumPages">9</Metadata>
|
---|
25 | <Metadata name="gsdlthistype">Paged</Metadata>
|
---|
26 | <Metadata name="ex.ExifTool.ExifToolVersion">8.57</Metadata>
|
---|
27 | <Metadata name="ex.File.Directory">/Scratch/ak19/gs2-svn-12Dec2016/collect/Enhanced-PDF/import</Metadata>
|
---|
28 | <Metadata name="ex.File.FileModifyDate">2017:02:15 16:36:57+13:00</Metadata>
|
---|
29 | <Metadata name="ex.File.FileName">pdf01.pdf</Metadata>
|
---|
30 | <Metadata name="ex.File.FilePermissions">664</Metadata>
|
---|
31 | <Metadata name="ex.File.FileSize">269487</Metadata>
|
---|
32 | <Metadata name="ex.File.FileType">PDF</Metadata>
|
---|
33 | <Metadata name="ex.File.MIMEType">application/pdf</Metadata>
|
---|
34 | <Metadata name="ex.PDF.Author">Bronwyn</Metadata>
|
---|
35 | <Metadata name="ex.PDF.CreateDate">2000:03:02 15:21:24</Metadata>
|
---|
36 | <Metadata name="ex.PDF.Creator">Microsoft Word</Metadata>
|
---|
37 | <Metadata name="ex.PDF.Linearized">false</Metadata>
|
---|
38 | <Metadata name="ex.PDF.PDFVersion">1.2</Metadata>
|
---|
39 | <Metadata name="ex.PDF.PageCount">9</Metadata>
|
---|
40 | <Metadata name="ex.PDF.Producer">Acrobat PDFWriter 4.0 for Power Macintosh</Metadata>
|
---|
41 | <Metadata name="Identifier">HASH1a9cea0f239f754007681b</Metadata>
|
---|
42 | <Metadata name="lastmodified">1487129817</Metadata>
|
---|
43 | <Metadata name="lastmodifieddate">20170215</Metadata>
|
---|
44 | <Metadata name="oailastmodified">1487129922</Metadata>
|
---|
45 | <Metadata name="oailastmodifieddate">20170215</Metadata>
|
---|
46 | <Metadata name="assocfilepath">HASH1a9c.dir</Metadata>
|
---|
47 | <Metadata name="gsdlassocfile">pdf01-2_1.jpg:image/jpeg:</Metadata>
|
---|
48 | <Metadata name="gsdlassocfile">pdf01-3_1.jpg:image/jpeg:</Metadata>
|
---|
49 | <Metadata name="gsdlassocfile">pdf01-4_1.jpg:image/jpeg:</Metadata>
|
---|
50 | <Metadata name="gsdlassocfile">pdf01-5_1.jpg:image/jpeg:</Metadata>
|
---|
51 | <Metadata name="gsdlassocfile">pdf01-7_1.jpg:image/jpeg:</Metadata>
|
---|
52 | <Metadata name="gsdlassocfile">pdf01-8_1.jpg:image/jpeg:</Metadata>
|
---|
53 | <Metadata name="gsdlassocfile">doc.pdf:application/pdf:</Metadata>
|
---|
54 | </Description>
|
---|
55 | <Content>
|
---|
56 |
|
---|
57 |
|
---|
58 |
|
---|
59 |
|
---|
60 |
|
---|
61 |
|
---|
62 |
|
---|
63 |
|
---|
64 |
|
---|
65 |
|
---|
66 |
|
---|
67 |
|
---|
68 |
|
---|
69 |
|
---|
70 |
|
---|
71 |
|
---|
72 |
|
---|
73 |
|
---|
74 |
|
---|
75 | </Content>
|
---|
76 | <Section>
|
---|
77 | <Description>
|
---|
78 | <Metadata name="Title">1</Metadata>
|
---|
79 | </Description>
|
---|
80 | <Content><br />
|
---|
81 | <b>Greenstone: A Comprehensive Open-Source</b><br>
|
---|
82 | <b>Digital Library Software System</b><br>
|
---|
83 | <i>Ian H. Witten,* Rodger J. McNab,â Stefan J. Boddie,* David Bainbridge*</i><br>
|
---|
84 | * Dept of Computer Science<br>
|
---|
85 | â Digilib Systems<br>
|
---|
86 | University of Waikato, New Zealand<br>
|
---|
87 | Hamilton, New Zealand<br>
|
---|
88 | E-mail: {ihw, sjboddie, davidb}@cs.waikato.ac.nz<br>
|
---|
89 | E-mail: [email protected]<br>
|
---|
90 | <b>ABSTRACT</b><br>
|
---|
91 | multilingual information retrieval to distributed computing<br>protocols, from interoperability to search engine<br>
|
---|
92 | This paper describes the Greenstone digital library<br>
|
---|
93 | technology, from metadata standards to multiformat<br>
|
---|
94 | software, a comprehensive, open-source system for the<br>
|
---|
95 | document parsing, from multimedia to multiple operating<br>
|
---|
96 | construction and presentation of information collections.<br>
|
---|
97 | systems, from Web browsers to plug-and-play DVDs.<br>
|
---|
98 | Collections built with Greenstone offer effective full-text<br>searching and metadata-based browsing facilities that are<br>
|
---|
99 | The Greenstone Digital Library Software from the New<br>
|
---|
100 | attractive and easy to use. Moreover, they are easily<br>
|
---|
101 | Zealand Digital Library (NZDL) project tackles this issue<br>
|
---|
102 | maintainable and can be augmented and rebuilt entirely<br>
|
---|
103 | by providing a new way of organizing information and<br>
|
---|
104 | automatically. The system is extensible: software<br>
|
---|
105 | making it available over the Internet. A <i>collection</i> of<br>
|
---|
106 | âpluginsâ accommodate different document and metadata<br>
|
---|
107 | information comprises several (typically several thousand,<br>
|
---|
108 | types.<br>
|
---|
109 | or several million) <i>documents</i>, and a uniform interface is<br>provided to all documents in a collection. A library may<br>
|
---|
110 | <b>INTRODUCTION</b><br>
|
---|
111 | include many different collections, each organized<br>differentlyâthough there is a strong family resemblance in<br>
|
---|
112 | Notwithstanding intense research activity in the digital<br>
|
---|
113 | how collections are presented.<br>
|
---|
114 | library field during the second half of the 1990s,<br>comprehensive software systems for creating digital<br>
|
---|
115 | Making information available using this system is far more<br>
|
---|
116 | libraries are not widely available. In fact, the usual solution<br>
|
---|
117 | than âjust putting it on the Web.â The collection becomes<br>
|
---|
118 | when creating a digital library is also the most<br>
|
---|
119 | maintainable, searchable, and browsable. Each collection,<br>
|
---|
120 | obviousâjust put it on the Web. But consider how much<br>
|
---|
121 | prior to presentation, undergoes a âbuildingâ process that,<br>
|
---|
122 | effort is involved in constructing a Web site for a digital<br>
|
---|
123 | once established, is completely automatic. This process<br>
|
---|
124 | library. To be effective it needs to be visually attractive<br>
|
---|
125 | creates all the structures that are used at run-time for<br>
|
---|
126 | and ergonomically easy to use, incorporate convenient and<br>
|
---|
127 | accessing the collection. Searching is based on various<br>
|
---|
128 | powerful searching capabilities, and offer rich and natural<br>
|
---|
129 | indexes, while browsing is based on various metadata;<br>
|
---|
130 | browsing facilities. Above all it must be easy to maintain<br>
|
---|
131 | support structures for both are created during the building<br>
|
---|
132 | and augment, which presents a significant challenge if any<br>
|
---|
133 | operation. When new material appears it can be fully<br>
|
---|
134 | manual organization is involved.<br>
|
---|
135 | incorporated into the collection by rebuilding.<br>
|
---|
136 | The alternative is to automate these activities through<br>
|
---|
137 | To address the exceptionally broad demands of digital<br>
|
---|
138 | software tools. But the broad scope of digital library<br>
|
---|
139 | libraries, the system is public and extensible. It is issued<br>
|
---|
140 | requirements makes this a daunting prospect. Ideally the<br>
|
---|
141 | under the Gnu public license and, in the spirit of open-<br>
|
---|
142 | software should incorporate facilities ranging from<br>
|
---|
143 | source software, users are invited to contribute<br>modifications and enhancements. Only through an<br>international cooperative effort will digital library software<br>become sufficiently comprehensive to meet the worldâs<br>needs. Currently the Greenstone software is used at sites in<br>Canada, Germany, New Zealand, Romania, UK, and the<br>US, and collections range from newspaper articles to<br>technical documents, from educational journals to oral<br>history, from visual art to folksongs. The software has<br>been used for collections in many different languages, and<br>for CD-ROMs that have been published by the United<br>Nations and other humanitarian agencies in Belgium,<br>France, Japan, and the US for distribution in developing<br>countries (Humanity Libraries, 1998; PAHO, 1999;<br>UNESCO, 1999; UNU, 1998). Further details can be<br>obtained from <i>www.nzdl.org</i>.<br>
|
---|
144 | <hr>
|
---|
145 | </Content>
|
---|
146 | </Section>
|
---|
147 | <Section>
|
---|
148 | <Description>
|
---|
149 | <Metadata name="Title">2</Metadata>
|
---|
150 | </Description>
|
---|
151 | <Content><br />
|
---|
152 | <IMG src="_httpdocimg_/pdf01-2_1.jpg"><br>
|
---|
153 | become a first-class component of the library. And what<br>permits it to be integrated into existing searching and<br>browsing structures without any manual intervention is<br><i>metadata</i>. This provides sufficient focus to the concept of<br>âdigital libraryâ to support the development of a<br>construction kit.<br>
|
---|
154 | <b>OVERVIEW OF GREENSTONE</b><br>
|
---|
155 | <br>Information collections built by Greenstone combine<br>extensive full-text search facilities with browsing indexes<br>based on different metadata types. There are several ways<br>for users to find information, although they differ between<br>collections depending on the metadata available and the<br>collection design. Typically you can <i>search for particular<br>words</i> that appear in the text, or within a section of a<br>document, or within a title or section heading. You can<br><i>browse documents by title</i>: just click on the displayed book<br>icon to read it. You can <i>browse documents by subject</i>.<br>Subjects are represented by bookshelves: just click on a<br>shelf to see the books. Where appropriate, documents<br>
|
---|
156 | <b>Figure 1: Searching the HDL collection</b><br>
|
---|
157 | come complete with a table of contents (constructed<br>automatically): you can click on a chapter or subsection to<br>
|
---|
158 | This paper sets the scene with a brief discussion of what a<br>
|
---|
159 | open it, expand the full table of contents, or expand the full<br>
|
---|
160 | digital library is. We then give an overview of the facilities<br>
|
---|
161 | document.<br>
|
---|
162 | offered by Greenstone and show how end users find<br>information in collections. Next we describe the files and<br>
|
---|
163 | <br>An example of searching is shown in Figure 1 where<br>
|
---|
164 | directories involved in a collection, and then discuss the<br>
|
---|
165 | documents in the Global Help Projectâs Humanity<br>
|
---|
166 | processes of updating existing collections and creating new<br>
|
---|
167 | Development Library (HDL) are being searched for<br>
|
---|
168 | ones, including extending the software to provide new<br>
|
---|
169 | chapters matching the word <i>butterfly</i>. In Figure 2 the same<br>
|
---|
170 | facilities. We conclude with an overview of related work.<br>
|
---|
171 | collection is being browsed by subject: by clicking on the<br>bookshelf icons the user has discovered an item under<br>
|
---|
172 | <b>WHAT IS A DIGITAL LIBRARY?</b><br>
|
---|
173 | Section 16, Animal Husbandry. Pursuing an interest in<br>butterfly farming, the user selects a book by clicking on its<br>
|
---|
174 | <br>Ten definitions of the term âdigital libraryâ have been<br>
|
---|
175 | book icon. In Figure 3 the front cover of the book is<br>
|
---|
176 | culled from the literature by Fox (1998), and their spirit is<br>
|
---|
177 | displayed as a graphic on the left, and the automatically<br>
|
---|
178 | captured in the following brief characterization:<br>
|
---|
179 | constructed table of contents appears at the start of the<br>
|
---|
180 | <br>
|
---|
181 | document. The current focus, <i>Introduction and Summary</i>,<br>
|
---|
182 | <i>A collection of digital objects, including text,</i><br>
|
---|
183 | is shown in bold in the table of contents with its text<br>
|
---|
184 | <i>video, and audio, along with methods for access</i><br>
|
---|
185 | starting further down the page.<br>
|
---|
186 | <i>and retrieval, and for selection, organization<br>and maintenance of the collection</i><br>
|
---|
187 | <br>In accordance with Leskâs advice, a statement of purpose<br>
|
---|
188 | <br>
|
---|
189 | and coverage accompanies each collection, along with an<br>
|
---|
190 | (Akscyn and Witten, 1998). Lesk (1998) views digital<br>
|
---|
191 | explanation of how it is organized (Figure 1 shows the<br>
|
---|
192 | libraries as âorganized collections of digital information,â<br>
|
---|
193 | start of this). A distinction is made between <i>searching</i> and<br>
|
---|
194 | and wisely recommends that they articulate the principles<br>
|
---|
195 | <i>browsing</i>. Searching is full-text, andâdepending on the<br>
|
---|
196 | governing what is included and how the collection is<br>
|
---|
197 | collectionâs designâthe user can choose between indexes<br>
|
---|
198 | organized.<br>
|
---|
199 | built from different parts of the documents, or from<br>
|
---|
200 | <br>Digital libraries are generally distinguished from the<br>
|
---|
201 | different metadata. Some collections have an index of full<br>
|
---|
202 | World-Wide Web, the essential difference being in<br>
|
---|
203 | documents, an index of sections, an index of paragraphs,<br>
|
---|
204 | selection and organization. But they are not generally<br>
|
---|
205 | an index of titles, and an index of section headings, each of<br>
|
---|
206 | distinguished from a web <i>site</i>: indeed, virtually all extant<br>
|
---|
207 | which can be searched for particular words or phrases.<br>
|
---|
208 | digital libraries manifest themselves as a web site. Hence<br>
|
---|
209 | Browsing involves data structures created from metadata<br>
|
---|
210 | the obvious question: to make a digital library, why not<br>
|
---|
211 | that the user can examine: lists of authors, lists of titles,<br>
|
---|
212 | just put the information on the Web?<br>
|
---|
213 | lists of dates, hierarchical classification structures, and so<br>
|
---|
214 | <br>
|
---|
215 | on. Data structures for both browsing and searching are<br>
|
---|
216 | But we make a distinction between a digital library and a<br>
|
---|
217 | built according to instructions in a configuration file,<br>
|
---|
218 | web site that lies at the heart of our software design: one<br>
|
---|
219 | which controls both building and serving the collection.<br>
|
---|
220 | should easily be able to add new material to a library<br>
|
---|
221 | Sample configuration files are discussed below.<br>
|
---|
222 | without having to integrate it manually or edit its content<br>in any way. Once added, new material should immediately<br>
|
---|
223 | <hr>
|
---|
224 | </Content>
|
---|
225 | </Section>
|
---|
226 | <Section>
|
---|
227 | <Description>
|
---|
228 | <Metadata name="Title">3</Metadata>
|
---|
229 | </Description>
|
---|
230 | <Content><br />
|
---|
231 | <IMG src="_httpdocimg_/pdf01-3_1.jpg"><br>
|
---|
232 | matter of specifying all the necessary plugins. In order to<br>build browsing indexes from metadata, an analogous<br>scheme of âclassifiersâ is used: classifiers create indexes<br>of various kinds based on metadata. Source documents are<br>brought into the Greenstone system through a process<br>called <i>importing</i>, which uses the plugins and classifiers<br>specified in the collection configuration file.<br>
|
---|
233 | <br>The international Unicode character set is used throughout,<br>so documentsâand interfacesâcan be written in any<br>language. Collections have so far been produced in<br>English, French, Spanish, German, Maori, Chinese, and<br>Arabic. The NZDL Web site provides numerous examples.<br>Collections can contain text, pictures, and even audio and<br>video clips; a text-only version of the interface is also<br>provided to accommodate visually impaired users.<br>Compression technology is used to ensure best use of<br>storage (Witten <i>et al </i>., 1999). Most non-textual material is<br>either linked to textual documents or accompanied by<br>textual descriptions (such as photo captions) to allow full-<br>text searching and browsing. However, the architecture<br>
|
---|
234 | <b>Figure 2: Browsing the HDL collection by subject</b><br>
|
---|
235 | permits the implementation of plugins and classifiers even<br>for non-textual data.<br>
|
---|
236 | <br>Rich browsing facilities can be provided by manually<br>
|
---|
237 | <br>
|
---|
238 | linking parts of documents together and building explicit<br>
|
---|
239 | The system includes an âadministrativeâ function whereby<br>
|
---|
240 | indexes and tables of contents. However, manually-created<br>
|
---|
241 | specified users can examine the composition of all<br>
|
---|
242 | linking becomes difficult to maintain, and often falls into<br>
|
---|
243 | collections, protect documents so that they can only be<br>
|
---|
244 | disrepair when a collection expands. The Greenstone<br>
|
---|
245 | accessed by registered users on presentation of a password,<br>
|
---|
246 | software takes a different tack: it facilitates <i>maintainability</i><br>
|
---|
247 | and so on. Logs of user activity are kept that record all<br>
|
---|
248 | by creating all searching and browsing structures<br>
|
---|
249 | queries made to every Greenstone collection (though this<br>
|
---|
250 | automatically from the documents themselves. No links<br>
|
---|
251 | facility can be disabled).<br>
|
---|
252 | are inserted by hand. This means that when new<br>
|
---|
253 | <br>Although primarily designed for Internet access over the<br>
|
---|
254 | documents in the same format become available, they can<br>
|
---|
255 | World-Wide Web, collections can be made available, in<br>
|
---|
256 | be added automatically. Indeed, for some collections this is<br>
|
---|
257 | precisely the same form, on CD-ROM. In either case they<br>
|
---|
258 | done by processes that wake up regularly, scout for new<br>
|
---|
259 | are accessed through any Web browser. Greenstone CD-<br>
|
---|
260 | material, and rebuild the indexesâall without manual<br>
|
---|
261 | ROMs operate on a standalone PC under Windows 3.X,<br>
|
---|
262 | intervention.<br>
|
---|
263 | 95, 98, and NT, and the interaction is identical to accessing<br>
|
---|
264 | Collections comprise many documents: thousands, tens of<br>
|
---|
265 | the collection on the Webâexcept that response is faster<br>
|
---|
266 | thousands, or even millions. Each document may be<br>
|
---|
267 | and more predictable. The requirement to operate on early<br>
|
---|
268 | hierarchically organized into <i>sections</i> (subsections, sub-<br>
|
---|
269 | Windows systems is one that plagues the software design,<br>
|
---|
270 | subsections, and so on). Each section comprises one or<br>
|
---|
271 | but is crucial for many usersâparticularly those in<br>
|
---|
272 | more <i>paragraphs</i>. Metadata such as author, title, date,<br>
|
---|
273 | underdeveloped countries seeking access to humanitarian<br>
|
---|
274 | keywords, and so on, may be associated with documents,<br>
|
---|
275 | aid collections. If the PC is connected to a network<br>
|
---|
276 | or with individual sections of documents. This is the raw<br>
|
---|
277 | (intranet or Internet), a custom-built Web server provided<br>
|
---|
278 | material for indexes. It must either be provided explicitly<br>
|
---|
279 | on each CD makes exactly the same information available<br>
|
---|
280 | for each document and section (for example, in an<br>
|
---|
281 | to others through their standard Web browser. The use of<br>
|
---|
282 | accompanying spreadsheet) or be derivable automatically<br>
|
---|
283 | compression ensures that the greatest possible volume of<br>
|
---|
284 | from the source documents. Metadata is converted to<br>
|
---|
285 | information can be packed on to a CD-ROM.<br>
|
---|
286 | Dublin Core and stored with the document for internal use.<br>
|
---|
287 | <br>The collection-serving software operates under Unix and<br>
|
---|
288 | <br>In order to accommodate different kinds of source<br>
|
---|
289 | Windows NT, and works with standard Web servers. A<br>
|
---|
290 | documents, the software is organized so that âpluginsâ can<br>
|
---|
291 | flexible process structure allows different collections to be<br>
|
---|
292 | be written for new document types. Plugins exist for plain<br>
|
---|
293 | served by different computers, yet be presented to the user<br>
|
---|
294 | text documents, HTML documents, email documents, and<br>
|
---|
295 | in the same way, on the same Web page, as part of the<br>
|
---|
296 | bibliographic formats. Word documents are handled by<br>
|
---|
297 | same digital library, even as part of the same collection<br>
|
---|
298 | saving them as HTML; PostScript ones by applying a<br>
|
---|
299 | (McNab and Witten, 1998). Existing collections can be<br>
|
---|
300 | preprocessor (Nevill-Manning <i>et al</i>., 1998). Specially<br>
|
---|
301 | updated and new ones brought on-line at any time, without<br>
|
---|
302 | written plugins also exist for proprietary formats such as<br>
|
---|
303 | bringing the system down; the process responsible for the<br>
|
---|
304 | that used by the BBC archives department. A collection<br>
|
---|
305 | user interface will notice (through periodic polling) when<br>
|
---|
306 | may have source documents in different forms: it is just a<br>
|
---|
307 | new collections appear and add them to the list presented<br>to the user.<br>
|
---|
308 | <hr>
|
---|
309 | </Content>
|
---|
310 | </Section>
|
---|
311 | <Section>
|
---|
312 | <Description>
|
---|
313 | <Metadata name="Title">4</Metadata>
|
---|
314 | </Description>
|
---|
315 | <Content><br />
|
---|
316 | <IMG src="_httpdocimg_/pdf01-4_1.jpg"><br>
|
---|
317 | <b>FILES IN A COLLECTION</b><br>
|
---|
318 | <br>When a new collection is created or material is added to an<br>existing one, the original source documents are first<br>brought into the system through a process known as<br>âimporting.â This involves converting documents into a<br>simple HTML-like format known as GML (for<br>âGreenstone Markup Languageâ), which includes any<br>metadata associated with the document. Documents are<br>assumed to be in the Unicode UTF-8 code (of which the<br>ASCII characters form a subset).<br>
|
---|
319 | <br><b>Files and directories</b><br>
|
---|
320 | <br>There is a separate directory for each collection, which<br>contains five subdirectories: the original raw material<br>(<i>import</i>), the GML files created from this (<i>archives</i>), the<br>final collection as it is served to users (<i>index</i>), a directory<br>for use during the building process (<i>building</i>), and one for<br>any supporting files (<i>etc</i>)âincluding the configuration file<br>
|
---|
321 | <b>Figure 3: Reading a book in the HDL</b><br>
|
---|
322 | that controls the collection creation procedure. Additional<br>files might be required: for example, building a hierarchy<br>of classifications requires a data file of sub-classifications.<br>
|
---|
323 | <b>FINDING INFORMATION</b><br>
|
---|
324 | <br>Greenstone digital library systems generally include<br>
|
---|
325 | <br>
|
---|
326 | several separate collections. A home page allows you to<br>
|
---|
327 | <b>The imported documents</b><br>
|
---|
328 | select a collection; in addition, each collection has its own<br>
|
---|
329 | <br>In order to identify documents internally, a unique object<br>
|
---|
330 | âaboutâ page that gives you information about how the<br>
|
---|
331 | identifier or OID is assigned to each original source<br>
|
---|
332 | collection is organized and the principles governing what<br>
|
---|
333 | document when it is imported (formed by hashing the<br>
|
---|
334 | is included.<br>
|
---|
335 | content, to overcome file duplication effects caused by<br>
|
---|
336 | <br>All icons in the screenshots of Figures 1â4 are clickable.<br>
|
---|
337 | mirroring) and stored as metadata within that document. It<br>
|
---|
338 | Those icons at the top of the page return to the home page,<br>
|
---|
339 | is important that OIDs persist throughout the index-<br>
|
---|
340 | provide help text, and allow you to set user interface and<br>
|
---|
341 | building processâso that a userâs search history is<br>
|
---|
342 | searching preferences. The navigation bar underneath<br>
|
---|
343 | unaffected by rebuilding the collection. OIDs are assigned<br>
|
---|
344 | gives access to the searching and browsing facilities,<br>
|
---|
345 | by hashing the contents of the original source document.<br>
|
---|
346 | which differ from one collection to another.<br>
|
---|
347 | <br>Once imported, each document is stored in its own<br>
|
---|
348 | <br>Each of the five buttons provides a different way to find<br>
|
---|
349 | subdirectory of <i>archives</i>, along with any associated<br>
|
---|
350 | information. You can <i>search for particular words</i> that<br>
|
---|
351 | filesâfor example, images. To ensure compatibility with<br>
|
---|
352 | appear in the text from the âsearchâ page (or from the<br>
|
---|
353 | Windows 3.0, only eight characters are used in directory<br>
|
---|
354 | âaboutâ page of Figure 1). This collection contains indexes<br>
|
---|
355 | and file names, which causes annoying but essentially<br>
|
---|
356 | of chapters, section titles, and entire books. The default<br>
|
---|
357 | trivial complications.<br>
|
---|
358 | search interface is a simple one, suitable for casual users;<br>advanced searchingâwhich allows full Boolean<br>
|
---|
359 | <br><b>Inside the documents</b><br>
|
---|
360 | expressions, phrase searching, case and stemming<br>controlâcan be enabled from the <i>Preferences</i> page.<br>
|
---|
361 | <br>The GML format imposes a limited amount of structure on<br>
|
---|
362 | <br>
|
---|
363 | documents. Documents are divided into paragraphs. They<br>
|
---|
364 | This collection has four browsable metadata indexes. You<br>
|
---|
365 | can be split hierarchically into sections and subsections.<br>
|
---|
366 | can <i>access publications by subject</i> by clicking the <i>subjects</i><br>
|
---|
367 | OIDs are extended to identify these components by<br>
|
---|
368 | button, which brings up a list of subjects, represented by<br>
|
---|
369 | appending numbers, separated by periods, to a documentâs<br>
|
---|
370 | bookshelves (Figure 2). You can <i>access publications by</i><br>
|
---|
371 | OID. When a book is read, its section hierarchy is visible<br>
|
---|
372 | <i>title</i> by clicking <i>titles a-z</i> (Figure 4), which brings up a list<br>
|
---|
373 | as the table of contents (Figure 3). Chapters, sections,<br>
|
---|
374 | of books in alphabetic order. You can <i>access publications</i><br>
|
---|
375 | subsections, and pages are all implemented simply as<br>
|
---|
376 | <i>by organization</i> (i.e. Dublin Core âpublisherâ), bringing up<br>
|
---|
377 | âsectionsâ within the document. In some collections<br>
|
---|
378 | a list of organizations. You can <i>access publications by</i><br>
|
---|
379 | documents do not have a hierarchical subsection structure,<br>
|
---|
380 | <i>âhow toâ listing</i>, yielding a list of hints defined by the<br>
|
---|
381 | but are split into pages to permit browsing within a<br>
|
---|
382 | collectionâs editors. We use the Dublin Core as a base and<br>
|
---|
383 | retrieved document.<br>
|
---|
384 | extend it in an <i>ad hoc</i> manner to accommodate the<br>individual requirements of collection designers.<br>
|
---|
385 | <br>The document structure is used for searchable indexes.<br>There are three levels of index: <i>documents</i>, <i>sections</i>, and<br>
|
---|
386 | <hr>
|
---|
387 | </Content>
|
---|
388 | </Section>
|
---|
389 | <Section>
|
---|
390 | <Description>
|
---|
391 | <Metadata name="Title">5</Metadata>
|
---|
392 | </Description>
|
---|
393 | <Content><br />
|
---|
394 | <IMG src="_httpdocimg_/pdf01-5_1.jpg"><br>
|
---|
395 | the <i>import</i> process is invoked, which converts the files into<br>GML using the specified plugins. Old material for which<br>GML files have previously been created is not re-imported.<br>Then the <i>build</i> process is invoked to build the requisite<br>indexes for the collection. Finally, the contents of the<br><i>building</i> directory are moved into the <i>index</i> directory, and<br>the new version of the collection automatically becomes<br>live.<br>
|
---|
396 | <br>This procedure may seem cumbersome. But all the steps<br>are necessary for efficient operation with large collections.<br>The <i>import</i> process could be performed on the fly during<br>the building operationâbut because building indexes is a<br>multipass operation, the often lengthy importing would be<br>repeated several times. The <i>build</i> process can take<br>considerable timeâa day or two, for very large<br>collections. Consequently, the results are placed in the<br><i>building</i> directory so that, if the collection already exists, it<br>will continue to be served to users in its old form<br>throughout the building operation.<br>
|
---|
397 | <br>Active users of the collection will not be disturbed when<br>the new version becomes liveâthey will probably not<br>
|
---|
398 | <b>Figure 4: Browsing titles in the HDL</b><br>
|
---|
399 | even notice. The persistent OIDs ensure that interactions<br>remain coherentâusers who are examining the results of a<br>query or browse operation will still retrieve the expected<br>
|
---|
400 | <i>paragraphs</i>, corresponding to the distinctions that GML<br>
|
---|
401 | documentsâand if a search is actually in progress when<br>
|
---|
402 | makesâthe hierarchical structure is flattened for the<br>
|
---|
403 | the change takes place the program detects the resulting<br>
|
---|
404 | purposes of creating these indexes. Indexes can be of text,<br>
|
---|
405 | file-structure inconsistency and automatically and<br>
|
---|
406 | or metadata, or any combination. Thus you can create a<br>
|
---|
407 | transparently re-executes the query, this time on the new<br>
|
---|
408 | searchable index of section titles, and/or authors, and/or<br>
|
---|
409 | version of the collection.<br>
|
---|
410 | document descriptions, as well as the document text.<br>
|
---|
411 | <b>UPDATING EXISTING COLLECTIONS</b><br>
|
---|
412 | <br><b>How it works</b><br>
|
---|
413 | <br>Updating an existing collection with new files in the same<br>
|
---|
414 | <br>The original material in the <i>import</i> directory may be in any<br>
|
---|
415 | format is easy. For example, the raw material for the HDL<br>
|
---|
416 | format, and plugins are required to process each format<br>
|
---|
417 | is supplied in the form of HTML files marked up with<br>
|
---|
418 | type. The plugins that a collection uses must be specified<br>
|
---|
419 | &lt;&lt;TOC&gt;&gt; tags to split books into sections and<br>
|
---|
420 | in the collection configuration file. The <i>import</i> program<br>
|
---|
421 | subsections, and &lt;&lt;I&gt;&gt; tags to indicate where an image is<br>
|
---|
422 | reads the list of plugins and passes each document to each<br>
|
---|
423 | to be inserted. For each book in the library there is a<br>
|
---|
424 | plugin in order until it finds one that can process it. When<br>
|
---|
425 | directory that contains a single HTML file representing the<br>
|
---|
426 | updating an existing collection, all plugins necessary to<br>
|
---|
427 | book, and separate files containing the associated images.<br>
|
---|
428 | process new material should already have been specified in<br>
|
---|
429 | An accompanying spreadsheet file contains the<br>
|
---|
430 | the configuration file.<br>
|
---|
431 | classification hierarchy; this is converted to a simple file<br>format (using Excelâs <i>Save As</i> command).<br>
|
---|
432 | <br>The building step creates the indexes for both searching<br>and browsing. The MG software is generally used to do the<br>
|
---|
433 | <br>Since the collection exists, its directory is already set up<br>
|
---|
434 | searching (Witten <i>et al.</i>, 1999), and the <i>mgbuild</i> module is<br>
|
---|
435 | with subdirectories <i>import</i>, <i>archives</i>, <i>building</i>, <i>index</i>, and<br>
|
---|
436 | automatically invoked to create each of the indexes that is<br>
|
---|
437 | <i>etc</i>, and the <i>etc</i> directory will contain a suitable collection<br>
|
---|
438 | required. For example, the Humanity Development Library<br>
|
---|
439 | configuration file.<br>
|
---|
440 | has three indexes, one for entire books, one for chapters,<br>and one for section titles. Subdirectories of the <i>index</i><br>
|
---|
441 | <br>
|
---|
442 | directory are created for each of these indexes.<br>
|
---|
443 | <b>The updating procedure</b><br>
|
---|
444 | <br>To update a collection, the new raw material is placed in<br>the <i>import</i> directory, in whatever form it is available. Then<br>
|
---|
445 | <hr>
|
---|
446 | </Content>
|
---|
447 | </Section>
|
---|
448 | <Section>
|
---|
449 | <Description>
|
---|
450 | <Metadata name="Title">6</Metadata>
|
---|
451 | </Description>
|
---|
452 | <Content><br />
|
---|
453 | creator<br>
|
---|
454 | [email protected]<br>
|
---|
455 | 1<br>
|
---|
456 | maintainer<br>
|
---|
457 | [email protected]<br>
|
---|
458 | 2<br>
|
---|
459 | public<br>
|
---|
460 | True<br>
|
---|
461 | 3<br>4<br>
|
---|
462 | indexes<br>
|
---|
463 | document:text<br>
|
---|
464 | 5<br>
|
---|
465 | defaultindex<br>
|
---|
466 | document:text<br>
|
---|
467 | 6<br>
|
---|
468 | plugins<br>
|
---|
469 | GMLPlug TEXTPlug ArcPlug RecPlug<br>
|
---|
470 | 7<br>8<br>
|
---|
471 | classify<br>
|
---|
472 | AZList metadata=Title<br>
|
---|
473 | 9<br>10<br>
|
---|
474 | collectionmeta<br>
|
---|
475 | collectionname &quot;generic text collection&quot;<br>
|
---|
476 | 11<br>
|
---|
477 | (a)<br>
|
---|
478 | collectionmeta<br>
|
---|
479 | .document:text &quot;documents&quot;<br>
|
---|
480 | 12<br>
|
---|
481 | creator<br>
|
---|
482 | [email protected]<br>
|
---|
483 | 1<br>
|
---|
484 | maintainer<br>
|
---|
485 | [email protected]<br>
|
---|
486 | 2<br>
|
---|
487 | public<br>
|
---|
488 | True<br>
|
---|
489 | 3<br>4<br>
|
---|
490 | indexes<br>
|
---|
491 | document:text document:From<br>
|
---|
492 | 5<br>
|
---|
493 | defaultindex<br>
|
---|
494 | document:text<br>
|
---|
495 | 6<br>
|
---|
496 | plugins<br>
|
---|
497 | GMLPlug EMAILPlug ArcPlug RecPlug<br>
|
---|
498 | 7<br>8<br>
|
---|
499 | classify<br>
|
---|
500 | AZList metadata=Title<br>
|
---|
501 | 9<br>
|
---|
502 | classify<br>
|
---|
503 | DateList<br>
|
---|
504 | 10<br>11<br>
|
---|
505 | collectionmeta<br>
|
---|
506 | collectionname &quot;Email messages&quot;<br>
|
---|
507 | 12<br>
|
---|
508 | collectionmeta<br>
|
---|
509 | .document:text &quot;documents&quot;<br>
|
---|
510 | 13<br>
|
---|
511 | collectionmeta<br>
|
---|
512 | .document:From &quot;email senders&quot;<br>
|
---|
513 | 14<br>15<br>
|
---|
514 | format<br>
|
---|
515 | QueryResults \\\\<br>
|
---|
516 | 16<br>
|
---|
517 | (b)<br>
|
---|
518 | &lt;td&gt;[link][icon][/link]&lt;/td&gt;&lt;td&gt;[Title]&lt;/td&gt;&lt;td&gt;[Author]&lt;/td&gt;<br>
|
---|
519 | 17<br>
|
---|
520 | <b>Figure 5: Collection configuration files (a) generic, (b) for an email collection</b><br>
|
---|
521 | <br>MG also compresses the text of the collection; and the<br>
|
---|
522 | certain circumstances, however, it might be preferable to<br>
|
---|
523 | image files are linked into the <i>index</i> subdirectory. Now<br>
|
---|
524 | use a standardized format such as XML. This is<br>
|
---|
525 | none of the material in the <i>import</i> and <i>archives</i> directories<br>
|
---|
526 | straightforward to implementjust write an XML<br>
|
---|
527 | is needed to run the collection and can be removed from<br>
|
---|
528 | pluginalthough we have not done so ourselves. Given<br>
|
---|
529 | the file system (though they would be needed if the<br>
|
---|
530 | the transitory nature of the imported data, to date, we have<br>
|
---|
531 | collection were rebuilt).<br>
|
---|
532 | found GML a satisfactory and beneficial format.<br>
|
---|
533 | <br>Associated with each collection is a database stored in<br>
|
---|
534 | <b>CREATING NEW COLLECTIONS</b><br>
|
---|
535 | GDBM (Gnu database manager) format. This contains an<br>entry for each document, giving its OID, its internal MG<br>
|
---|
536 | <br>Building new collections from scratch is only slightly<br>
|
---|
537 | document number, and metadata such as title. Information<br>
|
---|
538 | different from updating an existing collection. The key<br>
|
---|
539 | for each of the browsing indexes, which appear as buttons<br>
|
---|
540 | new requirement is creating a collection configuration file,<br>
|
---|
541 | on the Greenstone search/browse bar, is also extracted<br>
|
---|
542 | and a software utility is provided to help. Two pieces of<br>
|
---|
543 | during the building process and stored in the database. A<br>
|
---|
544 | information are required for this: the name of the directory<br>
|
---|
545 | âclassifierâ program is required for each browsing index to<br>
|
---|
546 | that the collection will use (into which the source data and<br>
|
---|
547 | extract the appropriate information from GML documents.<br>
|
---|
548 | other files will eventually be placed), and a contact e-mail<br>
|
---|
549 | Like plugins, classifiers are written on an <i>ad hoc</i> basis for<br>
|
---|
550 | address for use if any problems are encountered by the<br>
|
---|
551 | the particular information required, and where possible<br>
|
---|
552 | software once the collection is up and running. The utility<br>
|
---|
553 | reused from one collection to another.<br>
|
---|
554 | creates files and directories within the newly-named<br>
|
---|
555 | <br>
|
---|
556 | directory to support a generic collection of plain text<br>
|
---|
557 | The building program creates the indexes based on<br>
|
---|
558 | documents. With suitable data placed in the <i>import</i><br>
|
---|
559 | whatever appears in the <i>archives</i> directory. The first plugin<br>
|
---|
560 | directory, building the collection at this point will yield a<br>
|
---|
561 | specified by all collections is one that processes GML<br>
|
---|
562 | document-level searchable index of all the text and a<br>
|
---|
563 | files, and so if <i>archives</i> contains imported files they will be<br>
|
---|
564 | browsable list of âtitlesâ (defined in this case to be the<br>
|
---|
565 | processed correctly. If it contains material in the original<br>
|
---|
566 | document filenames).<br>
|
---|
567 | format, that will be converted using the appropriate plugin.<br>Thus the import process is optional.<br>
|
---|
568 | <br>To enhance the functionality and presentationâ something<br>
|
---|
569 | <br>
|
---|
570 | anything but the most trivial collection will requireâthe<br>
|
---|
571 | GML is designed to be fast and easy to parse, an important<br>
|
---|
572 | configuration file must be edited. For a collection sourced<br>
|
---|
573 | requirement when millions of documents are to be<br>
|
---|
574 | from documents in an already supported data format,<br>
|
---|
575 | processed. Something as simple as requiring tags to be<br>
|
---|
576 | presented in a similar fashion to an existing collection, the<br>
|
---|
577 | lower-case, for example, yields a substantial speed-up. In<br>
|
---|
578 | <hr>
|
---|
579 | </Content>
|
---|
580 | </Section>
|
---|
581 | <Section>
|
---|
582 | <Description>
|
---|
583 | <Metadata name="Title">7</Metadata>
|
---|
584 | </Description>
|
---|
585 | <Content><br />
|
---|
586 | <IMG src="_httpdocimg_/pdf01-7_1.jpg"><br>
|
---|
587 | <br>These are modules of code that can be slotted into the<br>system to enhance its capabilities. Plugins parse<br>documents, extracting the text and metadata to be indexed.<br>Classifiers control how metadata is brought together to<br>form browsable data structures. Both are specified in an<br>object-oriented framework using inheritance to minimize<br>the amount of code written.<br>
|
---|
588 | <br>A plugin must specify three things: what file formats it can<br>handle, how they should be parsed, and whether the plugin<br>is recursive. File formats are normally determined using<br>regular expression matching on the filename. For example,<br>the HTML plugin accepts all files that end in <i>.htm</i>, . <i>html</i>,<br><i>.HTM</i>, or <i>.HTML</i>. (It is quite possible, however, to write<br>plugins that âlook insideâ the file as well.) For other files,<br>the plugin returns <i>undefined</i> and the file is passed to the<br>next plugin in the collectionâs configuration file (e.g.<br>Figure 5 line 7). If it can, the plugin parses the file and<br>returns the number of documents processed. This involves<br>extracting text and metadata and adding it to the libraryâs<br>content through calls to <i>add text</i> and <i>add metadata</i>.<br>
|
---|
589 | <br>Some plugins (ârecursiveâ ones) add extra files into the<br>
|
---|
590 | <b>Figure 6: Searching bookmarked Web pages</b><br>
|
---|
591 | stream of data processed during the building phase by<br>artificially reactivating the list of plugins. This is how<br>directory hierarchies are traversed.<br>
|
---|
592 | amount of editing is minimal. Importing new data formats<br>and browsing metadata in ways not currently supported are<br>
|
---|
593 | <br>Plugins are small modules of code that are easy to write.<br>
|
---|
594 | more complex activities that require programming skills.<br>
|
---|
595 | We monitored the time it took to develop a new one that<br>was different to any we had produced so far. We chose to<br>make as an example a collection of HTML bookmark files,<br>
|
---|
596 | <br><b>Modifying the configuration file</b><br>
|
---|
597 | the motivation being to produce a convenient way of<br>
|
---|
598 | <br>
|
---|
599 | searching and browsing oneâs bookmarked Web pages.<br>
|
---|
600 | Figure 5b shows simple alterations to the generic<br>
|
---|
601 | Figure 6 shows a user searching for bookmarked pages<br>
|
---|
602 | configuration file in Figure 5a that was generated by the<br>
|
---|
603 | about <i>music</i>. The new plugin took under an hour to write,<br>
|
---|
604 | new-collection utility. <i>TEXTPlug</i> is replaced with<br>
|
---|
605 | and was 160 lines long (ignoring blank lines and<br>
|
---|
606 | <i>EMAILPlug</i> (line 7) which reads email files and extracts<br>
|
---|
607 | comments)âabout the average length of existing plugins.<br>
|
---|
608 | metadata (<i>From</i>, <i>To</i>, <i>Date</i>, <i>Subject</i>) from them. A classifier<br>for dates is added (line 10) to make the collection<br>
|
---|
609 | <br>Classifiers are more general than plugins because they<br>
|
---|
610 | browsable chronologically. The default presentation of<br>
|
---|
611 | work on GML-format data. For example, any plugin that<br>
|
---|
612 | search results is overridden (line 17) to display both the<br>
|
---|
613 | generates date metadata in accordance with the Dublin<br>
|
---|
614 | title of the message (i.e. Dublin Core <i>Title</i>) and its sender<br>
|
---|
615 | core can request the collection to be browsable<br>
|
---|
616 | (i.e. Dublin Core <i>Author</i>). Elements in square brackets,<br>
|
---|
617 | chronologically by specifying the <i>DateList</i> classifier in the<br>
|
---|
618 | such as <i>[Title]</i>, are replaced by the metadata associated<br>
|
---|
619 | collectionâs configuration file (Figure 7). Classifiers are<br>
|
---|
620 | with a particular document. The built-in term <i>[icon]</i><br>
|
---|
621 | more elaborate than most plugins, but new ones are seldom<br>
|
---|
622 | produces a suitable image that represents the document<br>
|
---|
623 | required. The average length of existing classifiers is 230<br>
|
---|
624 | (such as a book icon or page icon), and the <i>[link]âŠ[/link]</i><br>
|
---|
625 | lines.<br>
|
---|
626 | construct forms a hyperlink to the complete document.<br>
|
---|
627 | <br>
|
---|
628 | Anything else in the format statement, which in this case is<br>
|
---|
629 | Classifiers must specify three things: an initialization<br>
|
---|
630 | solely table-cell tags in HTML, is passed through to the<br>
|
---|
631 | routine, how individual documents are classified, and the<br>
|
---|
632 | page being displayed.<br>
|
---|
633 | final browsable data structure. Initialization takes care of<br>any options specified in the configuration file (such as<br>
|
---|
634 | As this example shows, creating a new collection that stays<br>
|
---|
635 | <i>metadata=Title </i>on line 9 of Figure 5b). Classifying<br>
|
---|
636 | within the bounds of the libraryâs established capabilities<br>
|
---|
637 | individual documents is an iterative process: for each one,<br>
|
---|
638 | falls within the capability of many computer usersâfor<br>
|
---|
639 | a call to <i>document-classify</i> is made. On presentation of the<br>
|
---|
640 | instance, computer-trained librarians. Extending<br>
|
---|
641 | documentâs OID, the necessary metadata is located and<br>
|
---|
642 | Greenstone to handle new document formats and browse<br>
|
---|
643 | used to control where the document is added to the<br>
|
---|
644 | metadata in new ways is more challenging.<br>
|
---|
645 | browsable data structure being constructed.<br>
|
---|
646 | <br>Once all documents have been added, a request is made for<br>
|
---|
647 | <br><b>Writing new plugins and classifiers</b><br>
|
---|
648 | the completed data structure. Some classifiers return the<br>data structure directly; others transform the data structure<br>
|
---|
649 | <br>Extensibility is obtained through plugins and classifiers.<br>
|
---|
650 | before it is returned. For example, the <i>AZList</i> classifier<br>
|
---|
651 | <hr>
|
---|
652 | </Content>
|
---|
653 | </Section>
|
---|
654 | <Section>
|
---|
655 | <Description>
|
---|
656 | <Metadata name="Title">8</Metadata>
|
---|
657 | </Description>
|
---|
658 | <Content><br />
|
---|
659 | <IMG src="_httpdocimg_/pdf01-8_1.jpg"><br>
|
---|
660 | a page number, next and previous page buttons, and<br>displaying a particular page at different resolutions. A text<br>version of the page is also available upon which a<br>searching option is also provided.<br>
|
---|
661 | Started in 1994, Harvest is also a long-running research<br>project. It provides an efficient means of gathering source<br>data from the Internet and distributing indexing<br>information over the Internet. This is accomplished<br>through five components: <i>gatherer</i>, <i>broker</i>, <i>indexer</i>,<br><i>replicator</i> and <i>cache</i>. The first three are central to creating,<br>updating and searching a collection; the last two help to<br>improve performance over the Internet through transparent<br>mirroring and caching techniques.<br>
|
---|
662 | The system is configurable and customizable. While<br>searching is most commonly implemented using Glimpse<br>(<i>glimpse.cs.arizona.edu</i>), in principle any search engine<br>that supports incremental updates and Boolean<br>combinations of attribute-based queries can be used. It is<br>possible to control what type of documents are gathered<br>during creation and updating, and how the query interface<br>
|
---|
663 | <b>Figure 7: Browsing a newspaper collection by date</b><br>
|
---|
664 | looks and is laid out.<br>
|
---|
665 | Sample collections cited by the developers include 21,000<br>
|
---|
666 | divides the alphabetically sorted list of metadata into<br>
|
---|
667 | computer science technical reports and 7,000 home pages.<br>
|
---|
668 | separate pages of about the same size and returns the<br>
|
---|
669 | Other examples include a sizable collection of agriculture-<br>
|
---|
670 | alphabetic ranges for each one (Figure 4).<br>
|
---|
671 | related electronic journals and magazines called âtomato-<br>juiceâ (accessed through <i>hegel.lib.ncsu.edu</i>) and a full-text<br>
|
---|
672 | <b>OVERVIEW OF RELATED WORK</b><br>
|
---|
673 | index of library-related electronic serials<br>
|
---|
674 | Two projects that provide substantial open source digital<br>
|
---|
675 | (<i>sunsite.berkeley.edu/IndexMorganagus</i>). Harvest is also<br>
|
---|
676 | library software are Dienst (Lagoze and Fielding, 1998)<br>
|
---|
677 | often used to index Web sites (for example<br>
|
---|
678 | and Harvest (Bowman <i>et al.</i>, 1994). The origins of Dienst<br>
|
---|
679 | <i>www.middlebury.edu</i>).<br>
|
---|
680 | (<i>www.cs.cornell.edu/cdlrg</i>) stretch back to 1992. The term<br>
|
---|
681 | Comparing Greenstone with Dienst and Harvest, there are<br>
|
---|
682 | has come to represent three entities: a conceptual<br>
|
---|
683 | both similarities and differences. All provide substantial<br>
|
---|
684 | architecture for distributed digital libraries; an open<br>
|
---|
685 | digital library systems, hence common themes recur, but<br>
|
---|
686 | protocol for service communication; and a software<br>
|
---|
687 | they are driven by projects with different aims. Harvest,<br>
|
---|
688 | system that implements the protocol. To date, five sample<br>
|
---|
689 | for instance, was not conceived as a digital library project<br>
|
---|
690 | digital libraries have been built using this technology.<br>
|
---|
691 | at all, but by virtue of its selective document gathering<br>
|
---|
692 | They manifest themselves in two forms: technical reports<br>
|
---|
693 | process it can be classed (and is used) as one. While it<br>
|
---|
694 | and primary source documents.<br>
|
---|
695 | provides sophisticated search options, it lacks the<br>
|
---|
696 | Best known is NCSTRL, the Networked Computer<br>
|
---|
697 | complementary service of browsing. Furthermore it adds<br>
|
---|
698 | Science Technical Reference Library project<br>
|
---|
699 | no structure or order to the documents collected, relying<br>
|
---|
700 | (<i>www.ncstrl.org</i>). This collection facilitates searching by<br>
|
---|
701 | on whatever structures are present in the site that they<br>
|
---|
702 | title, author and abstract, and browsing by year and author,<br>
|
---|
703 | were gathered from. A proven strength of the design is its<br>
|
---|
704 | across a distributed network of document repositories.<br>
|
---|
705 | flexibility through configuration and customizationan<br>
|
---|
706 | Documents can (where supported) be delivered in various<br>
|
---|
707 | element also present in Greenstone.<br>
|
---|
708 | formats such as PostScript, a thumbnail overview of the<br>
|
---|
709 | Dienstbest exemplified through the NCSTRL<br>
|
---|
710 | pages, and a GIF image of a particular page.<br>
|
---|
711 | worksupports searching and browsing, like Greenstone.<br>
|
---|
712 | The <i>Making of America</i> resource is an example of a<br>
|
---|
713 | Both use open protocols. Differences include a high<br>
|
---|
714 | collection based around primary sourcesin this case<br>
|
---|
715 | reliance in Dienst on user-supplied information when a<br>
|
---|
716 | American social history, 1830â1900. It has a different<br>
|
---|
717 | document is added, and a smaller range of document types<br>
|
---|
718 | âlook and feelâ to NCSTRL, being strongly oriented<br>
|
---|
719 | supportedâalthough Dienst does include a document<br>
|
---|
720 | toward browsing rather than searching. A user navigates<br>
|
---|
721 | model that should, over time, allow this to expand with<br>
|
---|
722 | their way through a hierarchical structure of hyperlinks to<br>
|
---|
723 | relative ease.<br>
|
---|
724 | reach a book of interest. The book itself is a series of<br>
|
---|
725 | There are also commercial systems that provide similar<br>
|
---|
726 | scanned images: delivery options include going directly to<br>
|
---|
727 | digital library services to those described. However, since<br>
|
---|
728 | <hr>
|
---|
729 | </Content>
|
---|
730 | </Section>
|
---|
731 | <Section>
|
---|
732 | <Description>
|
---|
733 | <Metadata name="Title">9</Metadata>
|
---|
734 | </Description>
|
---|
735 | <Content><br />
|
---|
736 | corporate culture instills proprietary attitudes there is little<br>
|
---|
737 | <b>REFERENCES</b><br>
|
---|
738 | opportunity for advancement through a shared<br>
|
---|
739 | 1. Akscyn, R.M. and Witten, I.H. (1998) âReport on First<br>
|
---|
740 | collaborative effort. Consequently they are not reviewed<br>
|
---|
741 | Summit on International Cooperation on Digital<br>
|
---|
742 | here.<br>
|
---|
743 | Libraries.â ks.com/idla-wp-oct98.<br>
|
---|
744 | 2. Bowman, C.M., Danzig, P.B., Manber, U., and<br>
|
---|
745 | <b>CONCLUSIONS</b><br>
|
---|
746 | Schwartz, M.F. âScalable Internet resource discovery:<br>
|
---|
747 | Greenstone is a comprehensive software system for<br>
|
---|
748 | Research problems and approachesâ <i>Communications</i><br>
|
---|
749 | creating digital library collections. It builds data structures<br>
|
---|
750 | <i>of the ACM,</i> Vol. 37, No. 8, pp. 98â107, 1994.<br>
|
---|
751 | for searching and browsing from the material provided,<br>
|
---|
752 | 3. Fox, E. (1998) âDigital library definitions.â<br>
|
---|
753 | rather than relying on any hand-crafting. The process is<br>
|
---|
754 | ei.cs.vt.edu/~fox/dlib/def.html.<br>
|
---|
755 | controlled by a configuration file, and once a collection<br>exists new material can be added completely<br>
|
---|
756 | 4. Humanity Libraries (1998) <i>Humanity Development</i><br>
|
---|
757 | automatically. Browsing is based on Dublin Core<br>
|
---|
758 | <i>Library</i>. CD-ROM produced by the Global Help<br>
|
---|
759 | metadata.<br>
|
---|
760 | Project, Antwerp, Belgium.<br>
|
---|
761 | New collections can be developed easily, particularly if<br>
|
---|
762 | 5. Lagoze, C. and Fielding, D âDefining Collections in<br>
|
---|
763 | they resemble existing ones. Extensibility is achieved<br>
|
---|
764 | Distributed Digital Librariesâ <i>D-Lib Magazine</i>, Nov.<br>
|
---|
765 | through software âpluginsâ that can be written to<br>
|
---|
766 | 1998.<br>
|
---|
767 | accommodate documents, and metadata, in different<br>
|
---|
768 | 6. PAHO (1999) <i>Virtual Disaster Library</i>. CD-ROM<br>
|
---|
769 | formats. Standard plugins exist for many document types;<br>
|
---|
770 | produced by the Pan-American Health Organization,<br>
|
---|
771 | new ones are easily written. Browsing is controlled by<br>
|
---|
772 | Washington DC, USA.<br>
|
---|
773 | âclassifiersâ that process metadata into browsing structures<br>
|
---|
774 | 7. McNab, R.J., Witten, I.H. and Boddie, S.J. (1998) âA<br>
|
---|
775 | (by date, alphabetical, hierarchical, etc).<br>
|
---|
776 | distributed digital library architecture incorporating<br>
|
---|
777 | However, the most powerful support for extensibility is<br>
|
---|
778 | different index styles.â <i>Proc IEEE Advances in Digital</i><br>
|
---|
779 | achieved not by technical means but by making the source<br>
|
---|
780 | <i>Libraries</i>, Santa Barbara, CA, pp. 36â45.<br>
|
---|
781 | code freely available under the Gnu public license. Only<br>
|
---|
782 | 8. Nevill-Manning, C.G., Reed, T., and Witten, I.H.<br>
|
---|
783 | through an international cooperative effort will digital<br>
|
---|
784 | (1998) âExtracting text from PostScriptâ<br>
|
---|
785 | library software become sufficiently comprehensive to<br>
|
---|
786 | <i>SoftwareâPractice and Experience</i>, Vol. 28, No. 5, pp.<br>
|
---|
787 | meet the worldâs needs with the richness and flexibility<br>
|
---|
788 | 481â491; April.<br>
|
---|
789 | that users deserve.<br>
|
---|
790 | 9. UNESCO (1999) <i>SAHEL point DOC: Anthologie du</i><br>
|
---|
791 | <b>ACKNOWLEDGMENTS</b><br>
|
---|
792 | <i>développement au Sahel</i>. CD-ROM produced by<br>UNESCO, Paris, France.<br>
|
---|
793 | We gratefully acknowledge all those who have worked on<br>the Greenstone software, and all members of the New<br>
|
---|
794 | 10. UNU (1998) <i>Collection on critical global issues.</i> CD-<br>
|
---|
795 | Zealand Digital Library project for their enthusiasm and<br>
|
---|
796 | ROM produced by the United Nations University<br>
|
---|
797 | ideas.<br>
|
---|
798 | Press, Tokyo, Japan.<br>
|
---|
799 | 11. Witten, I.H., Moffat, A. and Bell, T. (1999) <i>Managing</i><br>
|
---|
800 | <i>Gigabytes: compressing and indexing documents and<br>images</i>, Morgan Kaufmann, second edition.<br>
|
---|
801 | <hr>
|
---|
802 |
|
---|
803 |
|
---|
804 | </Content>
|
---|
805 | </Section>
|
---|
806 | </Section>
|
---|
807 | </Archive>
|
---|