1 | <?xml version="1.0" encoding="utf-8" standalone="no"?>
|
---|
2 | <!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
|
---|
3 | <Archive>
|
---|
4 | <Section>
|
---|
5 | <Description>
|
---|
6 | <Metadata name="gsdldoctype">indexed_doc</Metadata>
|
---|
7 | <Metadata name="Language">en</Metadata>
|
---|
8 | <Metadata name="Encoding">utf8</Metadata>
|
---|
9 | <Metadata name="Author">Bronwyn</Metadata>
|
---|
10 | <Metadata name="Title">Greenstone: A Comprehensive Open-Source Digital Library Software...</Metadata>
|
---|
11 | <Metadata name="URL">http://research/ak19/gs2-svn-22Aug2013/collect/Word-PDF-Basic/tmp/1395215322/pdf01.html</Metadata>
|
---|
12 | <Metadata name="UTF8URL">http://research/ak19/gs2-svn-22Aug2013/collect/Word-PDF-Basic/tmp/1395215322/pdf01.html</Metadata>
|
---|
13 | <Metadata name="gsdlsourcefilename">import/pdf01.pdf</Metadata>
|
---|
14 | <Metadata name="gsdlconvertedfilename">tmp/1395215322/pdf01.html</Metadata>
|
---|
15 | <Metadata name="OrigSource">pdf01.html</Metadata>
|
---|
16 | <Metadata name="Source">pdf01.pdf</Metadata>
|
---|
17 | <Metadata name="SourceFile">pdf01.pdf</Metadata>
|
---|
18 | <Metadata name="Plugin">PDFPlugin</Metadata>
|
---|
19 | <Metadata name="FileSize">269487</Metadata>
|
---|
20 | <Metadata name="FilenameRoot">pdf01</Metadata>
|
---|
21 | <Metadata name="FileFormat">PDF</Metadata>
|
---|
22 | <Metadata name="srcicon">_iconpdf_</Metadata>
|
---|
23 | <Metadata name="srclink_file">doc.pdf</Metadata>
|
---|
24 | <Metadata name="srclinkFile">doc.pdf</Metadata>
|
---|
25 | <Metadata name="NumPages">9</Metadata>
|
---|
26 | <Metadata name="dc.Creator">Ian H. Witten</Metadata>
|
---|
27 | <Metadata name="dc.Creator">Rodger J. McNab</Metadata>
|
---|
28 | <Metadata name="dc.Creator">Stefan J. Boddie</Metadata>
|
---|
29 | <Metadata name="dc.Creator">David Bainbridge</Metadata>
|
---|
30 | <Metadata name="dc.Title">Greenstone: A comprehensive open-source digital library software system</Metadata>
|
---|
31 | <Metadata name="ex.ExifTool.ExifToolVersion">8.57</Metadata>
|
---|
32 | <Metadata name="ex.File.Directory">/research/ak19/gs2-svn-22Aug2013/collect/Word-PDF-Basic/import</Metadata>
|
---|
33 | <Metadata name="ex.File.FileModifyDate">2014:03:19 20:42:09+13:00</Metadata>
|
---|
34 | <Metadata name="ex.File.FileName">pdf01.pdf</Metadata>
|
---|
35 | <Metadata name="ex.File.FilePermissions">644</Metadata>
|
---|
36 | <Metadata name="ex.File.FileSize">269487</Metadata>
|
---|
37 | <Metadata name="ex.File.FileType">PDF</Metadata>
|
---|
38 | <Metadata name="ex.File.MIMEType">application/pdf</Metadata>
|
---|
39 | <Metadata name="ex.PDF.Author">Bronwyn</Metadata>
|
---|
40 | <Metadata name="ex.PDF.CreateDate">2000:03:02 15:21:24</Metadata>
|
---|
41 | <Metadata name="ex.PDF.Creator">Microsoft Word</Metadata>
|
---|
42 | <Metadata name="ex.PDF.Linearized">false</Metadata>
|
---|
43 | <Metadata name="ex.PDF.PDFVersion">1.2</Metadata>
|
---|
44 | <Metadata name="ex.PDF.PageCount">9</Metadata>
|
---|
45 | <Metadata name="ex.PDF.Producer">Acrobat PDFWriter 4.0 for Power Macintosh</Metadata>
|
---|
46 | <Metadata name="Identifier">HASH1a9cea0f239f754007681b</Metadata>
|
---|
47 | <Metadata name="lastmodified">1395214929</Metadata>
|
---|
48 | <Metadata name="lastmodifieddate">20140319</Metadata>
|
---|
49 | <Metadata name="oailastmodified">1395215322</Metadata>
|
---|
50 | <Metadata name="oailastmodifieddate">20140319</Metadata>
|
---|
51 | <Metadata name="assocfilepath">HASH1a9c.dir</Metadata>
|
---|
52 | <Metadata name="gsdlassocfile">pdf01-2_1.jpg:image/jpeg:</Metadata>
|
---|
53 | <Metadata name="gsdlassocfile">pdf01-3_1.jpg:image/jpeg:</Metadata>
|
---|
54 | <Metadata name="gsdlassocfile">pdf01-4_1.jpg:image/jpeg:</Metadata>
|
---|
55 | <Metadata name="gsdlassocfile">pdf01-5_1.jpg:image/jpeg:</Metadata>
|
---|
56 | <Metadata name="gsdlassocfile">pdf01-7_1.jpg:image/jpeg:</Metadata>
|
---|
57 | <Metadata name="gsdlassocfile">pdf01-8_1.jpg:image/jpeg:</Metadata>
|
---|
58 | <Metadata name="gsdlassocfile">doc.pdf:application/pdf:</Metadata>
|
---|
59 | </Description>
|
---|
60 | <Content>
|
---|
61 | <A name=1></a><b>Greenstone: A Comprehensive Open-Source</b><br>
|
---|
62 | <b>Digital Library Software System</b><br>
|
---|
63 | <i>Ian H. Witten,* Rodger J. McNab,â Stefan J. Boddie,* David Bainbridge*</i><br>
|
---|
64 | * Dept of Computer Science<br>
|
---|
65 | â Digilib Systems<br>
|
---|
66 | University of Waikato, New Zealand<br>
|
---|
67 | Hamilton, New Zealand<br>
|
---|
68 | E-mail: {ihw, sjboddie, davidb}@cs.waikato.ac.nz<br>
|
---|
69 | E-mail: [email protected]<br>
|
---|
70 | <b>ABSTRACT</b><br>
|
---|
71 | multilingual information retrieval to distributed computing<br>protocols, from interoperability to search engine<br>
|
---|
72 | This paper describes the Greenstone digital library<br>
|
---|
73 | technology, from metadata standards to multiformat<br>
|
---|
74 | software, a comprehensive, open-source system for the<br>
|
---|
75 | document parsing, from multimedia to multiple operating<br>
|
---|
76 | construction and presentation of information collections.<br>
|
---|
77 | systems, from Web browsers to plug-and-play DVDs.<br>
|
---|
78 | Collections built with Greenstone offer effective full-text<br>searching and metadata-based browsing facilities that are<br>
|
---|
79 | The Greenstone Digital Library Software from the New<br>
|
---|
80 | attractive and easy to use. Moreover, they are easily<br>
|
---|
81 | Zealand Digital Library (NZDL) project tackles this issue<br>
|
---|
82 | maintainable and can be augmented and rebuilt entirely<br>
|
---|
83 | by providing a new way of organizing information and<br>
|
---|
84 | automatically. The system is extensible: software<br>
|
---|
85 | making it available over the Internet. A <i>collection</i> of<br>
|
---|
86 | âpluginsâ accommodate different document and metadata<br>
|
---|
87 | information comprises several (typically several thousand,<br>
|
---|
88 | types.<br>
|
---|
89 | or several million) <i>documents</i>, and a uniform interface is<br>provided to all documents in a collection. A library may<br>
|
---|
90 | <b>INTRODUCTION</b><br>
|
---|
91 | include many different collections, each organized<br>differentlyâthough there is a strong family resemblance in<br>
|
---|
92 | Notwithstanding intense research activity in the digital<br>
|
---|
93 | how collections are presented.<br>
|
---|
94 | library field during the second half of the 1990s,<br>comprehensive software systems for creating digital<br>
|
---|
95 | Making information available using this system is far more<br>
|
---|
96 | libraries are not widely available. In fact, the usual solution<br>
|
---|
97 | than âjust putting it on the Web.â The collection becomes<br>
|
---|
98 | when creating a digital library is also the most<br>
|
---|
99 | maintainable, searchable, and browsable. Each collection,<br>
|
---|
100 | obviousâjust put it on the Web. But consider how much<br>
|
---|
101 | prior to presentation, undergoes a âbuildingâ process that,<br>
|
---|
102 | effort is involved in constructing a Web site for a digital<br>
|
---|
103 | once established, is completely automatic. This process<br>
|
---|
104 | library. To be effective it needs to be visually attractive<br>
|
---|
105 | creates all the structures that are used at run-time for<br>
|
---|
106 | and ergonomically easy to use, incorporate convenient and<br>
|
---|
107 | accessing the collection. Searching is based on various<br>
|
---|
108 | powerful searching capabilities, and offer rich and natural<br>
|
---|
109 | indexes, while browsing is based on various metadata;<br>
|
---|
110 | browsing facilities. Above all it must be easy to maintain<br>
|
---|
111 | support structures for both are created during the building<br>
|
---|
112 | and augment, which presents a significant challenge if any<br>
|
---|
113 | operation. When new material appears it can be fully<br>
|
---|
114 | manual organization is involved.<br>
|
---|
115 | incorporated into the collection by rebuilding.<br>
|
---|
116 | The alternative is to automate these activities through<br>
|
---|
117 | To address the exceptionally broad demands of digital<br>
|
---|
118 | software tools. But the broad scope of digital library<br>
|
---|
119 | libraries, the system is public and extensible. It is issued<br>
|
---|
120 | requirements makes this a daunting prospect. Ideally the<br>
|
---|
121 | under the Gnu public license and, in the spirit of open-<br>
|
---|
122 | software should incorporate facilities ranging from<br>
|
---|
123 | source software, users are invited to contribute<br>modifications and enhancements. Only through an<br>international cooperative effort will digital library software<br>become sufficiently comprehensive to meet the worldâs<br>needs. Currently the Greenstone software is used at sites in<br>Canada, Germany, New Zealand, Romania, UK, and the<br>US, and collections range from newspaper articles to<br>technical documents, from educational journals to oral<br>history, from visual art to folksongs. The software has<br>been used for collections in many different languages, and<br>for CD-ROMs that have been published by the United<br>Nations and other humanitarian agencies in Belgium,<br>France, Japan, and the US for distribution in developing<br>countries (Humanity Libraries, 1998; PAHO, 1999;<br>UNESCO, 1999; UNU, 1998). Further details can be<br>obtained from <i>www.nzdl.org</i>.<br>
|
---|
124 | <hr>
|
---|
125 | <A name=2></a><IMG src="_httpdocimg_/pdf01-2_1.jpg"><br>
|
---|
126 | become a first-class component of the library. And what<br>permits it to be integrated into existing searching and<br>browsing structures without any manual intervention is<br><i>metadata</i>. This provides sufficient focus to the concept of<br>âdigital libraryâ to support the development of a<br>construction kit.<br>
|
---|
127 | <b>OVERVIEW OF GREENSTONE</b><br>
|
---|
128 | <br>Information collections built by Greenstone combine<br>extensive full-text search facilities with browsing indexes<br>based on different metadata types. There are several ways<br>for users to find information, although they differ between<br>collections depending on the metadata available and the<br>collection design. Typically you can <i>search for particular<br>words</i> that appear in the text, or within a section of a<br>document, or within a title or section heading. You can<br><i>browse documents by title</i>: just click on the displayed book<br>icon to read it. You can <i>browse documents by subject</i>.<br>Subjects are represented by bookshelves: just click on a<br>shelf to see the books. Where appropriate, documents<br>
|
---|
129 | <b>Figure 1: Searching the HDL collection</b><br>
|
---|
130 | come complete with a table of contents (constructed<br>automatically): you can click on a chapter or subsection to<br>
|
---|
131 | This paper sets the scene with a brief discussion of what a<br>
|
---|
132 | open it, expand the full table of contents, or expand the full<br>
|
---|
133 | digital library is. We then give an overview of the facilities<br>
|
---|
134 | document.<br>
|
---|
135 | offered by Greenstone and show how end users find<br>information in collections. Next we describe the files and<br>
|
---|
136 | <br>An example of searching is shown in Figure 1 where<br>
|
---|
137 | directories involved in a collection, and then discuss the<br>
|
---|
138 | documents in the Global Help Projectâs Humanity<br>
|
---|
139 | processes of updating existing collections and creating new<br>
|
---|
140 | Development Library (HDL) are being searched for<br>
|
---|
141 | ones, including extending the software to provide new<br>
|
---|
142 | chapters matching the word <i>butterfly</i>. In Figure 2 the same<br>
|
---|
143 | facilities. We conclude with an overview of related work.<br>
|
---|
144 | collection is being browsed by subject: by clicking on the<br>bookshelf icons the user has discovered an item under<br>
|
---|
145 | <b>WHAT IS A DIGITAL LIBRARY?</b><br>
|
---|
146 | Section 16, Animal Husbandry. Pursuing an interest in<br>butterfly farming, the user selects a book by clicking on its<br>
|
---|
147 | <br>Ten definitions of the term âdigital libraryâ have been<br>
|
---|
148 | book icon. In Figure 3 the front cover of the book is<br>
|
---|
149 | culled from the literature by Fox (1998), and their spirit is<br>
|
---|
150 | displayed as a graphic on the left, and the automatically<br>
|
---|
151 | captured in the following brief characterization:<br>
|
---|
152 | constructed table of contents appears at the start of the<br>
|
---|
153 | <br>
|
---|
154 | document. The current focus, <i>Introduction and Summary</i>,<br>
|
---|
155 | <i>A collection of digital objects, including text,</i><br>
|
---|
156 | is shown in bold in the table of contents with its text<br>
|
---|
157 | <i>video, and audio, along with methods for access</i><br>
|
---|
158 | starting further down the page.<br>
|
---|
159 | <i>and retrieval, and for selection, organization<br>and maintenance of the collection</i><br>
|
---|
160 | <br>In accordance with Leskâs advice, a statement of purpose<br>
|
---|
161 | <br>
|
---|
162 | and coverage accompanies each collection, along with an<br>
|
---|
163 | (Akscyn and Witten, 1998). Lesk (1998) views digital<br>
|
---|
164 | explanation of how it is organized (Figure 1 shows the<br>
|
---|
165 | libraries as âorganized collections of digital information,â<br>
|
---|
166 | start of this). A distinction is made between <i>searching</i> and<br>
|
---|
167 | and wisely recommends that they articulate the principles<br>
|
---|
168 | <i>browsing</i>. Searching is full-text, andâdepending on the<br>
|
---|
169 | governing what is included and how the collection is<br>
|
---|
170 | collectionâs designâthe user can choose between indexes<br>
|
---|
171 | organized.<br>
|
---|
172 | built from different parts of the documents, or from<br>
|
---|
173 | <br>Digital libraries are generally distinguished from the<br>
|
---|
174 | different metadata. Some collections have an index of full<br>
|
---|
175 | World-Wide Web, the essential difference being in<br>
|
---|
176 | documents, an index of sections, an index of paragraphs,<br>
|
---|
177 | selection and organization. But they are not generally<br>
|
---|
178 | an index of titles, and an index of section headings, each of<br>
|
---|
179 | distinguished from a web <i>site</i>: indeed, virtually all extant<br>
|
---|
180 | which can be searched for particular words or phrases.<br>
|
---|
181 | digital libraries manifest themselves as a web site. Hence<br>
|
---|
182 | Browsing involves data structures created from metadata<br>
|
---|
183 | the obvious question: to make a digital library, why not<br>
|
---|
184 | that the user can examine: lists of authors, lists of titles,<br>
|
---|
185 | just put the information on the Web?<br>
|
---|
186 | lists of dates, hierarchical classification structures, and so<br>
|
---|
187 | <br>
|
---|
188 | on. Data structures for both browsing and searching are<br>
|
---|
189 | But we make a distinction between a digital library and a<br>
|
---|
190 | built according to instructions in a configuration file,<br>
|
---|
191 | web site that lies at the heart of our software design: one<br>
|
---|
192 | which controls both building and serving the collection.<br>
|
---|
193 | should easily be able to add new material to a library<br>
|
---|
194 | Sample configuration files are discussed below.<br>
|
---|
195 | without having to integrate it manually or edit its content<br>in any way. Once added, new material should immediately<br>
|
---|
196 | <hr>
|
---|
197 | <A name=3></a><IMG src="_httpdocimg_/pdf01-3_1.jpg"><br>
|
---|
198 | matter of specifying all the necessary plugins. In order to<br>build browsing indexes from metadata, an analogous<br>scheme of âclassifiersâ is used: classifiers create indexes<br>of various kinds based on metadata. Source documents are<br>brought into the Greenstone system through a process<br>called <i>importing</i>, which uses the plugins and classifiers<br>specified in the collection configuration file.<br>
|
---|
199 | <br>The international Unicode character set is used throughout,<br>so documentsâand interfacesâcan be written in any<br>language. Collections have so far been produced in<br>English, French, Spanish, German, Maori, Chinese, and<br>Arabic. The NZDL Web site provides numerous examples.<br>Collections can contain text, pictures, and even audio and<br>video clips; a text-only version of the interface is also<br>provided to accommodate visually impaired users.<br>Compression technology is used to ensure best use of<br>storage (Witten <i>et al </i>., 1999). Most non-textual material is<br>either linked to textual documents or accompanied by<br>textual descriptions (such as photo captions) to allow full-<br>text searching and browsing. However, the architecture<br>
|
---|
200 | <b>Figure 2: Browsing the HDL collection by subject</b><br>
|
---|
201 | permits the implementation of plugins and classifiers even<br>for non-textual data.<br>
|
---|
202 | <br>Rich browsing facilities can be provided by manually<br>
|
---|
203 | <br>
|
---|
204 | linking parts of documents together and building explicit<br>
|
---|
205 | The system includes an âadministrativeâ function whereby<br>
|
---|
206 | indexes and tables of contents. However, manually-created<br>
|
---|
207 | specified users can examine the composition of all<br>
|
---|
208 | linking becomes difficult to maintain, and often falls into<br>
|
---|
209 | collections, protect documents so that they can only be<br>
|
---|
210 | disrepair when a collection expands. The Greenstone<br>
|
---|
211 | accessed by registered users on presentation of a password,<br>
|
---|
212 | software takes a different tack: it facilitates <i>maintainability</i><br>
|
---|
213 | and so on. Logs of user activity are kept that record all<br>
|
---|
214 | by creating all searching and browsing structures<br>
|
---|
215 | queries made to every Greenstone collection (though this<br>
|
---|
216 | automatically from the documents themselves. No links<br>
|
---|
217 | facility can be disabled).<br>
|
---|
218 | are inserted by hand. This means that when new<br>
|
---|
219 | <br>Although primarily designed for Internet access over the<br>
|
---|
220 | documents in the same format become available, they can<br>
|
---|
221 | World-Wide Web, collections can be made available, in<br>
|
---|
222 | be added automatically. Indeed, for some collections this is<br>
|
---|
223 | precisely the same form, on CD-ROM. In either case they<br>
|
---|
224 | done by processes that wake up regularly, scout for new<br>
|
---|
225 | are accessed through any Web browser. Greenstone CD-<br>
|
---|
226 | material, and rebuild the indexesâall without manual<br>
|
---|
227 | ROMs operate on a standalone PC under Windows 3.X,<br>
|
---|
228 | intervention.<br>
|
---|
229 | 95, 98, and NT, and the interaction is identical to accessing<br>
|
---|
230 | Collections comprise many documents: thousands, tens of<br>
|
---|
231 | the collection on the Webâexcept that response is faster<br>
|
---|
232 | thousands, or even millions. Each document may be<br>
|
---|
233 | and more predictable. The requirement to operate on early<br>
|
---|
234 | hierarchically organized into <i>sections</i> (subsections, sub-<br>
|
---|
235 | Windows systems is one that plagues the software design,<br>
|
---|
236 | subsections, and so on). Each section comprises one or<br>
|
---|
237 | but is crucial for many usersâparticularly those in<br>
|
---|
238 | more <i>paragraphs</i>. Metadata such as author, title, date,<br>
|
---|
239 | underdeveloped countries seeking access to humanitarian<br>
|
---|
240 | keywords, and so on, may be associated with documents,<br>
|
---|
241 | aid collections. If the PC is connected to a network<br>
|
---|
242 | or with individual sections of documents. This is the raw<br>
|
---|
243 | (intranet or Internet), a custom-built Web server provided<br>
|
---|
244 | material for indexes. It must either be provided explicitly<br>
|
---|
245 | on each CD makes exactly the same information available<br>
|
---|
246 | for each document and section (for example, in an<br>
|
---|
247 | to others through their standard Web browser. The use of<br>
|
---|
248 | accompanying spreadsheet) or be derivable automatically<br>
|
---|
249 | compression ensures that the greatest possible volume of<br>
|
---|
250 | from the source documents. Metadata is converted to<br>
|
---|
251 | information can be packed on to a CD-ROM.<br>
|
---|
252 | Dublin Core and stored with the document for internal use.<br>
|
---|
253 | <br>The collection-serving software operates under Unix and<br>
|
---|
254 | <br>In order to accommodate different kinds of source<br>
|
---|
255 | Windows NT, and works with standard Web servers. A<br>
|
---|
256 | documents, the software is organized so that âpluginsâ can<br>
|
---|
257 | flexible process structure allows different collections to be<br>
|
---|
258 | be written for new document types. Plugins exist for plain<br>
|
---|
259 | served by different computers, yet be presented to the user<br>
|
---|
260 | text documents, HTML documents, email documents, and<br>
|
---|
261 | in the same way, on the same Web page, as part of the<br>
|
---|
262 | bibliographic formats. Word documents are handled by<br>
|
---|
263 | same digital library, even as part of the same collection<br>
|
---|
264 | saving them as HTML; PostScript ones by applying a<br>
|
---|
265 | (McNab and Witten, 1998). Existing collections can be<br>
|
---|
266 | preprocessor (Nevill-Manning <i>et al</i>., 1998). Specially<br>
|
---|
267 | updated and new ones brought on-line at any time, without<br>
|
---|
268 | written plugins also exist for proprietary formats such as<br>
|
---|
269 | bringing the system down; the process responsible for the<br>
|
---|
270 | that used by the BBC archives department. A collection<br>
|
---|
271 | user interface will notice (through periodic polling) when<br>
|
---|
272 | may have source documents in different forms: it is just a<br>
|
---|
273 | new collections appear and add them to the list presented<br>to the user.<br>
|
---|
274 | <hr>
|
---|
275 | <A name=4></a><IMG src="_httpdocimg_/pdf01-4_1.jpg"><br>
|
---|
276 | <b>FILES IN A COLLECTION</b><br>
|
---|
277 | <br>When a new collection is created or material is added to an<br>existing one, the original source documents are first<br>brought into the system through a process known as<br>âimporting.â This involves converting documents into a<br>simple HTML-like format known as GML (for<br>âGreenstone Markup Languageâ), which includes any<br>metadata associated with the document. Documents are<br>assumed to be in the Unicode UTF-8 code (of which the<br>ASCII characters form a subset).<br>
|
---|
278 | <br><b>Files and directories</b><br>
|
---|
279 | <br>There is a separate directory for each collection, which<br>contains five subdirectories: the original raw material<br>(<i>import</i>), the GML files created from this (<i>archives</i>), the<br>final collection as it is served to users (<i>index</i>), a directory<br>for use during the building process (<i>building</i>), and one for<br>any supporting files (<i>etc</i>)âincluding the configuration file<br>
|
---|
280 | <b>Figure 3: Reading a book in the HDL</b><br>
|
---|
281 | that controls the collection creation procedure. Additional<br>files might be required: for example, building a hierarchy<br>of classifications requires a data file of sub-classifications.<br>
|
---|
282 | <b>FINDING INFORMATION</b><br>
|
---|
283 | <br>Greenstone digital library systems generally include<br>
|
---|
284 | <br>
|
---|
285 | several separate collections. A home page allows you to<br>
|
---|
286 | <b>The imported documents</b><br>
|
---|
287 | select a collection; in addition, each collection has its own<br>
|
---|
288 | <br>In order to identify documents internally, a unique object<br>
|
---|
289 | âaboutâ page that gives you information about how the<br>
|
---|
290 | identifier or OID is assigned to each original source<br>
|
---|
291 | collection is organized and the principles governing what<br>
|
---|
292 | document when it is imported (formed by hashing the<br>
|
---|
293 | is included.<br>
|
---|
294 | content, to overcome file duplication effects caused by<br>
|
---|
295 | <br>All icons in the screenshots of Figures 1â4 are clickable.<br>
|
---|
296 | mirroring) and stored as metadata within that document. It<br>
|
---|
297 | Those icons at the top of the page return to the home page,<br>
|
---|
298 | is important that OIDs persist throughout the index-<br>
|
---|
299 | provide help text, and allow you to set user interface and<br>
|
---|
300 | building processâso that a userâs search history is<br>
|
---|
301 | searching preferences. The navigation bar underneath<br>
|
---|
302 | unaffected by rebuilding the collection. OIDs are assigned<br>
|
---|
303 | gives access to the searching and browsing facilities,<br>
|
---|
304 | by hashing the contents of the original source document.<br>
|
---|
305 | which differ from one collection to another.<br>
|
---|
306 | <br>Once imported, each document is stored in its own<br>
|
---|
307 | <br>Each of the five buttons provides a different way to find<br>
|
---|
308 | subdirectory of <i>archives</i>, along with any associated<br>
|
---|
309 | information. You can <i>search for particular words</i> that<br>
|
---|
310 | filesâfor example, images. To ensure compatibility with<br>
|
---|
311 | appear in the text from the âsearchâ page (or from the<br>
|
---|
312 | Windows 3.0, only eight characters are used in directory<br>
|
---|
313 | âaboutâ page of Figure 1). This collection contains indexes<br>
|
---|
314 | and file names, which causes annoying but essentially<br>
|
---|
315 | of chapters, section titles, and entire books. The default<br>
|
---|
316 | trivial complications.<br>
|
---|
317 | search interface is a simple one, suitable for casual users;<br>advanced searchingâwhich allows full Boolean<br>
|
---|
318 | <br><b>Inside the documents</b><br>
|
---|
319 | expressions, phrase searching, case and stemming<br>controlâcan be enabled from the <i>Preferences</i> page.<br>
|
---|
320 | <br>The GML format imposes a limited amount of structure on<br>
|
---|
321 | <br>
|
---|
322 | documents. Documents are divided into paragraphs. They<br>
|
---|
323 | This collection has four browsable metadata indexes. You<br>
|
---|
324 | can be split hierarchically into sections and subsections.<br>
|
---|
325 | can <i>access publications by subject</i> by clicking the <i>subjects</i><br>
|
---|
326 | OIDs are extended to identify these components by<br>
|
---|
327 | button, which brings up a list of subjects, represented by<br>
|
---|
328 | appending numbers, separated by periods, to a documentâs<br>
|
---|
329 | bookshelves (Figure 2). You can <i>access publications by</i><br>
|
---|
330 | OID. When a book is read, its section hierarchy is visible<br>
|
---|
331 | <i>title</i> by clicking <i>titles a-z</i> (Figure 4), which brings up a list<br>
|
---|
332 | as the table of contents (Figure 3). Chapters, sections,<br>
|
---|
333 | of books in alphabetic order. You can <i>access publications</i><br>
|
---|
334 | subsections, and pages are all implemented simply as<br>
|
---|
335 | <i>by organization</i> (i.e. Dublin Core âpublisherâ), bringing up<br>
|
---|
336 | âsectionsâ within the document. In some collections<br>
|
---|
337 | a list of organizations. You can <i>access publications by</i><br>
|
---|
338 | documents do not have a hierarchical subsection structure,<br>
|
---|
339 | <i>âhow toâ listing</i>, yielding a list of hints defined by the<br>
|
---|
340 | but are split into pages to permit browsing within a<br>
|
---|
341 | collectionâs editors. We use the Dublin Core as a base and<br>
|
---|
342 | retrieved document.<br>
|
---|
343 | extend it in an <i>ad hoc</i> manner to accommodate the<br>individual requirements of collection designers.<br>
|
---|
344 | <br>The document structure is used for searchable indexes.<br>There are three levels of index: <i>documents</i>, <i>sections</i>, and<br>
|
---|
345 | <hr>
|
---|
346 | <A name=5></a><IMG src="_httpdocimg_/pdf01-5_1.jpg"><br>
|
---|
347 | the <i>import</i> process is invoked, which converts the files into<br>GML using the specified plugins. Old material for which<br>GML files have previously been created is not re-imported.<br>Then the <i>build</i> process is invoked to build the requisite<br>indexes for the collection. Finally, the contents of the<br><i>building</i> directory are moved into the <i>index</i> directory, and<br>the new version of the collection automatically becomes<br>live.<br>
|
---|
348 | <br>This procedure may seem cumbersome. But all the steps<br>are necessary for efficient operation with large collections.<br>The <i>import</i> process could be performed on the fly during<br>the building operationâbut because building indexes is a<br>multipass operation, the often lengthy importing would be<br>repeated several times. The <i>build</i> process can take<br>considerable timeâa day or two, for very large<br>collections. Consequently, the results are placed in the<br><i>building</i> directory so that, if the collection already exists, it<br>will continue to be served to users in its old form<br>throughout the building operation.<br>
|
---|
349 | <br>Active users of the collection will not be disturbed when<br>the new version becomes liveâthey will probably not<br>
|
---|
350 | <b>Figure 4: Browsing titles in the HDL</b><br>
|
---|
351 | even notice. The persistent OIDs ensure that interactions<br>remain coherentâusers who are examining the results of a<br>query or browse operation will still retrieve the expected<br>
|
---|
352 | <i>paragraphs</i>, corresponding to the distinctions that GML<br>
|
---|
353 | documentsâand if a search is actually in progress when<br>
|
---|
354 | makesâthe hierarchical structure is flattened for the<br>
|
---|
355 | the change takes place the program detects the resulting<br>
|
---|
356 | purposes of creating these indexes. Indexes can be of text,<br>
|
---|
357 | file-structure inconsistency and automatically and<br>
|
---|
358 | or metadata, or any combination. Thus you can create a<br>
|
---|
359 | transparently re-executes the query, this time on the new<br>
|
---|
360 | searchable index of section titles, and/or authors, and/or<br>
|
---|
361 | version of the collection.<br>
|
---|
362 | document descriptions, as well as the document text.<br>
|
---|
363 | <b>UPDATING EXISTING COLLECTIONS</b><br>
|
---|
364 | <br><b>How it works</b><br>
|
---|
365 | <br>Updating an existing collection with new files in the same<br>
|
---|
366 | <br>The original material in the <i>import</i> directory may be in any<br>
|
---|
367 | format is easy. For example, the raw material for the HDL<br>
|
---|
368 | format, and plugins are required to process each format<br>
|
---|
369 | is supplied in the form of HTML files marked up with<br>
|
---|
370 | type. The plugins that a collection uses must be specified<br>
|
---|
371 | &lt;&lt;TOC&gt;&gt; tags to split books into sections and<br>
|
---|
372 | in the collection configuration file. The <i>import</i> program<br>
|
---|
373 | subsections, and &lt;&lt;I&gt;&gt; tags to indicate where an image is<br>
|
---|
374 | reads the list of plugins and passes each document to each<br>
|
---|
375 | to be inserted. For each book in the library there is a<br>
|
---|
376 | plugin in order until it finds one that can process it. When<br>
|
---|
377 | directory that contains a single HTML file representing the<br>
|
---|
378 | updating an existing collection, all plugins necessary to<br>
|
---|
379 | book, and separate files containing the associated images.<br>
|
---|
380 | process new material should already have been specified in<br>
|
---|
381 | An accompanying spreadsheet file contains the<br>
|
---|
382 | the configuration file.<br>
|
---|
383 | classification hierarchy; this is converted to a simple file<br>format (using Excelâs <i>Save As</i> command).<br>
|
---|
384 | <br>The building step creates the indexes for both searching<br>and browsing. The MG software is generally used to do the<br>
|
---|
385 | <br>Since the collection exists, its directory is already set up<br>
|
---|
386 | searching (Witten <i>et al.</i>, 1999), and the <i>mgbuild</i> module is<br>
|
---|
387 | with subdirectories <i>import</i>, <i>archives</i>, <i>building</i>, <i>index</i>, and<br>
|
---|
388 | automatically invoked to create each of the indexes that is<br>
|
---|
389 | <i>etc</i>, and the <i>etc</i> directory will contain a suitable collection<br>
|
---|
390 | required. For example, the Humanity Development Library<br>
|
---|
391 | configuration file.<br>
|
---|
392 | has three indexes, one for entire books, one for chapters,<br>and one for section titles. Subdirectories of the <i>index</i><br>
|
---|
393 | <br>
|
---|
394 | directory are created for each of these indexes.<br>
|
---|
395 | <b>The updating procedure</b><br>
|
---|
396 | <br>To update a collection, the new raw material is placed in<br>the <i>import</i> directory, in whatever form it is available. Then<br>
|
---|
397 | <hr>
|
---|
398 | <A name=6></a>creator<br>
|
---|
399 | [email protected]<br>
|
---|
400 | 1<br>
|
---|
401 | maintainer<br>
|
---|
402 | [email protected]<br>
|
---|
403 | 2<br>
|
---|
404 | public<br>
|
---|
405 | True<br>
|
---|
406 | 3<br>4<br>
|
---|
407 | indexes<br>
|
---|
408 | document:text<br>
|
---|
409 | 5<br>
|
---|
410 | defaultindex<br>
|
---|
411 | document:text<br>
|
---|
412 | 6<br>
|
---|
413 | plugins<br>
|
---|
414 | GMLPlug TEXTPlug ArcPlug RecPlug<br>
|
---|
415 | 7<br>8<br>
|
---|
416 | classify<br>
|
---|
417 | AZList metadata=Title<br>
|
---|
418 | 9<br>10<br>
|
---|
419 | collectionmeta<br>
|
---|
420 | collectionname &quot;generic text collection&quot;<br>
|
---|
421 | 11<br>
|
---|
422 | (a)<br>
|
---|
423 | collectionmeta<br>
|
---|
424 | .document:text &quot;documents&quot;<br>
|
---|
425 | 12<br>
|
---|
426 | creator<br>
|
---|
427 | [email protected]<br>
|
---|
428 | 1<br>
|
---|
429 | maintainer<br>
|
---|
430 | [email protected]<br>
|
---|
431 | 2<br>
|
---|
432 | public<br>
|
---|
433 | True<br>
|
---|
434 | 3<br>4<br>
|
---|
435 | indexes<br>
|
---|
436 | document:text document:From<br>
|
---|
437 | 5<br>
|
---|
438 | defaultindex<br>
|
---|
439 | document:text<br>
|
---|
440 | 6<br>
|
---|
441 | plugins<br>
|
---|
442 | GMLPlug EMAILPlug ArcPlug RecPlug<br>
|
---|
443 | 7<br>8<br>
|
---|
444 | classify<br>
|
---|
445 | AZList metadata=Title<br>
|
---|
446 | 9<br>
|
---|
447 | classify<br>
|
---|
448 | DateList<br>
|
---|
449 | 10<br>11<br>
|
---|
450 | collectionmeta<br>
|
---|
451 | collectionname &quot;Email messages&quot;<br>
|
---|
452 | 12<br>
|
---|
453 | collectionmeta<br>
|
---|
454 | .document:text &quot;documents&quot;<br>
|
---|
455 | 13<br>
|
---|
456 | collectionmeta<br>
|
---|
457 | .document:From &quot;email senders&quot;<br>
|
---|
458 | 14<br>15<br>
|
---|
459 | format<br>
|
---|
460 | QueryResults \\\\<br>
|
---|
461 | 16<br>
|
---|
462 | (b)<br>
|
---|
463 | &lt;td&gt;[link][icon][/link]&lt;/td&gt;&lt;td&gt;[Title]&lt;/td&gt;&lt;td&gt;[Author]&lt;/td&gt;<br>
|
---|
464 | 17<br>
|
---|
465 | <b>Figure 5: Collection configuration files (a) generic, (b) for an email collection</b><br>
|
---|
466 | <br>MG also compresses the text of the collection; and the<br>
|
---|
467 | certain circumstances, however, it might be preferable to<br>
|
---|
468 | image files are linked into the <i>index</i> subdirectory. Now<br>
|
---|
469 | use a standardized format such as XML. This is<br>
|
---|
470 | none of the material in the <i>import</i> and <i>archives</i> directories<br>
|
---|
471 | straightforward to implementjust write an XML<br>
|
---|
472 | is needed to run the collection and can be removed from<br>
|
---|
473 | pluginalthough we have not done so ourselves. Given<br>
|
---|
474 | the file system (though they would be needed if the<br>
|
---|
475 | the transitory nature of the imported data, to date, we have<br>
|
---|
476 | collection were rebuilt).<br>
|
---|
477 | found GML a satisfactory and beneficial format.<br>
|
---|
478 | <br>Associated with each collection is a database stored in<br>
|
---|
479 | <b>CREATING NEW COLLECTIONS</b><br>
|
---|
480 | GDBM (Gnu database manager) format. This contains an<br>entry for each document, giving its OID, its internal MG<br>
|
---|
481 | <br>Building new collections from scratch is only slightly<br>
|
---|
482 | document number, and metadata such as title. Information<br>
|
---|
483 | different from updating an existing collection. The key<br>
|
---|
484 | for each of the browsing indexes, which appear as buttons<br>
|
---|
485 | new requirement is creating a collection configuration file,<br>
|
---|
486 | on the Greenstone search/browse bar, is also extracted<br>
|
---|
487 | and a software utility is provided to help. Two pieces of<br>
|
---|
488 | during the building process and stored in the database. A<br>
|
---|
489 | information are required for this: the name of the directory<br>
|
---|
490 | âclassifierâ program is required for each browsing index to<br>
|
---|
491 | that the collection will use (into which the source data and<br>
|
---|
492 | extract the appropriate information from GML documents.<br>
|
---|
493 | other files will eventually be placed), and a contact e-mail<br>
|
---|
494 | Like plugins, classifiers are written on an <i>ad hoc</i> basis for<br>
|
---|
495 | address for use if any problems are encountered by the<br>
|
---|
496 | the particular information required, and where possible<br>
|
---|
497 | software once the collection is up and running. The utility<br>
|
---|
498 | reused from one collection to another.<br>
|
---|
499 | creates files and directories within the newly-named<br>
|
---|
500 | <br>
|
---|
501 | directory to support a generic collection of plain text<br>
|
---|
502 | The building program creates the indexes based on<br>
|
---|
503 | documents. With suitable data placed in the <i>import</i><br>
|
---|
504 | whatever appears in the <i>archives</i> directory. The first plugin<br>
|
---|
505 | directory, building the collection at this point will yield a<br>
|
---|
506 | specified by all collections is one that processes GML<br>
|
---|
507 | document-level searchable index of all the text and a<br>
|
---|
508 | files, and so if <i>archives</i> contains imported files they will be<br>
|
---|
509 | browsable list of âtitlesâ (defined in this case to be the<br>
|
---|
510 | processed correctly. If it contains material in the original<br>
|
---|
511 | document filenames).<br>
|
---|
512 | format, that will be converted using the appropriate plugin.<br>Thus the import process is optional.<br>
|
---|
513 | <br>To enhance the functionality and presentationâ something<br>
|
---|
514 | <br>
|
---|
515 | anything but the most trivial collection will requireâthe<br>
|
---|
516 | GML is designed to be fast and easy to parse, an important<br>
|
---|
517 | configuration file must be edited. For a collection sourced<br>
|
---|
518 | requirement when millions of documents are to be<br>
|
---|
519 | from documents in an already supported data format,<br>
|
---|
520 | processed. Something as simple as requiring tags to be<br>
|
---|
521 | presented in a similar fashion to an existing collection, the<br>
|
---|
522 | lower-case, for example, yields a substantial speed-up. In<br>
|
---|
523 | <hr>
|
---|
524 | <A name=7></a><IMG src="_httpdocimg_/pdf01-7_1.jpg"><br>
|
---|
525 | <br>These are modules of code that can be slotted into the<br>system to enhance its capabilities. Plugins parse<br>documents, extracting the text and metadata to be indexed.<br>Classifiers control how metadata is brought together to<br>form browsable data structures. Both are specified in an<br>object-oriented framework using inheritance to minimize<br>the amount of code written.<br>
|
---|
526 | <br>A plugin must specify three things: what file formats it can<br>handle, how they should be parsed, and whether the plugin<br>is recursive. File formats are normally determined using<br>regular expression matching on the filename. For example,<br>the HTML plugin accepts all files that end in <i>.htm</i>, . <i>html</i>,<br><i>.HTM</i>, or <i>.HTML</i>. (It is quite possible, however, to write<br>plugins that âlook insideâ the file as well.) For other files,<br>the plugin returns <i>undefined</i> and the file is passed to the<br>next plugin in the collectionâs configuration file (e.g.<br>Figure 5 line 7). If it can, the plugin parses the file and<br>returns the number of documents processed. This involves<br>extracting text and metadata and adding it to the libraryâs<br>content through calls to <i>add text</i> and <i>add metadata</i>.<br>
|
---|
527 | <br>Some plugins (ârecursiveâ ones) add extra files into the<br>
|
---|
528 | <b>Figure 6: Searching bookmarked Web pages</b><br>
|
---|
529 | stream of data processed during the building phase by<br>artificially reactivating the list of plugins. This is how<br>directory hierarchies are traversed.<br>
|
---|
530 | amount of editing is minimal. Importing new data formats<br>and browsing metadata in ways not currently supported are<br>
|
---|
531 | <br>Plugins are small modules of code that are easy to write.<br>
|
---|
532 | more complex activities that require programming skills.<br>
|
---|
533 | We monitored the time it took to develop a new one that<br>was different to any we had produced so far. We chose to<br>make as an example a collection of HTML bookmark files,<br>
|
---|
534 | <br><b>Modifying the configuration file</b><br>
|
---|
535 | the motivation being to produce a convenient way of<br>
|
---|
536 | <br>
|
---|
537 | searching and browsing oneâs bookmarked Web pages.<br>
|
---|
538 | Figure 5b shows simple alterations to the generic<br>
|
---|
539 | Figure 6 shows a user searching for bookmarked pages<br>
|
---|
540 | configuration file in Figure 5a that was generated by the<br>
|
---|
541 | about <i>music</i>. The new plugin took under an hour to write,<br>
|
---|
542 | new-collection utility. <i>TEXTPlug</i> is replaced with<br>
|
---|
543 | and was 160 lines long (ignoring blank lines and<br>
|
---|
544 | <i>EMAILPlug</i> (line 7) which reads email files and extracts<br>
|
---|
545 | comments)âabout the average length of existing plugins.<br>
|
---|
546 | metadata (<i>From</i>, <i>To</i>, <i>Date</i>, <i>Subject</i>) from them. A classifier<br>for dates is added (line 10) to make the collection<br>
|
---|
547 | <br>Classifiers are more general than plugins because they<br>
|
---|
548 | browsable chronologically. The default presentation of<br>
|
---|
549 | work on GML-format data. For example, any plugin that<br>
|
---|
550 | search results is overridden (line 17) to display both the<br>
|
---|
551 | generates date metadata in accordance with the Dublin<br>
|
---|
552 | title of the message (i.e. Dublin Core <i>Title</i>) and its sender<br>
|
---|
553 | core can request the collection to be browsable<br>
|
---|
554 | (i.e. Dublin Core <i>Author</i>). Elements in square brackets,<br>
|
---|
555 | chronologically by specifying the <i>DateList</i> classifier in the<br>
|
---|
556 | such as <i>[Title]</i>, are replaced by the metadata associated<br>
|
---|
557 | collectionâs configuration file (Figure 7). Classifiers are<br>
|
---|
558 | with a particular document. The built-in term <i>[icon]</i><br>
|
---|
559 | more elaborate than most plugins, but new ones are seldom<br>
|
---|
560 | produces a suitable image that represents the document<br>
|
---|
561 | required. The average length of existing classifiers is 230<br>
|
---|
562 | (such as a book icon or page icon), and the <i>[link]âŠ[/link]</i><br>
|
---|
563 | lines.<br>
|
---|
564 | construct forms a hyperlink to the complete document.<br>
|
---|
565 | <br>
|
---|
566 | Anything else in the format statement, which in this case is<br>
|
---|
567 | Classifiers must specify three things: an initialization<br>
|
---|
568 | solely table-cell tags in HTML, is passed through to the<br>
|
---|
569 | routine, how individual documents are classified, and the<br>
|
---|
570 | page being displayed.<br>
|
---|
571 | final browsable data structure. Initialization takes care of<br>any options specified in the configuration file (such as<br>
|
---|
572 | As this example shows, creating a new collection that stays<br>
|
---|
573 | <i>metadata=Title </i>on line 9 of Figure 5b). Classifying<br>
|
---|
574 | within the bounds of the libraryâs established capabilities<br>
|
---|
575 | individual documents is an iterative process: for each one,<br>
|
---|
576 | falls within the capability of many computer usersâfor<br>
|
---|
577 | a call to <i>document-classify</i> is made. On presentation of the<br>
|
---|
578 | instance, computer-trained librarians. Extending<br>
|
---|
579 | documentâs OID, the necessary metadata is located and<br>
|
---|
580 | Greenstone to handle new document formats and browse<br>
|
---|
581 | used to control where the document is added to the<br>
|
---|
582 | metadata in new ways is more challenging.<br>
|
---|
583 | browsable data structure being constructed.<br>
|
---|
584 | <br>Once all documents have been added, a request is made for<br>
|
---|
585 | <br><b>Writing new plugins and classifiers</b><br>
|
---|
586 | the completed data structure. Some classifiers return the<br>data structure directly; others transform the data structure<br>
|
---|
587 | <br>Extensibility is obtained through plugins and classifiers.<br>
|
---|
588 | before it is returned. For example, the <i>AZList</i> classifier<br>
|
---|
589 | <hr>
|
---|
590 | <A name=8></a><IMG src="_httpdocimg_/pdf01-8_1.jpg"><br>
|
---|
591 | a page number, next and previous page buttons, and<br>displaying a particular page at different resolutions. A text<br>version of the page is also available upon which a<br>searching option is also provided.<br>
|
---|
592 | Started in 1994, Harvest is also a long-running research<br>project. It provides an efficient means of gathering source<br>data from the Internet and distributing indexing<br>information over the Internet. This is accomplished<br>through five components: <i>gatherer</i>, <i>broker</i>, <i>indexer</i>,<br><i>replicator</i> and <i>cache</i>. The first three are central to creating,<br>updating and searching a collection; the last two help to<br>improve performance over the Internet through transparent<br>mirroring and caching techniques.<br>
|
---|
593 | The system is configurable and customizable. While<br>searching is most commonly implemented using Glimpse<br>(<i>glimpse.cs.arizona.edu</i>), in principle any search engine<br>that supports incremental updates and Boolean<br>combinations of attribute-based queries can be used. It is<br>possible to control what type of documents are gathered<br>during creation and updating, and how the query interface<br>
|
---|
594 | <b>Figure 7: Browsing a newspaper collection by date</b><br>
|
---|
595 | looks and is laid out.<br>
|
---|
596 | Sample collections cited by the developers include 21,000<br>
|
---|
597 | divides the alphabetically sorted list of metadata into<br>
|
---|
598 | computer science technical reports and 7,000 home pages.<br>
|
---|
599 | separate pages of about the same size and returns the<br>
|
---|
600 | Other examples include a sizable collection of agriculture-<br>
|
---|
601 | alphabetic ranges for each one (Figure 4).<br>
|
---|
602 | related electronic journals and magazines called âtomato-<br>juiceâ (accessed through <i>hegel.lib.ncsu.edu</i>) and a full-text<br>
|
---|
603 | <b>OVERVIEW OF RELATED WORK</b><br>
|
---|
604 | index of library-related electronic serials<br>
|
---|
605 | Two projects that provide substantial open source digital<br>
|
---|
606 | (<i>sunsite.berkeley.edu/IndexMorganagus</i>). Harvest is also<br>
|
---|
607 | library software are Dienst (Lagoze and Fielding, 1998)<br>
|
---|
608 | often used to index Web sites (for example<br>
|
---|
609 | and Harvest (Bowman <i>et al.</i>, 1994). The origins of Dienst<br>
|
---|
610 | <i>www.middlebury.edu</i>).<br>
|
---|
611 | (<i>www.cs.cornell.edu/cdlrg</i>) stretch back to 1992. The term<br>
|
---|
612 | Comparing Greenstone with Dienst and Harvest, there are<br>
|
---|
613 | has come to represent three entities: a conceptual<br>
|
---|
614 | both similarities and differences. All provide substantial<br>
|
---|
615 | architecture for distributed digital libraries; an open<br>
|
---|
616 | digital library systems, hence common themes recur, but<br>
|
---|
617 | protocol for service communication; and a software<br>
|
---|
618 | they are driven by projects with different aims. Harvest,<br>
|
---|
619 | system that implements the protocol. To date, five sample<br>
|
---|
620 | for instance, was not conceived as a digital library project<br>
|
---|
621 | digital libraries have been built using this technology.<br>
|
---|
622 | at all, but by virtue of its selective document gathering<br>
|
---|
623 | They manifest themselves in two forms: technical reports<br>
|
---|
624 | process it can be classed (and is used) as one. While it<br>
|
---|
625 | and primary source documents.<br>
|
---|
626 | provides sophisticated search options, it lacks the<br>
|
---|
627 | Best known is NCSTRL, the Networked Computer<br>
|
---|
628 | complementary service of browsing. Furthermore it adds<br>
|
---|
629 | Science Technical Reference Library project<br>
|
---|
630 | no structure or order to the documents collected, relying<br>
|
---|
631 | (<i>www.ncstrl.org</i>). This collection facilitates searching by<br>
|
---|
632 | on whatever structures are present in the site that they<br>
|
---|
633 | title, author and abstract, and browsing by year and author,<br>
|
---|
634 | were gathered from. A proven strength of the design is its<br>
|
---|
635 | across a distributed network of document repositories.<br>
|
---|
636 | flexibility through configuration and customizationan<br>
|
---|
637 | Documents can (where supported) be delivered in various<br>
|
---|
638 | element also present in Greenstone.<br>
|
---|
639 | formats such as PostScript, a thumbnail overview of the<br>
|
---|
640 | Dienstbest exemplified through the NCSTRL<br>
|
---|
641 | pages, and a GIF image of a particular page.<br>
|
---|
642 | worksupports searching and browsing, like Greenstone.<br>
|
---|
643 | The <i>Making of America</i> resource is an example of a<br>
|
---|
644 | Both use open protocols. Differences include a high<br>
|
---|
645 | collection based around primary sourcesin this case<br>
|
---|
646 | reliance in Dienst on user-supplied information when a<br>
|
---|
647 | American social history, 1830â1900. It has a different<br>
|
---|
648 | document is added, and a smaller range of document types<br>
|
---|
649 | âlook and feelâ to NCSTRL, being strongly oriented<br>
|
---|
650 | supportedâalthough Dienst does include a document<br>
|
---|
651 | toward browsing rather than searching. A user navigates<br>
|
---|
652 | model that should, over time, allow this to expand with<br>
|
---|
653 | their way through a hierarchical structure of hyperlinks to<br>
|
---|
654 | relative ease.<br>
|
---|
655 | reach a book of interest. The book itself is a series of<br>
|
---|
656 | There are also commercial systems that provide similar<br>
|
---|
657 | scanned images: delivery options include going directly to<br>
|
---|
658 | digital library services to those described. However, since<br>
|
---|
659 | <hr>
|
---|
660 | <A name=9></a>corporate culture instills proprietary attitudes there is little<br>
|
---|
661 | <b>REFERENCES</b><br>
|
---|
662 | opportunity for advancement through a shared<br>
|
---|
663 | 1. Akscyn, R.M. and Witten, I.H. (1998) âReport on First<br>
|
---|
664 | collaborative effort. Consequently they are not reviewed<br>
|
---|
665 | Summit on International Cooperation on Digital<br>
|
---|
666 | here.<br>
|
---|
667 | Libraries.â ks.com/idla-wp-oct98.<br>
|
---|
668 | 2. Bowman, C.M., Danzig, P.B., Manber, U., and<br>
|
---|
669 | <b>CONCLUSIONS</b><br>
|
---|
670 | Schwartz, M.F. âScalable Internet resource discovery:<br>
|
---|
671 | Greenstone is a comprehensive software system for<br>
|
---|
672 | Research problems and approachesâ <i>Communications</i><br>
|
---|
673 | creating digital library collections. It builds data structures<br>
|
---|
674 | <i>of the ACM,</i> Vol. 37, No. 8, pp. 98â107, 1994.<br>
|
---|
675 | for searching and browsing from the material provided,<br>
|
---|
676 | 3. Fox, E. (1998) âDigital library definitions.â<br>
|
---|
677 | rather than relying on any hand-crafting. The process is<br>
|
---|
678 | ei.cs.vt.edu/~fox/dlib/def.html.<br>
|
---|
679 | controlled by a configuration file, and once a collection<br>exists new material can be added completely<br>
|
---|
680 | 4. Humanity Libraries (1998) <i>Humanity Development</i><br>
|
---|
681 | automatically. Browsing is based on Dublin Core<br>
|
---|
682 | <i>Library</i>. CD-ROM produced by the Global Help<br>
|
---|
683 | metadata.<br>
|
---|
684 | Project, Antwerp, Belgium.<br>
|
---|
685 | New collections can be developed easily, particularly if<br>
|
---|
686 | 5. Lagoze, C. and Fielding, D âDefining Collections in<br>
|
---|
687 | they resemble existing ones. Extensibility is achieved<br>
|
---|
688 | Distributed Digital Librariesâ <i>D-Lib Magazine</i>, Nov.<br>
|
---|
689 | through software âpluginsâ that can be written to<br>
|
---|
690 | 1998.<br>
|
---|
691 | accommodate documents, and metadata, in different<br>
|
---|
692 | 6. PAHO (1999) <i>Virtual Disaster Library</i>. CD-ROM<br>
|
---|
693 | formats. Standard plugins exist for many document types;<br>
|
---|
694 | produced by the Pan-American Health Organization,<br>
|
---|
695 | new ones are easily written. Browsing is controlled by<br>
|
---|
696 | Washington DC, USA.<br>
|
---|
697 | âclassifiersâ that process metadata into browsing structures<br>
|
---|
698 | 7. McNab, R.J., Witten, I.H. and Boddie, S.J. (1998) âA<br>
|
---|
699 | (by date, alphabetical, hierarchical, etc).<br>
|
---|
700 | distributed digital library architecture incorporating<br>
|
---|
701 | However, the most powerful support for extensibility is<br>
|
---|
702 | different index styles.â <i>Proc IEEE Advances in Digital</i><br>
|
---|
703 | achieved not by technical means but by making the source<br>
|
---|
704 | <i>Libraries</i>, Santa Barbara, CA, pp. 36â45.<br>
|
---|
705 | code freely available under the Gnu public license. Only<br>
|
---|
706 | 8. Nevill-Manning, C.G., Reed, T., and Witten, I.H.<br>
|
---|
707 | through an international cooperative effort will digital<br>
|
---|
708 | (1998) âExtracting text from PostScriptâ<br>
|
---|
709 | library software become sufficiently comprehensive to<br>
|
---|
710 | <i>SoftwareâPractice and Experience</i>, Vol. 28, No. 5, pp.<br>
|
---|
711 | meet the worldâs needs with the richness and flexibility<br>
|
---|
712 | 481â491; April.<br>
|
---|
713 | that users deserve.<br>
|
---|
714 | 9. UNESCO (1999) <i>SAHEL point DOC: Anthologie du</i><br>
|
---|
715 | <b>ACKNOWLEDGMENTS</b><br>
|
---|
716 | <i>développement au Sahel</i>. CD-ROM produced by<br>UNESCO, Paris, France.<br>
|
---|
717 | We gratefully acknowledge all those who have worked on<br>the Greenstone software, and all members of the New<br>
|
---|
718 | 10. UNU (1998) <i>Collection on critical global issues.</i> CD-<br>
|
---|
719 | Zealand Digital Library project for their enthusiasm and<br>
|
---|
720 | ROM produced by the United Nations University<br>
|
---|
721 | ideas.<br>
|
---|
722 | Press, Tokyo, Japan.<br>
|
---|
723 | 11. Witten, I.H., Moffat, A. and Bell, T. (1999) <i>Managing</i><br>
|
---|
724 | <i>Gigabytes: compressing and indexing documents and<br>images</i>, Morgan Kaufmann, second edition.<br>
|
---|
725 | <hr>
|
---|
726 |
|
---|
727 |
|
---|
728 | </Content>
|
---|
729 | </Section>
|
---|
730 | </Archive>
|
---|