source: other-projects/nightly-tasks/diffcol/trunk/model-collect/Enhanced-PDF/archives/HASH1a9c.dir/doc.xml@ 34934

Last change on this file since 34934 was 34934, checked in by anupama, 3 years ago

AUTOCOMMIT by gen-model-colls.sh script. Message: Rebuilding model-collections after having committed the new EXIF that Kathy added and the mods we've made to the EmbeddedMetadataPlugin to fix the problem Diego found of incorrect or incorrectly extracted EXIF metadata values.

File size: 54.5 KB
Line 
1<?xml version="1.0" encoding="utf-8" standalone="no"?>
2<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3<Archive>
4<Section>
5 <Description>
6 <Metadata name="gsdldoctype">indexed_doc</Metadata>
7 <Metadata name="Language">en</Metadata>
8 <Metadata name="Encoding">utf8</Metadata>
9 <Metadata name="URL">http://Scratch/ak19/gs2-diffcol-26Apr2019/collect/Enhanced-PDF/tmp/1614479617/pdf01.html</Metadata>
10 <Metadata name="UTF8URL">http://Scratch/ak19/gs2-diffcol-26Apr2019/collect/Enhanced-PDF/tmp/1614479617/pdf01.html</Metadata>
11 <Metadata name="Title">Greenstone: A Comprehensive Open-Source Digital Library Software System Ian H. Witten,* Rodger J....</Metadata>
12 <Metadata name="gsdlsourcefilename">import/pdf01.pdf</Metadata>
13 <Metadata name="gsdlsourcefilerenamemethod">url</Metadata>
14 <Metadata name="gsdlconvertedfilename">tmp/1614479617/pdf01.html</Metadata>
15 <Metadata name="OrigSource">pdf01.html</Metadata>
16 <Metadata name="Source">pdf01.pdf</Metadata>
17 <Metadata name="SourceFile">pdf01.pdf</Metadata>
18 <Metadata name="Plugin">PDFPlugin</Metadata>
19 <Metadata name="FileSize">269487</Metadata>
20 <Metadata name="FilenameRoot">pdf01</Metadata>
21 <Metadata name="FileFormat">PDF</Metadata>
22 <Metadata name="srcicon">_iconpdf_</Metadata>
23 <Metadata name="srclink_file">doc.pdf</Metadata>
24 <Metadata name="srclinkFile">doc.pdf</Metadata>
25 <Metadata name="NumPages">9</Metadata>
26 <Metadata name="gsdlthistype">Paged</Metadata>
27 <Metadata name="ex.ExifTool.ExifToolVersion">12.19</Metadata>
28 <Metadata name="ex.File.Directory">/Scratch/ak19/gs2-diffcol-26Apr2019/collect/Enhanced-PDF/import</Metadata>
29 <Metadata name="ex.File.FileAccessDate">2021:02:28 15:33:26+13:00</Metadata>
30 <Metadata name="ex.File.FileInodeChangeDate">2021:02:28 15:32:36+13:00</Metadata>
31 <Metadata name="ex.File.FileModifyDate">2021:02:28 15:32:36+13:00</Metadata>
32 <Metadata name="ex.File.FileName">pdf01.pdf</Metadata>
33 <Metadata name="ex.File.FilePermissions">100664</Metadata>
34 <Metadata name="ex.File.FileSize">269487</Metadata>
35 <Metadata name="ex.File.FileType">PDF</Metadata>
36 <Metadata name="ex.File.FileTypeExtension">PDF</Metadata>
37 <Metadata name="ex.File.MIMEType">application/pdf</Metadata>
38 <Metadata name="ex.PDF.Author">Bronwyn</Metadata>
39 <Metadata name="ex.PDF.CreateDate">2000:03:02 15:21:24</Metadata>
40 <Metadata name="ex.PDF.Creator">Microsoft Word</Metadata>
41 <Metadata name="ex.PDF.Linearized">false</Metadata>
42 <Metadata name="ex.PDF.PDFVersion">1.2</Metadata>
43 <Metadata name="ex.PDF.PageCount">9</Metadata>
44 <Metadata name="ex.PDF.Producer">Acrobat PDFWriter 4.0 for Power Macintosh</Metadata>
45 <Metadata name="Identifier">HASH1a9cea0f239f754007681b</Metadata>
46 <Metadata name="lastmodified">1614479556</Metadata>
47 <Metadata name="lastmodifieddate">20210228</Metadata>
48 <Metadata name="oailastmodified">1614479617</Metadata>
49 <Metadata name="oailastmodifieddate">20210228</Metadata>
50 <Metadata name="assocfilepath">HASH1a9c.dir</Metadata>
51 <Metadata name="gsdlassocfile">pdf01-2_1.jpg:image/jpeg:</Metadata>
52 <Metadata name="gsdlassocfile">pdf01-3_1.jpg:image/jpeg:</Metadata>
53 <Metadata name="gsdlassocfile">pdf01-4_1.jpg:image/jpeg:</Metadata>
54 <Metadata name="gsdlassocfile">pdf01-5_1.jpg:image/jpeg:</Metadata>
55 <Metadata name="gsdlassocfile">pdf01-7_1.jpg:image/jpeg:</Metadata>
56 <Metadata name="gsdlassocfile">pdf01-8_1.jpg:image/jpeg:</Metadata>
57 <Metadata name="gsdlassocfile">doc.pdf:application/pdf:</Metadata>
58 </Description>
59 <Content>
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79</Content>
80<Section>
81 <Description>
82 <Metadata name="Title">1</Metadata>
83 </Description>
84 <Content>&lt;br /&gt;
85&lt;b&gt;Greenstone: A Comprehensive Open-Source&lt;/b&gt;&lt;br&gt;
86&lt;b&gt;Digital Library Software System&lt;/b&gt;&lt;br&gt;
87&lt;i&gt;Ian H. Witten,* Rodger J. McNab,† Stefan J. Boddie,* David Bainbridge*&lt;/i&gt;&lt;br&gt;
88* Dept of Computer Science&lt;br&gt;
89† Digilib Systems&lt;br&gt;
90University of Waikato, New Zealand&lt;br&gt;
91Hamilton, New Zealand&lt;br&gt;
92E-mail: {ihw, sjboddie, davidb}@cs.waikato.ac.nz&lt;br&gt;
93E-mail: [email protected]&lt;br&gt;
94&lt;b&gt;ABSTRACT&lt;/b&gt;&lt;br&gt;
95multilingual information retrieval to distributed computing&lt;br&gt;protocols, from interoperability to search engine&lt;br&gt;
96This paper describes the Greenstone digital library&lt;br&gt;
97technology, from metadata standards to multiformat&lt;br&gt;
98software, a comprehensive, open-source system for the&lt;br&gt;
99document parsing, from multimedia to multiple operating&lt;br&gt;
100construction and presentation of information collections.&lt;br&gt;
101systems, from Web browsers to plug-and-play DVDs.&lt;br&gt;
102Collections built with Greenstone offer effective full-text&lt;br&gt;searching and metadata-based browsing facilities that are&lt;br&gt;
103The Greenstone Digital Library Software from the New&lt;br&gt;
104attractive and easy to use. Moreover, they are easily&lt;br&gt;
105Zealand Digital Library (NZDL) project tackles this issue&lt;br&gt;
106maintainable and can be augmented and rebuilt entirely&lt;br&gt;
107by providing a new way of organizing information and&lt;br&gt;
108automatically. The system is extensible: software&lt;br&gt;
109making it available over the Internet. A &lt;i&gt;collection&lt;/i&gt; of&lt;br&gt;
110“plugins” accommodate different document and metadata&lt;br&gt;
111information comprises several (typically several thousand,&lt;br&gt;
112types.&lt;br&gt;
113or several million) &lt;i&gt;documents&lt;/i&gt;, and a uniform interface is&lt;br&gt;provided to all documents in a collection. A library may&lt;br&gt;
114&lt;b&gt;INTRODUCTION&lt;/b&gt;&lt;br&gt;
115include many different collections, each organized&lt;br&gt;differently—though there is a strong family resemblance in&lt;br&gt;
116Notwithstanding intense research activity in the digital&lt;br&gt;
117how collections are presented.&lt;br&gt;
118library field during the second half of the 1990s,&lt;br&gt;comprehensive software systems for creating digital&lt;br&gt;
119Making information available using this system is far more&lt;br&gt;
120libraries are not widely available. In fact, the usual solution&lt;br&gt;
121than “just putting it on the Web.” The collection becomes&lt;br&gt;
122when creating a digital library is also the most&lt;br&gt;
123maintainable, searchable, and browsable. Each collection,&lt;br&gt;
124obvious—just put it on the Web. But consider how much&lt;br&gt;
125prior to presentation, undergoes a “building” process that,&lt;br&gt;
126effort is involved in constructing a Web site for a digital&lt;br&gt;
127once established, is completely automatic. This process&lt;br&gt;
128library. To be effective it needs to be visually attractive&lt;br&gt;
129creates all the structures that are used at run-time for&lt;br&gt;
130and ergonomically easy to use, incorporate convenient and&lt;br&gt;
131accessing the collection. Searching is based on various&lt;br&gt;
132powerful searching capabilities, and offer rich and natural&lt;br&gt;
133indexes, while browsing is based on various metadata;&lt;br&gt;
134browsing facilities. Above all it must be easy to maintain&lt;br&gt;
135support structures for both are created during the building&lt;br&gt;
136and augment, which presents a significant challenge if any&lt;br&gt;
137operation. When new material appears it can be fully&lt;br&gt;
138manual organization is involved.&lt;br&gt;
139incorporated into the collection by rebuilding.&lt;br&gt;
140The alternative is to automate these activities through&lt;br&gt;
141To address the exceptionally broad demands of digital&lt;br&gt;
142software tools. But the broad scope of digital library&lt;br&gt;
143libraries, the system is public and extensible. It is issued&lt;br&gt;
144requirements makes this a daunting prospect. Ideally the&lt;br&gt;
145under the Gnu public license and, in the spirit of open-&lt;br&gt;
146software should incorporate facilities ranging from&lt;br&gt;
147source software, users are invited to contribute&lt;br&gt;modifications and enhancements. Only through an&lt;br&gt;international cooperative effort will digital library software&lt;br&gt;become sufficiently comprehensive to meet the world’s&lt;br&gt;needs. Currently the Greenstone software is used at sites in&lt;br&gt;Canada, Germany, New Zealand, Romania, UK, and the&lt;br&gt;US, and collections range from newspaper articles to&lt;br&gt;technical documents, from educational journals to oral&lt;br&gt;history, from visual art to folksongs. The software has&lt;br&gt;been used for collections in many different languages, and&lt;br&gt;for CD-ROMs that have been published by the United&lt;br&gt;Nations and other humanitarian agencies in Belgium,&lt;br&gt;France, Japan, and the US for distribution in developing&lt;br&gt;countries (Humanity Libraries, 1998; PAHO, 1999;&lt;br&gt;UNESCO, 1999; UNU, 1998). Further details can be&lt;br&gt;obtained from &lt;i&gt;www.nzdl.org&lt;/i&gt;.&lt;br&gt;
148&lt;hr&gt;
149</Content>
150</Section>
151<Section>
152 <Description>
153 <Metadata name="Title">2</Metadata>
154 </Description>
155 <Content>&lt;br /&gt;
156&lt;IMG src=&quot;_httpdocimg_/pdf01-2_1.jpg&quot;&gt;&lt;br&gt;
157become a first-class component of the library. And what&lt;br&gt;permits it to be integrated into existing searching and&lt;br&gt;browsing structures without any manual intervention is&lt;br&gt;&lt;i&gt;metadata&lt;/i&gt;. This provides sufficient focus to the concept of&lt;br&gt;“digital library” to support the development of a&lt;br&gt;construction kit.&lt;br&gt;
158&lt;b&gt;OVERVIEW OF GREENSTONE&lt;/b&gt;&lt;br&gt;
159 &lt;br&gt;Information collections built by Greenstone combine&lt;br&gt;extensive full-text search facilities with browsing indexes&lt;br&gt;based on different metadata types. There are several ways&lt;br&gt;for users to find information, although they differ between&lt;br&gt;collections depending on the metadata available and the&lt;br&gt;collection design. Typically you can &lt;i&gt;search for particular&lt;br&gt;words&lt;/i&gt; that appear in the text, or within a section of a&lt;br&gt;document, or within a title or section heading. You can&lt;br&gt;&lt;i&gt;browse documents by title&lt;/i&gt;: just click on the displayed book&lt;br&gt;icon to read it. You can &lt;i&gt;browse documents by subject&lt;/i&gt;.&lt;br&gt;Subjects are represented by bookshelves: just click on a&lt;br&gt;shelf to see the books. Where appropriate, documents&lt;br&gt;
160&lt;b&gt;Figure 1: Searching the HDL collection&lt;/b&gt;&lt;br&gt;
161come complete with a table of contents (constructed&lt;br&gt;automatically): you can click on a chapter or subsection to&lt;br&gt;
162This paper sets the scene with a brief discussion of what a&lt;br&gt;
163open it, expand the full table of contents, or expand the full&lt;br&gt;
164digital library is. We then give an overview of the facilities&lt;br&gt;
165document.&lt;br&gt;
166offered by Greenstone and show how end users find&lt;br&gt;information in collections. Next we describe the files and&lt;br&gt;
167 &lt;br&gt;An example of searching is shown in Figure 1 where&lt;br&gt;
168directories involved in a collection, and then discuss the&lt;br&gt;
169documents in the Global Help Project’s Humanity&lt;br&gt;
170processes of updating existing collections and creating new&lt;br&gt;
171Development Library (HDL) are being searched for&lt;br&gt;
172ones, including extending the software to provide new&lt;br&gt;
173chapters matching the word &lt;i&gt;butterfly&lt;/i&gt;. In Figure 2 the same&lt;br&gt;
174facilities. We conclude with an overview of related work.&lt;br&gt;
175collection is being browsed by subject: by clicking on the&lt;br&gt;bookshelf icons the user has discovered an item under&lt;br&gt;
176&lt;b&gt;WHAT IS A DIGITAL LIBRARY?&lt;/b&gt;&lt;br&gt;
177Section 16, Animal Husbandry. Pursuing an interest in&lt;br&gt;butterfly farming, the user selects a book by clicking on its&lt;br&gt;
178 &lt;br&gt;Ten definitions of the term “digital library” have been&lt;br&gt;
179book icon. In Figure 3 the front cover of the book is&lt;br&gt;
180culled from the literature by Fox (1998), and their spirit is&lt;br&gt;
181displayed as a graphic on the left, and the automatically&lt;br&gt;
182captured in the following brief characterization:&lt;br&gt;
183constructed table of contents appears at the start of the&lt;br&gt;
184 &lt;br&gt;
185document. The current focus, &lt;i&gt;Introduction and Summary&lt;/i&gt;,&lt;br&gt;
186&lt;i&gt;A collection of digital objects, including text,&lt;/i&gt;&lt;br&gt;
187is shown in bold in the table of contents with its text&lt;br&gt;
188&lt;i&gt;video, and audio, along with methods for access&lt;/i&gt;&lt;br&gt;
189starting further down the page.&lt;br&gt;
190&lt;i&gt;and retrieval, and for selection, organization&lt;br&gt;and maintenance of the collection&lt;/i&gt;&lt;br&gt;
191 &lt;br&gt;In accordance with Lesk’s advice, a statement of purpose&lt;br&gt;
192 &lt;br&gt;
193and coverage accompanies each collection, along with an&lt;br&gt;
194(Akscyn and Witten, 1998). Lesk (1998) views digital&lt;br&gt;
195explanation of how it is organized (Figure 1 shows the&lt;br&gt;
196libraries as “organized collections of digital information,”&lt;br&gt;
197start of this). A distinction is made between &lt;i&gt;searching&lt;/i&gt; and&lt;br&gt;
198and wisely recommends that they articulate the principles&lt;br&gt;
199&lt;i&gt;browsing&lt;/i&gt;. Searching is full-text, and—depending on the&lt;br&gt;
200governing what is included and how the collection is&lt;br&gt;
201collection’s design—the user can choose between indexes&lt;br&gt;
202organized.&lt;br&gt;
203built from different parts of the documents, or from&lt;br&gt;
204 &lt;br&gt;Digital libraries are generally distinguished from the&lt;br&gt;
205different metadata. Some collections have an index of full&lt;br&gt;
206World-Wide Web, the essential difference being in&lt;br&gt;
207documents, an index of sections, an index of paragraphs,&lt;br&gt;
208selection and organization. But they are not generally&lt;br&gt;
209an index of titles, and an index of section headings, each of&lt;br&gt;
210distinguished from a web &lt;i&gt;site&lt;/i&gt;: indeed, virtually all extant&lt;br&gt;
211which can be searched for particular words or phrases.&lt;br&gt;
212digital libraries manifest themselves as a web site. Hence&lt;br&gt;
213Browsing involves data structures created from metadata&lt;br&gt;
214the obvious question: to make a digital library, why not&lt;br&gt;
215that the user can examine: lists of authors, lists of titles,&lt;br&gt;
216just put the information on the Web?&lt;br&gt;
217lists of dates, hierarchical classification structures, and so&lt;br&gt;
218 &lt;br&gt;
219on. Data structures for both browsing and searching are&lt;br&gt;
220But we make a distinction between a digital library and a&lt;br&gt;
221built according to instructions in a configuration file,&lt;br&gt;
222web site that lies at the heart of our software design: one&lt;br&gt;
223which controls both building and serving the collection.&lt;br&gt;
224should easily be able to add new material to a library&lt;br&gt;
225Sample configuration files are discussed below.&lt;br&gt;
226without having to integrate it manually or edit its content&lt;br&gt;in any way. Once added, new material should immediately&lt;br&gt;
227&lt;hr&gt;
228</Content>
229</Section>
230<Section>
231 <Description>
232 <Metadata name="Title">3</Metadata>
233 </Description>
234 <Content>&lt;br /&gt;
235&lt;IMG src=&quot;_httpdocimg_/pdf01-3_1.jpg&quot;&gt;&lt;br&gt;
236matter of specifying all the necessary plugins. In order to&lt;br&gt;build browsing indexes from metadata, an analogous&lt;br&gt;scheme of “classifiers” is used: classifiers create indexes&lt;br&gt;of various kinds based on metadata. Source documents are&lt;br&gt;brought into the Greenstone system through a process&lt;br&gt;called &lt;i&gt;importing&lt;/i&gt;, which uses the plugins and classifiers&lt;br&gt;specified in the collection configuration file.&lt;br&gt;
237 &lt;br&gt;The international Unicode character set is used throughout,&lt;br&gt;so documents—and interfaces—can be written in any&lt;br&gt;language. Collections have so far been produced in&lt;br&gt;English, French, Spanish, German, Maori, Chinese, and&lt;br&gt;Arabic. The NZDL Web site provides numerous examples.&lt;br&gt;Collections can contain text, pictures, and even audio and&lt;br&gt;video clips; a text-only version of the interface is also&lt;br&gt;provided to accommodate visually impaired users.&lt;br&gt;Compression technology is used to ensure best use of&lt;br&gt;storage (Witten &lt;i&gt;et al &lt;/i&gt;., 1999). Most non-textual material is&lt;br&gt;either linked to textual documents or accompanied by&lt;br&gt;textual descriptions (such as photo captions) to allow full-&lt;br&gt;text searching and browsing. However, the architecture&lt;br&gt;
238&lt;b&gt;Figure 2: Browsing the HDL collection by subject&lt;/b&gt;&lt;br&gt;
239permits the implementation of plugins and classifiers even&lt;br&gt;for non-textual data.&lt;br&gt;
240 &lt;br&gt;Rich browsing facilities can be provided by manually&lt;br&gt;
241 &lt;br&gt;
242linking parts of documents together and building explicit&lt;br&gt;
243The system includes an “administrative” function whereby&lt;br&gt;
244indexes and tables of contents. However, manually-created&lt;br&gt;
245specified users can examine the composition of all&lt;br&gt;
246linking becomes difficult to maintain, and often falls into&lt;br&gt;
247collections, protect documents so that they can only be&lt;br&gt;
248disrepair when a collection expands. The Greenstone&lt;br&gt;
249accessed by registered users on presentation of a password,&lt;br&gt;
250software takes a different tack: it facilitates &lt;i&gt;maintainability&lt;/i&gt;&lt;br&gt;
251and so on. Logs of user activity are kept that record all&lt;br&gt;
252by creating all searching and browsing structures&lt;br&gt;
253queries made to every Greenstone collection (though this&lt;br&gt;
254automatically from the documents themselves. No links&lt;br&gt;
255facility can be disabled).&lt;br&gt;
256are inserted by hand. This means that when new&lt;br&gt;
257 &lt;br&gt;Although primarily designed for Internet access over the&lt;br&gt;
258documents in the same format become available, they can&lt;br&gt;
259World-Wide Web, collections can be made available, in&lt;br&gt;
260be added automatically. Indeed, for some collections this is&lt;br&gt;
261precisely the same form, on CD-ROM. In either case they&lt;br&gt;
262done by processes that wake up regularly, scout for new&lt;br&gt;
263are accessed through any Web browser. Greenstone CD-&lt;br&gt;
264material, and rebuild the indexes—all without manual&lt;br&gt;
265ROMs operate on a standalone PC under Windows 3.X,&lt;br&gt;
266intervention.&lt;br&gt;
26795, 98, and NT, and the interaction is identical to accessing&lt;br&gt;
268Collections comprise many documents: thousands, tens of&lt;br&gt;
269the collection on the Web—except that response is faster&lt;br&gt;
270thousands, or even millions. Each document may be&lt;br&gt;
271and more predictable. The requirement to operate on early&lt;br&gt;
272hierarchically organized into &lt;i&gt;sections&lt;/i&gt; (subsections, sub-&lt;br&gt;
273Windows systems is one that plagues the software design,&lt;br&gt;
274subsections, and so on). Each section comprises one or&lt;br&gt;
275but is crucial for many users—particularly those in&lt;br&gt;
276more &lt;i&gt;paragraphs&lt;/i&gt;. Metadata such as author, title, date,&lt;br&gt;
277underdeveloped countries seeking access to humanitarian&lt;br&gt;
278keywords, and so on, may be associated with documents,&lt;br&gt;
279aid collections. If the PC is connected to a network&lt;br&gt;
280or with individual sections of documents. This is the raw&lt;br&gt;
281(intranet or Internet), a custom-built Web server provided&lt;br&gt;
282material for indexes. It must either be provided explicitly&lt;br&gt;
283on each CD makes exactly the same information available&lt;br&gt;
284for each document and section (for example, in an&lt;br&gt;
285to others through their standard Web browser. The use of&lt;br&gt;
286accompanying spreadsheet) or be derivable automatically&lt;br&gt;
287compression ensures that the greatest possible volume of&lt;br&gt;
288from the source documents. Metadata is converted to&lt;br&gt;
289information can be packed on to a CD-ROM.&lt;br&gt;
290Dublin Core and stored with the document for internal use.&lt;br&gt;
291 &lt;br&gt;The collection-serving software operates under Unix and&lt;br&gt;
292 &lt;br&gt;In order to accommodate different kinds of source&lt;br&gt;
293Windows NT, and works with standard Web servers. A&lt;br&gt;
294documents, the software is organized so that “plugins” can&lt;br&gt;
295flexible process structure allows different collections to be&lt;br&gt;
296be written for new document types. Plugins exist for plain&lt;br&gt;
297served by different computers, yet be presented to the user&lt;br&gt;
298text documents, HTML documents, email documents, and&lt;br&gt;
299in the same way, on the same Web page, as part of the&lt;br&gt;
300bibliographic formats. Word documents are handled by&lt;br&gt;
301same digital library, even as part of the same collection&lt;br&gt;
302saving them as HTML; PostScript ones by applying a&lt;br&gt;
303(McNab and Witten, 1998). Existing collections can be&lt;br&gt;
304preprocessor (Nevill-Manning &lt;i&gt;et al&lt;/i&gt;., 1998). Specially&lt;br&gt;
305updated and new ones brought on-line at any time, without&lt;br&gt;
306written plugins also exist for proprietary formats such as&lt;br&gt;
307bringing the system down; the process responsible for the&lt;br&gt;
308that used by the BBC archives department. A collection&lt;br&gt;
309user interface will notice (through periodic polling) when&lt;br&gt;
310may have source documents in different forms: it is just a&lt;br&gt;
311new collections appear and add them to the list presented&lt;br&gt;to the user.&lt;br&gt;
312&lt;hr&gt;
313</Content>
314</Section>
315<Section>
316 <Description>
317 <Metadata name="Title">4</Metadata>
318 </Description>
319 <Content>&lt;br /&gt;
320&lt;IMG src=&quot;_httpdocimg_/pdf01-4_1.jpg&quot;&gt;&lt;br&gt;
321&lt;b&gt;FILES IN A COLLECTION&lt;/b&gt;&lt;br&gt;
322 &lt;br&gt;When a new collection is created or material is added to an&lt;br&gt;existing one, the original source documents are first&lt;br&gt;brought into the system through a process known as&lt;br&gt;“importing.” This involves converting documents into a&lt;br&gt;simple HTML-like format known as GML (for&lt;br&gt;“Greenstone Markup Language”), which includes any&lt;br&gt;metadata associated with the document. Documents are&lt;br&gt;assumed to be in the Unicode UTF-8 code (of which the&lt;br&gt;ASCII characters form a subset).&lt;br&gt;
323 &lt;br&gt;&lt;b&gt;Files and directories&lt;/b&gt;&lt;br&gt;
324 &lt;br&gt;There is a separate directory for each collection, which&lt;br&gt;contains five subdirectories: the original raw material&lt;br&gt;(&lt;i&gt;import&lt;/i&gt;), the GML files created from this (&lt;i&gt;archives&lt;/i&gt;), the&lt;br&gt;final collection as it is served to users (&lt;i&gt;index&lt;/i&gt;), a directory&lt;br&gt;for use during the building process (&lt;i&gt;building&lt;/i&gt;), and one for&lt;br&gt;any supporting files (&lt;i&gt;etc&lt;/i&gt;)—including the configuration file&lt;br&gt;
325&lt;b&gt;Figure 3: Reading a book in the HDL&lt;/b&gt;&lt;br&gt;
326that controls the collection creation procedure. Additional&lt;br&gt;files might be required: for example, building a hierarchy&lt;br&gt;of classifications requires a data file of sub-classifications.&lt;br&gt;
327&lt;b&gt;FINDING INFORMATION&lt;/b&gt;&lt;br&gt;
328 &lt;br&gt;Greenstone digital library systems generally include&lt;br&gt;
329 &lt;br&gt;
330several separate collections. A home page allows you to&lt;br&gt;
331&lt;b&gt;The imported documents&lt;/b&gt;&lt;br&gt;
332select a collection; in addition, each collection has its own&lt;br&gt;
333 &lt;br&gt;In order to identify documents internally, a unique object&lt;br&gt;
334“about” page that gives you information about how the&lt;br&gt;
335identifier or OID is assigned to each original source&lt;br&gt;
336collection is organized and the principles governing what&lt;br&gt;
337document when it is imported (formed by hashing the&lt;br&gt;
338is included.&lt;br&gt;
339content, to overcome file duplication effects caused by&lt;br&gt;
340 &lt;br&gt;All icons in the screenshots of Figures 1–4 are clickable.&lt;br&gt;
341mirroring) and stored as metadata within that document. It&lt;br&gt;
342Those icons at the top of the page return to the home page,&lt;br&gt;
343is important that OIDs persist throughout the index-&lt;br&gt;
344provide help text, and allow you to set user interface and&lt;br&gt;
345building process—so that a user’s search history is&lt;br&gt;
346searching preferences. The navigation bar underneath&lt;br&gt;
347unaffected by rebuilding the collection. OIDs are assigned&lt;br&gt;
348gives access to the searching and browsing facilities,&lt;br&gt;
349by hashing the contents of the original source document.&lt;br&gt;
350which differ from one collection to another.&lt;br&gt;
351 &lt;br&gt;Once imported, each document is stored in its own&lt;br&gt;
352 &lt;br&gt;Each of the five buttons provides a different way to find&lt;br&gt;
353subdirectory of &lt;i&gt;archives&lt;/i&gt;, along with any associated&lt;br&gt;
354information. You can &lt;i&gt;search for particular words&lt;/i&gt; that&lt;br&gt;
355files—for example, images. To ensure compatibility with&lt;br&gt;
356appear in the text from the “search” page (or from the&lt;br&gt;
357Windows 3.0, only eight characters are used in directory&lt;br&gt;
358“about” page of Figure 1). This collection contains indexes&lt;br&gt;
359and file names, which causes annoying but essentially&lt;br&gt;
360of chapters, section titles, and entire books. The default&lt;br&gt;
361trivial complications.&lt;br&gt;
362search interface is a simple one, suitable for casual users;&lt;br&gt;advanced searching—which allows full Boolean&lt;br&gt;
363 &lt;br&gt;&lt;b&gt;Inside the documents&lt;/b&gt;&lt;br&gt;
364expressions, phrase searching, case and stemming&lt;br&gt;control—can be enabled from the &lt;i&gt;Preferences&lt;/i&gt; page.&lt;br&gt;
365 &lt;br&gt;The GML format imposes a limited amount of structure on&lt;br&gt;
366 &lt;br&gt;
367documents. Documents are divided into paragraphs. They&lt;br&gt;
368This collection has four browsable metadata indexes. You&lt;br&gt;
369can be split hierarchically into sections and subsections.&lt;br&gt;
370can &lt;i&gt;access publications by subject&lt;/i&gt; by clicking the &lt;i&gt;subjects&lt;/i&gt;&lt;br&gt;
371OIDs are extended to identify these components by&lt;br&gt;
372button, which brings up a list of subjects, represented by&lt;br&gt;
373appending numbers, separated by periods, to a document’s&lt;br&gt;
374bookshelves (Figure 2). You can &lt;i&gt;access publications by&lt;/i&gt;&lt;br&gt;
375OID. When a book is read, its section hierarchy is visible&lt;br&gt;
376&lt;i&gt;title&lt;/i&gt; by clicking &lt;i&gt;titles a-z&lt;/i&gt; (Figure 4), which brings up a list&lt;br&gt;
377as the table of contents (Figure 3). Chapters, sections,&lt;br&gt;
378of books in alphabetic order. You can &lt;i&gt;access publications&lt;/i&gt;&lt;br&gt;
379subsections, and pages are all implemented simply as&lt;br&gt;
380&lt;i&gt;by organization&lt;/i&gt; (i.e. Dublin Core “publisher”), bringing up&lt;br&gt;
381“sections” within the document. In some collections&lt;br&gt;
382a list of organizations. You can &lt;i&gt;access publications by&lt;/i&gt;&lt;br&gt;
383documents do not have a hierarchical subsection structure,&lt;br&gt;
384&lt;i&gt;“how to” listing&lt;/i&gt;, yielding a list of hints defined by the&lt;br&gt;
385but are split into pages to permit browsing within a&lt;br&gt;
386collection’s editors. We use the Dublin Core as a base and&lt;br&gt;
387retrieved document.&lt;br&gt;
388extend it in an &lt;i&gt;ad hoc&lt;/i&gt; manner to accommodate the&lt;br&gt;individual requirements of collection designers.&lt;br&gt;
389 &lt;br&gt;The document structure is used for searchable indexes.&lt;br&gt;There are three levels of index: &lt;i&gt;documents&lt;/i&gt;, &lt;i&gt;sections&lt;/i&gt;, and&lt;br&gt;
390&lt;hr&gt;
391</Content>
392</Section>
393<Section>
394 <Description>
395 <Metadata name="Title">5</Metadata>
396 </Description>
397 <Content>&lt;br /&gt;
398&lt;IMG src=&quot;_httpdocimg_/pdf01-5_1.jpg&quot;&gt;&lt;br&gt;
399the &lt;i&gt;import&lt;/i&gt; process is invoked, which converts the files into&lt;br&gt;GML using the specified plugins. Old material for which&lt;br&gt;GML files have previously been created is not re-imported.&lt;br&gt;Then the &lt;i&gt;build&lt;/i&gt; process is invoked to build the requisite&lt;br&gt;indexes for the collection. Finally, the contents of the&lt;br&gt;&lt;i&gt;building&lt;/i&gt; directory are moved into the &lt;i&gt;index&lt;/i&gt; directory, and&lt;br&gt;the new version of the collection automatically becomes&lt;br&gt;live.&lt;br&gt;
400 &lt;br&gt;This procedure may seem cumbersome. But all the steps&lt;br&gt;are necessary for efficient operation with large collections.&lt;br&gt;The &lt;i&gt;import&lt;/i&gt; process could be performed on the fly during&lt;br&gt;the building operation—but because building indexes is a&lt;br&gt;multipass operation, the often lengthy importing would be&lt;br&gt;repeated several times. The &lt;i&gt;build&lt;/i&gt; process can take&lt;br&gt;considerable time—a day or two, for very large&lt;br&gt;collections. Consequently, the results are placed in the&lt;br&gt;&lt;i&gt;building&lt;/i&gt; directory so that, if the collection already exists, it&lt;br&gt;will continue to be served to users in its old form&lt;br&gt;throughout the building operation.&lt;br&gt;
401 &lt;br&gt;Active users of the collection will not be disturbed when&lt;br&gt;the new version becomes live—they will probably not&lt;br&gt;
402&lt;b&gt;Figure 4: Browsing titles in the HDL&lt;/b&gt;&lt;br&gt;
403even notice. The persistent OIDs ensure that interactions&lt;br&gt;remain coherent—users who are examining the results of a&lt;br&gt;query or browse operation will still retrieve the expected&lt;br&gt;
404&lt;i&gt;paragraphs&lt;/i&gt;, corresponding to the distinctions that GML&lt;br&gt;
405documents—and if a search is actually in progress when&lt;br&gt;
406makes—the hierarchical structure is flattened for the&lt;br&gt;
407the change takes place the program detects the resulting&lt;br&gt;
408purposes of creating these indexes. Indexes can be of text,&lt;br&gt;
409file-structure inconsistency and automatically and&lt;br&gt;
410or metadata, or any combination. Thus you can create a&lt;br&gt;
411transparently re-executes the query, this time on the new&lt;br&gt;
412searchable index of section titles, and/or authors, and/or&lt;br&gt;
413version of the collection.&lt;br&gt;
414document descriptions, as well as the document text.&lt;br&gt;
415&lt;b&gt;UPDATING EXISTING COLLECTIONS&lt;/b&gt;&lt;br&gt;
416 &lt;br&gt;&lt;b&gt;How it works&lt;/b&gt;&lt;br&gt;
417 &lt;br&gt;Updating an existing collection with new files in the same&lt;br&gt;
418 &lt;br&gt;The original material in the &lt;i&gt;import&lt;/i&gt; directory may be in any&lt;br&gt;
419format is easy. For example, the raw material for the HDL&lt;br&gt;
420format, and plugins are required to process each format&lt;br&gt;
421is supplied in the form of HTML files marked up with&lt;br&gt;
422type. The plugins that a collection uses must be specified&lt;br&gt;
423&amp;lt;&amp;lt;TOC&amp;gt;&amp;gt; tags to split books into sections and&lt;br&gt;
424in the collection configuration file. The &lt;i&gt;import&lt;/i&gt; program&lt;br&gt;
425subsections, and &amp;lt;&amp;lt;I&amp;gt;&amp;gt; tags to indicate where an image is&lt;br&gt;
426reads the list of plugins and passes each document to each&lt;br&gt;
427to be inserted. For each book in the library there is a&lt;br&gt;
428plugin in order until it finds one that can process it. When&lt;br&gt;
429directory that contains a single HTML file representing the&lt;br&gt;
430updating an existing collection, all plugins necessary to&lt;br&gt;
431book, and separate files containing the associated images.&lt;br&gt;
432process new material should already have been specified in&lt;br&gt;
433An accompanying spreadsheet file contains the&lt;br&gt;
434the configuration file.&lt;br&gt;
435classification hierarchy; this is converted to a simple file&lt;br&gt;format (using Excel’s &lt;i&gt;Save As&lt;/i&gt; command).&lt;br&gt;
436 &lt;br&gt;The building step creates the indexes for both searching&lt;br&gt;and browsing. The MG software is generally used to do the&lt;br&gt;
437 &lt;br&gt;Since the collection exists, its directory is already set up&lt;br&gt;
438searching (Witten &lt;i&gt;et al.&lt;/i&gt;, 1999), and the &lt;i&gt;mgbuild&lt;/i&gt; module is&lt;br&gt;
439with subdirectories &lt;i&gt;import&lt;/i&gt;, &lt;i&gt;archives&lt;/i&gt;, &lt;i&gt;building&lt;/i&gt;, &lt;i&gt;index&lt;/i&gt;, and&lt;br&gt;
440automatically invoked to create each of the indexes that is&lt;br&gt;
441&lt;i&gt;etc&lt;/i&gt;, and the &lt;i&gt;etc&lt;/i&gt; directory will contain a suitable collection&lt;br&gt;
442required. For example, the Humanity Development Library&lt;br&gt;
443configuration file.&lt;br&gt;
444has three indexes, one for entire books, one for chapters,&lt;br&gt;and one for section titles. Subdirectories of the &lt;i&gt;index&lt;/i&gt;&lt;br&gt;
445 &lt;br&gt;
446directory are created for each of these indexes.&lt;br&gt;
447&lt;b&gt;The updating procedure&lt;/b&gt;&lt;br&gt;
448 &lt;br&gt;To update a collection, the new raw material is placed in&lt;br&gt;the &lt;i&gt;import&lt;/i&gt; directory, in whatever form it is available. Then&lt;br&gt;
449&lt;hr&gt;
450</Content>
451</Section>
452<Section>
453 <Description>
454 <Metadata name="Title">6</Metadata>
455 </Description>
456 <Content>&lt;br /&gt;
457creator&lt;br&gt;
458[email protected]&lt;br&gt;
4591&lt;br&gt;
460maintainer&lt;br&gt;
461[email protected]&lt;br&gt;
4622&lt;br&gt;
463public&lt;br&gt;
464True&lt;br&gt;
4653&lt;br&gt;4&lt;br&gt;
466indexes&lt;br&gt;
467document:text&lt;br&gt;
4685&lt;br&gt;
469defaultindex&lt;br&gt;
470document:text&lt;br&gt;
4716&lt;br&gt;
472plugins&lt;br&gt;
473GMLPlug TEXTPlug ArcPlug RecPlug&lt;br&gt;
4747&lt;br&gt;8&lt;br&gt;
475classify&lt;br&gt;
476AZList metadata=Title&lt;br&gt;
4779&lt;br&gt;10&lt;br&gt;
478collectionmeta&lt;br&gt;
479collectionname &amp;quot;generic text collection&amp;quot;&lt;br&gt;
48011&lt;br&gt;
481(a)&lt;br&gt;
482collectionmeta&lt;br&gt;
483.document:text &amp;quot;documents&amp;quot;&lt;br&gt;
48412&lt;br&gt;
485creator&lt;br&gt;
486[email protected]&lt;br&gt;
4871&lt;br&gt;
488maintainer&lt;br&gt;
489[email protected]&lt;br&gt;
4902&lt;br&gt;
491public&lt;br&gt;
492True&lt;br&gt;
4933&lt;br&gt;4&lt;br&gt;
494indexes&lt;br&gt;
495document:text document:From&lt;br&gt;
4965&lt;br&gt;
497defaultindex&lt;br&gt;
498document:text&lt;br&gt;
4996&lt;br&gt;
500plugins&lt;br&gt;
501GMLPlug EMAILPlug ArcPlug RecPlug&lt;br&gt;
5027&lt;br&gt;8&lt;br&gt;
503classify&lt;br&gt;
504AZList metadata=Title&lt;br&gt;
5059&lt;br&gt;
506classify&lt;br&gt;
507DateList&lt;br&gt;
50810&lt;br&gt;11&lt;br&gt;
509collectionmeta&lt;br&gt;
510collectionname &amp;quot;Email messages&amp;quot;&lt;br&gt;
51112&lt;br&gt;
512collectionmeta&lt;br&gt;
513.document:text &amp;quot;documents&amp;quot;&lt;br&gt;
51413&lt;br&gt;
515collectionmeta&lt;br&gt;
516.document:From &amp;quot;email senders&amp;quot;&lt;br&gt;
51714&lt;br&gt;15&lt;br&gt;
518format&lt;br&gt;
519QueryResults \\\\&lt;br&gt;
52016&lt;br&gt;
521(b)&lt;br&gt;
522&amp;lt;td&amp;gt;[link][icon][/link]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[Title]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[Author]&amp;lt;/td&amp;gt;&lt;br&gt;
52317&lt;br&gt;
524&lt;b&gt;Figure 5: Collection configuration files (a) generic, (b) for an email collection&lt;/b&gt;&lt;br&gt;
525 &lt;br&gt;MG also compresses the text of the collection; and the&lt;br&gt;
526certain circumstances, however, it might be preferable to&lt;br&gt;
527image files are linked into the &lt;i&gt;index&lt;/i&gt; subdirectory. Now&lt;br&gt;
528use a standardized format such as XML. This is&lt;br&gt;
529none of the material in the &lt;i&gt;import&lt;/i&gt; and &lt;i&gt;archives&lt;/i&gt; directories&lt;br&gt;
530straightforward to implementjust write an XML&lt;br&gt;
531is needed to run the collection and can be removed from&lt;br&gt;
532pluginalthough we have not done so ourselves. Given&lt;br&gt;
533the file system (though they would be needed if the&lt;br&gt;
534the transitory nature of the imported data, to date, we have&lt;br&gt;
535collection were rebuilt).&lt;br&gt;
536found GML a satisfactory and beneficial format.&lt;br&gt;
537 &lt;br&gt;Associated with each collection is a database stored in&lt;br&gt;
538&lt;b&gt;CREATING NEW COLLECTIONS&lt;/b&gt;&lt;br&gt;
539GDBM (Gnu database manager) format. This contains an&lt;br&gt;entry for each document, giving its OID, its internal MG&lt;br&gt;
540 &lt;br&gt;Building new collections from scratch is only slightly&lt;br&gt;
541document number, and metadata such as title. Information&lt;br&gt;
542different from updating an existing collection. The key&lt;br&gt;
543for each of the browsing indexes, which appear as buttons&lt;br&gt;
544new requirement is creating a collection configuration file,&lt;br&gt;
545on the Greenstone search/browse bar, is also extracted&lt;br&gt;
546and a software utility is provided to help. Two pieces of&lt;br&gt;
547during the building process and stored in the database. A&lt;br&gt;
548information are required for this: the name of the directory&lt;br&gt;
549“classifier” program is required for each browsing index to&lt;br&gt;
550that the collection will use (into which the source data and&lt;br&gt;
551extract the appropriate information from GML documents.&lt;br&gt;
552other files will eventually be placed), and a contact e-mail&lt;br&gt;
553Like plugins, classifiers are written on an &lt;i&gt;ad hoc&lt;/i&gt; basis for&lt;br&gt;
554address for use if any problems are encountered by the&lt;br&gt;
555the particular information required, and where possible&lt;br&gt;
556software once the collection is up and running. The utility&lt;br&gt;
557reused from one collection to another.&lt;br&gt;
558creates files and directories within the newly-named&lt;br&gt;
559 &lt;br&gt;
560directory to support a generic collection of plain text&lt;br&gt;
561The building program creates the indexes based on&lt;br&gt;
562documents. With suitable data placed in the &lt;i&gt;import&lt;/i&gt;&lt;br&gt;
563whatever appears in the &lt;i&gt;archives&lt;/i&gt; directory. The first plugin&lt;br&gt;
564directory, building the collection at this point will yield a&lt;br&gt;
565specified by all collections is one that processes GML&lt;br&gt;
566document-level searchable index of all the text and a&lt;br&gt;
567files, and so if &lt;i&gt;archives&lt;/i&gt; contains imported files they will be&lt;br&gt;
568browsable list of “titles” (defined in this case to be the&lt;br&gt;
569processed correctly. If it contains material in the original&lt;br&gt;
570document filenames).&lt;br&gt;
571format, that will be converted using the appropriate plugin.&lt;br&gt;Thus the import process is optional.&lt;br&gt;
572 &lt;br&gt;To enhance the functionality and presentation— something&lt;br&gt;
573 &lt;br&gt;
574anything but the most trivial collection will require—the&lt;br&gt;
575GML is designed to be fast and easy to parse, an important&lt;br&gt;
576configuration file must be edited. For a collection sourced&lt;br&gt;
577requirement when millions of documents are to be&lt;br&gt;
578from documents in an already supported data format,&lt;br&gt;
579processed. Something as simple as requiring tags to be&lt;br&gt;
580presented in a similar fashion to an existing collection, the&lt;br&gt;
581lower-case, for example, yields a substantial speed-up. In&lt;br&gt;
582&lt;hr&gt;
583</Content>
584</Section>
585<Section>
586 <Description>
587 <Metadata name="Title">7</Metadata>
588 </Description>
589 <Content>&lt;br /&gt;
590&lt;IMG src=&quot;_httpdocimg_/pdf01-7_1.jpg&quot;&gt;&lt;br&gt;
591 &lt;br&gt;These are modules of code that can be slotted into the&lt;br&gt;system to enhance its capabilities. Plugins parse&lt;br&gt;documents, extracting the text and metadata to be indexed.&lt;br&gt;Classifiers control how metadata is brought together to&lt;br&gt;form browsable data structures. Both are specified in an&lt;br&gt;object-oriented framework using inheritance to minimize&lt;br&gt;the amount of code written.&lt;br&gt;
592 &lt;br&gt;A plugin must specify three things: what file formats it can&lt;br&gt;handle, how they should be parsed, and whether the plugin&lt;br&gt;is recursive. File formats are normally determined using&lt;br&gt;regular expression matching on the filename. For example,&lt;br&gt;the HTML plugin accepts all files that end in &lt;i&gt;.htm&lt;/i&gt;, . &lt;i&gt;html&lt;/i&gt;,&lt;br&gt;&lt;i&gt;.HTM&lt;/i&gt;, or &lt;i&gt;.HTML&lt;/i&gt;. (It is quite possible, however, to write&lt;br&gt;plugins that “look inside” the file as well.) For other files,&lt;br&gt;the plugin returns &lt;i&gt;undefined&lt;/i&gt; and the file is passed to the&lt;br&gt;next plugin in the collection’s configuration file (e.g.&lt;br&gt;Figure 5 line 7). If it can, the plugin parses the file and&lt;br&gt;returns the number of documents processed. This involves&lt;br&gt;extracting text and metadata and adding it to the library’s&lt;br&gt;content through calls to &lt;i&gt;add text&lt;/i&gt; and &lt;i&gt;add metadata&lt;/i&gt;.&lt;br&gt;
593 &lt;br&gt;Some plugins (“recursive” ones) add extra files into the&lt;br&gt;
594&lt;b&gt;Figure 6: Searching bookmarked Web pages&lt;/b&gt;&lt;br&gt;
595stream of data processed during the building phase by&lt;br&gt;artificially reactivating the list of plugins. This is how&lt;br&gt;directory hierarchies are traversed.&lt;br&gt;
596amount of editing is minimal. Importing new data formats&lt;br&gt;and browsing metadata in ways not currently supported are&lt;br&gt;
597 &lt;br&gt;Plugins are small modules of code that are easy to write.&lt;br&gt;
598more complex activities that require programming skills.&lt;br&gt;
599We monitored the time it took to develop a new one that&lt;br&gt;was different to any we had produced so far. We chose to&lt;br&gt;make as an example a collection of HTML bookmark files,&lt;br&gt;
600 &lt;br&gt;&lt;b&gt;Modifying the configuration file&lt;/b&gt;&lt;br&gt;
601the motivation being to produce a convenient way of&lt;br&gt;
602 &lt;br&gt;
603searching and browsing one’s bookmarked Web pages.&lt;br&gt;
604Figure 5b shows simple alterations to the generic&lt;br&gt;
605Figure 6 shows a user searching for bookmarked pages&lt;br&gt;
606configuration file in Figure 5a that was generated by the&lt;br&gt;
607about &lt;i&gt;music&lt;/i&gt;. The new plugin took under an hour to write,&lt;br&gt;
608new-collection utility. &lt;i&gt;TEXTPlug&lt;/i&gt; is replaced with&lt;br&gt;
609and was 160 lines long (ignoring blank lines and&lt;br&gt;
610&lt;i&gt;EMAILPlug&lt;/i&gt; (line 7) which reads email files and extracts&lt;br&gt;
611comments)—about the average length of existing plugins.&lt;br&gt;
612metadata (&lt;i&gt;From&lt;/i&gt;, &lt;i&gt;To&lt;/i&gt;, &lt;i&gt;Date&lt;/i&gt;, &lt;i&gt;Subject&lt;/i&gt;) from them. A classifier&lt;br&gt;for dates is added (line 10) to make the collection&lt;br&gt;
613 &lt;br&gt;Classifiers are more general than plugins because they&lt;br&gt;
614browsable chronologically. The default presentation of&lt;br&gt;
615work on GML-format data. For example, any plugin that&lt;br&gt;
616search results is overridden (line 17) to display both the&lt;br&gt;
617generates date metadata in accordance with the Dublin&lt;br&gt;
618title of the message (i.e. Dublin Core &lt;i&gt;Title&lt;/i&gt;) and its sender&lt;br&gt;
619core can request the collection to be browsable&lt;br&gt;
620(i.e. Dublin Core &lt;i&gt;Author&lt;/i&gt;). Elements in square brackets,&lt;br&gt;
621chronologically by specifying the &lt;i&gt;DateList&lt;/i&gt; classifier in the&lt;br&gt;
622such as &lt;i&gt;[Title]&lt;/i&gt;, are replaced by the metadata associated&lt;br&gt;
623collection’s configuration file (Figure 7). Classifiers are&lt;br&gt;
624with a particular document. The built-in term &lt;i&gt;[icon]&lt;/i&gt;&lt;br&gt;
625more elaborate than most plugins, but new ones are seldom&lt;br&gt;
626produces a suitable image that represents the document&lt;br&gt;
627required. The average length of existing classifiers is 230&lt;br&gt;
628(such as a book icon or page icon), and the &lt;i&gt;[link]
[/link]&lt;/i&gt;&lt;br&gt;
629lines.&lt;br&gt;
630construct forms a hyperlink to the complete document.&lt;br&gt;
631 &lt;br&gt;
632Anything else in the format statement, which in this case is&lt;br&gt;
633Classifiers must specify three things: an initialization&lt;br&gt;
634solely table-cell tags in HTML, is passed through to the&lt;br&gt;
635routine, how individual documents are classified, and the&lt;br&gt;
636page being displayed.&lt;br&gt;
637final browsable data structure. Initialization takes care of&lt;br&gt;any options specified in the configuration file (such as&lt;br&gt;
638As this example shows, creating a new collection that stays&lt;br&gt;
639&lt;i&gt;metadata=Title &lt;/i&gt;on line 9 of Figure 5b). Classifying&lt;br&gt;
640within the bounds of the library’s established capabilities&lt;br&gt;
641individual documents is an iterative process: for each one,&lt;br&gt;
642falls within the capability of many computer users—for&lt;br&gt;
643a call to &lt;i&gt;document-classify&lt;/i&gt; is made. On presentation of the&lt;br&gt;
644instance, computer-trained librarians. Extending&lt;br&gt;
645document’s OID, the necessary metadata is located and&lt;br&gt;
646Greenstone to handle new document formats and browse&lt;br&gt;
647used to control where the document is added to the&lt;br&gt;
648metadata in new ways is more challenging.&lt;br&gt;
649browsable data structure being constructed.&lt;br&gt;
650 &lt;br&gt;Once all documents have been added, a request is made for&lt;br&gt;
651 &lt;br&gt;&lt;b&gt;Writing new plugins and classifiers&lt;/b&gt;&lt;br&gt;
652the completed data structure. Some classifiers return the&lt;br&gt;data structure directly; others transform the data structure&lt;br&gt;
653 &lt;br&gt;Extensibility is obtained through plugins and classifiers.&lt;br&gt;
654before it is returned. For example, the &lt;i&gt;AZList&lt;/i&gt; classifier&lt;br&gt;
655&lt;hr&gt;
656</Content>
657</Section>
658<Section>
659 <Description>
660 <Metadata name="Title">8</Metadata>
661 </Description>
662 <Content>&lt;br /&gt;
663&lt;IMG src=&quot;_httpdocimg_/pdf01-8_1.jpg&quot;&gt;&lt;br&gt;
664a page number, next and previous page buttons, and&lt;br&gt;displaying a particular page at different resolutions. A text&lt;br&gt;version of the page is also available upon which a&lt;br&gt;searching option is also provided.&lt;br&gt;
665Started in 1994, Harvest is also a long-running research&lt;br&gt;project. It provides an efficient means of gathering source&lt;br&gt;data from the Internet and distributing indexing&lt;br&gt;information over the Internet. This is accomplished&lt;br&gt;through five components: &lt;i&gt;gatherer&lt;/i&gt;, &lt;i&gt;broker&lt;/i&gt;, &lt;i&gt;indexer&lt;/i&gt;,&lt;br&gt;&lt;i&gt;replicator&lt;/i&gt; and &lt;i&gt;cache&lt;/i&gt;. The first three are central to creating,&lt;br&gt;updating and searching a collection; the last two help to&lt;br&gt;improve performance over the Internet through transparent&lt;br&gt;mirroring and caching techniques.&lt;br&gt;
666The system is configurable and customizable. While&lt;br&gt;searching is most commonly implemented using Glimpse&lt;br&gt;(&lt;i&gt;glimpse.cs.arizona.edu&lt;/i&gt;), in principle any search engine&lt;br&gt;that supports incremental updates and Boolean&lt;br&gt;combinations of attribute-based queries can be used. It is&lt;br&gt;possible to control what type of documents are gathered&lt;br&gt;during creation and updating, and how the query interface&lt;br&gt;
667&lt;b&gt;Figure 7: Browsing a newspaper collection by date&lt;/b&gt;&lt;br&gt;
668looks and is laid out.&lt;br&gt;
669Sample collections cited by the developers include 21,000&lt;br&gt;
670divides the alphabetically sorted list of metadata into&lt;br&gt;
671computer science technical reports and 7,000 home pages.&lt;br&gt;
672separate pages of about the same size and returns the&lt;br&gt;
673Other examples include a sizable collection of agriculture-&lt;br&gt;
674alphabetic ranges for each one (Figure 4).&lt;br&gt;
675related electronic journals and magazines called “tomato-&lt;br&gt;juice” (accessed through &lt;i&gt;hegel.lib.ncsu.edu&lt;/i&gt;) and a full-text&lt;br&gt;
676&lt;b&gt;OVERVIEW OF RELATED WORK&lt;/b&gt;&lt;br&gt;
677index of library-related electronic serials&lt;br&gt;
678Two projects that provide substantial open source digital&lt;br&gt;
679(&lt;i&gt;sunsite.berkeley.edu/IndexMorganagus&lt;/i&gt;). Harvest is also&lt;br&gt;
680library software are Dienst (Lagoze and Fielding, 1998)&lt;br&gt;
681often used to index Web sites (for example&lt;br&gt;
682and Harvest (Bowman &lt;i&gt;et al.&lt;/i&gt;, 1994). The origins of Dienst&lt;br&gt;
683&lt;i&gt;www.middlebury.edu&lt;/i&gt;).&lt;br&gt;
684(&lt;i&gt;www.cs.cornell.edu/cdlrg&lt;/i&gt;) stretch back to 1992. The term&lt;br&gt;
685Comparing Greenstone with Dienst and Harvest, there are&lt;br&gt;
686has come to represent three entities: a conceptual&lt;br&gt;
687both similarities and differences. All provide substantial&lt;br&gt;
688architecture for distributed digital libraries; an open&lt;br&gt;
689digital library systems, hence common themes recur, but&lt;br&gt;
690protocol for service communication; and a software&lt;br&gt;
691they are driven by projects with different aims. Harvest,&lt;br&gt;
692system that implements the protocol. To date, five sample&lt;br&gt;
693for instance, was not conceived as a digital library project&lt;br&gt;
694digital libraries have been built using this technology.&lt;br&gt;
695at all, but by virtue of its selective document gathering&lt;br&gt;
696They manifest themselves in two forms: technical reports&lt;br&gt;
697process it can be classed (and is used) as one. While it&lt;br&gt;
698and primary source documents.&lt;br&gt;
699provides sophisticated search options, it lacks the&lt;br&gt;
700Best known is NCSTRL, the Networked Computer&lt;br&gt;
701complementary service of browsing. Furthermore it adds&lt;br&gt;
702Science Technical Reference Library project&lt;br&gt;
703no structure or order to the documents collected, relying&lt;br&gt;
704(&lt;i&gt;www.ncstrl.org&lt;/i&gt;). This collection facilitates searching by&lt;br&gt;
705on whatever structures are present in the site that they&lt;br&gt;
706title, author and abstract, and browsing by year and author,&lt;br&gt;
707were gathered from. A proven strength of the design is its&lt;br&gt;
708across a distributed network of document repositories.&lt;br&gt;
709flexibility through configuration and customizationan&lt;br&gt;
710Documents can (where supported) be delivered in various&lt;br&gt;
711element also present in Greenstone.&lt;br&gt;
712formats such as PostScript, a thumbnail overview of the&lt;br&gt;
713Dienstbest exemplified through the NCSTRL&lt;br&gt;
714pages, and a GIF image of a particular page.&lt;br&gt;
715worksupports searching and browsing, like Greenstone.&lt;br&gt;
716The &lt;i&gt;Making of America&lt;/i&gt; resource is an example of a&lt;br&gt;
717Both use open protocols. Differences include a high&lt;br&gt;
718collection based around primary sourcesin this case&lt;br&gt;
719reliance in Dienst on user-supplied information when a&lt;br&gt;
720American social history, 1830−1900. It has a different&lt;br&gt;
721document is added, and a smaller range of document types&lt;br&gt;
722“look and feel” to NCSTRL, being strongly oriented&lt;br&gt;
723supported—although Dienst does include a document&lt;br&gt;
724toward browsing rather than searching. A user navigates&lt;br&gt;
725model that should, over time, allow this to expand with&lt;br&gt;
726their way through a hierarchical structure of hyperlinks to&lt;br&gt;
727relative ease.&lt;br&gt;
728reach a book of interest. The book itself is a series of&lt;br&gt;
729There are also commercial systems that provide similar&lt;br&gt;
730scanned images: delivery options include going directly to&lt;br&gt;
731digital library services to those described. However, since&lt;br&gt;
732&lt;hr&gt;
733</Content>
734</Section>
735<Section>
736 <Description>
737 <Metadata name="Title">9</Metadata>
738 </Description>
739 <Content>&lt;br /&gt;
740corporate culture instills proprietary attitudes there is little&lt;br&gt;
741&lt;b&gt;REFERENCES&lt;/b&gt;&lt;br&gt;
742opportunity for advancement through a shared&lt;br&gt;
7431. Akscyn, R.M. and Witten, I.H. (1998) “Report on First&lt;br&gt;
744collaborative effort. Consequently they are not reviewed&lt;br&gt;
745Summit on International Cooperation on Digital&lt;br&gt;
746here.&lt;br&gt;
747Libraries.” ks.com/idla-wp-oct98.&lt;br&gt;
7482. Bowman, C.M., Danzig, P.B., Manber, U., and&lt;br&gt;
749&lt;b&gt;CONCLUSIONS&lt;/b&gt;&lt;br&gt;
750Schwartz, M.F. “Scalable Internet resource discovery:&lt;br&gt;
751Greenstone is a comprehensive software system for&lt;br&gt;
752Research problems and approaches” &lt;i&gt;Communications&lt;/i&gt;&lt;br&gt;
753creating digital library collections. It builds data structures&lt;br&gt;
754&lt;i&gt;of the ACM,&lt;/i&gt; Vol. 37, No. 8, pp. 98−107, 1994.&lt;br&gt;
755for searching and browsing from the material provided,&lt;br&gt;
7563. Fox, E. (1998) “Digital library definitions.”&lt;br&gt;
757rather than relying on any hand-crafting. The process is&lt;br&gt;
758ei.cs.vt.edu/~fox/dlib/def.html.&lt;br&gt;
759controlled by a configuration file, and once a collection&lt;br&gt;exists new material can be added completely&lt;br&gt;
7604. Humanity Libraries (1998) &lt;i&gt;Humanity Development&lt;/i&gt;&lt;br&gt;
761automatically. Browsing is based on Dublin Core&lt;br&gt;
762&lt;i&gt;Library&lt;/i&gt;. CD-ROM produced by the Global Help&lt;br&gt;
763metadata.&lt;br&gt;
764Project, Antwerp, Belgium.&lt;br&gt;
765New collections can be developed easily, particularly if&lt;br&gt;
7665. Lagoze, C. and Fielding, D “Defining Collections in&lt;br&gt;
767they resemble existing ones. Extensibility is achieved&lt;br&gt;
768Distributed Digital Libraries” &lt;i&gt;D-Lib Magazine&lt;/i&gt;, Nov.&lt;br&gt;
769through software “plugins” that can be written to&lt;br&gt;
7701998.&lt;br&gt;
771accommodate documents, and metadata, in different&lt;br&gt;
7726. PAHO (1999) &lt;i&gt;Virtual Disaster Library&lt;/i&gt;. CD-ROM&lt;br&gt;
773formats. Standard plugins exist for many document types;&lt;br&gt;
774produced by the Pan-American Health Organization,&lt;br&gt;
775new ones are easily written. Browsing is controlled by&lt;br&gt;
776Washington DC, USA.&lt;br&gt;
777“classifiers” that process metadata into browsing structures&lt;br&gt;
7787. McNab, R.J., Witten, I.H. and Boddie, S.J. (1998) “A&lt;br&gt;
779(by date, alphabetical, hierarchical, etc).&lt;br&gt;
780distributed digital library architecture incorporating&lt;br&gt;
781However, the most powerful support for extensibility is&lt;br&gt;
782different index styles.” &lt;i&gt;Proc IEEE Advances in Digital&lt;/i&gt;&lt;br&gt;
783achieved not by technical means but by making the source&lt;br&gt;
784&lt;i&gt;Libraries&lt;/i&gt;, Santa Barbara, CA, pp. 36–45.&lt;br&gt;
785code freely available under the Gnu public license. Only&lt;br&gt;
7868. Nevill-Manning, C.G., Reed, T., and Witten, I.H.&lt;br&gt;
787through an international cooperative effort will digital&lt;br&gt;
788(1998) “Extracting text from PostScript”&lt;br&gt;
789library software become sufficiently comprehensive to&lt;br&gt;
790&lt;i&gt;Software—Practice and Experience&lt;/i&gt;, Vol. 28, No. 5, pp.&lt;br&gt;
791meet the world’s needs with the richness and flexibility&lt;br&gt;
792481–491; April.&lt;br&gt;
793that users deserve.&lt;br&gt;
7949. UNESCO (1999) &lt;i&gt;SAHEL point DOC: Anthologie du&lt;/i&gt;&lt;br&gt;
795&lt;b&gt;ACKNOWLEDGMENTS&lt;/b&gt;&lt;br&gt;
796&lt;i&gt;développement au Sahel&lt;/i&gt;. CD-ROM produced by&lt;br&gt;UNESCO, Paris, France.&lt;br&gt;
797We gratefully acknowledge all those who have worked on&lt;br&gt;the Greenstone software, and all members of the New&lt;br&gt;
79810. UNU (1998) &lt;i&gt;Collection on critical global issues.&lt;/i&gt; CD-&lt;br&gt;
799Zealand Digital Library project for their enthusiasm and&lt;br&gt;
800ROM produced by the United Nations University&lt;br&gt;
801ideas.&lt;br&gt;
802Press, Tokyo, Japan.&lt;br&gt;
80311. Witten, I.H., Moffat, A. and Bell, T. (1999) &lt;i&gt;Managing&lt;/i&gt;&lt;br&gt;
804&lt;i&gt;Gigabytes: compressing and indexing documents and&lt;br&gt;images&lt;/i&gt;, Morgan Kaufmann, second edition.&lt;br&gt;
805&lt;hr&gt;
806
807
808</Content>
809</Section>
810</Section>
811</Archive>
Note: See TracBrowser for help on using the repository browser.