1 | <?xml version="1.0" encoding="utf-8" standalone="no"?>
|
---|
2 | <!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
|
---|
3 | <Archive>
|
---|
4 | <Section>
|
---|
5 | <Description>
|
---|
6 | <Metadata name="gsdldoctype">indexed_doc</Metadata>
|
---|
7 | <Metadata name="Language">en</Metadata>
|
---|
8 | <Metadata name="Encoding">utf8</Metadata>
|
---|
9 | <Metadata name="Author">Bronwyn</Metadata>
|
---|
10 | <Metadata name="Title">Distributing Digital Libraries on the Web, CD-ROMs, and Intranets: ...</Metadata>
|
---|
11 | <Metadata name="URL">http://Scratch/ak19/gs2-diffcol-26Mar2018/collect/DSpace-To-GS/tmp/1522032971/3.html</Metadata>
|
---|
12 | <Metadata name="UTF8URL">http://Scratch/ak19/gs2-diffcol-26Mar2018/collect/DSpace-To-GS/tmp/1522032971/3.html</Metadata>
|
---|
13 | <Metadata name="gsdlsourcefilename">import/3/3.pdf</Metadata>
|
---|
14 | <Metadata name="gsdlconvertedfilename">tmp/1522032971/3.html</Metadata>
|
---|
15 | <Metadata name="OrigSource">3.html</Metadata>
|
---|
16 | <Metadata name="Source">3.pdf</Metadata>
|
---|
17 | <Metadata name="SourceFile">3.pdf</Metadata>
|
---|
18 | <Metadata name="Plugin">PDFPlugin</Metadata>
|
---|
19 | <Metadata name="FileSize">218837</Metadata>
|
---|
20 | <Metadata name="FilenameRoot">3</Metadata>
|
---|
21 | <Metadata name="FileFormat">PDF</Metadata>
|
---|
22 | <Metadata name="srcicon">_iconpdf_</Metadata>
|
---|
23 | <Metadata name="srclink_file">doc.pdf</Metadata>
|
---|
24 | <Metadata name="srclinkFile">doc.pdf</Metadata>
|
---|
25 | <Metadata name="NumPages">7</Metadata>
|
---|
26 | <Metadata name="ex.ExifTool.ExifToolVersion">8.57</Metadata>
|
---|
27 | <Metadata name="ex.File.Directory">/Scratch/ak19/gs2-diffcol-26Mar2018/collect/DSpace-To-GS/import/3</Metadata>
|
---|
28 | <Metadata name="ex.File.FileModifyDate">2018:03:26 15:55:32+13:00</Metadata>
|
---|
29 | <Metadata name="ex.File.FileName">3.pdf</Metadata>
|
---|
30 | <Metadata name="ex.File.FilePermissions">775</Metadata>
|
---|
31 | <Metadata name="ex.File.FileSize">218837</Metadata>
|
---|
32 | <Metadata name="ex.File.FileType">PDF</Metadata>
|
---|
33 | <Metadata name="ex.File.MIMEType">application/pdf</Metadata>
|
---|
34 | <Metadata name="ex.PDF.Author">Bronwyn</Metadata>
|
---|
35 | <Metadata name="ex.PDF.CreateDate">1999:09:27 16:06:52</Metadata>
|
---|
36 | <Metadata name="ex.PDF.Creator">Microsoft Word</Metadata>
|
---|
37 | <Metadata name="ex.PDF.Linearized">false</Metadata>
|
---|
38 | <Metadata name="ex.PDF.PDFVersion">1.1</Metadata>
|
---|
39 | <Metadata name="ex.PDF.PageCount">7</Metadata>
|
---|
40 | <Metadata name="ex.PDF.Producer">Acrobat PDFWriter 2.0 for Macintosh</Metadata>
|
---|
41 | <Metadata name="ex.dc.Contributor">Ian Witten</Metadata>
|
---|
42 | <Metadata name="ex.dc.Contributor">Sally Jo Cunningham</Metadata>
|
---|
43 | <Metadata name="ex.dc.Contributor">Bill Rogers</Metadata>
|
---|
44 | <Metadata name="ex.dc.Contributor">Rodger McNab</Metadata>
|
---|
45 | <Metadata name="ex.dc.Contributor">Stefan Boddie</Metadata>
|
---|
46 | <Metadata name="ex.dc.Date^accessioned">2005-01-10T02:51:09Z</Metadata>
|
---|
47 | <Metadata name="ex.dc.Date^available">2005-01-10T02:51:09Z</Metadata>
|
---|
48 | <Metadata name="ex.dc.Date^issued">2005-01-10T02:51:09Z</Metadata>
|
---|
49 | <Metadata name="ex.dc.Language^iso">en</Metadata>
|
---|
50 | <Metadata name="ex.dc.Title">Distributing Digital Libraries on the Web, CD-ROMs, and Intranets: Same information, same look-and-feel, different media</Metadata>
|
---|
51 | <Metadata name="equivlink"></Metadata>
|
---|
52 | <Metadata name="Identifier">HASH016449897ef791600b644989</Metadata>
|
---|
53 | <Metadata name="lastmodified">1522032932</Metadata>
|
---|
54 | <Metadata name="lastmodifieddate">20180326</Metadata>
|
---|
55 | <Metadata name="oailastmodified">1522032971</Metadata>
|
---|
56 | <Metadata name="oailastmodifieddate">20180326</Metadata>
|
---|
57 | <Metadata name="assocfilepath">HASH0164.dir</Metadata>
|
---|
58 | <Metadata name="gsdlassocfile">doc.pdf:application/pdf:</Metadata>
|
---|
59 | </Description>
|
---|
60 | <Content>
|
---|
61 | <A name=1></a><b>Distributing Digital Libraries on the Web,</b><br>
|
---|
62 | <b>CD-ROMs, and Intranets:</b><br>
|
---|
63 | <b>Same information, same look-and-feel, different media</b><br>
|
---|
64 | Ian Witten, Sally Jo Cunningham, Bill Rogers, Rodger McNab,Stefan Boddie<br>
|
---|
65 | Department of Computer Science<br>
|
---|
66 | University of Waikato<br>
|
---|
67 | Hamilton, New Zealand<br>
|
---|
68 | <b>Abstract:</b> The Greenstone system from the New Zealand Digital Library provides a<br>new way of making collections of information available in the same form over<br>the World-Wide Web, on CD-ROM, or on local Intranets. Exactly the same<br>information is available in each case, and exactly the same interface is used to access it.<br>The New Zealand Digital Library is accessible over the Web and offers a wide variety<br>of information collections. Sub-collections can be written to a CD-ROM, which can be<br>used on a standalone PC by a single user. A local Web browser suffices to access the<br>information on the disk just as though the PC were connected to the Internet.<br>Simultaneously, if there is a network connection, the same disk acts as a network server<br>to make exactly the same information available to others who need only use their<br>standard Internet browser software. This technology has great appeal for many users,<br>particularly those in developing nations where non-local Internet access can be<br>precarious or prohibitively expensive.<br>
|
---|
69 | <b>1. Introduction</b><br>
|
---|
70 | The emerging digital library movement is a child of the Internet and the World-Wide<br>Web. Spurred on by visions of an âinformation superhighway,â current digital library<br>projects invariably concentrate on providing access to document collections over the<br>Internet, where documents, users, and catalog may all be distributed widely. Often the<br>search interface is WWW-based, in contrast to the telnet or phone-in access required by<br>library OPACS and earlier commercial âonlineâ bibliographic databases such as Dialog.<br>Web-based digital libraries have significant advantages over their online predecessors.<br>Users need not obtain and install search software on their own sites. In many areas<br>Internet access incurs minimal charges, or at any rate is significantly cheaper than a<br>direct telephone connection with the retrieval system. Finally, Web browsers provide a<br>simple, standard means of access to a variety of digital library systems.<br>
|
---|
71 | However, practical experience in digital library development indicates that in many<br>situations, universal access via the Internet is neither possible nor desirable. A<br>business, for example, might desire a digital library to make its proprietary documents<br>available to its employees, but only if the companyâs security could be ensured by<br>restricting access with an intranet. CD-ROM has been identified as the implementation<br>platform of choice for collections targeted at large portions of the Third World; for<br>many developing countries, particularly in Sub-Saharan Africa, Internet connections are<br>still either non-existent, undependable, or prohibitively expensive to use. Despite its<br>lowly status, the CD-ROM has many advantages. Relatively durable in the face of harsh<br>environmental conditions, it incurs known, fixed costs for purchase and supporting<br>hardware (White, 1992). It makes information accessible on a tangible medium that is<br>under the userâs control and is not subject to capricious decisions by others. A CD-<br>ROM based digital library carries the further advantage of providing full document<br>contentsâa significant drawback to bibliographic systems being that their users in<br>developing countries could locate descriptions of relevant documents, but were then<br>often unable to obtain the documents themselves (El-Hadidy, 1994; Chowdhury,<br>1996). Finally, while a CD-ROM holds a reasonable amount of material in textual form,<br>digital videodisk technology is already available which can store 12 Gb on a single<br>diskâfar larger than most extant textual digital libraries.<br>
|
---|
72 | <hr>
|
---|
73 | <A name=2></a>For this reason the Greenstone digital library software developed by the New Zealand<br>Digital Library project allows a collection developer to create a digital library that is<br>WWW-based, intranet-based, or available on a standalone or networked CD-ROM. All<br>platforms support exactly the same interface, and the same search and retrieval<br>methods. This standardization reduces the system learning curve for intranet or CD-<br>ROM users who have previous experience with WWW browsers, and conversely<br>allows those users currently without Internet access to more easily progress to Web<br>searching and browsing when it becomes available to them.<br>
|
---|
74 | An earlier version of this software has been used in a university-level distance learning<br>course on computer literacy, where selected portions of various WWW sites were<br>stored on CD-ROM for students to surf (Holmes and Rogers, 1997). Here, the primary<br>advantages of avoiding an Internet connection were to smooth out variable page<br>retrieval times, to avoid problems with off-site servers going down or being temporarily<br>unavailable, and to eliminate communication costs. In secondary or primary school<br>settings, this technique for capturing known portions of the WWW can be used to<br>prevent students wasting lab time exploring sites that irrelevant to the task at hand, or<br>that are inappropriate for their age groups.<br>
|
---|
75 | The digital library collection described in this paper is comprised of a set of documents<br>provided by the United Nations University, focusing primarily on food and nutrition.<br>The goal of the United Nations University Press is to disseminate knowledge in the<br>field of the global problems of human survival, development and welfare, in order to<br>increase dynamic interaction in the world-wide community of learning and research.<br>By making their documents available in a variety of formatsâprint, CD-ROM, WWW<br>pagesâthis research and human development information can be distributed more<br>widely, and in a form appropriate to the conditions required by information users.<br>
|
---|
76 | Section 2 describes the software architecture. Multimedia collections are supported, and<br>a single collection may include text, images, audio, and even video clips. Compression<br>technology is used to ensure that the greatest possible volume of information is packed<br>into a limited storage space. The interface software combines easy-to-use browsing<br>with powerful search facilities. As discussed in Section 3, several ways are provided to<br>find information in a collection; a user can conduct keyword searches, access known<br>documents by title, or browse subject âbookshelvesâ.<br>
|
---|
77 | <b>2 . System architecture</b><br>
|
---|
78 | A great advantage of the WWW as a means of presenting and using information is that<br>very little direct user interface programming is required. A system can generate simple<br>text documents in HTML notation, and leave the task of display, printing, screen<br>navigation, and so forth to a Web browser. As a result, the browser writer takes most<br>of the burden of system dependence away from the application programmer. The CD<br>version of the Greenstone library follows this structure: our software takes the form of<br>a WWW server, communicating with an unmodified browser using IP networking<br>software. While the primary goal is to have a system running on a stand-alone<br>machine, the use of IP networking does also mean that the software will function as a<br>WWW server over an external network. Figure 1 shows the general software<br>organization. The gray box encloses the software components running on one<br>machine.<br>
|
---|
79 | Ideally, the WWW server would be a standard piece of software, and a digital library<br>would take exactly the same form on a single machine as it does on our larger WWW<br>serving equipment. This did not prove possible for a number of reasonsâmost<br>significant of which was the amount of memory expected to be available on our target<br>machines, which for this project include the older and smaller workstations commonly<br>
|
---|
80 | <hr>
|
---|
81 | <A name=3></a>in use in the Third World. The full digital library system on our WWW servers does<br>make use of standard Internet server software. In the WWW version of our digital<br>library architecture, pre and post processing of queries on the library are handled in<br>tasks run via the CGI mechanism, and communicate via request queues with tasks<br>running the MG document indexing and compression software (Witten et al, 1994).<br>Much of the âglueâ software is written in Perl (Wall et al, 1996) and requires the large<br>Perl interpreter and software library to be in memory.<br>
|
---|
82 | In contrast, the CD-ROM version of the software is a single integrated piece of software<br>incorporating the Web server, digital library pre/post processing, and MG. Only a<br>single index need be in memory at any one time, as a CD-ROM usually only holds a<br>single collection. All of the software is coded in C and C++ to avoid the significant<br>overhead involved in using a Perl interpreter. The result is a system which will work<br>satisfactorily on a workstation with 8 or 16 MB of main memory (depending on the<br>memory requirements of the workstationâs operating system).<br>
|
---|
83 | A browser is directed to access the server in one of two ways. The simplest is to use<br>the URL http://127.0.0.1 (127.0.0.1 means â local machineâ). Once the first page is<br>loaded, further pages are referenced relative to the starting page, and so are also<br>obtained from the server. This is convenient in that it requires no set-up on the<br>browser. The alternative is to set the browser to use 127.0.0.1 as its âproxyâ. This<br>means that all page requests are routed to the server. It functions like a fixed cache,<br>satisfying requests when it can and passing demands that it cannot handle on to an<br>external network (if available).<br>
|
---|
84 | external network<br>
|
---|
85 | internal network software<br>
|
---|
86 | BROWSER<br>
|
---|
87 | SERVER<br>
|
---|
88 | Local File Retrieve<br>Local Text Database (MG)<br>Local Non-Text Repository<br>Remote (WWW) access<br>
|
---|
89 | CD<br>
|
---|
90 | Special Processing<br>
|
---|
91 | <b>Figure 1: Browser-Server Interface</b><br>
|
---|
92 | The server handles incoming page/file retrieval requests according to the requested<br>itemâs availability and form of storage. If a page is not available locally, the request<br>may be passed on to an external network. If each page or document in a collection is<br>stored in a separate file, then a local file request can access the item on the CD-ROM.<br>However, in general we avoid storing a collectionâs documents in separate files,<br>because large numbers of files use CD-ROM space inefficiently. Instead, document<br>files containing text are stored (and the extracted text is indexed) in an MG database,<br>and non-text files are stored in a special repository file. The server has an index of the<br>documents held in the MG database and the file repository. Incoming requests are<br>checked against this index and may be retrieved from MG or the repository as<br>appropriate. Major savings in collection storage requirements are possible by taking<br>advantage of MG for text storage: typically text compresses to 25% of its original size,<br>and the compressed index occupies around 7% of the size of the original text. This<br>leads to a total storage requirement for the indexed collection of approximately one-third<br>of the size of the original text alone. The system can also support a variety of types of<br>
|
---|
93 | <hr>
|
---|
94 | <A name=4></a>non-text items in the collectionâaudio, images, video clipsâsimply by including<br>appropriate viewing utilities on the CD-ROM. For searching, the non-text items are<br>represented by textual descriptions in the MG index.<br>
|
---|
95 | A request which requires some computation on the server, such as the submission of a<br>query from a user, would normally be handled with CGI requests. On our system,<br>such requests are invoked by URLâs starting <br>
|
---|
96 | http://127.0.0.1/server/ . These are<br>
|
---|
97 | internally routed to handler routines within the server itself â particularly to MG<br>components.<br>
|
---|
98 | The major implementation difficulty experienced was with the IP network software, on<br>machines which did not have network cards or modem software. To avoid installation<br>complexity we chose to implement our own network layer to be used on such<br>machines. In the absence of networking software the server loads our internal network<br>software and communicates using that.<br>
|
---|
99 | <b>3 . Searching and navigating a collection</b><br>
|
---|
100 | The primary access method for documents in the United Nations University collection<br>is keyword search (Figure 2a). The system supports searching over the <i>full</i> text of the<br>documentânot merely a document surrogate as is common in many commercial<br>retrieval systems. While other collections we have built support a syntax for full<br>Boolean searching, early user feedback from a similar document set (the Humanitarian<br>Development collection, put together by the Global Help Project) indicated that Boolean<br>searching was more confusing than helpful for the targeted users. Previous research<br>suggests that difficulties with Boolean syntax and semantics are common, and are<br>observed in diverse user groups (Borgman, 1996; Greene et al, 1990). Transaction log<br>analysis over a number of library retrieval systems indicates that the most popular<br>Boolean operator by far is the AND, with the Boolean OR and NOT rarely present in<br>queries (Peters, 1993); we have confirmed this result in another New Zealand Digital<br>Library collection (Jones et al, 1998). For all these reasons, the United Nations<br>University interface default is ranked retrieval. However, to enable users to construct<br>high-precision Boolean AND searches where necessary, selecting âsearchâŠfor ALL<br>the wordsâ in the querying string produces the syntax-free equivalent of an AND query.<br>
|
---|
101 | <br>
|
---|
102 | Figure 2: (a) Initial search screen for the UNU collection and (b) search preferences<br>
|
---|
103 | page<br>
|
---|
104 | By default, search terms are stemmed and case differences are ignored. Most<br>transaction log analysis from library online catalogs, digital libraries, and WWW search<br>engines indicates that users tend to submit extremely brief queries. For example, the<br>average query length for the New Zealand Digital Libraryâs <i>Computer Science<br>Technical Report</i> collection is only 2.5 words (Jones et al, 1998), a typical figure<br>mirrored in retrieval studies conducted over two decades (Sandore, 1993). With such<br>
|
---|
105 | <hr>
|
---|
106 | <A name=5></a>brief queries the major difficulty encountered with search results is low search<br>recallâhence the system automatically expands the query through stemming and case<br>folding. These defaults can be modified by<br>
|
---|
107 | The initial search screen (Figure 2a) also permits users to specify the âgranularityâ at<br>which their search is done (that is, the size of the text against which the query is<br>matched). Choices include <i>title</i>, <i>paragraph</i>, <i>same chapter or section</i>, and <i>book</i>. By<br>selecting the smaller passage sizes, users can achieve a greater search precision, while<br>selecting the larger ones tends to give a higher recall. Regardless of granularity, the<br>results are always displayed in terms of a complete book, opened at the appropriate<br>place.<br>
|
---|
108 | Figure 3: Query results page<br>
|
---|
109 | We support browsing by taking advantage of the fact that the hierarchical structure of<br>United Nations University Press documents is marked up in the document files. When<br>an item in the âquery resultsâ list is selected (Figure 3), the user is presented with a<br>photograph of the documentâs front cover and a table of contents with an arrow<br>marking the itemâs position in the contents (Figure 4). Folders can be clicked open or<br>closed, allowing the user to travel up and down the documentâs structure (in Figure 5,<br>moving from a report up to the section headings for that issue of the bulletin). Clicking<br>on âexpand contentsâ will expand out the whole table of contents so that the user can<br>browse the titles of all chapters and subsections to get a detailed view of the entire<br>contents. âExpand textâ displays the whole text of the current section or book, which is<br>particularly useful when printing a complete work.<br>
|
---|
110 | Figure 4: Viewing a selected item in the query results list<br>
|
---|
111 | <hr>
|
---|
112 | <A name=6></a>Figure 5: Moving up the document structure hierarchy<br>
|
---|
113 | Browsing or searching by subject is supported by clicking the âsubjectsâ button on the<br>menu options bar of any search or results page . This brings up a list of subjects,<br>represented by bookshelves (Figure 6). Users can click on any bookshelf to look at<br>books on that subject, and click on a book to read it. Similarly, clicking on the âtitlesâ<br>button allows the user to browse through an alphabetized list of titles. If the user is<br>currently viewing a document when the âsubjectsâ or âtitlesâ button is clicked, s/he will<br>be taken to the place in the subjects or titles list that corresponds to that book. This<br>supports the user in browsing for books on the same subject, or for books with similar<br>titles.<br>
|
---|
114 | Figure 6: Browsing by subject<br>
|
---|
115 | <b>4 . Conclusions</b><br>
|
---|
116 | Despite near-universal current practice, the World-Wide Web is by no means the only<br>way to deliver digital library services. Local networks and CD-ROM disks can be a<br>viable alternativeâand a necessary one in many operating environments. The humble<br>CD-ROM can hold a lot of text, and DVD disks will enable easy distribution of very<br>substantial collections<br>
|
---|
117 | The challenge is to produce a scheme which can be used for distribution over each of<br>these media, and look just the same to the user. The Greenstone software allows<br>information to be made available in precisely the same form, using precisely the same<br>interface, on a single-user (PC) computer, a local intranet, or the World-Wide Web.<br>One reason for developing this technology was to permit access to important<br>information in the Third World, which runs the risk of falling further behind because of<br>inadequate network access. However, all who find the Internet capricious in terms of<br>remote site availability, and suffer from highly variable and unpredictable network<br>delays, will appreciate the advantages of having digital library information on<br>siteâwhether in single-user or shared mode.<br>
|
---|
118 | <hr>
|
---|
119 | <A name=7></a>The United Nations University collection that we have described and illustrated is<br>designed not, as most digital libraries seem to be, for technophiles, but for ordinary<br>people with little or no computer experience. We have again run counter to common<br>practice here to make the interface plain and easy to use. In a quest to improve usability<br>for the ordinary person we have sacrificed featuresâactually deleted them from our<br>softwareâthat, although powerful, we have observed to be rarely employed by real<br>users answering their real information needs.<br>
|
---|
120 | <b>References</b><br>
|
---|
121 | Borgman, C.L. (1996) Why are online catalogs still hard to use<i>? Journal of the</i><br>
|
---|
122 | <i>American Society for Information Science</i> 47(7), pp. 493-503.<br>
|
---|
123 | Chowdhury, G.G. (1996) Developing modern information systems and services:<br>
|
---|
124 | Africaâs challenges for the future, <i>Online &amp; CDROM Review</i> 20(3), pp. 145-<br>146.<br>
|
---|
125 | El-Hadidy, B. (1994) The breakeven point for using CD-ROM versus online: a case<br>
|
---|
126 | study for database access in a developing country, <i>Journal of the American<br>Society for Information Science</i> 45(4), pp. 273-283.<br>
|
---|
127 | Greene, S.L., Devlin, S.J., (1990) Cannata, P.E., and Gomez, L.M. No Ifs, ANDs or<br>
|
---|
128 | Ors: a study of database querying, <i>International Journal of Man-Machine<br>Studies</i> 32(3), pp. 303-326.<br>
|
---|
129 | Holmes, G., and Rogers, W.J. (197) Gathering and indexing rich fragments of the<br>
|
---|
130 | World-Wide Web, <i>Proceedings of the International Conference on Computers<br>in Education 1997</i> (Sarawak, Malaysia, Dec. 2-6), pp. 554-562.<br>
|
---|
131 | Jones, S., Cunningham, S.J., and McNab, R. (1998) An analysis of usage of a digital<br>
|
---|
132 | library, <i>Working Paper 98/13</i>, Department of Computer Science, University of<br>Waikato (Hamilton, New Zealand.<br>
|
---|
133 | Peters, T. (1993) The history and development of transaction log analysis, <i>Library Hi-</i><br>
|
---|
134 | <i>Tech </i>11(2), pp. 41-66.<br>
|
---|
135 | Sandore, B. (1993) Applying the results of transaction log analysis, <i>Library Hi-Tech</i><br>
|
---|
136 | 11(2), pp. 87-97.<br>
|
---|
137 | Wall, L., Christiansen, T., and Schwartz, R.L. (1996) <i> Programming Perl.</i> OâReilly,<br>
|
---|
138 | Sebastopol (CA, USA).<br>
|
---|
139 | White, W.D. (1992) CD-ROM in developing countries, <i>CD-ROM Professional</i> (May),<br>
|
---|
140 | pp. 32-35.<br>
|
---|
141 | Witten, I.H., Moffat, A., and Bell, T.C. (1994) <i>Managing Gigabytes</i>. Van Nostrand<br>
|
---|
142 | Reinhold, New York, New York.<br>
|
---|
143 | <hr>
|
---|
144 |
|
---|
145 |
|
---|
146 | </Content>
|
---|
147 | </Section>
|
---|
148 | </Archive>
|
---|