indexed_doc en utf8 Bronwyn Distributing Digital Libraries on the Web, CD-ROMs, and Intranets: ... http://Scratch/ak19/gs2-diffcol-26Mar2018/collect/DSpace-To-GS/tmp/1522032971/3.html http://Scratch/ak19/gs2-diffcol-26Mar2018/collect/DSpace-To-GS/tmp/1522032971/3.html import/3/3.pdf tmp/1522032971/3.html 3.html 3.pdf 3.pdf PDFPlugin 218837 3 PDF _iconpdf_ doc.pdf doc.pdf 7 8.57 /Scratch/ak19/gs2-diffcol-26Mar2018/collect/DSpace-To-GS/import/3 2018:03:26 15:55:32+13:00 3.pdf 775 218837 PDF application/pdf Bronwyn 1999:09:27 16:06:52 Microsoft Word false 1.1 7 Acrobat PDFWriter 2.0 for Macintosh Ian Witten Sally Jo Cunningham Bill Rogers Rodger McNab Stefan Boddie 2005-01-10T02:51:09Z 2005-01-10T02:51:09Z 2005-01-10T02:51:09Z en Distributing Digital Libraries on the Web, CD-ROMs, and Intranets: Same information, same look-and-feel, different media HASH016449897ef791600b644989 1522032932 20180326 1522032971 20180326 HASH0164.dir doc.pdf:application/pdf: <A name=1></a><b>Distributing Digital Libraries on the Web,</b><br> <b>CD-ROMs, and Intranets:</b><br> <b>Same information, same look-and-feel, different media</b><br> Ian Witten, Sally Jo Cunningham, Bill Rogers, Rodger McNab,Stefan Boddie<br> Department of Computer Science<br> University of Waikato<br> Hamilton, New Zealand<br> <b>Abstract:</b> The Greenstone system from the New Zealand Digital Library provides a<br>new way of making collections of information available in the same form over<br>the World-Wide Web, on CD-ROM, or on local Intranets. Exactly the same<br>information is available in each case, and exactly the same interface is used to access it.<br>The New Zealand Digital Library is accessible over the Web and offers a wide variety<br>of information collections. Sub-collections can be written to a CD-ROM, which can be<br>used on a standalone PC by a single user. A local Web browser suffices to access the<br>information on the disk just as though the PC were connected to the Internet.<br>Simultaneously, if there is a network connection, the same disk acts as a network server<br>to make exactly the same information available to others who need only use their<br>standard Internet browser software. This technology has great appeal for many users,<br>particularly those in developing nations where non-local Internet access can be<br>precarious or prohibitively expensive.<br> <b>1. Introduction</b><br> The emerging digital library movement is a child of the Internet and the World-Wide<br>Web. Spurred on by visions of an “information superhighway,” current digital library<br>projects invariably concentrate on providing access to document collections over the<br>Internet, where documents, users, and catalog may all be distributed widely. Often the<br>search interface is WWW-based, in contrast to the telnet or phone-in access required by<br>library OPACS and earlier commercial “online” bibliographic databases such as Dialog.<br>Web-based digital libraries have significant advantages over their online predecessors.<br>Users need not obtain and install search software on their own sites. In many areas<br>Internet access incurs minimal charges, or at any rate is significantly cheaper than a<br>direct telephone connection with the retrieval system. Finally, Web browsers provide a<br>simple, standard means of access to a variety of digital library systems.<br> However, practical experience in digital library development indicates that in many<br>situations, universal access via the Internet is neither possible nor desirable. A<br>business, for example, might desire a digital library to make its proprietary documents<br>available to its employees, but only if the company’s security could be ensured by<br>restricting access with an intranet. CD-ROM has been identified as the implementation<br>platform of choice for collections targeted at large portions of the Third World; for<br>many developing countries, particularly in Sub-Saharan Africa, Internet connections are<br>still either non-existent, undependable, or prohibitively expensive to use. Despite its<br>lowly status, the CD-ROM has many advantages. Relatively durable in the face of harsh<br>environmental conditions, it incurs known, fixed costs for purchase and supporting<br>hardware (White, 1992). It makes information accessible on a tangible medium that is<br>under the user’s control and is not subject to capricious decisions by others. A CD-<br>ROM based digital library carries the further advantage of providing full document<br>contents—a significant drawback to bibliographic systems being that their users in<br>developing countries could locate descriptions of relevant documents, but were then<br>often unable to obtain the documents themselves (El-Hadidy, 1994; Chowdhury,<br>1996). Finally, while a CD-ROM holds a reasonable amount of material in textual form,<br>digital videodisk technology is already available which can store 12 Gb on a single<br>disk—far larger than most extant textual digital libraries.<br> <hr> <A name=2></a>For this reason the Greenstone digital library software developed by the New Zealand<br>Digital Library project allows a collection developer to create a digital library that is<br>WWW-based, intranet-based, or available on a standalone or networked CD-ROM. All<br>platforms support exactly the same interface, and the same search and retrieval<br>methods. This standardization reduces the system learning curve for intranet or CD-<br>ROM users who have previous experience with WWW browsers, and conversely<br>allows those users currently without Internet access to more easily progress to Web<br>searching and browsing when it becomes available to them.<br> An earlier version of this software has been used in a university-level distance learning<br>course on computer literacy, where selected portions of various WWW sites were<br>stored on CD-ROM for students to surf (Holmes and Rogers, 1997). Here, the primary<br>advantages of avoiding an Internet connection were to smooth out variable page<br>retrieval times, to avoid problems with off-site servers going down or being temporarily<br>unavailable, and to eliminate communication costs. In secondary or primary school<br>settings, this technique for capturing known portions of the WWW can be used to<br>prevent students wasting lab time exploring sites that irrelevant to the task at hand, or<br>that are inappropriate for their age groups.<br> The digital library collection described in this paper is comprised of a set of documents<br>provided by the United Nations University, focusing primarily on food and nutrition.<br>The goal of the United Nations University Press is to disseminate knowledge in the<br>field of the global problems of human survival, development and welfare, in order to<br>increase dynamic interaction in the world-wide community of learning and research.<br>By making their documents available in a variety of formats—print, CD-ROM, WWW<br>pages—this research and human development information can be distributed more<br>widely, and in a form appropriate to the conditions required by information users.<br> Section 2 describes the software architecture. Multimedia collections are supported, and<br>a single collection may include text, images, audio, and even video clips. Compression<br>technology is used to ensure that the greatest possible volume of information is packed<br>into a limited storage space. The interface software combines easy-to-use browsing<br>with powerful search facilities. As discussed in Section 3, several ways are provided to<br>find information in a collection; a user can conduct keyword searches, access known<br>documents by title, or browse subject “bookshelves”.<br> <b>2 . System architecture</b><br> A great advantage of the WWW as a means of presenting and using information is that<br>very little direct user interface programming is required. A system can generate simple<br>text documents in HTML notation, and leave the task of display, printing, screen<br>navigation, and so forth to a Web browser. As a result, the browser writer takes most<br>of the burden of system dependence away from the application programmer. The CD<br>version of the Greenstone library follows this structure: our software takes the form of<br>a WWW server, communicating with an unmodified browser using IP networking<br>software. While the primary goal is to have a system running on a stand-alone<br>machine, the use of IP networking does also mean that the software will function as a<br>WWW server over an external network. Figure 1 shows the general software<br>organization. The gray box encloses the software components running on one<br>machine.<br> Ideally, the WWW server would be a standard piece of software, and a digital library<br>would take exactly the same form on a single machine as it does on our larger WWW<br>serving equipment. This did not prove possible for a number of reasons—most<br>significant of which was the amount of memory expected to be available on our target<br>machines, which for this project include the older and smaller workstations commonly<br> <hr> <A name=3></a>in use in the Third World. The full digital library system on our WWW servers does<br>make use of standard Internet server software. In the WWW version of our digital<br>library architecture, pre and post processing of queries on the library are handled in<br>tasks run via the CGI mechanism, and communicate via request queues with tasks<br>running the MG document indexing and compression software (Witten et al, 1994).<br>Much of the ‘glue’ software is written in Perl (Wall et al, 1996) and requires the large<br>Perl interpreter and software library to be in memory.<br> In contrast, the CD-ROM version of the software is a single integrated piece of software<br>incorporating the Web server, digital library pre/post processing, and MG. Only a<br>single index need be in memory at any one time, as a CD-ROM usually only holds a<br>single collection. All of the software is coded in C and C++ to avoid the significant<br>overhead involved in using a Perl interpreter. The result is a system which will work<br>satisfactorily on a workstation with 8 or 16 MB of main memory (depending on the<br>memory requirements of the workstation’s operating system).<br> A browser is directed to access the server in one of two ways. The simplest is to use<br>the URL http://127.0.0.1 (127.0.0.1 means ‘ local machine’). Once the first page is<br>loaded, further pages are referenced relative to the starting page, and so are also<br>obtained from the server. This is convenient in that it requires no set-up on the<br>browser. The alternative is to set the browser to use 127.0.0.1 as its ‘proxy’. This<br>means that all page requests are routed to the server. It functions like a fixed cache,<br>satisfying requests when it can and passing demands that it cannot handle on to an<br>external network (if available).<br> external network<br> internal network software<br> BROWSER<br> SERVER<br> Local File Retrieve<br>Local Text Database (MG)<br>Local Non-Text Repository<br>Remote (WWW) access<br> CD<br> Special Processing<br> <b>Figure 1: Browser-Server Interface</b><br> The server handles incoming page/file retrieval requests according to the requested<br>item’s availability and form of storage. If a page is not available locally, the request<br>may be passed on to an external network. If each page or document in a collection is<br>stored in a separate file, then a local file request can access the item on the CD-ROM.<br>However, in general we avoid storing a collection’s documents in separate files,<br>because large numbers of files use CD-ROM space inefficiently. Instead, document<br>files containing text are stored (and the extracted text is indexed) in an MG database,<br>and non-text files are stored in a special repository file. The server has an index of the<br>documents held in the MG database and the file repository. Incoming requests are<br>checked against this index and may be retrieved from MG or the repository as<br>appropriate. Major savings in collection storage requirements are possible by taking<br>advantage of MG for text storage: typically text compresses to 25% of its original size,<br>and the compressed index occupies around 7% of the size of the original text. This<br>leads to a total storage requirement for the indexed collection of approximately one-third<br>of the size of the original text alone. The system can also support a variety of types of<br> <hr> <A name=4></a>non-text items in the collection—audio, images, video clips—simply by including<br>appropriate viewing utilities on the CD-ROM. For searching, the non-text items are<br>represented by textual descriptions in the MG index.<br> A request which requires some computation on the server, such as the submission of a<br>query from a user, would normally be handled with CGI requests. On our system,<br>such requests are invoked by URL’s starting <br> http://127.0.0.1/server/ . These are<br> internally routed to handler routines within the server itself – particularly to MG<br>components.<br> The major implementation difficulty experienced was with the IP network software, on<br>machines which did not have network cards or modem software. To avoid installation<br>complexity we chose to implement our own network layer to be used on such<br>machines. In the absence of networking software the server loads our internal network<br>software and communicates using that.<br> <b>3 . Searching and navigating a collection</b><br> The primary access method for documents in the United Nations University collection<br>is keyword search (Figure 2a). The system supports searching over the <i>full</i> text of the<br>document—not merely a document surrogate as is common in many commercial<br>retrieval systems. While other collections we have built support a syntax for full<br>Boolean searching, early user feedback from a similar document set (the Humanitarian<br>Development collection, put together by the Global Help Project) indicated that Boolean<br>searching was more confusing than helpful for the targeted users. Previous research<br>suggests that difficulties with Boolean syntax and semantics are common, and are<br>observed in diverse user groups (Borgman, 1996; Greene et al, 1990). Transaction log<br>analysis over a number of library retrieval systems indicates that the most popular<br>Boolean operator by far is the AND, with the Boolean OR and NOT rarely present in<br>queries (Peters, 1993); we have confirmed this result in another New Zealand Digital<br>Library collection (Jones et al, 1998). For all these reasons, the United Nations<br>University interface default is ranked retrieval. However, to enable users to construct<br>high-precision Boolean AND searches where necessary, selecting “search…for ALL<br>the words” in the querying string produces the syntax-free equivalent of an AND query.<br> <br> Figure 2: (a) Initial search screen for the UNU collection and (b) search preferences<br> page<br> By default, search terms are stemmed and case differences are ignored. Most<br>transaction log analysis from library online catalogs, digital libraries, and WWW search<br>engines indicates that users tend to submit extremely brief queries. For example, the<br>average query length for the New Zealand Digital Library’s <i>Computer Science<br>Technical Report</i> collection is only 2.5 words (Jones et al, 1998), a typical figure<br>mirrored in retrieval studies conducted over two decades (Sandore, 1993). With such<br> <hr> <A name=5></a>brief queries the major difficulty encountered with search results is low search<br>recall—hence the system automatically expands the query through stemming and case<br>folding. These defaults can be modified by<br> The initial search screen (Figure 2a) also permits users to specify the “granularity” at<br>which their search is done (that is, the size of the text against which the query is<br>matched). Choices include <i>title</i>, <i>paragraph</i>, <i>same chapter or section</i>, and <i>book</i>. By<br>selecting the smaller passage sizes, users can achieve a greater search precision, while<br>selecting the larger ones tends to give a higher recall. Regardless of granularity, the<br>results are always displayed in terms of a complete book, opened at the appropriate<br>place.<br> Figure 3: Query results page<br> We support browsing by taking advantage of the fact that the hierarchical structure of<br>United Nations University Press documents is marked up in the document files. When<br>an item in the “query results” list is selected (Figure 3), the user is presented with a<br>photograph of the document’s front cover and a table of contents with an arrow<br>marking the item’s position in the contents (Figure 4). Folders can be clicked open or<br>closed, allowing the user to travel up and down the document’s structure (in Figure 5,<br>moving from a report up to the section headings for that issue of the bulletin). Clicking<br>on “expand contents” will expand out the whole table of contents so that the user can<br>browse the titles of all chapters and subsections to get a detailed view of the entire<br>contents. “Expand text” displays the whole text of the current section or book, which is<br>particularly useful when printing a complete work.<br> Figure 4: Viewing a selected item in the query results list<br> <hr> <A name=6></a>Figure 5: Moving up the document structure hierarchy<br> Browsing or searching by subject is supported by clicking the “subjects” button on the<br>menu options bar of any search or results page . This brings up a list of subjects,<br>represented by bookshelves (Figure 6). Users can click on any bookshelf to look at<br>books on that subject, and click on a book to read it. Similarly, clicking on the “titles”<br>button allows the user to browse through an alphabetized list of titles. If the user is<br>currently viewing a document when the “subjects” or “titles” button is clicked, s/he will<br>be taken to the place in the subjects or titles list that corresponds to that book. This<br>supports the user in browsing for books on the same subject, or for books with similar<br>titles.<br> Figure 6: Browsing by subject<br> <b>4 . Conclusions</b><br> Despite near-universal current practice, the World-Wide Web is by no means the only<br>way to deliver digital library services. Local networks and CD-ROM disks can be a<br>viable alternative—and a necessary one in many operating environments. The humble<br>CD-ROM can hold a lot of text, and DVD disks will enable easy distribution of very<br>substantial collections<br> The challenge is to produce a scheme which can be used for distribution over each of<br>these media, and look just the same to the user. The Greenstone software allows<br>information to be made available in precisely the same form, using precisely the same<br>interface, on a single-user (PC) computer, a local intranet, or the World-Wide Web.<br>One reason for developing this technology was to permit access to important<br>information in the Third World, which runs the risk of falling further behind because of<br>inadequate network access. However, all who find the Internet capricious in terms of<br>remote site availability, and suffer from highly variable and unpredictable network<br>delays, will appreciate the advantages of having digital library information on<br>site—whether in single-user or shared mode.<br> <hr> <A name=7></a>The United Nations University collection that we have described and illustrated is<br>designed not, as most digital libraries seem to be, for technophiles, but for ordinary<br>people with little or no computer experience. We have again run counter to common<br>practice here to make the interface plain and easy to use. In a quest to improve usability<br>for the ordinary person we have sacrificed features—actually deleted them from our<br>software—that, although powerful, we have observed to be rarely employed by real<br>users answering their real information needs.<br> <b>References</b><br> Borgman, C.L. (1996) Why are online catalogs still hard to use<i>? Journal of the</i><br> <i>American Society for Information Science</i> 47(7), pp. 493-503.<br> Chowdhury, G.G. (1996) Developing modern information systems and services:<br> Africa’s challenges for the future, <i>Online &amp; CDROM Review</i> 20(3), pp. 145-<br>146.<br> El-Hadidy, B. (1994) The breakeven point for using CD-ROM versus online: a case<br> study for database access in a developing country, <i>Journal of the American<br>Society for Information Science</i> 45(4), pp. 273-283.<br> Greene, S.L., Devlin, S.J., (1990) Cannata, P.E., and Gomez, L.M. No Ifs, ANDs or<br> Ors: a study of database querying, <i>International Journal of Man-Machine<br>Studies</i> 32(3), pp. 303-326.<br> Holmes, G., and Rogers, W.J. (197) Gathering and indexing rich fragments of the<br> World-Wide Web, <i>Proceedings of the International Conference on Computers<br>in Education 1997</i> (Sarawak, Malaysia, Dec. 2-6), pp. 554-562.<br> Jones, S., Cunningham, S.J., and McNab, R. (1998) An analysis of usage of a digital<br> library, <i>Working Paper 98/13</i>, Department of Computer Science, University of<br>Waikato (Hamilton, New Zealand.<br> Peters, T. (1993) The history and development of transaction log analysis, <i>Library Hi-</i><br> <i>Tech </i>11(2), pp. 41-66.<br> Sandore, B. (1993) Applying the results of transaction log analysis, <i>Library Hi-Tech</i><br> 11(2), pp. 87-97.<br> Wall, L., Christiansen, T., and Schwartz, R.L. (1996) <i> Programming Perl.</i> O’Reilly,<br> Sebastopol (CA, USA).<br> White, W.D. (1992) CD-ROM in developing countries, <i>CD-ROM Professional</i> (May),<br> pp. 32-35.<br> Witten, I.H., Moffat, A., and Bell, T.C. (1994) <i>Managing Gigabytes</i>. Van Nostrand<br> Reinhold, New York, New York.<br> <hr>