indexed_doc en utf8 biblio_for_dl_scientometrics.do http://research/ak19/GS2bin_5July2013/tmp/F639.html http://research/ak19/GS2bin_5July2013/tmp/F639.html import/pdf03.pdf /research/ak19/GS2bin_5July2013/tmp/F639.html F639.html pdf03.pdf pdf03.pdf PDFPlugin 35935 pdf03 PDF _iconpdf_ doc.pdf doc.pdf 17 2013:07:04 16:30:11+12:00 Bronwyn 17 PDF 1.1 Acrobat PDFWriter 2.0 for Macintosh pdf03.pdf 644 1999:09:27 16:05:06 false /research/ak19/GS2bin_5July2013/collect/PDFBox/import Microsoft Word biblio_for_dl_scientometrics.do 35935 8.57 application/pdf HASH019c5dca7f5bb781460a6b9c 1372912211 20130704 1373003208 20130705 HASH019c5dca.dir doc.pdf:application/pdf: <a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>Applications for Bibliometric Research<br />in the Emerging Digital Libraries<br /></p><br /><p>Sally Jo Cunningham<br /></p><br /><p>Department of Computer Science<br /></p><br /><p>University of Waikato<br /></p><br /><p>Hamilton, New Zealand<br /></p><br /><p>email: sallyjo@waikato.ac.nz<br /></p><br /><p>Abstract: Large numbers of research documents have recently become available on<br /></p><br /><p>the Internet through “digital libraries”, and these collections are seeing high levels of<br /></p><br /><p>use by their related research communities. A secondary use for these document<br /></p><br /><p>repositories and indexes is as a platform for bibliometric research. We examine the<br /></p><br /><p>extent to which the new digital libraries support conventional bibliometric analysis, and<br /></p><br /><p>discuss shortcomings in their current forms. Interestingly, these electronic text<br /></p><br /><p>archives also provide opportunities for new types of studies: generally the full text of<br /></p><br /><p>documents are available for analysis, giving a finer grain of insight than abstract-only<br /></p><br /><p>online databases; these repositories often contain technical reports or pre-prints, the<br /></p><br /><p>“grey literature” that has been previously unavailable for analysis; and document<br /></p><br /><p>“usage” can be measured directly by recording user accesses, rather than studied<br /></p><br /><p>indirectly through document references.<br /></p><br /><p>1. Introduction<br /></p><br /><p>In recent years a number of &quot;digital libraries&quot; have become available through the<br /></p><br /><p>Internet. While the technology promises in the future to support large, heterogenous<br /></p><br /><p>collections, at present the most widely used of the academically-focussed digital<br /></p><br /><p>libraries are generally repositories of one or two types of document (typically technical<br />reports, journal articles, pre-prints, or conference proceedings), grouped by discipline.</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>A distinguishing characteristic of these digital libraries is that the full text of documents<br /></p><br /><p>are often available for retrieval, as well as bibliographic records.The sciences are<br /></p><br /><p>represented much more heavily in the present crop of digital libraries than the social<br /></p><br /><p>sciences, arts, or humanities. They are maintained by professional societies,<br /></p><br /><p>universities, research laboratories, and even private individuals. Access is generally<br /></p><br /><p>free, both to search and to download documents.<br /></p><br /><p>The emergence of these subject-specific digital libraries is particularly important<br />given the pattern of access to materials presently employed by research scientists.<br /></p><br /><p>Informal exchanges of preprints, reprints, and photocopies of papers passed on by<br /></p><br /><p>colleagues currently are major venues for the transmission of scientific information<br />between researchers in the sciences. In one study, the dependence on these sources<br /></p><br /><p>ranges from 12% (for chemistry) to 39% (for mathematics) of all papers cited in<br />researchers' own publications [11]. A qualitative study of study of how computer<br />scientists locate and retrieve documents (computing is one of the domains considered<br />later in this paper) indicates that for that field, technical reports and research documents<br />found in various locations on the Internet are a preferred source of information [6].<br />Many of the digital library systems discussed in this paper are repositories for just this<br />type of literature. The documents tend to be of high quality: primarily technical<br /></p><br /><p>reports or working papers from research institutions (both academic and commercial),<br />as well as advance copies of work accepted for publication in conventional paper<br /></p><br /><p>journals. Moreover, these digital libraries are also coming to include refereed work<br />published digitally (in electronic journals). Anecdotal evidence suggests that in their<br />fields, these digital libraries are coming to be the resource of choice for locating cutting<br /></p><br /><p>edge work.<br /></p><br /><p>For specialized subjects such as high energy physics, this dependence on<br />informal or extra-library dissemination can be much higher. Ginsparg ([9], [10])<br />reports that fields in physics have traditionally relied heavily on preprint exchanges, and<br /></p><br /><p>the digital repositories of physics preprints begun in 1991 (the PHYSICS E-PRINT<br />ARCHIVES) have to a large extent supplanted conventional publishing and physical</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>paper mailing of technical reports. By providing ready access to information sources<br /></p><br /><p>that are already preferentially utilized by scientists, the digital libraries show potential to<br /></p><br /><p>increase access to information that until recently was expensive or difficult to acquire in<br /></p><br /><p>paper form. Indeed, in some fields (most notably physics) this process has already<br />begun, as researchers in less developed countries report access to ongoing research<br /></p><br /><p>through the Internet repositories that their local libraries could not afford to acquire<br /></p><br /><p>through conventional journal subscriptions ([9], [10]).<br />The primary use for new bibliographic resources is, of course, for the contents<br /></p><br /><p>of the documents involved. A secondary use for emerging resources is as a basis for<br /></p><br /><p>bibliometric analysis of the subject field. With the conventionally published scientific<br />literature, the sheer difficulty of accumulating statistics discouraged bibliometric<br /></p><br /><p>research until the advent of large bibliographic databases in the 1960's. Computerized<br /></p><br /><p>bibliographic databases sparked a significant increase in the number of large-scale<br /></p><br /><p>bibliographic studies, as significant portions of the collection and analysis of data could<br /></p><br /><p>be automated ([12], [13]). The availability of CD-ROM versions of bibliographic<br />databases has been of particular importance, since they provide a cheaper alternative to<br /></p><br /><p>the online commercial databases [3].<br />These computerized bibliographic resources have drawbacks, however. The<br /></p><br /><p>greatest is that the full text of documents are rarely available, and even abstracts are not<br /></p><br /><p>always present. This obviously limits the types of bibliometric research that can be<br /></p><br /><p>conducted solely through these databases. In addition, these databases are generally<br /></p><br /><p>limited to formally published documents (those appearing in selected books, journals,<br />and conference proceedings). The &quot;grey literature&quot; of technical reports, pre-prints, and<br />other works not formally published are largely ignored, and it is this absence of easy<br /></p><br /><p>access to these documents that has hampered the analysis of these important forms of<br /></p><br /><p>scientific communication.<br /></p><br /><p>The digital libraries currently in existence complement the online and CD-ROM<br /></p><br /><p>bibliographic databases. They are best suited for examinations of the &quot;physical&quot;<br /></p><br /><p>characteristics of documents (for example, document length), analysis based on</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>bibliographic information that can be automatically extracted from the document text or<br /></p><br /><p>the sometimes unevenly formatted bibliographic records (such as obsolescence<br />studies), and usage studies (geographic or institutional origin of users, date/time of<br />access, individual patterns of document retrieval, etc.). Because references are present<br />in the document file but not identified by field, co-citation and bibliographic coupling<br /></p><br /><p>research is not well-supported, and conducting these studies requires considerable<br /></p><br /><p>effort on the part of the researcher.<br /></p><br /><p>The variety of bibliographic repositories in the available digital libraries in itself<br /></p><br /><p>has great potential in conducting bibliometric research. Sigogneau et al [15] present a<br />case study illustrating the ways in which the strengths of different databases can be<br /></p><br /><p>played off each other; they conduct a fine-grained analysis of the emergence of research<br /></p><br /><p>fronts in molecular and cellular biology, and demonstrate that the observations gleaned<br /></p><br /><p>from two complementary bibliographic databases provide greater insight into their<br /></p><br /><p>problem. Similarly, it appears that the types of bibliographic data that can be gleaned<br /></p><br /><p>from the relatively unstructured digital libraries can be profitably combined with data<br /></p><br /><p>from online databases, CD-ROMS, and other more conventional bibliographic<br /></p><br /><p>resources.<br /></p><br /><p>This paper is organized as follows: Section 2 discusses the types of indexing<br /></p><br /><p>and searching available with current digital libraries; Section 3 gives examples of<br /></p><br /><p>conventional bibliometric techniques applied to Internet-accessible archives; Section 4<br /></p><br /><p>discusses opportunities to directly measure usage of documents and to detect<br /></p><br /><p>information-seeking patterns in researchers; and Section 5 presents our conclusions.<br /></p><br /><p>2. Indexing and searching in current digital libraries<br /></p><br /><p>At present, the types of indexing fields for most academically-oriented digital<br /></p><br /><p>library systems are limited. Many schemes index on user-supplied document<br /></p><br /><p>descriptions, abstracts, or similar document surrogates (for example, the PHYSICS E-<br />PRINT ARCHIVE [10], a collection of physics pre-prints and technical reports). As will</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>be discussed below, the quality of this user-provided data can be highly variable, and<br /></p><br /><p>may unfavorably impact the usefulness of the index for searching. Alternatively, a<br /></p><br /><p>designated site librarian may maintain a catalog (eg, the WATERS [14] system, now<br />subsumed by NCSTRL (http://www.ncstrl.org/), both primarily collections of<br />computer science technical reports); in this case the quality of the bibliographic<br />information may be expedited to be higher, but fewer sites will be likely to support<br /></p><br /><p>such a librarian and therefore fewer documents are likely to be included in the digital<br /></p><br /><p>library. In a “harvesting” system such as the computer science technical report<br /></p><br /><p>collections supported by HARVEST [2] or the NEW ZEALAND DIGITAL LIBRARY<br />computer science technical report collection ([16], [17]), documents are indexed from<br />passive repositories (that may not even be aware that their documents are being<br />included in the digital library). Harvesting systems therefore cannot rely on the<br />presence of bibliographic data of any sort.<br /></p><br /><p>Because of the relative paucity of high-quality bibliographic data available to<br /></p><br /><p>many of the current academically- or research-focussed digital library collections, their<br /></p><br /><p>search interfaces tend to be more primitive than those ordinarily found in online<br /></p><br /><p>bibliographic databases or library catalogs. Systems such as NCSTRL can support<br /></p><br /><p>author, title, and subject searching, but this more sophisticated search functionality<br />comes at the expense of requiring participating repositories to use specific software. As<br /></p><br /><p>a consequence, these latter systems may provide access to a small number of sites than<br /></p><br /><p>harvesting systems. Harvesters may access a broader range of providers, but at the<br /></p><br /><p>penalty of being limited to unfielded, keyword searches over the raw text of the<br /></p><br /><p>document or document surrogate.<br /></p><br /><p>Specifically, the indexing in existing digital libraries has a variety of shortcomings for<br /></p><br /><p>bibliometric applications:<br /></p><br /><p>• lack of fielded indexing: As noted above, some large and widely used digital<br />libraries (such as the computer science technical report collection of the NEW<br />ZEALAND DIGITAL LIBRARY) may lack formal cataloging entirely, and rely on</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>keyword searching over the raw document text. Obviously this makes field-<br /></p><br /><p>dependent analysis more difficult (for example, locating documents produced by<br />specific authors), and in the worst case my require a manual examination of all<br />files in the collection in order to reliably identify a desired document subset.<br /></p><br /><p>However, keyword search techniques that approximate fielded searching results<br /></p><br /><p>may suffice: for example in the NEW ZEALAND DIGITAL LIBRARY computer<br /></p><br /><p>science technical report collection, limiting the keyword search for “Johnson”<br /></p><br /><p>to a search of first pages only is likely to retrieve documents written by Johnson<br /></p><br /><p>(since for the majority of computer science technical reports, the first page<br />contains little more than author, title, date, and institution details).<br /></p><br /><p>A more principled approach to extracting bibliographic information is embodied<br /></p><br /><p>in the CiteSeer tool [1]. This software parses raw, unfielded academic<br />documents and attempts to identify such indexing information as author, title,<br /></p><br /><p>reference list, etc. Obviously such a tool cannot attain 100% accuracy over a<br /></p><br /><p>heterogenous document collection, but in practice it appears useful in that it can<br /></p><br /><p>make a good first pass in processing a set of documents, providing an initial set<br /></p><br /><p>of parsed documents for analysis. The remaining (presumably much smaller) set<br />of unparsable documents can then be dealt with manually.<br /></p><br /><p>• lack of consistency in field formatting: Current digital libraries usually acquire<br />bibliographic information from either the authors of submitted articles or<br /></p><br /><p>automatic extraction routines (retrieving bibliographic details from catalog files<br />that may or may not be in a given document site, and that may or may not be in<br /></p><br /><p>an easily parsable form). Neither of these methods produce records with<br />standard formatting, which causes problems with automated bibliometric<br /></p><br /><p>analysis. Consider the following examples selected from entries in the hep-th<br /></p><br /><p>(high energy physics) collection of the PHYSICS E-PRINT ARCHIVES:</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>(i) Authors: A. Yu. Alekseev, V. Schomerus<br />(ii) Authors: Adel Bilal and Ian. I. Kogan<br />(iii) Authors: Paul S. Aspinwall and David R. Morrison (with an appendix <br /></p><br /><p>by Mark Gross)<br />(iv) Authors: A. H. Chamseddine and Herbi Dreiner (ETH-Zurich)<br /></p><br /><p>In this case, typical for existing digital libraries, there is no standardized format<br /></p><br /><p>for authors' names (here, appearing with full names, initials plus last name, and<br />a mixture of the two); no standard convention for separating author names<br />(here, either a comma or &quot;and&quot; are used); and parenthetical information can<br />include a variety of information such as the name of an associate author or the<br /></p><br /><p>institutional affiliations of an author. Manual processing or specially crafted<br /></p><br /><p>software would be required to reformat these fields for analysis.<br /></p><br /><p>• duplicate entries: Digital libraries that draw documents from a variety of sources<br /></p><br /><p>may inadvertently contain duplicate items. Unfortunately, the irregular<br /></p><br /><p>formatting of the bibliographic information makes it difficult to automatically<br /></p><br /><p>detect these duplicates.<br /></p><br /><p>• implicit field tagging: In some repositories, items are not explicitly tagged with<br />certain types of information – most commonly the document's date of<br /></p><br /><p>publication or production. Instead, the date is implicit in the document's title<br /></p><br /><p>(eg, its numeration in a technical report series) or in the location of the document<br />in the file structure of the repository (eg, separate directories exist for each<br />year). A second common piece of implicit data is the authors’ institutional<br />affiliations. This may be contained in the document itself (typically on a cover<br />page), or may be implicit in the document’s location (for example, a<br />corporation’s technical reports are stored in its ftp repository). Again, in these</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>cases special processing is required to append this field information to a<br /></p><br /><p>document record for bibliometric analysis. <br /></p><br /><p>• extraction of document text: Few of the documents stored in the research-<br />oriented digital libraries discussed in this paper are straight ascii text; instead,<br /></p><br /><p>documents may appear in a variety of file formats, such as LaTeX, PostScript,<br /></p><br /><p>PDF, etc. If the contents of the documents are to be automatically processed<br /></p><br /><p>(for example, to count the words in a document, or to extract reference<br />publication dates for an obsolescence study), then the text must be extracted.<br />Utilities are available to convert most common document formats to ascii.<br /></p><br /><p>It is likely that many of these problems will be addressed as the Internet-based<br /></p><br /><p>document indexing systems mature. Even minor changes can greatly increase the<br /></p><br /><p>useability of a bibliographic database for bibliometric research. For example, the<br /></p><br /><p>addition of an explicit date tag to many online databases in 1975 sparked new<br /></p><br /><p>applications in time series research [3].<br /></p><br /><p>3. Opportunities for applications of bibliometric techniques<br /></p><br /><p>One type of bibliometric research concentrates on quantifying fundamental,<br /></p><br /><p>structural details about a subject literature: how many items are published, how many<br />authors are publishing, over what time period documents are likely to be used, etc.<br /></p><br /><p>More complex studies analyze the relationships between documents, such as how<br /></p><br /><p>documents cluster into subjects. The following examples give a flavour of the<br />bibliometric research that is possible using the emerging digital libraries:<br /></p><br /><p>examining the “physical” characteristics of archived documents<br />One relatively straightforward type of bibliometric study characterizes the<br /></p><br /><p>formats of different literatures. For example, Figure 1 presents a the range of the size</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>of computer science technical reports as measured by their length in pages. Of the<br /></p><br /><p>45,720 documents in the CSTR collection as of April 1998, nearly 1600 did not contain<br /></p><br /><p>page divisions in their files (and hence are excluded from analysis). Note that the<br />number of pages in the shorter documents (&lt;50 pages) falls into an approximately<br />normal distribution (slightly skewed to the left), while presumably the longer<br />documents represent Masters’ and Doctoral theses. A surprising number of documents<br /></p><br /><p>are very short (between one and 5 pages); these may represent the type of condensed<br />results frequently found in the “technical notes”, “short papers”, and “poster sessions”<br /></p><br /><p>of computing conferences and journals. The average number of pages per document,<br />27.5, appears to be slightly longer than the common upper bound for a computing<br /></p><br /><p>journal article, although this observation must be confirmed by a similar study of the<br />lengths of formally published computing articles.<br /></p><br /><p>This type of analysis is of particular interest for technical reports, since they<br /></p><br /><p>have not been studied in the same detail as formally published papers. A comparison of<br /></p><br /><p>the physical characteristics of the formal and informal literature could provide<br /></p><br /><p>supporting evidence for common beliefs about the relationship between the two types<br /></p><br /><p>of documents. For example, do publishing constraints force journal and proceedings<br />articles to be shorter than technical reports, and therefore presumably omit technical<br /></p><br /><p>details of findings? Do technical reports contain more/less extensive reference sections?<br /></p><br /><p>If reference sections of technical reports are longer than those of published articles, then<br /></p><br /><p>citation links are being ommitted in published works; if technical reports contain fewer<br /></p><br /><p>references, then this may confirm earlier indications that computer scientists tend to<br /></p><br /><p>“research first” and do literature surveys later [6].<br /></p><br /><p>Figure 1. Range of sizes of CS technical reports, measured by number of pages<br /></p><br /><p>obsolescence studies.<br /></p><br /><p>A document is considered obsolete when it is no longer referenced by the<br /></p><br /><p>current literature. Typically, documents receive their greatest number and frequency of</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>citations immediately after publication, and the frequency of citation falls rapidly as time<br /></p><br /><p>passes. One technique for estimating the obsolescence rate of a body of literature– the<br /></p><br /><p>synchronous method – is to find the median date in the references of the documents.<br /></p><br /><p>This median date is subtracted from the year of publication for the documents, yielding<br /></p><br /><p>the median citation age. As would be expected, this median varies between the<br /></p><br /><p>disciplines. Typically the social sciences and arts have a higher median citation age<br /></p><br /><p>than the “hard” sciences and engineering, indicating that documents obsolesce more<br /></p><br /><p>quickly for the latter fields.<br /></p><br /><p>As noted in Section 2, references are not generally explicitly tagged in existing<br /></p><br /><p>digital repositories. However, reference dates can usually be extracted from the<br /></p><br /><p>document text by first locating the reference section (usually delimited by a &quot;references&quot;<br />or &quot;bibliography&quot; section heading), and then extracting all numbers in the appropriate<br />ranges for dates for the field under study.<br /></p><br /><p>To illustrate this process, 188 technical reports were sampled from Internet-<br /></p><br /><p>accessible repositories1 and used as source documents for a synchronous obsolescence<br /></p><br /><p>study. Conveniently, the repositories chosen organize technical reports into sub-<br /></p><br /><p>directories by their date of publication. The reference dates for each technical report<br /></p><br /><p>were automatically extracted by software that scanned the document’s file for numbers<br /></p><br /><p>of the form 19XX, since previous studies indicate that few if any computing reports<br /></p><br /><p>reference documents published in previous centuries [5]. Table 1 presents the median<br />citation age calculated for these documents, broken down by repository and the year of<br /></p><br /><p>publication for the source documents from which the reference dates were extracted:<br /></p><br /><p>Table 1. Median citation ages for technical report repositories<br /></p><br /><p>The median citation age ranges between 2 and 4 years, which is consistent with<br /></p><br /><p>previous examinations of computing and information systems literature ([5], [4]).<br />When graphed, the distribution of reference dates show the exponential curve typically<br /></p><br /><p>found in obsolescence studies, including the final droop due to an “immediacy effect”</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>as fewer very new documents are available for citation [7]. These types of results<br />provide confirmation that references used in computer science technical reports (the pre-<br />eminent “grey literature” of the computing field) conforms to the same patterns as<br />references found in the formally published literature.<br /></p><br /><p>co-citation and bibliographic coupling studies<br /></p><br /><p>The rate at which documents cite each other (co-citation) or cite the same<br />documents (bibliographic coupling) can be used to produce &quot;maps&quot; of a subject<br />literature. These techniques rely on analysis of the references of documents, and these<br /></p><br /><p>references must be in a common format. While digital libraries contain full text of<br /></p><br /><p>documents, their references are not standardized, and indeed are not even tagged as<br /></p><br /><p>such. To perform these studies the references must be manually extracted and<br /></p><br /><p>processed–a tedious process that is only worthwhile for documents (such as technical<br />reports) that are not included in existing citation databases such as the Science Citation<br />Index and Social Science Citation Index.<br /></p><br /><p>detecting cycles or regularities in the rate of production of research<br />Analysis of trends in the production of technical reports can give indications<br /></p><br /><p>about working conditions that affect research; for example, is more research produced<br /></p><br /><p>over the summer, when the teaching load is lighter? or is research steadily produced<br /></p><br /><p>throughout the year?<br /></p><br /><p>Figure 2. Distribution of the number of documents submitted to hep-th, 1992-1994<br /></p><br /><p>Figures 2 and 3 present statistics on document accumulation in the hep-th (high<br />energy physics) e-print server, a part of the PHYSICS E-PRINT ARCHIVE. This system<br />is one of the oldest formal pre-print archives, and has become the primary means for<br /></p><br /><p>information dissemination in its field. Examination of these figures reveals several<br /></p><br /><p>trends. Clearly the absolute number of documents deposited in the repository has</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>tended to increase over the time period. For all three years, research production has its<br /></p><br /><p>lowest point in January and February, increases through May and June, then decreases<br /></p><br /><p>until August and September. At that point the rate of production steps up, reaching a<br /></p><br /><p>yearly peak in November and December. This pattern is less clear for 1992, which<br /></p><br /><p>might be expected as the archive was established in mid-1991.<br /></p><br /><p>Figure 3. Distribution of the percentage of documents submitted to hep-th, 1992-1994<br /></p><br /><p>4. Analysis of usage data<br /></p><br /><p>The emerging Internet-based digital libraries will permit research on scientific<br /></p><br /><p>information collection and use at a much finer grain than is possible with current paper<br /></p><br /><p>libraries or online bibliographic databases. Current bibliometric or scientometric<br /></p><br /><p>research of this type must measure information use indirectly – for example, through<br /></p><br /><p>examination of the list of references appended to published articles. However, it is well<br /></p><br /><p>known that authors do not necessarily include in the reference list all documents that<br /></p><br /><p>could have been cited, and conversely that not all references listed may have been<br /></p><br /><p>actually “used” in performing the research; citation behavior can be affected by a<br /></p><br /><p>number of motivating factors (Garfield lists 15 possible reasons in [8]).<br />Digital library transaction logs provide a powerful tool for direct analysis of<br /></p><br /><p>document “usage”: since digital libraries contain the actual document (rather than only a<br />document surrogate), the relative amount of “use” that a digital library’s clients make of<br />a given document sees can be estimated from the number of times the document file is<br /></p><br /><p>downloaded (and, presumably, the document is read). Note that file downloading is a<br />much stronger statement on the part of the user than, for example, having a<br /></p><br /><p>bibliographic record appear in the query result set for a conventional bibliographic<br /></p><br /><p>system; the user downloads only after the document has been found potentially relevant<br />through examination of its document surrogate. Additionally, downloading is<br /></p><br /><p>frequently time-consuming and sometimes costly (depending on local pricing for</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>Internet access). Downloaded documents are therefore highly likely at least to be<br />scanned, if not read closely. The transaction logs for a digital library can provide a<br /></p><br /><p>global picture of the use of documents in the collection, since all user interactions with<br /></p><br /><p>the library can be automatically logged for analysis. By contrast, it is of course<br /></p><br /><p>impossible to track usage of print bibliographies, and very difficult to monitor usage of<br /></p><br /><p>bibliographic data available on CD-ROM across more than one or two sites.<br /></p><br /><p>Furthermore, analysis of search requests by geographic location, institution,<br /></p><br /><p>and sometimes even individual user are also possible. As an example, Table 2 presents<br /></p><br /><p>a portion of the summary of usage statistics (broken down by domain code) for queries<br />to the computer science technical collection of the NEW ZEALAND DIGITAL LIBRARY.<br /></p><br /><p>Examination of the data indicates that the heaviest use of the collection comes from<br /></p><br /><p>North America, Europe (particularly Germany and Finland), as well as the local New<br />Zealand community and nearby Australia. As expected for such a collection, a large<br /></p><br /><p>proportion of users are from educational (.edu) institutions; surprisingly, however, a<br />similar number of queries come from commercial (.com) organizations, indicating<br />perhaps that the documents are seeing use in commercial research and development<br /></p><br /><p>units.<br /></p><br /><p>Table 2. Accesses to the NEW ZEALAND DIGITAL LIBRARY CS collection by Domain<br />Code<br /></p><br /><p>Of course, usage levels can also be further broken down by IP number<br /></p><br /><p>(indicating institutions), and systems requiring users to register may also be able to<br />analyze usage on an individual basis. Since the query strings themselves are also<br /></p><br /><p>recorded in the transaction logs, this domain/institution/individual activity could also be<br /></p><br /><p>linked to specific subjects through the query terms. Summaries of this type could be<br />invaluable for studies of geographic diffusion and distribution of research topics.<br /></p><br /><p>Transaction log analysis can also indicate time-related patterns in the<br /></p><br /><p>information seeking behavior of digital library users. As a sample of this type of<br /></p><br /><p>analysis, Paul Ginsparg notes a seven day periodicity in the number of search requests</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>made to the PHYSICS E-PRINT archives (Figure 4, reproduced from [9]). From this he<br />adduces that many physicists do not yet have weekend access to the Internet (an<br />alternative, slightly more cynical hypothesis is that even high energy theoretical<br /></p><br /><p>physicists take the weekend off).<br /></p><br /><p>Figure 4. Summary of search requests to the physics pre-print archives<br /></p><br /><p>5. Conclusion<br /></p><br /><p>This study suggests opportunities for conducting bibliometric research on the<br /></p><br /><p>evolving digital libraries. These repositories are suitable platforms for conventional<br /></p><br /><p>bibliometric techniques (such as obsolescence studies, quantification of physical<br />characteristics of documents comprising a subject literature, time analysis, etc.). The<br />ability to directly monitor access to documents in digital libraries also enables<br /></p><br /><p>researchers to explicitly quantify document usage, as well as to implicitly measure<br /></p><br /><p>usage through citations. Additional facilities could aid in the performance of<br /></p><br /><p>bibliographic experiments, such as: improved tagging of document fields; provision of<br /></p><br /><p>utilities to strip out titles, authors, etc. from common document formats; and the ability<br /></p><br /><p>to easily eliminate duplicate entries from downloaded library subsets. Unfortunately,<br /></p><br /><p>the most useful of these additional facilities – those associated with a higher degree of<br /></p><br /><p>cataloging – run counter to the underlying philosophy of many digital libraries: to<br /></p><br /><p>avoid, if possible, manual processing and formal cataloging of documents. While<br /></p><br /><p>adherence to this principle can limit the accuracy of fielded searching (or indeed,<br />preclude it altogether), it can also avoid the cataloging bottleneck and permit digital<br />libraries to provide access to larger numbers of documents.<br /></p><br /><p>The digital libraries complement the information currently available through<br /></p><br /><p>paper, online, and CD-ROM bibliographic resources. While these latter databases<br /></p><br /><p>generally have the advantage of standardized formatting of bibliographic fields, the<br /></p><br /><p>digital libraries are freely accessible, often contain &quot;grey literature&quot; that is otherwise</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>unavailable for analysis, and generally make the full text of documents available. The<br /></p><br /><p>insights gained from analysis of digital libraries will add to the store of &quot;information<br /></p><br /><p>about information&quot; that we have gained from older types of bibliographic repositories.<br /></p><br /><p>References<br /></p><br /><p>[1] Bollacker, K.D., S. Lawrence, and C.L.Giles, CiteSeer: An Autonomous Web<br />Agent for Automatic Retrieval and Identification of Interesting Publications,<br /></p><br /><p>Proceedings of the Second International Conference on Autonomous Agents<br />(Minneapolis/St. Paul, May 9-13), 1998.<br /></p><br /><p>[2] Bowman, C.M., P.B. Danzig, U. Manber, and M.F. Schwartz, Scalable Internet<br />resource discovery: Research problems and approaches, Communications of<br />the ACM 37(8) (1994) 98-107.<br /></p><br /><p>[3] Burton, Hilary D. , Use of a virtual information system for bibliometric analysis,<br />Informaton Processing &amp; Management 24(1) (1988) 39-44.<br /></p><br /><p>[4] Cunningham, S.J., An empirical investigation of the obsolescence rate for<br />information systems literature, Library and Information Science<br />Research., 1996, http://library.fgcu.edu/iclc/lisrissu.htm<br /></p><br /><p> [5] Cunningham, S.J., and D. Bocock, Obsolescence of computing literature.<br />Scientometrics 34(2) (1995), pp. 255-262.<br /></p><br /><p> [6] Cunningham, S.J. and Lynn Silipigni Connaway, Information searching<br />preferences and practices of computer science researchers, Proceedings of<br />OZCHI '96 (1996) 294-299.<br /></p><br /><p>[7] de Solla Price, D.J., Citation measures of hard science, soft science, technology,<br />and nonscience. In: C.E. Nelson and D.K. Pollock (eds), Communication<br />among scientists and engineers (Heath Lexington, 1970).<br /></p><br /><p>[8] Garfield, E., Citation Indexing: Its theory and application in Science, Technology<br />and Humanities (Wiley, 1979).</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p>[9] Ginsparg, P. After dinner remarks: 14 Oct ‘94 APS meeting at LANL, 1994<br />(&lt;URL: http://xxx.lanl.gov/blurb&gt; ).<br /></p><br /><p>[10] Ginsparg, P., First steps towards electronic research communication, Computers<br />in Physics 8(4) (1994) 390-401. <br /></p><br /><p>[11] Hallmark, J., Scientists' access and retrieval of references cited in their recent<br />journal articles, College and Research Libraries 55(3) (1994) 199-210.<br /></p><br /><p>[12] Hawkins, D.T. , Unconventional uses of on-line information retrieval systems:<br />on-line bibliometric studies, Journal of the American Society for Information<br />Science 28 (1977) 13-18.<br /></p><br /><p>[13] McGhee, P.E. , P.R. Skinner, K. Roberto, N.J. Ridenour, and S.M. Larson,<br />Using online databases to study current research trends: an online bibliometric<br /></p><br /><p>study, Library and Information Science Research 9 (1987) 285-291.<br />[14] Maly, K., E.A. Fox, J.C. French, and A.L. Selman, Wide area technical report<br /></p><br /><p>server (Technical Report , Dept. of Computer Science, Old Dominion<br />University, 1994. Also available at &lt;URL:<br /></p><br /><p>http://www.cs.odu.edu/WATERS/WATERS-paper.ps&gt; ).<br />[15] Sigogneau, M.J. , S. Bain, J.P. Courtial, and H. Feillet, Scientific innovation in<br /></p><br /><p>bibliographical databases: a comparative study of the Science Citation Index<br /></p><br /><p>and the Pascal database, Scientometrics 22(1) (1991) 65-82.<br />[16] Witten, I.H., S.J. Cunningham, M. Vallabh, and T.C. Bell, A New Zealand<br /></p><br /><p>digital library for computer science research, Proceedings of Digital Libraries<br />'95 (1995) 25-30.<br /></p><br /><p>[17] Witten, I.H., C. Nevill-Manning, and S.J. Cunningham, A public library based<br />on full-text retrieval, Communications of the ACM 41(4), 1998, p. 71</p><br /><br /></div></div><br /><a name=0></a><div style="page-break-before:always; page-break-after:always"><div><p> <br /></p><br /><p>1Documents were randomly sampled from the DEC<br /></p><br /><p>(ftp://crl.dec.com/pub/DEC/CRL/tech-reports/), Sony<br />(ftp://ftp.csl.sony.co.jp/CSL/CSL-Papers), and Ohio (ftp://archive.cis.ohio-<br />state.edu/pub/tech-report/) technical report repositories</p><br /><br /></div></div><br />