root/other-projects/nightly-tasks/diffcol/trunk/model-collect/PDFBox/archives/HASH019c5dca.dir/doc.xml @ 27951

Revision 27951, 54.7 KB (checked in by ak19, 6 years ago)

Updating PDFBox collection with the extra metadata extracted (when using the PDFBox extension) sorted in doc.xml, for diffcol to give consistent results on CentOS and Ubuntu.

Line 
1<?xml version="1.0" encoding="utf-8" standalone="no"?>
2<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3<Archive>
4<Section>
5  <Description>
6    <Metadata name="gsdldoctype">indexed_doc</Metadata>
7    <Metadata name="Language">en</Metadata>
8    <Metadata name="Encoding">utf8</Metadata>
9    <Metadata name="Title">biblio_for_dl_scientometrics.do</Metadata>
10    <Metadata name="URL">http://research/ak19/gs2-svn-Mar18/tmp/F297.html</Metadata>
11    <Metadata name="UTF8URL">http://research/ak19/gs2-svn-Mar18/tmp/F297.html</Metadata>
12    <Metadata name="gsdlsourcefilename">import/pdf03.pdf</Metadata>
13    <Metadata name="gsdlconvertedfilename">/research/ak19/gs2-svn-Mar18/tmp/F297.html</Metadata>
14    <Metadata name="OrigSource">F297.html</Metadata>
15    <Metadata name="Source">pdf03.pdf</Metadata>
16    <Metadata name="SourceFile">pdf03.pdf</Metadata>
17    <Metadata name="Plugin">PDFPlugin</Metadata>
18    <Metadata name="FileSize">35935</Metadata>
19    <Metadata name="FilenameRoot">pdf03</Metadata>
20    <Metadata name="FileFormat">PDF</Metadata>
21    <Metadata name="srcicon">_iconpdf_</Metadata>
22    <Metadata name="srclink_file">doc.pdf</Metadata>
23    <Metadata name="srclinkFile">doc.pdf</Metadata>
24    <Metadata name="NumPages">17</Metadata>
25    <Metadata name="ex.ExifTool.ExifToolVersion">8.57</Metadata>
26    <Metadata name="ex.File.Directory">/research/ak19/gs2-svn-Mar18/collect/PDFBox/import</Metadata>
27    <Metadata name="ex.File.FileModifyDate">2013:08:01 15:24:13+12:00</Metadata>
28    <Metadata name="ex.File.FileName">pdf03.pdf</Metadata>
29    <Metadata name="ex.File.FilePermissions">644</Metadata>
30    <Metadata name="ex.File.FileSize">35935</Metadata>
31    <Metadata name="ex.File.FileType">PDF</Metadata>
32    <Metadata name="ex.File.MIMEType">application/pdf</Metadata>
33    <Metadata name="ex.PDF.Author">Bronwyn</Metadata>
34    <Metadata name="ex.PDF.CreateDate">1999:09:27 16:05:06</Metadata>
35    <Metadata name="ex.PDF.Creator">Microsoft Word</Metadata>
36    <Metadata name="ex.PDF.Linearized">false</Metadata>
37    <Metadata name="ex.PDF.PDFVersion">1.1</Metadata>
38    <Metadata name="ex.PDF.PageCount">17</Metadata>
39    <Metadata name="ex.PDF.Producer">Acrobat PDFWriter 2.0 for Macintosh</Metadata>
40    <Metadata name="ex.PDF.Title">biblio_for_dl_scientometrics.do</Metadata>
41    <Metadata name="Identifier">HASH019c5dca7f5bb781460a6b9c</Metadata>
42    <Metadata name="lastmodified">1375327453</Metadata>
43    <Metadata name="lastmodifieddate">20130801</Metadata>
44    <Metadata name="oailastmodified">1375327484</Metadata>
45    <Metadata name="oailastmodifieddate">20130801</Metadata>
46    <Metadata name="assocfilepath">HASH019c5dca.dir</Metadata>
47    <Metadata name="gsdlassocfile">doc.pdf:application/pdf:</Metadata>
48  </Description>
49  <Content>
50&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;Applications for Bibliometric Research&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;in the Emerging Digital Libraries&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Sally Jo Cunningham&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Department of Computer Science&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;University of Waikato&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Hamilton, New Zealand&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;email:  sallyjo@waikato.ac.nz&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Abstract:  Large numbers of research documents have recently become available on&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;the Internet through  “digital libraries”, and these collections are seeing high levels of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;use by their related research communities. A secondary  use for these document&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;repositories and indexes is as a platform for bibliometric research.  We examine the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;extent to which the new digital libraries support conventional bibliometric analysis, and&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;discuss shortcomings in their current forms. Interestingly, these electronic text&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;archives also provide opportunities for new types of studies:  generally the full text of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;documents are available for analysis, giving a finer grain of insight than abstract-only&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;online databases;  these repositories often contain technical reports or pre-prints, the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;“grey literature” that has been previously unavailable for analysis; and document&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;“usage” can be measured irectly by recording user accesses, rather than studied&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;indirectly through document references.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;1.  Introduction&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;In recent years a number of &amp;quot;digital libraries&amp;quot; have become available through the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Internet.  While the technology promises in the future to support large, heterogenous&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;collections, at present the most widely used of the academically-focussed digital&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;libraries are generally repositories of one or two types of document (typically technical&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;reports, journal articles, pre-prints, or conference proceedings), grouped by discipline.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;A distinguishing characteristic of these digital libraries is that the full text of documents&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;are often available for retrieval, as well as bibliographic records.The sciences are&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;represented much more heavily in the present crop of digital libraries than the social&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;sciences, arts, or humanities. They are maintained by professional societies,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;universities, research laboratories, and even private individuals.  Access is generally&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;free, both to search and to download documents.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The emergence of these subject-specific digital libraries is particularly important&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;given the pattern of access to materials presently employed by research scientists.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Informal exchanges of preprints, reprints, and photocopies of papers passed on by&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;colleagues currently are major venues for the transmission of scientific information&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;between researchers in the sciences.  In one study, the dependence on these sources&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;ranges from 12% (for chemistry)  to 39% (for mathematics) of all papers cited in&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;researchers' own publications [11]. A qualitative study of study of how computer&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;scientists locate and retrieve documents (computing is one of the domains considered&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;later in this paper) indicates that for that field, technical reports and research documents&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;found in various locations on the Internet are a preferred source of information [6].&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Many of the digital library systems discussed in this paper are repositories for just this&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;type of literature.  The documents tend to be of high quality:  primarily  technical&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;reports or working papers from research institutions (both academic and commercial),&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;as well as advance copies of work accepted for publication in conventional paper&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;journals. Moreover, these digital libraries are also coming to include refereed work&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;published digitally (in electronic journals).  Anecdotal evidence suggests that in their&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;fields, these digital libraries are coming to be the resource of choice for locating cutting&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;edge work.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;For specialized subjects such as high energy physics, this dependence on&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;informal or extra-library dissemination can be much higher. Ginsparg ([9], [10])&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;reports that fields in physics have traditionally relied heavily on preprint exchanges, and&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;the digital repositories of physics preprints begun in 1991 (the PHYSICS E-PRINT&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;ARCHIVES) have to a large extent supplanted conventional publishing and physical&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;paper mailing of technical reports.  By providing ready access to information sources&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;that are already preferentially utilized by scientists, the digital libraries show potential to&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;increase access to information that until recently was expensive or difficult to acquire in&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;paper form.  Indeed, in some fields (most notably physics) this process has already&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;begun, as researchers in less developed countries report access to ongoing research&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;through the Internet repositories that their local libraries could not afford to acquire&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;through conventional journal subscriptions ([9], [10]).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The primary use for new bibliographic resources is, of course, for the contents&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;of the documents involved.  A secondary use for emerging resources is as a basis for&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliometric analysis of the subject field.  With the conventionally published scientific&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;literature, the sheer difficulty of accumulating statistics discouraged bibliometric&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;research until the advent of large bibliographic databases in the 1960's. Computerized&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliographic databases sparked a significant increase in the number of large-scale&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliographic studies, as significant portions of the collection and analysis of data could&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;be automated ([12], [13]).  The availability of CD-ROM versions of bibliographic&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;databases has been of particular importance, since they provide a cheaper alternative to&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;the online commercial databases [3].&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;These computerized bibliographic resources have drawbacks, however.  The&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;greatest is that the full text of documents are rarely available, and even abstracts are not&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;always present.  This obviously limits the types of bibliometric research that can be&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;conducted solely through these databases.  In addition, these databases are generally&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;limited to formally published documents (those appearing in selected books, journals,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;and conference proceedings).  The &amp;quot;grey literature&amp;quot; of technical reports, pre-prints, and&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;other works not formally published are largely ignored, and it is this absence of easy&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;access to these documents that has hampered the analysis of these important forms of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;scientific communication.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The digital libraries currently in existence complement the online and CD-ROM&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliographic databases.  They are best suited for examinations of the &amp;quot;physical&amp;quot;&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;characteristics of documents (for example, document length), analysis based on&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;bibliographic information that can be automatically extracted from the document text or&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;the sometimes unevenly formatted bibliographic records (such as obsolescence&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;studies),  and usage studies (geographic or institutional origin of users, date/time of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;access, individual patterns of document retrieval, etc.).  Because  references are present&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;in the document file but not identified by field, co-citation and bibliographic coupling&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;research is not well-supported, and conducting these studies requires considerable&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;effort on the part of the researcher.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The variety of bibliographic repositories in the available digital libraries in itself&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;has great potential in conducting bibliometric research.  Sigogneau et al [15] present a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;case study illustrating the ways in which the strengths of different databases can be&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;played off each other; they conduct a fine-grained analysis of the emergence of research&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;fronts in molecular and cellular biology, and demonstrate that the observations gleaned&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;from two complementary bibliographic databases provide greater insight into their&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;problem. Similarly, it appears that  the types of bibliographic data that can be gleaned&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;from the relatively unstructured digital libraries can be profitably combined with data&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;from online databases, CD-ROMS, and other more conventional bibliographic&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;resources.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;This paper is organized as follows:  Section 2 discusses the types of indexing&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;and searching available with current digital libraries; Section 3 gives examples of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;conventional bibliometric techniques applied to Internet-accessible archives; Section 4&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;discusses opportunities to directly measure usage of documents and to detect&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;information-seeking patterns in researchers; and Section 5 presents our conclusions.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;2.  Indexing and searching in current digital libraries&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;At present, the types of indexing fields for most academically-oriented digital&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;library systems are limited.  Many schemes index on user-supplied document&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;descriptions, abstracts, or similar document surrogates (for example, the PHYSICS E-&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;PRINT ARCHIVE [10], a collection of physics pre-prints and technical reports). As will&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;be discussed below, the quality of this user-provided data can be highly variable, and&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;may unfavorably impact the usefulness of the index for searching. Alternatively, a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;designated site librarian may maintain a catalog (eg, the WATERS [14] system, now&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;subsumed by NCSTRL (http://www.ncstrl.org/ ), both primarily collections of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;computer science technical reports);  in this case the quality of the bibliographic&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;information may be expedited to be higher, but fewer sites will be likely to support&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;such a librarian and therefore fewer documents are likely to be included in the digital&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;library. In a “harvesting” system such as the computer science technical report&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;collections supported by HARVEST  [2] or the NEW ZEALAND DIGITAL LIBRARY&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;computer science technical report collection ([16], [17]), documents are indexed from&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;passive repositories (that may not even be aware that their documents are being&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;included in the digital library). Harvesting systems therefore cannot rely on the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;presence of bibliographic data of any sort.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Because of the relative paucity of high-quality bibliographic data available to&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;many of the current academically- or research-focussed digital library collections, their&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;search interfaces tend to be more primitive than those ordinarily found in online&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliographic databases or library catalogs.  Systems such as NCSTRL an upport&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;author, title, and subject searching, but  this more sophisticated search functionality&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;comes at the expense of requiring participating repositories to use specific software.  As&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;a consequence, these latter systems may provide access to a small number of sites than&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;harvesting systems. Harvesters may access a broader range of providers, but at the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;penalty of being limited to unfielded, keyword searches over the raw text of the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;document or document surrogate.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Specifically, the indexing in existing digital libraries has a variety of shortcomings for&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliometric applications:&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;• lack of fielded indexing:   As noted above, some large and widely used digital&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;libraries (such as the computer science technical report collection of the NEW&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;ZEALAND DIGITAL LIBRARY) may lack formal cataloging entirely, and rely on&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;keyword searching over the raw document text. Obviously this makes field-&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;dependent analysis more difficult (for example, locating documents produced by&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;specific authors), and in the worst case my require a manual examination of all&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;files in the collection in order to reliably identify a desired document subset.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;However, keyword search techniques that approximate fielded searching results&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;may suffice:  for example in the NEW ZEALAND DIGITAL LIBRARY computer&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;science technical report collection,  limiting the keyword search for “Johnson”&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;to a search of first pages only is likely to retrieve documents written by Johnson&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(since for the majority of computer science technical reports, the first page&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;contains little more than author, title, date, and institution details).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;A more principled  approach to extracting bibliographic information is embodied&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;in the CiteSeer tool [1]. This software parses raw, unfielded academic&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;documents and attempts to identify such indexing information as author, title,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;reference list, etc. Obviously such a tool cannot attain 100% accuracy over a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;heterogenous document collection, but in practice it appears useful in that it can&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;make a good first pass in processing a set of documents, providing an initial set&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;of parsed documents for analysis. The remaining (presumably much smaller) set&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;of unparsable documents can then be dealt with manually.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;• lack of consistency in field formatting:  Current digital libraries usually acquire&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliographic information from either the authors of submitted articles or&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;automatic extraction routines (retrieving bibliographic details from catalog files&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;that may or may not be in a given document site, and that may or may not be in&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;an easily parsable form). Neither of these methods produce records with&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;standard formatting, which causes problems with automated bibliometric&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;analysis.   Consider the following examples selected from entries in the hep-th&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(high energy physics) collection of the PHYSICS E-PRINT ARCHIVES:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;(i) Authors: A. Yu. Alekseev, V. Schomerus&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(ii) Authors: Adel Bilal and Ian. I. Kogan&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(iii) Authors: Paul S. Aspinwall and David R. Morrison (with an appendix &lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;by Mark Gross)&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(iv) Authors: A. H. Chamseddine and Herbi Dreiner (ETH-Zurich)&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;In this case, typical for existing digital libraries, there is no standardized format&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;for authors' names (here, appearing with full names, initials plus last name, and&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;a mixture of the two); no standard convention for separating author names&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(here, either a comma or &amp;quot;and&amp;quot; are used); and parenthetical information can&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;include a variety of information such as the name of an associate author or the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;institutional affiliations of an author.  Manual processing or specially crafted&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;software would be required to reformat these fields for analysis.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;• duplicate entries:  Digital libraries that draw documents from a variety of sources&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;may inadvertently contain duplicate items. Unfortunately, the irregular&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;formatting of the bibliographic information makes it difficult to automa ically&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;detect these duplicates.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;• implicit field tagging:  In some repositories, items are not explicitly tagged with&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;certain types of information – most commonly the document's date of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;publication or production.  Instead, the date is implicit in the document's title&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(eg, its numeration in a technical report series) or in the location of the document&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;in the file structure of the repository (eg, separate directories exist for each&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;year).  A second common piece of implicit data is the authors’ institutional&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;affiliations.  This may be contained in the document itself (typically on a cover&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;page), or may be implicit in the document’s location (for example, a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;corporation’s technical reports are stored in its ftp repository).  Again, in these&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;cases special processing is required to append this field information to a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;document record for bibliometric analysis. &lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;• extraction of document text: Few of the documents stored in the research-&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;oriented digital libraries discussed in this paper are straight ascii text; instead,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;documents may appear in a variety of file formats, such as LaTeX, PostScript,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;PDF, etc.  If the contents of the documents are to be automatically processed&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(for example, to count the words in a document, or to extract reference&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;publication dates for an obsolescence study), then the text must be extracted.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Utilities are available to convert most common document formats to ascii.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;It is likely that many of these problems will be addressed as the Internet-based&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;document indexing systems mature.  Even minor changes can greatly increase the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;useability of a bibliographic database for bibliometric research.  For example, the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;addition of an explicit date tag to many online databases in 1975 sparked new&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;applications in time series research [3].&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;3.  Opportunities for applications of bibliometric techniques&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;One type of bibliometric research concentrates on quantifying fundamental,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;structural details about a subject literature:  how many items are published, how many&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;authors are publishing, over what time period documents are likely to be used, etc.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;More complex studies analyze the relationships between documents, such as how&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;documents cluster into subjects.  The following examples give a flavour of the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliometric research that is possible using the emerging digital libraries:&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;examining the “physical” characteristics of archived documents&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;One relatively straightforward type of bibliometric study characterizes the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;formats of different literatures.   For example, Figure 1 presents a the range of the size&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;of computer science technical reports as measured by their length in pages.   Of the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;45,720 documents in the CSTR collection as of April 1998, nearly 1600 did not contain&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;page divisions in their files (and hence are excluded from analysis). Note that the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;number of pages in the shorter documents (&amp;lt;50 pages) falls into an approximately&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;normal distribution (slightly skewed to the left), while presumably the longer&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;documents represent Masters’ and Doctoral theses. A surprising number of documents&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;are very short (between one and 5 pages); these may represent the type of condensed&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;results frequently found in the “technical notes”, “short papers”, and “poster sessions”&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;of computing conferences and journals. The average number of pages per document,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;27.5, appears to be slightly longer than the common upper bound for a computing&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;journal article, although this observation must be confirmed by a similar study of the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;lengths of formally published computing articles.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;This type of analysis is of particular interest for technical reports, since they&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;have not been studied in the same detail as formally published papers.  A comparison of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;the physical characteristics of the formal and informal literature could provide&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;supporting evidence for common beliefs about the relationship between the two types&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;of documents. For example,  do publishing constraints force journal and proceedings&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;articles to be shorter than technical reports, and therefore presumably omit technical&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;details of findings?  Do technical reports contain more/less extensive reference sections?&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;If reference sections of technical reports are longer than those of published articles, then&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;citation links are being ommitted in published works; if technical reports contain fewer&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;references, then this may confirm earlier indications that computer scientists tend to&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;“research first” and do literature surveys later [6].&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Figure 1.  Range of sizes of CS technical reports, measured by number of pages&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;obsolescence studies.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;A document is considered obsolete when it is no longer referenced by the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;current literature. Typically, documents receive their greatest number and frequency of&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;citations immediately after publication, and the frequency of citation falls rapidly as time&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;passes. One technique for estimating the obsolescence rate of a body of  literature– the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;synchronous method –  is to find the median date in the references of the documents.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;This median date is subtracted from the year of publication for the documents, yielding&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;the median citation age.  As would be expected, this median varies between the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;disciplines.  Typically the social sciences and arts have a higher median citation age&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;than the “hard” sciences and engineering, indicating that documents obsolesce more&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;quickly for the latter fields.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;As noted in Section 2, references are not generally explicitly tagged in existing&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;digital repositories.  However, reference dates can usually be extracted from the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;document text by first locating the reference section (usually delimited by a &amp;quot;references&amp;quot;&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;or &amp;quot;bibliography&amp;quot; section heading), and then extracting all numbers in the appropriate&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;ranges for dates  for the field under study.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;To illustrate this process, 188 technical reports were sampled from Internet-&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;accessible repositories1 and used as source documents for a synchronous obsolescence&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;study.  Conveniently, the repositories chosen organize technical reports into sub-&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;directories by their date of publication.  The reference dates for each technical report&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;were automatically extracted by software that scanned the document’s file for numbers&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;of the form 19XX, since previous studies indicate that few if any computing reports&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;reference documents published in previous centuries [5].  Table 1 presents the median&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;citation age calculated for these documents, broken down by repository and the year of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;publication for the source documents from which the reference dates were extracted:&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Table 1.  Median citation ages for technical report repositories&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The median citation age ranges between 2 and 4 years, which is consistent with&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;previous examinations of computing and information systems literature ([5], [4]).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;When graphed, the distribution of reference dates show the exponential curve typically&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;found in obsolescence studies, including the final droop due to an “immediacy effect”&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;as fewer very new documents are available for citation [7].  These types of results&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;provide confirmation that references used in computer science technical reports (the pre-&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;eminent “grey literature” of  the computing field) conforms to the same patterns as&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;references found in the formally published literature.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;co-citation and bibliographic coupling studies&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The rate at which documents cite each other (co-citation) or cite the same&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;documents (bibliographic coupling) can be used to produce &amp;quot;maps&amp;quot; of a subject&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;literature.  These techniques rely on analysis of the references of documents, and these&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;references must be in a common format.  While digital libraries contain full text of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;documents, their references are not standardized, and indeed are not  even tagged as&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;such.  To perform these studies the references must be manually extracted and&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;processed–a tedious process that is only worthwhile for documents (such as technical&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;reports) that are not included in existing citation databases such as the Science Citation&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Index and Social Science Citation Index.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;detecting cycles or regularities in the rate of production of research&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Analysis of trends in the production of technical reports can give indications&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;about working conditions that affect research; for example, is more research produced&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;over the summer, when the teaching load is lighter?  or is research steadily produced&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;throughout the year?&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Figure 2.  Distribution of the number of documents submitted to hep-th, 1992-1994&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Figures 2 and 3 present statistics on document accumulation in the hep-th (high&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;energy physics) e-print server, a part of the PHYSICS E-PRINT ARCHIVE.  This system&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;is one of the oldest formal pre-print archives, and has become the primary means for&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;information dissemination in its field.  Examination of these figures reveals several&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;trends.  Clearly the absolute number of documents deposited in the repository has&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;tended to increase over the time period.  For all three years, research production has its&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;lowest point in January and February, increases through May and June, then decreases&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;until August and September.  At that point the rate of production steps up, reaching a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;yearly peak in November and December.  This pattern is less clear for 1992, which&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;might be expected as the archive was established in mid-1991.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Figure 3.  Distribution of the percentage of documents submitted to hep-th, 1992-1994&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;4.  Analysis of usage data&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The emerging Internet-based digital libraries will permit research on scientific&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;information collection and use at a much finer grain than is possible with current paper&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;libraries or online bibliographic databases.  Current bibliometric or scientometric&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;research of this type must measure information use indirectly – for example,  through&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;examination of the list of references appended to published articles.  However, it is well&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;known that authors do not necessarily include in the reference list all documents that&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;could have been cited, and conversely that not all references listed may have been&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;actually “used” in performing the research; citation behavior can be affected by a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;number of motivating factors (Garfield lists 15 po sible reasons in [8]).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Digital library transaction logs provide a powerful tool for direct analysis of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;document “usage”: since  digital libraries contain the actual document (rather than only a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;document surrogate), the relative amount of “use” that a digital library’s clients make of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;a given document sees can be estimated from the number of times the document file is&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;downloaded (and, presumably, the document is read). Note that file downloading is a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;much stronger statement on the part of the user than, for example, having a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliographic record appear in the query result set for a conventional bibliographic&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;system; the user downloads only after the document has been found potentially relevant&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;through examination of its document surrogate. Additionally,  downloading is&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;frequently time-consuming  and sometimes costly (depending on local pricing for&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;Internet access). Downloaded documents are therefore highly likely at least to be&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;scanned, if not read closely.  The transaction logs for a digital library can provide a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;global picture of the use of documents in the collection, since all user interactions with&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;the library can be automatically logged  for analysis. By contrast, it is of course&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;impossible to track usage of print bibliographies, and very difficult to monitor usage of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliographic data available on CD-ROM across more than one or two sites.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Furthermore, analysis of search requests by geographic location, institution,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;and sometimes even individual user are also possible.  As an example, Table 2 presents&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;a portion of the summary of usage statistics (broken down by domain code) for queries&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;to the computer science technical collection of the NEW ZEALAND DIGITAL LIBRARY.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Examination of the data indicates that the heaviest use of the collection comes from&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;North America, Europe (particularly Germany and Finland), as well as the local New&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Zealand community and nearby Australia.  As expected for such a collection, a large&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;proportion of users are from educational (.edu) institutions; surprisingly, however, a&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;similar number of queries come from commercial (.com) organizations, indicating&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;perhaps that the documents are seeing use in commercial research and development&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;units.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Table 2. Accesses to the NEW ZEALAND DIGITAL LIBRARY CS collection  by Domain&lt;br /&gt;Code&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Of course, usage levels can also be further broken down by IP number&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(indicating  institutions), and systems requiring users to register may also be able to&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;analyze usage on an individual basis. Since the query strings themselves are also&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;recorded in the transaction logs, this domain/institution/individual activity could also be&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;linked to specific subjects through the query terms.  Summaries of this type could be&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;invaluable for studies of geographic diffusion and distribution of research topics.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Transaction log analysis can also indicate time-related  patterns in the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;information seeking behavior of digital library users.   As a sample of this type of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;analysis,  Paul Ginsparg notes a seven day periodicity in the number of search requests&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;made to the PHYSICS E-PRINT archives (Figure 4, reproduced from [9]).  From this he&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;adduces that many physicists do not yet have weekend access to the Internet (an&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;alternative, slightly more cynical hypothesis is that even high energy theoretical&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;physicists take the weekend off).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Figure 4.  Summary of search requests to the physics pre-print archives&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;5.  Conclusion&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;This study suggests opportunities for conducting bibliometric research on the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;evolving digital libraries.  These repositories are suitable platforms for conventional&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliometric techniques (such as obsolescence studies, quantification of physical&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;characteristics of documents comprising a subject literature, time analysis, etc.).  The&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;ability to directly monitor access to documents in digital libraries also enables&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;researchers to explicitly quantify document usage, as well as to implicitly measure&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;usage through citations.  Additional facilities could aid in the performance of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliographic experiments, such as: improved tagging of document fields; provision of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;utilities to strip out titles, authors, etc. from common document formats; and the ability&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;to easily eliminate duplicate entries from downloaded library subsets.  Unfortunately,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;the most useful of these additional facilities – those associated with a higher degree of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;cataloging – run counter to the underlying philosophy of many digital libraries:  to&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;avoid, if possible,  manual processing and formal cataloging of documents.   While&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;adherence to this principle can limit the accuracy of fielded searching (or indeed,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;preclude it altogether), it can also avoid the cataloging bottleneck and permit digital&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;libraries to provide access to larger numbers of documents.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The digital libraries complement the information currently available through&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;paper, online, and CD-ROM bibliographic resources.  While these latter databases&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;generally have the advantage of standardized formatting of bibliographic fields, the&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;digital libraries are freely accessible, often contain &amp;quot;grey literature&amp;quot; that is otherwise&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;unavailable for analysis, and generally make the full text of documents available.  The&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;insights gained from analysis of digital libraries will add to the store of &amp;quot;information&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;about information&amp;quot; that we have gained from older types of bibliographic repositories.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;References&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[1] Bollacker, K.D., S. Lawrence, and C.L.Giles, CiteSeer: An Autonomous Web&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Agent for Automatic Retrieval and Identification of Interesting Publications,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Proceedings of the Second International Conference on Autonomous Agents&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(Minneapolis/St. Paul, May 9-13), 1998.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[2] Bowman, C.M., P.B. Danzig, U. Manber,  and M.F. Schwartz,  Scalable Internet&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;resource discovery:  Research problems and approaches, Communications of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;the ACM 37(8)  (1994)  98-107.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[3] Burton, Hilary D. , Use of a virtual information system for bibliometric analysis,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Informaton Processing &amp;amp; Management 24(1)  (1988) 39-44.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[4] Cunningham, S.J., An empirical investigation of the obsolescence rate for&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;information systems literature, Library and Information Science&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Research., 1996, http://library.fgcu.edu/iclc/lisrissu.htm&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; [5] Cunningham, S.J., and D. Bocock, Obsolescence of computing literature.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Scientometrics  34(2)  (1995), pp. 255-262.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt; [6] Cunningham, S.J. and Lynn Silipigni Connaway, Information searching&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;preferences and practices of computer science researchers, Proceeding  of&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;OZCHI '96 (1996)  294-299.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[7] de Solla Price, D.J.,  Citation measures of hard science, soft science, technology,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;and nonscience.  In: C.E. Nelson and D.K. Pollock (eds), Communication&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;among scientists and engineers  (H ath Lexington, 1970).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[8]  Garfield, E., Citation Indexing:  Its theory and application in Science, Technology&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;and Humanities (Wiley, 1979).&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;[9]  Ginsparg, P.  After dinner remarks:  14 Oct ‘94 APS meeting at LANL, 1994&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(&amp;lt;URL: http://xxx.lanl.gov/blurb&amp;gt; ).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[10] Ginsparg, P., First steps towards electronic research communication, Co puters&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;in Physics 8(4) (1994)  390-401. &lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[11] Hallmark,  J., Scientists' access and retrieval of references cited in their recent&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;journal articles,  College and Research Libraries 55(3)  (1994) 199-210.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[12] Hawkins, D.T. , Unconventional uses of on-line information retrieval systems:&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;on-line bibliometric studies, Journal of the American Society for Information&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Science 28  (1977)  13-18.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[13] McGhee, P.E. , P.R. Skinner, K. Roberto,  N.J. Ridenour,  and S.M. Larson,&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Using online databases to study current research trends:  an online bibliometric&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;study, Library and Information Science Research 9  (1987)   285-291.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[14] Maly, K., E.A. Fox,  J.C. French,  and A.L. Selman,  Wide area technical report&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;server  (Technical  Report ,  Dept. of Computer Science, Old Dominion&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;University, 1994. Also available at   &amp;lt;URL:&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;http://www.cs.odu.edu/WATERS/WATERS-paper.ps&amp;gt; ).&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[15] Sigogneau, M.J. , S. Bain, J.P. Courtial, and H. Feillet,  Scientific innovation in&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;bibliographical databases:  a comparative study of the Science Citation Index&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;and the Pascal database,  Sci ntometrics 22(1)   (1991)  65-82.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[16] Witten, I.H., S.J. Cunningham, M. Vallabh,  and T.C. Bell,  A New Zealand&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;digital library for computer science research, Proceedings of Digital Libraries&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;'95 (1995) 25-30.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;[17]  Witten, I.H., C. Nevill-Manning, and S.J. Cunningham, A public library based&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;on full-text retrieval, Communications of the ACM41(4), 1998, p. 71&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name=0&gt;&lt;/a&gt;&lt;div style=&quot;page-break-before:always; page-break-after:always&quot;&gt;&lt;div&gt;&lt;p&gt;                                    &lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;1Documents were randomly sampled from the DEC&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(ftp://crl.dec.com/pub/DEC/CRL/tech-reports/), Sony&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(ftp://ftp.csl.sony.co.jp/CSL/CSL-Papers), and Ohio (ftp://archive.cis.ohio-&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;state.edu/pub/tech-report/) technical report repositories&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;</Content>
51</Section>
52</Archive>
Note: See TracBrowser for help on using the browser.