source: other-projects/nightly-tasks/diffcol/trunk/gs3-model-collect/Word-PDF-Basic/archives/HASH019c.dir/doc.xml@ 28241

Last change on this file since 28241 was 28241, checked in by ak19, 11 years ago

Rebuilt the GS3 model collection after the change over to using placeholders for standard GS path prefixes in the two archiveinf gdb files

File size: 40.0 KB
Line 
1<?xml version="1.0" encoding="utf-8" standalone="no"?>
2<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3<Archive>
4<Section>
5 <Description>
6 <Metadata name="gsdldoctype">indexed_doc</Metadata>
7 <Metadata name="Language">en</Metadata>
8 <Metadata name="Encoding">utf8</Metadata>
9 <Metadata name="Author">Bronwyn</Metadata>
10 <Metadata name="Title">biblio_for_dl_scientometrics.do</Metadata>
11 <Metadata name="URL">http://research/ak19/gs3-svn-26Aug2013/web/sites/localsite/collect/Word-PDF-Basic/tmp/1378708746_1/pdf03.html</Metadata>
12 <Metadata name="UTF8URL">http://research/ak19/gs3-svn-26Aug2013/web/sites/localsite/collect/Word-PDF-Basic/tmp/1378708746_1/pdf03.html</Metadata>
13 <Metadata name="gsdlsourcefilename">import/pdf03.pdf</Metadata>
14 <Metadata name="gsdlconvertedfilename">tmp/1378708746_1/pdf03.html</Metadata>
15 <Metadata name="OrigSource">pdf03.html</Metadata>
16 <Metadata name="Source">pdf03.pdf</Metadata>
17 <Metadata name="SourceFile">pdf03.pdf</Metadata>
18 <Metadata name="Plugin">PDFPlugin</Metadata>
19 <Metadata name="FileSize">35935</Metadata>
20 <Metadata name="FilenameRoot">pdf03</Metadata>
21 <Metadata name="FileFormat">PDF</Metadata>
22 <Metadata name="srcicon">_iconpdf_</Metadata>
23 <Metadata name="srclink_file">doc.pdf</Metadata>
24 <Metadata name="srclinkFile">doc.pdf</Metadata>
25 <Metadata name="NumPages">17</Metadata>
26 <Metadata name="dc.Creator">Sally Jo Cunningham</Metadata>
27 <Metadata name="dc.Title">Applications for Bibliometric Research in the Emerging Digital Libraries</Metadata>
28 <Metadata name="Identifier">HASH019c5dca7f5bb781460a6b9c</Metadata>
29 <Metadata name="lastmodified">1378708193</Metadata>
30 <Metadata name="lastmodifieddate">20130909</Metadata>
31 <Metadata name="oailastmodified">1378708746</Metadata>
32 <Metadata name="oailastmodifieddate">20130909</Metadata>
33 <Metadata name="assocfilepath">HASH019c.dir</Metadata>
34 <Metadata name="gsdlassocfile">doc.pdf:application/pdf:</Metadata>
35 </Description>
36 <Content>
37&lt;A name=1&gt;&lt;/a&gt;&lt;b&gt;Applications for Bibliometric Research&lt;/b&gt;&lt;br&gt;
38&lt;b&gt;in the Emerging Digital Libraries&lt;/b&gt;&lt;br&gt;
39Sally Jo Cunningham&lt;br&gt;
40Department of Computer Science&lt;br&gt;
41University of Waikato&lt;br&gt;
42Hamilton, New Zealand&lt;br&gt;
43email: [email protected]&lt;br&gt;
44&lt;b&gt;Abstract:&lt;/b&gt; Large numbers of research documents have recently become available on&lt;br&gt;
45the Internet through “digital libraries”, and these collections are seeing high levels of&lt;br&gt;
46use by their related research communities. A secondary use for these document&lt;br&gt;
47repositories and indexes is as a platform for bibliometric research. We examine the&lt;br&gt;
48extent to which the new digital libraries support conventional bibliometric analysis, and&lt;br&gt;
49discuss shortcomings in their current forms. Interestingly, these electronic text&lt;br&gt;
50archives also provide opportunities for new types of studies: generally the full text of&lt;br&gt;
51documents are available for analysis, giving a finer grain of insight than abstract-only&lt;br&gt;
52online databases; these repositories often contain technical reports or pre-prints, the&lt;br&gt;
53“grey literature” that has been previously unavailable for analysis; and document&lt;br&gt;
54“usage” can be measured directly by recording user accesses, rather than studied&lt;br&gt;
55indirectly through document references.&lt;br&gt;
56&lt;b&gt;1. Introduction&lt;/b&gt;&lt;br&gt;
57In recent years a number of &amp;quot;digital libraries&amp;quot; have become available through the&lt;br&gt;
58Internet. While the technology promises in the future to support large, heterogenous&lt;br&gt;
59collections, at present the most widely used of the academically-focussed digital&lt;br&gt;
60libraries are generally repositories of one or two types of document (typically technical&lt;br&gt;
61reports, journal articles, pre-prints, or conference proceedings), grouped by discipline.&lt;br&gt;
62&lt;hr&gt;
63&lt;A name=2&gt;&lt;/a&gt;A distinguishing characteristic of these digital libraries is that the full text of documents&lt;br&gt;
64are often available for retrieval, as well as bibliographic records.The sciences are&lt;br&gt;
65represented much more heavily in the present crop of digital libraries than the social&lt;br&gt;
66sciences, arts, or humanities. They are maintained by professional societies,&lt;br&gt;
67universities, research laboratories, and even private individuals. Access is generally&lt;br&gt;
68free, both to search and to download documents.&lt;br&gt;
69The emergence of these subject-specific digital libraries is particularly important&lt;br&gt;
70given the pattern of access to materials presently employed by research scientists.&lt;br&gt;
71Informal exchanges of preprints, reprints, and photocopies of papers passed on by&lt;br&gt;
72colleagues currently are major venues for the transmission of scientific information&lt;br&gt;
73between researchers in the sciences. In one study, the dependence on these sources&lt;br&gt;
74ranges from 12% (for chemistry) to 39% (for mathematics) of all papers cited in&lt;br&gt;
75researchers' own publications [11]. A qualitative study of study of how computer&lt;br&gt;
76scientists locate and retrieve documents (computing is one of the domains considered&lt;br&gt;
77later in this paper) indicates that for that field, technical reports and research documents&lt;br&gt;
78found in various locations on the Internet are a preferred source of information [6].&lt;br&gt;
79Many of the digital library systems discussed in this paper are repositories for just this&lt;br&gt;
80type of literature. The documents tend to be of high quality: primarily technical&lt;br&gt;
81reports or working papers from research institutions (both academic and commercial),&lt;br&gt;
82as well as advance copies of work accepted for publication in conventional paper&lt;br&gt;
83journals. Moreover, these digital libraries are also coming to include refereed work&lt;br&gt;
84published digitally (in electronic journals). Anecdotal evidence suggests that in their&lt;br&gt;
85fields, these digital libraries are coming to be the resource of choice for locating cutting&lt;br&gt;
86edge work.&lt;br&gt;
87For specialized subjects such as high energy physics, this dependence on&lt;br&gt;
88informal or extra-library dissemination can be much higher. Ginsparg ([9], [10])&lt;br&gt;
89reports that fields in physics have traditionally relied heavily on preprint exchanges, and&lt;br&gt;
90the digital repositories of physics preprints begun in 1991 (the PHYSICS E-PRINT&lt;br&gt;
91ARCHIVES) have to a large extent supplanted conventional publishing and physical&lt;br&gt;
92&lt;hr&gt;
93&lt;A name=3&gt;&lt;/a&gt;paper mailing of technical reports. By providing ready access to information sources&lt;br&gt;
94that are already preferentially utilized by scientists, the digital libraries show potential to&lt;br&gt;
95increase access to information that until recently was expensive or difficult to acquire in&lt;br&gt;
96paper form. Indeed, in some fields (most notably physics) this process has already&lt;br&gt;
97begun, as researchers in less developed countries report access to ongoing research&lt;br&gt;
98through the Internet repositories that their local libraries could not afford to acquire&lt;br&gt;
99through conventional journal subscriptions ([9], [10]).&lt;br&gt;
100The primary use for new bibliographic resources is, of course, for the contents&lt;br&gt;
101of the documents involved. A secondary use for emerging resources is as a basis for&lt;br&gt;
102bibliometric analysis of the subject field. With the conventionally published scientific&lt;br&gt;
103literature, the sheer difficulty of accumulating statistics discouraged bibliometric&lt;br&gt;
104research until the advent of large bibliographic databases in the 1960's. Computerized&lt;br&gt;
105bibliographic databases sparked a significant increase in the number of large-scale&lt;br&gt;
106bibliographic studies, as significant portions of the collection and analysis of data could&lt;br&gt;
107be automated ([12], [13]). The availability of CD-ROM versions of bibliographic&lt;br&gt;
108databases has been of particular importance, since they provide a cheaper alternative to&lt;br&gt;
109the online commercial databases [3].&lt;br&gt;
110These computerized bibliographic resources have drawbacks, however. The&lt;br&gt;
111greatest is that the full text of documents are rarely available, and even abstracts are not&lt;br&gt;
112always present. This obviously limits the types of bibliometric research that can be&lt;br&gt;
113conducted &lt;i&gt;solely&lt;/i&gt; through these databases. In addition, these databases are generally&lt;br&gt;
114limited to formally published documents (those appearing in selected books, journals,&lt;br&gt;
115and conference proceedings). The &amp;quot;grey literature&amp;quot; of technical reports, pre-prints, and&lt;br&gt;
116other works not formally published are largely ignored, and it is this absence of easy&lt;br&gt;
117access to these documents that has hampered the analysis of these important forms of&lt;br&gt;
118scientific communication.&lt;br&gt;
119The digital libraries currently in existence complement the online and CD-ROM&lt;br&gt;
120bibliographic databases. They are best suited for examinations of the &amp;quot;physical&amp;quot;&lt;br&gt;
121characteristics of documents (for example, document length), analysis based on&lt;br&gt;
122&lt;hr&gt;
123&lt;A name=4&gt;&lt;/a&gt;bibliographic information that can be automatically extracted from the document text or&lt;br&gt;
124the sometimes unevenly formatted bibliographic records (such as obsolescence&lt;br&gt;
125studies), and usage studies (geographic or institutional origin of users, date/time of&lt;br&gt;
126access, individual patterns of document retrieval, etc.). Because references are present&lt;br&gt;
127in the document file but not identified by field, co-citation and bibliographic coupling&lt;br&gt;
128research is not well-supported, and conducting these studies requires considerable&lt;br&gt;
129effort on the part of the researcher.&lt;br&gt;
130The variety of bibliographic repositories in the available digital libraries in itself&lt;br&gt;
131has great potential in conducting bibliometric research. Sigogneau et al [15] present a&lt;br&gt;
132case study illustrating the ways in which the strengths of different databases can be&lt;br&gt;
133played off each other; they conduct a fine-grained analysis of the emergence of research&lt;br&gt;
134fronts in molecular and cellular biology, and demonstrate that the observations gleaned&lt;br&gt;
135from two complementary bibliographic databases provide greater insight into their&lt;br&gt;
136problem. Similarly, it appears that the types of bibliographic data that can be gleaned&lt;br&gt;
137from the relatively unstructured digital libraries can be profitably combined with data&lt;br&gt;
138from online databases, CD-ROMS, and other more conventional bibliographic&lt;br&gt;
139resources.&lt;br&gt;
140This paper is organized as follows: Section 2 discusses the types of indexing&lt;br&gt;
141and searching available with current digital libraries; Section 3 gives examples of&lt;br&gt;
142conventional bibliometric techniques applied to Internet-accessible archives; Section 4&lt;br&gt;
143discusses opportunities to directly measure usage of documents and to detect&lt;br&gt;
144information-seeking patterns in researchers; and Section 5 presents our conclusions.&lt;br&gt;
145&lt;b&gt;2. Indexing and searching in current digital libraries&lt;/b&gt;&lt;br&gt;
146At present, the types of indexing fields for most academically-oriented digital&lt;br&gt;
147library systems are limited. Many schemes index on user-supplied document&lt;br&gt;
148descriptions, abstracts, or similar document surrogates (for example, the PHYSICS E-&lt;br&gt;
149PRINT ARCHIVE [10], a collection of physics pre-prints and technical reports). As will&lt;br&gt;
150&lt;hr&gt;
151&lt;A name=5&gt;&lt;/a&gt;be discussed below, the quality of this user-provided data can be highly variable, and&lt;br&gt;
152may unfavorably impact the usefulness of the index for searching. Alternatively, a&lt;br&gt;
153designated site librarian may maintain a catalog (eg, the WATERS [14] system, now&lt;br&gt;
154subsumed by NCSTRL (http://www.ncstrl.org/), both primarily collections of&lt;br&gt;
155computer science technical reports); in this case the quality of the bibliographic&lt;br&gt;
156information may be expedited to be higher, but fewer sites will be likely to support&lt;br&gt;
157such a librarian and therefore fewer documents are likely to be included in the digital&lt;br&gt;
158library. In a “harvesting” system such as the computer science technical report&lt;br&gt;
159collections supported by HARVEST [2] or the NEW ZEALAND DIGITAL LIBRARY&lt;br&gt;
160computer science technical report collection ([16], [17]), documents are indexed from&lt;br&gt;
161passive repositories (that may not even be aware that their documents are being&lt;br&gt;
162included in the digital library). Harvesting systems therefore cannot rely on the&lt;br&gt;
163presence of bibliographic data of any sort.&lt;br&gt;
164Because of the relative paucity of high-quality bibliographic data available to&lt;br&gt;
165many of the current academically- or research-focussed digital library collections, their&lt;br&gt;
166search interfaces tend to be more primitive than those ordinarily found in online&lt;br&gt;
167bibliographic databases or library catalogs. Systems such as NCSTRL can support&lt;br&gt;
168author, title, and subject searching, but this more sophisticated search functionality&lt;br&gt;
169comes at the expense of requiring participating repositories to use specific software. As&lt;br&gt;
170a consequence, these latter systems may provide access to a small number of sites than&lt;br&gt;
171harvesting systems. Harvesters may access a broader range of providers, but at the&lt;br&gt;
172penalty of being limited to unfielded, keyword searches over the raw text of the&lt;br&gt;
173document or document surrogate.&lt;br&gt;
174Specifically, the indexing in existing digital libraries has a variety of shortcomings for&lt;br&gt;
175bibliometric applications:&lt;br&gt;
176•&lt;br&gt;
177&lt;i&gt;lack of fielded indexing:&lt;/i&gt; As noted above, some large and widely used digital&lt;br&gt;
178libraries (such as the computer science technical report collection of the NEW&lt;br&gt;
179ZEALAND DIGITAL LIBRARY) may lack formal cataloging entirely, and rely on&lt;br&gt;
180&lt;hr&gt;
181&lt;A name=6&gt;&lt;/a&gt;keyword searching over the raw document text. Obviously this makes field-&lt;br&gt;
182dependent analysis more difficult (for example, locating documents produced by&lt;br&gt;
183specific authors), and in the worst case my require a manual examination of all&lt;br&gt;
184files in the collection in order to reliably identify a desired document subset.&lt;br&gt;
185However, keyword search techniques that approximate fielded searching results&lt;br&gt;
186may suffice: for example in the NEW ZEALAND DIGITAL LIBRARY computer&lt;br&gt;
187science technical report collection, limiting the keyword search for “Johnson”&lt;br&gt;
188to a search of first pages only is likely to retrieve documents written by Johnson&lt;br&gt;
189(since for the majority of computer science technical reports, the first page&lt;br&gt;
190contains little more than author, title, date, and institution details).&lt;br&gt;
191A more principled approach to extracting bibliographic information is embodied&lt;br&gt;
192in the CiteSeer tool [1]. This software parses raw, unfielded academic&lt;br&gt;
193documents and attempts to identify such indexing information as author, title,&lt;br&gt;
194reference list, etc. Obviously such a tool cannot attain 100% accuracy over a&lt;br&gt;
195heterogenous document collection, but in practice it appears useful in that it can&lt;br&gt;
196make a good first pass in processing a set of documents, providing an initial set&lt;br&gt;
197of parsed documents for analysis. The remaining (presumably much smaller) set&lt;br&gt;
198of unparsable documents can then be dealt with manually.&lt;br&gt;
199•&lt;br&gt;
200&lt;i&gt;lack of consistency in field formatting:&lt;/i&gt; Current digital libraries usually acquire&lt;br&gt;
201bibliographic information from either the authors of submitted articles or&lt;br&gt;
202automatic extraction routines (retrieving bibliographic details from catalog files&lt;br&gt;
203that may or may not be in a given document site, and that may or may not be in&lt;br&gt;
204an easily parsable form). Neither of these methods produce records with&lt;br&gt;
205standard formatting, which causes problems with automated bibliometric&lt;br&gt;
206analysis. Consider the following examples selected from entries in the hep-th&lt;br&gt;
207(high energy physics) collection of the PHYSICS E-PRINT ARCHIVES:&lt;br&gt;
208&lt;hr&gt;
209&lt;A name=7&gt;&lt;/a&gt;(i)&lt;br&gt;
210Authors: A. Yu. Alekseev, V. Schomerus&lt;br&gt;
211(ii)&lt;br&gt;
212Authors: Adel Bilal and Ian. I. Kogan&lt;br&gt;
213(iii)&lt;br&gt;
214Authors: Paul S. Aspinwall and David R. Morrison (with an appendix &lt;br&gt;
215by Mark Gross)&lt;br&gt;
216(iv)&lt;br&gt;
217Authors: A. H. Chamseddine and Herbi Dreiner (ETH-Zurich)&lt;br&gt;
218In this case, typical for existing digital libraries, there is no standardized format&lt;br&gt;
219for authors' names (here, appearing with full names, initials plus last name, and&lt;br&gt;
220a mixture of the two); no standard convention for separating author names&lt;br&gt;
221(here, either a comma or &amp;quot;and&amp;quot; are used); and parenthetical information can&lt;br&gt;
222include a variety of information such as the name of an associate author or the&lt;br&gt;
223institutional affiliations of an author. Manual processing or specially crafted&lt;br&gt;
224software would be required to reformat these fields for analysis.&lt;br&gt;
225•&lt;br&gt;
226&lt;i&gt;duplicate entries: &lt;/i&gt; Digital libraries that draw documents from a variety of sources&lt;br&gt;
227may inadvertently contain duplicate items. Unfortunately, the irregular&lt;br&gt;
228formatting of the bibliographic information makes it difficult to automatically&lt;br&gt;
229detect these duplicates.&lt;br&gt;
230•&lt;br&gt;
231&lt;i&gt;implicit field tagging:&lt;/i&gt; In some repositories, items are not explicitly tagged with&lt;br&gt;
232certain types of information – most commonly the document's date of&lt;br&gt;
233publication or production. Instead, the date is implicit in the document's title&lt;br&gt;
234(eg, its numeration in a technical report series) or in the location of the document&lt;br&gt;
235in the file structure of the repository (eg, separate directories exist for each&lt;br&gt;
236year). A second common piece of implicit data is the authors’ institutional&lt;br&gt;
237affiliations. This may be contained in the document itself (typically on a cover&lt;br&gt;
238page), or may be implicit in the document’s location (for example, a&lt;br&gt;
239corporation’s technical reports are stored in its ftp repository). Again, in these&lt;br&gt;
240&lt;hr&gt;
241&lt;A name=8&gt;&lt;/a&gt;cases special processing is required to append this field information to a&lt;br&gt;
242document record for bibliometric analysis. &lt;br&gt;
243•&lt;br&gt;
244&lt;i&gt;extraction of document text:&lt;/i&gt; Few of the documents stored in the research-&lt;br&gt;
245oriented digital libraries discussed in this paper are straight ascii text; instead,&lt;br&gt;
246documents may appear in a variety of file formats, such as LaTeX, PostScript,&lt;br&gt;
247PDF, etc. If the contents of the documents are to be automatically processed&lt;br&gt;
248(for example, to count the words in a document, or to extract reference&lt;br&gt;
249publication dates for an obsolescence study), then the text must be extracted.&lt;br&gt;
250Utilities are available to convert most common document formats to ascii.&lt;br&gt;
251It is likely that many of these problems will be addressed as the Internet-based&lt;br&gt;
252document indexing systems mature. Even minor changes can greatly increase the&lt;br&gt;
253useability of a bibliographic database for bibliometric research. For example, the&lt;br&gt;
254addition of an explicit date tag to many online databases in 1975 sparked new&lt;br&gt;
255applications in time series research [3].&lt;br&gt;
256&lt;b&gt;3. Opportunities for applications of bibliometric techniques&lt;/b&gt;&lt;br&gt;
257One type of bibliometric research concentrates on quantifying fundamental,&lt;br&gt;
258structural details about a subject literature: how many items are published, how many&lt;br&gt;
259authors are publishing, over what time period documents are likely to be used, etc.&lt;br&gt;
260More complex studies analyze the relationships between documents, such as how&lt;br&gt;
261documents cluster into subjects. The following examples give a flavour of the&lt;br&gt;
262bibliometric research that is possible using the emerging digital libraries:&lt;br&gt;
263&lt;i&gt;examining the “physical” characteristics of archived documents&lt;/i&gt;&lt;br&gt;
264One relatively straightforward type of bibliometric study characterizes the&lt;br&gt;
265formats of different literatures. For example, Figure 1 presents a the range of the size&lt;br&gt;
266&lt;hr&gt;
267&lt;A name=9&gt;&lt;/a&gt;of computer science technical reports as measured by their length in pages. Of the&lt;br&gt;
26845,720 documents in the CSTR collection as of April 1998, nearly 1600 did not contain&lt;br&gt;
269page divisions in their files (and hence are excluded from analysis). Note that the&lt;br&gt;
270number of pages in the shorter documents (&amp;lt;50 pages) falls into an approximately&lt;br&gt;
271normal distribution (slightly skewed to the left), while presumably the longer&lt;br&gt;
272documents represent Masters’ and Doctoral theses. A surprising number of documents&lt;br&gt;
273are very short (between one and 5 pages); these may represent the type of condensed&lt;br&gt;
274results frequently found in the “technical notes”, “short papers”, and “poster sessions”&lt;br&gt;
275of computing conferences and journals. The average number of pages per document,&lt;br&gt;
27627.5, appears to be slightly longer than the common upper bound for a computing&lt;br&gt;
277journal article, although this observation must be confirmed by a similar study of the&lt;br&gt;
278lengths of formally published computing articles.&lt;br&gt;
279This type of analysis is of particular interest for technical reports, since they&lt;br&gt;
280have not been studied in the same detail as formally published papers. A comparison of&lt;br&gt;
281the physical characteristics of the formal and informal literature could provide&lt;br&gt;
282supporting evidence for common beliefs about the relationship between the two types&lt;br&gt;
283of documents. For example, do publishing constraints force journal and proceedings&lt;br&gt;
284articles to be shorter than technical reports, and therefore presumably omit technical&lt;br&gt;
285details of findings? Do technical reports contain more/less extensive reference sections?&lt;br&gt;
286If reference sections of technical reports are longer than those of published articles, then&lt;br&gt;
287citation links are being ommitted in published works; if technical reports contain fewer&lt;br&gt;
288references, then this may confirm earlier indications that computer scientists tend to&lt;br&gt;
289“research first” and do literature surveys later [6].&lt;br&gt;
290Figure 1. Range of sizes of CS technical reports, measured by number of pages&lt;br&gt;
291&lt;i&gt;obsolescence studies.&lt;/i&gt;&lt;br&gt;
292A document is considered obsolete when it is no longer referenced by the&lt;br&gt;
293current literature. Typically, documents receive their greatest number and frequency of&lt;br&gt;
294&lt;hr&gt;
295&lt;A name=10&gt;&lt;/a&gt;citations immediately after publication, and the frequency of citation falls rapidly as time&lt;br&gt;
296passes. One technique for estimating the obsolescence rate of a body of literature– the&lt;br&gt;
297&lt;i&gt;synchronous&lt;/i&gt; method – is to find the median date in the references of the documents.&lt;br&gt;
298This median date is subtracted from the year of publication for the documents, yielding&lt;br&gt;
299the &lt;i&gt;median citation age&lt;/i&gt;. As would be expected, this median varies between the&lt;br&gt;
300disciplines. Typically the social sciences and arts have a higher median citation age&lt;br&gt;
301than the “hard” sciences and engineering, indicating that documents obsolesce more&lt;br&gt;
302quickly for the latter fields.&lt;br&gt;
303As noted in Section 2, references are not generally explicitly tagged in existing&lt;br&gt;
304digital repositories. However, reference dates can usually be extracted from the&lt;br&gt;
305document text by first locating the reference section (usually delimited by a &amp;quot;references&amp;quot;&lt;br&gt;
306or &amp;quot;bibliography&amp;quot; section heading), and then extracting all numbers in the appropriate&lt;br&gt;
307ranges for dates for the field under study.&lt;br&gt;
308To illustrate this process, 188 technical reports were sampled from Internet-&lt;br&gt;
309accessible repositories1 and used as source documents for a synchronous obsolescence&lt;br&gt;
310study. Conveniently, the repositories chosen organize technical reports into sub-&lt;br&gt;
311directories by their date of publication. The reference dates for each technical report&lt;br&gt;
312were automatically extracted by software that scanned the document’s file for numbers&lt;br&gt;
313of the form 19XX, since previous studies indicate that few if any computing reports&lt;br&gt;
314reference documents published in previous centuries [5]. Table 1 presents the median&lt;br&gt;
315citation age calculated for these documents, broken down by repository and the year of&lt;br&gt;
316publication for the source documents from which the reference dates were extracted:&lt;br&gt;
317Table 1. Median citation ages for technical report repositories&lt;br&gt;
318The median citation age ranges between 2 and 4 years, which is consistent with&lt;br&gt;
319previous examinations of computing and information systems literature ([5], [4]).&lt;br&gt;
320When graphed, the distribution of reference dates show the exponential curve typically&lt;br&gt;
321found in obsolescence studies, including the final droop due to an “immediacy effect”&lt;br&gt;
322&lt;hr&gt;
323&lt;A name=11&gt;&lt;/a&gt;as fewer very new documents are available for citation [7]. These types of results&lt;br&gt;
324provide confirmation that references used in computer science technical reports (the pre-&lt;br&gt;
325eminent “grey literature” of the computing field) conforms to the same patterns as&lt;br&gt;
326references found in the formally published literature.&lt;br&gt;
327&lt;i&gt;co-citation and bibliographic coupling studies&lt;/i&gt;&lt;br&gt;
328The rate at which documents cite each other (co-citation) or cite the same&lt;br&gt;
329documents (bibliographic coupling) can be used to produce &amp;quot;maps&amp;quot; of a subject&lt;br&gt;
330literature. These techniques rely on analysis of the references of documents, and these&lt;br&gt;
331references must be in a common format. While digital libraries contain full text of&lt;br&gt;
332documents, their references are not standardized, and indeed are not even tagged as&lt;br&gt;
333such. To perform these studies the references must be manually extracted and&lt;br&gt;
334processed–a tedious process that is only worthwhile for documents (such as technical&lt;br&gt;
335reports) that are not included in existing citation databases such as the Science Citation&lt;br&gt;
336Index and Social Science Citation Index.&lt;br&gt;
337&lt;i&gt;detecting cycles or regularities in the rate of production of research&lt;/i&gt;&lt;br&gt;
338Analysis of trends in the production of technical reports can give indications&lt;br&gt;
339about working conditions that affect research; for example, is more research produced&lt;br&gt;
340over the summer, when the teaching load is lighter? or is research steadily produced&lt;br&gt;
341throughout the year?&lt;br&gt;
342Figure 2. Distribution of the number of documents submitted to hep-th, 1992-1994&lt;br&gt;
343Figures 2 and 3 present statistics on document accumulation in the hep-th (high&lt;br&gt;
344energy physics) e-print server, a part of the PHYSICS E-PRINT ARCHIVE. This system&lt;br&gt;
345is one of the oldest formal pre-print archives, and has become the primary means for&lt;br&gt;
346information dissemination in its field. Examination of these figures reveals several&lt;br&gt;
347trends. Clearly the absolute number of documents deposited in the repository has&lt;br&gt;
348&lt;hr&gt;
349&lt;A name=12&gt;&lt;/a&gt;tended to increase over the time period. For all three years, research production has its&lt;br&gt;
350lowest point in January and February, increases through May and June, then decreases&lt;br&gt;
351until August and September. At that point the rate of production steps up, reaching a&lt;br&gt;
352yearly peak in November and December. This pattern is less clear for 1992, which&lt;br&gt;
353might be expected as the archive was established in mid-1991.&lt;br&gt;
354Figure 3. Distribution of the percentage of documents submitted to hep-th, 1992-1994&lt;br&gt;
355&lt;b&gt;4. Analysis of usage data&lt;/b&gt;&lt;br&gt;
356The emerging Internet-based digital libraries will permit research on scientific&lt;br&gt;
357information collection and use at a much finer grain than is possible with current paper&lt;br&gt;
358libraries or online bibliographic databases. Current bibliometric or scientometric&lt;br&gt;
359research of this type must measure information use indirectly – for example, through&lt;br&gt;
360examination of the list of references appended to published articles. However, it is well&lt;br&gt;
361known that authors do not necessarily include in the reference list all documents that&lt;br&gt;
362could have been cited, and conversely that not all references listed may have been&lt;br&gt;
363actually “used” in performing the research; citation behavior can be affected by a&lt;br&gt;
364number of motivating factors (Garfield lists &lt;i&gt;15&lt;/i&gt; possible reasons in [8]).&lt;br&gt;
365Digital library transaction logs provide a powerful tool for direct analysis of&lt;br&gt;
366document “usage”: since digital libraries contain the actual document (rather than only a&lt;br&gt;
367document surrogate), the relative amount of “use” that a digital library’s clients make of&lt;br&gt;
368a given document sees can be estimated from the number of times the document file is&lt;br&gt;
369downloaded (and, presumably, the document is read). Note that file downloading is a&lt;br&gt;
370much stronger statement on the part of the user than, for example, having a&lt;br&gt;
371bibliographic record appear in the query result set for a conventional bibliographic&lt;br&gt;
372system; the user downloads only &lt;i&gt;after&lt;/i&gt; the document has been found potentially relevant&lt;br&gt;
373through examination of its document surrogate. Additionally, downloading is&lt;br&gt;
374frequently time-consuming and sometimes costly (depending on local pricing for&lt;br&gt;
375&lt;hr&gt;
376&lt;A name=13&gt;&lt;/a&gt;Internet access). Downloaded documents are therefore highly likely at least to be&lt;br&gt;
377scanned, if not read closely. The transaction logs for a digital library can provide a&lt;br&gt;
378global picture of the use of documents in the collection, since all user interactions with&lt;br&gt;
379the library can be automatically logged for analysis. By contrast, it is of course&lt;br&gt;
380impossible to track usage of print bibliographies, and very difficult to monitor usage of&lt;br&gt;
381bibliographic data available on CD-ROM across more than one or two sites.&lt;br&gt;
382Furthermore, analysis of search requests by geographic location, institution,&lt;br&gt;
383and sometimes even individual user are also possible. As an example, Table 2 presents&lt;br&gt;
384a portion of the summary of usage statistics (broken down by domain code) for queries&lt;br&gt;
385to the computer science technical collection of the NEW ZEALAND DIGITAL LIBRARY.&lt;br&gt;
386Examination of the data indicates that the heaviest use of the collection comes from&lt;br&gt;
387North America, Europe (particularly Germany and Finland), as well as the local New&lt;br&gt;
388Zealand community and nearby Australia. As expected for such a collection, a large&lt;br&gt;
389proportion of users are from educational (.edu) institutions; surprisingly, however, a&lt;br&gt;
390similar number of queries come from commercial (.com) organizations, indicating&lt;br&gt;
391perhaps that the documents are seeing use in commercial research and development&lt;br&gt;
392units.&lt;br&gt;
393Table 2. Accesses to the NEW ZEALAND DIGITAL LIBRARY CS collection by Domain&lt;br&gt;Code&lt;br&gt;
394Of course, usage levels can also be further broken down by IP number&lt;br&gt;
395(indicating institutions), and systems requiring users to register may also be able to&lt;br&gt;
396analyze usage on an individual basis. Since the query strings themselves are also&lt;br&gt;
397recorded in the transaction logs, this domain/institution/individual activity could also be&lt;br&gt;
398linked to specific subjects through the query terms. Summaries of this type could be&lt;br&gt;
399invaluable for studies of geographic diffusion and distribution of research topics.&lt;br&gt;
400Transaction log analysis can also indicate time-related patterns in the&lt;br&gt;
401information seeking behavior of digital library users. As a sample of this type of&lt;br&gt;
402analysis, Paul Ginsparg notes a seven day periodicity in the number of search requests&lt;br&gt;
403&lt;hr&gt;
404&lt;A name=14&gt;&lt;/a&gt;made to the PHYSICS E-PRINT archives (Figure 4, reproduced from [9]). From this he&lt;br&gt;
405adduces that many physicists do not yet have weekend access to the Internet (an&lt;br&gt;
406alternative, slightly more cynical hypothesis is that even high energy theoretical&lt;br&gt;
407physicists take the weekend off).&lt;br&gt;
408Figure 4. Summary of search requests to the physics pre-print archives&lt;br&gt;
409&lt;b&gt;5. Conclusion&lt;/b&gt;&lt;br&gt;
410This study suggests opportunities for conducting bibliometric research on the&lt;br&gt;
411evolving digital libraries. These repositories are suitable platforms for conventional&lt;br&gt;
412bibliometric techniques (such as obsolescence studies, quantification of physical&lt;br&gt;
413characteristics of documents comprising a subject literature, time analysis, etc.). The&lt;br&gt;
414ability to directly monitor access to documents in digital libraries also enables&lt;br&gt;
415researchers to explicitly quantify document usage, as well as to implicitly measure&lt;br&gt;
416usage through citations. Additional facilities could aid in the performance of&lt;br&gt;
417bibliographic experiments, such as: improved tagging of document fields; provision of&lt;br&gt;
418utilities to strip out titles, authors, etc. from common document formats; and the ability&lt;br&gt;
419to easily eliminate duplicate entries from downloaded library subsets. Unfortunately,&lt;br&gt;
420the most useful of these additional facilities – those associated with a higher degree of&lt;br&gt;
421cataloging – run counter to the underlying philosophy of many digital libraries: to&lt;br&gt;
422avoid, if possible, manual processing and formal cataloging of documents. While&lt;br&gt;
423adherence to this principle can limit the accuracy of fielded searching (or indeed,&lt;br&gt;
424preclude it altogether), it can also avoid the cataloging bottleneck and permit digital&lt;br&gt;
425libraries to provide access to larger numbers of documents.&lt;br&gt;
426The digital libraries complement the information currently available through&lt;br&gt;
427paper, online, and CD-ROM bibliographic resources. While these latter databases&lt;br&gt;
428generally have the advantage of standardized formatting of bibliographic fields, the&lt;br&gt;
429digital libraries are freely accessible, often contain &amp;quot;grey literature&amp;quot; that is otherwise&lt;br&gt;
430&lt;hr&gt;
431&lt;A name=15&gt;&lt;/a&gt;unavailable for analysis, and generally make the full text of documents available. The&lt;br&gt;
432insights gained from analysis of digital libraries will add to the store of &amp;quot;information&lt;br&gt;
433about information&amp;quot; that we have gained from older types of bibliographic repositories.&lt;br&gt;
434&lt;b&gt;References&lt;/b&gt;&lt;br&gt;
435[1] Bollacker, K.D., S. Lawrence, and C.L.Giles, CiteSeer: An Autonomous Web&lt;br&gt;
436Agent for Automatic Retrieval and Identification of Interesting Publications,&lt;br&gt;
437&lt;i&gt;Proceedings of the Second International Conference on Autonomous Agents&lt;/i&gt;&lt;br&gt;
438(Minneapolis/St. Paul, May 9-13), 1998.&lt;br&gt;
439[2] Bowman, C.M., P.B. Danzig, U. Manber, and M.F. Schwartz, Scalable Internet&lt;br&gt;
440resource discovery: Research problems and approaches, &lt;i&gt;Communications of&lt;/i&gt;&lt;br&gt;
441&lt;i&gt;the ACM 37(8)&lt;/i&gt; (1994) 98-107.&lt;br&gt;
442[3] Burton, Hilary D. , Use of a virtual information system for bibliometric analysis,&lt;br&gt;
443&lt;i&gt;Informaton Processing &amp;amp; Management 24(1)&lt;/i&gt; (1988) 39-44.&lt;br&gt;
444[4] Cunningham, S.J., An empirical investigation of the obsolescence rate for&lt;br&gt;
445information systems literature, &lt;i&gt;Library and Information Science&lt;/i&gt;&lt;br&gt;
446&lt;i&gt;Research&lt;/i&gt;., 1996, http://library.fgcu.edu/iclc/lisrissu.htm&lt;br&gt;
447 [5] Cunningham, S.J., and D. Bocock, Obsolescence of computing literature.&lt;br&gt;
448&lt;i&gt;Scientometrics&lt;/i&gt; &lt;i&gt;34(2) &lt;/i&gt; (1995), pp. 255-262.&lt;br&gt;
449 [6] Cunningham, S.J. and Lynn Silipigni Connaway, Information searching&lt;br&gt;
450preferences and practices of computer science researchers, &lt;i&gt;Proceedings of&lt;/i&gt;&lt;br&gt;
451&lt;i&gt;OZCHI '96&lt;/i&gt; (1996) 294-299.&lt;br&gt;
452[7] de Solla Price, D.J., Citation measures of hard science, soft science, technology,&lt;br&gt;
453and nonscience. In: C.E. Nelson and D.K. Pollock (eds), &lt;i&gt;Communication&lt;/i&gt;&lt;br&gt;
454&lt;i&gt;among scientists and engineers&lt;/i&gt; (Heath Lexington, 1970).&lt;br&gt;
455[8] Garfield, E., &lt;i&gt;Citation Indexing: Its theory and application in Science, Technology&lt;/i&gt;&lt;br&gt;
456&lt;i&gt;and Humanities (&lt;/i&gt;Wiley, 1979).&lt;br&gt;
457&lt;hr&gt;
458&lt;A name=16&gt;&lt;/a&gt;[9] Ginsparg, P. After dinner remarks: 14 Oct ‘94 APS meeting at LANL, 1994&lt;br&gt;
459(&amp;lt;URL: http://xxx.lanl.gov/blurb&amp;gt; ).&lt;br&gt;
460[10] Ginsparg, P., First steps towards electronic research communication, &lt;i&gt;Computers&lt;/i&gt;&lt;br&gt;
461&lt;i&gt;in Physics 8(4)&lt;/i&gt; (1994) 390-401. &lt;br&gt;
462[11] Hallmark, J., Scientists' access and retrieval of references cited in their recent&lt;br&gt;
463journal articles, &lt;i&gt; College and Research Libraries 55(3)&lt;/i&gt; (1994) 199-210.&lt;br&gt;
464[12] Hawkins, D.T. , Unconventional uses of on-line information retrieval systems:&lt;br&gt;
465on-line bibliometric studies, &lt;i&gt;Journal of the American Society for Information&lt;/i&gt;&lt;br&gt;
466&lt;i&gt;Science 28&lt;/i&gt; (1977) 13-18.&lt;br&gt;
467[13] McGhee, P.E. , P.R. Skinner, K. Roberto, N.J. Ridenour, and S.M. Larson,&lt;br&gt;
468Using online databases to study current research trends: an online bibliometric&lt;br&gt;
469study, &lt;i&gt;Library and Information Science Research 9&lt;/i&gt; (1987) 285-291.&lt;br&gt;
470[14] Maly, K., E.A. Fox, J.C. French, and A.L. Selman, Wide area technical report&lt;br&gt;
471server (&lt;i&gt;Technical Report , &lt;/i&gt; Dept. of Computer Science, Old Dominion&lt;br&gt;
472University, &lt;br&gt;
4731994. &lt;br&gt;
474Also &lt;br&gt;
475available &lt;br&gt;
476at &lt;br&gt;
477 &lt;br&gt;
478 &lt;br&gt;
479&amp;lt;URL:&lt;br&gt;
480http://www.cs.odu.edu/WATERS/WATERS-paper.ps&amp;gt; ).&lt;br&gt;
481[15] Sigogneau, M.J. , S. Bain, J.P. Courtial, and H. Feillet, Scientific innovation in&lt;br&gt;
482bibliographical databases: a comparative study of the Science Citation Index&lt;br&gt;
483and the Pascal database, &lt;i&gt;Scientometrics 22(1)&lt;/i&gt; (1991) 65-82.&lt;br&gt;
484[16] Witten, I.H., S.J. Cunningham, M. Vallabh, and T.C. Bell, A New Zealand&lt;br&gt;
485digital library for computer science research, &lt;i&gt;Proceedings of Digital Libraries&lt;/i&gt;&lt;br&gt;
486&lt;i&gt;'95&lt;/i&gt; (1995) 25-30.&lt;br&gt;
487[17] Witten, I.H., C. Nevill-Manning, and S.J. Cunningham, A public library based&lt;br&gt;
488on full-text retrieval, &lt;i&gt;Communications of the ACM&lt;/i&gt; 41(4), 1998, p. 71&lt;br&gt;
489&lt;hr&gt;
490&lt;A name=17&gt;&lt;/a&gt; &lt;br&gt;
4911Documents were randomly sampled from the DEC&lt;br&gt;
492(ftp://crl.dec.com/pub/DEC/CRL/tech-reports/), Sony&lt;br&gt;
493(ftp://ftp.csl.sony.co.jp/CSL/CSL-Papers), and Ohio (ftp://archive.cis.ohio-&lt;br&gt;
494state.edu/pub/tech-report/) technical report repositories&lt;br&gt;
495&lt;hr&gt;
496
497
498</Content>
499</Section>
500</Archive>
Note: See TracBrowser for help on using the repository browser.