source: other-projects/nightly-tasks/diffcol/trunk/gs3-model-collect/Word-PDF-Enhanced/archives/HASH019c.dir/doc.xml@ 30029

Last change on this file since 30029 was 30029, checked in by ak19, 9 years ago

Adding the Enhanced Word tutorial collection that uses Windows Scripting. Pre-built on Windows 7 64 bit.

File size: 41.4 KB
Line 
1<?xml version="1.0" encoding="utf-8" standalone="no"?>
2<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3<Archive>
4<Section>
5 <Description>
6 <Metadata name="gsdldoctype">indexed_doc</Metadata>
7 <Metadata name="Language">en</Metadata>
8 <Metadata name="Encoding">utf8</Metadata>
9 <Metadata name="Author">Bronwyn</Metadata>
10 <Metadata name="Title">biblio_for_dl_scientometrics.do</Metadata>
11 <Metadata name="URL">http://C:/Users/Anupama/GS307_13July2015/web/sites/localsite/collect/Word-PDF-Enhanced/tmp/1436775750/pdf03.html</Metadata>
12 <Metadata name="UTF8URL">http://C:/Users/Anupama/GS307_13July2015/web/sites/localsite/collect/Word-PDF-Enhanced/tmp/1436775750/pdf03.html</Metadata>
13 <Metadata name="gsdlsourcefilename">import\pdf03.pdf</Metadata>
14 <Metadata name="gsdlconvertedfilename">tmp\1436775750\pdf03.html</Metadata>
15 <Metadata name="OrigSource">pdf03.html</Metadata>
16 <Metadata name="Source">pdf03.pdf</Metadata>
17 <Metadata name="SourceFile">pdf03.pdf</Metadata>
18 <Metadata name="Plugin">PDFPlugin</Metadata>
19 <Metadata name="FileSize">35935</Metadata>
20 <Metadata name="FilenameRoot">pdf03</Metadata>
21 <Metadata name="FileFormat">PDF</Metadata>
22 <Metadata name="srcicon">_iconpdf_</Metadata>
23 <Metadata name="srclink_file">doc.pdf</Metadata>
24 <Metadata name="srclinkFile">doc.pdf</Metadata>
25 <Metadata name="NumPages">17</Metadata>
26 <Metadata name="dc.Creator">Sally Jo Cunningham</Metadata>
27 <Metadata name="dc.Title">Applications for Bibliometric Research in the Emerging Digital Libraries</Metadata>
28 <Metadata name="Identifier">HASH019c5dca7f5bb781460a6b9c</Metadata>
29 <Metadata name="lastmodified">1436763858</Metadata>
30 <Metadata name="lastmodifieddate">20150713</Metadata>
31 <Metadata name="oailastmodified">1436775750</Metadata>
32 <Metadata name="oailastmodifieddate">20150713</Metadata>
33 <Metadata name="assocfilepath">HASH019c.dir</Metadata>
34 <Metadata name="gsdlassocfile">doc.pdf:application/pdf:</Metadata>
35 </Description>
36 <Content>
37
38
39&lt;A name=1&gt;&lt;/a&gt;&lt;b&gt;Applications for Bibliometric Research&lt;/b&gt;&lt;br&gt;
40
41
42&lt;b&gt;in the Emerging Digital Libraries&lt;/b&gt;&lt;br&gt;
43
44
45Sally Jo Cunningham&lt;br&gt;
46
47
48Department of Computer Science&lt;br&gt;
49
50
51University of Waikato&lt;br&gt;
52
53
54Hamilton, New Zealand&lt;br&gt;
55
56
57email: [email protected]&lt;br&gt;
58
59
60&lt;b&gt;Abstract:&lt;/b&gt; Large numbers of research documents have recently become available on&lt;br&gt;
61
62
63the Internet through “digital libraries”, and these collections are seeing high levels of&lt;br&gt;
64
65
66use by their related research communities. A secondary use for these document&lt;br&gt;
67
68
69repositories and indexes is as a platform for bibliometric research. We examine the&lt;br&gt;
70
71
72extent to which the new digital libraries support conventional bibliometric analysis, and&lt;br&gt;
73
74
75discuss shortcomings in their current forms. Interestingly, these electronic text&lt;br&gt;
76
77
78archives also provide opportunities for new types of studies: generally the full text of&lt;br&gt;
79
80
81documents are available for analysis, giving a finer grain of insight than abstract-only&lt;br&gt;
82
83
84online databases; these repositories often contain technical reports or pre-prints, the&lt;br&gt;
85
86
87“grey literature” that has been previously unavailable for analysis; and document&lt;br&gt;
88
89
90“usage” can be measured directly by recording user accesses, rather than studied&lt;br&gt;
91
92
93indirectly through document references.&lt;br&gt;
94
95
96&lt;b&gt;1. Introduction&lt;/b&gt;&lt;br&gt;
97
98
99In recent years a number of &amp;quot;digital libraries&amp;quot; have become available through the&lt;br&gt;
100
101
102Internet. While the technology promises in the future to support large, heterogenous&lt;br&gt;
103
104
105collections, at present the most widely used of the academically-focussed digital&lt;br&gt;
106
107
108libraries are generally repositories of one or two types of document (typically technical&lt;br&gt;
109
110
111reports, journal articles, pre-prints, or conference proceedings), grouped by discipline.&lt;br&gt;
112
113
114&lt;hr&gt;
115
116
117&lt;A name=2&gt;&lt;/a&gt;A distinguishing characteristic of these digital libraries is that the full text of documents&lt;br&gt;
118
119
120are often available for retrieval, as well as bibliographic records.The sciences are&lt;br&gt;
121
122
123represented much more heavily in the present crop of digital libraries than the social&lt;br&gt;
124
125
126sciences, arts, or humanities. They are maintained by professional societies,&lt;br&gt;
127
128
129universities, research laboratories, and even private individuals. Access is generally&lt;br&gt;
130
131
132free, both to search and to download documents.&lt;br&gt;
133
134
135The emergence of these subject-specific digital libraries is particularly important&lt;br&gt;
136
137
138given the pattern of access to materials presently employed by research scientists.&lt;br&gt;
139
140
141Informal exchanges of preprints, reprints, and photocopies of papers passed on by&lt;br&gt;
142
143
144colleagues currently are major venues for the transmission of scientific information&lt;br&gt;
145
146
147between researchers in the sciences. In one study, the dependence on these sources&lt;br&gt;
148
149
150ranges from 12% (for chemistry) to 39% (for mathematics) of all papers cited in&lt;br&gt;
151
152
153researchers' own publications [11]. A qualitative study of study of how computer&lt;br&gt;
154
155
156scientists locate and retrieve documents (computing is one of the domains considered&lt;br&gt;
157
158
159later in this paper) indicates that for that field, technical reports and research documents&lt;br&gt;
160
161
162found in various locations on the Internet are a preferred source of information [6].&lt;br&gt;
163
164
165Many of the digital library systems discussed in this paper are repositories for just this&lt;br&gt;
166
167
168type of literature. The documents tend to be of high quality: primarily technical&lt;br&gt;
169
170
171reports or working papers from research institutions (both academic and commercial),&lt;br&gt;
172
173
174as well as advance copies of work accepted for publication in conventional paper&lt;br&gt;
175
176
177journals. Moreover, these digital libraries are also coming to include refereed work&lt;br&gt;
178
179
180published digitally (in electronic journals). Anecdotal evidence suggests that in their&lt;br&gt;
181
182
183fields, these digital libraries are coming to be the resource of choice for locating cutting&lt;br&gt;
184
185
186edge work.&lt;br&gt;
187
188
189For specialized subjects such as high energy physics, this dependence on&lt;br&gt;
190
191
192informal or extra-library dissemination can be much higher. Ginsparg ([9], [10])&lt;br&gt;
193
194
195reports that fields in physics have traditionally relied heavily on preprint exchanges, and&lt;br&gt;
196
197
198the digital repositories of physics preprints begun in 1991 (the PHYSICS E-PRINT&lt;br&gt;
199
200
201ARCHIVES) have to a large extent supplanted conventional publishing and physical&lt;br&gt;
202
203
204&lt;hr&gt;
205
206
207&lt;A name=3&gt;&lt;/a&gt;paper mailing of technical reports. By providing ready access to information sources&lt;br&gt;
208
209
210that are already preferentially utilized by scientists, the digital libraries show potential to&lt;br&gt;
211
212
213increase access to information that until recently was expensive or difficult to acquire in&lt;br&gt;
214
215
216paper form. Indeed, in some fields (most notably physics) this process has already&lt;br&gt;
217
218
219begun, as researchers in less developed countries report access to ongoing research&lt;br&gt;
220
221
222through the Internet repositories that their local libraries could not afford to acquire&lt;br&gt;
223
224
225through conventional journal subscriptions ([9], [10]).&lt;br&gt;
226
227
228The primary use for new bibliographic resources is, of course, for the contents&lt;br&gt;
229
230
231of the documents involved. A secondary use for emerging resources is as a basis for&lt;br&gt;
232
233
234bibliometric analysis of the subject field. With the conventionally published scientific&lt;br&gt;
235
236
237literature, the sheer difficulty of accumulating statistics discouraged bibliometric&lt;br&gt;
238
239
240research until the advent of large bibliographic databases in the 1960's. Computerized&lt;br&gt;
241
242
243bibliographic databases sparked a significant increase in the number of large-scale&lt;br&gt;
244
245
246bibliographic studies, as significant portions of the collection and analysis of data could&lt;br&gt;
247
248
249be automated ([12], [13]). The availability of CD-ROM versions of bibliographic&lt;br&gt;
250
251
252databases has been of particular importance, since they provide a cheaper alternative to&lt;br&gt;
253
254
255the online commercial databases [3].&lt;br&gt;
256
257
258These computerized bibliographic resources have drawbacks, however. The&lt;br&gt;
259
260
261greatest is that the full text of documents are rarely available, and even abstracts are not&lt;br&gt;
262
263
264always present. This obviously limits the types of bibliometric research that can be&lt;br&gt;
265
266
267conducted &lt;i&gt;solely&lt;/i&gt; through these databases. In addition, these databases are generally&lt;br&gt;
268
269
270limited to formally published documents (those appearing in selected books, journals,&lt;br&gt;
271
272
273and conference proceedings). The &amp;quot;grey literature&amp;quot; of technical reports, pre-prints, and&lt;br&gt;
274
275
276other works not formally published are largely ignored, and it is this absence of easy&lt;br&gt;
277
278
279access to these documents that has hampered the analysis of these important forms of&lt;br&gt;
280
281
282scientific communication.&lt;br&gt;
283
284
285The digital libraries currently in existence complement the online and CD-ROM&lt;br&gt;
286
287
288bibliographic databases. They are best suited for examinations of the &amp;quot;physical&amp;quot;&lt;br&gt;
289
290
291characteristics of documents (for example, document length), analysis based on&lt;br&gt;
292
293
294&lt;hr&gt;
295
296
297&lt;A name=4&gt;&lt;/a&gt;bibliographic information that can be automatically extracted from the document text or&lt;br&gt;
298
299
300the sometimes unevenly formatted bibliographic records (such as obsolescence&lt;br&gt;
301
302
303studies), and usage studies (geographic or institutional origin of users, date/time of&lt;br&gt;
304
305
306access, individual patterns of document retrieval, etc.). Because references are present&lt;br&gt;
307
308
309in the document file but not identified by field, co-citation and bibliographic coupling&lt;br&gt;
310
311
312research is not well-supported, and conducting these studies requires considerable&lt;br&gt;
313
314
315effort on the part of the researcher.&lt;br&gt;
316
317
318The variety of bibliographic repositories in the available digital libraries in itself&lt;br&gt;
319
320
321has great potential in conducting bibliometric research. Sigogneau et al [15] present a&lt;br&gt;
322
323
324case study illustrating the ways in which the strengths of different databases can be&lt;br&gt;
325
326
327played off each other; they conduct a fine-grained analysis of the emergence of research&lt;br&gt;
328
329
330fronts in molecular and cellular biology, and demonstrate that the observations gleaned&lt;br&gt;
331
332
333from two complementary bibliographic databases provide greater insight into their&lt;br&gt;
334
335
336problem. Similarly, it appears that the types of bibliographic data that can be gleaned&lt;br&gt;
337
338
339from the relatively unstructured digital libraries can be profitably combined with data&lt;br&gt;
340
341
342from online databases, CD-ROMS, and other more conventional bibliographic&lt;br&gt;
343
344
345resources.&lt;br&gt;
346
347
348This paper is organized as follows: Section 2 discusses the types of indexing&lt;br&gt;
349
350
351and searching available with current digital libraries; Section 3 gives examples of&lt;br&gt;
352
353
354conventional bibliometric techniques applied to Internet-accessible archives; Section 4&lt;br&gt;
355
356
357discusses opportunities to directly measure usage of documents and to detect&lt;br&gt;
358
359
360information-seeking patterns in researchers; and Section 5 presents our conclusions.&lt;br&gt;
361
362
363&lt;b&gt;2. Indexing and searching in current digital libraries&lt;/b&gt;&lt;br&gt;
364
365
366At present, the types of indexing fields for most academically-oriented digital&lt;br&gt;
367
368
369library systems are limited. Many schemes index on user-supplied document&lt;br&gt;
370
371
372descriptions, abstracts, or similar document surrogates (for example, the PHYSICS E-&lt;br&gt;
373
374
375PRINT ARCHIVE [10], a collection of physics pre-prints and technical reports). As will&lt;br&gt;
376
377
378&lt;hr&gt;
379
380
381&lt;A name=5&gt;&lt;/a&gt;be discussed below, the quality of this user-provided data can be highly variable, and&lt;br&gt;
382
383
384may unfavorably impact the usefulness of the index for searching. Alternatively, a&lt;br&gt;
385
386
387designated site librarian may maintain a catalog (eg, the WATERS [14] system, now&lt;br&gt;
388
389
390subsumed by NCSTRL (http://www.ncstrl.org/), both primarily collections of&lt;br&gt;
391
392
393computer science technical reports); in this case the quality of the bibliographic&lt;br&gt;
394
395
396information may be expedited to be higher, but fewer sites will be likely to support&lt;br&gt;
397
398
399such a librarian and therefore fewer documents are likely to be included in the digital&lt;br&gt;
400
401
402library. In a “harvesting” system such as the computer science technical report&lt;br&gt;
403
404
405collections supported by HARVEST [2] or the NEW ZEALAND DIGITAL LIBRARY&lt;br&gt;
406
407
408computer science technical report collection ([16], [17]), documents are indexed from&lt;br&gt;
409
410
411passive repositories (that may not even be aware that their documents are being&lt;br&gt;
412
413
414included in the digital library). Harvesting systems therefore cannot rely on the&lt;br&gt;
415
416
417presence of bibliographic data of any sort.&lt;br&gt;
418
419
420Because of the relative paucity of high-quality bibliographic data available to&lt;br&gt;
421
422
423many of the current academically- or research-focussed digital library collections, their&lt;br&gt;
424
425
426search interfaces tend to be more primitive than those ordinarily found in online&lt;br&gt;
427
428
429bibliographic databases or library catalogs. Systems such as NCSTRL can support&lt;br&gt;
430
431
432author, title, and subject searching, but this more sophisticated search functionality&lt;br&gt;
433
434
435comes at the expense of requiring participating repositories to use specific software. As&lt;br&gt;
436
437
438a consequence, these latter systems may provide access to a small number of sites than&lt;br&gt;
439
440
441harvesting systems. Harvesters may access a broader range of providers, but at the&lt;br&gt;
442
443
444penalty of being limited to unfielded, keyword searches over the raw text of the&lt;br&gt;
445
446
447document or document surrogate.&lt;br&gt;
448
449
450Specifically, the indexing in existing digital libraries has a variety of shortcomings for&lt;br&gt;
451
452
453bibliometric applications:&lt;br&gt;
454
455
456•&lt;br&gt;
457
458
459&lt;i&gt;lack of fielded indexing:&lt;/i&gt; As noted above, some large and widely used digital&lt;br&gt;
460
461
462libraries (such as the computer science technical report collection of the NEW&lt;br&gt;
463
464
465ZEALAND DIGITAL LIBRARY) may lack formal cataloging entirely, and rely on&lt;br&gt;
466
467
468&lt;hr&gt;
469
470
471&lt;A name=6&gt;&lt;/a&gt;keyword searching over the raw document text. Obviously this makes field-&lt;br&gt;
472
473
474dependent analysis more difficult (for example, locating documents produced by&lt;br&gt;
475
476
477specific authors), and in the worst case my require a manual examination of all&lt;br&gt;
478
479
480files in the collection in order to reliably identify a desired document subset.&lt;br&gt;
481
482
483However, keyword search techniques that approximate fielded searching results&lt;br&gt;
484
485
486may suffice: for example in the NEW ZEALAND DIGITAL LIBRARY computer&lt;br&gt;
487
488
489science technical report collection, limiting the keyword search for “Johnson”&lt;br&gt;
490
491
492to a search of first pages only is likely to retrieve documents written by Johnson&lt;br&gt;
493
494
495(since for the majority of computer science technical reports, the first page&lt;br&gt;
496
497
498contains little more than author, title, date, and institution details).&lt;br&gt;
499
500
501A more principled approach to extracting bibliographic information is embodied&lt;br&gt;
502
503
504in the CiteSeer tool [1]. This software parses raw, unfielded academic&lt;br&gt;
505
506
507documents and attempts to identify such indexing information as author, title,&lt;br&gt;
508
509
510reference list, etc. Obviously such a tool cannot attain 100% accuracy over a&lt;br&gt;
511
512
513heterogenous document collection, but in practice it appears useful in that it can&lt;br&gt;
514
515
516make a good first pass in processing a set of documents, providing an initial set&lt;br&gt;
517
518
519of parsed documents for analysis. The remaining (presumably much smaller) set&lt;br&gt;
520
521
522of unparsable documents can then be dealt with manually.&lt;br&gt;
523
524
525•&lt;br&gt;
526
527
528&lt;i&gt;lack of consistency in field formatting:&lt;/i&gt; Current digital libraries usually acquire&lt;br&gt;
529
530
531bibliographic information from either the authors of submitted articles or&lt;br&gt;
532
533
534automatic extraction routines (retrieving bibliographic details from catalog files&lt;br&gt;
535
536
537that may or may not be in a given document site, and that may or may not be in&lt;br&gt;
538
539
540an easily parsable form). Neither of these methods produce records with&lt;br&gt;
541
542
543standard formatting, which causes problems with automated bibliometric&lt;br&gt;
544
545
546analysis. Consider the following examples selected from entries in the hep-th&lt;br&gt;
547
548
549(high energy physics) collection of the PHYSICS E-PRINT ARCHIVES:&lt;br&gt;
550
551
552&lt;hr&gt;
553
554
555&lt;A name=7&gt;&lt;/a&gt;(i)&lt;br&gt;
556
557
558Authors: A. Yu. Alekseev, V. Schomerus&lt;br&gt;
559
560
561(ii)&lt;br&gt;
562
563
564Authors: Adel Bilal and Ian. I. Kogan&lt;br&gt;
565
566
567(iii)&lt;br&gt;
568
569
570Authors: Paul S. Aspinwall and David R. Morrison (with an appendix &lt;br&gt;
571
572
573by Mark Gross)&lt;br&gt;
574
575
576(iv)&lt;br&gt;
577
578
579Authors: A. H. Chamseddine and Herbi Dreiner (ETH-Zurich)&lt;br&gt;
580
581
582In this case, typical for existing digital libraries, there is no standardized format&lt;br&gt;
583
584
585for authors' names (here, appearing with full names, initials plus last name, and&lt;br&gt;
586
587
588a mixture of the two); no standard convention for separating author names&lt;br&gt;
589
590
591(here, either a comma or &amp;quot;and&amp;quot; are used); and parenthetical information can&lt;br&gt;
592
593
594include a variety of information such as the name of an associate author or the&lt;br&gt;
595
596
597institutional affiliations of an author. Manual processing or specially crafted&lt;br&gt;
598
599
600software would be required to reformat these fields for analysis.&lt;br&gt;
601
602
603•&lt;br&gt;
604
605
606&lt;i&gt;duplicate entries: &lt;/i&gt; Digital libraries that draw documents from a variety of sources&lt;br&gt;
607
608
609may inadvertently contain duplicate items. Unfortunately, the irregular&lt;br&gt;
610
611
612formatting of the bibliographic information makes it difficult to automatically&lt;br&gt;
613
614
615detect these duplicates.&lt;br&gt;
616
617
618•&lt;br&gt;
619
620
621&lt;i&gt;implicit field tagging:&lt;/i&gt; In some repositories, items are not explicitly tagged with&lt;br&gt;
622
623
624certain types of information – most commonly the document's date of&lt;br&gt;
625
626
627publication or production. Instead, the date is implicit in the document's title&lt;br&gt;
628
629
630(eg, its numeration in a technical report series) or in the location of the document&lt;br&gt;
631
632
633in the file structure of the repository (eg, separate directories exist for each&lt;br&gt;
634
635
636year). A second common piece of implicit data is the authors’ institutional&lt;br&gt;
637
638
639affiliations. This may be contained in the document itself (typically on a cover&lt;br&gt;
640
641
642page), or may be implicit in the document’s location (for example, a&lt;br&gt;
643
644
645corporation’s technical reports are stored in its ftp repository). Again, in these&lt;br&gt;
646
647
648&lt;hr&gt;
649
650
651&lt;A name=8&gt;&lt;/a&gt;cases special processing is required to append this field information to a&lt;br&gt;
652
653
654document record for bibliometric analysis. &lt;br&gt;
655
656
657•&lt;br&gt;
658
659
660&lt;i&gt;extraction of document text:&lt;/i&gt; Few of the documents stored in the research-&lt;br&gt;
661
662
663oriented digital libraries discussed in this paper are straight ascii text; instead,&lt;br&gt;
664
665
666documents may appear in a variety of file formats, such as LaTeX, PostScript,&lt;br&gt;
667
668
669PDF, etc. If the contents of the documents are to be automatically processed&lt;br&gt;
670
671
672(for example, to count the words in a document, or to extract reference&lt;br&gt;
673
674
675publication dates for an obsolescence study), then the text must be extracted.&lt;br&gt;
676
677
678Utilities are available to convert most common document formats to ascii.&lt;br&gt;
679
680
681It is likely that many of these problems will be addressed as the Internet-based&lt;br&gt;
682
683
684document indexing systems mature. Even minor changes can greatly increase the&lt;br&gt;
685
686
687useability of a bibliographic database for bibliometric research. For example, the&lt;br&gt;
688
689
690addition of an explicit date tag to many online databases in 1975 sparked new&lt;br&gt;
691
692
693applications in time series research [3].&lt;br&gt;
694
695
696&lt;b&gt;3. Opportunities for applications of bibliometric techniques&lt;/b&gt;&lt;br&gt;
697
698
699One type of bibliometric research concentrates on quantifying fundamental,&lt;br&gt;
700
701
702structural details about a subject literature: how many items are published, how many&lt;br&gt;
703
704
705authors are publishing, over what time period documents are likely to be used, etc.&lt;br&gt;
706
707
708More complex studies analyze the relationships between documents, such as how&lt;br&gt;
709
710
711documents cluster into subjects. The following examples give a flavour of the&lt;br&gt;
712
713
714bibliometric research that is possible using the emerging digital libraries:&lt;br&gt;
715
716
717&lt;i&gt;examining the “physical” characteristics of archived documents&lt;/i&gt;&lt;br&gt;
718
719
720One relatively straightforward type of bibliometric study characterizes the&lt;br&gt;
721
722
723formats of different literatures. For example, Figure 1 presents a the range of the size&lt;br&gt;
724
725
726&lt;hr&gt;
727
728
729&lt;A name=9&gt;&lt;/a&gt;of computer science technical reports as measured by their length in pages. Of the&lt;br&gt;
730
731
73245,720 documents in the CSTR collection as of April 1998, nearly 1600 did not contain&lt;br&gt;
733
734
735page divisions in their files (and hence are excluded from analysis). Note that the&lt;br&gt;
736
737
738number of pages in the shorter documents (&amp;lt;50 pages) falls into an approximately&lt;br&gt;
739
740
741normal distribution (slightly skewed to the left), while presumably the longer&lt;br&gt;
742
743
744documents represent Masters’ and Doctoral theses. A surprising number of documents&lt;br&gt;
745
746
747are very short (between one and 5 pages); these may represent the type of condensed&lt;br&gt;
748
749
750results frequently found in the “technical notes”, “short papers”, and “poster sessions”&lt;br&gt;
751
752
753of computing conferences and journals. The average number of pages per document,&lt;br&gt;
754
755
75627.5, appears to be slightly longer than the common upper bound for a computing&lt;br&gt;
757
758
759journal article, although this observation must be confirmed by a similar study of the&lt;br&gt;
760
761
762lengths of formally published computing articles.&lt;br&gt;
763
764
765This type of analysis is of particular interest for technical reports, since they&lt;br&gt;
766
767
768have not been studied in the same detail as formally published papers. A comparison of&lt;br&gt;
769
770
771the physical characteristics of the formal and informal literature could provide&lt;br&gt;
772
773
774supporting evidence for common beliefs about the relationship between the two types&lt;br&gt;
775
776
777of documents. For example, do publishing constraints force journal and proceedings&lt;br&gt;
778
779
780articles to be shorter than technical reports, and therefore presumably omit technical&lt;br&gt;
781
782
783details of findings? Do technical reports contain more/less extensive reference sections?&lt;br&gt;
784
785
786If reference sections of technical reports are longer than those of published articles, then&lt;br&gt;
787
788
789citation links are being ommitted in published works; if technical reports contain fewer&lt;br&gt;
790
791
792references, then this may confirm earlier indications that computer scientists tend to&lt;br&gt;
793
794
795“research first” and do literature surveys later [6].&lt;br&gt;
796
797
798Figure 1. Range of sizes of CS technical reports, measured by number of pages&lt;br&gt;
799
800
801&lt;i&gt;obsolescence studies.&lt;/i&gt;&lt;br&gt;
802
803
804A document is considered obsolete when it is no longer referenced by the&lt;br&gt;
805
806
807current literature. Typically, documents receive their greatest number and frequency of&lt;br&gt;
808
809
810&lt;hr&gt;
811
812
813&lt;A name=10&gt;&lt;/a&gt;citations immediately after publication, and the frequency of citation falls rapidly as time&lt;br&gt;
814
815
816passes. One technique for estimating the obsolescence rate of a body of literature– the&lt;br&gt;
817
818
819&lt;i&gt;synchronous&lt;/i&gt; method – is to find the median date in the references of the documents.&lt;br&gt;
820
821
822This median date is subtracted from the year of publication for the documents, yielding&lt;br&gt;
823
824
825the &lt;i&gt;median citation age&lt;/i&gt;. As would be expected, this median varies between the&lt;br&gt;
826
827
828disciplines. Typically the social sciences and arts have a higher median citation age&lt;br&gt;
829
830
831than the “hard” sciences and engineering, indicating that documents obsolesce more&lt;br&gt;
832
833
834quickly for the latter fields.&lt;br&gt;
835
836
837As noted in Section 2, references are not generally explicitly tagged in existing&lt;br&gt;
838
839
840digital repositories. However, reference dates can usually be extracted from the&lt;br&gt;
841
842
843document text by first locating the reference section (usually delimited by a &amp;quot;references&amp;quot;&lt;br&gt;
844
845
846or &amp;quot;bibliography&amp;quot; section heading), and then extracting all numbers in the appropriate&lt;br&gt;
847
848
849ranges for dates for the field under study.&lt;br&gt;
850
851
852To illustrate this process, 188 technical reports were sampled from Internet-&lt;br&gt;
853
854
855accessible repositories1 and used as source documents for a synchronous obsolescence&lt;br&gt;
856
857
858study. Conveniently, the repositories chosen organize technical reports into sub-&lt;br&gt;
859
860
861directories by their date of publication. The reference dates for each technical report&lt;br&gt;
862
863
864were automatically extracted by software that scanned the document’s file for numbers&lt;br&gt;
865
866
867of the form 19XX, since previous studies indicate that few if any computing reports&lt;br&gt;
868
869
870reference documents published in previous centuries [5]. Table 1 presents the median&lt;br&gt;
871
872
873citation age calculated for these documents, broken down by repository and the year of&lt;br&gt;
874
875
876publication for the source documents from which the reference dates were extracted:&lt;br&gt;
877
878
879Table 1. Median citation ages for technical report repositories&lt;br&gt;
880
881
882The median citation age ranges between 2 and 4 years, which is consistent with&lt;br&gt;
883
884
885previous examinations of computing and information systems literature ([5], [4]).&lt;br&gt;
886
887
888When graphed, the distribution of reference dates show the exponential curve typically&lt;br&gt;
889
890
891found in obsolescence studies, including the final droop due to an “immediacy effect”&lt;br&gt;
892
893
894&lt;hr&gt;
895
896
897&lt;A name=11&gt;&lt;/a&gt;as fewer very new documents are available for citation [7]. These types of results&lt;br&gt;
898
899
900provide confirmation that references used in computer science technical reports (the pre-&lt;br&gt;
901
902
903eminent “grey literature” of the computing field) conforms to the same patterns as&lt;br&gt;
904
905
906references found in the formally published literature.&lt;br&gt;
907
908
909&lt;i&gt;co-citation and bibliographic coupling studies&lt;/i&gt;&lt;br&gt;
910
911
912The rate at which documents cite each other (co-citation) or cite the same&lt;br&gt;
913
914
915documents (bibliographic coupling) can be used to produce &amp;quot;maps&amp;quot; of a subject&lt;br&gt;
916
917
918literature. These techniques rely on analysis of the references of documents, and these&lt;br&gt;
919
920
921references must be in a common format. While digital libraries contain full text of&lt;br&gt;
922
923
924documents, their references are not standardized, and indeed are not even tagged as&lt;br&gt;
925
926
927such. To perform these studies the references must be manually extracted and&lt;br&gt;
928
929
930processed–a tedious process that is only worthwhile for documents (such as technical&lt;br&gt;
931
932
933reports) that are not included in existing citation databases such as the Science Citation&lt;br&gt;
934
935
936Index and Social Science Citation Index.&lt;br&gt;
937
938
939&lt;i&gt;detecting cycles or regularities in the rate of production of research&lt;/i&gt;&lt;br&gt;
940
941
942Analysis of trends in the production of technical reports can give indications&lt;br&gt;
943
944
945about working conditions that affect research; for example, is more research produced&lt;br&gt;
946
947
948over the summer, when the teaching load is lighter? or is research steadily produced&lt;br&gt;
949
950
951throughout the year?&lt;br&gt;
952
953
954Figure 2. Distribution of the number of documents submitted to hep-th, 1992-1994&lt;br&gt;
955
956
957Figures 2 and 3 present statistics on document accumulation in the hep-th (high&lt;br&gt;
958
959
960energy physics) e-print server, a part of the PHYSICS E-PRINT ARCHIVE. This system&lt;br&gt;
961
962
963is one of the oldest formal pre-print archives, and has become the primary means for&lt;br&gt;
964
965
966information dissemination in its field. Examination of these figures reveals several&lt;br&gt;
967
968
969trends. Clearly the absolute number of documents deposited in the repository has&lt;br&gt;
970
971
972&lt;hr&gt;
973
974
975&lt;A name=12&gt;&lt;/a&gt;tended to increase over the time period. For all three years, research production has its&lt;br&gt;
976
977
978lowest point in January and February, increases through May and June, then decreases&lt;br&gt;
979
980
981until August and September. At that point the rate of production steps up, reaching a&lt;br&gt;
982
983
984yearly peak in November and December. This pattern is less clear for 1992, which&lt;br&gt;
985
986
987might be expected as the archive was established in mid-1991.&lt;br&gt;
988
989
990Figure 3. Distribution of the percentage of documents submitted to hep-th, 1992-1994&lt;br&gt;
991
992
993&lt;b&gt;4. Analysis of usage data&lt;/b&gt;&lt;br&gt;
994
995
996The emerging Internet-based digital libraries will permit research on scientific&lt;br&gt;
997
998
999information collection and use at a much finer grain than is possible with current paper&lt;br&gt;
1000
1001
1002libraries or online bibliographic databases. Current bibliometric or scientometric&lt;br&gt;
1003
1004
1005research of this type must measure information use indirectly – for example, through&lt;br&gt;
1006
1007
1008examination of the list of references appended to published articles. However, it is well&lt;br&gt;
1009
1010
1011known that authors do not necessarily include in the reference list all documents that&lt;br&gt;
1012
1013
1014could have been cited, and conversely that not all references listed may have been&lt;br&gt;
1015
1016
1017actually “used” in performing the research; citation behavior can be affected by a&lt;br&gt;
1018
1019
1020number of motivating factors (Garfield lists &lt;i&gt;15&lt;/i&gt; possible reasons in [8]).&lt;br&gt;
1021
1022
1023Digital library transaction logs provide a powerful tool for direct analysis of&lt;br&gt;
1024
1025
1026document “usage”: since digital libraries contain the actual document (rather than only a&lt;br&gt;
1027
1028
1029document surrogate), the relative amount of “use” that a digital library’s clients make of&lt;br&gt;
1030
1031
1032a given document sees can be estimated from the number of times the document file is&lt;br&gt;
1033
1034
1035downloaded (and, presumably, the document is read). Note that file downloading is a&lt;br&gt;
1036
1037
1038much stronger statement on the part of the user than, for example, having a&lt;br&gt;
1039
1040
1041bibliographic record appear in the query result set for a conventional bibliographic&lt;br&gt;
1042
1043
1044system; the user downloads only &lt;i&gt;after&lt;/i&gt; the document has been found potentially relevant&lt;br&gt;
1045
1046
1047through examination of its document surrogate. Additionally, downloading is&lt;br&gt;
1048
1049
1050frequently time-consuming and sometimes costly (depending on local pricing for&lt;br&gt;
1051
1052
1053&lt;hr&gt;
1054
1055
1056&lt;A name=13&gt;&lt;/a&gt;Internet access). Downloaded documents are therefore highly likely at least to be&lt;br&gt;
1057
1058
1059scanned, if not read closely. The transaction logs for a digital library can provide a&lt;br&gt;
1060
1061
1062global picture of the use of documents in the collection, since all user interactions with&lt;br&gt;
1063
1064
1065the library can be automatically logged for analysis. By contrast, it is of course&lt;br&gt;
1066
1067
1068impossible to track usage of print bibliographies, and very difficult to monitor usage of&lt;br&gt;
1069
1070
1071bibliographic data available on CD-ROM across more than one or two sites.&lt;br&gt;
1072
1073
1074Furthermore, analysis of search requests by geographic location, institution,&lt;br&gt;
1075
1076
1077and sometimes even individual user are also possible. As an example, Table 2 presents&lt;br&gt;
1078
1079
1080a portion of the summary of usage statistics (broken down by domain code) for queries&lt;br&gt;
1081
1082
1083to the computer science technical collection of the NEW ZEALAND DIGITAL LIBRARY.&lt;br&gt;
1084
1085
1086Examination of the data indicates that the heaviest use of the collection comes from&lt;br&gt;
1087
1088
1089North America, Europe (particularly Germany and Finland), as well as the local New&lt;br&gt;
1090
1091
1092Zealand community and nearby Australia. As expected for such a collection, a large&lt;br&gt;
1093
1094
1095proportion of users are from educational (.edu) institutions; surprisingly, however, a&lt;br&gt;
1096
1097
1098similar number of queries come from commercial (.com) organizations, indicating&lt;br&gt;
1099
1100
1101perhaps that the documents are seeing use in commercial research and development&lt;br&gt;
1102
1103
1104units.&lt;br&gt;
1105
1106
1107Table 2. Accesses to the NEW ZEALAND DIGITAL LIBRARY CS collection by Domain&lt;br&gt;Code&lt;br&gt;
1108
1109
1110Of course, usage levels can also be further broken down by IP number&lt;br&gt;
1111
1112
1113(indicating institutions), and systems requiring users to register may also be able to&lt;br&gt;
1114
1115
1116analyze usage on an individual basis. Since the query strings themselves are also&lt;br&gt;
1117
1118
1119recorded in the transaction logs, this domain/institution/individual activity could also be&lt;br&gt;
1120
1121
1122linked to specific subjects through the query terms. Summaries of this type could be&lt;br&gt;
1123
1124
1125invaluable for studies of geographic diffusion and distribution of research topics.&lt;br&gt;
1126
1127
1128Transaction log analysis can also indicate time-related patterns in the&lt;br&gt;
1129
1130
1131information seeking behavior of digital library users. As a sample of this type of&lt;br&gt;
1132
1133
1134analysis, Paul Ginsparg notes a seven day periodicity in the number of search requests&lt;br&gt;
1135
1136
1137&lt;hr&gt;
1138
1139
1140&lt;A name=14&gt;&lt;/a&gt;made to the PHYSICS E-PRINT archives (Figure 4, reproduced from [9]). From this he&lt;br&gt;
1141
1142
1143adduces that many physicists do not yet have weekend access to the Internet (an&lt;br&gt;
1144
1145
1146alternative, slightly more cynical hypothesis is that even high energy theoretical&lt;br&gt;
1147
1148
1149physicists take the weekend off).&lt;br&gt;
1150
1151
1152Figure 4. Summary of search requests to the physics pre-print archives&lt;br&gt;
1153
1154
1155&lt;b&gt;5. Conclusion&lt;/b&gt;&lt;br&gt;
1156
1157
1158This study suggests opportunities for conducting bibliometric research on the&lt;br&gt;
1159
1160
1161evolving digital libraries. These repositories are suitable platforms for conventional&lt;br&gt;
1162
1163
1164bibliometric techniques (such as obsolescence studies, quantification of physical&lt;br&gt;
1165
1166
1167characteristics of documents comprising a subject literature, time analysis, etc.). The&lt;br&gt;
1168
1169
1170ability to directly monitor access to documents in digital libraries also enables&lt;br&gt;
1171
1172
1173researchers to explicitly quantify document usage, as well as to implicitly measure&lt;br&gt;
1174
1175
1176usage through citations. Additional facilities could aid in the performance of&lt;br&gt;
1177
1178
1179bibliographic experiments, such as: improved tagging of document fields; provision of&lt;br&gt;
1180
1181
1182utilities to strip out titles, authors, etc. from common document formats; and the ability&lt;br&gt;
1183
1184
1185to easily eliminate duplicate entries from downloaded library subsets. Unfortunately,&lt;br&gt;
1186
1187
1188the most useful of these additional facilities – those associated with a higher degree of&lt;br&gt;
1189
1190
1191cataloging – run counter to the underlying philosophy of many digital libraries: to&lt;br&gt;
1192
1193
1194avoid, if possible, manual processing and formal cataloging of documents. While&lt;br&gt;
1195
1196
1197adherence to this principle can limit the accuracy of fielded searching (or indeed,&lt;br&gt;
1198
1199
1200preclude it altogether), it can also avoid the cataloging bottleneck and permit digital&lt;br&gt;
1201
1202
1203libraries to provide access to larger numbers of documents.&lt;br&gt;
1204
1205
1206The digital libraries complement the information currently available through&lt;br&gt;
1207
1208
1209paper, online, and CD-ROM bibliographic resources. While these latter databases&lt;br&gt;
1210
1211
1212generally have the advantage of standardized formatting of bibliographic fields, the&lt;br&gt;
1213
1214
1215digital libraries are freely accessible, often contain &amp;quot;grey literature&amp;quot; that is otherwise&lt;br&gt;
1216
1217
1218&lt;hr&gt;
1219
1220
1221&lt;A name=15&gt;&lt;/a&gt;unavailable for analysis, and generally make the full text of documents available. The&lt;br&gt;
1222
1223
1224insights gained from analysis of digital libraries will add to the store of &amp;quot;information&lt;br&gt;
1225
1226
1227about information&amp;quot; that we have gained from older types of bibliographic repositories.&lt;br&gt;
1228
1229
1230&lt;b&gt;References&lt;/b&gt;&lt;br&gt;
1231
1232
1233[1] Bollacker, K.D., S. Lawrence, and C.L.Giles, CiteSeer: An Autonomous Web&lt;br&gt;
1234
1235
1236Agent for Automatic Retrieval and Identification of Interesting Publications,&lt;br&gt;
1237
1238
1239&lt;i&gt;Proceedings of the Second International Conference on Autonomous Agents&lt;/i&gt;&lt;br&gt;
1240
1241
1242(Minneapolis/St. Paul, May 9-13), 1998.&lt;br&gt;
1243
1244
1245[2] Bowman, C.M., P.B. Danzig, U. Manber, and M.F. Schwartz, Scalable Internet&lt;br&gt;
1246
1247
1248resource discovery: Research problems and approaches, &lt;i&gt;Communications of&lt;/i&gt;&lt;br&gt;
1249
1250
1251&lt;i&gt;the ACM 37(8)&lt;/i&gt; (1994) 98-107.&lt;br&gt;
1252
1253
1254[3] Burton, Hilary D. , Use of a virtual information system for bibliometric analysis,&lt;br&gt;
1255
1256
1257&lt;i&gt;Informaton Processing &amp;amp; Management 24(1)&lt;/i&gt; (1988) 39-44.&lt;br&gt;
1258
1259
1260[4] Cunningham, S.J., An empirical investigation of the obsolescence rate for&lt;br&gt;
1261
1262
1263information systems literature, &lt;i&gt;Library and Information Science&lt;/i&gt;&lt;br&gt;
1264
1265
1266&lt;i&gt;Research&lt;/i&gt;., 1996, http://library.fgcu.edu/iclc/lisrissu.htm&lt;br&gt;
1267
1268
1269 [5] Cunningham, S.J., and D. Bocock, Obsolescence of computing literature.&lt;br&gt;
1270
1271
1272&lt;i&gt;Scientometrics&lt;/i&gt; &lt;i&gt;34(2) &lt;/i&gt; (1995), pp. 255-262.&lt;br&gt;
1273
1274
1275 [6] Cunningham, S.J. and Lynn Silipigni Connaway, Information searching&lt;br&gt;
1276
1277
1278preferences and practices of computer science researchers, &lt;i&gt;Proceedings of&lt;/i&gt;&lt;br&gt;
1279
1280
1281&lt;i&gt;OZCHI '96&lt;/i&gt; (1996) 294-299.&lt;br&gt;
1282
1283
1284[7] de Solla Price, D.J., Citation measures of hard science, soft science, technology,&lt;br&gt;
1285
1286
1287and nonscience. In: C.E. Nelson and D.K. Pollock (eds), &lt;i&gt;Communication&lt;/i&gt;&lt;br&gt;
1288
1289
1290&lt;i&gt;among scientists and engineers&lt;/i&gt; (Heath Lexington, 1970).&lt;br&gt;
1291
1292
1293[8] Garfield, E., &lt;i&gt;Citation Indexing: Its theory and application in Science, Technology&lt;/i&gt;&lt;br&gt;
1294
1295
1296&lt;i&gt;and Humanities (&lt;/i&gt;Wiley, 1979).&lt;br&gt;
1297
1298
1299&lt;hr&gt;
1300
1301
1302&lt;A name=16&gt;&lt;/a&gt;[9] Ginsparg, P. After dinner remarks: 14 Oct ‘94 APS meeting at LANL, 1994&lt;br&gt;
1303
1304
1305(&amp;lt;URL: http://xxx.lanl.gov/blurb&amp;gt; ).&lt;br&gt;
1306
1307
1308[10] Ginsparg, P., First steps towards electronic research communication, &lt;i&gt;Computers&lt;/i&gt;&lt;br&gt;
1309
1310
1311&lt;i&gt;in Physics 8(4)&lt;/i&gt; (1994) 390-401. &lt;br&gt;
1312
1313
1314[11] Hallmark, J., Scientists' access and retrieval of references cited in their recent&lt;br&gt;
1315
1316
1317journal articles, &lt;i&gt; College and Research Libraries 55(3)&lt;/i&gt; (1994) 199-210.&lt;br&gt;
1318
1319
1320[12] Hawkins, D.T. , Unconventional uses of on-line information retrieval systems:&lt;br&gt;
1321
1322
1323on-line bibliometric studies, &lt;i&gt;Journal of the American Society for Information&lt;/i&gt;&lt;br&gt;
1324
1325
1326&lt;i&gt;Science 28&lt;/i&gt; (1977) 13-18.&lt;br&gt;
1327
1328
1329[13] McGhee, P.E. , P.R. Skinner, K. Roberto, N.J. Ridenour, and S.M. Larson,&lt;br&gt;
1330
1331
1332Using online databases to study current research trends: an online bibliometric&lt;br&gt;
1333
1334
1335study, &lt;i&gt;Library and Information Science Research 9&lt;/i&gt; (1987) 285-291.&lt;br&gt;
1336
1337
1338[14] Maly, K., E.A. Fox, J.C. French, and A.L. Selman, Wide area technical report&lt;br&gt;
1339
1340
1341server (&lt;i&gt;Technical Report , &lt;/i&gt; Dept. of Computer Science, Old Dominion&lt;br&gt;
1342
1343
1344University, &lt;br&gt;
1345
1346
13471994. &lt;br&gt;
1348
1349
1350Also &lt;br&gt;
1351
1352
1353available &lt;br&gt;
1354
1355
1356at &lt;br&gt;
1357
1358
1359 &lt;br&gt;
1360
1361
1362 &lt;br&gt;
1363
1364
1365&amp;lt;URL:&lt;br&gt;
1366
1367
1368http://www.cs.odu.edu/WATERS/WATERS-paper.ps&amp;gt; ).&lt;br&gt;
1369
1370
1371[15] Sigogneau, M.J. , S. Bain, J.P. Courtial, and H. Feillet, Scientific innovation in&lt;br&gt;
1372
1373
1374bibliographical databases: a comparative study of the Science Citation Index&lt;br&gt;
1375
1376
1377and the Pascal database, &lt;i&gt;Scientometrics 22(1)&lt;/i&gt; (1991) 65-82.&lt;br&gt;
1378
1379
1380[16] Witten, I.H., S.J. Cunningham, M. Vallabh, and T.C. Bell, A New Zealand&lt;br&gt;
1381
1382
1383digital library for computer science research, &lt;i&gt;Proceedings of Digital Libraries&lt;/i&gt;&lt;br&gt;
1384
1385
1386&lt;i&gt;'95&lt;/i&gt; (1995) 25-30.&lt;br&gt;
1387
1388
1389[17] Witten, I.H., C. Nevill-Manning, and S.J. Cunningham, A public library based&lt;br&gt;
1390
1391
1392on full-text retrieval, &lt;i&gt;Communications of the ACM&lt;/i&gt; 41(4), 1998, p. 71&lt;br&gt;
1393
1394
1395&lt;hr&gt;
1396
1397
1398&lt;A name=17&gt;&lt;/a&gt; &lt;br&gt;
1399
1400
14011Documents were randomly sampled from the DEC&lt;br&gt;
1402
1403
1404(ftp://crl.dec.com/pub/DEC/CRL/tech-reports/), Sony&lt;br&gt;
1405
1406
1407(ftp://ftp.csl.sony.co.jp/CSL/CSL-Papers), and Ohio (ftp://archive.cis.ohio-&lt;br&gt;
1408
1409
1410state.edu/pub/tech-report/) technical report repositories&lt;br&gt;
1411
1412
1413&lt;hr&gt;
1414
1415
1416
1417
1418
1419
1420
1421
1422</Content>
1423</Section>
1424</Archive>
Note: See TracBrowser for help on using the repository browser.