Context Navigation

source: trunk/gsdl-documentation/manuals/xml-source/en/Paper_en.xml@ 14099

Last change on this file since 14099 was 14099, checked in by lh92, 17 years ago
Added the copyright information
Property svn:keywords set to `Author Date Id Revision`
File size: 59.4 KB

Line
1	<?xml version="1.0" encoding="UTF-8"?>
2	<Manual id="Paper" lang="en">
3	<Heading>
4	<Text id="1">GREENSTONE DIGITAL LIBRARY</Text>
5	</Heading>
6	<Title>
7	<Text id="2">FROM PAPER TO COLLECTION</Text>
8	</Title>
9	<Author>
10	<Text id="3">Dr Michel Loots, Dan Camarzan and Ian H. Witten</Text>
11	</Author>
12	<Affiliation>
13	<Text id="4">Human Info NGO, Belgium <br/>Simple Words, Romania <br/>University of Waikato, New Zealand</Text>
14	</Affiliation>
15	<SupplementaryText>
16	<Text id="manual_index">Back to manual index</Text>
17	<Text id="top_index">Back to top index</Text>
18	</SupplementaryText>
19	<Text id="5">Greenstone is a suite of software for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet or on CD-ROM. Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO. It is open-source software, available from <i>http://greenstone.org</i> under the terms of the G nu General Public License.</Text>
20	<Comment>
21	<Text id="6">We want to ensure that this software works well for you. Please report any problems to <i>[email protected]</i></Text>
22	</Comment>
23	<Version>
24	<Text id="7">Greenstone gsdl-2.50</Text>
25	</Version>
26	<Date>
27	<Text id="8">March 2004</Text>
28	</Date>
29	<Section id="about_this_manual">
30	<Title>
31	<Text id="9">About this manual</Text>
32	</Title>
33	<Content>
34	<Text id="10">This document explains how to create CD-ROM collections from paper documents. It describes in full detail the procedures and economics involved in the scanning and optical character recognition (OCR) processes, so that you end up with text in the right format to apply the Greenstone software. It also describes how to create and edit the material associated with a collection.</Text>
35	<Text id="11">We have tried to be as plain as possible in our explanation. Reference to any trade mark or company product is purely for illustrative purposes, and does not imply that we endorse or favor this product over any other.</Text>
36	</Content>
37	</Section>
38	<Section id="companion_documents">
39	<Title>
40	<Text id="12">Companion documents</Text>
41	</Title>
42	<Content>
43	<Text id="13">The complete set of Greenstone documents include five volumes:</Text>
44	<BulletList>
45	<Bullet>
46	<Text id="14">Greenstone Digital Library Installer's Guide</Text>
47	</Bullet>
48	<Bullet>
49	<Text id="15">Greenstone Digital Library User's Guide</Text>
50	</Bullet>
51	<Bullet>
52	<Text id="16">Greenstone Digital Library Developer's Guide</Text>
53	</Bullet>
54	<Bullet>
55	<Text id="17">Greenstone Digital Library: From Paper to Collection <i>(this document)</i></Text>
56	</Bullet>
57	<Bullet>
58	<Text id="18">Greenstone Digital Library: Using the Organizer</Text>
59	</Bullet>
60	</BulletList>
61	</Content>
62	</Section>
63	<Section id="copyright">
64	<Title>
65	<Text id="copyright-title">Copyright</Text>
66	</Title>
67	<Content>
68	<Text id="right-text-1">Copyright 2002 2003 2004 2005 2006 2007 by the <Link url="http://www.nzdl.org">New Zealand Digital Library Project</Link> at <Link url="http://www.waikato.ac.nz">the University of Waikato</Link>, New Zealand.</Text>
69	<Text id="right-text-2">Permission is granted to copy, distribute and/or modify this document under the terms of the <Link url="http://www.gnu.org/licenses/fdl.html">GNU Free Documentation License</Link>, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled <Link url="http://greenstonewiki.cs.waikato.ac.nz/wiki/gsdoc/GNUFDL.html">âGNU Free Documentation License.â</Link></Text>
70	</Content>
71	</Section>
72	<Section id="acknowledgements">
73	<Title>
74	<Text id="19">Acknowledgements</Text>
75	</Title>
76	<Content>
77	<Text id="20">The scanning operation and other know-how relating to the creation of collaborative non-profit collections have been developed by Dr Michel Loots, MD, of Human Info NGO and HumanityCD, Dan Camarzan of Simple Words, and their team of collaborators in Brasov, Romania.</Text>
78	<Text id="21">The Greenstone software is a collaborative effort between many people. Rodger McNab and Stefan Boddie are the principal architects and implementors. Contributions have been made by David Bainbridge, George Buchanan, Hong Chen, Michael Dewsnip, Katherine Don, Elke Duncker, Carl Gutwin, Geoff Holmes, Dana McKay, John McPherson, Craig Nevill-Manning, Dynal Patel, Gordon Paynter, Bernhard Pfahringer, Todd Reed, Bill Rogers, John Thompson, and Stuart Yeates. Other members of the New Zealand Digital Library project provided advice and inspiration in the design of the system: Mark Apperley, Sally Jo Cunningham, Matt Jones, Steve Jones, Te Taka Keegan, Michel Loots, Malika Mahoui, Gary Marsden, Dave Nichols and Lloyd Smith. We would also like to acknowledge all those who have contributed to the GNU-licensed packages included in this distribution: MG, GDBM, PDFTOHTML, PERL, WGET, WVWARE and XLHTML.</Text>
79	</Content>
80	</Section>
81	<Chapter id="introduction">
82	<Title>
83	<Text id="22">Introduction</Text>
84	</Title>
85	<Content>
86	<Text id="23">One goal of the Greenstone Digital Library software is to empower organizations such as universities, United Nations agencies, non-governmental organizations, non-profit organizations and governments to create varied collections of information that can be delivered online or on CD-ROM.</Text>
87	<Text id="24">Typical steps that have to be implemented are:</Text>
88	<NumberedList>
89	<NumberedItem>
90	<Text id="25">Selecting the documents to be included</Text>
91	</NumberedItem>
92	<NumberedItem>
93	<Text id="26">Securing copyrights permissions to use these documents in the digital library</Text>
94	</NumberedItem>
95	<NumberedItem>
96	<Text id="27">Scanning and OCR of the hard-copy documents which are not available in to digital form to have a perfect digital format</Text>
97	</NumberedItem>
98	<NumberedItem>
99	<Text id="28">Converting all documents to a format (integrating text and images) which can be imported into Greenstone (preferably
100	HTML or Microsoft Word, but others are also covered at varying levels of precision by a âpluginâ (see the Greenstone Userâs Manual)</Text>
101	</NumberedItem>
102	<NumberedItem>
103	<Text id="29">Tagging the chapters, paragraphs and images of the digital documents</Text>
104	</NumberedItem>
105	<NumberedItem>
106	<Text id="30">Organising the collection into a optimally structured digital library</Text>
107	</NumberedItem>
108	<NumberedItem>
109	<Text id="31">Building the digital library using the Greenstone software</Text>
110	</NumberedItem>
111	<NumberedItem>
112	<Text id="32">Printing and distributing the collection on CD-ROM and/or distributing it over the Internet</Text>
113	</NumberedItem>
114	</NumberedList>
115	<Text id="33">In order to create a digital collection, the publications must be available in digital format. If books, newsletters or other documents are only available on paper, they will need to be scanned and processed into machine-readable form (step iii). Usually this is done using optical character recognition (OCR), but sometimes by manual retyping. This process is covered in Chapters 2-4 of this manual.</Text>
116	<Text id="34">Step v. enables the different parts of a document to be independently selected and displayed by readers in the final library, while step vi. involves assigning attributes to the documents such as subject categories, keywords and bibliographic data for ordering and searching the library. These steps are covered in Chapter 5 of this manual.</Text>
117	<Text id="35">This manual introduces many issues that affect the editorial process of creating a collection from paper. Before reading on, you should consider these questions:</Text>
118	<BulletList>
119	<Bullet>
120	<Text id="36">What is the goal of your collection?</Text>
121	</Bullet>
122	<Bullet>
123	<Text id="37">What is your target group?</Text>
124	</Bullet>
125	<Bullet>
126	<Text id="38">How big is itâlocal, regional, or global?</Text>
127	</Bullet>
128	<Bullet>
129	<Text id="39">How many documents are you making available?</Text>
130	</Bullet>
131	<Bullet>
132	<Text id="40">How many pages?</Text>
133	</Bullet>
134	<Bullet>
135	<Text id="41">How much graphics content?</Text>
136	</Bullet>
137	<Bullet>
138	<Text id="42">Does the material split into parts that will be consulted by a limited audience and parts that need to be disseminated widely?</Text>
139	</Bullet>
140	<Bullet>
141	<Text id="43">Are the documents already available electronically?</Text>
142	</Bullet>
143	<Bullet>
144	<Text id="44">If so, in which formats? (Note incidentally that PDF files are not automatically equivalent to digital full-text form, as they often contain only page images.)</Text>
145	</Bullet>
146	<Bullet>
147	<Text id="45">What is the copyright status of the documents?</Text>
148	</Bullet>
149	<Bullet>
150	<Text id="46">Who owns the copyright?</Text>
151	</Bullet>
152	<Bullet>
153	<Text id="47">Are there other organizations with the same target audience?</Text>
154	</Bullet>
155	<Bullet>
156	<Text id="48">Are you willing to collaborate with other groups?</Text>
157	</Bullet>
158	<Bullet>
159	<Text id="49">What budget is available for the whole project?</Text>
160	</Bullet>
161	<Bullet>
162	<Text id="50">What human resources are available (in person-months) for co-ordination, editing, scanning and programming?</Text>
163	</Bullet>
164	<Bullet>
165	<Text id="51">How many computers are available for this project?</Text>
166	</Bullet>
167	<Bullet>
168	<Text id="52">How many CD-ROMs do you want to distribute?</Text>
169	</Bullet>
170	<Bullet>
171	<Text id="53">Will they be free, or for sale?</Text>
172	</Bullet>
173	</BulletList>
174	</Content>
175	</Chapter>
176	<Chapter id="scanners_and_scanning">
177	<Title>
178	<Text id="54">Scanners and scanning</Text>
179	</Title>
180	<Content>
181	<Text id="55">The first step in converting paper documents into a digital library collection is to obtain images of all pages of all publications in digital format. The next stage is optical character recognition (OCR), and clean, high-quality images are essential for successful OCR. The digitization process requires a scanner capable of working at a resolution of 300 dpi (dots per inch). Most scanning can be done in black-and-white, but if color illustrations are included they must be scanned with a color scanner. In most cases the covers of the book contain colors and will have to be scanned as a color photographic image.</Text>
182	<Section id="scanners">
183	<Title>
184	<Text id="56">Scanners</Text>
185	</Title>
186	<Content>
187	<Text id="57">Scanners are available in all price ranges, and all shapes and sizes. They range from $100 for flat-bed scanners to upwards of $50,000 for large industrial scanners from manufacturers such as Bell & Howell.<FootnoteRef id="1"/>There are many websites that offer a wide range of scanners for sale. To locate them, just search for âscannersâ in search engines like Google, Altavista, or Yahoo.</Text>
188	<Text id="58">The output format of a scanned page is a computer file that is usually stored in TIFF or Bitmap format. Compressed TIFF IV is the best format to use. An average page scanned and converted to this format occupies only 50 Kb, compared to perhaps 2 Mb for the equivalent page in uncompressed Bitmap form.</Text>
189	<Subsection id="low-cost_flat-bed_scanner">
190	<Title>
191	<Text id="59">Low-cost flat-bed scanner</Text>
192	</Title>
193	<Content>
194	<Text id="60">Low-cost flat-bed units are the cheapest and most widely available type of scanner. There are many brands: HP, Agfa, Acer, etc. Prices range from $100 to $300. Both black-and-white and color images can be scanned. The low price allows each computer to have its own scanner.</Text>
195	<Text id="61">Disadvantages of these scanners include the medium quality of the result, the slow rate of scanning, unreliability in warm environments, and relatively frequent breakdown. Pages must be scanned manually, one by one. Each page must be positioned carefully on the scanning plate to ensure that it is aligned correctly. Productivity of these scanners is low. Despite manufacturers' claims that each page can be scanned in less than a minute, the fact is that rates exceeding twelve pages per hour are rarely achieved. The scanning process monopolizes the computer on which the work is being performed.</Text>
196	<Text id="62">Consequently these scanners are useful only for small jobs with limited numbers of pagesâno more than 200 to 400 pages a month on a regular basis, or one-time jobs of up to 1000 or 2000 pages.</Text>
197	</Content>
198	</Subsection>
199	<Subsection id="low-end_scanner_with_sheet_feeder">
200	<Title>
201	<Text id="63">Low-end scanner with sheet feeder</Text>
202	</Title>
203	<Content>
204	<Text id="64">Low-end scanners with sheet feeders typically cost between $500 and $1200. Ten to fifty pages can be inserted, scanned and processed at once: thus the operator does not have to attend constantly to the machine. This increases capacity up to 150 to 200 pages per day. These scanners are more robust, and have a larger lifespan before repairâusually in the range 30,000 to 50,000 pages.</Text>
205	<Text id="65">A disadvantage is that only one side of the page is scanned at a timeâthe stack of pages must be reversed and rescanned in order to obtain an image of both sides. This often creates problems because sheet feeders are never without problems and sometimes pages get blocked.</Text>
206	<Text id="66">These scanners are useful for up to 1500 to 3000 pages a month.</Text>
207	</Content>
208	</Subsection>
209	<Subsection id="color_scanners">
210	<Title>
211	<Text id="67">Color scanners</Text>
212	</Title>
213	<Content>
214	<Text id="68">Any scanning operation invariably involves some color images, so a color scanner will always be required. Generally speaking, less than 5% of any publication contains color images, plus the cover. Thus a low cost flat-bed scanner as described above suffices. It is advisable to select one capable of scanning up to 600 dpi resolution.</Text>
215	</Content>
216	</Subsection>
217	<Subsection id="professional_duplex_scanners">
218	<Title>
219	<Text id="69">Professional duplex scanners</Text>
220	</Title>
221	<Content>
222	<Text id="70">Professional scanners are reliable, heavy-duty machines capable of processing a large volume of pagesâtypically from 2000 pages to 10,000 pages per day. They have an automatic sheet-feeder tray system that processes batches of about 50 to 200 pages. The best and fastest are duplex machines that scan both sides of the page at once.</Text>
223	<Text id="71">Professional duplex scanners require a powerful computer with a hard disk of at least 10 to 20 Gb. Prices range from $5000 to $50,000. For example, the Canon DR-6020 duplex scanner costs $5000 and works with double-sided documents. It has a capacity of about 2000 pages per day and a lifespan of 600,000 to 800,000 pages. Bell & Howell and Fujitsu scanners range from $10,000 to $50,000 and have a lifespan of many millions of pages.</Text>
224	<Text id="72">Micro-fiche scanners cost from $15,000 for a semi-manual unit to $80,000 for one that operates fully automatically.</Text>
225	</Content>
226	</Subsection>
227	<Subsection id="scanning_programs">
228	<Title>
229	<Text id="73">Scanning programs</Text>
230	</Title>
231	<Content>
232	<Text id="74">Every scanner comes with its own software, which means that the program must be installed on the computer that manages the scanner. Some have a computer card that needs to be installed in your computer to speed up the scanning operation.</Text>
233	</Content>
234	</Subsection>
235	</Content>
236	</Section>
237	<Section id="preparing_the_documents">
238	<Title>
239	<Text id="75">Preparing the documents</Text>
240	</Title>
241	<Content>
242	<Text id="76">Before being scanned, documents must be properly prepared. Dusty documents must be cleaned, humid documents dried, clips removed, pages unfolded.</Text>
243	<Text id="77">The spine of each book should be removed by cutting it off, straight and precisely. Books provided by libraries must often be rebound, and if so you should be particularly careful when removing spines in order to facilitate smooth rebinding.</Text>
244	<Text id="78">If there are just a few documents, cutting can be done manually with a ruler and cutters. Be careful with your hands! For more documents, special manual cutting machines are available.</Text>
245	<Text id="79">For high volumesâmore than 20 documentsâwe recommend asking a printer or copy-shop if you can use their professional cutting machine. Do not forget to remove metal clips which could damage the cutting blades.</Text>
246	</Content>
247	</Section>
248	<Section id="the_scanning_process">
249	<Title>
250	<Text id="80">The scanning process</Text>
251	</Title>
252	<Content>
253	<Text id="81">Using software provided with the scanner, a digital image of each paper page is scanned and transformed into a Bitmap or TIFF image. These images should be stored on hard disk with standard filenames. The OCR process starts once some or all of a batch of documents have been scanned. It can be undertaken by the person who operates the scanner, or by someone else.</Text>
254	<Text id="82">Typically a scanning resolution of 300 dpi is needed, although sometimes 200 dpi is acceptable.</Text>
255	<Subsection id="quality_control">
256	<Title>
257	<Text id="83">Quality control</Text>
258	</Title>
259	<Content>
260	<Text id="84">The final goal of scanning is either to OCR the pages to obtain perfect word processor or HTML versions of the publications, or to produce enhanced image files such as PDF image files. In either case the quality of the image is very important. If quality is sub-standard, image files will not look good and will consume more memory. Image quality seriously affects the OCR process: with sub-standard quality, productivity deteriorates by up to 40%. OCR typically represents more than 90% of the total cost, so scanning quality can have a very substantial effect on the final cost.</Text>
261	<Text id="85">The quality of the TIFF file can be enhanced by adjusting the scanning process to each type of paper, using settings provided by the scanner software. Relatively transparent kind of paper will require a lighter setting; the contrast must be adjusted depending on the quality of printing, and so on.</Text>
262	<Text id="86">First divide the material into batches with similar paper and print qualities. Perform OCR tests on a sample from the first batch to determine the optimal settings. Then scan all material in this batch before proceeding to the next one.</Text>
263	</Content>
264	</Subsection>
265	<Subsection id="filename_conventions">
266	<Title>
267	<Text id="87">Filename conventions</Text>
268	</Title>
269	<Content>
270	<Text id="88">Give each book or document a job number or unique code, which will become the name of the folder that contains all TIFF images in the document. Depending on the computer system (DOS, Windows, UNIX, LINUX, etc) from 8 characters to 128 characters can be used in a filename. We recommend restricting this unique document identifier to 8 to 16 characters. The first five characters might identify the document, the following letter might contain a language code, and the remaining characters might identify the particular page. For example, the identifier <i>u7548e12.tif</i> might identify the TIFF image of page 12 of a book written in English with code <i>u7548e</i>.</Text>
271	<Text id="89">Allocate one directory on the hard disk for scanning jobs, say <i>scanjobs</i>. Then make a subdirectory for each job. Within this make a subdirectory for each publicationâsay <i>u7548e</i> for the above document. Store all the TIFF images of the publication, including color images, in this folder.</Text>
272	</Content>
273	</Subsection>
274	</Content>
275	</Section>
276	<Section id="productivity_and_resources">
277	<Title>
278	<Text id="90">Productivity and resources</Text>
279	</Title>
280	<Content>
281	<Text id="91">You should not underestimate the magnitude of the scanning operationâand particularly the OCR process that follows. It is best to consider scanning and OCR as completely separate activities. The optimal choice from an economic and practical point of view should be madeindividually for each one.</Text>
282	<Text id="92">Some points to consider are the investment in scanners and computers that is necessary; the availability of appropriate space and human resources; training the workforce; salary costs; the initial and total number of pages to be scanned; deadlines; and whether documents can be outsourced to third parties.</Text>
283	<Subsection id="scanning_costs">
284	<Title>
285	<Text id="93">Scanning costs</Text>
286	</Title>
287	<Content>
288	<Text id="94">An important decision is whether to invest in scanning equipment and perform all scanning oneself, or outsource it to a scanning company. The main considerations are:</Text>
289	<BulletList>
290	<Bullet>
291	<Text id="95">pressure of time for the scanning job;</Text>
292	</Bullet>
293	<Bullet>
294	<Text id="96">total number of pages;</Text>
295	</Bullet>
296	<Bullet>
297	<Text id="97">salary costs of those who perform the scanning.</Text>
298	</Bullet>
299	</BulletList>
300	<Text id="98">The people who perform the scanning must be highly motivated, technically skilled, and quality-oriented.</Text>
301	<Text id="99">The typical cost of scanning by a professional company is $0.06 per page. To this must be added the cost of shipment, which can be up to $0.03 per page for transport from developing countries to developed countries, and $0.015 per page for transport within countries.</Text>
302	<Text id="100">Table <CrossRef target="Table" ref="table_scanning_cost"/> estimates the cost of doing it yourself, using various scanner types. Note that all figures are approximate. They are provided as rough guidelines based on the authors' experience. The first three columns concern labor costs. The first is the capacity in pages/month, assuming full-time work. The resources required in person-hours per page is obtained by dividing the number of working hours per month by the pages/month capacity in the second column. It is shown in the second column, which assumes 180 working hours per month.</Text>
303	<Table id="table_scanning_cost">
304	<Title>
305	<Text id="101">Scanning cost</Text>
306	</Title>
307	<TableContent>
308	<tr>
309	<th width="90"/>
310	<th width="71">
311	<Text id="102">Capacity (pages/month)</Text>
312	</th>
313	<th width="75">
314	<Text id="103">Hours/page (180-hour month)</Text>
315	</th>
316	<th width="83">
317	<Text id="104">Cost/page (assuming $4/hour)</Text>
318	</th>
319	<th width="60">
320	<Text id="105">Scanner acquisit- ion</Text>
321	</th>
322	<th width="66">
323	<Text id="106">Scanner lifespan (pages)</Text>
324	</th>
325	<th width="85">
326	<Text id="107">Outsourced pages for scanner cost<br/>(at $.06 each)</Text>
327	</th>
328	</tr>
329	<tr>
330	<th width="90">
331	<Text id="108">Flat bed scanner</Text>
332	</th>
333	<th width="71">
334	<Text id="109">2,500</Text>
335	</th>
336	<th width="75">
337	<Text id="110">0.072</Text>
338	</th>
339	<th width="83">
340	<Text id="111">$0.288</Text>
341	</th>
342	<th width="60">
343	<Text id="112">$300</Text>
344	</th>
345	<th width="66">
346	<Text id="113">7,000</Text>
347	</th>
348	<th width="85">
349	<Text id="114">5,000</Text>
350	</th>
351	</tr>
352	<tr>
353	<th width="90">
354	<Text id="115">Scanner with sheet-feeder</Text>
355	</th>
356	<th width="71">
357	<Text id="116">8,000</Text>
358	</th>
359	<th width="75">
360	<Text id="117">0.0225</Text>
361	</th>
362	<th width="83">
363	<Text id="118">$0.09</Text>
364	</th>
365	<th width="60">
366	<Text id="119">$800</Text>
367	</th>
368	<th width="66">
369	<Text id="120">30,000</Text>
370	</th>
371	<th width="85">
372	<Text id="121">13,000</Text>
373	</th>
374	</tr>
375	<tr>
376	<th width="90">
377	<Text id="122">Professional: low-end duplex</Text>
378	</th>
379	<th width="71">
380	<Text id="123">40,000</Text>
381	</th>
382	<th width="75">
383	<Text id="124">0.0045</Text>
384	</th>
385	<th width="83">
386	<Text id="125">$0.018</Text>
387	</th>
388	<th width="60">
389	<Text id="126">$6,000</Text>
390	</th>
391	<th width="66">
392	<Text id="127">600,000</Text>
393	</th>
394	<th width="85">
395	<Text id="128">100,000</Text>
396	</th>
397	</tr>
398	<tr>
399	<th width="90">
400	<Text id="129">Professional: high-end duplex</Text>
401	</th>
402	<th width="71">
403	<Text id="130">150,000</Text>
404	</th>
405	<th width="75">
406	<Text id="131">0.0012</Text>
407	</th>
408	<th width="83">
409	<Text id="132">$0.0048</Text>
410	</th>
411	<th width="60">
412	<Text id="133">$50,000</Text>
413	</th>
414	<th width="66">
415	<Text id="134">8,000,000</Text>
416	</th>
417	<th width="85">
418	<Text id="135">833,000</Text>
419	</th>
420	</tr>
421	</TableContent>
422	</Table>
423	<Text id="136">To determine the price per page, multiply the total hourly salary costs in your situation by the second column of Table <CrossRef target="Table" ref="table_scanning_cost"/>.As an example, the third column gives the price of in-house scanning at a salaryrate of $4/hourânot including investment costs.</Text>
424	<Text id="137">These calculations assume that the scanner is used for a sufficient volume to justify the investment. The final three columns of Table <CrossRef target="Table" ref="table_scanning_cost"/> give more information about the cost of the scanner itself. The first of these shows the acquisition cost of the scanner, and the next gives its expected lifetime. The last shows the number of pages that could be scannedcommercially, at a cost of $0.06/page, for the price of the scanner alone.</Text>
425	<Text id="138">Of course, many other factors affect the choice of scanner: availability of funds, need to minimize dependence on others, desire to build local capacity, obligations to libraries to scan books locally and not transport them, and so on.</Text>
426	<Text id="139">The above figures give some idea of the volume of pages needed to justify different levels of investment. Rarely will an institute or organization need to scan 800,000 pages. At such levels more complex issues ariseâsuch as maintenance and the possibility of recouping costs by offering scanning services to othersâthat we will not discuss here.</Text>
427	<Text id="140">It is tempting to regard the development of scanning capacity asa commercial venture, particularly in developing countries. But one shouldalways bear in mind that scanning is not a repetitive business. Oncedocuments have been scanned, clients never place new orders for the same documentsâno matter how good the relationship with the scanning company. From a commercial point of view, intensive marketing efforts are needed. We do not advise NGOs or other non-profit organizations to venture into this realm without thorough initial trials and a carefully-considered business plan.</Text>
428	<Text id="141">In conclusion, if 10,000 to 50,000 pages are to be scanned, one should consider outsourcing the job. A low-end professional scanner costing about $6000 can only be justified if more than 100,000 pages have to be scanned. You might consider banding together with a few other institutionsâperhaps NGOs or librariesâto purchase such a scanner.</Text>
429	</Content>
430	</Subsection>
431	</Content>
432	</Section>
433	</Content>
434	</Chapter>
435	<Chapter id="ocr">
436	<Title>
437	<Text id="142">OCR: Optical Character Recognition</Text>
438	</Title>
439	<Content>
440	<Text id="143">An optical character recognition or OCR system transforms a scanned image into text. The input is a digitized image in TIFF or Bitmap formatâpreferably a clean, high-quality image. The output is a word-processor or web file, typically in RTF, Word, or HTML format.</Text>
441	<Text id="144">The following steps are involved in converting paper documents to computer form:</Text>
442	<BulletList>
443	<Bullet>
444	<Text id="145">scanning;</Text>
445	</Bullet>
446	<Bullet>
447	<Text id="146">page layout analysis;</Text>
448	</Bullet>
449	<Bullet>
450	<Text id="147">recognition;</Text>
451	</Bullet>
452	<Bullet>
453	<Text id="148">scanning images and tables.</Text>
454	</Bullet>
455	</BulletList>
456	<Text id="149">Following these, you must perform quality checks on the resulting files, and save them in the appropriate format.</Text>
457	<Text id="150">On the market are many good OCR programs, with prices ranging from $100 to $400.<FootnoteRef id="2"/>For example, among many others are:</Text>
458	<BulletList>
459	<Bullet>
460	<Text id="151"><i>Read-Iris</i>(http://www.readiris.com/)</Text>
461	</Bullet>
462	<Bullet>
463	<Text id="152"><i>Omnipage</i>(http://www.omnipage.com/)</Text>
464	</Bullet>
465	<Bullet>
466	<Text id="153"><i>Fine-Reader</i>(http://www.finereader.com/)</Text>
467	</Bullet>
468	</BulletList>
469	<Text id="154">All information, including lists of local distributors, can be found on the manufacturers' websites. Among these, in the authors' experience the most user-friendly are Fine-Reader and Omnipage. Fine-Reader is cheapest, costing about $100. It offers a great deal of flexibility, and the widest range of different language options.</Text>
470	<Text id="155">A choice must be made between undertaking the scanning and OCR in-house or outsourcing it to a commercial organization. To do it in-house requires a scanner, OCR software program, OCR skill development, and a quality-conscious, highly motivated workforce.</Text>
471	<Section id="the_ocr_process">
472	<Title>
473	<Text id="156">The OCR process</Text>
474	</Title>
475	<Content>
476	<Text id="157">The OCR process differs from one OCR program to another, and each one requires a considerable amount of learning. The program's manual will explain this process in detail. Four points deserve particular attention: quality control, tables, images, and specialized material such as formulas, foreign characters etc.</Text>
477	<Subsection id="quality_control_1">
478	<Title>
479	<Text id="158">Quality control</Text>
480	</Title>
481	<Content>
482	<Text id="159">We cannot place enough emphasis on quality control. Quality checks are best performed by native speakers, or people with an excellent command of the language to check. The best people are at the university or high-school level. We should also note that young people tend to sustain higher concentration than older people for this kind of work.</Text>
483	<Text id="160">Normally there are four quality checks.</Text>
484	<Text id="161">The first is performed at the same time as OCR. Every OCR program has a built-in spell-checker that highlights every suspect letter. At the same time the image of the word appears too, making it easy to check and correct the error.</Text>
485	<Text id="162">The second is a general check of the text once the OCR process is finished. Common errors are to miss a page, a paragraph, chapter titles, and so on. A general overview is necessary to check if pages are missing. It is essential to check titles, chapter headings, paragraphs, and tables.</Text>
486	<Text id="163">The third is a spelling check using Microsoft Word. This program has a dictionary that is often more sophisticated than the one embedded in OCR programs. By importing the book into Word and performing a spelling check there, more errors can be found and corrected. Be sure to add to the spell-checker any particularly difficult or error-prone words, or scientific and technical terms common in that type of publication.</Text>
487	<Text id="164">Finally, the completed document should be checked by an independent person who samples the complete book and checks for errors, problems with tables and images, tagging, and the general look of the resulting text. Only after this final check can a book be considered ready for digital dissemination.</Text>
488	</Content>
489	</Subsection>
490	<Subsection id="tables">
491	<Title>
492	<Text id="165">Tables</Text>
493	</Title>
494	<Content>
495	<Text id="166">OCR programs do not cope well with tables. Moreover, tables are hard to check. They contain many digits, sometimes with points and commas, and entries are easily misplaced into the wrong row or column. They require concentrated effort, dedicated work, intensive proof-reading, careful checking, and good quality control. They can be handled in three basically different ways.</Text>
496	<Text id="167">First, tables can be treated as images. This involves scanning them as black-and-white images and placing them in this form at the appropriate point in the document. This is the easiest solution. There are no errors, and the only time taken is that involved in creating the image. However, this solution consumes more memory than others. Also, the resolution is not always sufficient when large tables are displayed on a computer screen. If you make the complete table fit, the resolution is too small. If you make the table over-wide, the user must scroll to see all columns and rows, and cannot get an overview of the contents.</Text>
497	<Text id="168">Second, tables can be recreated manually by making a table with the same number of rows and columns and filling the entries by typing them in, character by character.</Text>
498	<Text id="169">Third, the table can be OCR'd. This saves time compared to the manual process, but has a potential for more errors. Columns sometimes get merged, and commas and points are not recognized.</Text>
499	</Content>
500	</Subsection>
501	<Subsection id="images">
502	<Title>
503	<Text id="170">Images</Text>
504	</Title>
505	<Content>
506	<Text id="171">Publications contain three different general types of image:</Text>
507	<BulletList>
508	<Bullet>
509	<Text id="172">black and white line art;</Text>
510	</Bullet>
511	<Bullet>
512	<Text id="173">black and white photographs;</Text>
513	</Bullet>
514	<Bullet>
515	<Text id="174">color photographs.</Text>
516	</Bullet>
517	</BulletList>
518	<Text id="175">Black and white line art should be scanned in line art mode and saved as GIF or PNG files. Black and white photographs should be scanned in greyscale mode and saved as GIF or JPEG files. Color photographs should be scanned in color mode and saved as JPEG files. Generally speaking, medium-quality JPEG provides adequate resolution.</Text>
519	<Text id="176">For most collections, images consume the bulk of the space required on a hard-disk or CD-ROM. This makes it important to optimize each image for clarity and visibility, while minimizing its size. To save space you might drop some or all of the images if they are not relevant to the text.</Text>
520	<Text id="177">Images should be scanned separately, one by one. We recommend giving the image files a name that consists of the first five or six characters used to denote the document followed by the number of the page on which the image was found. An alternative, assuming each document is in its own directory, is to simply use the letter <i>p</i> followed by the page containing the image. If there are several images on a single page, append an additional letter <i>a</i>, <i>b</i>, <i>c</i>âŠ to the filename. For example, if a JPEG image appeared on page 36 of the publication <i>u7548e</i> discussed earlier, it would be placed in a file named <i>u7548e36.jpg</i> or <i>p36.jpg</i>.</Text>
521	<Text id="178">Once the images have been scanned, you can put batch-processing programs to work to resize or enhance all the images at once.</Text>
522	</Content>
523	</Subsection>
524	<Subsection id="specialized_material">
525	<Title>
526	<Text id="179">Specialized material</Text>
527	</Title>
528	<Content>
529	<Text id="180">Many documents contain specialized material such as special characters, formulas, and difficult pages. Special characters generally relate to different languages and diacritical marks. The language option for the OCR program should be set for the specific language being read. Formulas will have to be recreated manually. Sometimes this is not possible in the OCR program, but only in a word processor like MICROSOFT Word. Difficult pages that contain complex material or are damaged so that a clear image cannot be obtained might have to be retyped manually.</Text>
530	</Content>
531	</Subsection>
532	</Content>
533	</Section>
534	<Section id="productivity_and_resources_1">
535	<Title>
536	<Text id="181">Productivity and resources</Text>
537	</Title>
538	<Content>
539	<Text id="182">As mentioned earlier, you should not underestimate the difficulty of OCR. Although the economic and practical options for OCR should be considered separately from scanning, similar points arise: the necessary investment in computers; the availability of human resources and management skills; training the workforce; salary costs; the total number of pages to be processed; and whether documents can be outsourced to third parties.</Text>
540	<Text id="183">In this section we share our experience of OCR operations in Belgium, Romania and India. All case studies, calculations and figures assume average situations, documents of standard difficulty (including tables and images) such as are found in most archives or libraries, very high-quality results, and a medium- to long-term operation.</Text>
541	<Subsection id="intensive_ocr">
542	<Title>
543	<Text id="184">Intensive OCR</Text>
544	</Title>
545	<Content>
546	<Text id="185">OCR is difficult. It demands great concentration and much skill. Before attaining peak productivity level and quality, a learning period of about six weeks is needed.</Text>
547	<Text id="186">Typically, best results and productivity are achieved during the first hours of each day. After three hours of OCR work, productivity declines very rapidly, perhaps to 50% of the initial level. After six hours most people become very tired.</Text>
548	<Text id="187">The same kind of evolution occurs over the initial weeks. In the first few weeks everyone achieves fairly high productivity, but after that up to two-thirds of people become bored and frustrated. These people either quit or perform poorly in terms of quality and productivity. Even those who pass the first three to five critical weeks and become part of the regular work team often leave in search of a better position after 6 to 12 months.</Text>
549	<Text id="188">The remarks made in Section <CrossRef target="Section" ref="the_ocr_process"/> about personnel apply particularly to intensive OCR. Quality checks are best undertaken by native speakers or people with a good command of the language being checked. Young people generally sustain higher concentration than older people for OCR work. As a rule-of-the-thumb, people aged between 18 and 23 years tend to be better suited than those over 25.</Text>
550	<Text id="189">Finally, OCR can be a boring job, which makes motivation and sustained commitment to quality exceptionally important.</Text>
551	<Text id="190">These facts about OCR lead to the following guidelines:</Text>
552	<BulletList>
553	<Bullet>
554	<Text id="191">Young people between 18 and 25 are best suited for this job.</Text>
555	</Bullet>
556	<Bullet>
557	<Text id="192">Because the first hours are always the most productive, the work should either be organized on a part-time basis or only the most motivated and concentrated people should be selected for full-time work.</Text>
558	</Bullet>
559	<Bullet>
560	<Text id="193">Two-thirds of people tend to quit or get bored after about three to five weeks. This translates into poorer quality and low productivity in the last weeks.</Text>
561	</Bullet>
562	<Bullet>
563	<Text id="194">A regular supply of work is needed to justify the necessary training, to maintain concentration, and to keep spirits high.</Text>
564	</Bullet>
565	</BulletList>
566	</Content>
567	</Subsection>
568	<Subsection id="achievable_productivity">
569	<Title>
570	<Text id="195">Achievable productivity</Text>
571	</Title>
572	<Content>
573	<Table id="table_ocr_productivity">
574	<Title>
575	<Text id="196">OCR productivity</Text>
576	</Title>
577	<TableContent>
578	<tr>
579	<th width="161"/>
580	<th width="142">
581	<Text id="197">Working hours/day</Text>
582	</th>
583	<th width="123">
584	<Text id="198">Pages/day</Text>
585	</th>
586	<th width="104">
587	<Text id="199">Pages/month</Text>
588	</th>
589	</tr>
590	<tr>
591	<th width="161">
592	<Text id="200">Initial training (6 weeks)</Text>
593	</th>
594	<th width="142">
595	<Text id="201">3</Text>
596	</th>
597	<th width="123">
598	<Text id="202">6</Text>
599	</th>
600	<th width="104">
601	<Text id="203">120</Text>
602	</th>
603	</tr>
604	<tr>
605	<th width="161">
606	<Text id="204">Optimal productivity level</Text>
607	</th>
608	<th width="142">
609	<Text id="205">3</Text>
610	</th>
611	<th width="123">
612	<Text id="206">9</Text>
613	</th>
614	<th width="104">
615	<Text id="207">150 to 200</Text>
616	</th>
617	</tr>
618	<tr>
619	<th width="161"/>
620	<th width="142">
621	<Text id="208">7</Text>
622	</th>
623	<th width="123">
624	<Text id="209">28</Text>
625	</th>
626	<th width="104">
627	<Text id="210">500 to 600</Text>
628	</th>
629	</tr>
630	</TableContent>
631	</Table>
632	<Text id="211">Table <CrossRef target="Table" ref="table_ocr_productivity"/> gives typical OCR productivity figures. Documents come in all sizes and qualities, and these figures assume that the mix of documents contains an average number of images or tablesâsay one image and one table of five rows by five columns every 8 pages. They also assume that the page images are of medium to high qualityânote that, as discussed above, this depends on the quality of scanningâand that the OCR workers have a good command of the language.</Text>
633	<Text id="212">Table <CrossRef target="Table" ref="table_ocr_productivity"/> gives separate figures for people undergoing training and for those who have reached their optimal productivity level. If a member of the administrative staff were to allocate three hours a day to OCR, they could achieve 180 to 200 pages OCR per month. For full-time staff with proper training, high concentration and dedication to quality, 500 to 600 pages a month can be achieved.</Text>
634	<Text id="213">However, the rates that are achieved on difficult pages of low quality, with many columns or many tables, are far lowerâperhaps 300 to 400 pages per month for full-time work.</Text>
635	<Text id="214">Assume that the salary cost for dedicated and motivated full-time OCR workers is $400 per month, and the overheadâincluding management costs, computers, office space, utilities, etc.âcomes to another $300 to $400 per person per month. Then the cost of OCR comes to about $1.2 to $1.6 per page. Taking into account the training period, total volume, time-span, and layoff costs should the operation close down for lack of work, these figures rise to $1.5 to $2.5 per page.</Text>
636	<Text id="215">The cost of in-house OCR should be weighed against the cost of outsourcing the work to a professional OCR company. These typically charge from $1.5 to $4 per page, including images and tables. Human Info NGO/Simple Words has such a unit in Romania, and charges humanitarian non-profit organizations a special price that ranges from $1.2 to $2 per page. Please contact us at [email protected] for further information and advice.</Text>
637	</Content>
638	</Subsection>
639	</Content>
640	</Section>
641	<Section id="alternatives_to_ocr">
642	<Title>
643	<Text id="216">Alternatives to OCR</Text>
644	</Title>
645	<Content>
646	<Text id="217">There are two alternatives to OCR that we discuss here.</Text>
647	<Subsection id="manual_retyping">
648	<Title>
649	<Text id="218">Manual retyping</Text>
650	</Title>
651	<Content>
652	<Text id="219">One, which eliminates most scanning as well, is to retype the documents manually, using a word processor. This still requires the images and front cover to be scanned, but the remaining pages need not be scannedâthus one can dispense with both powerful scanners and OCR software.</Text>
653	<Text id="220">The people who do this work do not have to understand the text. They must be accurate typists and re-key exactly what they see. Retyping does introduce errors, and double-keying is often used to find and correct these. This method involves two people who independently re-key the same document, after which both digital versions are compared word for word using a special software program by an operator who has the original document in front of them. The assumption is that if the same word has been typed independently twice in the same way, it is correct. However, this is not always true, and for extremely high precision, triple-keying is performed.</Text>
654	<Text id="221">The advantage of rekeying is that cost is saved because an OCR program is not needed and so the computers can be older, lower-range, or second-hand modelsâwhereas powerful computers are needed for OCR. Also, the work can be performed by people with a lower level of skill. The disadvantages are that a training period of at least two months is needed. Single keying usually produces too many errors, and double or triple keying is needed.</Text>
655	<Text id="222">The cost depends entirely on salary level. Typically, re-keyers in developing countries are paid on the order of $150/month. Their productivity could be twenty to thirty pages per dayâcorresponding to 400 pages per month, images included. With double-keying, this makes the total salary costs around $300 per month, plus overheads.</Text>
656	</Content>
657	</Subsection>
658	<Subsection id="image_files">
659	<Title>
660	<Text id="223">Image files</Text>
661	</Title>
662	<Content>
663	<Text id="224">A very low cost alternative to OCR is simply to use a PDF image version of the document pages. The cost is only a fraction of OCR'sâabout $0.1 per page.</Text>
664	<Text id="225">Once scanning has been completed and TIFF files are available, an automatic converter (usually Adobe Acrobat or Adobe Photoshop) converts all TIFF files of book pages into PDF files.</Text>
665	<Text id="226">The downside is that these files are not searchable. Also, they are quite largeâusually 50 Kb per page, plus or minus 20% depending on the quality of the original TIFF file.</Text>
666	<Text id="227">PDF image files are slowâsometimes, in developing countries, impossible or prohibitively expensiveâto download. They rarely fit on a floppy disk, and do not support text manipulation functions such as cut-and-paste.</Text>
667	<Text id="228">The PDF image file method should only be used if no OCR budget is available, and for documents that are likely to be used by a small number of people who have high-speed low-cost Internet access.</Text>
668	</Content>
669	</Subsection>
670	</Content>
671	</Section>
672	<Section id="combining_scanning_and_ocr">
673	<Title>
674	<Text id="229">Combining scanning and OCR</Text>
675	</Title>
676	<Content>
677	<Text id="230">If a scanner is connected directly to the computer that runs the OCR software, most OCR programs can scan a page and perform OCR immediately. Page-by-page scanning and OCR is a reasonable strategy for low volumes, but will prove time-consuming for bigger and more continuous jobs.</Text>
678	<Text id="231">For up to 100 to 150 pages per month, this solution may suffice. For higher volumes it is faster and more efficient to scan the document first, then perfom OCR on all the pages as a separate step.</Text>
679	</Content>
680	</Section>
681	</Content>
682	</Chapter>
683	<Chapter id="three_examples">
684	<Title>
685	<Text id="232">Three examples: 1000 to 100,000 pages</Text>
686	</Title>
687	<Content>
688	<Section id="typical_small_collection">
689	<Title>
690	<Text id="233">Typical small collection: 500 to 1000 pages</Text>
691	</Title>
692	<Content>
693	<Text id="234">Most NGOs have 500 to 1000 pages to scan. This volume can be OCRed in-house if motivated volunteers are available.</Text>
694	<Part id="scanning">
695	<Title>
696	<Text id="235">Scanning</Text>
697	</Title>
698	<Content>
699	<Text id="236">The first step is to scan the publications to generate a high-quality TIFF file of each page, and a separate line-art, grey-scale or color bitmap image for each illustration. Assuming that 1000 pages have to be scanned, this might represent a part-time job of about one monthâjust for scanning. The TIFF files would consume 60 to 80 Mb of hard-disk space, and a good policy is to create a CD-R containing these files. A low-cost flatbed scanner of $100 to $300 will be sufficient for the job. Scanning can be done after working hours or during the weekends by a volunteer in the office or at home.</Text>
700	</Content>
701	</Part>
702	<Part id="ocr">
703	<Title>
704	<Text id="237">OCR</Text>
705	</Title>
706	<Content>
707	<Text id="238">The second step is OCR by another volunteer, or team of volunteers, skilled in language and correction. The TIFF files can either be shared between computers, or one computer can be used for the entire job. Typically, it will take five or six months of part-time labor (e.g. 20 hours a week) to convert 1000 pages into perfect Word or HTML documents.</Text>
708	</Content>
709	</Part>
710	<Part id="outsourcing">
711	<Title>
712	<Text id="239">Outsourcing</Text>
713	</Title>
714	<Content>
715	<Text id="240">An alternative is to outsource the scanning and OCR process. It would probably cost $1500 to $2000 to convert everything into perfect Word and HTML files.</Text>
716	</Content>
717	</Part>
718	</Content>
719	</Section>
720	<Section id="all_publications_from_an_organization">
721	<Title>
722	<Text id="241">All publications from an organization: 5000 pages</Text>
723	</Title>
724	<Content>
725	<Text id="242">Many larger organizations have archives of around 5000 pages of currrent or out-of print books, journals, newsletters, grey literature, etc.</Text>
726	<Part id="scanning_1">
727	<Title>
728	<Text id="243">Scanning</Text>
729	</Title>
730	<Content>
731	<Text id="244">This is too much for a flat-bed scanner. Scanning should either be outsourced (approximately $400 for 5000 pages) or a sheet-feeder scanner purchased (approximately $900). Alternatively, a more expensive scanner could be bought together with a few other institutions or NGOs ($6000 costs divided by the number of participants). All 5000 pages in TIFF format will take about 300 to 400 Mb of hard-disk space. Again, a good policy is to create a CD-R containing these files.</Text>
732	</Content>
733	</Part>
734	<Part id="ocr_1">
735	<Title>
736	<Text id="245">OCR</Text>
737	</Title>
738	<Content>
739	<Text id="246">The second step is OCR by another volunteer, or team of volunteers, skilled in OCR and correction. Again, several computers might be used, or one computer for the whole job. It would take 25 to 30 months of half-time labor (assuming 20 hours a week) to convert 5000 pages into perfect Word or HTML. In practice this is too long and too computer-intensive to manage on a volunteer basis. One would have to pay volunteers, monitor them for performance and quality, provide adequate space, etc, in order to have the job finished within reasonable time at a high level of quality.</Text>
740	<Text id="247">Alternatively one could create image PDF files, which would take 300 to 400 Mb of space and would be harder to download over the Internet.</Text>
741	</Content>
742	</Part>
743	<Part id="outsourcing_1">
744	<Title>
745	<Text id="248">Outsourcing</Text>
746	</Title>
747	<Content>
748	<Text id="249">An alternative is to outsource the scanning and OCR processes. It would probably cost $7500 to $10,000 to convert everything into perfect Word and HTML files.</Text>
749	</Content>
750	</Part>
751	</Content>
752	</Section>
753	<Section id="a_small_library">
754	<Title>
755	<Text id="250">A small library: 100,000 pages</Text>
756	</Title>
757	<Content>
758	<Text id="251">Larger organizations, universities, governments, and specialized libraries might have a whole library to digitizeâsay 100,000 pages. The first issue to consider is the copyright status of the publications. If they are not in the public domain, explicit permission to digitize them must be obtained from the copyright holders. You should also check whether the files are already available digitally.</Text>
759	<Part id="scanning_2">
760	<Title>
761	<Text id="252">Scanning</Text>
762	</Title>
763	<Content>
764	<Text id="253">The volume is too high for a sheet-feed scanner. Scanning should either be outsourced ($8000 for 100,000 pages), or a more expensive scanner purchased together with a few other institutions or NGOs ($6000 shared between the participants). 100,000 pages in TIFF format will take 6 to 8 Gb of hard-disk space. The best plan is to create a set of CD-R copies containing these files.</Text>
765	</Content>
766	</Part>
767	<Part id="ocr_2">
768	<Title>
769	<Text id="254">OCR</Text>
770	</Title>
771	<Content>
772	<Text id="255">The second step is OCR (or creation of PDF files for less widely used documents). It would take 500 to 700 months of half-time labor to convert 100,000 pages into perfect Word or HTML. This is impossible to realize with volunteers, and the job must be done on a professional basis.</Text>
773	<Text id="256">To save cost, some of the less-frequently-used pagesâsay 80% or 80,000 pagesâcould be transformed into PDF, and the other 20,000 pages into Word and HTML. The PDFs would take 4 to 6 Gb space and be harder to download on the Internet, but would cost only $0.2 per page to create by a professional organization (total of $16,000). If 80,000 PDF files were created from TIFF files by volunteers using PDF conversion programs like Adobe Acrobat, 10 to 20 months of part-time work would be necessary on a powerful computer.</Text>
774	</Content>
775	</Part>
776	<Part id="outsourcing_2">
777	<Title>
778	<Text id="257">Outsourcing</Text>
779	</Title>
780	<Content>
781	<Text id="258">An alternative is to outsource the work. If the 80% PDF and 20% HTML mix were maintained, the PDF would cost around $16,000 and the HTML $30,000 to $40,000âa total budget of around $50,000. If everything were OCRed, it would cost $150,000 to $200,000 to convert the entire collection into perfect Word and /HTML files.</Text>
782	</Content>
783	</Part>
784	</Content>
785	</Section>
786	</Content>
787	</Chapter>
788	<Chapter id="creating_an_electronic_collection">
789	<Title>
790	<Text id="259">Creating an electronic collection</Text>
791	</Title>
792	<Content>
793	<Text id="260">Three important aspects should be kept in mind when deciding to create digital collections. First, the collection must be organized. The more content there is, the greater the need for indexes and powerful search systems. For collections of 3000 to 5000 pages or more, indexes and search systems are essential. Second, the needs of end-users must prevail. The target groups that will use the collection should be identified, and a process of regular consultation set up. Third, the available budget will determine how much can be done.</Text>
794	<Section id="methods_of_collection_building">
795	<Title>
796	<Text id="261">Methods of collection building</Text>
797	</Title>
798	<Content>
799	<Text id="262">There are many examples of excellent CD-ROMs that are created on the web-page model. HTML, PDF or Word documents are added and linked using hyperlinks. Navigation is made simple and attractive by the use of hyperlinks, frames, keywords, indexes and so on. Such systems work well up to a few thousand pages, but from 3000 to 5000 pages onwards it is important to have a well-structured collection and a powerful search facility. This is where the Greenstone software can help.</Text>
800	<Text id="263">The Greenstone Digital Library software creates a structured digital library including a very powerful search and retrieval engine. Up to 150,000 pages can be indexed on a single CD-ROM. Every CD-ROM can become an Internet server. Greenstone is open-source software, and is freely available under the GNU license.</Text>
801	<Text id="264">The companion manuals describe how to build Greenstone collections. There are essentially three different ways of building collections:</Text>
802	<BulletList>
803	<Bullet>
804	<Text id="265">The librarian interface</Text>
805	</Bullet>
806	<Bullet>
807	<Text id="266">The Collector</Text>
808	</Bullet>
809	<Bullet>
810	<Text id="267">Building from the command line.</Text>
811	</Bullet>
812	</BulletList>
813	<Text id="268">The first method is the âlibrarianâ interface, described in the <i>Greenstone Digital Library User's Guide</i>(Chapter 3, âMaking Greenstone Collectionsâ). This is a comprehensive interactive facility for collection-building. With it, you can collect sets of documents, import or assign metadata, and build them into a Greenstone collection. The second method is the âCollectorâ subsystem, described in Chapter 4 of the <i>User's Guide</i>. This is an older facility that provides an alternative way of building collections of web pages or other documents. It guides you through a sequence of interactive web pages that request the information needed. However, it does not provide any way of adding metadata to the documents, andâbecause it is a web interfaceâit is not really suitable for collections that take more than a few minutes to build. The third method is to run the programs for collection-building directly from the command line; this is in the <i>Greenstone Digital Library Developer's Guide</i>(Chapter 1). This gives more flexibility in running programs individually and saving intermediate results, which may be desirable for collections that take many hours to build. You will also need to read Chapter 2 of the <i>Developer's Guide</i> in order to harness the full power of Greenstone to build advanced collections.</Text>
814	<Text id="269">There is a fourth method for creating and editing the material associated with a collection, a program called the Collection Organizer. However, its functionality has been superseded by the librarian interface mentioned above. It is described in a legacy document entitled <i>Using the Organizer</i>.</Text>
815	</Content>
816	</Section>
817	<Section id="getting_started_in_seven_steps_and_15_minutes">
818	<Title>
819	<Text id="270">Getting started in seven steps and 15 minutes</Text>
820	</Title>
821	<Content>
822	<Text id="271">The best way of getting the look and feel of the librarian interface is to actually create a small test library. If you have 15 minutes please follow these steps and you will understand this program much better.</Text>
823	<Text id="272">Before getting started, first install Greenstone (see the <i>Greenstone Installer's Guide</i>) which includes the Demo collection in DLS format and its source files. <b>Note, if you wish to be able to add to your collection any of the 140 documents in the DLS collection (instead of just the 11 of these documents in the Greenstone Demo collection), you should install DLS as one of the sample Greenstone libraries.</b> The Demo and DLS collections will be installed in <i>C:\Program Files\gsdl\collect</i>, in subdirectories <i>demo</i> and <i>dls</i> respectively. If you previously installed Greenstone without DLS and wish to install it, then you may re-insert your Greenstone CD-ROM and add this collection. It is not necessary to uninstall Greenstone first.</Text>
824	<Text id="273">We suggest that you print the instructions below and follow them step by step:</Text>
825	<NumberedList>
826	<NumberedItem>
827	<Text id="274">Launch the librarian interface under Windows by selecting <i>Greenstone Digital Library</i> from the <i>Programs</i> section of the <i>Start</i> menu and choosing <i>Librarian Interface</i>If you are using Unix, instead type</Text>
828	<CodeLine>cd ~/gsdl</CodeLine>
829	<CodeLine>cd gli</CodeLine>
830	<CodeLine>./gli.sh</CodeLine>
831	<Text id="275">where <i>~/gsdl</i> is the directory containing your Greenstone system.</Text>
832	</NumberedItem>
833	<NumberedItem>
834	<Text id="276">Select <i>New</i> from the File menu in the horizontal menu bar at the top of the window. Give it a title, for example âMy First Collection,â and fill out your email address and a brief description of the collection. In the âBase this collection onâ menu, choose âgreenstone demoâ or âDevelopment Library Subsetâ (the effect is the same because these two collections have the same structure).</Text>
835	</NumberedItem>
836	<NumberedItem>
837	<Text id="277">Add some documents from the Demo collection (or the DLS collection if it is installed) to your new collection. To do this, double-click the <i>Greenstone Collections</i> folder in the left-hand panel, then double-click the collection you desire. The documents in it are displayed underneath. Select one of these, drag it, and drop into the right-hand panel. This panel represents the collection you are building. Choose several documents and drag them into it one by one, or using multiple selection in the standard way.</Text>
838	</NumberedItem>
839	<NumberedItem>
840	<Text id="278">Add some of your own documents that are not in the Demo or DLS collections. Close the <i>Greenstone Collections</i> folder in the left-hand panel and double-click the <i>Local Filespace</i> folder. Navigate to a directory that contains some documents (e.g. small Word or HTML files). Drag a few of these into the right-hand panel to include them in your collection.</Text>
841	</NumberedItem>
842	<NumberedItem>
843	<Text id="279">Add metadata to the documents in your collection. So far you have been operating under the <i>Gather</i> panel, indicated by the <i>Gather</i> tab underneath the horizontal menu bar at the top of the window. Click the <i>Enrich</i> tab beside it. The documents in your collection now appear in the left-hand panel: click one and examine the metadata associated with it in the â<i>Element âŠ Value</i>â table at the top right. Use the panel underneath to change individual values by selecting the desired <i>Element</i> and either choosing an existing value from the list or typing a new value into the box near the bottom. Add <i>Title</i>, <i>Organization</i>, and <i>Keyword</i> metadata to each of your own documents that you put in the collection. After you type each value you need to click â <i>Appendix</i> to add that value to the metadata.</Text>
844	</NumberedItem>
845	<NumberedItem>
846	<Text id="280">Click the <i>Create</i> tab to leave the <i>Enrich</i> mode and create your new collection. Click the <i>Build Collection</i> button at the bottom. While the computer is building the collection you will receive some feedback on what it is doing.</Text>
847	</NumberedItem>
848	<NumberedItem>
849	<Text id="281">When it has finished, click the <i>Preview</i> tab to view the collection from within the librarian interface. Check the <i>titles a-z</i>, <i>organisations</i> and <i>how to</i> lists to ensure that your documents have been included in the collection. You will also find when you visit your Greenstone home page that the collection has been installed as one of the regular collections.</Text>
850	</NumberedItem>
851	</NumberedList>
852	</Content>
853	</Section>
854	</Content>
855	</Chapter>
856	<FootnoteList>
857	<Footnote id="1">
858	<Text id="282">All sums of money mentioned in this document are in US dollars, and were current in 2001.</Text>
859	</Footnote>
860	<Footnote id="2">
861	<Text id="283">Recall that all sums of money are expressed in 2001 US dollars.</Text>
862	</Footnote>
863	</FootnoteList>
864	</Manual>

Note: See TracBrowser for help on using the repository browser.

Download in other formats: