source: documented-examples/trunk/wrdpdf-e/resources/ 36257

Last change on this file since 36257 was 36257, checked in by anupama, 2 years ago

DEC collection Word-PDF for GS3. The coll descriptions are still mostly for GS2, but I'm at present in the process of getting the DEC collections to build, work and display the existing collection text. Once all the collections are ported, the descriptions can be rewritten and will be available for translation update.

File size: 3.3 KB
1name=MSWord and PDF demonstration
4shortDescription=<p>This collection demonstrates Greenstone's ability to build collections from documents provided in different formats. It contains a number of papers written by various members of the NZDL project in PDF, MSWord, RTF, and Postscript formats.</p>
6description1=<p>The documents in this collection have been produced by members of the Department of Computer Science, University of Waikato. The University of Waikato holds copyright. They may be distributed freely, without any restrictions.</p>
8description2=<h3>How the collection works</h3> <p> This collection's <a href="_httpcollection_/etc/collect.cfg" target=collect.cfg>configuration file</a> contains the four plugins <i>WordPlugin</i>, <i>RTFPlugin</i>, <i>PDFPlugin</i> and <i>PostScriptPlugin</i> (along with the standard four, <i>GreenstoneXMLPlugin</i>, <i>MetadataXMLPlugin</i>, <i>ArchivesInfPlugin</i> and <i>DirectoryPlugin</i>). These four plugins all extract <i>Title</i> and <i>Source</i> (i.e. filename) metadata.</p>
10description3=<p>Greenstone contains third-party software that is used to convert Word, RTF, PDF and PostScript files into HTML. The Greenstone team does not maintain these modules, although we do try to include the latest versions with each Greenstone release. Bugs arise with unusual Word documents (e.g. from older Macintosh systems), and sometimes the text is badly extracted. Some PDF files have no machine-readable text at all, comprising instead a sequence of page <i>images</i> from which text can only be extracted by optical character recognition (OCR), which Greenstone does not attempt. If you encounter these problems, you can either remove the offending documents from your collection, or try using some of the advanced plugin options to process the documents in different ways. For more information, see the Enhanced PDF and Word tutorials on the <a href=''>Greenstone wiki</a>.</p>
12description4=<p>The <a href="_httpcollection_/etc/collect.cfg" target=collect.cfg>configuration file</a> includes a single index, based on document text, and one classifier, an <i>AZList</i> based on <i>Title</i> metadata, shown <a href="?a=d&amp;cl=CL1">here</a> (the alphabetic selector is suppressed automatically because the collection contains only a few documents). However, no format statement is specified. In the absence of explicit information, Greenstone supplies sensible defaults. In this case, the default format for the classifier gives: <ul> <li> an icon for the HTML version of the document (the text that is actually indexed, essentially the same as the Greenstone Archive format); <li> an icon for the original version of the document (clicking it opens the document in its original form); <li> <i>Title</i> metadata, extracted from the document; <li> <i>Source</i> (i.e. filename) metadata, extracted from the document. </ul></p>
14description5=<p>Here is a format statement that achieves exactly the same effect explicitly. It applies to all <i>Vlists</i>, and so controls both search results list and the alphabetic title browser. <pre> format VList " &lt;td&gt;[link][icon][/link]&lt;\/td&gt; &lt;td&gt;[srclink][srcicon][/srclink]&lt;\/td&gt; &lt;td&gt;[Title]&lt;br&gt;&lt;i&gt;([Source])&lt;/i&gt;&lt;/td&gt;" </pre></p>
Note: See TracBrowser for help on using the repository browser.