Changeset 25966

2012-07-18T19:16:25+12:00 (12 years ago)

Added a tutorial for the Associated Files Example Collection on Adjusted the original on puka to make the format statements appear accurate to how they are meant to be in order to work.

1 added
1 edited


  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r25965 r25966  
     1453<Tutorial id="associated_files">
     1455<Text id="assoc-files-0">Associated files: combining different versions of the same document together</Text>
     1457<Prerequisite id="word_pdf_collection"/>
     1458<Version initial="2.85" current="2.85"/>
     1461<Text id="assoc-files-1">This tutorial demonstrates how to combine Word and PDF versions of the same document together in Greenstone. As an example, two identical articles about Greenstone are used, one is in PDF format, the other in Word. </Text>
     1464<Text id="assoc-files-2">The key to how this collection is set up is that the Word and PDF versions of the document deliberately have the same filename&mdash;only the file extension is different. This is something that is quite simple to achieve in practice, as it reflects common practice when a document is published in PDF form. This convention is then exploited by the <Format>associate_ext</Format> plugin option at build-time in Greenstone, an option that allows variants of a document to be grouped together and treated by Greenstone as a single document, based on similarity of filename.</Text>
     1467<Text id="assoc-files-3">In the example collection of this tutorial, we set this option in the WordPlugin to be <Format>pdf</Format>. The result of this setting is that it makes the Word version of the document the dominant form in the collection that is built&mdash;the text that Greenstone extracts for indexing purposes comes from the Word document&mdash;and any PDF version of the document with the same filename is bound to it as an associated file.</Text>
     1470<Text id="assoc-files-4">Start a new collection called <b>Associated Files Example</b>, by selecting File &rarr; New. Enter an appropriate description for your collection.</Text>
     1473<Text id="assoc-files-5">Copy the files pdf03.pdf and word03.doc provided in sample_files &rarr; Word_and_PDF &rarr; Documents images into your new collection. Do this by dragging these files across from the filesystem view on the left of the <AutoText key="glidict::GUI.Gather"/> panel into the collection view on the right.</Text>
     1476<Text id="assoc-files-6">In the collection view, rename the 2 files you just copied to greenstone1.pdf and greenstone1.doc, respectively. This sets the input documents up to be in line with the objective of this tutorial: to work with documents of different formats that are named similarly and have identical contents.</Text>
     1479<Text id="assoc-files-7">Go to the <AutoText key="glidict::GUI.Design"/> panel. In <AutoText key="glidict::CDM.GUI.Indexes"/>, delete the index for ex.Source, and in <AutoText key="glidict::CDM.GUI.Classifiers"/>, delete the Browsing Classifier for ex.Source too, since we will not be making use of them.</Text>
     1482<Text id="assoc-files-8">In <AutoText key="glidict::CDM.GUI.Plugins"/>, select the <Autotext text="WordPlugin"/> and press the <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> button.
     1483In the resulting popup, scroll down to find the associate_ext option, and set this option to <AutoText text="pdf" type="italics"/>.</Text>
     1484<Text id="assoc-files-9">Note 1: as this is an option that is categorized under the <Autotext text="BasePlugin"/> heading, it is therefore an option that is available across all the plugins provided by Greenstone. In our example, we happen to be binding a PDF document to a Word document, however it could equally be used to bind MP3 versions of files to PNG artwork of album covers.</Text>
     1485<Text id="assoc-files-10">Note 2: More than one filename extension can be provided as part of this option, separated by a comma. For example, setting the value of the associate_ext in <Autotext text="TextPlugin"/> to <Autotext text="avi,png" type="italics"/> would allow both an AVI video file (say an oral history interview) and a PNG image (say a picture of the interviewee taken at the time of the recording) to bind to a text version of the document (say representing a transcript of the interview). Both AVI and PNG versions of the file can be present at the same time, or alternatively only one of the two file types need be present, or neither, and Greenstone will process the situation accordingly.</Text>
     1486<Text id="assoc-files-11">Note 3: The option <Format>associate_ext</Format> is in fact a simplified version of a more general option <Format>associate_tail_re</Format>. Using regular expression syntax, the latter provides a more powerful way of manipulating filenames. Rather than focus on just the filename extension, with <Format>associate_tail_re</Format>, one is able to group files together that share a similar filename root, but might start to differ in characters before the filename extension. For instance, the Word version of the document might be <Format>my-article.doc</Format> but the PDF version might be <Format>my-article-ver13.pdf</Format> reflecting the fact that the PDF file is saved in version 1.3 of this format. Using <Format>associate_tail_re</Format> (and a little bit of regular expression know-how!), such differences can be surmounted, and the two files still processed automatically as different versions of the same document.</Text>
     1489<Text id="assoc-files-12">If you're working with structured Word documents that contain formatted headings and you want better structured and formatted HTML versions of the documents to be generated by Greenstone from the Word format, optionally set the <Format>windows_scripting</Format> option for the <Autotext text="WordPlugin"/> if building on Windows, or turn on the <Format>open_office_scripting</Format> option if this extension has been added to your Greenstone installation and either OpenOffice or LibreOffice is available on your system.</Text>
     1490<Text id="assoc-files-13">Optionally set the <Autotext text="level1_heading" type="italics"/> to <i>heading\s*1</i>, or whatever is appropriate for your documents if they use style information for headings that deviate from the norm for Word. Repeat as is needed for <Autotext text="level2_heading" type="italics"/> and so forth. For more details on how to control sections within a Word document, see the <TutorialRef id="enhanced_word"/> tutorial.</Text>
     1493<Text id="assoc-files-14">In GLI, or otherwise, assign appropriate dc.Title and dc.Creator metadata to both your documents. Since the contents are identical, you can select the 2 documents in the <AutoText key="glidict::GUI.Enrich"/> panel, then set dc.Title and dc.Creator simultaneously for both.</Text>
     1496<Text id="assoc-files-15">Building the collection at this point will have the effect that internally Greenstone will have captured this relationship between the different file versions of the same documents; however, until we make some adjustments to the format statements, none of this will be visible to the end-user. The collection built at this point (with default settings) allows a user to search the text from the Word document, browse by title metadata and so on, but when it comes to the point of viewing a document there will only be the choice of viewing the Word version of the document, or the HTML version that Greenstone automatically generates by processing the Word document.</Text>
     1497<Text id="assoc-files-16">To go beyond this, the key change to make is to alter the part of default VList statement that says:</Text>
     1498<Format><td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td></Format>
     1499<Text id="assoc-files-18">to:</Text>
     1500<Format><td valign="top">[ex.equivDocLink][ex.equivDocIcon][ex./equivDocLink]</td></Format>
     1501<Text id="assoc-files-19">Two things occur in this edit. The main difference is the switch from using <Autotext text="ex.srclink" type="italics"/> and <Autotext text="ex.srcicon" type="italics"/> that provides the link to the primary source document (which is the Word document), and replace it with a hyperlink around an icon to the document that Greenstone has associated as an equivalent document (which is the PDF version). The icon Greenstone chooses to show is based on the filename extension of the matching file it has found. In this case <img src="../tutorial_files/ipdf.gif"/>.</Text>
     1502<Text id="assoc-files-20">The second (more minor) change in this edit is to simplify the statement a bit. The original uses an <Format>{Or}</Format> statement to show a thumbnail version of the document if Greenstone has one, in preference over the source icon. Since in this collection we have no thumbnails generated, it has been simplified by eliminating the <Format>{Or}</Format> combination and going straight to the <Autotext text="ex.equivDocIcon" type="italics"/> metadata item.</Text>
     1503<Text id="assoc-files-21">Switch to the <AutoText key="glidict::GUI.Format"/> panel and edit the format statement for VList (All).</Text>
     1504<Text id="assoc-files-22">Change:</Text>
     1506 <td valign="top">[link][icon][/link]</td><br />
     1507 <td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td><br />
     1508 <td valign="top">[highlight]<br />
     1509 {Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}<br />
     1510 [/highlight]{If}{[ex.Source],<br />
     1511 &lt;i&gt;([ex.Source])&lt;/i&gt;}</td>
     1513<Text id="assoc-files-23">To:</Text>
     1515 <td valign="top">[link][icon][/link]</td><br />
     1516 <td valign="top">[ex.equivDocLink][ex.equivDocIcon][ex./equivDocLink]</td><br />
     1517 <td valign="top">[highlight]<br />
     1518 {Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}<br />
     1519 [/highlight]{If}{[dc.Creator],: [sibling(All'\, '):dc.Creator]}</td><br />
     1521<Text id="assoc-files-24">Note: When Greenstone encounters a file that matches the provided <Format>associate_ext</Format> value (<Format>pdf</Format> in our case), it sets the metadata value <Autotext text="ex.equivDocIcon"/> for that document to be the macro <i>_iconXXX_</i>, where <i>XXX</i> is whatever the filename extension is (so <Autotext text="_iconpdf_" type="italics"/> in our case). As long as there is an existing macro defined for that combination of the word <i>icon</i> and the filename extension, then a suitable icon will be displayed when the document appears in a VList. For <i>pdf</i> the displayed icon will be <img src="../tutorial_files/ipdf.gif"/>.</Text>
    14531525<MajorVersion number="2">
    14541526<Tutorial id="export_to_CDROM">
Note: See TracChangeset for help on using the changeset viewer.