Ignore:
Timestamp:
2013-11-18T15:45:44+13:00 (10 years ago)
Author:
jlwhisler
Message:

Updated the associated files tutorial to bit a bit more straight forward and focused on the task of associating one file with another. Modified the GS3 format statement so it is more clear exactly what changes are to be made and where.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r28630 r28635  
    15491549<Content>
    15501550<Comment>
    1551 <Text id="assoc-files-1"><Synopsis>This tutorial demonstrates how to combine different versions of the same document together in Greenstone.</Synopsis> As an example, two identical articles about Greenstone are used, one is in PDF format, the other in Word.</Text>
    1552 </Comment>
     1551<Text id="assoc-files-1"><Synopsis>This tutorial demonstrates how to link different versions of the same document together in Greenstone.</Synopsis> As an example, two identical articles about Greenstone are used; one is in PDF format, the other in Word.</Text>
     1552</Comment>
     1553<!--
    15531554<Comment>
    15541555<Text id="assoc-files-2">The key to how this collection is set up is that the Word and PDF versions of the document deliberately have the same filename&mdash;only the file extension is different. This is something that is quite simple to achieve in practice, as it reflects common practice when a document is published in PDF form. This convention is then exploited by the <Format>associate_ext</Format> plugin option at build-time in Greenstone, an option that allows variants of a document to be grouped together and treated by Greenstone as a single document, based on similarity of filename.</Text>
     
    15571558<Text id="assoc-files-3">In the example collection of this tutorial, we set this option in the WordPlugin to be <Format>pdf</Format>. The result of this setting is that it makes the Word version of the document the dominant form in the collection that is built&mdash;the text that Greenstone extracts for indexing purposes comes from the Word document&mdash;and any PDF version of the document with the same filename is bound to it as an associated file.</Text>
    15581559</Comment>
     1560-->
    15591561<NumberedItem>
    15601562<Text id="assoc-files-4">Start a new collection called <b>Associated Files Example</b>, by selecting File &rarr; New. Enter an appropriate description for your collection.</Text>
    15611563</NumberedItem>
    15621564<NumberedItem>
    1563 <Text id="assoc-files-5">Copy the files pdf03.pdf and word03.doc provided in sample_files &rarr; Word_and_PDF &rarr; Documents into your new collection. Do this by dragging these files across from the filesystem view on the left of the <AutoText key="glidict::GUI.Gather"/> panel into the collection view on the right.</Text>
    1564 </NumberedItem>
    1565 <NumberedItem>
    1566 <Text id="assoc-files-6">In the collection view, right-click on each file you just copied and choose Rename to rename them to greenstone1.pdf and greenstone1.doc, respectively. This sets the input documents up to be in line with the objective of this tutorial: to work with documents of different formats that are named similarly and have identical contents.</Text>
    1567 </NumberedItem>
    1568 <NumberedItem>
    1569 <Text id="assoc-files-7">Go to the <AutoText key="glidict::GUI.Design"/> panel. In <AutoText key="glidict::CDM.GUI.Indexes"/>, delete the index for ex.Source, and in <AutoText key="glidict::CDM.GUI.Classifiers"/>, delete the Browsing Classifier for ex.Source too, since we will not be making use of them.</Text>
    1570 </NumberedItem>
    1571 <NumberedItem>
    1572 <Text id="assoc-files-8">In <AutoText key="glidict::CDM.GUI.Plugins"/>, select the <AutoText text="WordPlugin"/> and press the <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> button.
    1573 In the resulting popup, scroll down to find the associate_ext option, and set this option to <AutoText text="pdf" type="italics"/>.</Text>
    1574 <Text id="assoc-files-9">Note 1: as this is an option that is categorized under the <AutoText text="BasePlugin"/> heading, it is therefore an option that is available across all the plugins provided by Greenstone. In our example, we happen to be binding a PDF document to a Word document, however it could equally be used to bind MP3 versions of files to PNG artwork of album covers.</Text>
    1575 <Text id="assoc-files-10">Note 2: More than one filename extension can be provided as part of this option, separated by a comma. For example, setting the value of the associate_ext in <AutoText text="TextPlugin"/> to <AutoText text="avi,png" type="italics"/> would allow both an AVI video file (say an oral history interview) and a PNG image (say a picture of the interviewee taken at the time of the recording) to bind to a text version of the document (say representing a transcript of the interview). Both AVI and PNG versions of the file can be present at the same time, or alternatively only one of the two file types need be present, or neither, and Greenstone will process the situation accordingly.</Text>
    1576 <Text id="assoc-files-11">Note 3: The option <Format>associate_ext</Format> is in fact a simplified version of a more general option <Format>associate_tail_re</Format>. Using regular expression syntax, the latter provides a more powerful way of manipulating filenames. Rather than focus on just the filename extension, with <Format>associate_tail_re</Format>, one is able to group files together that share a similar filename root, but might start to differ in characters before the filename extension. For instance, the Word version of the document might be <Format>my-article.doc</Format> but the PDF version might be <Format>my-article-ver13.pdf</Format> reflecting the fact that the PDF file is saved in version 1.3 of this format. Using <Format>associate_tail_re</Format> (and a little bit of regular expression know-how!), such differences can be surmounted, and the two files still processed automatically as different versions of the same document.</Text>
    1577 </NumberedItem>
    1578 <NumberedItem>
    1579 <Text id="assoc-files-12">If you're working with structured Word documents that contain formatted headings and you want better structured and formatted HTML versions of the documents to be generated by Greenstone from the Word format, optionally set the <Format>windows_scripting</Format> option for the <AutoText text="WordPlugin"/> if building on Windows. Alternatively, you can turn on the <Format>open_office_scripting</Format> option if this extension has been added to your Greenstone installation and if either OpenOffice or LibreOffice is available on your system.</Text>
    1580 <Text id="assoc-files-13">If you're using windows scripting, optionally set the <AutoText text="level1_heading" type="italics"/> to <i>heading\s*1</i>, or whatever is appropriate for your documents if they use style information for headings that deviate from the norm for Word. Repeat as is needed for <AutoText text="level2_heading" type="italics"/> and so forth. For more details on how to control sections within a Word document, see the <TutorialRef id="enhanced_word"/> tutorial.</Text>
    1581 </NumberedItem>
    1582 <NumberedItem>
    1583 <Text id="assoc-files-14">In GLI, or otherwise, assign appropriate dc.Title and dc.Creator metadata to both your documents. Since the contents are identical, you can select the 2 documents in the <AutoText key="glidict::GUI.Enrich"/> panel, then set dc.Title and dc.Creator simultaneously for both.</Text>
    1584 </NumberedItem>
    1585 <NumberedItem>
    1586 <Text id="assoc-files-15">Building the collection at this point will have the effect that internally Greenstone will have captured this relationship between the different file versions of the same documents; however, until we make some adjustments to the format statements, none of this will be visible to the end-user. The collection built at this point (with default settings) allows a user to search the text from the Word document, browse by title metadata and so on, but when it comes to the point of viewing a document there will only be the choice of viewing the Word version of the document, or the HTML version that Greenstone automatically generates by processing the Word document.</Text>
    1587 <Text id="assoc-files-16">To go beyond this, the key change to make is to alter the part of the <MajorVersion number="2">default VList statement that says:</MajorVersion><MajorVersion number="3"><AutoText text="documentNode"/> template of the <AutoText text="Browse" /> format statement which chooses between <Format>thumbicon</Format> and <Format>srcicon</Format>, and replace this with a reference to <Format>equivDocIcon</Format> instead.</MajorVersion></Text>
     1565<Text id="assoc-files-5">Copy the files pdf03.pdf and word03.doc provided in sample_files &rarr; Word_and_PDF &rarr; Documents into your new collection. Do this by dragging these files across from the filesystem view on the left of the <AutoText key="glidict::GUI.Gather"/> panel into the <b>Collection view</b> on the right.</Text>
     1566</NumberedItem>
     1567<NumberedItem>
     1568<Text id="assoc-files-6">In the collection view, right-click on each file and select <b>Rename</b>, renaming them greenstone1.pdf and greenstone1.doc, respectively.</Text>
     1569</NumberedItem>
     1570<NumberedItem>
     1571<Text id="assoc-files-14">In the <AutoText key="glidict::GUI.Enrich"/> panel, assign appropriate <b>dc.Title</b> and <b>dc.Creator</b> metadata to the documents. Since the contents are identical, you can select both documents and set metadata for them simultaneously.</Text>
     1572</NumberedItem>
     1573<Heading>
     1574<Text id="assoc-files-h1">Associating one document with another</Text>
     1575</Heading>
     1576<NumberedItem>
     1577<Text id="assoc-files-8">In <AutoText key="glidict::CDM.GUI.Plugins"/>, select the <AutoText text="WordPlugin"/> and press the <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> button.In the resulting popup, scroll down to find the <Format>associate_ext</Format> option, and set this option to <AutoText text="pdf" type="italics"/>. Now, for Word documents, Greenstone will look for documents with the exact same name but the PDF file extension. These PDF's will not be processed separately; instead, they will be associated with their equivalent Word documents. (Alternatively, you could make the PDF document the primary document, by setting the <Format>associate_ext</Format> option in the <AutoText text="PDFPlugin"/> to <AutoText text="doc" type="italics"/>.)</Text>
     1578</NumberedItem>
     1579<NumberedItem>
     1580<Text id="assoc-files-12">Build the collection. Notice that only one document was considered for processing and included in the collection. Since the PDF version of the document is an associated document, it is not processed.</Text>
     1581</NumberedItem>
     1582<Heading>
     1583<Text id="assoc-files-h2">Linking to associated documents</Text>
     1584</Heading>
     1585<NumberedItem>
     1586<Text id="assoc-files-15">Greenstone has internally associated the PDF version with the Word version of the document. However, with the default format statement, the end-user will have no idea that the PDF version exists. The collection built at this point (with default settings) only gives the user the choice of viewing either the Word version or the Greenstone-generated HTML version of the document. They are not given the option to view the PDF version.</Text>
     1587<Text id="assoc-files-16">To allow users to view the PDF version of the document,<MajorVersion number="2">change the default VList statement from this:</MajorVersion><MajorVersion number="3"> edit the <AutoText text="documentNode"/> template of the <AutoText text="Browse" /> to reference the <Format>equivDocIcon</Format> with a link to the PDF document <Format>equivDocLink</Format>.</MajorVersion></Text>
    15881588<MajorVersion number="2">
    15891589<Format>
     
    15941594&lt;td valign="top"&gt;[ex.equivDocLink][ex.equivDocIcon][ex./equivDocLink]&lt;/td&gt;
    15951595</Format>
    1596 </MajorVersion>
    15971596<Text id="assoc-files-19">Two things occur in this replacement. The main difference is the switch from using <AutoText text="ex.srclink" type="italics"/> and <AutoText text="ex.srcicon" type="italics"/> that provides the link to the primary source document (which is the Word document), and replace it with a hyperlink around an icon to the document that Greenstone has associated as an equivalent document (which is the PDF version). The icon Greenstone chooses to show is based on the filename extension of the matching file it has found. In this case <img src="../tutorial_files/ipdf.gif"/>.</Text>
    1598 <MajorVersion number="2">
    15991597<Text id="assoc-files-20">The second (more minor) change in this edit is to simplify the statement a bit. The original uses an <Format>{Or}</Format> statement to show a thumbnail version of the document, if Greenstone has one, in preference over the source icon. Since in this collection we have no thumbnails generated, it has been simplified by eliminating the <Format>{Or}</Format> combination and going straight to the <AutoText text="ex.equivDocIcon" type="italics"/> metadata item.</Text>
    16001598<Text id="assoc-files-21">To make the change then, switch to the <AutoText key="glidict::GUI.Format"/> panel and edit the format statement for VList (All).</Text>
     
    16171615</MajorVersion>
    16181616<MajorVersion number="3">
    1619 <Text id="assoc-files-20-3">The second (more minor) change in this edit is to simplify the statement a bit. The original uses a <Format>&lt;gsf:choose-metadata/&gt;</Format> statement to show a thumbnail version of the document, if Greenstone has one, in preference over the source icon. Since in this collection we have no thumbnails generated, it has been simplified by eliminating the <Format>&lt;gsf:choose-metadata/&gt;</Format> combination and going straight to the <AutoText text="ex.equivDocIcon" type="italics"/> metadata item.</Text>
    1620 <Text id="assoc-files-21-3">To make the change then, switch to the <AutoText key="glidict::GUI.Format"/> panel and edit the <AutoText text="documentNode"/> template of the <AutoText text="Browse" /> format statement</Text>
    16211617<table>
    16221618<th>
     
    16271623<td valign="bottom">
    16281624<Format>
    1629 &lt;td valign=&quot;top&quot;&gt;<br />
    1630   <Tab n="1"/>&lt;gsf:link type=&quot;source&quot;&gt;<br />
    1631     <Tab n="2"/>&lt;gsf:choose-metadata&gt;<br />
    1632       <Tab n="3"/>&lt;gsf:metadata name=&quot;thumbicon&quot;/&gt;<br />
    1633       <Tab n="3"/>&lt;gsf:metadata name=&quot;srcicon&quot;/&gt;<br />
    1634     <Tab n="2"/>&lt;/gsf:choose-metadata&gt;<br />
    1635   <Tab n="1"/>&lt;/gsf:link&gt;<br />
    1636 &lt;/td&gt;<br />
    1637 &lt;td valign=&quot;top&quot;&gt;<br />
    1638   <Tab n="1"/>&lt;gsf:link type=&quot;document&quot;&gt;<br />
    1639     <Tab n="2"/>&lt;xsl:call-template name=&quot;choose-title&quot;/&gt;<br />
    1640   <Tab n="1"/>&lt;/gsf:link&gt;<br />
    1641   <Tab n="1"/>&lt;gsf:switch&gt;<br />
    1642     <Tab n="2"/>&lt;gsf:metadata name=&quot;Source&quot;/&gt;<br />
    1643     <Tab n="2"/>&lt;gsf:when test=&quot;exists&quot;&gt;<br />
    1644       <Tab n="3"/>&lt;br/&gt;&lt;i&gt;(&lt;gsf:metadata name=&quot;Source&quot;/&gt;)&lt;/i&gt;<br />
    1645     <Tab n="2"/>&lt;/gsf:when&gt;<br />
    1646   <Tab n="1"/>&lt;/gsf:switch&gt;<br />
    1647 &lt;/td&gt;
     1625  &lt;gsf:template match="documentNode"&gt;<br />
     1626    <Tab n="1"/>&lt;td valign="top"&gt;<br />
     1627      <Tab n="2"/>&lt;gsf:link type="document"&gt;<br />
     1628        <Tab n="3"/>&lt;gsf:icon type="document"/&gt;<br />
     1629      <Tab n="2"/>&lt;/gsf:link&gt;<br />
     1630    <Tab n="1"/>&lt;/td&gt;<br />
     1631    <Tab n="1"/>&lt;td valign="top"&gt;<br />
     1632      <Tab n="2"/>&lt;gsf:link type="source"&gt;<br />
     1633       <Tab n="3"/> &lt;gsf:choose-metadata&gt;<br />
     1634         <Tab n="4"/> &lt;gsf:metadata name="thumbicon"/&gt;<br />
     1635        <Tab n="4"/>  &lt;gsf:metadata name="srcicon"/&gt;<br />
     1636        <Tab n="3"/>&lt;/gsf:choose-metadata&gt;<br />
     1637      <Tab n="2"/>&lt;/gsf:link&gt;<br />
     1638<Tab n="1"/>&lt;/td&gt;<br /><br /><br /><br /><br /><br />
     1639    <Tab n="1"/>&lt;td valign="top"&gt;<br />
     1640      <Tab n="2"/>&lt;gsf:link type="document"&gt;<br />
     1641&lt;!--<br />
     1642Defined in the global format statement<br />
     1643--&gt;<br />
     1644       <Tab n="3"/> &lt;xsl:call-template name="choose-title"/&gt;<br />
     1645       <Tab n="3"/> &lt;gsf:switch&gt;<br />
     1646          <Tab n="3"/>&lt;gsf:metadata name="Source"/&gt;<br />
     1647          <Tab n="4"/>&lt;gsf:when test="exists"&gt;<br />
     1648            <Tab n="5"/>&lt;br/&gt;<br />
     1649            <Tab n="5"/>&lt;i&gt;(&lt;gsf:metadata name="Source"/&gt;)&lt;/i&gt;<br />
     1650          <Tab n="4"/>&lt;/gsf:when&gt;<br />
     1651        <Tab n="3"/>&lt;/gsf:switch&gt;<br />
     1652      <Tab n="2"/>&lt;/gsf:link&gt;<br />
     1653    <Tab n="1"/>&lt;/td&gt;<br />
     1654  &lt;/gsf:template&gt;<br />
    16481655</Format>
    16491656</td>
    16501657<td valign="bottom">
    16511658<Format>
    1652 &lt;td valign=&quot;top&quot;&gt;<br />
    1653   <highlight><Tab n="1"/>&lt;gsf:metadata name=&quot;equivDocLink&quot;/&gt;<br />
    1654   <Tab n="1"/>&lt;gsf:metadata name=&quot;equivDocIcon&quot;/&gt;<br />
    1655   <Tab n="1"/>&lt;gsf:metadata name=&quot;/equivDocLink&quot;/&gt;<br />
    1656   </highlight>
    1657 &lt;/td&gt;<br />
    1658 &lt;td valign=&quot;top&quot;&gt;<br />
    1659   <Tab n="1"/>&lt;gsf:link type=&quot;document&quot;&gt;<br />
    1660     <Tab n="2"/>&lt;xsl:call-template name=&quot;choose-title&quot;/&gt;<br />
    1661   <Tab n="1"/>&lt;/gsf:link&gt;<br />
    1662   <Tab n="1"/>&lt;gsf:switch&gt;
    1663     <highlight><br />
    1664     <Tab n="2"/>&lt;gsf:metadata name=&quot;dc.Creator&quot;/&gt;<br />
    1665     <Tab n="2"/>&lt;gsf:when test=&quot;exists&quot;&gt;<br />
    1666       <Tab n="3"/>&lt;br/&gt;&lt;gsf:metadata name=&quot;dc.Creator&quot; separator=&quot;, &quot;/&gt;<br />
    1667     <Tab n="2"/>&lt;/gsf:when&gt;
    1668   </highlight><br />
    1669   <Tab n="1"/>&lt;/gsf:switch&gt;<br />
    1670 &lt;/td&gt;
     1659  &lt;gsf:template match="documentNode"&gt;<br />
     1660    <Tab n="1"/>&lt;td valign="top"&gt;<br />
     1661      <Tab n="2"/>&lt;gsf:link type="document"&gt;<br />
     1662        <Tab n="3"/>&lt;gsf:icon type="document"/&gt;<br />
     1663      <Tab n="2"/>&lt;/gsf:link&gt;<br />
     1664    <Tab n="1"/>&lt;/td&gt;<br />
     1665    <Tab n="1"/>&lt;td valign="top"&gt;<br />
     1666      <Tab n="2"/>&lt;gsf:link type="source"&gt;<br />
     1667       <Tab n="3"/> &lt;gsf:choose-metadata&gt;<br />
     1668         <Tab n="4"/> &lt;gsf:metadata name="thumbicon"/&gt;<br />
     1669        <Tab n="4"/>  &lt;gsf:metadata name="srcicon"/&gt;<br />
     1670        <Tab n="3"/>&lt;/gsf:choose-metadata&gt;<br />
     1671      <Tab n="2"/>&lt;/gsf:link&gt;<br />
     1672<Tab n="1"/>&lt;/td&gt;<br />
     1673  <highlight><Tab n="1"/>&lt;td valign="top"&gt;<br />
     1674<Tab n="2"/>&lt;gsf:metadata name=&quot;equivDocLink&quot;/&gt;<br />
     1675  <Tab n="3"/>&lt;gsf:metadata name=&quot;equivDocIcon&quot;/&gt;<br />
     1676  <Tab n="2"/>&lt;gsf:metadata name=&quot;/equivDocLink&quot;/&gt;<br />
     1677<Tab n="1"/>&lt;/td&gt;<br /></highlight>
     1678    <Tab n="1"/>&lt;td valign="top"&gt;<br />
     1679      <Tab n="2"/>&lt;gsf:link type="document"&gt;<br />
     1680&lt;!--<br />
     1681Defined in the global format statement<br />
     1682--&gt;<br />
     1683       <Tab n="3"/> &lt;xsl:call-template name="choose-title"/&gt;<br />
     1684       <Tab n="3"/> &lt;gsf:switch&gt;<br />
     1685          <Tab n="3"/>&lt;gsf:metadata name="Source"/&gt;<br />
     1686          <Tab n="4"/>&lt;gsf:when test="exists"&gt;<br />
     1687            <Tab n="5"/>&lt;br/&gt;<br />
     1688            <Tab n="5"/>&lt;i&gt;(&lt;gsf:metadata name="Source"/&gt;)&lt;/i&gt;<br />
     1689          <Tab n="4"/>&lt;/gsf:when&gt;<br />
     1690        <Tab n="3"/>&lt;/gsf:switch&gt;<br />
     1691      <Tab n="2"/>&lt;/gsf:link&gt;<br />
     1692    <Tab n="1"/>&lt;/td&gt;<br />
     1693  &lt;/gsf:template&gt;<br />
    16711694</Format>
    16721695</td>
     
    16751698<br />
    16761699</MajorVersion>
    1677 <Text id="assoc-files-24">Note: When Greenstone encounters a file that matches the provided <Format>associate_ext</Format> value (<Format>pdf</Format> in our case), it sets the metadata value <AutoText text="ex.equivDocIcon"/> for that document to be the macro <i>_iconXXX_</i>, where <i>XXX</i> is whatever the filename extension is (so <AutoText text="_iconpdf_" type="italics"/> in our case). As long as there is an existing macro defined for that combination of the word <i>icon</i> and the filename extension, then a suitable icon will be displayed when the document appears in a VList. For <i>pdf</i> the displayed icon will be <img src="../tutorial_files/ipdf.gif"/>. <b>Build</b> the collection if you hadn't already done so and <b>preview</b> it now.</Text>
     1700<Text id="assoc-files-24">Note: When Greenstone encounters a file that matches the provided <Format>associate_ext</Format> value (<Format>pdf</Format> in our case), it sets the metadata value <AutoText text="ex.equivDocIcon"/> for that document to be the macro <i>_iconXXX_</i>, where <i>XXX</i> is whatever the filename extension is (so <AutoText text="_iconpdf_" type="italics"/> in our case). As long as there is an existing macro defined for that combination of the word <i>icon</i> and the filename extension, then a suitable icon will be displayed when the document appears in a VList. For <i>pdf</i> the displayed icon will be <img src="../tutorial_files/ipdf.gif"/>.</Text>
    16781701</NumberedItem>
    16791702</Content>
Note: See TracChangeset for help on using the changeset viewer.