Changeset 32906

14.03.2019 21:13:46 (9 days ago)

New Enhanced PDF Tutorial for the new PDFv2Plugin introduced in upcoming GS3.09 release. This tutorial is solely written with GS3 in mind. For now leaving the old Enhanced PDF tutorial in the tutorial xml file along with the pdfbox tutorial for GS2's sake, but both commented out. PDFv1Plugin is the default PDF plugin for GS2, either the commented out versions of these tutorials should now be the tutorials for GS2 caveats. Or, if we ever write/adjust the Enhanced PDF tutorial for GS2 to use PDFv2Plugin, then the first instruction would be to repeat the commented out pdfbox tutorial and get GS2 users to download and set up the pdfbox extension.

1 modified


  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r32905 r32906  
    12901291<Tutorial id="enhanced_pdf"> 
    1452 <!--<MajorVersion number="3"> 
     1453<MajorVersion number="3"> 
    14541455<Text id="fw-24a-3">Next we'll customize the <AutoText text="search"/> format statement to highlight the query terms in a PDF file when it is opened from the search result list. This requires Acrobat Reader 7.0 version or higher, and currently only works on a Microsoft Windows platform.</Text> 
    1500 </MajorVersion>--> 
     1502<Text id="fw-24i">When the PDF icons are clicked in the search results, Acrobat will open the file with the search window open with the query terms highlighted.</Text> 
     1506<Tutorial id="enhanced_pdf"> 
     1508<Text id="ep-1">Enhanced PDF handling</Text> 
     1510<SampleFiles folder="Word_and_PDF"/> 
     1511<Version initial="3.09" current="3.09"/> 
     1513  <Comment><Text id="ep-2">Prior to Greenstone 3.09, Greenstone shipped with a plugin called <AutoText text="PDFPlugin"/>. It was the plugin Greenstone used to convert PDF files to HTML using the third-party software <AutoText text="" type="italics"/>. PDFPlugin allowed users to view PDF documents even if they didn't have the PDF software installed. Unfortunately, sometimes the formatting of the resulting HTML files was not so good. Earlier versions of this tutorial would provide some instruction on extra options to the PDFPlugin for producing a nicer version for display.</Text> 
     1514  </Comment> 
     1515  <Comment><Text id="ep-2a">Furthermore, the older pdftohtml process could not cope with much newer versions of PDF unless PDFPlugin's <Format>pdfbox_conversion</Format> option was switched on.</Text> 
     1516  </Comment> 
     1517  <Comment><Text id="ep-2b">Starting with Greenstone 3.09, some older pdf processing functionality has been restructured into <AutoText text="PDFv1Plugin"/>, while shifting the <Format>pdfbox_conversion</Format> option into <AutoText text="PDFv2Plugin"/>. PDFv2Plugin further makes use of third-party software <AutoText text="xpdf-tools" type="italics"/>, which better copes with newer PDFs (without requiring the <Format>pdfbox_conversion</Format> option to be activated). PDFv2Plugin comes with several new preconfigured settings to produce output files in html, text, image or image and text formats, that can better reflect the appearance of an input PDF document's pages. Behind the scenes, PDFv2Plugin is configured to use the third-party xpdf-tools or pdfbox software for each output setting.</Text> 
     1518  </Comment> 
     1519  <Comment><Text id="ep-2c">From Greenstone 3.09 onwards, PDFv2Plugin is added to a new collection's Document Plugins pipleline by default, in place of the now defunct PDFPlugin. In any instance where you particularly prefer the original PDFPlugin's HTML output for a PDF, you can now use PDFv1Plugin instead, as it still retains this functionality.</Text> 
     1520  </Comment> 
     1522<Text id="ep-3a">In the Librarian Interface, start a new collection called "PDF collection" and base it on <AutoText key="glidict::NewCollectionPrompt.NewCollection"/>.</Text> 
     1523<Text id="ep-3b">In the <AutoText key="glidict::GUI.Gather"/> panel, drag just the PDF documents from <Path>sample_files &rarr; Word_and_PDF &rarr; Documents</Path> into the new collection. Also drag in the PDF documents from <Path>sample_files &rarr; Word_and_PDF &rarr; difficult_pdf</Path>.</Text> 
     1524<Text id="ep-3b-1">In the <AutoText key="glidict::CDM.GUI.Plugins"/> section of the <AutoText key="glidict::GUI.Design"/> panel, you should find <AutoText text="PDFv2Plugin"/> in the plugins list (in place of the deprecated <AutoText text="PDFPlugin"/> that would have been present in the plugins list in older versions of Greenstone).</Text> 
     1525<Text id="ep-3c">Go to the <AutoText key="glidict::GUI.Create"/> panel and build the collection. Examine the output from the build process.</Text> 
     1526<Text id="ep-3d">If you had built the same collection with PDFv1Plugin instead of PDFv2Plugin, the build output would inform you that one of the documents could not be processed and you'd have seen the following building messages: "The file pdf05-notext.pdf was recognised but could not be processed by any plugin.", and "3 documents were processed and included in the collection. 1 was rejected".</Text> 
     1527<Text id="ep-3e">However, since you built the collection of 4 pdfs with PDFv2Plugin, you will notice that all 4 documents could be processed.</Text> 
     1530<Text id="ep-4a">Preview the collection and view the documents. Inspect <Path>pdf01</Path> and <Path>pdf03</Path> first. There's a table of contents is provided to the right. Clicking on a page in the table of contents will scroll to that page. Another way of navigating can be found to the left, where individual pages are listed vertically by page number and clicking the "plus" box next to a page will expand its contents. The pdfs have been sectionalised into groups of 10 pages, each group further containing a section for each individual page. If your pdf contained 10 or fewer pages, there won't two levels of sectionalising, just one.</Text> 
     1531<Text id="ep-4b">If you visit a given page and try to select and copy the text, you can. These are not entirely images of the pdf's pages (like screenshots of a pdf page), but are HTML pages that combine the images of the background of each pdf page with the actual text of that page superimposed. The latter is what makes the text selectable.</Text> 
     1532<Text id="ep-4c">If you return to GLI's Design pane and double click on PDFv2Plugin in Document Plugins, then you will see that the convert_to option is set to paged_pretty_html. This is the default PDF convert_to type and produces the kind of sectionalised HTML pages consisting of background images and superimposed text that you see with <Path>pdf01</Path> and <Path>pdf03</Path>.</Text> 
     1535<Text id="ep-5">Next preview <Path>pdf05-notext.pdf</Path>. This is also similarly sectionalised, but the text is not selectable. That's because the original PDF file <Path>pdf05-notext.pdf</Path> contained no text, only images of text.</Text> 
     1538<Text id="ep-6a">Now preview <Path>pdf06-weirdchars.pdf</Path>. Although also sectionalised, its contents look very strange. The reason for this will become apparent if you open the original document by double-clicking <Path>pdf06-weirdchars.pdf</Path> in GLI's Gather pane. Then in the open PDF, select as much of the text on its first page as possible. Copy that text and paste it in a text editor. You should see strange characters. This is why Greenstone's PDFv2Plugin wasn't able to extract legible text either.</Text> 
     1539<Text id="ep-6b">Although Greenstone has processed all 4 documents, <Path>pdf06-weirdchars.pdf</Path> can be made to look better.</Text> 
     1543<Text id="0333">Modes in the Librarian Interface</Text> 
     1546<Text id="0334">The Librarian Interface can operate in different modes. The default mode is <AutoText key="glidict::Preferences.Mode.Librarian"/> mode. We can use <AutoText key="glidict::Preferences.Mode.Expert"/> mode to work out why the pdf file could not be processed.</Text> 
     1549<Text id="0335">Use the <AutoText key="glidict::Menu.File_Options"/> item on the <AutoText key="glidict::Menu.File"/> menu, <AutoText key="glidict::Preferences.Mode"/> tab, to switch to <AutoText key="glidict::Preferences.Mode.Expert"/> mode and then build the collection again. The <AutoText key="glidict::GUI.Create"/> panel looks different in <AutoText key="glidict::Preferences.Mode.Expert"/> mode because it gives more options: locate the <AutoText key="glidict::CreatePane.Build_Collection" type="button"/> button, near the bottom of the window, and click it. Now a message appears saying that the file could not be processed, and why. Amongst all the output, we get the following message: "Error: PDF contains no extractable text. Could not convert pdf05-notext.pdf to HTML format". cannot convert a PDF file to HTML if the PDF file has no extractable text.</Text> 
     1552<Text id="0336">We recommend that you switch back to <AutoText key="glidict::Preferences.Mode.Librarian"/> mode for subsequent exercises, to avoid confusion.</Text> 
     1556<Text id="ep-11">Using image format</Text> 
     1559<Text id="ep-12">PDF documents can be converted to a series of images, one per page. This requires ImageMagick and Ghostscript to be installed.</Text> 
     1562<Text id="ep-13">In the <AutoText key="glidict::CDM.GUI.Plugins"/> section, configure <AutoText text="PDFv2Plugin"/>. Set the <AutoText text="convert_to"/> option to one of the image types, e.g. <AutoText text="pagedimg_jpg"/>. </Text> 
     1564<MajorVersion number="3"> 
     1566<Text id="ep-14-3"><b>Build</b> the collection and <b>preview</b>.  
     1567All PDF documents have been processed again, still divided into a series of page sections, but this time one image per page. 
     1568Images from the document are now displayed instead of the extracted text. That means there's no selectable text for any of the 4 documents this time. The table of contents on the right now displays a horizontal scroller containing thumbnails of each page. <Path>pdf06-weirdchars.pdf</Path> displays nicely now.</Text> 
     1571<MajorVersion number="2"> 
     1573<Text id="ep-14"><b>Build</b> the collection and <b>preview</b>.All PDF documents (including pdf05-notext.pdf) have been processed and divided into sections, but each section displays <AutoText key="perlmodules::BaseImporter.dummy_text" type="quoted"/>. For the conversion to images for PDF documents, no text is extracted.</Text> 
     1576<Text id="ep-15">In order to view the documents properly, you will need to modify the format statement. In the <AutoText key="glidict::CDM.GUI.Formats"/> section on the <AutoText key="glidict::GUI.Format"/> panel, select the <AutoText text="DocumentText"/> format statement. Replace </Text> 
     1580<Text id="ep-16">with</Text> 
     1586<Text id="ep-18">Preview the collection. Images from the document are now displayed instead of the extracted text. Both <Path>pdf05-notext.pdf</Path> and <Path>pdf06-weirdchars.pdf</Path> display nicely now.</Text> 
     1588<Text id="ep-17">In this collection, we only have PDF documents and they have all been converted to images. If we had other document types in the collection, we should use a different format statement, such as:</Text> 
     1590{If}{[parent:FileFormat] eq PDF,[srcicon],[Text]} 
     1592<Text id="ep-17a"><AutoText text="FileFormat"/> is an extracted metadata item which shows the format of the source document. We can use this to test whether the documents are PDF or not: for PDF documents, display [srcicon], for other documents, display [Text].</Text> 
     1597<Text id="ep-19">Using <AutoText text="process_exp"/> to control document processing (advanced)</Text> 
     1600<Text id="ep-20">Processing all of the PDF documents using an image type may not give the best result for your collection. The images will look nice, but as no text is extracted, searching the full text will not be available for these documents. The best solution would be to process most of the PDF files as HTML, and only use the image format where HTML doesn't work.</Text> 
     1603<Text id="ep-21">We achieve this by putting the problem files into a separate folder, and adding another <AutoText text="PDFv2Plugin"/> plugin with different options.</Text>  
     1606<Text id="ep-23">Go to the <AutoText key="glidict::GUI.Gather"/> panel. Make a new folder called <AutoText text="notext" type="quoted"/>: right click in the collection panel and select <AutoText key="glidict::CollectionPopupMenu.New_Folder"/> from the menu. Change the <AutoText key="glidict::NewFolderOrFilePrompt.Folder_Name"/> to <AutoText text="notext" type="quoted"/>, and click <AutoText key="glidict::General.OK" type="button"/>.</Text> 
     1607<Text id="ep-23a">Move the two pdf files that have problems with html (<Path>pdf05-notext.pdf</Path> and <Path>pdf06-weirdchars</Path>.pdf) into this folder by drag and drop. We will set up the plugins so that PDF files in this <Path>notext</Path> folder are processed differently to the other PDF files.</Text> 
     1610<Text id="ep-24">Switch to the <AutoText key="glidict::CDM.GUI.Plugins"/> section of the <AutoText key="glidict::GUI.Design"/> panel. Add a second instance of <AutoText text="PDFv2Plugin"/> by selecting <AutoText text="PDFv2Plugin"/> from the <AutoText key="glidict::CDM.PlugInManager.PlugIn"/> drop-down list, and clicking <AutoText key="glidict::CDM.PlugInManager.Add" type="button"/>. This plugin will come after the first PDFv2Plugin instance, so we configure it to process PDF documents as sectionalised HTML. Leave the <AutoText text="convert_to"/> option on <AutoText text="paged_pretty_html"/>, and switch on the <AutoText text="use_sections"/> option. Click <AutoText key="glidict::General.OK" type="button"/>.</Text> 
     1613<Text id="ep-25">Configure the first PDF plugin, and set the <AutoText text="process_exp"/> option to <AutoText text="&quot;notext.*\.pdf&quot;"/>.</Text> 
     1616<Text id="ep-26">The two PDF plugins should have options like the following:</Text> 
     1618plugin PDFv2Plugin -convert_to pagedimg_jpg -process_exp "notext.*\.pdf"<br/> 
     1619plugin PDFv2Plugin -convert_to paged_pretty_html 
     1621<Text id="ep-27">The <AutoText text="paged_img" type="italics"/> version must come earlier in the list than the <AutoText text="html" type="italics"/> version. The <AutoText text="process_exp"/> for the first <AutoText text="PDFPlugin"/> will process any PDF files in the <Path>notext</Path> directory. The second <AutoText text="PDFPlugin"/> will process any PDF files that are not processed by the first one.</Text> 
     1622<Text id="ep-28">Note that all plugins have the <AutoText text="process_exp"/> option, and this can be used to customize which documents are processed by which plugin.</Text> 
     1624<MajorVersion number="2"> 
     1626<Text id="ep-30">Edit the <AutoText text="DocumentText"/> format statement. PDF files processed as HTML will not have images to display, so we need to make sure they get text displayed instead. Change <Format>[srcicon]</Format> to <Format>{If}{[NoText] eq "1",[srcicon],[Text]}</Format>.</Text> 
     1630<Text id="ep-33">Build and preview the collection. All PDF documents should look relatively nice. Try searching this collection. You will be able to search for the PDFs that were converted to HTML (try e.g. <AutoText text="bibliography" type="quoted"/>), but not the ones that were converted to images (try searching for <AutoText text="FAO" type="quoted"/> or <AutoText text="METS" type="quoted"/>).</Text> 
     1632<MajorVersion number="3"> 
     1634<Text id="ep-sd-1">Customising the table of contents section heading display</Text> 
     1637<Text id="ep-sd-2">In the table of contents (on the right), a section number and section title are displayed by default. For documents like these where the section titles are the same as the section numbers, this doesn't make much sense, as you end up with headings like "1 1". We can hide the section number from the display by adding some CSS style information.</Text> 
     1640<Text id="ep-sd-3">Click on the <AutoText text="display"/> format statement in the <AutoText key="glidict::CDM.GUI.Formats"/> list. Add the following to the start of the content:</Text> 
     1642    &lt;gsf:template name="additionalHeaderContent-collection"&gt;<br/> 
     1643    <Tab n="1"/>&lt;style&gt;span.tocSectionNumber { display: none; }&lt;/style&gt;<br/> 
     1644  &lt;/gsf:template&gt; 
     1648<Text id="ep-sd-4">Note that if you'd rather hide the title instead, you can use <AutoText type="italics" text="span.tocSectionTitle" /> in the above CSS code instead of <AutoText type="italics" text="span.tocSectionNumber" />.</Text> 
     1652<Text id="fw-24">Opening PDF files with query terms highlighted</Text> 
     1654<MajorVersion number="2"> 
     1656<Text id="fw-24a">Next we'll customize the <AutoText text="SearchVList"/> format statement to highlight the query terms in a PDF file when it is opened from the search result list. This requires Acrobat Reader 7.0 version or higher, and currently only works on a Microsoft Windows platform.</Text> 
     1659<Text id="fw-24c">The search terms are kept in the macro variable <AutoText text="_queryterms_"/>, and we append <AutoText text="#search=&quot;_queryterms_&quot;"/> to the end of a PDF file link to pass the query terms to the PDF.</Text> 
     1660<Text id="fw-24d"><AutoText text="PDFPlugin"/> saves each PDF file in a unique directory. You can use </Text> 
     1662<Text id="fw-24f">to refer to these files.</Text> 
     1665<Text id="fw-24g">Add <AutoText text="SearchVList"/> by selecting <AutoText text="Search"/> from the <AutoText key="glidict::CDM.FormatManager.Feature"/> drop down list, and <AutoText text="VList"/> from the <AutoText key="glidict::CDM.FormatManager.Part"/> list. Click <AutoText key="glidict::CDM.FormatManager.Add" type="button"/> to add the <AutoText text="SearchVList"/> format statement into the list of assigned formats. We need to test whether the file is a PDF file before linking to it, using <Format>{If}{[ex.FileFormat] eq 'PDF',,}</Format>. For PDF files, we use the above path format instead of the <Format>[ex.srclink]</Format> and <Format>[ex./srclink]</Format> variables to link to the file.</Text> 
     1666<Text id="fw-24b">The resulting format statement is:</Text> 
     1668&lt;td valign="top"&gt;[link][icon][/link]&lt;/td&gt;<br/> 
     1669&lt;td valign="top"&gt;<highlight>{If}{[ex.FileFormat] eq 'PDF', &lt;a 
     1671&lt;td valign="top"&gt;[highlight]<br/> 
     1677<MajorVersion number="3"> 
     1679<Text id="fw-24a-3">Next we'll customize the <AutoText text="search"/> format statement to highlight the query terms in a PDF file when it is opened from the search result list. This requires Acrobat Reader 7.0 version or higher, and currently only works on a Microsoft Windows platform.</Text> 
     1682<Text id="fw-24c-3">To highlight the query terms in a PDF document, we need to pass them into the PDF file by appending <AutoText text="#search=&quot;query&quot;"/> to the end of the document link. We need to create the link ourselves rather than using &lt;gsf:link type=&quot;source&quot;/&gt; in the format statement. </Text> 
     1683<Text id="fw-24d-3"><AutoText text="PDFPlugin"/> saves each PDF file in a unique directory for that document, and we can use</Text> 
     1684<Format>&lt;gsf:metadata name=&quot;httpPath&quot; type=&quot;collection&quot;/&gt;/index/assoc/&lt;gsf:metadata name=&quot;archivedir&quot;/&gt;/&lt;gsf:metadata name=&quot;srclinkFile&quot;/&gt;</Format> 
     1685<Text id="fw-24e-3">to refer to the PDF source file.  
     1686The search terms can be found in the &quot;q&quot; cgi parameter. You can access this using &lt;gsf:cgi-param name=&quot;q&quot;/&gt;.</Text> 
     1689<Text id="fw-24g-3">Select <AutoText text="search"/> in <AutoText key="glidict::CDM.GUI.Formats"/> for editing. We need to test whether the file is a PDF file before linking to it, using a test on whether the Greenstone extracted FileFormat metadata is PDF. For PDF files, we now generate the link explicitly.</Text> 
     1690<Text id="fw-24b-3">The resulting format statement is:</Text> 
     1692  &lt;td valign=&quot;top&quot;&gt;<br/> 
     1693    <Tab n="1"/>&lt;gsf:link type=&quot;document&quot;&gt;<br/> 
     1694        <Tab n="2"/>&lt;gsf:icon type=&quot;document&quot;/&gt;<br/> 
     1695    <Tab n="1"/>&lt;/gsf:link&gt;<br/> 
     1696  &lt;&#47;td&gt;<br/> 
     1697  <br /> 
     1698  &lt;td valign=&quot;top&quot;&gt;<br/> 
     1699  <highlight> 
     1700  &lt;gsf:switch&gt;<br/> 
     1701    <Tab n="1"/>&lt;gsf:metadata name=&quot;FileFormat&quot;/&gt;<br/> 
     1702    <Tab n="1"/>&lt;gsf:when test=&quot;equals&quot; test-value=&quot;PDF&quot;&gt;<br/> 
     1703        <Tab n="2"/>&lt;a&gt;&lt;xsl:attribute name=&quot;href&quot;&gt;&lt;gsf:metadata name=&quot;httpPath&quot; type=&quot;collection&quot;/&gt;/index/assoc/&lt;gsf:metadata name=&quot;archivedir&quot;/&gt;/&lt;gsf:metadata name=&quot;srclinkFile&quot;/&gt;#search=&amp;amp;quot;&lt;gsf:cgi-param name=&quot;query&quot;/&gt;&amp;amp;quot;&lt;/xsl:attribute&gt;<br/> 
     1704            <Tab n="3"/>&lt;gsf:choose-metadata&gt;<br/> 
     1705                <Tab n="4"/>&lt;gsf:metadata name=&quot;thumbicon&quot;/&gt;<br/> 
     1706                <Tab n="4"/>&lt;gsf:metadata name=&quot;srcicon&quot;/&gt;<br/> 
     1707            <Tab n="3"/>&lt;/gsf:choose-metadata&gt;<br/> 
     1708        <Tab n="2"/>&lt;/a&gt;<br/> 
     1709    <Tab n="1"/>&lt;/gsf:when&gt;<br/> 
     1710    <Tab n="1"/>&lt;gsf:otherwise&gt;<br/> 
     1711        <Tab n="2"/>&lt;gsf:link type=&quot;source&quot;&gt;<br/> 
     1712            <Tab n="3"/>&lt;gsf:choose-metadata&gt;<br/> 
     1713                <Tab n="4"/>&lt;gsf:metadata name=&quot;thumbicon&quot;/&gt;<br/> 
     1714                <Tab n="4"/>&lt;gsf:metadata name=&quot;srcicon&quot;/&gt;<br/> 
     1715            <Tab n="3"/>&lt;/gsf:choose-metadata&gt;<br/> 
     1716        <Tab n="2"/>&lt;/gsf:link&gt;<br/> 
     1717    <Tab n="1"/>&lt;/gsf:otherwise&gt;<br/> 
     1718  &lt;/gsf:switch&gt;</highlight><br/>   
     1719  &lt;&#47;td&gt;<br/> 
     1720  <br /> 
     1721&lt;td valign=&quot;top&quot;&gt;<br/> 
    15011726<Text id="fw-24i">When the PDF icons are clicked in the search results, Acrobat will open the file with the search window open with the query terms highlighted.</Text>