Changeset 32027 for documentation


Ignore:
Timestamp:
2017-10-05T23:01:50+13:00 (7 years ago)
Author:
ak19
Message:
  1. Beginning of a new tutorial, one on using the new UnknownConverterPlugin. Added the section on using the plugin with Icecite for converting PDFs to text. Tomorrow, will investigate its use with djvu conversion tool with djvu file formats and hopefully be able to add a section in the tutorial for that.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r32021 r32027  
    775775</Comment>
    776776<NumberedItem>
    777 <Text id="images-gps-1">Create a new collection in GLI called <i>Images-GPS</i>. In the <b>Gather</b> panel, drag and drop the 4 folders in <Path>sample_files &rarr; images_gps</Path> from the Workspace view on the left into the Collection view on the right.</Text>
    778 </NumberedItem>
    779 <NumberedItem>
    780   <Text id="images-gps-1a">Since the images are organised by folder, we can easily assign <i>folder-level</i> metadata to the images which will help with classifying them. In the <b>Enrich</b> panel, select the <Path>eiffel-tower</Path> folder, and in its <b>dc.Title</b> field type <i>Eiffel Tower</i>. Since this metadata is assigned at folder level, it is inherited as <b>dc.Title</b> metadata by all the images in the folder.</Text>
     777<Text id="images-gps-1">Create a new collection in GLI called <i>Images-GPS</i>. In the <AutoText key="glidict::GUI.Gather"/> panel, drag and drop the 4 folders in <Path>sample_files &rarr; images_gps</Path> from the Workspace view on the left into the Collection view on the right.</Text>
     778</NumberedItem>
     779<NumberedItem>
     780  <Text id="images-gps-1a">Since the images are organised by folder, we can easily assign <i>folder-level</i> metadata to the images which will help with classifying them. In the <AutoText key="glidict::GUI.Enrich"/> panel, select the <Path>eiffel-tower</Path> folder, and in its <b>dc.Title</b> field type <i>Eiffel Tower</i>. Since this metadata is assigned at folder level, it is inherited as <b>dc.Title</b> metadata by all the images in the folder.</Text>
    781781  <Text id="images-gps-1b">When setting folder-level metadata like this, the default setting in GLI is to produce a popup window alerting you to the fact that the assigned metadata will be assigned
    782782  to all files and sub-folders contained in the selected folder.  For this collection, this is what we want, so press <i>OK</i> for the action to proceed.</Text>
     
    790790</NumberedItem>
    791791<NumberedItem>
    792 <Text id="images-gps-4">Now go to the <b>Create</b> panel and press <b>Build Collection</b>.</Text>
     792<Text id="images-gps-4">Now go to the <AutoText key="glidict::GUI.Create"/> panel and press <b>Build Collection</b>.</Text>
    793793</NumberedItem>
    794794<NumberedItem>
     
    803803  Each of these image files has metadata embedded in it&mdash;including GPS data&mdash;generated by the smartphone when the photo was taken. We can extract this metadata when the collection is built, and in particular, make use of the GPS metadata to provide map-based views of the collection to the user.</Text>
    804804
    805   <Text id="images-gps7">In the <b>Document Plugins</b> section of the <b>Design</b> panel, go down to the <b>select plugin to add</b> and choose the <b>EmbeddedMetadataPlugin</b>. Press the <b>Add Plugin</b> button, and then click <b>OK</b> to add it to the plugin list. Select this plugin in the list, then use the <b>Move Up</b> button to shift it upwards until it comes just after the GreenstoneXMLPlugin.</Text>
     805  <Text id="images-gps7">In the <AutoText key="glidict::CDM.GUI.Plugins"/> section of the <AutoText key="glidict::GUI.Design"/> panel, go down to the <b>select plugin to add</b> and choose the <AutoText text="EmbeddedMetadataPlugin"/>. Press the <b>Add Plugin</b> button, and then click <AutoText key="glidict::General.OK" type="button"/> to add it to the plugin list. Select this plugin in the list, then use the <AutoText key="glidict::CDM.Move.Move_Up" type="button"/> button to shift it upwards until it comes just after the GreenstoneXMLPlugin.</Text>
    806806</NumberedItem>
    807807
    808808<NumberedItem>
    809 <Text id="images-gps-8">Go to the <b>Create</b> panel and press <b>Build Collection</b>.</Text>
     809<Text id="images-gps-8">Go to the <AutoText key="glidict::GUI.Create"/> panel and press <AutoText key="glidict::CreatePane.Build_Collection" type="button"/>.</Text>
    810810</NumberedItem>
    811811
    812812<NumberedItem>
    813   <Text id="images-gps-9">Now go to the <b>Enrich</b> panel, expand the <b>eiffel-tower</b> folder and select the first image. Scroll down to see the metadata extracted during the building process. Among the extracted metadata, you will find several pieces of Latitude and Longitude metadata, which we will be taking advantage of shortly: ex.Latitude, ex.Longitude, ex.LatShort, and ex.LngShort.  </Text>
     813  <Text id="images-gps-9">Now go to the <AutoText key="glidict::GUI.Enrich"/> panel, expand the <b>eiffel-tower</b> folder and select the first image. Scroll down to see the metadata extracted during the building process. Among the extracted metadata, you will find several pieces of Latitude and Longitude metadata, which we will be taking advantage of shortly: ex.Latitude, ex.Longitude, ex.LatShort, and ex.LngShort.  </Text>
    814814</NumberedItem>
    815815 
     
    820820  Greenstone has a map view, based on the Google Maps API, that can make use of this location metadata.  The map view can be controlled to appear in different parts of the interface: as part of a collection's search results page when browsing the collection, and/or when viewing a document.  For this view to be operational in a Greenstone, it is necessary for the collection to index the GPS metadata.</Text>
    821821<NumberedItem>
    822 <Text id="images-gps-11">In the <b>Search Indexes</b> section of the <b>Design</b> panel, press the <b>New Index...</b> button. Scroll down and tick the box for <b>ex.LatShort</b> and press <b>Add Index</b> to create an index on it. In like manner, create an index on <b>ex.Latitude</b>. Then another on <b>ex.LngShort</b>. And finally one on <b>ex.Longitude</b>.</Text>
    823 </NumberedItem>
    824 <NumberedItem>
    825 <Text id="images-gps-12">Select <b>Search</b> on the left of the <b>Format</b> panel. For the index on the combined longitude and latitude metadata, type <i>locations</i> as its display name.</Text>
    826 </NumberedItem>
    827 <NumberedItem>
    828 <Text id="images-gps-13">Now go to the <b>Create</b> panel and press <b>Build Collection</b>.</Text>
    829 </NumberedItem>
    830 <NumberedItem>
    831 <Text id="images-gps-14">To enable the map, go to the <b>Format Features</b> section of the <b>Format</b> panel, and select the <b>browse</b> format feature. In the editor below, enter the following format statement <i>above</i> the documentNode template:</Text>
     822<Text id="images-gps-11">In the <AutoText key="glidict::CDM.GUI.Indexes"/> section of the <AutoText key="glidict::GUI.Design"/> panel, press the <AutoText key="glidict::CDM.IndexManager.New_Index" type="button"/> button. Scroll down and tick the box for <b>ex.LatShort</b> and press <AutoText key="glidict::CDM.IndexManager.Add_Index" type="button"/> to create an index on it. In like manner, create an index on <b>ex.Latitude</b>. Then another on <b>ex.LngShort</b>. And finally one on <b>ex.Longitude</b>.</Text>
     823</NumberedItem>
     824<NumberedItem>
     825<Text id="images-gps-12">Select <b>Search</b> on the left of the <AutoText key="glidict::GUI.Format"/> panel. For the index on the combined longitude and latitude metadata, type <i>locations</i> as its display name.</Text>
     826</NumberedItem>
     827<NumberedItem>
     828<Text id="images-gps-13">Now go to the <AutoText key="glidict::GUI.Create"/> panel and press <AutoText key="glidict::CreatePane.Build_Collection" type="button"/>.</Text>
     829</NumberedItem>
     830<NumberedItem>
     831<Text id="images-gps-14">To enable the map, go to the <AutoText key="glidict::CDM.FormatManager.Feature"/> section of the <AutoText key="glidict::GUI.Format"/> panel, and select the <b>browse</b> format feature. In the editor below, enter the following format statement <i>above</i> the documentNode template:</Text>
    832832<Format>&lt;gsf:option name="mapEnabled" value="true" /&gt;</Format>
    833833</NumberedItem>
    834834<NumberedItem>
    835 <Text>Also in the <b>Format Features</b> section, select the <b>searchType</b> feature. Add <AutoText text="raw"/> to the list of search types.</Text>
    836 </NumberedItem>
    837 <NumberedItem>
    838 <Text id="images-gps-15">In the <b>Format</b> panel press the <b>Preview Collection</b> button, and click on the new browsing classifier (<Format>locations</Format>) and then click on a bookshelf icon. The page that opens up shows a Google map, with the locations of the images in the collection pinpointed on it. The map view can also scroll through all the images, locating each place and associated image in turn.</Text>
     835<Text>Also in the <AutoText key="glidict::CDM.FormatManager.Feature"/> section, select the <b>searchType</b> feature. Add <AutoText text="raw"/> to the list of search types.</Text>
     836</NumberedItem>
     837<NumberedItem>
     838<Text id="images-gps-15">In the <AutoText key="glidict::GUI.Format"/> panel press <AutoText key="glidict::CreatePane.Preview_Collection" type="button"/>, and click on the new browsing classifier (<Format>locations</Format>) and then click on a bookshelf icon. The page that opens up shows a Google map, with the locations of the images in the collection pinpointed on it. The map view can also scroll through all the images, locating each place and associated image in turn.</Text>
    839839</NumberedItem>
    840840<Text id="images-gps-16">
     
    851851  <NumberedItem>
    852852    <Text id="images-gps-17">
    853       To activate a map view when viewing the document, go to the <b>Format Features</b> section of the <b>Format</b> panel, and select the <b>display</b> format feature. In the editor below, enter the following format statement <i>after</i> the line <Format>&lt;gsf:option name="TOC" value="true" /&gt;</Format>:
     853      To activate a map view when viewing the document, go to the <AutoText key="glidict::CDM.FormatManager.Feature"/> section of the <AutoText key="glidict::GUI.Format"/> panel, and select the <b>display</b> format feature. In the editor below, enter the following format statement <i>after</i> the line <Format>&lt;gsf:option name="TOC" value="true" /&gt;</Format>:
    854854    </Text>
    855855    <Format>&lt;gsf:option name="mapEnabled" value="true" /&gt;</Format>
     
    859859  </NumberedItem>
    860860  <NumberedItem>
    861     <Text id="images-gps-18">Still in the <b>Format</b> panel press the <b>Preview Collection</b> button, browse or search to locate a document, and then view the document. The page that opens up shows a Google map, shows the location of the document (where the photo was taken), in addition to the screen-sized photo.</Text>
     861    <Text id="images-gps-18">Still in the <AutoText key="glidict::GUI.Format"/> panel press the <AutoText key="glidict::CreatePane.Preview_Collection" type="button"/> button, browse or search to locate a document, and then view the document. The page that opens up shows a Google map, shows the location of the document (where the photo was taken), in addition to the screen-sized photo.</Text>
    862862  </NumberedItem> 
    863863</Content>
     
    47764776<Text id="gli-oai-17">If you wish, you can now set up this collection in a manner similar to how the <b>backdrop</b> collection was set up in <TutorialRef id="simple_image_collection"/>. Don't forget to copy in any specific format statements, adjust them to use the <Format>ex.dc</Format> metadata instead of <Format>dc</Format> metadata, then <b>rebuild</b> and <b>preview</b> the collection.</Text>
    47774777</NumberedItem>
     4778</Content>
     4779</Tutorial>
     4780<Tutorial id="unknown_converter_plugin">
     4781<Title>
     4782<Text id="ucp-01">Using the UnknownConverterPlugin</Text>
     4783</Title>
     4784<SampleFiles folder="pdfbox"/>
     4785<Version initial="2.88" current="2.87"/>
     4786<Content>
     4787<Comment><Text id="ucp-02">The UnknownConverterPlugin builds on the idea of the UnknownPlugin, in that it can be configured to handle documents of unknown format and file extension. It can also be made to handle documents with known file extensions in a custom manner.</Text></Comment>
     4788<Comment><Text id="ucp-03">The UnknownConverterPlugin extends the UnknownPlugin's abilities by letting you launch a tool you have installed on your own operating system that can be run from the commandline to convert from the "unknown" file format to either text, html or gif/jpg/png images, or a folder of these. If you know how to launch this tool from the commandline to do the conversion, then you would configure the UnknownConverterPlugin by supplying the file format (file extension) of the documents it should process, the expected output file format (text, html or paged images), and the tool's conversion command that the UnknownConverterPlugin should launch to perform the conversion. In place of the input file and the output file or folder you provide placeholders in the command to run. Once configured, the UnknownConverterPlugin will be used during building to process documents that match the specified file format. The conversion tool will be launched with the command provided, and the expected output files as specified can then be processed by Greenstone in the usual manner.</Text></Comment>
     4789<Comment><Text id="ucp-04">An example would be djvu files, for which Greenstone provides no custom plugin. However, there's a free commandline tool available for unix systems that can convert from djvu to one of the text based format that Greenstone can process, text or html. So in this case, you could try using the UnknownConverterPlugin with the commandline tool on djvu files that you've gathered. The result should be that the djvu in your collection are now searchable.</Text></Comment>
     4790<Comment><Text id="ucp-05">This part of the tutorial requires you to be working on a Unix operating system. In this part of the tutorial, we're going to learn how to install the Icecite tool on a Linux system and then configure the UnknownConverterPlugin to use Icecite to process PDF files. Icecite (https://github.com/ckorzen/icecite) is an open-source tool that can do many things, including extracting text from a PDF.</Text></Comment>
     4791<Heading>
     4792<Text id="ucp-06">Using the Icecite tool to convert from PDF to text</Text>
     4793</Heading>
     4794<Comment>
     4795<Text id="ucp-07">As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.</Text>
     4796</Comment>
     4797<NumberedItem>
     4798<Text id="ucp-08">Grab the pre-compiled Icecite tarball from <Link>http://trac.greenstone.org/export/head/gs3-extensions/gs-icecite/gs-icecite.tar.gz</Link> and decompress it into your Greenstone installation's <Format>ext</Format> subfolder.</Text>
     4799<Text id="ucp-09">Now you're ready to test Icecite's PDF to text conversion abilities manually, by running Icecite from the command line.</Text>
     4800</NumberedItem>
     4801<NumberedItem>
     4802<Text id="ucp-10">Set up your environment for Java 8:</Text>
     4803<Format>export JAVA_HOME=/PATH/TO/YOUR-JAVA-8-HOME
     4804export PATH=$JAVA_HOME/bin:$PATH
     4805</Format>
     4806</NumberedItem>
     4807<NumberedItem>
     4808<Text id="ucp-11">You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the &lt;PLACEHOLDERS&gt; below:</Text>
     4809<Format>java -classpath '.:/&lt;PATH-TO-GS-INSTALLTION&gt;/ext/icecite/gs-installed-jars/*:/&lt;PATH-TO-GS-INSTALLTION&gt;/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs &lt;/PATH/TO/YOUR.pdf&gt; &lt;/PATH/TO/CONVERTED.txt&gt;</Format>
     4810<Text id="ucp-12">It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string <Format>&lt;/PATH/TO/CONVERTED.txt&gt;</Format></Text>
     4811<Comment><Text id="ucp-13">You can experiment with using <Format>--feature words</Format> or <Format>--feature lines</Format> above, in place of <Format>--feature paragraphs</Format>, to find out the effect of such a change on the output file, particularly if <Format>--feature paragraphs</Format> does not produce the desired results for your PDFs.</Text></Comment>
     4812</NumberedItem>
     4813<Heading>
     4814<Text id="ucp-14">Using the UnknownConverterPlugin to launch Icecite from GLI to do the PDF to text conversion</Text>
     4815</Heading>
     4816<Comment><Text id="ucp-15">We're now ready to use the <AutoText text="UnknownConverterPlugin"/> to launch Icecite as the external tool to do the conversion, producing output that Greenstone's building scripts can ingest into Greenstone and index for searching.</Text></Comment>
     4817<NumberedItem><Text id="ucp-16">Run GLI</Text></NumberedItem>
     4818<NumberedItem><Text id="ucp-17">Create a new collection called Icecite. In the <AutoText key="glidict::GUI.Gather"/> pane, drop in the sample PDF file into your collection.</Text></NumberedItem>
     4819<NumberedItem><Text id="ucp-18">In the <AutoText key="glidict::GUI.Design"/> pane and select <AutoText key="glidict::CDM.GUI.Plugins"/> from the list on the left. Add the <AutoText text="UnknownConverterPlugin"/>. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the <AutoText text="UnknownConverterPlugin"/>. Click <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> and set up the plugin with the following settings:</Text>
     4820<BulletList>
     4821<Bullet><Text id="ucp-19">set <Format>convert_to</Format> to the <Format>text</Format> option, this is the output format upon conversion</Text></Bullet>
     4822<Bullet><Text id="ucp-20">set <Format>mime type</Format> to <Format>application/pdf</Format></Text></Bullet>
     4823<Bullet><Text id="ucp-21">set <Format>process_extension</Format> to <Format>pdf</Format>, this is the input format of the files that this instance of the <AutoText text="UnknownConverterPlugin"/> will process</Text></Bullet>
     4824<Bullet><Text id="ucp-22">set the <Format>exec_cmd</Format> field to:</Text>
     4825<Text id="ucp-23"><Format>/PATH/TO/YOUR-JAVA-8-HOME/bin/java -classpath ':<MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion>/ext/icecite/gs-installed-jars/*:<MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion>/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs %INPUT_FILE %OUTPUT</Format></Text></Bullet>
     4826</BulletList>
     4827<Text id="ucp-24">Note: When filling in the <Format>exec_cmd</Format> field, leave the words with <Format>%</Format> signs in front of them intact. They are placeholders for Greenstone to replace.</Text>
     4828<Text id="ucp-25">However, you will need to adjust the above value for <Format>exec_cmd</Format> by finding out where your Java 8 is installed and replacing <Format>/PATH/TO/YOUR-JAVA-8-HOME</Format> with it. The reason you need to provide the full path to the Java 8 executable is because, at present, GLI binaries ship with Java 7, which is incompatible with the precompiled Icecite. And if you're running Greenstone from a source, you may have a different version of Java set up in your environment too. However, by providing the full path to the Java 8 executable above, you force Icecite's PDF conversion program to run with Java 8.</Text>
     4829<Comment><Text id="ucp-26">If your Greenstone is installed in a location that contains spaces in the filepath, then ensure you have escaped double quotes (<Format>\&quot;</Format>) around each location referencing the Greenstone installation path except for the parameter value to <Format>-classpath</Format>.</Text></Comment>
     4830<Comment><Text id="ucp-27">The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the <Format><MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion></Format>, <Format>%INPUT_FILE</Format> and <Format>%OUTPUT</Format> appropriately. <Format><MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion></Format> works out to be the Greenstone <MajorVersion number="2">2</MajorVersion><MajorVersion number="3">3</MajorVersion> installation directory, whereas <Format>%INPUT_FILE</Format> is whichever matching PDF it's processing and <Format>%OUTPUT</Format> is likewise the file (or folder of files) produced by the conversion process. In this case, the output type is txt, as that's what Icecite produces. Once the conversion to text has finished, Greenstone will be able to process it as usual, such as indexing the extracted contents to make the document searchable.</Text>
     4831</Comment>
     4832</NumberedItem>
     4833<NumberedItem><Text id="ucp-28">Having sufficiently configured the <AutoText text="UnknownConverterPlugin"/>, click the <AutoText key="glidict::General.OK" type="button"/> button to close its configuration dialog.</Text></NumberedItem>
     4834<NumberedItem><Text id="ucp-29">Select the <AutoText text="UnknownConverterPlugin"/> in the list of plugins and keep pressing the <AutoText key="glidict::CDM.Move.Move_Up" type="button"/> button to shift it upwards, until it appears in the plugin pipeline above the existing <AutoText text="PDFPlugin"/>, so that this instance of <AutoText text="UnknownConverterPlugin"/>, configured as it has now been to handle PDF files, will take precedence in processing such files.</Text></NumberedItem>
     4835<NumberedItem><Text id="ucp-30">Move to the <AutoText key="glidict::GUI.Create"/> pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone's building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.</Text></NumberedItem>
    47784836</Content>
    47794837</Tutorial>
Note: See TracChangeset for help on using the changeset viewer.