Changeset 32032 for documentation/trunk


Ignore:
Timestamp:
2017-10-06T23:15:13+13:00 (7 years ago)
Author:
ak19
Message:

Adjusting the numbering of the Text elements in the latter (Icecite) section of the UnknownConverterPlugin to accommodate the numbering of the new DjVu section that has preceded it.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r32031 r32032  
    48554855</NumberedItem>
    48564856<Heading>
    4857 <Text id="ucp-06">Using the Icecite's commandline tool to convert from PDF to text</Text>
    4858 </Heading>
    4859 <Comment><Text id="ucp-05"><Link url="https://github.com/ckorzen/icecite">Icecite</Link> is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFPlugin, even when <Format>pdfbox_conversion</Format> option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here.</Text></Comment>
    4860 <Comment>
    4861 <Text id="ucp-07">As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.</Text>
    4862 </Comment>
    4863 <NumberedItem>
    4864 <Text id="ucp-08">Grab the pre-compiled Icecite tarball from <Link>http://trac.greenstone.org/export/head/gs3-extensions/gs-icecite/gs-icecite.tar.gz</Link> and decompress it into your Greenstone installation's <Format>ext</Format> subfolder.</Text>
    4865 <Text id="ucp-09">Now you're ready to test Icecite's PDF to text conversion abilities manually, by running Icecite from the command line.</Text>
    4866 </NumberedItem>
    4867 <NumberedItem>
    4868 <Text id="ucp-10">Set up your environment for Java 8:</Text>
     4857<Text id="ucp-46">Using the Icecite's commandline tool to convert from PDF to text</Text>
     4858</Heading>
     4859<Comment><Text id="ucp-47"><Link url="https://github.com/ckorzen/icecite">Icecite</Link> is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFPlugin, even when <Format>pdfbox_conversion</Format> option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here.</Text></Comment>
     4860<Comment>
     4861<Text id="ucp-48">As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.</Text>
     4862</Comment>
     4863<NumberedItem>
     4864<Text id="ucp-49">Grab the pre-compiled Icecite tarball from <Link>http://trac.greenstone.org/export/head/gs3-extensions/gs-icecite/gs-icecite.tar.gz</Link> and decompress it into your Greenstone installation's <Format>ext</Format> subfolder.</Text>
     4865<Text id="ucp-50">Now you're ready to test Icecite's PDF to text conversion abilities manually, by running Icecite from the command line.</Text>
     4866</NumberedItem>
     4867<NumberedItem>
     4868<Text id="ucp-60">Set up your environment for Java 8:</Text>
    48694869<Format>export JAVA_HOME=/PATH/TO/YOUR-JAVA-8-HOME
    48704870export PATH=$JAVA_HOME/bin:$PATH
     
    48724872</NumberedItem>
    48734873<NumberedItem>
    4874 <Text id="ucp-11">You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the <Format>&lt;PLACEHOLDERS&gt;</Format> below:</Text>
     4874<Text id="ucp-61">You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the <Format>&lt;PLACEHOLDERS&gt;</Format> below:</Text>
    48754875<Format>java -classpath '.:/&lt;PATH-TO-GS-INSTALLTION&gt;/ext/icecite/gs-installed-jars/*:/&lt;PATH-TO-GS-INSTALLTION&gt;/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs &lt;/PATH/TO/YOUR.pdf&gt; &lt;/PATH/TO/CONVERTED.txt&gt;</Format>
    4876 <Text id="ucp-12">It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string <Format>&lt;/PATH/TO/CONVERTED.txt&gt;</Format></Text>
    4877 <Comment><Text id="ucp-13">You can experiment with using <Format>--feature words</Format> or <Format>--feature lines</Format> above, in place of <Format>--feature paragraphs</Format>, to find out the effect of such a change on the output file, particularly if <Format>--feature paragraphs</Format> does not produce the desired results for your PDFs.</Text></Comment>
    4878 </NumberedItem>
    4879 <Heading>
    4880 <Text id="ucp-14">Using the UnknownConverterPlugin to launch Icecite from GLI to do the PDF to text conversion</Text>
    4881 </Heading>
    4882 <Comment><Text id="ucp-15">We're now ready to use the <AutoText text="UnknownConverterPlugin"/> to launch Icecite as the external tool to do the conversion, producing output that Greenstone's building scripts can ingest into Greenstone and index for searching.</Text></Comment>
    4883 <NumberedItem><Text id="ucp-16">Run GLI</Text></NumberedItem>
    4884 <NumberedItem><Text id="ucp-17">Create a new collection called Icecite. In the <AutoText key="glidict::GUI.Gather"/> pane, drop in the sample PDF file into your collection.</Text></NumberedItem>
    4885 <NumberedItem><Text id="ucp-18">In the <AutoText key="glidict::GUI.Design"/> pane and select <AutoText key="glidict::CDM.GUI.Plugins"/> from the list on the left. Add the <AutoText text="UnknownConverterPlugin"/>. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the <AutoText text="UnknownConverterPlugin"/>. Click <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> and set up the plugin with the following settings:</Text>
     4876<Text id="ucp-62">It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string <Format>&lt;/PATH/TO/CONVERTED.txt&gt;</Format></Text>
     4877<Comment><Text id="ucp-63">You can experiment with using <Format>--feature words</Format> or <Format>--feature lines</Format> above, in place of <Format>--feature paragraphs</Format>, to find out the effect of such a change on the output file, particularly if <Format>--feature paragraphs</Format> does not produce the desired results for your PDFs.</Text></Comment>
     4878</NumberedItem>
     4879<Heading>
     4880<Text id="ucp-64">Using the UnknownConverterPlugin to launch Icecite from GLI to do the PDF to text conversion</Text>
     4881</Heading>
     4882<Comment><Text id="ucp-65">We're now ready to use the <AutoText text="UnknownConverterPlugin"/> to launch Icecite as the external tool to do the conversion, producing output that Greenstone's building scripts can ingest into Greenstone and index for searching.</Text></Comment>
     4883<NumberedItem><Text id="ucp-66">Run GLI</Text></NumberedItem>
     4884<NumberedItem><Text id="ucp-67">Create a new collection called Icecite. In the <AutoText key="glidict::GUI.Gather"/> pane, drop in the sample PDF file into your collection.</Text></NumberedItem>
     4885<NumberedItem><Text id="ucp-68">In the <AutoText key="glidict::GUI.Design"/> pane and select <AutoText key="glidict::CDM.GUI.Plugins"/> from the list on the left. Add the <AutoText text="UnknownConverterPlugin"/>. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the <AutoText text="UnknownConverterPlugin"/>. Click <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> and set up the plugin with the following settings:</Text>
    48864886<BulletList>
    4887 <Bullet><Text id="ucp-19">set <Format>convert_to</Format> to the <Format>text</Format> option, this is the output format upon conversion</Text></Bullet>
    4888 <Bullet><Text id="ucp-20">set <Format>mime_type</Format> to <Format>application/pdf</Format></Text></Bullet>
    4889 <Bullet><Text id="ucp-20a">set <Format>srcicon</Format> to the <Format>iconpdf</Format>, since Greenstone already knows about this macro and already has an icon for PDFs and knows to associate the two</Text></Bullet>
    4890 <Bullet><Text id="ucp-21">set <Format>process_extension</Format> to <Format>pdf</Format>, this is the input format of the files that this instance of the <AutoText text="UnknownConverterPlugin"/> will process</Text></Bullet>
    4891 <Bullet><Text id="ucp-22">set the <Format>exec_cmd</Format> field to:</Text>
     4887<Bullet><Text id="ucp-69">set <Format>convert_to</Format> to the <Format>text</Format> option, this is the output format upon conversion</Text></Bullet>
     4888<Bullet><Text id="ucp-70">set <Format>mime_type</Format> to <Format>application/pdf</Format></Text></Bullet>
     4889<Bullet><Text id="ucp-71">set <Format>srcicon</Format> to the <Format>iconpdf</Format>, since Greenstone already knows about this macro and already has an icon for PDFs and knows to associate the two</Text></Bullet>
     4890<Bullet><Text id="ucp-72">set <Format>process_extension</Format> to <Format>pdf</Format>, this is the input format of the files that this instance of the <AutoText text="UnknownConverterPlugin"/> will process</Text></Bullet>
     4891<Bullet><Text id="ucp-73">set the <Format>exec_cmd</Format> field to:</Text>
    48924892<Format>/PATH/TO/YOUR-JAVA-8-HOME/bin/java -classpath ':<MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion>/ext/icecite/gs-installed-jars/*:<MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion>/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs %%INPUT_FILE %%OUTPUT</Format></Bullet>
    48934893</BulletList>
    4894 <Text id="ucp-24">Note: When filling in the <Format>exec_cmd</Format> field, leave the words with <Format>%%</Format> signs in front of them intact. They are placeholders for Greenstone to replace.</Text>
    4895 <Text id="ucp-25">However, you will need to adjust the above value for <Format>exec_cmd</Format> by finding out where your Java 8 is installed and replacing <Format>/PATH/TO/YOUR-JAVA-8-HOME</Format> with it. The reason you need to provide the full path to the Java 8 executable is because, at present, GLI binaries ship with Java 7, which is incompatible with the precompiled Icecite. And if you're running Greenstone from a source, you may have a different version of Java set up in your environment too. However, by providing the full path to the Java 8 executable above, you force Icecite's PDF conversion program to run with Java 8.</Text>
    4896 <Comment><Text id="ucp-26">If your Greenstone is installed in a location that contains spaces in the filepath, then ensure you have escaped double quotes (<Format>\&quot;</Format>) around each location referencing the Greenstone installation path except for the parameter value to <Format>-classpath</Format>.</Text></Comment>
    4897 <Comment><Text id="ucp-27">The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the <Format><MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion></Format>, <Format>%%INPUT_FILE</Format> and <Format>%%OUTPUT</Format> appropriately. <Format><MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion></Format> works out to be the Greenstone <MajorVersion number="2">2</MajorVersion><MajorVersion number="3">3</MajorVersion> installation directory, whereas <Format>%%INPUT_FILE</Format> is whichever matching PDF it's processing and <Format>%%OUTPUT</Format> is likewise the file (or folder of files) produced by the conversion process. In this case, the output type is txt, as that's what Icecite produces. Once the conversion to text has finished, Greenstone will be able to process it as usual, such as indexing the extracted contents to make the document searchable.</Text>
    4898 </Comment>
    4899 </NumberedItem>
    4900 <NumberedItem><Text id="ucp-28">Having sufficiently configured the <AutoText text="UnknownConverterPlugin"/>, click the <AutoText key="glidict::General.OK" type="button"/> button to close its configuration dialog.</Text></NumberedItem>
    4901 <NumberedItem><Text id="ucp-29">Select the <AutoText text="UnknownConverterPlugin"/> in the list of plugins and keep pressing the <AutoText key="glidict::CDM.Move.Move_Up" type="button"/> button to shift it upwards, until it appears in the plugin pipeline above the existing <AutoText text="PDFPlugin"/>, so that this instance of <AutoText text="UnknownConverterPlugin"/>, configured as it has now been to handle PDF files, will take precedence in processing such files.</Text></NumberedItem>
    4902 <NumberedItem><Text id="ucp-30">Move to the <AutoText key="glidict::GUI.Create"/> pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone's building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.</Text></NumberedItem>
     4894<Text id="ucp-74">Note: When filling in the <Format>exec_cmd</Format> field, leave the words with <Format>%%</Format> signs in front of them intact. They are placeholders for Greenstone to replace.</Text>
     4895<Text id="ucp-75">However, you will need to adjust the above value for <Format>exec_cmd</Format> by finding out where your Java 8 is installed and replacing <Format>/PATH/TO/YOUR-JAVA-8-HOME</Format> with it. The reason you need to provide the full path to the Java 8 executable is because, at present, GLI binaries ship with Java 7, which is incompatible with the precompiled Icecite. And if you're running Greenstone from a source, you may have a different version of Java set up in your environment too. However, by providing the full path to the Java 8 executable above, you force Icecite's PDF conversion program to run with Java 8.</Text>
     4896<Comment><Text id="ucp-76">If your Greenstone is installed in a location that contains spaces in the filepath, then ensure you have escaped double quotes (<Format>\&quot;</Format>) around each location referencing the Greenstone installation path except for the parameter value to <Format>-classpath</Format>.</Text></Comment>
     4897<Comment><Text id="ucp-77">The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the <Format><MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion></Format>, <Format>%%INPUT_FILE</Format> and <Format>%%OUTPUT</Format> appropriately. <Format><MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion></Format> works out to be the Greenstone <MajorVersion number="2">2</MajorVersion><MajorVersion number="3">3</MajorVersion> installation directory, whereas <Format>%%INPUT_FILE</Format> is whichever matching PDF it's processing and <Format>%%OUTPUT</Format> is likewise the file (or folder of files) produced by the conversion process. In this case, the output type is txt, as that's what Icecite produces. Once the conversion to text has finished, Greenstone will be able to process it as usual, such as indexing the extracted contents to make the document searchable.</Text>
     4898</Comment>
     4899</NumberedItem>
     4900<NumberedItem><Text id="ucp-78">Having sufficiently configured the <AutoText text="UnknownConverterPlugin"/>, click the <AutoText key="glidict::General.OK" type="button"/> button to close its configuration dialog.</Text></NumberedItem>
     4901<NumberedItem><Text id="ucp-79">Select the <AutoText text="UnknownConverterPlugin"/> in the list of plugins and keep pressing the <AutoText key="glidict::CDM.Move.Move_Up" type="button"/> button to shift it upwards, until it appears in the plugin pipeline above the existing <AutoText text="PDFPlugin"/>, so that this instance of <AutoText text="UnknownConverterPlugin"/>, configured as it has now been to handle PDF files, will take precedence in processing such files.</Text></NumberedItem>
     4902<NumberedItem><Text id="ucp-80">Move to the <AutoText key="glidict::GUI.Create"/> pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone's building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.</Text></NumberedItem>
    49034903</Content>
    49044904</Tutorial>
Note: See TracChangeset for help on using the changeset viewer.