Changeset 32030

Show
Ignore:
Timestamp:
06.10.2017 21:49:03 (2 years ago)
Author:
ak19
Message:

Minor improvements to the newly added UnknownConverterPlugin? tutorial, before adding the new section, as that has yet to be formatted according to the tutorial XML markup.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r32027 r32030  
    47884788<Comment><Text id="ucp-03">The UnknownConverterPlugin extends the UnknownPlugin's abilities by letting you launch a tool you have installed on your own operating system that can be run from the commandline to convert from the "unknown" file format to either text, html or gif/jpg/png images, or a folder of these. If you know how to launch this tool from the commandline to do the conversion, then you would configure the UnknownConverterPlugin by supplying the file format (file extension) of the documents it should process, the expected output file format (text, html or paged images), and the tool's conversion command that the UnknownConverterPlugin should launch to perform the conversion. In place of the input file and the output file or folder you provide placeholders in the command to run. Once configured, the UnknownConverterPlugin will be used during building to process documents that match the specified file format. The conversion tool will be launched with the command provided, and the expected output files as specified can then be processed by Greenstone in the usual manner.</Text></Comment> 
    47894789<Comment><Text id="ucp-04">An example would be djvu files, for which Greenstone provides no custom plugin. However, there's a free commandline tool available for unix systems that can convert from djvu to one of the text based format that Greenstone can process, text or html. So in this case, you could try using the UnknownConverterPlugin with the commandline tool on djvu files that you've gathered. The result should be that the djvu in your collection are now searchable.</Text></Comment> 
    4790 <Comment><Text id="ucp-05">This part of the tutorial requires you to be working on a Unix operating system. In this part of the tutorial, we're going to learn how to install the Icecite tool on a Linux system and then configure the UnknownConverterPlugin to use Icecite to process PDF files. Icecite (https://github.com/ckorzen/icecite) is an open-source tool that can do many things, including extracting text from a PDF.</Text></Comment> 
     4790<Comment><Text id="ucp-05"><Link url="https://github.com/ckorzen/icecite">Icecite</Link> is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFPlugin, even when <Format>pdfbox_conversion</Format> option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here.</Text></Comment> 
    47914791<Heading> 
    47924792<Text id="ucp-06">Using the Icecite tool to convert from PDF to text</Text> 
     
    48064806</NumberedItem> 
    48074807<NumberedItem> 
    4808 <Text id="ucp-11">You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the &lt;PLACEHOLDERS&gt; below:</Text> 
     4808<Text id="ucp-11">You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the <Format>&lt;PLACEHOLDERS&gt;</Format> below:</Text> 
    48094809<Format>java -classpath '.:/&lt;PATH-TO-GS-INSTALLTION&gt;/ext/icecite/gs-installed-jars/*:/&lt;PATH-TO-GS-INSTALLTION&gt;/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs &lt;/PATH/TO/YOUR.pdf&gt; &lt;/PATH/TO/CONVERTED.txt&gt;</Format> 
    48104810<Text id="ucp-12">It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string <Format>&lt;/PATH/TO/CONVERTED.txt&gt;</Format></Text> 
     
    48204820<BulletList> 
    48214821<Bullet><Text id="ucp-19">set <Format>convert_to</Format> to the <Format>text</Format> option, this is the output format upon conversion</Text></Bullet> 
    4822 <Bullet><Text id="ucp-20">set <Format>mime type</Format> to <Format>application/pdf</Format></Text></Bullet> 
     4822<Bullet><Text id="ucp-20">set <Format>mime_type</Format> to <Format>application/pdf</Format></Text></Bullet> 
     4823<Bullet><Text id="ucp-20a">set <Format>srcicon</Format> to the <Format>iconpdf</Format>, since Greenstone already knows about this macro and already has an icon for PDFs and knows to associate the two</Text></Bullet> 
    48234824<Bullet><Text id="ucp-21">set <Format>process_extension</Format> to <Format>pdf</Format>, this is the input format of the files that this instance of the <AutoText text="UnknownConverterPlugin"/> will process</Text></Bullet> 
    48244825<Bullet><Text id="ucp-22">set the <Format>exec_cmd</Format> field to:</Text> 
    4825 <Text id="ucp-23"><Format>/PATH/TO/YOUR-JAVA-8-HOME/bin/java -classpath ':<MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion>/ext/icecite/gs-installed-jars/*:<MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion>/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs %INPUT_FILE %OUTPUT</Format></Text></Bullet> 
     4826<Format>/PATH/TO/YOUR-JAVA-8-HOME/bin/java -classpath ':<MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion>/ext/icecite/gs-installed-jars/*:<MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion>/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs %%INPUT_FILE %%OUTPUT</Format></Bullet> 
    48264827</BulletList> 
    4827 <Text id="ucp-24">Note: When filling in the <Format>exec_cmd</Format> field, leave the words with <Format>%</Format> signs in front of them intact. They are placeholders for Greenstone to replace.</Text> 
     4828<Text id="ucp-24">Note: When filling in the <Format>exec_cmd</Format> field, leave the words with <Format>%%</Format> signs in front of them intact. They are placeholders for Greenstone to replace.</Text> 
    48284829<Text id="ucp-25">However, you will need to adjust the above value for <Format>exec_cmd</Format> by finding out where your Java 8 is installed and replacing <Format>/PATH/TO/YOUR-JAVA-8-HOME</Format> with it. The reason you need to provide the full path to the Java 8 executable is because, at present, GLI binaries ship with Java 7, which is incompatible with the precompiled Icecite. And if you're running Greenstone from a source, you may have a different version of Java set up in your environment too. However, by providing the full path to the Java 8 executable above, you force Icecite's PDF conversion program to run with Java 8.</Text> 
    48294830<Comment><Text id="ucp-26">If your Greenstone is installed in a location that contains spaces in the filepath, then ensure you have escaped double quotes (<Format>\&quot;</Format>) around each location referencing the Greenstone installation path except for the parameter value to <Format>-classpath</Format>.</Text></Comment> 
    4830 <Comment><Text id="ucp-27">The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the <Format><MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion></Format>, <Format>%INPUT_FILE</Format> and <Format>%OUTPUT</Format> appropriately. <Format><MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion></Format> works out to be the Greenstone <MajorVersion number="2">2</MajorVersion><MajorVersion number="3">3</MajorVersion> installation directory, whereas <Format>%INPUT_FILE</Format> is whichever matching PDF it's processing and <Format>%OUTPUT</Format> is likewise the file (or folder of files) produced by the conversion process. In this case, the output type is txt, as that's what Icecite produces. Once the conversion to text has finished, Greenstone will be able to process it as usual, such as indexing the extracted contents to make the document searchable.</Text> 
     4831<Comment><Text id="ucp-27">The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the <Format><MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion></Format>, <Format>%%INPUT_FILE</Format> and <Format>%%OUTPUT</Format> appropriately. <Format><MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion></Format> works out to be the Greenstone <MajorVersion number="2">2</MajorVersion><MajorVersion number="3">3</MajorVersion> installation directory, whereas <Format>%%INPUT_FILE</Format> is whichever matching PDF it's processing and <Format>%%OUTPUT</Format> is likewise the file (or folder of files) produced by the conversion process. In this case, the output type is txt, as that's what Icecite produces. Once the conversion to text has finished, Greenstone will be able to process it as usual, such as indexing the extracted contents to make the document searchable.</Text> 
    48314832</Comment> 
    48324833</NumberedItem>