Ignore:
Timestamp:
2017-10-24T15:49:19+13:00 (6 years ago)
Author:
ak19
Message:

Updating instructions to UnknownConverterPlugin tutorial now that the tutorial finally worked on Windows, both the djvu and icecite portions.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r32044 r32053  
    48174817<Format>djvutxt input.djvu output.txt</Format>
    48184818<Text id="ucp-22">Open a DOS prompt on Windows or a terminal on Mac/Linux and experiment to see what it takes to convert your Greenstone installation's <Format>web/sites/localsite/collect/DjVuColl/superhero.djvu</Format> file.</Text>
    4819 <Text id="ucp-22a">You may have to invoke <Format>djvutxt</Format> using its full filepath, in which case the command would look like:</Text>
     4819<Text id="ucp-22a">You may have to invoke <Format>djvutxt</Format> using its full filepath, in which case on Windows the command would look like:</Text>
     4820<Format>C:\PATH\TO\YOUR\djvutxt C:\PATH\TO\GS\web\sites\localsite\collect\DjVuColl\superhero.djvu C:\PATH\TO\YOUR\GS\superhero.txt</Format>
     4821<Text id="ucp-22b">while on Unix systems the command would look like:</Text>
    48204822<Format>/PATH/TO/YOUR/djvutxt /PATH/TO/GS/web/sites/localsite/collect/DjVuColl/superhero.djvu /PATH/TO/YOUR/GS/superhero.txt</Format>
    48214823<Text id="ucp-23">Once you have the command working, inspect the output file. You should see mostly legible text in it. Only when you've been able to successfully complete this step should you proceed to the next steps.</Text>
     
    48384840</NumberedItem>
    48394841<Heading><Text id="ucp-34">Associating an icon with DjVu documents in Greenstone</Text></Heading>
    4840 <NumberedItem><Text id="ucp-35">When previewing the search result, you may notice that there's no proper icon for the superhero.djvu document. The Greenstone extracted text variant of the document has an icon, a plain text one. However, the <Format>superhero.djvu</Format> has the "unknown document format" icon, the one with the question mark on it. We can change this.</Text>
     4842<NumberedItem><Text id="ucp-35">When previewing the search result, you may notice that there's no proper icon for the document <Format>superhero.djvu</Format>. The Greenstone extracted text variant of the document has an icon, a plain text one. However, the <Format>superhero.djvu</Format> has the "unknown document format" icon, the one with the question mark on it. We can change this.</Text>
    48414843</NumberedItem>
    48424844<NumberedItem><Text id="ucp-36">Go back to the <AutoText key="glidict::GUI.Design"/> pane to configure your <AutoText text="UnknownConverterPlugin"/> once more. This time, enable the <Format>srcicon</Format> field and set its value to <Format>icondjvu</Format>.</Text>
     
    48774879</NumberedItem>
    48784880<NumberedItem>
    4879 <Text id="ucp-61">You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the <Format>&lt;PLACEHOLDERS&gt;</Format> below:</Text>
    4880 <Format>java -classpath '.:/&lt;PATH-TO-GS-INSTALLTION&gt;/ext/icecite/gs-installed-jars/*:/&lt;PATH-TO-GS-INSTALLTION&gt;/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs &lt;/PATH/TO/YOUR.pdf&gt; &lt;/PATH/TO/CONVERTED.txt&gt;</Format>
     4881<Text id="ucp-61">You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the <Format>&lt;PLACEHOLDERS&gt;</Format> below.</Text>
     4882<BulletList>
     4883<Bullet>
     4884<Text id="ucp-61a">The command will look as follows on Windows, note the use of <i>double quotes</i> around the <Format>classpath</Format> value and the use of semi-colon as the path separator on Windows:</Text>
     4885<Format>java -classpath "&lt;DRIVE:\PATH-TO-GS-INSTALLATION&gt;\ext\icecite\gs-installed-jars\*;&lt;DRIVE:\PATH-TO-GS-INSTALLATION&gt;\ext\icecite\pdf-cli\target\pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar" cli.PdfParserCommandLine --format txt --feature paragraphs &lt;DRIVE:\FULL\PATH\TO\YOUR.pdf&gt; &lt;DRIVE:\FULL\PATH\TO\CONVERTED.txt&gt;</Format>
     4886</Bullet>
     4887<Bullet>
     4888<Text id="ucp-61b">On Unix systems, the command will be of the following form, where single quotes are acceptable around the value for <Format>classpath</Format> and where colon is the path separator:</Text>
     4889<Format>java -classpath '/&lt;PATH-TO-GS-INSTALLATION&gt;/ext/icecite/gs-installed-jars/*:/&lt;PATH-TO-GS-INSTALLATION&gt;/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs &lt;/PATH/TO/YOUR.pdf&gt; &lt;/PATH/TO/CONVERTED.txt&gt;</Format>
     4890</Bullet>
     4891</BulletList>
    48814892<Text id="ucp-62">It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string <Format>&lt;/PATH/TO/CONVERTED.txt&gt;</Format></Text>
    48824893<Comment><Text id="ucp-63">You can experiment with using <Format>--feature words</Format> or <Format>--feature lines</Format> above, in place of <Format>--feature paragraphs</Format>, to find out the effect of such a change on the output file, particularly if <Format>--feature paragraphs</Format> does not produce the desired results for your PDFs.</Text></Comment>
     
    48944905<Bullet><Text id="ucp-71">set <Format>srcicon</Format> to the <Format>iconpdf</Format>, since Greenstone already knows about this macro and already has an icon for PDFs and knows to associate the two</Text></Bullet>
    48954906<Bullet><Text id="ucp-72">set <Format>process_extension</Format> to <Format>pdf</Format>, this is the input format of the files that this instance of the <AutoText text="UnknownConverterPlugin"/> will process</Text></Bullet>
    4896 <Bullet><Text id="ucp-73">set the <Format>exec_cmd</Format> field to:</Text>
    4897 <Format>/PATH/TO/YOUR-JAVA-8-HOME/bin/java -classpath ':<MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion>/ext/icecite/gs-installed-jars/*:<MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion>/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs %%INPUT_FILE %%OUTPUT</Format></Bullet>
     4907<Bullet><Text id="ucp-73">set the <Format>exec_cmd</Format> field as follows, depending on your operating system:</Text>
     4908<BulletList>
     4909<Bullet>
     4910<Text id="ucp-73a">on Windows:</Text>
     4911<Format>DRIVE:\PATH\TO\YOUR-JAVA-8-HOME\bin\java -classpath "<MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion>\ext\icecite\gs-installed-jars\*:<MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion>\ext\icecite\pdf-cli\target\pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar" cli.PdfParserCommandLine --format txt --feature paragraphs %%INPUT_FILE %%OUTPUT</Format>
     4912</Bullet>
     4913<Bullet>
     4914<Text id="ucp-73b">on Unix systems:</Text>
     4915<Format>/PATH/TO/YOUR-JAVA-8-HOME/bin/java -classpath '<MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion>/ext/icecite/gs-installed-jars/*:<MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion>/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs %%INPUT_FILE %%OUTPUT</Format>
     4916</Bullet>
     4917</BulletList>
     4918</Bullet>
    48984919</BulletList>
    48994920<Text id="ucp-74">Note: When filling in the <Format>exec_cmd</Format> field, leave the words with <Format>%%</Format> signs in front of them intact. They are placeholders for Greenstone to replace.</Text>
    4900 <Text id="ucp-75">However, you will need to adjust the above value for <Format>exec_cmd</Format> by finding out where your Java 8 is installed and replacing <Format>/PATH/TO/YOUR-JAVA-8-HOME</Format> with it. The reason you need to provide the full path to the Java 8 executable is because, at present, GLI binaries ship with Java 7, which is incompatible with the precompiled Icecite. And if you're running Greenstone from a source, you may have a different version of Java set up in your environment too. However, by providing the full path to the Java 8 executable above, you force Icecite's PDF conversion program to run with Java 8.</Text>
    4901 <Comment><Text id="ucp-76">If your Greenstone is installed in a location that contains spaces in the filepath, then ensure you have escaped double quotes (<Format>\&quot;</Format>) around each location referencing the Greenstone installation path except for the parameter value to <Format>-classpath</Format>.</Text></Comment>
     4921<Text id="ucp-75">You will however need to adjust the above value for <Format>exec_cmd</Format> by finding out where your Java 8 is installed and replacing <Format>/PATH/TO/YOUR-JAVA-8-HOME</Format> with it. The reason you need to provide the full path to the Java 8 executable is because, at present, GLI binaries ship with Java 7, which is incompatible with the precompiled Icecite. And if you're running Greenstone from a source, you may have a different version of Java set up in your environment too. However, by providing the full path to the Java 8 executable above, you force Icecite's PDF conversion program to run with Java 8.</Text>
     4922<Comment><Text id="ucp-76">On Windows, if there are spaces in any filepaths in the command, <i>other than</i> in the parameter value to <Format>-classpath</Format>, remember to bookend those filepaths within double quotes escaped with a backslash, <Format>\"</Format>.</Text></Comment>
    49024923<Comment><Text id="ucp-77">The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the <Format><MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion></Format>, <Format>%%INPUT_FILE</Format> and <Format>%%OUTPUT</Format> appropriately. <Format><MajorVersion number="2">%%GSDLHOME</MajorVersion><MajorVersion number="3">%%GSDL3SRCHOME</MajorVersion></Format> works out to be the Greenstone <MajorVersion number="2">2</MajorVersion><MajorVersion number="3">3</MajorVersion> installation directory, whereas <Format>%%INPUT_FILE</Format> is whichever matching PDF it's processing and <Format>%%OUTPUT</Format> is likewise the file (or folder of files) produced by the conversion process. In this case, the output type is txt, as that's what Icecite produces. Once the conversion to text has finished, Greenstone will be able to process it as usual, such as indexing the extracted contents to make the document searchable.</Text>
    49034924</Comment>
Note: See TracChangeset for help on using the changeset viewer.