source: gs3-extensions/gs-icecite/GS-Icecite-README@ 32860

Last change on this file since 32860 was 32051, checked in by ak19, 7 years ago

Fixes to get icecite to convert PDFs to txt on Windows. See added sections in the GS-Icecite-README file committed

File size: 9.5 KB
Line 
1IceCite obtained from https://github.com/ckorzen/icecite
2
3IceCite for Greenstone was built 19 July 2017 on the research net linux machine. The version that was checked out from git and which was compiled successfully on 5 Oct 2017 produced strange sequences of alphanumeric interspersed with what could be the regular contents when run over the 24.pdf test file in step 4c. So we've since committed the version compiled on 19 July instead, as it had fewer strange contents upon conversion.
4
5
6LICENSE INFO
7
8- Icecite has an Apache license https://github.com/ckorzen/icecite/blob/master/LICENSE
9this is compatible with GPL3, which we use with GS3
10
11- BouncyCastle jars used by Icecite have an MIT license, which Dr Bainbridge says we once already worked out was compatible with the license we use for GS(3).
12 https://www.bouncycastle.org/licence.html
13
14
15USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT
16- Icecite needs Java 8. For compiling, you need JDK 8, for running, either JDK 8 or JRE 8 will suffice.
17- you will need maven installed
18- you will need to be able to run git commands
19
201. In order to compile up Icecite, you will have to set up the environment for JDK8:
21
22 export JAVA_HOME=/opt/java8
23 export PATH=$JAVA_HOME/bin:$PATH
24
252. PROXY STEP WHEN ON MACHINES THAT AREN'T RESEARCH NET:
26
27WARNING: Behind a proxy, it's hard to compile successfully. It gets stuck timing out trying to download different files on different attempts to run "mvn install". But running "mvn install" works fine on the research net linux machine and compiles relatively quickly, taking no more than a couple of minutes.
28
29If you're behind a proxy, make sure you've set the https_proxy environment variable correctly.
30The proxy also needs to be set for maven. Refer to http://maven.apache.org/guides/mini/guide-proxies.html and https://stackoverflow.com/questions/12807112/problems-after-maven-installation-mvn-install-tries-to-download-unreachable-fi
31
32You can create a settings.xml file, if one does not already exist, and put the contents seen on that page into it and edit it accordingly.
33
34e.g. emacs ~/.m2/settings.xml
35
36 <!--http://maven.apache.org/guides/mini/guide-proxies.html-->
37 <settings>
38 <proxies>
39 <proxy>
40 <id>example-proxy</id>
41 <active>true</active>
42 <protocol>http</protocol>
43 <host>proxy.cms.waikato.ac.nz</host>
44 <port>3128</port>
45 <username>USERNAME</username>
46 <password>PWD</password>
47 <nonProxyHosts>www.waikato.ac.nz|*.greenstone.org</nonProxyHosts>
48 </proxy>
49 </proxies>
50 </settings>
51
52(Check the permissions. The mvn install step seems to require that All users have read access to settings.xml, but it will need to be made private as it contains the proxy pwd.)
53
54
553. Then get and compile Icecite following the instructions at https://github.com/ckorzen/icecite
56
57 git clone https://github.com/ckorzen/icecite.git --recursive
58 cd icecite
59 git pull --recurse-submodules
60 cd pdf-parent/
61 mvn install
62
63
644. Once compiled, run Icecite. The general instructions for running IceCite are at https://github.com/ckorzen/icecite
65
66Remember, if you're running IceCite in a new terminal, ensure Java 8 is set up on the environment. This time around, it can be either a JDK8 or a JRE8.
67
68 export JAVA_HOME=/opt/java8/
69 export PATH=$JAVA_HOME/bin:$PATH
70
71
72In order to run Icecite's PDF to text conversion abilities, you will need to use its "PDF-CLI" (PDF command line interface). This is located in icecite's pdf-cli subfolder. So go there and run the conversion executable:
73
74 cd ../../
75 cd icecite/pdf-cli
76 java -jar target/pdf-cli-*-jar-with-dependencies.jar [options] <input> [<output>]
77
78
79Example ways of running it:
80 ~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature words ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted1.txt
81
82 ~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature lines ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted2.txt
83
84 ~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature paragraphs ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted3.txt
85
86(Also tried with input file pdf01.pdf from the Reports collection)
87
88Use a terminal to try out each of the above.
89
90
914. PDFBox failed to convert a problematic PDF file, 24.pdf, from a user on the mailing list. PDFBox's error message said there were no permissions to extract the contents of the PDF, yet Document Viewer and LibreOffice allowed text to be selected, and copied and pasted from the PDF.
92
93Running this file through icecite originally resulted in the exception
94
95 Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider
96 at org.apache.pdfbox.pdmodel.encryption.PDEncryption.<init>(PDEncryption.java:96)
97 at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java:282)
98 at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:199)
99 at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249)
100 at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847)
101 at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:803)
102 at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
103 at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java:120)
104 at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java:44)
105 at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java:268)
106 at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java:247)
107 at cli.PdfParserCommandLine.process(PdfParserCommandLine.java:233)
108 at cli.PdfParserCommandLine.main(PdfParserCommandLine.java:168)
109 Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider
110 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
111 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
112 at java.security.AccessController.doPrivileged(Native Method)
113 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
114 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
115 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
116 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
117 ... 13 more
118
119
120The solution was to:
121a. Create a new folder inside the "icecite" checked out folder called "gs-installed-jars".
122
123b. Obtain bouncycastle (encryption?) jar files from https://www.bouncycastle.org/latest_releases.html
124
125Download both jar files listed under the "Provider" column for row "JDK 1.5 - JDK 1.8" (not sure that both are necessary) and put them in icecite/gs-installed-jars folder
126
127More information on bouncycastle Java Cryptography APIs is at https://www.bouncycastle.org/java.html
128
129b. Then see https://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option
130for how to run a java programme when you have multiple jar files on classpath, as you can't run java with both -cp and -jar.
131
132
133Therefore, to convert PDF docs to text now that we have the bouncycastle jar files, we now run icecite's PDF-CLI as in the following example:
134
135 java -classpath ':/home/greenstone/icecite/gs-installed-jars/*:/home/greenstone/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature words ~/Desktop/24.pdf ~/Desktop/24converted.txt
136
137
138Since we provide the absolute path to the jar nested within pdf-cli, we no longer need to cd into pdf-cli first to run the jar executable.
139
140
1414. In order to get IceCite built on Linux to work on Windows, to convert PDF to txt, make the following 2 changes to both the following java files both found in icecite/commons/src/main/java/de/freiburg/iif/path/
142
143- PathUtils.java
144- LineReader.java
145
146Changes to make:
147a. Add the import statement
148 import java.net.URISyntaxException;
149
150b. Replace
151 Path jarFile = Paths.get(codeSource.getLocation().getPath());
152with
153 // GREENSTONE MOD:
154 // The following line causes problem on Windows with parsing
155 // the cmdline args when running pdf-cli jar:
156 //Path jarFile = Paths.get(codeSource.getLocation().getPath());
157 // See https://stackoverflow.com/questions/43972777/exception-in-thread-main-java-nio-file-invalidpathexception-illegal-char
158 // for the error message and solution
159 Path jarFile = null;
160 try {
161 String jarPath = Paths.get(codeSource.getLocation().toURI()).toString();
162 jarFile = Paths.get(jarPath);
163 } catch(URISyntaxException e) {
164 System.err.println("**** URISyntaxException. Couldn't convert CodeSource URL to URI: " + codeSource.getLocation());
165 // fallback to old way that works on linux, since declaring this method as
166 // "throws URISyntaxException" will require dealing with that bubbled up
167 // exception in all calling methods. As this appears to be a common utility
168 // method, that could make for a lot of calling code that needs editing
169 jarFile = Paths.get(codeSource.getLocation().getPath());
170 }
171
172c. When running on either Linux or Windows, provide the full filepaths to both input and output files. Using ~/ in filepaths on Linux, to denote home folders, is alright.
173A windows command looks as follows, note double quotes in place of single ones around the classpath value, and the Windows PATH separator in classpath. But the backslashes in classpath also work if they're forward slashes:
174
175 java -classpath "C:\Path\to\GS3\ext\icecite\gs-installed-jars\*;C:\Path\to\GS3\icecite\pdf-cli\target\pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar" cli.PdfParserCommandLine --format txt --feature words C:\Path\to\24.pdf C:\Path\to\24converted.txt
Note: See TracBrowser for help on using the repository browser.