source: gs3-extensions/gs-icecite/GS-Icecite-README@ 32024

Last change on this file since 32024 was 32024, checked in by ak19, 7 years ago

IceCite for Greenstone was built 19 July 2017 on the research net linux machine. The version checked out from git and compiled successfully on 5 Oct 2017 produced strange sequences of alphanumeric interspersed with what could be the regular contents when run over the 24.pdf test file in step 4c. So committing the version compiled on 19 July instead, as it works.

File size: 7.2 KB
Line 
1IceCite for Greenstone was built 19 July 2017 on the research net linux machine. The version checked out from git and compiled successfully on 5 Oct 2017 produced strange sequences of alphanumeric interspersed with what could be the regular contents when run over the 24.pdf test file in step 4c. So we've since committed the version compiled on 19 July instead.
2
3
4LICENSE INFO
5
6- Icecite has an Apache license https://github.com/ckorzen/icecite/blob/master/LICENSE
7this is compatible with GPL3, which we use with GS3
8
9- BouncyCastle jars used by Icecite have an MIT license, which Dr Bainbridge says we once already worked out was compatible with the license we use for GS(3).
10 https://www.bouncycastle.org/licence.html
11
12
13USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT
14- Icecite needs Java 8. For compiling, you need JDK 8, for running, either JDK 8 or JRE 8 will suffice.
15- you will need maven installed
16- you will need to be able to run git commands
17
181. In order to compile up Icecite, you will have to set up the environment for JDK8:
19
20 export JAVA_HOME=/opt/java8/
21 export PATH=$JAVA_HOME/bin:$PATH
22
232. PROXY STEP WHEN ON MACHINES THAT AREN'T RESEARCH NET:
24
25WARNING: Behind a proxy, it's hard to compile successfully. It gets stuck timing out trying to download different files on different attempts to run "mvn install". But running "mvn install" works fine on the research net linux machine and compiles relatively quickly, taking no more than a couple of minutes.
26
27If you're behind a proxy, make sure you've set the https_proxy environment variable correctly.
28The proxy also needs to be set for maven. Refer to http://maven.apache.org/guides/mini/guide-proxies.html and https://stackoverflow.com/questions/12807112/problems-after-maven-installation-mvn-install-tries-to-download-unreachable-fi
29
30You can create a settings.xml file, if one does not already exist, and put the contents seen on that page into it and edit it accordingly.
31
32e.g. emacs ~/.m2/settings.xml
33
34 <!--http://maven.apache.org/guides/mini/guide-proxies.html-->
35 <settings>
36 <proxies>
37 <proxy>
38 <id>example-proxy</id>
39 <active>true</active>
40 <protocol>http</protocol>
41 <host>proxy.cms.waikato.ac.nz</host>
42 <port>3128</port>
43 <username>USERNAME</username>
44 <password>PWD</password>
45 <nonProxyHosts>www.waikato.ac.nz|*.greenstone.org</nonProxyHosts>
46 </proxy>
47 </proxies>
48 </settings>
49
50(Check the permissions. The mvn install step seems to require that All users have read access to settings.xml, but it will need to be made private as it contains the proxy pwd.)
51
52
533. Then get and compile Icecite following the instructions at https://github.com/ckorzen/icecite
54
55 git clone https://github.com/ckorzen/icecite.git --recursive
56 cd icecite
57 git pull --recurse-submodules
58 cd pdf-parent/
59 mvn install
60
61
624. Once compiled, run Icecite. The general instructions for running IceCite are at https://github.com/ckorzen/icecite
63
64Remember, if you're running IceCite in a new terminal, ensure Java 8 is set up on the environment. This time around, it can be either a JDK8 or a JRE8.
65
66 export JAVA_HOME=/opt/java8/
67 export PATH=$JAVA_HOME/bin:$PATH
68
69
70In order to run Icecite's PDF to text conversion abilities, you will need to use its "PDF-CLI" (PDF command line interface). This is located in icecite's pdf-cli subfolder. So go there and run the conversion executable:
71
72 cd ../../
73 cd icecite/pdf-cli
74 java -jar target/pdf-cli-*-jar-with-dependencies.jar [options] <input> [<output>]
75
76
77Example ways of running it:
78 ~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature words ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted1.txt
79
80 ~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature lines ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted2.txt
81
82 ~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature paragraphs ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted3.txt
83
84(Also tried with input file pdf01.pdf from the Reports collection)
85
86Use a terminal to try out each of the above.
87
88
894. PDFBox failed to convert a problematic PDF file, 24.pdf, from a user on the mailing list. PDFBox's error message said there were no permissions to extract the contents of the PDF, yet Document Viewer and LibreOffice allowed text to be selected, and copied and pasted from the PDF.
90
91Running this file through icecite originally resulted in the exception
92
93 Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider
94 at org.apache.pdfbox.pdmodel.encryption.PDEncryption.<init>(PDEncryption.java:96)
95 at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java:282)
96 at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:199)
97 at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249)
98 at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847)
99 at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:803)
100 at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
101 at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java:120)
102 at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java:44)
103 at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java:268)
104 at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java:247)
105 at cli.PdfParserCommandLine.process(PdfParserCommandLine.java:233)
106 at cli.PdfParserCommandLine.main(PdfParserCommandLine.java:168)
107 Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider
108 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
109 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
110 at java.security.AccessController.doPrivileged(Native Method)
111 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
112 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
113 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
114 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
115 ... 13 more
116
117
118The solution was to:
119a. Create a new folder inside the "icecite" checked out folder called "gs-installed-jars".
120
121b. Obtain bouncycastle (encryption?) jar files from https://www.bouncycastle.org/latest_releases.html
122
123Download both jar files listed under the "Provider" column for row "JDK 1.5 - JDK 1.8" (not sure that both are necessary) and put them in icecite/gs-installed-jars folder
124
125b. Then see https://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option
126for how to run a java programme when you have multiple jar files on classpath, as you can't run java with both -cp and -jar.
127
128
129Therefore, to convert PDF docs to text now that we have the bouncycastle jar files, we now run icecite's PDF-CLI as in the following example:
130
131 java -classpath '.:/home/greenstone/icecite/gs-installed-jars/*:/home/greenstone/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature words ~/Desktop/24.pdf ~/Desktop/24converted.txt
132
133
134Since we provide the absolute path to the jar nested within pdf-cli, we no longer need to cd into pdf-cli first to run the jar executable.
135
Note: See TracBrowser for help on using the repository browser.