Changeset 34175 for gs2-extensions
- Timestamp:
- 2020-06-15T03:28:28+12:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs2-extensions/gstika/trunk/GS_TIKA_README.txt
r34174 r34175 28 28 1. HTML: 29 29 30 GS3/gs2build/ext/ tika>java -jar tika-app-*.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm30 GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm 31 31 32 32 2. XHTML - looks the same as HTML: 33 33 34 GS3/gs2build/ext/ tika>java -jar tika-app-*.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html34 GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html 35 35 36 36 3. PLAIN TEXT CONTENT - NO META: 37 37 38 GS3/gs2build/ext/ tika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html38 GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html 39 39 40 40 a. PLAIN TEXT WITH META: 41 41 42 GS3/gs2build/ext/ tika>java -jar tika-app-*.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html42 GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html 43 43 44 44 b. JUST META: 45 45 46 GS3/gs2build/ext/ tika>java -jar tika-app-*.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)46 GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html) 47 47 48 48 4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition): 49 49 50 50 Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it) 51 GS3/gs2build/ext/ tika>java -jar tika-app-*.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx51 GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx 52 52 53 53 … … 55 55 C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT: 56 56 -------------------------------------------------------------- 57 * GS3/gs2build/ext/ tika>java -jar tika-app-*.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx57 * GS3/gs2build/ext/gstika>java -jar tika-app-*.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx 58 58 59 59 INFO As a convenience, TikaCLI has turned on extraction of … … 72 72 73 73 74 * GS3/gs2build/ext/ tika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx74 * GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx 75 75 76 76 Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem … … 186 186 2. It stands alone and can be compiled and run against the tika-app-*.jar file on the classpath: 187 187 To compile 188 GS3/gs2build/ext/ tika>javac -cp `pwd`/tika-app-*.jar org/greenstone/tika/GSTikaCLI.java188 GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar org/greenstone/tika/GSTikaCLI.java 189 189 To run: 190 GS3/gs2build/ext/ tika>java -cp "`pwd`/tika-app-*.jar:." org.greenstone.tika.GSTikaCLI --html-with-images <inputfilepath> > output.html190 GS3/gs2build/ext/gstika>java -cp "`pwd`/lib/tika-app-*.jar:." org.greenstone.tika.GSTikaCLI --html-with-images <inputfilepath> > output.html 191 191 192 192 (Can pass existing flags, e.g. --html for html without images extracted) … … 194 194 To compile code that lives in a directory called "src" and compile it into a directory called "build": 195 195 196 GS3/gs2build/ext/ tika>javac -cp `pwd`/tika-app-*.jar -d `pwd`/build src/org/greenstone/tika/GSTikaCLI.java196 GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar -d `pwd`/build src/org/greenstone/tika/GSTikaCLI.java 197 197 198 198 To run the compiled class that's now in folder "build": 199 GS3/gs2build/ext/ tika>javac -cp "`pwd`/tika-app-*.jar:`pwd`/build" --html-with-images <inputfilepath> > output.html199 GS3/gs2build/ext/gstika>javac -cp "`pwd`/lib/tika-app-*.jar:`pwd`/build" --html-with-images <inputfilepath> > output.html 200 200 201 201 … … 215 215 216 216 217 cd gs2build/ext/ tika217 cd gs2build/ext/gstika 218 218 ./makeGSTikaCLI.sh 219 219 ./GSTikaCLI.sh --html-with-images <inputfile> > <outputfile> 220 220 e.g. ./GSTikaCLI.sh --html-with-imgs --pretty-print --encoding=UTF-8 tmp/<file>.docx > tmp/<file>.html 221 222 221 223 -------------------------------------------------------------- 222 224 F. COMPILING TIKA FROM SOURCE
Note:
See TracChangeset
for help on using the changeset viewer.