Changeset 35401 for main/trunk/greenstone2/ext/tika/README.txt
- Timestamp:
- 2021-09-15T11:58:11+12:00 (3 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
main/trunk/greenstone2/ext/tika/README.txt
r34172 r35401 1 -------------------------------------------------------------- 2 About tika-app.jar: 3 -------------------------------------------------------------- 4 Last updated version is currently 1.24.1 (tika-app-1.24.1.jar) 5 which can be found in the final line of output of running: 6 java -jar %GSDLHOME%\ext\tika\tika-app.jar --version 7 on Windows: 8 or on Linux, 9 java -jar $GSDLHOME/ext/tika/tika-app.jar --version 10 11 12 1 13 -------------------------------------------------------------- 2 14 A. Some background information on Apache Tika and related: … … 28 40 1. HTML: 29 41 30 GS3/gs2build/ext/tika>java -jar tika-app -1.24.1.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm42 GS3/gs2build/ext/tika>java -jar tika-app.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm 31 43 32 44 2. XHTML - looks the same as HTML: 33 45 34 GS3/gs2build/ext/tika>java -jar tika-app -1.24.1.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html46 GS3/gs2build/ext/tika>java -jar tika-app.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html 35 47 36 48 3. PLAIN TEXT CONTENT - NO META: 37 49 38 GS3/gs2build/ext/tika>java -jar tika-app -1.24.1.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html50 GS3/gs2build/ext/tika>java -jar tika-app.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html 39 51 40 52 a. PLAIN TEXT WITH META: 41 53 42 GS3/gs2build/ext/tika>java -jar tika-app -1.24.1.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html54 GS3/gs2build/ext/tika>java -jar tika-app.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html 43 55 44 56 b. JUST META: 45 57 46 GS3/gs2build/ext/tika>java -jar tika-app -1.24.1.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)58 GS3/gs2build/ext/tika>java -jar tika-app.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html) 47 59 48 60 4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition): 49 61 50 62 Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it) 51 GS3/gs2build/ext/tika>java -jar tika-app -1.24.1.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx63 GS3/gs2build/ext/tika>java -jar tika-app.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx 52 64 53 65 … … 55 67 C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT: 56 68 -------------------------------------------------------------- 57 * GS3/gs2build/ext/tika>java -jar tika-app -1.24.1.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx69 * GS3/gs2build/ext/tika>java -jar tika-app.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx 58 70 59 71 INFO As a convenience, TikaCLI has turned on extraction of … … 72 84 73 85 74 * GS3/gs2build/ext/tika>java -jar tika-app -1.24.1.jar --text-main /PATH/TO/testword.docx86 * GS3/gs2build/ext/tika>java -jar tika-app.jar --text-main /PATH/TO/testword.docx 75 87 76 88 Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem … … 89 101 D. THE --encoding= FLAG TO TIKA 90 102 -------------------------------------------------------------- 91 > java -jar tika-app -1.24.1.jar --help103 > java -jar tika-app.jar --help 92 104 ... 93 105 -eX or --encoding=X Use output encoding X … … 104 116 COMPARE, noting also the case of the encoding in the Tika command, vs in the output: 105 117 106 (1) >java -jar tika-app -1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx118 (1) >java -jar tika-app.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx 107 119 <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml"> 108 120 <head> … … 110 122 ... 111 123 112 (2) >java -jar tika-app -1.24.1.jar --encoding=UTF-8 /Scratch/ak19/testword.docx124 (2) >java -jar tika-app.jar --encoding=UTF-8 /Scratch/ak19/testword.docx 113 125 <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> 114 126 <head> 115 127 ... 116 128 117 (3) >java -jar tika-app -1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx129 (3) >java -jar tika-app.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx 118 130 <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml"> 119 131 <head> 120 132 ... 121 133 122 (4) >java -jar tika-app -1.24.1.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx134 (4) >java -jar tika-app.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx 123 135 <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml"> 124 136 <head> 125 137 ... 126 138 127 (5) >java -jar tika-app -1.24.1.jar --encoding=nonexistent /Scratch/ak19/testword.docx139 (5) >java -jar tika-app.jar --encoding=nonexistent /Scratch/ak19/testword.docx 128 140 Warning: The encoding 'nonexistent' is not supported by the Java runtime. 129 141 Warning: encoding "nonexistent" not supported, using UTF-8 … … 133 145 134 146 (6) (Output to html) 135 > java -jar tika-app -1.24.1.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx147 > java -jar tika-app.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx 136 148 Warning: The encoding 'nonexistent' is not supported by the Java runtime. 137 149 Warning: encoding "nonexistent" not supported, using UTF-8 … … 144 156 145 157 (7) (Output to html case 2) 146 > java -jar tika-app -1.24.1.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx158 > java -jar tika-app.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx 147 159 <html xmlns="http://www.w3.org/1999/xhtml"> 148 160 <head>
Note:
See TracChangeset
for help on using the changeset viewer.