Changeset 34172
- Timestamp:
- 2020-06-14T19:11:13+12:00 (4 years ago)
- Location:
- main/trunk/greenstone2
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
main/trunk/greenstone2/collect/modelcol/etc/collectionConfig.xml
r34169 r34172 87 87 <!-- Configuring an UnknownConverterPlugin for docx processing with Tika --> 88 88 <plugin name="UnknownConverterPlugin"> 89 <option name="-exec_cmd" value="java -jar $GSDLHOME/ext/tika/tika-app-1.24.1.jar --html %%INPUT_FILE > %%OUTPUT"/>89 <option name="-exec_cmd" value="java -jar $GSDLHOME/ext/tika/tika-app-1.24.1.jar --html --pretty-print --encoding=UTF-8 %%INPUT_FILE > %%OUTPUT"/> 90 90 <option name="-convert_to" value="html"/> 91 91 <option name="-mime_type" value="application/vnd.openxmlformats-officedocument.wordprocessingml.document"/> -
main/trunk/greenstone2/ext/tika/README.txt
r34171 r34172 14 14 * Mime-types for docx and other office suite docs: 15 15 https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc 16 16 17 * Tesseract for OCR with Tika: 18 https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/ 19 Use Tika 1.14 to extract text from image by Tesseract OCR 20 21 * API usage examples - if modifying Tika code: 22 https://tika.apache.org/1.8/examples.html 23 https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika 17 24 18 25 -------------------------------------------------------------- … … 78 85 <ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE> 79 86 87 80 88 -------------------------------------------------------------- 89 D. THE --encoding= FLAG TO TIKA 90 -------------------------------------------------------------- 91 > java -jar tika-app-1.24.1.jar --help 92 ... 93 -eX or --encoding=X Use output encoding X 94 ... 95 96 You can't specify invalid encodings (e.g. --encoding=nonexistent) 97 It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1 98 99 Since my tests have been to convert docs that contain ASCII using Tika, 100 it's only obvious that the encoding flag has been taken into account in any way when the output is 101 xhtml which is the default (or can pass in -x or --xml to get xhtml out). 102 103 104 COMPARE, noting also the case of the encoding in the Tika command, vs in the output: 105 106 (1) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx 107 <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml"> 108 <head> 109 <meta name="date" content="2013-09-18T02:46:00Z"/> 110 ... 111 112 (2) >java -jar tika-app-1.24.1.jar --encoding=UTF-8 /Scratch/ak19/testword.docx 113 <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> 114 <head> 115 ... 116 117 (3) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx 118 <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml"> 119 <head> 120 ... 121 122 (4) >java -jar tika-app-1.24.1.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx 123 <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml"> 124 <head> 125 ... 126 127 (5) >java -jar tika-app-1.24.1.jar --encoding=nonexistent /Scratch/ak19/testword.docx 128 Warning: The encoding 'nonexistent' is not supported by the Java runtime. 129 Warning: encoding "nonexistent" not supported, using UTF-8 130 <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> 131 <head> 132 ... 133 134 (6) (Output to html) 135 > java -jar tika-app-1.24.1.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx 136 Warning: The encoding 'nonexistent' is not supported by the Java runtime. 137 Warning: encoding "nonexistent" not supported, using UTF-8 138 <html xmlns="http://www.w3.org/1999/xhtml"> 139 <head> 140 ... 141 The warning to STDERR is all that indicates that the encoding flag is taken into account 142 when --html flag is turned. The actual html output sent to STDOUT makes no mention of any 143 encoding in the file. 144 145 (7) (Output to html case 2) 146 > java -jar tika-app-1.24.1.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx 147 <html xmlns="http://www.w3.org/1999/xhtml"> 148 <head> 149 <meta name="date" content="2013-09-18T02:46:00Z"/> 150 <meta name="Total-Time" content="5"/> 151 ... 152 No warnings, but also no mention of the encoding in the html output. 153 154 155 The warning messages in (6) indicate that the output encoding is also taken into account when 156 the output format is set to html, by passing in the flag --html to tika. 157 Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers 158 to work with, it therefore seems meaningful to set --encoding=UTF-8. 159 160 Also passing in --pretty-print to get supposedly better formatted output. 161 162 163 --------------------------------------------------------------
Note:
See TracChangeset
for help on using the changeset viewer.