-------------------------------------------------------------- A. Some background information on Apache Tika and related: -------------------------------------------------------------- * https://tika.apache.org/1.5/gettingstarted.html Refer to the heading "Using Tika as a command line utility" for available cmd line options * https://tika.apache.org/download.html is where the tika-app-1.24.1.jar was downloaded from (We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html) * Apache 2.0 license https://tika.apache.org/license.html * Mime-types for docx and other office suite docs: https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc * Tesseract for OCR with Tika: https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/ Use Tika 1.14 to extract text from image by Tesseract OCR * API usage examples - if modifying Tika code: https://tika.apache.org/1.8/examples.html https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika -------------------------------------------------------------- B. Here are some examples of running Tika on the command line: -------------------------------------------------------------- 1. HTML: GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm 2. XHTML - looks the same as HTML: GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html 3. PLAIN TEXT CONTENT - NO META: GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html a. PLAIN TEXT WITH META: GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html b. JUST META: GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html) 4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition): Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it) GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx -------------------------------------------------------------- C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT: -------------------------------------------------------------- * GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx INFO As a convenience, TikaCLI has turned on extraction of inline images for the PDFParser (TIKA-2374). Aside from the -z option, this is not the default behavior in Tika generally or in tika-server. Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. * GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. -------------------------------------------------------------- D. THE --encoding= FLAG TO TIKA -------------------------------------------------------------- > java -jar tika-app-1.24.1.jar --help ... -eX or --encoding=X Use output encoding X ... You can't specify invalid encodings (e.g. --encoding=nonexistent) It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1 Since my tests have been to convert docs that contain ASCII using Tika, it's only obvious that the encoding flag has been taken into account in any way when the output is xhtml which is the default (or can pass in -x or --xml to get xhtml out). COMPARE, noting also the case of the encoding in the Tika command, vs in the output: (1) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx ... (2) >java -jar tika-app-1.24.1.jar --encoding=UTF-8 /Scratch/ak19/testword.docx ... (3) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx ... (4) >java -jar tika-app-1.24.1.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx ... (5) >java -jar tika-app-1.24.1.jar --encoding=nonexistent /Scratch/ak19/testword.docx Warning: The encoding 'nonexistent' is not supported by the Java runtime. Warning: encoding "nonexistent" not supported, using UTF-8 ... (6) (Output to html) > java -jar tika-app-1.24.1.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx Warning: The encoding 'nonexistent' is not supported by the Java runtime. Warning: encoding "nonexistent" not supported, using UTF-8 ... The warning to STDERR is all that indicates that the encoding flag is taken into account when --html flag is turned. The actual html output sent to STDOUT makes no mention of any encoding in the file. (7) (Output to html case 2) > java -jar tika-app-1.24.1.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx ... No warnings, but also no mention of the encoding in the html output. The warning messages in (6) indicate that the output encoding is also taken into account when the output format is set to html, by passing in the flag --html to tika. Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers to work with, it therefore seems meaningful to set --encoding=UTF-8. Also passing in --pretty-print to get supposedly better formatted output. --------------------------------------------------------------