--------------------------------------------------------------
A. Some background information on Apache Tika and related:
--------------------------------------------------------------
* https://tika.apache.org/1.5/gettingstarted.html
Refer to the heading "Using Tika as a command line utility" for available cmd line options

* https://tika.apache.org/download.html
is where the tika-app-1.24.1.jar was downloaded from
(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)

* Apache 2.0 license
	https://tika.apache.org/license.html

* Mime-types for docx and other office suite docs:	
	https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc

* Tesseract for OCR with Tika:
https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
Use Tika 1.14 to extract text from image by Tesseract OCR

* API usage examples - if modifying Tika code:
https://tika.apache.org/1.8/examples.html
https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika

--------------------------------------------------------------
B. Here are some examples of running Tika on the command line:
--------------------------------------------------------------
1. HTML:	

GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm

2. XHTML - looks the same as HTML:

GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html

3. PLAIN TEXT CONTENT - NO META:

GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html

  a. PLAIN TEXT WITH META:

GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html

  b. JUST META:

GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
	
4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):

Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx		


--------------------------------------------------------------
C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
--------------------------------------------------------------
* GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx

INFO  As a convenience, TikaCLI has turned on extraction of
inline images for the PDFParser (TIKA-2374).
Aside from the -z option, this is not the default behavior
in Tika generally or in tika-server.
Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.


* GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx

Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>


--------------------------------------------------------------
D. THE --encoding= FLAG TO TIKA
--------------------------------------------------------------
> java -jar tika-app-1.24.1.jar --help
  ...
  -eX or --encoding=X    Use output encoding X
  ...

You can't specify invalid encodings (e.g. --encoding=nonexistent)
It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1

Since my tests have been to convert docs that contain ASCII using Tika,
it's only obvious that the encoding flag has been taken into account in any way when the output is
xhtml which is the default (or can pass in -x or --xml to get xhtml out).


COMPARE, noting also the case of the encoding in the Tika command, vs in the output:

(1) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
  <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <meta name="date" content="2013-09-18T02:46:00Z"/>
  ...

(2) >java -jar tika-app-1.24.1.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    ...

(3) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    ...
  
(4) >java -jar tika-app-1.24.1.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
     ...

(5) >java -jar tika-app-1.24.1.jar --encoding=nonexistent /Scratch/ak19/testword.docx
    Warning:  The encoding 'nonexistent' is not supported by the Java runtime.
    Warning: encoding "nonexistent" not supported, using UTF-8
    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    ...

(6) (Output to html)
    > java -jar tika-app-1.24.1.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
    Warning:  The encoding 'nonexistent' is not supported by the Java runtime.
    Warning: encoding "nonexistent" not supported, using UTF-8
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    ...
The warning to STDERR is all that indicates that the encoding flag is taken into account
when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
encoding in the file.

(7) (Output to html case 2)
    > java -jar tika-app-1.24.1.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="date" content="2013-09-18T02:46:00Z"/>
    <meta name="Total-Time" content="5"/>
    ...
No warnings, but also no mention of the encoding in the html output.


The warning messages in (6) indicate that the output encoding is also taken into account when
the output format is set to html, by passing in the flag --html to tika.
Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
to work with, it therefore seems meaningful to set --encoding=UTF-8.

Also passing in --pretty-print to get supposedly better formatted output.


--------------------------------------------------------------