source: main/trunk/greenstone2/ext/tika/README.txt@ 34169

Last change on this file since 34169 was 34169, checked in by ak19, 4 years ago

All GS3 needs to convert docx files to basic html (no images) out of the box. 1. Adding in the Tika jar with its Apache 2.0 licence, a handcrafted notice derived from the license, and a Readme with links and examples of its use. 2. Updating model collectionConfig.xml with a pre-configured UnknownConverterPlugin to use the tika jar to convert docx to basic html. So all future GS3 collections will have this set up in the document pipeline and be ready for docx files. When the chance arises, need to set up a model coll for GS2 that uses the UnknownConverterPlugin in this way too.

File size: 3.8 KB
Line 
1--------------------------------------------------------------
2A. Some background information on Apache Tika and related:
3--------------------------------------------------------------
4* https://tika.apache.org/1.5/gettingstarted.html
5Refer to the heading "Using Tika as a command line utility" for available cmd line options
6
7* https://tika.apache.org/download.html
8is where the tika-app-1.24.1.jar was downloaded from
9(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
10
11* Apache 2.0 license
12 https://tika.apache.org/license.html
13
14* Mime-types for docx and other office suite docs:
15 https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
16
17
18--------------------------------------------------------------
19B. Here are some examples of running Tika on the command line:
20--------------------------------------------------------------
211. HTML:
22
23GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
24
252. XHTML - looks the same as HTML:
26
27GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
28
293. PLAIN TEXT CONTENT - NO META:
30
31GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
32
33 a. PLAIN TEXT WITH META:
34
35GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
36
37 b. JUST META:
38
39GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
40
414. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
42
43Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
44GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
45
46
47--------------------------------------------------------------
48C. COMPARE OUTPUT:
49--------------------------------------------------------------
50* GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
51
52INFO As a convenience, TikaCLI has turned on extraction of
53inline images for the PDFParser (TIKA-2374).
54Aside from the -z option, this is not the default behavior
55in Tika generally or in tika-server.
56Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
57WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
58See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
59for optional dependencies.
60
61Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
62WARNING: org.xerial's sqlite-jdbc is not loaded.
63Please provide the jar on your classpath to parse sqlite files.
64See tika-parsers/pom.xml for the correct version.
65
66
67* GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx
68
69Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
70WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
71See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
72for optional dependencies.
73
74Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
75WARNING: org.xerial's sqlite-jdbc is not loaded.
76Please provide the jar on your classpath to parse sqlite files.
77See tika-parsers/pom.xml for the correct version.
78<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
79
80--------------------------------------------------------------
Note: See TracBrowser for help on using the repository browser.