source: main/trunk/greenstone2/ext/tika/README.txt@ 34171

Last change on this file since 34171 was 34171, checked in by ak19, 4 years ago

Minor

File size: 3.7 KB
Line 
1--------------------------------------------------------------
2A. Some background information on Apache Tika and related:
3--------------------------------------------------------------
4* https://tika.apache.org/1.5/gettingstarted.html
5Refer to the heading "Using Tika as a command line utility" for available cmd line options
6
7* https://tika.apache.org/download.html
8is where the tika-app-1.24.1.jar was downloaded from
9(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
10
11* Apache 2.0 license
12 https://tika.apache.org/license.html
13
14* Mime-types for docx and other office suite docs:
15 https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
16
17
18--------------------------------------------------------------
19B. Here are some examples of running Tika on the command line:
20--------------------------------------------------------------
211. HTML:
22
23GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
24
252. XHTML - looks the same as HTML:
26
27GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
28
293. PLAIN TEXT CONTENT - NO META:
30
31GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
32
33 a. PLAIN TEXT WITH META:
34
35GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
36
37 b. JUST META:
38
39GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
40
414. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
42
43Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
44GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
45
46
47--------------------------------------------------------------
48C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
49--------------------------------------------------------------
50* GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
51
52INFO As a convenience, TikaCLI has turned on extraction of
53inline images for the PDFParser (TIKA-2374).
54Aside from the -z option, this is not the default behavior
55in Tika generally or in tika-server.
56Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
57WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
58See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
59for optional dependencies.
60
61Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
62WARNING: org.xerial's sqlite-jdbc is not loaded.
63Please provide the jar on your classpath to parse sqlite files.
64See tika-parsers/pom.xml for the correct version.
65
66
67* GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx
68
69Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
70WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
71See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
72for optional dependencies.
73
74Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
75WARNING: org.xerial's sqlite-jdbc is not loaded.
76Please provide the jar on your classpath to parse sqlite files.
77See tika-parsers/pom.xml for the correct version.
78<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
79
80--------------------------------------------------------------
Note: See TracBrowser for help on using the repository browser.