1 | --------------------------------------------------------------
|
---|
2 | A. Some background information on Apache Tika and related:
|
---|
3 | --------------------------------------------------------------
|
---|
4 | * https://tika.apache.org/1.5/gettingstarted.html
|
---|
5 | Refer to the heading "Using Tika as a command line utility" for available cmd line options
|
---|
6 |
|
---|
7 | * https://tika.apache.org/download.html
|
---|
8 | is where the tika-app-1.24.1.jar was downloaded from
|
---|
9 | (We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
|
---|
10 |
|
---|
11 | * Apache 2.0 license
|
---|
12 | https://tika.apache.org/license.html
|
---|
13 |
|
---|
14 | * Mime-types for docx and other office suite docs:
|
---|
15 | https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
|
---|
16 |
|
---|
17 |
|
---|
18 | --------------------------------------------------------------
|
---|
19 | B. Here are some examples of running Tika on the command line:
|
---|
20 | --------------------------------------------------------------
|
---|
21 | 1. HTML:
|
---|
22 |
|
---|
23 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
|
---|
24 |
|
---|
25 | 2. XHTML - looks the same as HTML:
|
---|
26 |
|
---|
27 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
|
---|
28 |
|
---|
29 | 3. PLAIN TEXT CONTENT - NO META:
|
---|
30 |
|
---|
31 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
|
---|
32 |
|
---|
33 | a. PLAIN TEXT WITH META:
|
---|
34 |
|
---|
35 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
|
---|
36 |
|
---|
37 | b. JUST META:
|
---|
38 |
|
---|
39 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
|
---|
40 |
|
---|
41 | 4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
|
---|
42 |
|
---|
43 | Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
|
---|
44 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
|
---|
45 |
|
---|
46 |
|
---|
47 | --------------------------------------------------------------
|
---|
48 | C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
|
---|
49 | --------------------------------------------------------------
|
---|
50 | * GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
|
---|
51 |
|
---|
52 | INFO As a convenience, TikaCLI has turned on extraction of
|
---|
53 | inline images for the PDFParser (TIKA-2374).
|
---|
54 | Aside from the -z option, this is not the default behavior
|
---|
55 | in Tika generally or in tika-server.
|
---|
56 | Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
57 | WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
|
---|
58 | See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
|
---|
59 | for optional dependencies.
|
---|
60 |
|
---|
61 | Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
62 | WARNING: org.xerial's sqlite-jdbc is not loaded.
|
---|
63 | Please provide the jar on your classpath to parse sqlite files.
|
---|
64 | See tika-parsers/pom.xml for the correct version.
|
---|
65 |
|
---|
66 |
|
---|
67 | * GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx
|
---|
68 |
|
---|
69 | Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
70 | WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
|
---|
71 | See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
|
---|
72 | for optional dependencies.
|
---|
73 |
|
---|
74 | Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
75 | WARNING: org.xerial's sqlite-jdbc is not loaded.
|
---|
76 | Please provide the jar on your classpath to parse sqlite files.
|
---|
77 | See tika-parsers/pom.xml for the correct version.
|
---|
78 | <ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
|
---|
79 |
|
---|
80 | --------------------------------------------------------------
|
---|