Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

README.txt@ 34169

Last change on this file since 34169 was 34169, checked in by ak19, 4 years ago

All GS3 needs to convert docx files to basic html (no images) out of the box. 1. Adding in the Tika jar with its Apache 2.0 licence, a handcrafted notice derived from the license, and a Readme with links and examples of its use. 2. Updating model collectionConfig.xml with a pre-configured UnknownConverterPlugin to use the tika jar to convert docx to basic html. So all future GS3 collections will have this set up in the document pipeline and be ready for docx files. When the chance arises, need to set up a model coll for GS2 that uses the UnknownConverterPlugin in this way too.

File size: 3.8 KB

Line
1	--------------------------------------------------------------
2	A. Some background information on Apache Tika and related:
3	--------------------------------------------------------------
4	* https://tika.apache.org/1.5/gettingstarted.html
5	Refer to the heading "Using Tika as a command line utility" for available cmd line options
6
7	* https://tika.apache.org/download.html
8	is where the tika-app-1.24.1.jar was downloaded from
9	(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
10
11	* Apache 2.0 license
12	https://tika.apache.org/license.html
13
14	* Mime-types for docx and other office suite docs:
15	https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
16
17
18	--------------------------------------------------------------
19	B. Here are some examples of running Tika on the command line:
20	--------------------------------------------------------------
21	1. HTML:
22
23	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
24
25	2. XHTML - looks the same as HTML:
26
27	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
28
29	3. PLAIN TEXT CONTENT - NO META:
30
31	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
32
33	a. PLAIN TEXT WITH META:
34
35	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
36
37	b. JUST META:
38
39	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
40
41	4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
42
43	Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
44	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
45
46
47	--------------------------------------------------------------
48	C. COMPARE OUTPUT:
49	--------------------------------------------------------------
50	* GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
51
52	INFO As a convenience, TikaCLI has turned on extraction of
53	inline images for the PDFParser (TIKA-2374).
54	Aside from the -z option, this is not the default behavior
55	in Tika generally or in tika-server.
56	Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
57	WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
58	See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
59	for optional dependencies.
60
61	Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
62	WARNING: org.xerial's sqlite-jdbc is not loaded.
63	Please provide the jar on your classpath to parse sqlite files.
64	See tika-parsers/pom.xml for the correct version.
65
66
67	* GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx
68
69	Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
70	WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
71	See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
72	for optional dependencies.
73
74	Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
75	WARNING: org.xerial's sqlite-jdbc is not loaded.
76	Please provide the jar on your classpath to parse sqlite files.
77	See tika-parsers/pom.xml for the correct version.
78	<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
79
80	--------------------------------------------------------------

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: main/trunk/greenstone2/ext/tika/README.txt@ 34169

Download in other formats: