1 | --------------------------------------------------------------
|
---|
2 | CONTENTS:
|
---|
3 | --------------------------------------------------------------
|
---|
4 |
|
---|
5 | A. Some background information on Apache Tika and related:
|
---|
6 | B. Here are some examples of running Tika on the command line:
|
---|
7 | C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
|
---|
8 | D. THE --encoding= FLAG TO TIKA
|
---|
9 | E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
|
---|
10 | F. COMPILING TIKA FROM SOURCE
|
---|
11 | G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)
|
---|
12 |
|
---|
13 | --------------------------------------------------------------
|
---|
14 | A. Some background information on Apache Tika and related:
|
---|
15 | --------------------------------------------------------------
|
---|
16 | * https://tika.apache.org/1.5/gettingstarted.html
|
---|
17 | Refer to the heading "Using Tika as a command line utility" for available cmd line options
|
---|
18 |
|
---|
19 | * https://tika.apache.org/download.html
|
---|
20 | is where the tika-app-1.24.1.jar was downloaded from
|
---|
21 | (We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
|
---|
22 |
|
---|
23 | * Apache 2.0 license
|
---|
24 | https://tika.apache.org/license.html
|
---|
25 |
|
---|
26 | * Mime-types for docx and other office suite docs:
|
---|
27 | https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
|
---|
28 |
|
---|
29 | * Tesseract for OCR with Tika:
|
---|
30 | https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
|
---|
31 | Use Tika 1.14 to extract text from image by Tesseract OCR
|
---|
32 |
|
---|
33 | * API usage examples - if modifying Tika code:
|
---|
34 | https://tika.apache.org/1.8/examples.html
|
---|
35 | https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
|
---|
36 |
|
---|
37 | --------------------------------------------------------------
|
---|
38 | B. Here are some examples of running Tika on the command line:
|
---|
39 | --------------------------------------------------------------
|
---|
40 | 1. HTML:
|
---|
41 |
|
---|
42 | GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
|
---|
43 |
|
---|
44 | 2. XHTML - looks the same as HTML:
|
---|
45 |
|
---|
46 | GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
|
---|
47 |
|
---|
48 | 3. PLAIN TEXT CONTENT - NO META:
|
---|
49 |
|
---|
50 | GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
|
---|
51 |
|
---|
52 | a. PLAIN TEXT WITH META:
|
---|
53 |
|
---|
54 | GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
|
---|
55 |
|
---|
56 | b. JUST META:
|
---|
57 |
|
---|
58 | GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
|
---|
59 |
|
---|
60 | 4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
|
---|
61 |
|
---|
62 | Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
|
---|
63 | GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
|
---|
64 |
|
---|
65 |
|
---|
66 | --------------------------------------------------------------
|
---|
67 | C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
|
---|
68 | --------------------------------------------------------------
|
---|
69 | * GS3/gs2build/ext/gstika>java -jar tika-app-*.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
|
---|
70 |
|
---|
71 | INFO As a convenience, TikaCLI has turned on extraction of
|
---|
72 | inline images for the PDFParser (TIKA-2374).
|
---|
73 | Aside from the -z option, this is not the default behavior
|
---|
74 | in Tika generally or in tika-server.
|
---|
75 | Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
76 | WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
|
---|
77 | See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
|
---|
78 | for optional dependencies.
|
---|
79 |
|
---|
80 | Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
81 | WARNING: org.xerial's sqlite-jdbc is not loaded.
|
---|
82 | Please provide the jar on your classpath to parse sqlite files.
|
---|
83 | See tika-parsers/pom.xml for the correct version.
|
---|
84 |
|
---|
85 |
|
---|
86 | * GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx
|
---|
87 |
|
---|
88 | Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
89 | WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
|
---|
90 | See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
|
---|
91 | for optional dependencies.
|
---|
92 |
|
---|
93 | Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
94 | WARNING: org.xerial's sqlite-jdbc is not loaded.
|
---|
95 | Please provide the jar on your classpath to parse sqlite files.
|
---|
96 | See tika-parsers/pom.xml for the correct version.
|
---|
97 | <ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
|
---|
98 |
|
---|
99 |
|
---|
100 | --------------------------------------------------------------
|
---|
101 | D. THE --encoding= FLAG TO TIKA
|
---|
102 | --------------------------------------------------------------
|
---|
103 | > java -jar tika-app-*.jar --help
|
---|
104 | ...
|
---|
105 | -eX or --encoding=X Use output encoding X
|
---|
106 | ...
|
---|
107 |
|
---|
108 | You can't specify invalid encodings (e.g. --encoding=nonexistent)
|
---|
109 | It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1
|
---|
110 |
|
---|
111 | Since my tests have been to convert docs that contain ASCII using Tika,
|
---|
112 | it's only obvious that the encoding flag has been taken into account in any way when the output is
|
---|
113 | xhtml which is the default (or can pass in -x or --xml to get xhtml out).
|
---|
114 |
|
---|
115 |
|
---|
116 | COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
|
---|
117 |
|
---|
118 | (1) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
|
---|
119 | <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
|
---|
120 | <head>
|
---|
121 | <meta name="date" content="2013-09-18T02:46:00Z"/>
|
---|
122 | ...
|
---|
123 |
|
---|
124 | (2) >java -jar tika-app-*.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
|
---|
125 | <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
|
---|
126 | <head>
|
---|
127 | ...
|
---|
128 |
|
---|
129 | (3) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
|
---|
130 | <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
|
---|
131 | <head>
|
---|
132 | ...
|
---|
133 |
|
---|
134 | (4) >java -jar tika-app-*.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
|
---|
135 | <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
|
---|
136 | <head>
|
---|
137 | ...
|
---|
138 |
|
---|
139 | (5) >java -jar tika-app-*.jar --encoding=nonexistent /Scratch/ak19/testword.docx
|
---|
140 | Warning: The encoding 'nonexistent' is not supported by the Java runtime.
|
---|
141 | Warning: encoding "nonexistent" not supported, using UTF-8
|
---|
142 | <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
|
---|
143 | <head>
|
---|
144 | ...
|
---|
145 |
|
---|
146 | (6) (Output to html)
|
---|
147 | > java -jar tika-app-*.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
|
---|
148 | Warning: The encoding 'nonexistent' is not supported by the Java runtime.
|
---|
149 | Warning: encoding "nonexistent" not supported, using UTF-8
|
---|
150 | <html xmlns="http://www.w3.org/1999/xhtml">
|
---|
151 | <head>
|
---|
152 | ...
|
---|
153 | The warning to STDERR is all that indicates that the encoding flag is taken into account
|
---|
154 | when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
|
---|
155 | encoding in the file.
|
---|
156 |
|
---|
157 | (7) (Output to html case 2)
|
---|
158 | > java -jar tika-app-*.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
|
---|
159 | <html xmlns="http://www.w3.org/1999/xhtml">
|
---|
160 | <head>
|
---|
161 | <meta name="date" content="2013-09-18T02:46:00Z"/>
|
---|
162 | <meta name="Total-Time" content="5"/>
|
---|
163 | ...
|
---|
164 | No warnings, but also no mention of the encoding in the html output.
|
---|
165 |
|
---|
166 |
|
---|
167 | The warning messages in (6) indicate that the output encoding is also taken into account when
|
---|
168 | the output format is set to html, by passing in the flag --html to tika.
|
---|
169 | Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
|
---|
170 | to work with, it therefore seems meaningful to set --encoding=UTF-8.
|
---|
171 |
|
---|
172 | Also passing in --pretty-print to get supposedly better formatted output.
|
---|
173 |
|
---|
174 |
|
---|
175 | --------------------------------------------------------------
|
---|
176 | E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
|
---|
177 | --------------------------------------------------------------
|
---|
178 |
|
---|
179 | The default Tika cli app accepts --html and --xml (for xhtml) flags to output html and xhtml respectively.
|
---|
180 | To extract images, the Tika cli app needs to be run separately with a --extract flag and optional --extract-dir=<dir>
|
---|
181 | However, running --html and then --extract sequentially does not produce an html file referring to the extracted
|
---|
182 | images because the extracted images are renamed to rId<digit>_<imagefilename>.<ext>, while the html file generated
|
---|
183 | refers to "embedded:<imagefilename>.<ext>" as the value for the src attributes of image elements.
|
---|
184 |
|
---|
185 | So the problem is two-fold:
|
---|
186 | - Need to not be prefixing anything to the extracted images
|
---|
187 | - Need to remove "embedded:" prefix from the img src attributes in the html produced. Ideally don't want the string
|
---|
188 | "embedded:" prefixed at all, but that would require editing many source files in the Tika project rather than just one.
|
---|
189 |
|
---|
190 | The solution turned out not to require compiling up apache-tika from source at all, but having a source checkout
|
---|
191 | to locate and modify code was handy.
|
---|
192 |
|
---|
193 |
|
---|
194 | SOLUTION TO OUTPUT (X)HTML WITH IMAGES EXTRACTED IN THE SAME LOCATION:
|
---|
195 | 1. I wrote the org.greenstone.tika.GSTikaClient.java which is based on the TikaClient.java
|
---|
196 | with some minor modifications to be documented below.
|
---|
197 |
|
---|
198 | 2. It stands alone and can be compiled and run against the tika-app-*.jar file on the classpath:
|
---|
199 | To compile
|
---|
200 | GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar org/greenstone/tika/GSTikaCLI.java
|
---|
201 | To run:
|
---|
202 | GS3/gs2build/ext/gstika>java -cp "`pwd`/lib/tika-app-*.jar:." org.greenstone.tika.GSTikaCLI --html-with-images <inputfilepath> > output.html
|
---|
203 |
|
---|
204 | (Can pass existing flags, e.g. --html for html without images extracted)
|
---|
205 |
|
---|
206 | To compile code that lives in a directory called "src" and compile it into a directory called "build":
|
---|
207 |
|
---|
208 | GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar -d `pwd`/build src/org/greenstone/tika/GSTikaCLI.java
|
---|
209 |
|
---|
210 | To run the compiled class that's now in folder "build":
|
---|
211 | GS3/gs2build/ext/gstika>javac -cp "`pwd`/lib/tika-app-*.jar:`pwd`/build" --html-with-images <inputfilepath> > output.html
|
---|
212 |
|
---|
213 |
|
---|
214 | 3. GSTikaClient.java is based on TikaClient.java with the modifications marked with comments mentioning "GSDL".
|
---|
215 |
|
---|
216 | a. The major changes are that inner class method FileEmbeddedDocumentExtractor.getOutputFile() no longer
|
---|
217 | prefixes the unwanted "rId_" prefix to the filenames of the extracted images
|
---|
218 |
|
---|
219 | b. The return type of the static method getTransformerHandler() is no longer TransformerHandler, but its superclass ContentHandler.
|
---|
220 |
|
---|
221 | When the new --html-with-imgs (or xhtml-with-images) flag is passed into GSTikaClient, function getTransformerHandler() will further process the existing html/xml result generated by the function, by removing "embedded:" prefixes in img src attributes. This is done by copying some source code from tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java source code and modifying it (look for code about a ContentHandlerDecorator in TikaGUI.java).
|
---|
222 |
|
---|
223 | c. Other changes are to support the 2 new additional input flags --html-with-imgs and --xhtml-with-imgs, and additional call the image extraction functions, and ensuring an extraction directory flag is still supported in this mode. (Though when not provided, the images will be extracted into the same level as the input file.)
|
---|
224 |
|
---|
225 |
|
---|
226 | 4. Next added a makeGSTikaCLI.sh script for compiling and the GSTikaCLI.sh script for minor simplification of running.
|
---|
227 |
|
---|
228 |
|
---|
229 | cd gs2build/ext/gstika
|
---|
230 | ./makeGSTikaCLI.sh
|
---|
231 | ./GSTikaCLI.sh --html-with-images <inputfile> > <outputfile>
|
---|
232 | e.g. ./GSTikaCLI.sh --html-with-imgs --pretty-print --encoding=UTF-8 tmp/<file>.docx > tmp/<file>.html
|
---|
233 |
|
---|
234 |
|
---|
235 | --------------------------------------------------------------
|
---|
236 | F. COMPILING TIKA FROM SOURCE
|
---|
237 | --------------------------------------------------------------
|
---|
238 |
|
---|
239 | Refer to https://github.com/apache/tika
|
---|
240 |
|
---|
241 | (a) Need Maven 3 to compile up Tika.
|
---|
242 | export MAVEN_HOME=/Path/To/apache-maven3
|
---|
243 | export PATH=$MAVEN_HOME/bin:$PATH
|
---|
244 |
|
---|
245 | (b) Need to configure Maven to grab artifacts using https, since some are only available over https.
|
---|
246 | Refer to https://stackoverflow.com/questions/25393298/what-is-the-correct-way-of-forcing-maven-to-use-https-for-maven-central
|
---|
247 | which instructs adding the following to your $MAVEN_HOME/conf/settings.xml into <profiles> section:
|
---|
248 |
|
---|
249 | <profile>
|
---|
250 | <id>maven-https</id>
|
---|
251 | <activation>
|
---|
252 | <activeByDefault>true</activeByDefault>
|
---|
253 | </activation>
|
---|
254 | <repositories>
|
---|
255 | <repository>
|
---|
256 | <id>central</id>
|
---|
257 | <url>https://repo1.maven.org/maven2</url>
|
---|
258 | <snapshots>
|
---|
259 | <enabled>false</enabled>
|
---|
260 | </snapshots>
|
---|
261 | </repository>
|
---|
262 | </repositories>
|
---|
263 | <pluginRepositories>
|
---|
264 | <pluginRepository>
|
---|
265 | <id>central</id>
|
---|
266 | <url>https://repo1.maven.org/maven2</url>
|
---|
267 | <snapshots>
|
---|
268 | <enabled>false</enabled>
|
---|
269 | </snapshots>
|
---|
270 | </pluginRepository>
|
---|
271 | </pluginRepositories>
|
---|
272 | </profile>
|
---|
273 |
|
---|
274 | (c) Grab tika from git and attempt to compile it with maven
|
---|
275 | > git clone https://github.com/apache/tika.git
|
---|
276 | > cd tika
|
---|
277 | > mvn clean install
|
---|
278 | Takes 42-45 mins to compile up!
|
---|
279 |
|
---|
280 |
|
---|
281 | This compiles up version 2.0.0 tika-app jar file, whereas the precompiled downloadable jar is version 1.24.1.
|
---|
282 |
|
---|
283 | Compiling this wasn't necessary to compile or run GSTikaClient.java!
|
---|
284 | However, having the source code to base GSTikaCLI.java off of TikaCLI.java
|
---|
285 | was useful.
|
---|
286 |
|
---|
287 | --------------------------------------------------------------
|
---|
288 | G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)
|
---|
289 | --------------------------------------------------------------
|
---|
290 |
|
---|
291 | If you have Tesseract installed correctly, its bin folder on PATH and TESSDATA_PREFIX
|
---|
292 | environment variable set, the current version of Tika (tika-app-1.24.x.jar) and will
|
---|
293 | turn on Tesseract OCR automatically for images.
|
---|
294 |
|
---|
295 | But Tika is not configured out of the box to work with Tesseract to OCR PDFs (Tesseract
|
---|
296 | on its own does not OCR PDFs, only images).
|
---|
297 |
|
---|
298 | To get Tika to work with Tesseract to OCR PDFs:
|
---|
299 | 1. Must pass a config.xml file to Tika, where the TesseractOCRParser and PDFParser are
|
---|
300 | configured correctly. Run as:
|
---|
301 | tika-app-*.jar --config=<tika-congif.xml>
|
---|
302 |
|
---|
303 | 2. The "outputType" param of the TesseractOCRParser in this config file must have one of
|
---|
304 | these 2 values:
|
---|
305 | a. "txt" - which requests Tesseract to output OCR as text
|
---|
306 | b. "hocr" - which asks Tesseract to output OCR as html (hence format called hocr)
|
---|
307 |
|
---|
308 | For the hocr param to have any effect (else the PDF pages will not be OCR-ed), on the
|
---|
309 | tesseract end, the $TESSDATA_PREFIX/configs/hocr file must exist and contain
|
---|
310 | these values (given at
|
---|
311 | https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
|
---|
312 | tessedit_create_hocr 1
|
---|
313 | hocr_font_info 0
|
---|
314 |
|
---|
315 | The latest Tesseract tarball should now contain this $TESSDATA_PREFIX/configs/hocr file.
|
---|
316 |
|
---|
317 |
|
---|
318 | I'm committing an appropriate tika-config.xml file (based on https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/) for GSTika, containing:
|
---|
319 |
|
---|
320 | *************************************************************
|
---|
321 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
---|
322 | <!--
|
---|
323 | (XML comments only allowed after xml processor instruction.)
|
---|
324 |
|
---|
325 | https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/
|
---|
326 | which links to their sample tika-config.xml (copied below) which configures the PDF and OCR Parsers to behave just as the old PDFParser.props and OCR Parser properties files did.
|
---|
327 |
|
---|
328 | - new way of one tika-config.xml: https://github.com/o19s/pdf-discovery-demo/blob/crazy_tika_tesseract_inside_of_solr/ocr/tika-config.xml
|
---|
329 | - old way of 2 props files: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser
|
---|
330 |
|
---|
331 | https://tika.apache.org/1.16/configuring.html
|
---|
332 | https://issues.apache.org/jira/browse/TIKA-2624
|
---|
333 | -->
|
---|
334 | <properties>
|
---|
335 | <parsers>
|
---|
336 | <parser class="org.apache.tika.parser.DefaultParser">
|
---|
337 | <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
|
---|
338 | <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
|
---|
339 | </parser>
|
---|
340 | <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
|
---|
341 | <params>
|
---|
342 | <!-- Setting the following 2 params is unnecessary, since sourcing Greenstone puts the Tesseract binary
|
---|
343 | on the path AND also sets the TESSDATA_PREFIX env var needed by Tesseract -->
|
---|
344 | <!--
|
---|
345 | <param name="tesseractPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/bin</param>
|
---|
346 | <param name="tessdataPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/tessdata</param>
|
---|
347 | -->
|
---|
348 |
|
---|
349 | <!-- IMPORTANT!! -->
|
---|
350 | <param name="outputType" type="string">hocr</param><!-- Choose one of: hocr, txt -->
|
---|
351 | <!-- hocr is preferred as Tesseract produces nicely formatted html that better reflects
|
---|
352 | the placement of the original text in the scanned page. (Can compare running with horc vs txt)
|
---|
353 |
|
---|
354 | However, initially, the above value had to be fixed as "txt", as outputType value = hocr prevented
|
---|
355 | Tika+Tesseract from OCR-ing pdfs (no OCR output).
|
---|
356 | Until $GEXT_INSTALLED/tessdata/configs/hocr (tesseract config file) was created containing specific
|
---|
357 | property values in point 2b below.
|
---|
358 |
|
---|
359 | To get Tika to work with Tesseract to OCR pages of a scanned PDF:
|
---|
360 | 1. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd,
|
---|
361 | 2. AND do one of the following:
|
---|
362 | a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
|
---|
363 | b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the outputType above
|
---|
364 | to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following
|
---|
365 | (taken from
|
---|
366 | https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
|
---|
367 | tessedit_create_hocr 1
|
---|
368 | hocr_font_info 0
|
---|
369 |
|
---|
370 | More information about tesseract config options by running:
|
---|
371 | tesseract __print-parameters
|
---|
372 | -->
|
---|
373 | <param name="language" type="string">eng</param>
|
---|
374 | <param name="pageSegMode" type="string">1</param>
|
---|
375 | </params>
|
---|
376 | </parser>
|
---|
377 | <parser class="org.apache.tika.parser.pdf.PDFParser">
|
---|
378 | <params>
|
---|
379 | <param name="ocrStrategy" type="string">ocr_and_text</param>
|
---|
380 | </params>
|
---|
381 | </parser>
|
---|
382 |
|
---|
383 | </parsers>
|
---|
384 | </properties>
|
---|
385 | *************************************************************
|
---|
386 |
|
---|
387 |
|
---|
388 | --------------------------------------------------------------
|
---|