Ignore:
Timestamp:
2021-09-15T11:58:11+12:00 (3 years ago)
Author:
anupama
Message:

Committing Dr Bainbridge's improvements to the Tika-preconfigured UnknownConverterPlugin: 1. Introducing the OS-agnostic %%GSDLHOME variable into the model collConfig.xml file which the UnknownConverterPlugin.pm will replace with or %GSDLHOME% as needed. The perl file will now also handle GSDL3HOME and GSDL3SRCHOME similarly. 2. The tika-app-1.24.1.jar is now renamed to just tika-app.jar so that UnknownConverterPlugin's exec_cmd works on Windows too, where there is no file globbing or wildcard to expand tika-app*.jar as there was on Linux. The gs2build/ext/tika folder's README has been updated to mention the version number of the tika-app jar file we're using.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/ext/tika/README.txt

    r34172 r35401  
     1--------------------------------------------------------------
     2About tika-app.jar:
     3--------------------------------------------------------------
     4Last updated version is currently 1.24.1 (tika-app-1.24.1.jar)
     5which can be found in the final line of output of running:
     6    java -jar %GSDLHOME%\ext\tika\tika-app.jar --version
     7on Windows:
     8or on Linux,
     9    java -jar $GSDLHOME/ext/tika/tika-app.jar --version
     10
     11
     12
    113--------------------------------------------------------------
    214A. Some background information on Apache Tika and related:
     
    28401. HTML:   
    2941
    30 GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
     42GS3/gs2build/ext/tika>java -jar tika-app.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
    3143
    32442. XHTML - looks the same as HTML:
    3345
    34 GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
     46GS3/gs2build/ext/tika>java -jar tika-app.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
    3547
    36483. PLAIN TEXT CONTENT - NO META:
    3749
    38 GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
     50GS3/gs2build/ext/tika>java -jar tika-app.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
    3951
    4052  a. PLAIN TEXT WITH META:
    4153
    42 GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
     54GS3/gs2build/ext/tika>java -jar tika-app.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
    4355
    4456  b. JUST META:
    4557
    46 GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
     58GS3/gs2build/ext/tika>java -jar tika-app.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
    4759   
    48604. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
    4961
    5062Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
    51 GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx       
     63GS3/gs2build/ext/tika>java -jar tika-app.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx       
    5264
    5365
     
    5567C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
    5668--------------------------------------------------------------
    57 * GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
     69* GS3/gs2build/ext/tika>java -jar tika-app.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
    5870
    5971INFO  As a convenience, TikaCLI has turned on extraction of
     
    7284
    7385
    74 * GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx
     86* GS3/gs2build/ext/tika>java -jar tika-app.jar --text-main /PATH/TO/testword.docx
    7587
    7688Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
     
    89101D. THE --encoding= FLAG TO TIKA
    90102--------------------------------------------------------------
    91 > java -jar tika-app-1.24.1.jar --help
     103> java -jar tika-app.jar --help
    92104  ...
    93105  -eX or --encoding=X    Use output encoding X
     
    104116COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
    105117
    106 (1) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
     118(1) >java -jar tika-app.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
    107119  <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    108120  <head>
     
    110122  ...
    111123
    112 (2) >java -jar tika-app-1.24.1.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
     124(2) >java -jar tika-app.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
    113125    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    114126    <head>
    115127    ...
    116128
    117 (3) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
     129(3) >java -jar tika-app.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
    118130    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
    119131    <head>
    120132    ...
    121133 
    122 (4) >java -jar tika-app-1.24.1.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
     134(4) >java -jar tika-app.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
    123135    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
    124136    <head>
    125137     ...
    126138
    127 (5) >java -jar tika-app-1.24.1.jar --encoding=nonexistent /Scratch/ak19/testword.docx
     139(5) >java -jar tika-app.jar --encoding=nonexistent /Scratch/ak19/testword.docx
    128140    Warning:  The encoding 'nonexistent' is not supported by the Java runtime.
    129141    Warning: encoding "nonexistent" not supported, using UTF-8
     
    133145
    134146(6) (Output to html)
    135     > java -jar tika-app-1.24.1.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
     147    > java -jar tika-app.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
    136148    Warning:  The encoding 'nonexistent' is not supported by the Java runtime.
    137149    Warning: encoding "nonexistent" not supported, using UTF-8
     
    144156
    145157(7) (Output to html case 2)
    146     > java -jar tika-app-1.24.1.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
     158    > java -jar tika-app.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
    147159    <html xmlns="http://www.w3.org/1999/xhtml">
    148160    <head>
Note: See TracChangeset for help on using the changeset viewer.