Ignore:
Timestamp:
2020-07-24T18:07:13+12:00 (4 years ago)
Author:
ak19
Message:

More information about Tika, including to answer Dr Bainbridge's question on whether html output of the tika-cli really doesn't include the images. Tika docs say output formats are all for extracting text in those formats.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs2-extensions/gstika/trunk/GS_TIKA_README.txt

    r34187 r34291  
    3434https://tika.apache.org/1.8/examples.html
    3535https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
     36
     37
     38* HTML output is without images:
     39  - https://tika.apache.org/1.8/examples.html#Picking_different_output_formats
     40    Picking different output formats
     41
     42    With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, xhtml of one part of the file etc. This is controlled based on the ContentHandler you supply to the Parser.
     43
     44  - https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
     45    also seems to indicate that images are not part of the html output
     46
     47  - https://stackoverflow.com/questions/27623809/how-to-extract-title-body-and-images-from-html-with-apache-tika-parser
     48
     49* More reading:
     50- https://medium.com/@simonli_18826/apache-tika-code-with-example-walkthroughs-d1b0c18d5b2d
     51- https://www.manning.com/books/tika-in-action
     52- https://livebook.manning.com/book/tika-in-action/chapter-2/48 (one of the free chapters)
    3653
    3754--------------------------------------------------------------
     
    101118D. THE --encoding= FLAG TO TIKA
    102119--------------------------------------------------------------
     120
     121https://livebook.manning.com/book/tika-in-action/chapter-2/48
     122Contains this insightful segment about the encoding flag:
     123    "Note that Tika will by default output text using the normal character encoding used on your computer. This is great if you’re using Tika with tools such as your command-line console window that expect this default character encoding, but may cause trouble otherwise. To avoid unexpected encoding problems, you can explicitly set the output encoding with the --encoding option..."
     124
     125
     126
    103127> java -jar tika-app-*.jar --help
    104128  ...
Note: See TracChangeset for help on using the changeset viewer.