Changeset 34291

Show
Ignore:
Timestamp:
24.07.2020 18:07:13 (11 days ago)
Author:
ak19
Message:

More information about Tika, including to answer Dr Bainbridge's question on whether html output of the tika-cli really doesn't include the images. Tika docs say output formats are all for extracting text in those formats.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs2-extensions/gstika/trunk/GS_TIKA_README.txt

    r34187 r34291  
    3434https://tika.apache.org/1.8/examples.html 
    3535https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika 
     36 
     37 
     38* HTML output is without images: 
     39  - https://tika.apache.org/1.8/examples.html#Picking_different_output_formats 
     40    Picking different output formats 
     41 
     42    With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, xhtml of one part of the file etc. This is controlled based on the ContentHandler you supply to the Parser. 
     43 
     44  - https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika 
     45    also seems to indicate that images are not part of the html output 
     46 
     47  - https://stackoverflow.com/questions/27623809/how-to-extract-title-body-and-images-from-html-with-apache-tika-parser 
     48 
     49* More reading: 
     50- https://medium.com/@simonli_18826/apache-tika-code-with-example-walkthroughs-d1b0c18d5b2d 
     51- https://www.manning.com/books/tika-in-action 
     52- https://livebook.manning.com/book/tika-in-action/chapter-2/48 (one of the free chapters) 
    3653 
    3754-------------------------------------------------------------- 
     
    101118D. THE --encoding= FLAG TO TIKA 
    102119-------------------------------------------------------------- 
     120 
     121https://livebook.manning.com/book/tika-in-action/chapter-2/48 
     122Contains this insightful segment about the encoding flag: 
     123    "Note that Tika will by default output text using the normal character encoding used on your computer. This is great if you’re using Tika with tools such as your command-line console window that expect this default character encoding, but may cause trouble otherwise. To avoid unexpected encoding problems, you can explicitly set the output encoding with the --encoding option..." 
     124 
     125 
     126 
    103127> java -jar tika-app-*.jar --help 
    104128  ...