Changeset 34291 for gs2-extensions/gstika/trunk/GS_TIKA_README.txt
- Timestamp:
- 2020-07-24T18:07:13+12:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs2-extensions/gstika/trunk/GS_TIKA_README.txt
r34187 r34291 34 34 https://tika.apache.org/1.8/examples.html 35 35 https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika 36 37 38 * HTML output is without images: 39 - https://tika.apache.org/1.8/examples.html#Picking_different_output_formats 40 Picking different output formats 41 42 With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, xhtml of one part of the file etc. This is controlled based on the ContentHandler you supply to the Parser. 43 44 - https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika 45 also seems to indicate that images are not part of the html output 46 47 - https://stackoverflow.com/questions/27623809/how-to-extract-title-body-and-images-from-html-with-apache-tika-parser 48 49 * More reading: 50 - https://medium.com/@simonli_18826/apache-tika-code-with-example-walkthroughs-d1b0c18d5b2d 51 - https://www.manning.com/books/tika-in-action 52 - https://livebook.manning.com/book/tika-in-action/chapter-2/48 (one of the free chapters) 36 53 37 54 -------------------------------------------------------------- … … 101 118 D. THE --encoding= FLAG TO TIKA 102 119 -------------------------------------------------------------- 120 121 https://livebook.manning.com/book/tika-in-action/chapter-2/48 122 Contains this insightful segment about the encoding flag: 123 "Note that Tika will by default output text using the normal character encoding used on your computer. This is great if youâre using Tika with tools such as your command-line console window that expect this default character encoding, but may cause trouble otherwise. To avoid unexpected encoding problems, you can explicitly set the output encoding with the --encoding option..." 124 125 126 103 127 > java -jar tika-app-*.jar --help 104 128 ...
Note:
See TracChangeset
for help on using the changeset viewer.