Changeset 35530 for documentation
- Timestamp:
- 2021-09-29T18:21:43+13:00 (2 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
documentation/trunk/tutorials/xml-source/tutorial_en.xml
r34583 r35530 1853 1853 <Text id="ew-31">Look at the metadata for the two documents again in the <AutoText key="glidict::GUI.Enrich"/> panel. You should now see ex.Creator and ex.Subject metadata items. This metadata can now be used in display or browsing classifiers etc.</Text> 1854 1854 </NumberedItem> 1855 <MajorVersion number="3"> 1856 <Heading> 1857 <Text id="ew-32">Processing docx files</Text> 1858 </Heading> 1859 <NumberedItem> 1860 <Text id="ew-33">Drag and drop the testword.docx file, or any Word doc you have with docx file extension, into the collection. <b>Build</b> the collection. With <AutoText text="windows_scripting"/> turned on, <Format>docx</Format> files, which are the newer version of word documents, will now also be processed during build. <b>Preview</b> the collection and have a look at the document view of the newly added word document in the collection, to see what thegenerated html version of the file looks like.</Text> 1861 </NumberedItem> 1862 <NumberedItem> 1863 <Text id="ew-34">Now turn off <AutoText text="windows_scripting"/> in the <AutoText key="glidict::GUI.Design"/> panel, and <b>rebuild</b> the collection again. All the documents should still be processed, because Greenstone's document plugin pipeline is now set up with an <AutoText text="UnknownConverterPlugin"/> configured to use <i>Apache Tika</i> to extract text from Word documents by default (including docx files). <b>Preview</b> the collection and revisit the document view of the docx file. This time, the html produced should look very different: much more basic. This is because <i>Tika</i> supports extracting text from different document formats, including word documents, but is not optimised for html presentation. However, this does mean full text searching will be available for docx files toom when Greenstone is installed out-of-the-box.</Text> 1864 <Text id="ew-35">So at a pinch, you can always use Greenstone's now default document plugins setup, to process a collection that includes docx files, to at least support full text searching of the contents of docx files, even if the document view (the HTML view) of docx files processed with Tika may not look as formatted as the original source document. Presentation may be of secondary importance, since by default Greenstone will anyway provide a link to the original source document in its original format (in this case, a link to the docx file).</Text> 1865 </NumberedItem> 1866 </MajorVersion> 1855 1867 </Content> 1856 1868 </Tutorial>
Note:
See TracChangeset
for help on using the changeset viewer.