Ignore:
Timestamp:
2021-09-29T18:21:43+13:00 (3 years ago)
Author:
anupama
Message:

Enhanced Word document handling tutorial: Added some instructions on trying to process a docx file with windows_scripting (since we use an updated vbs script to process docx files). And also a step to process the reports collection containing the docx file with windows_scripting off: it should now fall back to using the UnknownConverterPlugin configured to use tika for docx files, providing full text searching of docx files out of the box, but with lower quality html presentation than if using windows_scripting.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r34583 r35530  
    18531853<Text id="ew-31">Look at the metadata for the two documents again in the <AutoText key="glidict::GUI.Enrich"/> panel. You should now see ex.Creator and ex.Subject metadata items. This metadata can now be used in display or browsing classifiers etc.</Text>
    18541854</NumberedItem>
     1855<MajorVersion number="3">
     1856<Heading>
     1857<Text id="ew-32">Processing docx files</Text>
     1858</Heading>
     1859<NumberedItem>
     1860<Text id="ew-33">Drag and drop the testword.docx file, or any Word doc you have with docx file extension, into the collection. <b>Build</b> the collection. With <AutoText text="windows_scripting"/> turned on, <Format>docx</Format> files, which are the newer version of word documents, will now also be processed during build. <b>Preview</b> the collection and have a look at the document view of the newly added word document in the collection, to see what thegenerated html version of the file looks like.</Text>
     1861</NumberedItem>
     1862<NumberedItem>
     1863<Text id="ew-34">Now turn off <AutoText text="windows_scripting"/> in the <AutoText key="glidict::GUI.Design"/> panel, and <b>rebuild</b> the collection again. All the documents should still be processed, because Greenstone's document plugin pipeline is now set up with an <AutoText text="UnknownConverterPlugin"/> configured to use <i>Apache Tika</i> to extract text from Word documents by default (including docx files). <b>Preview</b> the collection and revisit the document view of the docx file. This time, the html produced should look very different: much more basic. This is because <i>Tika</i> supports extracting text from different document formats, including word documents, but is not optimised for html presentation. However, this does mean full text searching will be available for docx files toom when Greenstone is installed out-of-the-box.</Text>
     1864<Text id="ew-35">So at a pinch, you can always use Greenstone's now default document plugins setup, to process a collection that includes docx files, to at least support full text searching of the contents of docx files, even if the document view (the HTML view) of docx files processed with Tika may not look as formatted as the original source document. Presentation may be of secondary importance, since by default Greenstone will anyway provide a link to the original source document in its original format (in this case, a link to the docx file).</Text>
     1865</NumberedItem>
     1866</MajorVersion>
    18551867</Content>
    18561868</Tutorial>
Note: See TracChangeset for help on using the changeset viewer.