Changeset 37524


Ignore:
Timestamp:
2023-03-16T20:34:57+13:00 (14 months ago)
Author:
anupama
Message:

Some changes to the Enhanced Word doc handling tutorial.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r37354 r37524  
    18611861</Heading>
    18621862<NumberedItem>
    1863 <Text id="ew-33">Drag and drop the <Path>sample_files &rarr; Word_and_PDF &rarr; extra_docx &rarr; testword.docx</Path> file, or any Word doc you have with docx file extension, into the collection. <b>Build</b> the collection. With <AutoText text="windows_scripting"/> turned on, <Format>docx</Format> files, which are the newer version of word documents, will now also be processed during build. <b>Preview</b> the collection and have a look at the document view of the newly added word document in the collection, to see what the generated html version of the file looks like. (testword.docx is a very basic docx file, containing a few sentences and an image.)</Text>
    1864 </NumberedItem>
    1865 <NumberedItem>
    1866 <Text id="ew-34">Now turn off <AutoText text="windows_scripting"/> in the <AutoText key="glidict::GUI.Design"/> panel, and <b>rebuild</b> the collection again. All the documents should still be processed, because Greenstone's document plugin pipeline is now set up with an <AutoText text="UnknownConverterPlugin"/> configured to use <i>Apache Tika</i> to extract text from Word documents by default (including docx files). <b>Preview</b> the collection and revisit the document view of the docx file. This time, the html produced should look very different: much more basic. This is because <i>Tika</i> supports extracting text from different document formats, including word documents, but is not optimised for html presentation. However, this does mean full text searching will be available for docx files toom when Greenstone is installed out-of-the-box.</Text>
     1863<Text id="ew-33">Drag and drop the <Path>sample_files &rarr; Word_and_PDF &rarr; extra_docx &rarr; testword.docx</Path> file, or any Word doc you have with docx file extension, into the collection. In the <AutoText key="glidict::CDM.GUI.Plugins"/> section of the <AutoText key="glidict::GUI.Design"/> panel, use the <AutoText key="glidict::CDM.Move.Move_Down" type="button"/> to move the <AutoText text="UnknownConverterPlugin"/> in the plugins list to below the <AutoText text="WordPlugin"/> in the document plugin pipeline. <b>Build</b> the collection. With <AutoText text="windows_scripting"/> turned on, <Format>docx</Format> files, which are the newer version of word documents, will now also be processed during build. <b>Preview</b> the collection and have a look at the document view of the newly added word document in the collection, to see what the generated html version of the file looks like. (testword.docx is a very basic docx file, containing a few sentences and an image.)</Text>
     1864</NumberedItem>
     1865<NumberedItem>
     1866<Text id="ew-34">Now turn off <AutoText text="windows_scripting"/> in the <AutoText key="glidict::GUI.Design"/> panel, and <b>rebuild</b> the collection again. All the documents should still be processed, because Greenstone's document plugin pipeline is now set up with an <AutoText text="UnknownConverterPlugin"/> configured to use <i>Apache Tika</i> to extract text from Word documents by default (including docx files). <b>Preview</b> the collection and revisit the document view of the docx file. This time, the html produced should look very different: much more basic. This is because <i>Tika</i> supports extracting text from different document formats, including word documents, but is not optimised for html presentation. However, this does mean full text searching will be available for docx files too when Greenstone is installed out-of-the-box.</Text>
    18671867<Text id="ew-35">So at a pinch, you can always use Greenstone's now default document plugins setup, to process a collection that includes docx files, to at least support full text searching of the contents of docx files, even if the document view (the HTML view) of docx files processed with Tika may not look as formatted as the original source document. Presentation may be of secondary importance, since by default Greenstone will anyway provide a link to the original source document in its original format (in this case, a link to the docx file).</Text>
     1868<Comment><Text id="ew-36">Above, we shifted the <AutoText text="UnknownConverterPlugin"/> that uses Apache Tika to below the <AutoText text="WordPlugin"/> in the document plugin pipeline, because we want to force <AutoText text="WordPlugin"/> to attempt to process all word documents first, when it recognises them. Apache Tika can always process Word documents, but we favour <AutoText text="WordPlugin"/> to try processing them first, including the newer docx files, which it can do when on Windows machines with Word installed and <AutoText text="windows_scripting"/> turned on. Turning off <AutoText text="windows_scripting"/> instructs the <AutoText text="WordPlugin"/> not to make use of Word to convert doc(x) files to html, and so <AutoText text="WordPlugin"/> is not able to process docx files. As a result, the document plugins in the pipeline pass the unprocessed docx file further down the pipeline to the <AutoText text="UnknownConverterPlugin"/> that is able to process the docx file as it's pre-configure to make use of Apache Tika to extract text from Word documents.</Text></Comment>
    18681869</NumberedItem>
    18691870</MajorVersion>
Note: See TracChangeset for help on using the changeset viewer.