Changeset 32978


Ignore:
Timestamp:
2019-04-04T19:30:25+13:00 (5 years ago)
Author:
ak19
Message:

Additional changes to tutorials after going through the first set of tutorials for linux testing.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r32907 r32978  
    15111511<Version initial="3.09" current="3.09"/>
    15121512<Content>
    1513   <Comment><Text id="ep-2">Prior to Greenstone 3.09, Greenstone shipped with a plugin called <AutoText text="PDFPlugin"/>. It was the plugin Greenstone used to convert PDF files to HTML using the third-party software <AutoText text="pdftohtml.pl" type="italics"/>. PDFPlugin allowed users to view PDF documents even if they didn't have the PDF software installed. Unfortunately, sometimes the formatting of the resulting HTML files was not so good. Earlier versions of this tutorial would provide some instruction on extra options to the PDFPlugin for producing a nicer version for display.</Text>
     1513  <Comment><Text id="ep-2">Prior to Greenstone 3.09, Greenstone shipped with a plugin called <AutoText text="PDFPlugin"/>. It was the plugin Greenstone used to convert PDF files to HTML using the third-party software <AutoText text="pdftohtml.pl" type="italics"/>. PDFPlugin allowed users to view PDF documents even if they didn't have the PDF software installed. Unfortunately, sometimes the formatting of the resulting HTML files was not so good. Earlier versions of this tutorial would provide some instruction on extra options to the PDFPlugin for producing a nicer version for display. The older pdftohtml process could however not cope with much newer versions of PDF unless PDFPlugin's <Format>pdfbox_conversion</Format> option was switched on.</Text>
    15141514  </Comment>
    1515   <Comment><Text id="ep-2a">Furthermore, the older pdftohtml process could not cope with much newer versions of PDF unless PDFPlugin's <Format>pdfbox_conversion</Format> option was switched on.</Text>
    1516   </Comment>
    1517   <Comment><Text id="ep-2b">Starting with Greenstone 3.09, some older pdf processing functionality has been restructured into <AutoText text="PDFv1Plugin"/>, while shifting the <Format>pdfbox_conversion</Format> option into <AutoText text="PDFv2Plugin"/>. PDFv2Plugin further makes use of third-party software <AutoText text="xpdf-tools" type="italics"/>, which better copes with newer PDFs (without requiring the <Format>pdfbox_conversion</Format> option to be activated). PDFv2Plugin comes with several new preconfigured settings to produce output files in html, text, image or image and text formats, that can better reflect the appearance of an input PDF document's pages. Behind the scenes, PDFv2Plugin is configured to use the third-party xpdf-tools or pdfbox software for each output setting.</Text>
     1515  <Comment><Text id="ep-2b">Starting with Greenstone 3.09, some older pdf processing functionality has been restructured into <AutoText text="PDFv1Plugin"/>, while shifting the <Format>pdfbox_conversion</Format> option into <AutoText text="PDFv2Plugin"/>. PDFv2Plugin further makes use of third-party software <AutoText text="xpdf-tools" type="italics"/>, which better copes with newer PDFs, thus no longer requiring activating the <Format>pdfbox_conversion</Format> option when dealing with newer PDFs. PDFv2Plugin comes with several new preconfigured settings to produce output files in html, text, image or image and text formats, that can better reflect the appearance of an input PDF document's pages. Behind the scenes, PDFv2Plugin is configured to use the third-party xpdf-tools or pdfbox software for each output setting.</Text>
    15181516  </Comment>
    15191517  <Comment><Text id="ep-2c">From Greenstone 3.09 onwards, PDFv2Plugin is added to a new collection's Document Plugins pipleline by default, in place of the now defunct PDFPlugin. In any instance where you particularly prefer the original PDFPlugin's HTML output for a PDF, you can now use PDFv1Plugin instead, as it still retains this functionality.</Text>
     
    15281526</NumberedItem>
    15291527<NumberedItem>
    1530 <Text id="ep-4a">Preview the collection and view the documents. Inspect <Path>pdf01</Path> and <Path>pdf03</Path> first. There's a table of contents is provided to the right. Clicking on a page in the table of contents will scroll to that page. Another way of navigating can be found to the left, where individual pages are listed vertically by page number and clicking the "plus" box next to a page will expand its contents. The pdfs have been sectionalised into groups of 10 pages, each group further containing a section for each individual page. If your pdf contained 10 or fewer pages, there won't two levels of sectionalising, just one.</Text>
     1528<Text id="ep-4a">Preview the collection and view the documents. Inspect <Path>pdf01</Path> and <Path>pdf03</Path> first. There's a table of contents is provided to the right. Clicking on a page in the table of contents will scroll to that page. Another way of navigating can be found to the left, where individual pages are listed vertically by page number and clicking the "plus" box next to a page will expand its contents. The pdfs have been sectionalised into groups of 10 pages, each group further containing a section for each individual page. If your pdf contained 10 or fewer pages, there won't be two levels of sectionalising, just one.</Text>
    15311529<Text id="ep-4b">If you visit a given page and try to select and copy the text, you can. These are not entirely images of the pdf's pages (like screenshots of a pdf page), but are HTML pages that combine the images of the background of each pdf page with the actual text of that page superimposed. The latter is what makes the text selectable.</Text>
    15321530<Text id="ep-4c">If you return to GLI's Design pane and double click on PDFv2Plugin in Document Plugins, then you will see that the convert_to option is set to paged_pretty_html. This is the default PDF convert_to type and produces the kind of sectionalised HTML pages consisting of background images and superimposed text that you see with <Path>pdf01</Path> and <Path>pdf03</Path>.</Text>
     
    16081606</NumberedItem>
    16091607<NumberedItem>
    1610 <Text id="ep-24">Switch to the <AutoText key="glidict::CDM.GUI.Plugins"/> section of the <AutoText key="glidict::GUI.Design"/> panel. Add a second instance of <AutoText text="PDFv2Plugin"/> by selecting <AutoText text="PDFv2Plugin"/> from the <AutoText key="glidict::CDM.PlugInManager.PlugIn"/> drop-down list, and clicking <AutoText key="glidict::CDM.PlugInManager.Add" type="button"/>. This plugin will come after the first PDFv2Plugin instance, so we configure it to process PDF documents as sectionalised HTML. Leave the <AutoText text="convert_to"/> option on <AutoText text="paged_pretty_html"/>, and switch on the <AutoText text="use_sections"/> option. Click <AutoText key="glidict::General.OK" type="button"/>.</Text>
     1608<Text id="ep-24">Switch to the <AutoText key="glidict::CDM.GUI.Plugins"/> section of the <AutoText key="glidict::GUI.Design"/> panel. Add a second instance of <AutoText text="PDFv2Plugin"/> by selecting <AutoText text="PDFv2Plugin"/> from the <AutoText key="glidict::CDM.PlugInManager.PlugIn"/> drop-down list, and clicking <AutoText key="glidict::CDM.PlugInManager.Add" type="button"/>. This plugin will come after the first PDFv2Plugin instance, so we configure it to process PDF documents as sectionalised HTML by leaving the <AutoText text="convert_to"/> option on the default, <AutoText text="paged_pretty_html"/>. Click <AutoText key="glidict::General.OK" type="button"/>.</Text>
    16111609</NumberedItem>
    16121610<NumberedItem>
     
    21402138</Heading>
    21412139<Comment>
    2142 <Text id="0457a">Next we'll add an interactive hierarchical phrase browsing classifier to this collection. Java applet support is being or has been phased out in various browsers and browser versions. As a result the following will not work on <Link url="https://stackoverflow.com/questions/31816839/how-do-i-enable-java-in-microsoft-edge-web-browser">Microsoft Edge</Link> browsers, among others.</Text>
     2140<Text id="0457a">Next we'll add an interactive hierarchical phrase browsing classifier to this collection. Java applet support is being or has been phased out in various browsers and browser versions. As a result the following will not work on <Link url="https://stackoverflow.com/questions/31816839/how-do-i-enable-java-in-microsoft-edge-web-browser">Microsoft Edge</Link> and some other browsers.</Text>
    21432141</Comment>
    21442142<NumberedItem>
     
    21862184</NumberedItem>
    21872185<NumberedItem>
    2188 <Text id="0455">Search for the term <i>Mary</i> again, as that is likely to be common in all five index partitions, and check that the numbers of words (not documents) add up.</Text>
    2189 </NumberedItem>
    2190 <NumberedItem>
    21912186<Text id="0455a">The text in the drop down box on the search page is based on the filters each partition was built on. To change the text that is displayed, go to the <AutoText key="glidict::CDM.GUI.SearchMetadata"/> section of the <AutoText key="glidict::GUI.Format"/> panel. The single filter partitions have sensible default text, but the combined partition does not. Set the <AutoText key="glidict::CDM.SearchMetadataManager.Component_Name"/> for the combined partition to "all". <b>Preview</b> the collection.</Text>
    21922187</NumberedItem>
     2188<NumberedItem>
     2189<Text id="0455">Search for the term <i>Mary</i> again, as that is likely to be common in all five index partitions, and check that the numbers of words (not documents) in the search results for the 4 individual indexes add up to the number of words for the <i>all</i> index.</Text>
     2190</NumberedItem>
    21932191<Heading>
    21942192<Text id="0462">Controlling the building process</Text>
     
    21982196</Comment>
    21992197<NumberedItem>
    2200 <Text id="0463">Switch to the <AutoText key="glidict::GUI.Create"/> panel. Expand the top panel to be able to see the options for collection building. Scroll to view them all, then select <AutoText text="Import Options"/> on the left and view the options that are then displayed to the right. Select <AutoText text="maxdocs"/> and set its numeric counter to <AutoText text="3"/>. (When in GLI's <AutoText key="glidict::Preferences.Mode.Expert"/> Mode, the <AutoText text="maxdocs"/> option for the import process are located under the <AutoText text="Import Options"/> of the <AutoText key="glidict::GUI.Create"/> panel.) Now <b>build</b>.</Text>
     2198<Text id="0463">Switch to the <AutoText key="glidict::GUI.Create"/> panel. Expand the top panel to be able to see the options for collection building. Scroll to view them all. Select <AutoText text="maxdocs"/> and set its numeric counter to <AutoText text="3"/>. (When in GLI's <AutoText key="glidict::Preferences.Mode.Expert"/> Mode, the <AutoText text="maxdocs"/> option for the import process are located under the <AutoText text="Import Options"/> of the <AutoText key="glidict::GUI.Create"/> panel.) Now <b>build</b>.</Text>
    22012199</NumberedItem>
    22022200<NumberedItem>
     
    26202618<NumberedItem>
    26212619  <Text id="0417a">If your computer is behind a firewall or proxy server, you will need to edit the proxy settings in the Librarian Interface. Click the <AutoText key="glidict::Mirroring.Preferences" type="button"/> button. Switch on the <AutoText key="glidict::Preferences.Connection.Use_Proxy"/> checkbox. Enter the proxy server address and port number in the <AutoText key="glidict::Preferences.Connection.HTTP_Proxy_Host"/> and <AutoText key="glidict::Preferences.Connection.Proxy_Port"/> boxes.</Text>
    2622   <Text id ="0417b">URLs that start with <i>https</i>, or URLs that resolve to <i>https</i>, will additionally need the <AutoText key="glidict::Preferences.Connection.HTTP_Proxy_Host"/> and corresponding <AutoText key="glidict::Preferences.Connection.Proxy_Port"/> filled in too, before web pages can be downloaded from there.</Text>
     2620  <Text id ="0417b">URLs that start with <i>https</i>, or URLs that resolve to <i>https</i>, will additionally need the <AutoText key="glidict::Preferences.Connection.HTTPS_Proxy_Host"/> and corresponding <AutoText key="glidict::Preferences.Connection.Proxy_Port"/> filled in too, before web pages can be downloaded from there.</Text>
    26232621<Text id ="0417c">Websites at https URLs often have a security certificate, but not always. For instance, <Link>https://englishhistory.net</Link> does not have one. To instruct GLI to nevertheless download pages from <i>https</i> URLs that don't have a security certificate, you'll also need to switch on the <AutoText key="glidict::Preferences.Connection.No_Check_Certificate"/> checkbox.</Text>
    26242622  <Text id ="0417d">Once you've finished configuring the proxy settings, click <AutoText key="glidict::General.OK" type="button"/> to close the dialog.</Text>
     
    36253623</Heading>
    36263624<Comment>
    3627   <Text id="0612a-1">Java applet support is being or has been phased out in various browsers and browser versions. As a result the following step will not work on <Link url="https://stackoverflow.com/questions/31816839/how-do-i-enable-java-in-microsoft-edge-web-browser">Microsoft Edge</Link> browsers, among others. If you're using such a browser, you may skip this step.</Text>
     3625  <Text id="0612a-1">Java applet support is being or has been phased out in various browsers and browser versions. As a result the following step will not work on <Link url="https://stackoverflow.com/questions/31816839/how-do-i-enable-java-in-microsoft-edge-web-browser">Microsoft Edge</Link> and some other browsers. If you're using such a browser, you may skip this step.</Text>
    36283626</Comment>
    36293627<NumberedItem>
     
    36353633</Heading>
    36363634<NumberedItem>
    3637 <Text id="0613">To complete the collection, let's give it a new image for the <MajorVersion number="2">top left corner of the page</MajorVersion><MajorVersion number="3">link from the main page</MajorVersion>. Go to the <AutoText key="glidict::CDM.GUI.General"/> section of the <AutoText key="glidict::GUI.Format"/> panel. Use the browse button of <AutoText key="glidict::CDM.General.Icon_Collection"/> to select the following image:</Text>
     3635<Text id="0613">To complete the collection, let's give it a new image for the <MajorVersion number="2">top left corner of the page</MajorVersion><MajorVersion number="3">link from the main page</MajorVersion>. Go to the <AutoText key="glidict::CDM.GUI.General"/> section of the <AutoText key="glidict::GUI.Format"/> panel. Use the browse button of the <AutoText key="glidict::CDM.General.Icon_Collection"/> to select the following image:</Text>
     3636<Path>sample_files &rarr; beatles &rarr; advbeat_large &rarr; images &rarr; tile.jpg</Path>
     3637<Text id="0613b">You can also set an image for the link to the collection's home page here. For this, use the browse button of <AutoText key="glidict::CDM.General.Icon_Collection_Small"/> to select the following image:</Text>
    36383638<Path>sample_files &rarr; beatles &rarr; advbeat_large &rarr; images &rarr; beatlesmm.png</Path>
    3639 <Text id="0613a"><b>Preview</b> the collection, and make sure the new image appears.</Text>
     3639<Text id="0613b"><b>Preview</b> the collection, and make sure the new image appears on the collection's about page.</Text>
     3640<Text id="0613c">Also go to the digital library home page by clicking on the <i>My Greenstone Library</i> link at the top left. On the home page, look through the links to all the collections in your digital library to find the one to the Small Beatles collection. This link should now be denoted by an image bearing the text &quot;BeatlesMultimedia&quot;.</Text>
    36403641</NumberedItem>
    36413642<Heading>
     
    41484149<NumberedItem>
    41494150<Text id="0690h-1">In the <AutoText key="glidict::CDM.GUI.Formats"/> section of the <AutoText key="glidict::GUI.Format"/> panel, select <AutoText text="Search"/> in <AutoText key="glidict::CDM.FormatManager.Feature"/><MajorVersion number="3"> to adjust how search results are displayed.</MajorVersion><MajorVersion number="2">, and <AutoText text="VList"/> in <AutoText key="glidict::CDM.FormatManager.Part"/>. Click <AutoText key="glidict::CDM.FormatManager.Add" type="button"/> to add this format to the collection. The previous changes modified <AutoText text="VList"/>, so they will apply to all <AutoText text="VList"/>s that don't have specific format statements. These next changes are made to <AutoText text="SearchVList"/> so will only apply to search results. </MajorVersion></Text>
    4150 <Text id="0690i">The extracted Title for the current section is specified as <Format><MajorVersion number="2">[ex.Title]</MajorVersion><MajorVersion number="3">&lt;gsf:metadata name=&quot;Title&quot;/&gt;</MajorVersion></Format> while the Title for the parent section is <Format><MajorVersion number="2">[parent:ex.Title]</MajorVersion><MajorVersion number="3">&lt;gsf:metadata name=&quot;Title&quot; select=&quot;parent&quot;/&gt;</MajorVersion></Format>. Since the same <AutoText text="SearchVList"/> format statement is used when searching both whole newspapers and newspaper pages, we need to make sure it works in both cases.</Text>
     4151<Text id="0690i">The extracted Title for the current section is specified as <Format><MajorVersion number="2">[ex.Title]</MajorVersion><MajorVersion number="3">&lt;gsf:metadata name=&quot;Title&quot;/&gt;</MajorVersion></Format> while the Title for the parent section is <MajorVersion number="2"><Format>[parent:ex.Title]</Format></MajorVersion><MajorVersion number="3"><Format>&lt;gsf:metadata name=&quot;Title&quot; select=&quot;parent&quot;/&gt;</Format> (if using metadata assigned at the document or root level, this would be <Format>&lt;gsf:metadata name=&quot;Title&quot; select=&quot;root&quot;/&gt;</Format>)</MajorVersion>. Since the same <AutoText text="SearchVList"/> format statement is used when searching both whole newspapers and newspaper pages, we need to make sure it works in both cases.</Text>
    41514152<MajorVersion number="2">
    41524153<Text id="0690j">Set the format statement to the following text (it can be copied and pasted from the file <Path>sample_files &rarr; niupepa &rarr; formats &rarr; search_tweak.txt</Path>):</Text>
     
    41784179      <Tab n="1"/>&lt;i&gt;<br />
    41794180        <Tab n="2"/>&lt;gsf:choose-metadata&gt;<br />
     4181          <Tab n="3"/>&lt;gsf:metadata name=&quot;Date&quot; format=&quot;formatDate&quot; /&gt;<br />
    41804182          <Tab n="3"/>&lt;gsf:metadata name=&quot;Date&quot; select=&quot;parent&quot; format=&quot;formatDate&quot; /&gt;<br />
    4181           <Tab n="3"/>&lt;gsf:metadata name=&quot;Date&quot; format=&quot;formatDate&quot; /&gt;<br />
     4183      <Tab n="3"/>&lt;gsf:metadata name=&quot;Date&quot; select=&quot;root&quot; format=&quot;formatDate&quot; /&gt;<br />
    41824184          <Tab n="3"/>&lt;gsf:default&gt;undated&lt;/gsf:default&gt;<br />
    41834185        <Tab n="2"/>&lt;/gsf:choose-metadata&gt;<br />
     
    45184520        <Tab n="3"/>&lt;td&gt;Caption:&lt;/td&gt;<br />
    45194521        <Tab n="3"/>&lt;td&gt;&lt;i&gt;&lt;gsf:metadata name=&quot;ex.dc.Description&quot;/&gt;&lt;/i&gt;&lt;br/&gt;<br />
    4520         <Tab n="3"/>&lt;a&gt;&lt;xsl:attribute name=&quot;href&quot;&gt;&lt;gsf:metadata name=&quot;ex.dc.OrigURL&quot;/&gt;&lt;/xsl:attribute&gt;<br />
     4522        <Tab n="3"/>&lt;gsf:link type=&quot;source&quot;&gt;<br />
    45214523      <Tab n="4"/>original &lt;gsf:metadata name=&quot;ImageWidth&quot;/&gt;x&lt;gsf:metadata name=&quot;ImageHeight&quot;/&gt; &lt;gsf:metadata name=&quot;ImageType&quot;/&gt; available<br />
    4522     <Tab n="3"/>&lt;/a&gt;<br />
     4524    <Tab n="3"/>&lt;/gsf:link&gt;<br />
    45234525        <Tab n="3"/>&lt;/td&gt;<br />
    45244526      <Tab n="2"/>&lt;/tr&gt;<br />
Note: See TracChangeset for help on using the changeset viewer.