Changeset 32978 for documentation
- Timestamp:
- 2019-04-04T19:30:25+13:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
documentation/trunk/tutorials/xml-source/tutorial_en.xml
r32907 r32978 1511 1511 <Version initial="3.09" current="3.09"/> 1512 1512 <Content> 1513 <Comment><Text id="ep-2">Prior to Greenstone 3.09, Greenstone shipped with a plugin called <AutoText text="PDFPlugin"/>. It was the plugin Greenstone used to convert PDF files to HTML using the third-party software <AutoText text="pdftohtml.pl" type="italics"/>. PDFPlugin allowed users to view PDF documents even if they didn't have the PDF software installed. Unfortunately, sometimes the formatting of the resulting HTML files was not so good. Earlier versions of this tutorial would provide some instruction on extra options to the PDFPlugin for producing a nicer version for display. </Text>1513 <Comment><Text id="ep-2">Prior to Greenstone 3.09, Greenstone shipped with a plugin called <AutoText text="PDFPlugin"/>. It was the plugin Greenstone used to convert PDF files to HTML using the third-party software <AutoText text="pdftohtml.pl" type="italics"/>. PDFPlugin allowed users to view PDF documents even if they didn't have the PDF software installed. Unfortunately, sometimes the formatting of the resulting HTML files was not so good. Earlier versions of this tutorial would provide some instruction on extra options to the PDFPlugin for producing a nicer version for display. The older pdftohtml process could however not cope with much newer versions of PDF unless PDFPlugin's <Format>pdfbox_conversion</Format> option was switched on.</Text> 1514 1514 </Comment> 1515 <Comment><Text id="ep-2a">Furthermore, the older pdftohtml process could not cope with much newer versions of PDF unless PDFPlugin's <Format>pdfbox_conversion</Format> option was switched on.</Text> 1516 </Comment> 1517 <Comment><Text id="ep-2b">Starting with Greenstone 3.09, some older pdf processing functionality has been restructured into <AutoText text="PDFv1Plugin"/>, while shifting the <Format>pdfbox_conversion</Format> option into <AutoText text="PDFv2Plugin"/>. PDFv2Plugin further makes use of third-party software <AutoText text="xpdf-tools" type="italics"/>, which better copes with newer PDFs (without requiring the <Format>pdfbox_conversion</Format> option to be activated). PDFv2Plugin comes with several new preconfigured settings to produce output files in html, text, image or image and text formats, that can better reflect the appearance of an input PDF document's pages. Behind the scenes, PDFv2Plugin is configured to use the third-party xpdf-tools or pdfbox software for each output setting.</Text> 1515 <Comment><Text id="ep-2b">Starting with Greenstone 3.09, some older pdf processing functionality has been restructured into <AutoText text="PDFv1Plugin"/>, while shifting the <Format>pdfbox_conversion</Format> option into <AutoText text="PDFv2Plugin"/>. PDFv2Plugin further makes use of third-party software <AutoText text="xpdf-tools" type="italics"/>, which better copes with newer PDFs, thus no longer requiring activating the <Format>pdfbox_conversion</Format> option when dealing with newer PDFs. PDFv2Plugin comes with several new preconfigured settings to produce output files in html, text, image or image and text formats, that can better reflect the appearance of an input PDF document's pages. Behind the scenes, PDFv2Plugin is configured to use the third-party xpdf-tools or pdfbox software for each output setting.</Text> 1518 1516 </Comment> 1519 1517 <Comment><Text id="ep-2c">From Greenstone 3.09 onwards, PDFv2Plugin is added to a new collection's Document Plugins pipleline by default, in place of the now defunct PDFPlugin. In any instance where you particularly prefer the original PDFPlugin's HTML output for a PDF, you can now use PDFv1Plugin instead, as it still retains this functionality.</Text> … … 1528 1526 </NumberedItem> 1529 1527 <NumberedItem> 1530 <Text id="ep-4a">Preview the collection and view the documents. Inspect <Path>pdf01</Path> and <Path>pdf03</Path> first. There's a table of contents is provided to the right. Clicking on a page in the table of contents will scroll to that page. Another way of navigating can be found to the left, where individual pages are listed vertically by page number and clicking the "plus" box next to a page will expand its contents. The pdfs have been sectionalised into groups of 10 pages, each group further containing a section for each individual page. If your pdf contained 10 or fewer pages, there won't two levels of sectionalising, just one.</Text>1528 <Text id="ep-4a">Preview the collection and view the documents. Inspect <Path>pdf01</Path> and <Path>pdf03</Path> first. There's a table of contents is provided to the right. Clicking on a page in the table of contents will scroll to that page. Another way of navigating can be found to the left, where individual pages are listed vertically by page number and clicking the "plus" box next to a page will expand its contents. The pdfs have been sectionalised into groups of 10 pages, each group further containing a section for each individual page. If your pdf contained 10 or fewer pages, there won't be two levels of sectionalising, just one.</Text> 1531 1529 <Text id="ep-4b">If you visit a given page and try to select and copy the text, you can. These are not entirely images of the pdf's pages (like screenshots of a pdf page), but are HTML pages that combine the images of the background of each pdf page with the actual text of that page superimposed. The latter is what makes the text selectable.</Text> 1532 1530 <Text id="ep-4c">If you return to GLI's Design pane and double click on PDFv2Plugin in Document Plugins, then you will see that the convert_to option is set to paged_pretty_html. This is the default PDF convert_to type and produces the kind of sectionalised HTML pages consisting of background images and superimposed text that you see with <Path>pdf01</Path> and <Path>pdf03</Path>.</Text> … … 1608 1606 </NumberedItem> 1609 1607 <NumberedItem> 1610 <Text id="ep-24">Switch to the <AutoText key="glidict::CDM.GUI.Plugins"/> section of the <AutoText key="glidict::GUI.Design"/> panel. Add a second instance of <AutoText text="PDFv2Plugin"/> by selecting <AutoText text="PDFv2Plugin"/> from the <AutoText key="glidict::CDM.PlugInManager.PlugIn"/> drop-down list, and clicking <AutoText key="glidict::CDM.PlugInManager.Add" type="button"/>. This plugin will come after the first PDFv2Plugin instance, so we configure it to process PDF documents as sectionalised HTML . Leave the <AutoText text="convert_to"/> option on <AutoText text="paged_pretty_html"/>, and switch on the <AutoText text="use_sections"/> option. Click <AutoText key="glidict::General.OK" type="button"/>.</Text>1608 <Text id="ep-24">Switch to the <AutoText key="glidict::CDM.GUI.Plugins"/> section of the <AutoText key="glidict::GUI.Design"/> panel. Add a second instance of <AutoText text="PDFv2Plugin"/> by selecting <AutoText text="PDFv2Plugin"/> from the <AutoText key="glidict::CDM.PlugInManager.PlugIn"/> drop-down list, and clicking <AutoText key="glidict::CDM.PlugInManager.Add" type="button"/>. This plugin will come after the first PDFv2Plugin instance, so we configure it to process PDF documents as sectionalised HTML by leaving the <AutoText text="convert_to"/> option on the default, <AutoText text="paged_pretty_html"/>. Click <AutoText key="glidict::General.OK" type="button"/>.</Text> 1611 1609 </NumberedItem> 1612 1610 <NumberedItem> … … 2140 2138 </Heading> 2141 2139 <Comment> 2142 <Text id="0457a">Next we'll add an interactive hierarchical phrase browsing classifier to this collection. Java applet support is being or has been phased out in various browsers and browser versions. As a result the following will not work on <Link url="https://stackoverflow.com/questions/31816839/how-do-i-enable-java-in-microsoft-edge-web-browser">Microsoft Edge</Link> browsers, among others.</Text>2140 <Text id="0457a">Next we'll add an interactive hierarchical phrase browsing classifier to this collection. Java applet support is being or has been phased out in various browsers and browser versions. As a result the following will not work on <Link url="https://stackoverflow.com/questions/31816839/how-do-i-enable-java-in-microsoft-edge-web-browser">Microsoft Edge</Link> and some other browsers.</Text> 2143 2141 </Comment> 2144 2142 <NumberedItem> … … 2186 2184 </NumberedItem> 2187 2185 <NumberedItem> 2188 <Text id="0455">Search for the term <i>Mary</i> again, as that is likely to be common in all five index partitions, and check that the numbers of words (not documents) add up.</Text>2189 </NumberedItem>2190 <NumberedItem>2191 2186 <Text id="0455a">The text in the drop down box on the search page is based on the filters each partition was built on. To change the text that is displayed, go to the <AutoText key="glidict::CDM.GUI.SearchMetadata"/> section of the <AutoText key="glidict::GUI.Format"/> panel. The single filter partitions have sensible default text, but the combined partition does not. Set the <AutoText key="glidict::CDM.SearchMetadataManager.Component_Name"/> for the combined partition to "all". <b>Preview</b> the collection.</Text> 2192 2187 </NumberedItem> 2188 <NumberedItem> 2189 <Text id="0455">Search for the term <i>Mary</i> again, as that is likely to be common in all five index partitions, and check that the numbers of words (not documents) in the search results for the 4 individual indexes add up to the number of words for the <i>all</i> index.</Text> 2190 </NumberedItem> 2193 2191 <Heading> 2194 2192 <Text id="0462">Controlling the building process</Text> … … 2198 2196 </Comment> 2199 2197 <NumberedItem> 2200 <Text id="0463">Switch to the <AutoText key="glidict::GUI.Create"/> panel. Expand the top panel to be able to see the options for collection building. Scroll to view them all , then select <AutoText text="Import Options"/> on the left and view the options that are then displayed to the right. Select <AutoText text="maxdocs"/> and set its numeric counter to <AutoText text="3"/>. (When in GLI's <AutoText key="glidict::Preferences.Mode.Expert"/> Mode, the <AutoText text="maxdocs"/> option for the import process are located under the <AutoText text="Import Options"/> of the <AutoText key="glidict::GUI.Create"/> panel.) Now <b>build</b>.</Text>2198 <Text id="0463">Switch to the <AutoText key="glidict::GUI.Create"/> panel. Expand the top panel to be able to see the options for collection building. Scroll to view them all. Select <AutoText text="maxdocs"/> and set its numeric counter to <AutoText text="3"/>. (When in GLI's <AutoText key="glidict::Preferences.Mode.Expert"/> Mode, the <AutoText text="maxdocs"/> option for the import process are located under the <AutoText text="Import Options"/> of the <AutoText key="glidict::GUI.Create"/> panel.) Now <b>build</b>.</Text> 2201 2199 </NumberedItem> 2202 2200 <NumberedItem> … … 2620 2618 <NumberedItem> 2621 2619 <Text id="0417a">If your computer is behind a firewall or proxy server, you will need to edit the proxy settings in the Librarian Interface. Click the <AutoText key="glidict::Mirroring.Preferences" type="button"/> button. Switch on the <AutoText key="glidict::Preferences.Connection.Use_Proxy"/> checkbox. Enter the proxy server address and port number in the <AutoText key="glidict::Preferences.Connection.HTTP_Proxy_Host"/> and <AutoText key="glidict::Preferences.Connection.Proxy_Port"/> boxes.</Text> 2622 <Text id ="0417b">URLs that start with <i>https</i>, or URLs that resolve to <i>https</i>, will additionally need the <AutoText key="glidict::Preferences.Connection.HTTP _Proxy_Host"/> and corresponding <AutoText key="glidict::Preferences.Connection.Proxy_Port"/> filled in too, before web pages can be downloaded from there.</Text>2620 <Text id ="0417b">URLs that start with <i>https</i>, or URLs that resolve to <i>https</i>, will additionally need the <AutoText key="glidict::Preferences.Connection.HTTPS_Proxy_Host"/> and corresponding <AutoText key="glidict::Preferences.Connection.Proxy_Port"/> filled in too, before web pages can be downloaded from there.</Text> 2623 2621 <Text id ="0417c">Websites at https URLs often have a security certificate, but not always. For instance, <Link>https://englishhistory.net</Link> does not have one. To instruct GLI to nevertheless download pages from <i>https</i> URLs that don't have a security certificate, you'll also need to switch on the <AutoText key="glidict::Preferences.Connection.No_Check_Certificate"/> checkbox.</Text> 2624 2622 <Text id ="0417d">Once you've finished configuring the proxy settings, click <AutoText key="glidict::General.OK" type="button"/> to close the dialog.</Text> … … 3625 3623 </Heading> 3626 3624 <Comment> 3627 <Text id="0612a-1">Java applet support is being or has been phased out in various browsers and browser versions. As a result the following step will not work on <Link url="https://stackoverflow.com/questions/31816839/how-do-i-enable-java-in-microsoft-edge-web-browser">Microsoft Edge</Link> browsers, among others. If you're using such a browser, you may skip this step.</Text>3625 <Text id="0612a-1">Java applet support is being or has been phased out in various browsers and browser versions. As a result the following step will not work on <Link url="https://stackoverflow.com/questions/31816839/how-do-i-enable-java-in-microsoft-edge-web-browser">Microsoft Edge</Link> and some other browsers. If you're using such a browser, you may skip this step.</Text> 3628 3626 </Comment> 3629 3627 <NumberedItem> … … 3635 3633 </Heading> 3636 3634 <NumberedItem> 3637 <Text id="0613">To complete the collection, let's give it a new image for the <MajorVersion number="2">top left corner of the page</MajorVersion><MajorVersion number="3">link from the main page</MajorVersion>. Go to the <AutoText key="glidict::CDM.GUI.General"/> section of the <AutoText key="glidict::GUI.Format"/> panel. Use the browse button of <AutoText key="glidict::CDM.General.Icon_Collection"/> to select the following image:</Text> 3635 <Text id="0613">To complete the collection, let's give it a new image for the <MajorVersion number="2">top left corner of the page</MajorVersion><MajorVersion number="3">link from the main page</MajorVersion>. Go to the <AutoText key="glidict::CDM.GUI.General"/> section of the <AutoText key="glidict::GUI.Format"/> panel. Use the browse button of the <AutoText key="glidict::CDM.General.Icon_Collection"/> to select the following image:</Text> 3636 <Path>sample_files → beatles → advbeat_large → images → tile.jpg</Path> 3637 <Text id="0613b">You can also set an image for the link to the collection's home page here. For this, use the browse button of <AutoText key="glidict::CDM.General.Icon_Collection_Small"/> to select the following image:</Text> 3638 3638 <Path>sample_files → beatles → advbeat_large → images → beatlesmm.png</Path> 3639 <Text id="0613a"><b>Preview</b> the collection, and make sure the new image appears.</Text> 3639 <Text id="0613b"><b>Preview</b> the collection, and make sure the new image appears on the collection's about page.</Text> 3640 <Text id="0613c">Also go to the digital library home page by clicking on the <i>My Greenstone Library</i> link at the top left. On the home page, look through the links to all the collections in your digital library to find the one to the Small Beatles collection. This link should now be denoted by an image bearing the text "BeatlesMultimedia".</Text> 3640 3641 </NumberedItem> 3641 3642 <Heading> … … 4148 4149 <NumberedItem> 4149 4150 <Text id="0690h-1">In the <AutoText key="glidict::CDM.GUI.Formats"/> section of the <AutoText key="glidict::GUI.Format"/> panel, select <AutoText text="Search"/> in <AutoText key="glidict::CDM.FormatManager.Feature"/><MajorVersion number="3"> to adjust how search results are displayed.</MajorVersion><MajorVersion number="2">, and <AutoText text="VList"/> in <AutoText key="glidict::CDM.FormatManager.Part"/>. Click <AutoText key="glidict::CDM.FormatManager.Add" type="button"/> to add this format to the collection. The previous changes modified <AutoText text="VList"/>, so they will apply to all <AutoText text="VList"/>s that don't have specific format statements. These next changes are made to <AutoText text="SearchVList"/> so will only apply to search results. </MajorVersion></Text> 4150 <Text id="0690i">The extracted Title for the current section is specified as <Format><MajorVersion number="2">[ex.Title]</MajorVersion><MajorVersion number="3"><gsf:metadata name="Title"/></MajorVersion></Format> while the Title for the parent section is < Format><MajorVersion number="2">[parent:ex.Title]</MajorVersion><MajorVersion number="3"><gsf:metadata name="Title" select="parent"/></MajorVersion></Format>. Since the same <AutoText text="SearchVList"/> format statement is used when searching both whole newspapers and newspaper pages, we need to make sure it works in both cases.</Text>4151 <Text id="0690i">The extracted Title for the current section is specified as <Format><MajorVersion number="2">[ex.Title]</MajorVersion><MajorVersion number="3"><gsf:metadata name="Title"/></MajorVersion></Format> while the Title for the parent section is <MajorVersion number="2"><Format>[parent:ex.Title]</Format></MajorVersion><MajorVersion number="3"><Format><gsf:metadata name="Title" select="parent"/></Format> (if using metadata assigned at the document or root level, this would be <Format><gsf:metadata name="Title" select="root"/></Format>)</MajorVersion>. Since the same <AutoText text="SearchVList"/> format statement is used when searching both whole newspapers and newspaper pages, we need to make sure it works in both cases.</Text> 4151 4152 <MajorVersion number="2"> 4152 4153 <Text id="0690j">Set the format statement to the following text (it can be copied and pasted from the file <Path>sample_files → niupepa → formats → search_tweak.txt</Path>):</Text> … … 4178 4179 <Tab n="1"/><i><br /> 4179 4180 <Tab n="2"/><gsf:choose-metadata><br /> 4181 <Tab n="3"/><gsf:metadata name="Date" format="formatDate" /><br /> 4180 4182 <Tab n="3"/><gsf:metadata name="Date" select="parent" format="formatDate" /><br /> 4181 <Tab n="3"/><gsf:metadata name="Date" format="formatDate" /><br />4183 <Tab n="3"/><gsf:metadata name="Date" select="root" format="formatDate" /><br /> 4182 4184 <Tab n="3"/><gsf:default>undated</gsf:default><br /> 4183 4185 <Tab n="2"/></gsf:choose-metadata><br /> … … 4518 4520 <Tab n="3"/><td>Caption:</td><br /> 4519 4521 <Tab n="3"/><td><i><gsf:metadata name="ex.dc.Description"/></i><br/><br /> 4520 <Tab n="3"/>< a><xsl:attribute name="href"><gsf:metadata name="ex.dc.OrigURL"/></xsl:attribute><br />4522 <Tab n="3"/><gsf:link type="source"><br /> 4521 4523 <Tab n="4"/>original <gsf:metadata name="ImageWidth"/>x<gsf:metadata name="ImageHeight"/> <gsf:metadata name="ImageType"/> available<br /> 4522 <Tab n="3"/></ a><br />4524 <Tab n="3"/></gsf:link><br /> 4523 4525 <Tab n="3"/></td><br /> 4524 4526 <Tab n="2"/></tr><br />
Note:
See TracChangeset
for help on using the changeset viewer.