Changeset 32027
- Timestamp:
- 2017-10-05T23:01:50+13:00 (7 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
documentation/trunk/tutorials/xml-source/tutorial_en.xml
r32021 r32027 775 775 </Comment> 776 776 <NumberedItem> 777 <Text id="images-gps-1">Create a new collection in GLI called <i>Images-GPS</i>. In the < b>Gather</b> panel, drag and drop the 4 folders in <Path>sample_files → images_gps</Path> from the Workspace view on the left into the Collection view on the right.</Text>778 </NumberedItem> 779 <NumberedItem> 780 <Text id="images-gps-1a">Since the images are organised by folder, we can easily assign <i>folder-level</i> metadata to the images which will help with classifying them. In the < b>Enrich</b> panel, select the <Path>eiffel-tower</Path> folder, and in its <b>dc.Title</b> field type <i>Eiffel Tower</i>. Since this metadata is assigned at folder level, it is inherited as <b>dc.Title</b> metadata by all the images in the folder.</Text>777 <Text id="images-gps-1">Create a new collection in GLI called <i>Images-GPS</i>. In the <AutoText key="glidict::GUI.Gather"/> panel, drag and drop the 4 folders in <Path>sample_files → images_gps</Path> from the Workspace view on the left into the Collection view on the right.</Text> 778 </NumberedItem> 779 <NumberedItem> 780 <Text id="images-gps-1a">Since the images are organised by folder, we can easily assign <i>folder-level</i> metadata to the images which will help with classifying them. In the <AutoText key="glidict::GUI.Enrich"/> panel, select the <Path>eiffel-tower</Path> folder, and in its <b>dc.Title</b> field type <i>Eiffel Tower</i>. Since this metadata is assigned at folder level, it is inherited as <b>dc.Title</b> metadata by all the images in the folder.</Text> 781 781 <Text id="images-gps-1b">When setting folder-level metadata like this, the default setting in GLI is to produce a popup window alerting you to the fact that the assigned metadata will be assigned 782 782 to all files and sub-folders contained in the selected folder. For this collection, this is what we want, so press <i>OK</i> for the action to proceed.</Text> … … 790 790 </NumberedItem> 791 791 <NumberedItem> 792 <Text id="images-gps-4">Now go to the < b>Create</b> panel and press <b>Build Collection</b>.</Text>792 <Text id="images-gps-4">Now go to the <AutoText key="glidict::GUI.Create"/> panel and press <b>Build Collection</b>.</Text> 793 793 </NumberedItem> 794 794 <NumberedItem> … … 803 803 Each of these image files has metadata embedded in it—including GPS data—generated by the smartphone when the photo was taken. We can extract this metadata when the collection is built, and in particular, make use of the GPS metadata to provide map-based views of the collection to the user.</Text> 804 804 805 <Text id="images-gps7">In the < b>Document Plugins</b> section of the <b>Design</b> panel, go down to the <b>select plugin to add</b> and choose the <b>EmbeddedMetadataPlugin</b>. Press the <b>Add Plugin</b> button, and then click <b>OK</b> to add it to the plugin list. Select this plugin in the list, then use the <b>Move Up</b> button to shift it upwards until it comes just after the GreenstoneXMLPlugin.</Text>805 <Text id="images-gps7">In the <AutoText key="glidict::CDM.GUI.Plugins"/> section of the <AutoText key="glidict::GUI.Design"/> panel, go down to the <b>select plugin to add</b> and choose the <AutoText text="EmbeddedMetadataPlugin"/>. Press the <b>Add Plugin</b> button, and then click <AutoText key="glidict::General.OK" type="button"/> to add it to the plugin list. Select this plugin in the list, then use the <AutoText key="glidict::CDM.Move.Move_Up" type="button"/> button to shift it upwards until it comes just after the GreenstoneXMLPlugin.</Text> 806 806 </NumberedItem> 807 807 808 808 <NumberedItem> 809 <Text id="images-gps-8">Go to the < b>Create</b> panel and press <b>Build Collection</b>.</Text>809 <Text id="images-gps-8">Go to the <AutoText key="glidict::GUI.Create"/> panel and press <AutoText key="glidict::CreatePane.Build_Collection" type="button"/>.</Text> 810 810 </NumberedItem> 811 811 812 812 <NumberedItem> 813 <Text id="images-gps-9">Now go to the < b>Enrich</b> panel, expand the <b>eiffel-tower</b> folder and select the first image. Scroll down to see the metadata extracted during the building process. Among the extracted metadata, you will find several pieces of Latitude and Longitude metadata, which we will be taking advantage of shortly: ex.Latitude, ex.Longitude, ex.LatShort, and ex.LngShort. </Text>813 <Text id="images-gps-9">Now go to the <AutoText key="glidict::GUI.Enrich"/> panel, expand the <b>eiffel-tower</b> folder and select the first image. Scroll down to see the metadata extracted during the building process. Among the extracted metadata, you will find several pieces of Latitude and Longitude metadata, which we will be taking advantage of shortly: ex.Latitude, ex.Longitude, ex.LatShort, and ex.LngShort. </Text> 814 814 </NumberedItem> 815 815 … … 820 820 Greenstone has a map view, based on the Google Maps API, that can make use of this location metadata. The map view can be controlled to appear in different parts of the interface: as part of a collection's search results page when browsing the collection, and/or when viewing a document. For this view to be operational in a Greenstone, it is necessary for the collection to index the GPS metadata.</Text> 821 821 <NumberedItem> 822 <Text id="images-gps-11">In the < b>Search Indexes</b> section of the <b>Design</b> panel, press the <b>New Index...</b> button. Scroll down and tick the box for <b>ex.LatShort</b> and press <b>Add Index</b> to create an index on it. In like manner, create an index on <b>ex.Latitude</b>. Then another on <b>ex.LngShort</b>. And finally one on <b>ex.Longitude</b>.</Text>823 </NumberedItem> 824 <NumberedItem> 825 <Text id="images-gps-12">Select <b>Search</b> on the left of the < b>Format</b> panel. For the index on the combined longitude and latitude metadata, type <i>locations</i> as its display name.</Text>826 </NumberedItem> 827 <NumberedItem> 828 <Text id="images-gps-13">Now go to the < b>Create</b> panel and press <b>Build Collection</b>.</Text>829 </NumberedItem> 830 <NumberedItem> 831 <Text id="images-gps-14">To enable the map, go to the < b>Format Features</b> section of the <b>Format</b> panel, and select the <b>browse</b> format feature. In the editor below, enter the following format statement <i>above</i> the documentNode template:</Text>822 <Text id="images-gps-11">In the <AutoText key="glidict::CDM.GUI.Indexes"/> section of the <AutoText key="glidict::GUI.Design"/> panel, press the <AutoText key="glidict::CDM.IndexManager.New_Index" type="button"/> button. Scroll down and tick the box for <b>ex.LatShort</b> and press <AutoText key="glidict::CDM.IndexManager.Add_Index" type="button"/> to create an index on it. In like manner, create an index on <b>ex.Latitude</b>. Then another on <b>ex.LngShort</b>. And finally one on <b>ex.Longitude</b>.</Text> 823 </NumberedItem> 824 <NumberedItem> 825 <Text id="images-gps-12">Select <b>Search</b> on the left of the <AutoText key="glidict::GUI.Format"/> panel. For the index on the combined longitude and latitude metadata, type <i>locations</i> as its display name.</Text> 826 </NumberedItem> 827 <NumberedItem> 828 <Text id="images-gps-13">Now go to the <AutoText key="glidict::GUI.Create"/> panel and press <AutoText key="glidict::CreatePane.Build_Collection" type="button"/>.</Text> 829 </NumberedItem> 830 <NumberedItem> 831 <Text id="images-gps-14">To enable the map, go to the <AutoText key="glidict::CDM.FormatManager.Feature"/> section of the <AutoText key="glidict::GUI.Format"/> panel, and select the <b>browse</b> format feature. In the editor below, enter the following format statement <i>above</i> the documentNode template:</Text> 832 832 <Format><gsf:option name="mapEnabled" value="true" /></Format> 833 833 </NumberedItem> 834 834 <NumberedItem> 835 <Text>Also in the < b>Format Features</b> section, select the <b>searchType</b> feature. Add <AutoText text="raw"/> to the list of search types.</Text>836 </NumberedItem> 837 <NumberedItem> 838 <Text id="images-gps-15">In the < b>Format</b> panel press the <b>Preview Collection</b> button, and click on the new browsing classifier (<Format>locations</Format>) and then click on a bookshelf icon. The page that opens up shows a Google map, with the locations of the images in the collection pinpointed on it. The map view can also scroll through all the images, locating each place and associated image in turn.</Text>835 <Text>Also in the <AutoText key="glidict::CDM.FormatManager.Feature"/> section, select the <b>searchType</b> feature. Add <AutoText text="raw"/> to the list of search types.</Text> 836 </NumberedItem> 837 <NumberedItem> 838 <Text id="images-gps-15">In the <AutoText key="glidict::GUI.Format"/> panel press <AutoText key="glidict::CreatePane.Preview_Collection" type="button"/>, and click on the new browsing classifier (<Format>locations</Format>) and then click on a bookshelf icon. The page that opens up shows a Google map, with the locations of the images in the collection pinpointed on it. The map view can also scroll through all the images, locating each place and associated image in turn.</Text> 839 839 </NumberedItem> 840 840 <Text id="images-gps-16"> … … 851 851 <NumberedItem> 852 852 <Text id="images-gps-17"> 853 To activate a map view when viewing the document, go to the < b>Format Features</b> section of the <b>Format</b> panel, and select the <b>display</b> format feature. In the editor below, enter the following format statement <i>after</i> the line <Format><gsf:option name="TOC" value="true" /></Format>:853 To activate a map view when viewing the document, go to the <AutoText key="glidict::CDM.FormatManager.Feature"/> section of the <AutoText key="glidict::GUI.Format"/> panel, and select the <b>display</b> format feature. In the editor below, enter the following format statement <i>after</i> the line <Format><gsf:option name="TOC" value="true" /></Format>: 854 854 </Text> 855 855 <Format><gsf:option name="mapEnabled" value="true" /></Format> … … 859 859 </NumberedItem> 860 860 <NumberedItem> 861 <Text id="images-gps-18">Still in the < b>Format</b> panel press the <b>Preview Collection</b> button, browse or search to locate a document, and then view the document. The page that opens up shows a Google map, shows the location of the document (where the photo was taken), in addition to the screen-sized photo.</Text>861 <Text id="images-gps-18">Still in the <AutoText key="glidict::GUI.Format"/> panel press the <AutoText key="glidict::CreatePane.Preview_Collection" type="button"/> button, browse or search to locate a document, and then view the document. The page that opens up shows a Google map, shows the location of the document (where the photo was taken), in addition to the screen-sized photo.</Text> 862 862 </NumberedItem> 863 863 </Content> … … 4776 4776 <Text id="gli-oai-17">If you wish, you can now set up this collection in a manner similar to how the <b>backdrop</b> collection was set up in <TutorialRef id="simple_image_collection"/>. Don't forget to copy in any specific format statements, adjust them to use the <Format>ex.dc</Format> metadata instead of <Format>dc</Format> metadata, then <b>rebuild</b> and <b>preview</b> the collection.</Text> 4777 4777 </NumberedItem> 4778 </Content> 4779 </Tutorial> 4780 <Tutorial id="unknown_converter_plugin"> 4781 <Title> 4782 <Text id="ucp-01">Using the UnknownConverterPlugin</Text> 4783 </Title> 4784 <SampleFiles folder="pdfbox"/> 4785 <Version initial="2.88" current="2.87"/> 4786 <Content> 4787 <Comment><Text id="ucp-02">The UnknownConverterPlugin builds on the idea of the UnknownPlugin, in that it can be configured to handle documents of unknown format and file extension. It can also be made to handle documents with known file extensions in a custom manner.</Text></Comment> 4788 <Comment><Text id="ucp-03">The UnknownConverterPlugin extends the UnknownPlugin's abilities by letting you launch a tool you have installed on your own operating system that can be run from the commandline to convert from the "unknown" file format to either text, html or gif/jpg/png images, or a folder of these. If you know how to launch this tool from the commandline to do the conversion, then you would configure the UnknownConverterPlugin by supplying the file format (file extension) of the documents it should process, the expected output file format (text, html or paged images), and the tool's conversion command that the UnknownConverterPlugin should launch to perform the conversion. In place of the input file and the output file or folder you provide placeholders in the command to run. Once configured, the UnknownConverterPlugin will be used during building to process documents that match the specified file format. The conversion tool will be launched with the command provided, and the expected output files as specified can then be processed by Greenstone in the usual manner.</Text></Comment> 4789 <Comment><Text id="ucp-04">An example would be djvu files, for which Greenstone provides no custom plugin. However, there's a free commandline tool available for unix systems that can convert from djvu to one of the text based format that Greenstone can process, text or html. So in this case, you could try using the UnknownConverterPlugin with the commandline tool on djvu files that you've gathered. The result should be that the djvu in your collection are now searchable.</Text></Comment> 4790 <Comment><Text id="ucp-05">This part of the tutorial requires you to be working on a Unix operating system. In this part of the tutorial, we're going to learn how to install the Icecite tool on a Linux system and then configure the UnknownConverterPlugin to use Icecite to process PDF files. Icecite (https://github.com/ckorzen/icecite) is an open-source tool that can do many things, including extracting text from a PDF.</Text></Comment> 4791 <Heading> 4792 <Text id="ucp-06">Using the Icecite tool to convert from PDF to text</Text> 4793 </Heading> 4794 <Comment> 4795 <Text id="ucp-07">As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.</Text> 4796 </Comment> 4797 <NumberedItem> 4798 <Text id="ucp-08">Grab the pre-compiled Icecite tarball from <Link>http://trac.greenstone.org/export/head/gs3-extensions/gs-icecite/gs-icecite.tar.gz</Link> and decompress it into your Greenstone installation's <Format>ext</Format> subfolder.</Text> 4799 <Text id="ucp-09">Now you're ready to test Icecite's PDF to text conversion abilities manually, by running Icecite from the command line.</Text> 4800 </NumberedItem> 4801 <NumberedItem> 4802 <Text id="ucp-10">Set up your environment for Java 8:</Text> 4803 <Format>export JAVA_HOME=/PATH/TO/YOUR-JAVA-8-HOME 4804 export PATH=$JAVA_HOME/bin:$PATH 4805 </Format> 4806 </NumberedItem> 4807 <NumberedItem> 4808 <Text id="ucp-11">You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the <PLACEHOLDERS> below:</Text> 4809 <Format>java -classpath '.:/<PATH-TO-GS-INSTALLTION>/ext/icecite/gs-installed-jars/*:/<PATH-TO-GS-INSTALLTION>/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs </PATH/TO/YOUR.pdf> </PATH/TO/CONVERTED.txt></Format> 4810 <Text id="ucp-12">It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string <Format></PATH/TO/CONVERTED.txt></Format></Text> 4811 <Comment><Text id="ucp-13">You can experiment with using <Format>--feature words</Format> or <Format>--feature lines</Format> above, in place of <Format>--feature paragraphs</Format>, to find out the effect of such a change on the output file, particularly if <Format>--feature paragraphs</Format> does not produce the desired results for your PDFs.</Text></Comment> 4812 </NumberedItem> 4813 <Heading> 4814 <Text id="ucp-14">Using the UnknownConverterPlugin to launch Icecite from GLI to do the PDF to text conversion</Text> 4815 </Heading> 4816 <Comment><Text id="ucp-15">We're now ready to use the <AutoText text="UnknownConverterPlugin"/> to launch Icecite as the external tool to do the conversion, producing output that Greenstone's building scripts can ingest into Greenstone and index for searching.</Text></Comment> 4817 <NumberedItem><Text id="ucp-16">Run GLI</Text></NumberedItem> 4818 <NumberedItem><Text id="ucp-17">Create a new collection called Icecite. In the <AutoText key="glidict::GUI.Gather"/> pane, drop in the sample PDF file into your collection.</Text></NumberedItem> 4819 <NumberedItem><Text id="ucp-18">In the <AutoText key="glidict::GUI.Design"/> pane and select <AutoText key="glidict::CDM.GUI.Plugins"/> from the list on the left. Add the <AutoText text="UnknownConverterPlugin"/>. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the <AutoText text="UnknownConverterPlugin"/>. Click <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> and set up the plugin with the following settings:</Text> 4820 <BulletList> 4821 <Bullet><Text id="ucp-19">set <Format>convert_to</Format> to the <Format>text</Format> option, this is the output format upon conversion</Text></Bullet> 4822 <Bullet><Text id="ucp-20">set <Format>mime type</Format> to <Format>application/pdf</Format></Text></Bullet> 4823 <Bullet><Text id="ucp-21">set <Format>process_extension</Format> to <Format>pdf</Format>, this is the input format of the files that this instance of the <AutoText text="UnknownConverterPlugin"/> will process</Text></Bullet> 4824 <Bullet><Text id="ucp-22">set the <Format>exec_cmd</Format> field to:</Text> 4825 <Text id="ucp-23"><Format>/PATH/TO/YOUR-JAVA-8-HOME/bin/java -classpath ':<MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion>/ext/icecite/gs-installed-jars/*:<MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion>/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs %INPUT_FILE %OUTPUT</Format></Text></Bullet> 4826 </BulletList> 4827 <Text id="ucp-24">Note: When filling in the <Format>exec_cmd</Format> field, leave the words with <Format>%</Format> signs in front of them intact. They are placeholders for Greenstone to replace.</Text> 4828 <Text id="ucp-25">However, you will need to adjust the above value for <Format>exec_cmd</Format> by finding out where your Java 8 is installed and replacing <Format>/PATH/TO/YOUR-JAVA-8-HOME</Format> with it. The reason you need to provide the full path to the Java 8 executable is because, at present, GLI binaries ship with Java 7, which is incompatible with the precompiled Icecite. And if you're running Greenstone from a source, you may have a different version of Java set up in your environment too. However, by providing the full path to the Java 8 executable above, you force Icecite's PDF conversion program to run with Java 8.</Text> 4829 <Comment><Text id="ucp-26">If your Greenstone is installed in a location that contains spaces in the filepath, then ensure you have escaped double quotes (<Format>\"</Format>) around each location referencing the Greenstone installation path except for the parameter value to <Format>-classpath</Format>.</Text></Comment> 4830 <Comment><Text id="ucp-27">The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the <Format><MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion></Format>, <Format>%INPUT_FILE</Format> and <Format>%OUTPUT</Format> appropriately. <Format><MajorVersion number="2">%GSDLHOME</MajorVersion><MajorVersion number="3">%GSDL3SRCHOME</MajorVersion></Format> works out to be the Greenstone <MajorVersion number="2">2</MajorVersion><MajorVersion number="3">3</MajorVersion> installation directory, whereas <Format>%INPUT_FILE</Format> is whichever matching PDF it's processing and <Format>%OUTPUT</Format> is likewise the file (or folder of files) produced by the conversion process. In this case, the output type is txt, as that's what Icecite produces. Once the conversion to text has finished, Greenstone will be able to process it as usual, such as indexing the extracted contents to make the document searchable.</Text> 4831 </Comment> 4832 </NumberedItem> 4833 <NumberedItem><Text id="ucp-28">Having sufficiently configured the <AutoText text="UnknownConverterPlugin"/>, click the <AutoText key="glidict::General.OK" type="button"/> button to close its configuration dialog.</Text></NumberedItem> 4834 <NumberedItem><Text id="ucp-29">Select the <AutoText text="UnknownConverterPlugin"/> in the list of plugins and keep pressing the <AutoText key="glidict::CDM.Move.Move_Up" type="button"/> button to shift it upwards, until it appears in the plugin pipeline above the existing <AutoText text="PDFPlugin"/>, so that this instance of <AutoText text="UnknownConverterPlugin"/>, configured as it has now been to handle PDF files, will take precedence in processing such files.</Text></NumberedItem> 4835 <NumberedItem><Text id="ucp-30">Move to the <AutoText key="glidict::GUI.Create"/> pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone's building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.</Text></NumberedItem> 4778 4836 </Content> 4779 4837 </Tutorial>
Note:
See TracChangeset
for help on using the changeset viewer.