Ignore:
Timestamp:
2019-03-14T19:57:30+13:00 (5 years ago)
Author:
ak19
Message:

Commandline Incremental Rebuilding tutorial modifications for GS3.09.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r32903 r32904  
    52105210</NumberedItem>
    52115211<NumberedItem>
    5212 <Text id="ic-03">Do not build the collection in GLI. We'll be building and rebuilding manually, from the command-line terminal. So close GLI. You can choose to run the Greenstone server at any stage, however.</Text>
     5212  <Text id="ic-02a">Go to the <AutoText key="glidict::GUI.Design"/> panel &gt; <AutoText key="glidict::CDM.GUI.Indexes"/> and look for <AutoText key="glidict::CDM.LevelManager.Level_Title"/>. Make <AutoText key="glidict::CDM.LevelManager.Document"/> level searching the <AutoText text="default"/>.</Text>
     5213</NumberedItem>   
     5214<NumberedItem>
     5215<Text id="ic-03">Do not build the collection in GLI. We'll be building and rebuilding manually, from the command-line terminal. So close GLI once the files and folders have finished copying into your collection. You can choose to run the Greenstone server at any stage, however.</Text>
    52135216</NumberedItem>
    52145217<NumberedItem>
     
    52185221<Text id="ic-04a">Scroll down to the following line near the bottom:</Text>
    52195222<Format>&lt;importOption name=&quot;OIDtype&quot; value=&quot;dirname&quot;/&gt;</Format>
    5220 <Text id="ic-04b">Edit it to refer to the full filenames instead of directory names:</Text>
     5223<Text id="ic-04b">Edit this line to refer to the full filenames instead of directory names as follows, then save the file:</Text>
    52215224<Format>&lt;importOption name=&quot;OIDtype&quot; value=&quot;full_filename&quot;/&gt;</Format>
    52225225<Text id="ic-04c">The above step sets the identifiers used by Greenstone for this collection's documents to be based on their full filenames. Doing so will allow us to refer to the files by name in the &lt;Filename&gt; elements of any manifest file we use for incrementally building the collection. These &lt;Filename&gt; elements will then identify which files are to be indexed if newly added, and which are to be re-indexed, as should happen if a document or its metadata has been edited. (For specifying which files are to be deleted, the document identifier will be used instead of the filename.)</Text>
     5226<Comment>
     5227  <Text id="ic-04d">In this step you've learnt how to edit the collectionConfig.xml by hand. You can also edit the collectionConfig file from within GLI. In that case, with the collection open in GLI, you'd go to <AutoText key="glidict::Menu.Edit"/> &gt; <AutoText key="glidict::Menu.Edit_Config"/>. The XML editor that opens also validates any changes you make to the file, to help prevent you from leaving it in an invalid state. It provides the usual <AutoText key="glidict::General.Undo"/> and <AutoText key="glidict::General.Redo"/> buttons. You can use the <AutoText text="Find"/> toolbar at the bottom of the editor to locate text of interest in the collectionConfig file (e.g. search for "importoption"). Once you've finished editing the file, you'd press the <AutoText key="glidict::General.Save"/> button, which will save the changes, close the editor and immediately reload the collection in order to put your changes into effect. If you're not happy with your edits, you can press the <AutoText key="glidict::General.Cancel"/> button to close the editor without saving any changes.</Text>
     5228</Comment>
    52235229</MajorVersion>
    52245230</NumberedItem>
     
    52285234<Text id="ic-05a">On Linux or Macs, the general command is the same, but the installed location would be different and the slashes go the other way. For example, if installed in <Format>/Users/me/Greenstone3</Format>, you'd type the following and hit Enter:</Text>
    52295235<Format>cd /Users/me/Greenstone3</Format>
     5236<Text id="ic-05a-1">If there are any spaces in the filepath, put double quotes on either side of the filepath.</Text>
    52305237<Text id="ic-05b">Now you're ready to set up the Greenstone environment in your terminal. On Windows, type the following into your terminal and hit Enter again:</Text>
    52315238<Format><MajorVersion number="2">setup.bat</MajorVersion><MajorVersion number="3">gs3-setup.bat</MajorVersion></Format>
     
    52355242<Text id="ic-05e">With the terminal now operating within your Greenstone installation folder, and with the Greenstone environment now set up and ready, type the following commands to do a complete build of your new collection. Although the command contains the word "rebuild" in it, since this is the first time the collection's being built, it will just build it.</Text>
    52365243<Format>perl -S full-rebuild.pl <MajorVersion number="3">-site localsite</MajorVersion> incremen</Format>
    5237 <Text id="ic-05f">Preview the collection. If the Greenstone server is not running (as would happen if you had closed GLI and didn't start the standalone Greenstone Server Interface application), then run it from the Start Menu on Windows now. You could also run the Greenstone server by running the <Format><MajorVersion number="2">gs2-server.bat</MajorVersion><MajorVersion number="3">gs3-server.bat</MajorVersion></Format> script in the terminal if using a Windows, or running the <Format><MajorVersion number="2">gs2-server.sh</MajorVersion><MajorVersion number="3">gs3-server.sh</MajorVersion></Format> script from a Linux/Mac terminal.</Text>
    5238 <Text id="ic-05g">When previewing, try searching for "kouprey" and you should get results, as this term occurs in the document <i>b18ase</i>.</Text>
    5239 <Text id="ic-05h">For the rest of this tutorial exercise, leave open the terminal in which you have set up your Greenstone's environment. We'll be using it throughout.</Text>
     5244<Text id="ic-05f">For the rest of this tutorial exercise, leave open this terminal in which you have set up your Greenstone's environment. We'll be using it throughout.</Text>
     5245</NumberedItem>
     5246<NumberedItem>
     5247  <Text id="ic-05g">If the Greenstone server is not running (as would happen if you had closed GLI and didn't start the standalone Greenstone Server Interface application), then run it from the Start Menu on Windows now. You could also run the Greenstone server by running the <Format><MajorVersion number="2">gs2-server.bat</MajorVersion><MajorVersion number="3">gs3-server.bat</MajorVersion></Format> script in the terminal if you're trying this on a Windows machine, or by running the <Format><MajorVersion number="2">gs2-server.sh</MajorVersion><MajorVersion number="3">gs3-server.sh</MajorVersion></Format> script from a Linux/Mac terminal.</Text>
     5248</NumberedItem>
     5249<NumberedItem>
     5250  <Text id="ic-05h">Preview the <i>incremen</i> collection.</Text>
     5251  <Comment>
     5252    <Text id="ic-05i">Throughout this tutorial, when previewing an (incrementally) rebuilt collection, make sure to reload any web page in the collection in order to ensure you're seeing any changes you've made. A <Link url="https://www.getfilecloud.com/blog/2015/03/tech-tip-how-to-do-hard-refresh-in-browsers/">"force reload"</Link>, also referred to as a "hard refresh", is better: either hold down Ctrl while clicking the reload/refresh button, or press Ctrl+F5 in some browsers or Ctrl+Shift+R in others to make the browser do a force reload.</Text>
     5253  </Comment>
     5254<Text id="ic-05j">When previewing, try searching for "kouprey" and you should get results, as this term occurs in the document <i>b18ase</i>.</Text>
     5255<Text id="ic-05k">Next, try searching for "groundnuts" and no documents should match.</Text>
    52405256</NumberedItem>
    52415257<Heading>
     
    52625278&lt;/Manifest&gt;
    52635279</Format>
    5264 <Text id="ic-07d">The above lists the 3 main documents to be added/indexed by Greenstone (hence the keyword &lt;Index&gt;). Since these documents are located inside their own subfolders when copied into the <i>import</i> folder, the manifest file also indicates the relative folder structure of these documents, e.g. <i>"fb33fe/fb33fe.htm"</i> shows that the <i>fb33fe.htm</i> HTML document is located in the folder <i>fb33fe</i>. Only the main documents to be added are listed, not the associated image files also found at the same folder level, as Greenstone will track down all the image files referred to by the main html documents to be indexed and will process them as files associated with the html.</Text>
     5280<Text id="ic-07d">The above lists the 3 main documents to be added/indexed by Greenstone (hence the keyword &lt;Index&gt;). Since these documents are located inside their own subfolders when copied into the <i>import</i> folder, the manifest file also indicates the relative folder structure of these documents (relative to the collection), e.g. <i>"fb33fe/fb33fe.htm"</i> shows that the <i>fb33fe.htm</i> HTML document is located in the folder <i>fb33fe</i>. Only the main documents to be added are listed, not the associated image files also found at the same folder level, as Greenstone will track down all the image files referred to by the main html documents to be indexed and will process them as files associated with the html.</Text>
    52655281</NumberedItem>
    52665282<NumberedItem>
    52675283<Text id="ic-08">Return to the terminal you had left open. We can finally run the commands for the incremental build operation.</Text>
    52685284<Text id="ic-08a">Use the terminal to first run the incremental import stage:</Text>
    5269 <Format>perl -S incremental-import.pl -manifest manifests/add-new-files.xml <MajorVersion number="3">-site localsite</MajorVersion> incremen</Format>
     5285<Format>perl -S incremental-import.pl -incremental -manifest manifests/add-new-files.xml <MajorVersion number="3">-site localsite</MajorVersion> incremen</Format>
    52705286<Text id="ic-08b">Once that finishes running, start off the incremental buildcol stage of the build process:</Text>
    52715287<Format>perl -S incremental-buildcol.pl -activate <MajorVersion number="3">-site localsite</MajorVersion> incremen</Format>
    5272 <Text id="ic-08c">The incremental import command specifies the manifest file that Greenstone is to consult in order to work out which files should be processed and how (Indexed, Deleted or Reindexed). By the builcol stage, the specific files would then be ready for further incremental processing by the buildcol script. The activate flag to the incremental buildcol script tells Greenstone to (re-)activate the updated collection if the Greenstone server is running.</Text>
     5288<Text id="ic-08c">The incremental import command specifies the manifest file that Greenstone is to consult in order to work out which files should be processed and how (whether each is to be Indexed, Deleted or Reindexed). By the builcol stage, the specific files would then be ready for further incremental processing by the buildcol script. The <Format>-activate</Format> flag to the incremental buildcol script tells Greenstone to (re-)activate the updated collection if the Greenstone server is running.</Text>
    52735289</NumberedItem>
    52745290<NumberedItem>
    52755291<Text id="ic-09">Preview the collection either by running the Greenstone Server Interface application, if it isn't already running, or by starting the Greenstone server from the command line with the command:</Text>
    52765292<Format><MajorVersion number="2">gsicontrol.bat web-start</MajorVersion><MajorVersion number="3">ant start</MajorVersion></Format>
    5277 <Text id="ic-09a">(To stop the Greenstone server at any point, use the command <Format><MajorVersion number="2">gsicontrol.bat web-stop</MajorVersion><MajorVersion number="3">ant stop</MajorVersion></Format>. To stop-and-start it, you'd use <Format><MajorVersion number="2">gsicontrol.bat web-restart</MajorVersion><MajorVersion number="3">ant restart</MajorVersion></Format>.<MajorVersion number="2"> On Linux/Mac, use the equivalent script <i>gsicontrol.sh</i> for each command, e.g. <Format>./gsicontrol.sh web-start</Format>.</MajorVersion>)</Text>
    5278 <Text id="ic-09b">When the server is runnning, preview your library home page, located by default at <Format><MajorVersion number="2">http://localhost:8282/greenstone/cgi-bin/library.cgi</MajorVersion><MajorVersion number="3">http://localhost:8383/greenstone3/library</MajorVersion></Format>. Visit the <i>Incremental with Manifests</i> collection and click on the Titles browser. There should be 3 additional documents now, and you should be able to search for terms that occur in them. For example, searching for "groundnuts" should return results, since this term occurs in the newly added document <i>fb33fe</i>.</Text>
     5293<Text id="ic-09a">(To stop the Greenstone server at any point, use the command <Format><MajorVersion number="2">gsicontrol.bat web-stop</MajorVersion><MajorVersion number="3">ant stop</MajorVersion></Format>. To stop-then-start it, you'd use <Format><MajorVersion number="2">gsicontrol.bat web-restart</MajorVersion><MajorVersion number="3">ant restart</MajorVersion></Format>.<MajorVersion number="2"> On Linux/Mac, use the equivalent script <i>gsicontrol.sh</i> for each command, e.g. <Format>./gsicontrol.sh web-start</Format>.</MajorVersion>)</Text>
     5294<Text id="ic-09b">When the server is runnning, preview your library home page, located by default at <Format><MajorVersion number="2">http://localhost:8282/greenstone/cgi-bin/library.cgi</MajorVersion><MajorVersion number="3">http://localhost:8383/greenstone3/library</MajorVersion></Format>. Visit the <i>Incremental with Manifests</i> collection and click on the Titles browser. There should be 3 additional documents now, and you should be able to search for terms that occur in them. For example, searching for "groundnuts" again should return a result this time, since this term occurs in the newly added document <i>fb33fe</i>.</Text>
    52795295</NumberedItem>
    52805296<Heading>
     
    53035319<Text id="ic-12">Finally, let's incrementally rebuild the collection, specifying the manifest file that Greenstone should use this time to carry out the incremental build operation. As before, there are two steps.</Text>
    53045320<Text id="ic-12a">First run the modified incremental import command:</Text>
    5305 <Format>perl -S incremental-import.pl -manifest manifests/delete-some-files.xml <MajorVersion number="3">-site localsite</MajorVersion> incremen</Format>
     5321<Format>perl -S incremental-import.pl -incremental -manifest manifests/delete-some-files.xml <MajorVersion number="3">-site localsite</MajorVersion> incremen</Format>
    53065322<Text id="ic-12b">When that has finished running, run the same incremental buildcol command as before (it doesn't change):</Text>
    53075323<Format>perl -S incremental-buildcol.pl -activate <MajorVersion number="3">-site localsite</MajorVersion> incremen</Format>
    53085324</NumberedItem>
    53095325<NumberedItem>
    5310 <Text id="ic-13">When it has finished, preview the collection once more and check that the 2 documents have been removed. They should not turn up in the browse classifiers, nor in search results. For example, search for "kouprey" again. Check that when you search for the term this time, that no documents matched the query. (Since it only occurs in document <i>b18ase</i>, which has now been removed.)</Text>
     5326<Text id="ic-13">When it has finished, preview the collection once more and check that the 2 documents have been removed. They should not turn up in the browse classifiers, nor in search results. For example, search for "kouprey" again. Check that when you search for the term this time, that no documents matched the query. (Since it only occurred in document <i>b18ase</i>, which has now been removed from the collection.)</Text>
    53115327</NumberedItem>
    53125328<Heading>
     
    53325348</NumberedItem>
    53335349<NumberedItem>
    5334 <Text id="ic-16"><MajorVersion number="3">Next, quit the Greenstone server application if it was running, so that the Greenstone server is stopped. </MajorVersion>Start up GLI. Open the incremen collection and go to the Enrich panel. Add or modify <i>dc.Title</i> metadata for the <i>b20cre</i> document. Do not accidentally build the collection using GLI.</Text>
    5335 </NumberedItem>
    5336 <NumberedItem>
    5337 <Text id="ic-17">Quit GLI.<MajorVersion number="3"> Optionally run the Greenstone server application.</MajorVersion></Text>
     5350<Text id="ic-16"><!--<MajorVersion number="3">Next, quit the Greenstone server application if it was running, so that the Greenstone server is stopped. </MajorVersion>-->Start up GLI. Open the <Format>incremen</Format> collection and go to the Enrich panel. Add or modify <i>dc.Title</i> metadata for the <i>b20cre</i> document. Do not accidentally build the collection using GLI.</Text>
     5351</NumberedItem>
     5352<NumberedItem>
     5353<Text id="ic-17">Quit GLI.<MajorVersion number="3"> Optionally run the Greenstone server application if it isn't already running.</MajorVersion></Text>
    53385354<Text id="ic-17a">In the above two steps, we've modified the text contents of document <i>fb34fe</i> and the metadata associated with <i>b20cre</i>. Our <i>mod-text-and-meta.xml</i> manifest file already indicates that these two files are to be reindexed, so we can go ahead and incrementally rebuild the collection with this manifest file.</Text>
    53395355</NumberedItem>
     
    53415357<Text id="ic-18">Run the incremental rebuild operation to re-process just these two files. To do so, pass the <Format>mod-text-and-meta.xml</Format> manifest file this time.</Text>
    53425358<Text id="ic-18a">First run:</Text>
    5343 <Format>perl -S incremental-import.pl -manifest manifests/mod-text-and-meta.xml <MajorVersion number="3">-site localsite</MajorVersion> incremen</Format>
     5359<Format>perl -S incremental-import.pl -incremental -manifest manifests/mod-text-and-meta.xml <MajorVersion number="3">-site localsite</MajorVersion> incremen</Format>
    53445360<Text id="ic-18b">Followed by:</Text>
    53455361<Format>perl -S incremental-buildcol.pl -activate <MajorVersion number="3">-site localsite</MajorVersion> incremen</Format>
    53465362</NumberedItem>
    53475363<NumberedItem>
    5348 <Text id="ic-19">Preview the collection once more. Check that the 2 documents contain your edits: try searching for any additional words you added. Also check the dc.Title metadata that you had modified can now be searched and appears as the title for the <i>b20cre</i> document in the Titles browsing classifier.</Text>
    5349 </NumberedItem>
    5350 <Text id="ic-20">In this tutorial, we looked at cutting down the amount of time spent on rebuilding a collection by manually controlling the rebuild operation so that it processes only what has changed. We do so by means of a manifest that specifies exactly which files need to be rebuilt and how (whether they need to be Indexed, Deleted or Reindexed). Greenstone also has an automatic incremental rebuild feature, sparing you the need to specify a manifest file in the <i>import</i> phase. Omitting the manifest argument in the above exercises activates this behaviour, however, this is typically slower, because Greenstone now needs to scan the entire <Format>import</Format> folder and compare this with the information in the <Format>archives</Format> folder to determine what has changed.</Text>
    5351 <Text id="ic-21">Now repeat all the above exercises in the same sequence once again, but with a new collection called <i>autoincr</i> also based on the <i>Demo</i> collection. But this time, don't pass in the manifest file as an argument to the <Format>import.pl</Format> script. After each incremental build, preview your <i>autoincr</i> collection to check that the Browsing classifiers contain the expected documents and that searching returns the expected results.</Text>
     5364<Text id="ic-19">Preview the collection once more. Check that the 2 documents contain your edits: try searching for any additional words you added and confirm that document fb34fe turns up in the results. Also check the dc.Title metadata that you had modified can now be searched and appears as the title for the <i>b20cre</i> document in the Titles browsing classifier.</Text>
     5365</NumberedItem>
     5366<Comment>
     5367<Text id="ic-20">In this tutorial, we looked at cutting down the amount of time spent on rebuilding a collection by manually controlling the rebuild operation so that it processes only what has changed. We do so by means of a manifest that specifies exactly which files need to be rebuilt and how (whether any need to be Indexed, Deleted or Reindexed). Greenstone also has an automatic incremental rebuild feature, sparing you the need to specify a manifest file in the <i>import</i> phase. Omitting the manifest argument in the above exercises activates this behaviour. However, this is typically slower, because Greenstone now needs to scan the entire <Format>import</Format> folder and compare this with the information in the <Format>archives</Format> folder to determine what has changed.</Text>
     5368</Comment>
     5369<NumberedItem>
     5370  <Text id="ic-21">Now repeat all the above exercises in the same sequence once again, but with a new collection called <i>autoincr</i> also based on the <i>Demo</i> collection (remember to set <Format>&lt;importOption name="OIDtype" value="full_filename"/&gt;</Format> in the collectionConfig.xml file once again). This time, however, <i>don't</i> pass in any manifest file as an argument to the <Format>incremental-import.pl</Format> script. So you'd be running these commands after each change:</Text>
     5371  <Format>
     5372    perl -S incremental-import.pl -incremental -site localsite autoincr<br />
     5373    perl -S incremental-buildcol.pl -activate -site localsite autoincr
     5374  </Format>
     5375  <Text id="ic-21a">After each incremental build, preview your <i>autoincr</i> collection to check that the browsing classifiers contain the expected documents and that searching returns the expected results.</Text>
     5376</NumberedItem>
    53525377<Heading><Text id="ic-21">Incrementally indexing automatically</Text></Heading>
    5353 <Text id="ic-22">Just as there is the command <Format>full-rebuild.pl</Format> to completely build a collection from scratch, there is also the command <Format>incremental-rebuild.pl</Format>. The final exercise you have just completed could equally have been achieved by running:</Text>
     5378<Text id="ic-22">Just as there is the command <Format>full-rebuild.pl</Format> to completely build a collection from scratch, there is also the command <Format>incremental-rebuild.pl</Format>. The final exercise you have just completed could equally have been achieved by running the following after each change:</Text>
    53545379<Format>perl -S incremental-rebuild.pl <MajorVersion number="3">-site localsite</MajorVersion> autoincr</Format>
    53555380<Text id="ic-23">For every collection, the <i>import</i> phase can be run incrementally (either using a manifest file or automatically), however, the ability for the <i>buildcol</i> phase to be incremental depends on the indexer in use. Lucene and Solr indexers support incremental indexing, but the MG and MGPP indexers do not. A warning is issued if you attempt to run the <i>buildcol</i> phase incrementally when the chosen indexer does not support this.</Text>
Note: See TracChangeset for help on using the changeset viewer.