Changeset 24675

27.09.2011 12:55:07 (8 years ago)

Continued to document the parallel building extension and various attempts/explanations of current state of project

1 modified


  • gs2-extensions/parallel-building/trunk/src/README.txt

    r24589 r24675  
    1515Requires the GS2-Extension "tdb-edit" to be installed, compiled and enabled, in order to allow multiple parallel readers/writers to access key datastore/base files (such as the archive-inf databases). 
     17==== Supported InfoDBs ==== 
     19The "tdb-edit" extension was create specifically for parallel building, so is 100% percent compatible. Included in this extension are versions of the GDBM tools txt2db and db2txt (named txt2dbl and db2txtl respectively) that use simple file locking to allow (albeit not as efficiently) parallel readers/writers. SQLite is currently not supported (see below). No other infodb types currently supported. 
     21=== GDBM === 
     23*NEW* GDBM support is now available through the 'gdbmserver' database driver. This makes use of a GDBMCLI called via daemonized server (so the Greenstone processes become clients in a client/server model). This not only allows GDBM databases in parallel building, but will also improve serial building as it reduces the number of GDBM database open/closes from 1-2 per document in the build process to 1-2 per build process (obviously the gain won't be huge with 10 documents, especially with the extra cost of starting and waiting, and stopping and waiting the GDBMServers, but on one million documents that difference is significant). 
     25The GDBMServer class requires to non-standard CPAN packages, namely IPC::Run and Proc::Daemon. I've included the source for these in the packages folder, but for now compiling them is a manual process (go to packages, extract, go to folder, run "perl PREFIX=$GEXTPARALLELBUILDING", make, then make install). 
     27There are logs dropped in the logs directory, but good luck with making heads or tails of them in a parallel building context (even with write buffering/synchronization the server communications are going to be all jumbled up). 
     29=== SQLite === 
     31The SQLite ( included in circa 2.84 Greenstone, while allowing multiple readers and one active writer with using the CLI, does not support parallel writers. Any attempt to use the CLI to write to a database table already opened exclusive by another writer results in a "Error: database is locked" being thrown in the sqlite library and the write being ignored. 
     33I can see two possible ways to work around this. The first, and I guess preferred, is to move away from using the CLI and instead use native Perl DBI/DBD. By doing so we can listen for, and react appropriately to, SQLITE_BUSY errors (by waiting then retrying). We could even, conceivably, implement some locking schema similar to HS's changes to GDBM. There may even be versions of the SQLite DBD (OpenPGK?) that allow you to specify that writes *must* succeed and to try infinitely until they do. 
     35As an alternative, I tried to upgrade to the latest version of SQLite (3.7.8) which includes a new feature called Write Ahead Logging. Presumably this feature allows multiple writers on a single database - so sounded perfect for our purpose. However, in practice, there still appears to be locking issues - with the occasional import/build throwing one or more of these errors: 
     37* Error: database is locked 
     38* Use of uninitialized value $filenames[0] in join or string at /research/jmt12/gsdl-64-svn/ext/parallel-building/perllib/ line 860. 
     39* Unexpected EOF in "bitio_m_stdio.cpp" on line 22[1316641505][Worker2] complete 
     41All subsequent builds (presumably on a now corrupt archives directory) without a successful import produce garbage. I include the SQLite source in the packages folder for anyone who wants to try and figure out why. 
     43Meanwhile I've added a sanity test to that explicitly ignores -parallel if infodb type is sqlite (and prints out a warning about doing so). 
     45===== bin/script and perllib ===== 
     47In order to try and make this compatible with the latest advances in the main trunk (so not the 64bit version I've been testing on), I've implemented the parallel building using a SVN head version of, and perllib. I'll try to keep a list of the files I've changed here to aid in merging this code back into Greenstone: 
     49* bin/script/ extended with a -parallel argument and the code to use it (basically, call builder methods to construct a 'recipe' of how the build should proceed, and then call mpibuilder with the recipe to do that actual build!) 
     51* perllib/ made modifying the INC conditional on the paths not already existing in INC - otherwise all our clever INC building to support extensions is clobbered. Similar change for modifying the PATH. 
     52* perllib/ adding indexlevel variable to build_indexes() function (although it doesn't do anything in base) and also added dummy prepare_build_recipe() function that just complaining that inheriting class should implement. 
     53* perllib/ modified classifier loader to use @INC rather than hardcoded locations - so extensions are properly supported (two places). 
     54* perllib/ added indexname and indexlevel CLI configuration options 
     55* perllib/ removed, as I want the one in TDB-Edit extension (with it's dynamic database driver loading) to apply. 
     56* perllib/ commented out hardcoded(?) part to a particular 64bit version of perl... is this from my earlier checkins? 
     57* perllib/ added indexlevel parameter and code to ensure that, if specified, only that level is built (with appropriate testing for the non-lucene-supported 'paragraph' level). Added prepare_build_recipe() function that knows all phases in lucene builds can be run in parallel, with build_indexes being separable by level. 
     58* perllib/ the most 'complex' recipe generator, mg creates the compress_text item, then adds all the build_index items as dependents. InfoDB can happen in parallel with compress_text however. 
     59* perllib/ the simpliest recipe, mgpp can run all three of its phases in parallel. 
     60* perllib/ see IncrementalBuildTools 
     61* perllib/ see IncrementalBuildTools 
     62* perllib/ see IncrementalBuildTools 
     63* perllib/ made it only complain about periods (.) in the Identifier once - rather than once per document (which is a PITA when building one million documents). 
     65* perllib/dbutil/ changed to call lock enabled versions of txt2db and db2txt. 
     66* perllib/dbutil/ added WAL Pragma (for all the good it did). Also needed to redirect output (like for db_fast) as the WAL reports each type of action ("add","update", and "delete") that it has queued - very quickly becoming annoying. 
     68* perllib/plugins/ making the "Global file scan..." comment obey verbosity. 
     69* perllib/plugins/ see IncrementalBuildTools (in this case the path to cpan) 
     70* perllib/plugins/ see IncrementalBuildTools (in this case the path to cpan) 
     71* perllib/plugins/ see IncrementalBuildTools (in this case the path to cpan) 
     72* perllib/plugins/ see IncrementalBuildTools (in this case the path to cpan)