Context Navigation

← Previous Change
Next Change →

basebuilder.pm

Timestamp:

2009-09-10T10:46:36+12:00 (15 years ago)

Author:

davidb

Message:

Opening of txt2db moved to earlier in the buildcol process. This was done to avoid a huge memory spike that occurred with incremental building. Previously we recoconstructed all the documents from the GDBM database. Then the code added, edited, removed documents as required (i.e. the incremental bit), then it wrote it all out to GDBM. The problem was that the reconstructed phase could grow quite large -- an example PagedImage collection of 100000 documents took 2.4 GB when read in. When it got to the stage of opening a pipe to the datbase with open('|txt2db'), the fork() call that occurs inside this function requires the system to (briefly) have *two* 2.4 GB processes, before quickly replacing the child process with the much smalled 'txt2db' process. It is at the point of the duplication of the two processes that can cause a computer to run out of memory. In the PagedImage example, the machine had 2 GB of main memory and 2 GB of swap. Therefore there was no way it could sustain two 2.4 GB processes.\n Long explanation. The good news is shifting the open() to be before the documents are reconstructed solves the problem.

File:

: 1 edited

gsdl/trunk/perllib/basebuilder.pm (modified) (6 diffs)

Legend:

: Unmodified
: Added
: Removed

gsdl/trunk/perllib/basebuilder.pm

-              r20100
+              r20575
     # Get info database file path
+    my $infodb_file_path = &dbutil::get_infodb_file_path($self->{'infodbtype'}, $self->{'collection'}, $textdir);
+    my $infodb_type = $self->{'infodbtype'};
+    my $infodb_file_path = &dbutil::get_infodb_file_path($infodb_type, $self->{'collection'}, $textdir);
     print $outhandle "\n*** creating the info database and processing associated files\n"
 …
     my $reconstructed_docs = undef;
+    my $database_recs = undef;
     if ($self->{'keepold'}) {
+    # reconstruct doc_obj metadata from database for all docs
+    $reconstructed_docs = &classify::reconstruct_doc_objs_metadata($self->{'infodbtype'}, $infodb_file_path);
+    }
+    # set up the document processor
+    $database_recs = {};
+    &dbutil::read_infodb_file($infodb_type, $infodb_file_path, $database_recs);
+    }
+    # Important (for memory usage reasons) that we obtain the filehandle
+    # here for writing out to the database, rather than after
+    # $reconstructed_docs has been set up (assuming -keepold is on)
+    #
+    # This is because when we open a pipe to txt2db [using open()]
+    # this triggers a fork() followed by exec().  $reconstructed_docs
+    # can get very large, and so if we did the open() after this, it means
+    # the fork creates a clone of the *large* process image which (admittedly)
+    # is then quickly replaced in the execve() with the much smaller image for
+    # 'txt2db'.  The trouble is, in that for a seismic second caused by
+    # the fork(), the system really does need to have all that memory available
+    # even though it isn't ultimately used.  The result is an out of memory
+    # error.
     my ($infodb_handle);
     if ($self->{'debug'}) {
 …
+    }
     else {
     $infodb_handle = &dbutil::open_infodb_write_handle($self->{'infodbtype'}, $infodb_file_path);
+    $infodb_handle = &dbutil::open_infodb_write_handle($infodb_type, $infodb_file_path);
     if (!defined($infodb_handle))
+    {
 …
+    }
+    $self->{'buildproc'}->set_infodbtype ($self->{'infodbtype'});
+    if ($self->{'keepold'}) {
+    # reconstruct doc_obj metadata from database for all docs
+    $reconstructed_docs
+        = &classify::reconstruct_doc_objs_metadata($infodb_type,
+                               $infodb_file_path,
+                               $database_recs);
+    }
+    # set up the document processor
+    $self->{'buildproc'}->set_infodbtype ($infodb_type);
     $self->{'buildproc'}->set_output_handle ($infodb_handle);
     $self->{'buildproc'}->set_mode ('infodb');
 …
     # output classification information
     &classify::output_classify_info ($self->{'classifiers'}, $self->{'infodbtype'}, $infodb_handle,
+    &classify::output_classify_info ($self->{'classifiers'}, $infodb_type, $infodb_handle,
                      $self->{'remove_empty_classifications'},
                      $self->{'gli'});
 …
                   'thistype' => [ "Invisible" ],
                   'contains' => [ join(";", @doc_list) ] };
     &dbutil::write_infodb_entry($self->{'infodbtype'}, $infodb_handle, "browselist", $browselist_infodb);
     &dbutil::close_infodb_write_handle($self->{'infodbtype'}, $infodb_handle) if !$self->{'debug'};
+    &dbutil::write_infodb_entry($infodb_type, $infodb_handle, "browselist", $browselist_infodb);
+    &dbutil::close_infodb_write_handle($infodb_type, $infodb_handle) if !$self->{'debug'};
     print STDERR "</Stage>\n" if $self->{'gli'};

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 20575 for gsdl/trunk/perllib/basebuilder.pm

Legend:

gsdl/trunk/perllib/basebuilder.pm

Download in other formats: