Ignore:
Timestamp:
2009-09-10T10:46:36+12:00 (15 years ago)
Author:
davidb
Message:

Opening of txt2db moved to earlier in the buildcol process. This was done to avoid a huge memory spike that occurred with incremental building. Previously we recoconstructed all the documents from the GDBM database. Then the code added, edited, removed documents as required (i.e. the incremental bit), then it wrote it all out to GDBM. The problem was that the reconstructed phase could grow quite large -- an example PagedImage collection of 100000 documents took 2.4 GB when read in. When it got to the stage of opening a pipe to the datbase with open('|txt2db'), the fork() call that occurs inside this function requires the system to (briefly) have *two* 2.4 GB processes, before quickly replacing the child process with the much smalled 'txt2db' process. It is at the point of the duplication of the two processes that can cause a computer to run out of memory. In the PagedImage example, the machine had 2 GB of main memory and 2 GB of swap. Therefore there was no way it could sustain two 2.4 GB processes.\n Long explanation. The good news is shifting the open() to be before the documents are reconstructed solves the problem.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gsdl/trunk/perllib/classify.pm

    r19772 r20575  
    234234    my $infodb_type = shift(@_);
    235235    my $infodb_file_path = shift(@_);
    236 
    237     my %database_recs;
    238     &dbutil::read_infodb_file($infodb_type, $infodb_file_path, \%database_recs);
     236    my $database_recs = shift(@_);
    239237
    240238    # dig out top level doc sections
    241239    my %top_sections = ();
    242240    my %top_docnums = ();
    243     foreach my $key ( keys %database_recs )
     241    foreach my $key ( keys %$database_recs )
    244242    {
    245     my $md_rec = $database_recs{$key};
     243    my $md_rec = $database_recs->{$key};
    246244    my $md_hash = db_rec_to_hash($md_rec);
    247245
     
    266264        add_section_content ($doc_obj, $top, $doc_db_hash);
    267265        my $children = &get_children($doc_db_hash);
    268         recurse_sections($doc_obj, $children, $oid, $top, \%database_recs);
     266        recurse_sections($doc_obj, $children, $oid, $top, $database_recs);
    269267
    270268    push(@all_docs,$doc_obj);
Note: See TracChangeset for help on using the changeset viewer.