Changeset 33327 for main/trunk


Ignore:
Timestamp:
2019-07-18T22:45:22+12:00 (5 years ago)
Author:
ak19
Message:

In order to get map coordinate metadata stored correctly in solr, changes were required. These changes revealed that the way in which some index fields were stored in solr but also lucene were not exactly correct and required changing too. 1. Coordinate/CD, CoordShort/CS and GPSMapOverlayLabel/ML meta are now being stored. The schema elements created for these indexed fields notably need to say they're multivalued (multiple values per docOID) and are of type=string rather than type=text_en_splitting as the other meta have been so far. No term related information being stored for them as that doesn't appear important for these indexed fields. 2. Changes to solrbuildproc required and these changes were also repeated into lucenebuildproc: in their code before this commit, <field name=... /> elements were stored once for all meta elements in that field. It sort of worked out so far since the type=text_en_splitting for these fields. This however created the problem that for example all Coordinate meta for a docOID went into a single <field name=CD .../> element separate by spaces rather than a <field name=CD .../> element for each Coordinate meta. We wanted the latter behaviour for CD, CS and ML meta but also for all other indexed meta fields such as TI for titles. But also for indexed fields that include multiple meta in one index such as a hypothetical TT where TT would include dc.Title,ex.Title,text. In that case too we want a <field name=TT /> element for each title meta and for the text meta. 3. The num_processed_bytes calculation is left untouched and still includes the encapsulating <field name=.../> element and has not been changed to be calculated over just the meta data value of each field. This is because not only is it calculated to include the field in super -buildproc.pm classes, but also because the definition of num_processed_bytes in basebuilder.pm is defined as the number of bytes actually passed to (mg) for the current index, where lucene and mgpp buildprocs both include the enclosing element in the calculation which seems deliberate. Further, num_processed_bytes contrasts against num_bytes, declared and defined in basebuildproc.pm too as The actual number of bytes in the collection, normally the same as what's processed during text compression. num_bytes seems to be what Dr Bainbridge had in mind today when he said that actually the enclosing <field/> element shouldn't be included in the calculation of num_processed_bytes. Since the definition of num_processed_bytes seems ambiguous to me now, I leave it alone until discussed with Dr Bainbridge again, as there are many places where it needs changing otherwise.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/perllib/lucenebuildproc.pm

    r28566 r33327  
    260260       
    261261        if ($section_text ne "") {
    262             $new_text .= "$section_text ";
     262           
     263            if ($self->{'allfields_index'}) {
     264                $allfields_text .= "$section_text ";
     265            }
     266           
     267            if ($self->{'indexing_text'}) {
     268                # add the tag
     269                $new_text .= "<$shortname index=\"1\">$section_text</$shortname>";
     270                $self->{'allindexfields'}->{$real_field} = 1;
     271            } else {
     272                $new_text .= "$section_text ";
     273            }
    263274        }
    264275       
    265276        foreach my $item (@metadata_list) {
    266277            &ghtml::htmlsafe($item);
    267             $new_text .= "$item ";
    268         }
    269 
    270         if ($self->{'allfields_index'}) {
    271             $allfields_text .= $new_text;
    272         }
    273 
    274         if ($self->{'indexing_text'}) {
    275             # add the tag
    276             $new_text = "<$shortname index=\"1\">$new_text</$shortname>";
    277             $self->{'allindexfields'}->{$real_field} = 1;
    278         }
     278
     279            if ($self->{'allfields_index'}) {
     280                $allfields_text .= "$item ";
     281            }
     282
     283            if ($self->{'indexing_text'}) {
     284                # add the tag
     285                $new_text .= "<$shortname index=\"1\">$item</$shortname>";
     286                $self->{'allindexfields'}->{$real_field} = 1;
     287            } else {
     288                $new_text .= "$item ";
     289            }
     290        } # end for loop processing @metadata_list
     291       
    279292        # filter the text
    280293        $new_text = $self->filter_text ($field, $new_text);
     
    384397        push (@metadata_list, @section_metadata);
    385398        }
    386         my $new_text = "";
    387         foreach my $item (@metadata_list) {
    388         &ghtml::htmlsafe($item);
    389         $new_text .= "$item";
    390         }
    391         if ($new_text =~ /\S/) {
    392         $new_text = "<$sf_shortname index=\"1\" tokenize=\"0\">$new_text</$sf_shortname>";
    393         # filter the text???
    394         $text .= "$new_text"; # add it to the main text block
    395         $self->{'actualsortfields'}->{$sfield} = 1;
     399        # my $new_text = "";
     400        # foreach my $item (@metadata_list) {
     401        # &ghtml::htmlsafe($item);
     402        # $new_text .= "$item"; # should be .="$item "; But will be commenting out and rewriting this entire thing, so it doesn't matter
     403        # }
     404        # if ($new_text =~ /\S/) {
     405        # $new_text = "<$sf_shortname index=\"1\" tokenize=\"0\">$new_text</$sf_shortname>";
     406        # # filter the text???
     407        # $text .= "$new_text"; # add it to the main text block
     408        # $self->{'actualsortfields'}->{$sfield} = 1;
     409        # }
     410       
     411        foreach my $item (@metadata_list) {
     412            &ghtml::htmlsafe($item);
     413            if ($item =~ /\S/) {
     414                $item = "<$sf_shortname index=\"1\" tokenize=\"0\">$item</$sf_shortname>";
     415                $text .= "$item"; # add it to the main text block
     416            }
     417        }
     418        if(scalar @metadata_list > 0) {
     419            $self->{'actualsortfields'}->{$sfield} = 1;
    396420        }
    397421    }
Note: See TracChangeset for help on using the changeset viewer.