Ignore:
Timestamp:
2019-07-18T22:45:22+12:00 (5 years ago)
Author:
ak19
Message:

In order to get map coordinate metadata stored correctly in solr, changes were required. These changes revealed that the way in which some index fields were stored in solr but also lucene were not exactly correct and required changing too. 1. Coordinate/CD, CoordShort/CS and GPSMapOverlayLabel/ML meta are now being stored. The schema elements created for these indexed fields notably need to say they're multivalued (multiple values per docOID) and are of type=string rather than type=text_en_splitting as the other meta have been so far. No term related information being stored for them as that doesn't appear important for these indexed fields. 2. Changes to solrbuildproc required and these changes were also repeated into lucenebuildproc: in their code before this commit, <field name=... /> elements were stored once for all meta elements in that field. It sort of worked out so far since the type=text_en_splitting for these fields. This however created the problem that for example all Coordinate meta for a docOID went into a single <field name=CD .../> element separate by spaces rather than a <field name=CD .../> element for each Coordinate meta. We wanted the latter behaviour for CD, CS and ML meta but also for all other indexed meta fields such as TI for titles. But also for indexed fields that include multiple meta in one index such as a hypothetical TT where TT would include dc.Title,ex.Title,text. In that case too we want a <field name=TT /> element for each title meta and for the text meta. 3. The num_processed_bytes calculation is left untouched and still includes the encapsulating <field name=.../> element and has not been changed to be calculated over just the meta data value of each field. This is because not only is it calculated to include the field in super -buildproc.pm classes, but also because the definition of num_processed_bytes in basebuilder.pm is defined as the number of bytes actually passed to (mg) for the current index, where lucene and mgpp buildprocs both include the enclosing element in the calculation which seems deliberate. Further, num_processed_bytes contrasts against num_bytes, declared and defined in basebuildproc.pm too as The actual number of bytes in the collection, normally the same as what's processed during text compression. num_bytes seems to be what Dr Bainbridge had in mind today when he said that actually the enclosing <field/> element shouldn't be included in the calculation of num_processed_bytes. Since the definition of num_processed_bytes seems ambiguous to me now, I leave it alone until discussed with Dr Bainbridge again, as there are many places where it needs changing otherwise.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/solr/trunk/src/perllib/solrbuildproc.pm

    r32441 r33327  
    202202   
    203203}
    204 sub create_shortname {
     204
     205# UNUSED now by default.
     206# Georgy overrode the mgppbuildproc::create_shortname() method in commit 32441 to create the method below, to override the inherited
     207# behaviour so that create_shortname() worked appropriately for his use cases involving multiple analyzers.
     208# As a result, create_shortname() for solr no longer did a lookup into the %mgppbuildproc::static_indexfield_map for registered shortnames.
     209# For the rest, this method is a copy mgppbuildproc::create_shortname().
     210# But we want the original mgppbuildproc::create_shortname() behaviour restored, as it does the lookups into %static_indexfield_map that's necessary for us.
     211# So we've renamed this function to create_shortname_multi_solr_analyzer below so it won't get called as default beahviour any more.
     212# Rename to create_shortname() when requiring Georgy's behaviour.
     213sub create_shortname_multi_solr_analyzer {
    205214    my $self = shift(@_);
    206215
     
    500509       
    501510        if ($section_text ne "") {
    502             $new_text .= "$section_text ";
     511           
     512            if ($allfields_index) {
     513                $allfields_text .= "$section_text ";
     514            }
     515           
     516            # Remove any leading or trailing white space
     517            $section_text =~ s/\s+$//;
     518            $section_text =~ s/^\s+//;
     519           
     520            if ($self->{'indexing_text'}) {
     521                # add the tag
     522                $new_text .= "<field name=\"$shortname\" >$section_text</field>\n";
     523            } else {
     524                $new_text .= "$section_text ";
     525            }
    503526        }
    504527       
    505528        foreach my $item (@metadata_list) {
    506529            &ghtml::htmlsafe($item);
    507             $new_text .= "$item ";
    508         }
    509 
    510         if ($allfields_index) {
    511             $allfields_text .= $new_text;
    512         }
    513 
    514         # Remove any leading or trailing white space
    515         $new_text =~ s/\s+$//;
    516         $new_text =~ s/^\s+//;
    517    
     530
     531            if ($allfields_index) {
     532                $allfields_text .= "$item ";
     533            }
     534
     535            # Remove any leading or trailing white space
     536            $item =~ s/\s+$//;
     537            $item =~ s/^\s+//;
     538           
     539            if ($self->{'indexing_text'}) {
     540                # add the tag
     541                $new_text .= "<field name=\"$shortname\" >$item</field>\n";
     542            } else {
     543                $new_text .= "$item ";
     544            }
     545        } # end for loop processing @metadata_list
    518546       
    519         if ($self->{'indexing_text'}) {
    520             # add the tag
    521             $new_text = "<field name=\"$shortname\" >$new_text</field>\n";
    522         }
    523547        # filter the text
    524548        $new_text = $self->filter_text ($field, $new_text);
     
    669693
    670694        foreach my $item (@metadata_list) {
    671         &ghtml::htmlsafe($item);
    672        
    673         $item = "<field name=\"$sf_shortname\">$item</field>\n";
    674         # filter the text???
    675         $text .= "$item"; # add it to the main text block
    676         #print "#### new_text: $item\n";
     695            &ghtml::htmlsafe($item);
     696            if ($item =~ /\S/) {
     697                $item = "<field name=\"$sf_shortname\">$item</field>\n";
     698                # filter the text???
     699                $text .= "$item"; # add it to the main text block
     700                #print "#### new_text: $item\n";
     701            }
    677702        }
    678703        if(scalar @metadata_list > 0) {
Note: See TracChangeset for help on using the changeset viewer.