Show
Ignore:
Timestamp:
18.07.2019 22:45:22 (2 months ago)
Author:
ak19
Message:

In order to get map coordinate metadata stored correctly in solr, changes were required. These changes revealed that the way in which some index fields were stored in solr but also lucene were not exactly correct and required changing too. 1. Coordinate/CD, CoordShort?/CS and GPSMapOverlayLabel/ML meta are now being stored. The schema elements created for these indexed fields notably need to say they're multivalued (multiple values per docOID) and are of type=string rather than type=text_en_splitting as the other meta have been so far. No term related information being stored for them as that doesn't appear important for these indexed fields. 2. Changes to solrbuildproc required and these changes were also repeated into lucenebuildproc: in their code before this commit, <field name=... /> elements were stored once for all meta elements in that field. It sort of worked out so far since the type=text_en_splitting for these fields. This however created the problem that for example all Coordinate meta for a docOID went into a single <field name=CD .../> element separate by spaces rather than a <field name=CD .../> element for each Coordinate meta. We wanted the latter behaviour for CD, CS and ML meta but also for all other indexed meta fields such as TI for titles. But also for indexed fields that include multiple meta in one index such as a hypothetical TT where TT would include dc.Title,ex.Title,text. In that case too we want a <field name=TT /> element for each title meta and for the text meta. 3. The num_processed_bytes calculation is left untouched and still includes the encapsulating <field name=.../> element and has not been changed to be calculated over just the meta data value of each field. This is because not only is it calculated to include the field in super -buildproc.pm classes, but also because the definition of num_processed_bytes in basebuilder.pm is defined as the number of bytes actually passed to (mg) for the current index, where lucene and mgpp buildprocs both include the enclosing element in the calculation which seems deliberate. Further, num_processed_bytes contrasts against num_bytes, declared and defined in basebuildproc.pm too as The actual number of bytes in the collection, normally the same as what's processed during text compression. num_bytes seems to be what Dr Bainbridge had in mind today when he said that actually the enclosing <field/> element shouldn't be included in the calculation of num_processed_bytes. Since the definition of num_processed_bytes seems ambiguous to me now, I leave it alone until discussed with Dr Bainbridge again, as there are many places where it needs changing otherwise.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/solr/trunk/src/perllib/solrbuildproc.pm

    r32441 r33327  
    202202     
    203203} 
    204 sub create_shortname { 
     204 
     205# UNUSED now by default. 
     206# Georgy overrode the mgppbuildproc::create_shortname() method in commit 32441 to create the method below, to override the inherited 
     207# behaviour so that create_shortname() worked appropriately for his use cases involving multiple analyzers. 
     208# As a result, create_shortname() for solr no longer did a lookup into the %mgppbuildproc::static_indexfield_map for registered shortnames. 
     209# For the rest, this method is a copy mgppbuildproc::create_shortname(). 
     210# But we want the original mgppbuildproc::create_shortname() behaviour restored, as it does the lookups into %static_indexfield_map that's necessary for us. 
     211# So we've renamed this function to create_shortname_multi_solr_analyzer below so it won't get called as default beahviour any more. 
     212# Rename to create_shortname() when requiring Georgy's behaviour. 
     213sub create_shortname_multi_solr_analyzer { 
    205214    my $self = shift(@_); 
    206215 
     
    500509         
    501510        if ($section_text ne "") { 
    502             $new_text .= "$section_text "; 
     511             
     512            if ($allfields_index) { 
     513                $allfields_text .= "$section_text "; 
     514            } 
     515             
     516            # Remove any leading or trailing white space 
     517            $section_text =~ s/\s+$//; 
     518            $section_text =~ s/^\s+//; 
     519             
     520            if ($self->{'indexing_text'}) { 
     521                # add the tag 
     522                $new_text .= "<field name=\"$shortname\" >$section_text</field>\n"; 
     523            } else { 
     524                $new_text .= "$section_text "; 
     525            } 
    503526        } 
    504527         
    505528        foreach my $item (@metadata_list) { 
    506529            &ghtml::htmlsafe($item); 
    507             $new_text .= "$item "; 
    508         } 
    509  
    510         if ($allfields_index) { 
    511             $allfields_text .= $new_text; 
    512         } 
    513  
    514         # Remove any leading or trailing white space 
    515         $new_text =~ s/\s+$//; 
    516         $new_text =~ s/^\s+//; 
    517      
     530 
     531            if ($allfields_index) { 
     532                $allfields_text .= "$item "; 
     533            } 
     534 
     535            # Remove any leading or trailing white space 
     536            $item =~ s/\s+$//; 
     537            $item =~ s/^\s+//; 
     538             
     539            if ($self->{'indexing_text'}) { 
     540                # add the tag 
     541                $new_text .= "<field name=\"$shortname\" >$item</field>\n"; 
     542            } else { 
     543                $new_text .= "$item "; 
     544            } 
     545        } # end for loop processing @metadata_list 
    518546         
    519         if ($self->{'indexing_text'}) { 
    520             # add the tag 
    521             $new_text = "<field name=\"$shortname\" >$new_text</field>\n"; 
    522         } 
    523547        # filter the text 
    524548        $new_text = $self->filter_text ($field, $new_text); 
     
    669693 
    670694        foreach my $item (@metadata_list) { 
    671         &ghtml::htmlsafe($item); 
    672          
    673         $item = "<field name=\"$sf_shortname\">$item</field>\n"; 
    674         # filter the text??? 
    675         $text .= "$item"; # add it to the main text block 
    676         #print "#### new_text: $item\n"; 
     695            &ghtml::htmlsafe($item); 
     696            if ($item =~ /\S/) { 
     697                $item = "<field name=\"$sf_shortname\">$item</field>\n"; 
     698                # filter the text??? 
     699                $text .= "$item"; # add it to the main text block 
     700                #print "#### new_text: $item\n"; 
     701            } 
    677702        } 
    678703        if(scalar @metadata_list > 0) {