Changeset 33327

Show
Ignore:
Timestamp:
18.07.2019 22:45:22 (5 weeks ago)
Author:
ak19
Message:

In order to get map coordinate metadata stored correctly in solr, changes were required. These changes revealed that the way in which some index fields were stored in solr but also lucene were not exactly correct and required changing too. 1. Coordinate/CD, CoordShort?/CS and GPSMapOverlayLabel/ML meta are now being stored. The schema elements created for these indexed fields notably need to say they're multivalued (multiple values per docOID) and are of type=string rather than type=text_en_splitting as the other meta have been so far. No term related information being stored for them as that doesn't appear important for these indexed fields. 2. Changes to solrbuildproc required and these changes were also repeated into lucenebuildproc: in their code before this commit, <field name=... /> elements were stored once for all meta elements in that field. It sort of worked out so far since the type=text_en_splitting for these fields. This however created the problem that for example all Coordinate meta for a docOID went into a single <field name=CD .../> element separate by spaces rather than a <field name=CD .../> element for each Coordinate meta. We wanted the latter behaviour for CD, CS and ML meta but also for all other indexed meta fields such as TI for titles. But also for indexed fields that include multiple meta in one index such as a hypothetical TT where TT would include dc.Title,ex.Title,text. In that case too we want a <field name=TT /> element for each title meta and for the text meta. 3. The num_processed_bytes calculation is left untouched and still includes the encapsulating <field name=.../> element and has not been changed to be calculated over just the meta data value of each field. This is because not only is it calculated to include the field in super -buildproc.pm classes, but also because the definition of num_processed_bytes in basebuilder.pm is defined as the number of bytes actually passed to (mg) for the current index, where lucene and mgpp buildprocs both include the enclosing element in the calculation which seems deliberate. Further, num_processed_bytes contrasts against num_bytes, declared and defined in basebuildproc.pm too as The actual number of bytes in the collection, normally the same as what's processed during text compression. num_bytes seems to be what Dr Bainbridge had in mind today when he said that actually the enclosing <field/> element shouldn't be included in the calculation of num_processed_bytes. Since the definition of num_processed_bytes seems ambiguous to me now, I leave it alone until discussed with Dr Bainbridge again, as there are many places where it needs changing otherwise.

Files:
3 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/solr/trunk/src/perllib/solrbuilder.pm

    r32179 r33327  
    207207# build_cfg as, unlike in MGPP, we need these mappings in advance to configure 
    208208# Lucene/Solr. Unfortunately the original function found in mgbuilder.pm makes 
    209 # a mess of this - it only output fields that have been processed (none have) 
     209# a mess of this - it only outputs fields that have been processed (none have) 
    210210# and it has a hardcoded renaming for 'text' so it becomes 'TX' according to 
    211211# the schema but 'TE' according to XML sent to lucene_passes.pl/solr_passes.pl 
    212 # This version is dumber - just copy them all across verbatum - but works. We 
     212# This version is dumber - just copy them all across verbatim - but works. We 
    213213# do still need to support the special case of 'allfields' 
    214214sub make_final_field_list 
     
    286286        $schema_insert_xml .= "<field name=\"$field\" "; 
    287287 
    288         if($field eq "LA" || $field eq "LO") 
    289         { 
    290             $schema_insert_xml .=   "type=\"location\" "; 
     288        if($field eq "CD" || $field eq "CS") { 
     289            # Coordinate and CoordShort meta should not be split but treated as a whole string for searching. So type=string, not type=text_en_splitting             
     290            # Can't set to type="location", which uses solr.LatLonType, since type=location fields "must not be multivalued" as per conf/schema.xml.in. 
     291            # And we can have multiple Coordinate (and multiple CoordShort) meta for one doc, so multivalued=true. 
     292            # Not certain what to set stored to. As per conf/schema.xml.in, stored=false means "you only need to search on the field but 
     293            # don't need to return the original value". And they advice to set stored="false" for all fields possible (esp large fields)." 
     294            # But stored=false makes it not visible in Luke. So setting stored=true as for other fields 
     295            # TermVector: '"A term vector is a list of the document's terms and their number of occurrences in that documented." 
     296            # Each document has one term vector which is a list.' (http://makble.com/what-is-term-vector-in-lucene and lucene API for Field.TermVector) 
     297            # e.g. docA contains, "cat" 5 times, "dog" 10 times. We don't care to treat Coordinate meta as a term: not a "term" occurring 
     298            # in the doc, and don't care how often a Coordinate occurs in a document. 
     299            # Consequently, we don't care about term positions and term offsets for Coordinate meta either. 
     300             
     301            $schema_insert_xml .= "type=\"string\" indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"false\" termPositions=\"false\" termOffsets=\"false\" />\n"; 
    291302        } 
    292 #       elsif ($field ne "ZZ" && $field ne "TX") 
    293 #       { 
    294 #           $schema_insert_xml .=   "type=\"string\" "; 
    295 #       } 
    296         else 
    297         { 
    298             #$schema_insert_xml .= "type=\"text_en_splitting\" "; 
    299  
    300             # original default solr field type for all fields is text_en_splitting 
    301             my $solrfieldtype = "text_en_splitting"; 
    302             if(defined $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'}) {     
    303             $solrfieldtype = $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'}; 
    304             #print STDERR "@@@@#### found TYPE: $solrfieldtype\n"; 
    305             } 
    306             $schema_insert_xml .= "type=\"$solrfieldtype\" "; 
     303         
     304        elsif($field eq "ML") {  
     305            # mapLabel: same attributes as for coord meta CD and CS above 
     306            # mapLabel is also like facets with type="string" to not get tokenized, and multiValued="true" to allow each shape's label to be stored distinctly 
     307            $schema_insert_xml .= "type=\"string\" indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"false\" termPositions=\"false\" termOffsets=\"false\" />\n"; 
     308        } 
     309         
     310        else { 
     311            if($field eq "LT" || $field eq "LO") # full Latitude and Longitude coordinate meta, not the short variants (LatShort/LA and LongShort/LN) 
     312            { 
     313                # Latitude and Longitude is being phased out in favour of using Coord meta. 
     314                # However, if ever returning to using Lat and Lng instead of Coord meta, then the way the Lat Lng meta is currently written out for type="location" 
     315                # is in the wrong format. Lat and Lng shouldn't get written out separately but as: Lat,Lng 
     316                # It gets written out in solrbuildproc.pm, I think, so that would be where it needs to be corrected. 
     317                # For more info on type=location for our solr 4.7.2 or thereabouts, see https://web.archive.org/web/20160312154250/https://wiki.apache.org/solr/SpatialSearchDev 
     318                # which states: 
     319                #    When indexing, the format is something like: 
     320                #       <field name="store_lat_lon">12.34,-123.45</field> 
     321                # 
     322                $schema_insert_xml .=   "type=\"location\" ";                
     323            } 
    307324             
     325             
     326    #       elsif ($field ne "ZZ" && $field ne "TX") 
     327    #       { 
     328    #           $schema_insert_xml .=   "type=\"string\" "; 
     329    #       } 
     330            else 
     331            { 
     332                #$schema_insert_xml .= "type=\"text_en_splitting\" "; 
     333 
     334                # original default solr field type for all fields is text_en_splitting 
     335                my $solrfieldtype = "text_en_splitting"; 
     336                if(defined $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'}) {     
     337                $solrfieldtype = $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'}; 
     338                #print STDERR "@@@@#### found TYPE: $solrfieldtype\n"; 
     339                } 
     340                $schema_insert_xml .= "type=\"$solrfieldtype\" "; 
     341                 
     342            } 
     343            # set termVectors=\"true\" when term vectors info is required,  
     344            # see TermsResponse termResponse = solrResponse.getTermsResponse();  
     345            $schema_insert_xml .=  "indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"true\" termPositions=\"true\" termOffsets=\"true\" />\n"; 
    308346        } 
    309         # set termVectors=\"true\" when term vectors info is required,  
    310         # see TermsResponse termResponse = solrResponse.getTermsResponse();  
    311         $schema_insert_xml .=  "indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"true\" termPositions=\"true\" termOffsets=\"true\" />\n"; 
    312347    } 
    313348 
  • gs3-extensions/solr/trunk/src/perllib/solrbuildproc.pm

    r32441 r33327  
    202202     
    203203} 
    204 sub create_shortname { 
     204 
     205# UNUSED now by default. 
     206# Georgy overrode the mgppbuildproc::create_shortname() method in commit 32441 to create the method below, to override the inherited 
     207# behaviour so that create_shortname() worked appropriately for his use cases involving multiple analyzers. 
     208# As a result, create_shortname() for solr no longer did a lookup into the %mgppbuildproc::static_indexfield_map for registered shortnames. 
     209# For the rest, this method is a copy mgppbuildproc::create_shortname(). 
     210# But we want the original mgppbuildproc::create_shortname() behaviour restored, as it does the lookups into %static_indexfield_map that's necessary for us. 
     211# So we've renamed this function to create_shortname_multi_solr_analyzer below so it won't get called as default beahviour any more. 
     212# Rename to create_shortname() when requiring Georgy's behaviour. 
     213sub create_shortname_multi_solr_analyzer { 
    205214    my $self = shift(@_); 
    206215 
     
    500509         
    501510        if ($section_text ne "") { 
    502             $new_text .= "$section_text "; 
     511             
     512            if ($allfields_index) { 
     513                $allfields_text .= "$section_text "; 
     514            } 
     515             
     516            # Remove any leading or trailing white space 
     517            $section_text =~ s/\s+$//; 
     518            $section_text =~ s/^\s+//; 
     519             
     520            if ($self->{'indexing_text'}) { 
     521                # add the tag 
     522                $new_text .= "<field name=\"$shortname\" >$section_text</field>\n"; 
     523            } else { 
     524                $new_text .= "$section_text "; 
     525            } 
    503526        } 
    504527         
    505528        foreach my $item (@metadata_list) { 
    506529            &ghtml::htmlsafe($item); 
    507             $new_text .= "$item "; 
    508         } 
    509  
    510         if ($allfields_index) { 
    511             $allfields_text .= $new_text; 
    512         } 
    513  
    514         # Remove any leading or trailing white space 
    515         $new_text =~ s/\s+$//; 
    516         $new_text =~ s/^\s+//; 
    517      
     530 
     531            if ($allfields_index) { 
     532                $allfields_text .= "$item "; 
     533            } 
     534 
     535            # Remove any leading or trailing white space 
     536            $item =~ s/\s+$//; 
     537            $item =~ s/^\s+//; 
     538             
     539            if ($self->{'indexing_text'}) { 
     540                # add the tag 
     541                $new_text .= "<field name=\"$shortname\" >$item</field>\n"; 
     542            } else { 
     543                $new_text .= "$item "; 
     544            } 
     545        } # end for loop processing @metadata_list 
    518546         
    519         if ($self->{'indexing_text'}) { 
    520             # add the tag 
    521             $new_text = "<field name=\"$shortname\" >$new_text</field>\n"; 
    522         } 
    523547        # filter the text 
    524548        $new_text = $self->filter_text ($field, $new_text); 
     
    669693 
    670694        foreach my $item (@metadata_list) { 
    671         &ghtml::htmlsafe($item); 
    672          
    673         $item = "<field name=\"$sf_shortname\">$item</field>\n"; 
    674         # filter the text??? 
    675         $text .= "$item"; # add it to the main text block 
    676         #print "#### new_text: $item\n"; 
     695            &ghtml::htmlsafe($item); 
     696            if ($item =~ /\S/) { 
     697                $item = "<field name=\"$sf_shortname\">$item</field>\n"; 
     698                # filter the text??? 
     699                $text .= "$item"; # add it to the main text block 
     700                #print "#### new_text: $item\n"; 
     701            } 
    677702        } 
    678703        if(scalar @metadata_list > 0) { 
  • main/trunk/greenstone2/perllib/lucenebuildproc.pm

    r28566 r33327  
    260260         
    261261        if ($section_text ne "") { 
    262             $new_text .= "$section_text "; 
     262             
     263            if ($self->{'allfields_index'}) { 
     264                $allfields_text .= "$section_text "; 
     265            } 
     266             
     267            if ($self->{'indexing_text'}) { 
     268                # add the tag 
     269                $new_text .= "<$shortname index=\"1\">$section_text</$shortname>"; 
     270                $self->{'allindexfields'}->{$real_field} = 1; 
     271            } else { 
     272                $new_text .= "$section_text "; 
     273            } 
    263274        } 
    264275         
    265276        foreach my $item (@metadata_list) { 
    266277            &ghtml::htmlsafe($item); 
    267             $new_text .= "$item "; 
    268         } 
    269  
    270         if ($self->{'allfields_index'}) { 
    271             $allfields_text .= $new_text; 
    272         } 
    273  
    274         if ($self->{'indexing_text'}) { 
    275             # add the tag 
    276             $new_text = "<$shortname index=\"1\">$new_text</$shortname>"; 
    277             $self->{'allindexfields'}->{$real_field} = 1; 
    278         } 
     278 
     279            if ($self->{'allfields_index'}) { 
     280                $allfields_text .= "$item "; 
     281            } 
     282 
     283            if ($self->{'indexing_text'}) { 
     284                # add the tag 
     285                $new_text .= "<$shortname index=\"1\">$item</$shortname>"; 
     286                $self->{'allindexfields'}->{$real_field} = 1; 
     287            } else { 
     288                $new_text .= "$item "; 
     289            } 
     290        } # end for loop processing @metadata_list 
     291         
    279292        # filter the text 
    280293        $new_text = $self->filter_text ($field, $new_text); 
     
    384397        push (@metadata_list, @section_metadata); 
    385398        } 
    386         my $new_text = ""; 
    387         foreach my $item (@metadata_list) { 
    388         &ghtml::htmlsafe($item); 
    389         $new_text .= "$item"; 
    390         } 
    391         if ($new_text =~ /\S/) { 
    392         $new_text = "<$sf_shortname index=\"1\" tokenize=\"0\">$new_text</$sf_shortname>"; 
    393         # filter the text??? 
    394         $text .= "$new_text"; # add it to the main text block 
    395         $self->{'actualsortfields'}->{$sfield} = 1; 
     399        # my $new_text = ""; 
     400        # foreach my $item (@metadata_list) { 
     401        # &ghtml::htmlsafe($item); 
     402        # $new_text .= "$item"; # should be .="$item "; But will be commenting out and rewriting this entire thing, so it doesn't matter 
     403        # } 
     404        # if ($new_text =~ /\S/) { 
     405        # $new_text = "<$sf_shortname index=\"1\" tokenize=\"0\">$new_text</$sf_shortname>"; 
     406        # # filter the text??? 
     407        # $text .= "$new_text"; # add it to the main text block 
     408        # $self->{'actualsortfields'}->{$sfield} = 1; 
     409        # } 
     410         
     411        foreach my $item (@metadata_list) { 
     412            &ghtml::htmlsafe($item); 
     413            if ($item =~ /\S/) { 
     414                $item = "<$sf_shortname index=\"1\" tokenize=\"0\">$item</$sf_shortname>"; 
     415                $text .= "$item"; # add it to the main text block 
     416            } 
     417        } 
     418        if(scalar @metadata_list > 0) { 
     419            $self->{'actualsortfields'}->{$sfield} = 1; 
    396420        } 
    397421    }