Show
Ignore:
Timestamp:
18.07.2019 22:45:22 (3 months ago)
Author:
ak19
Message:

In order to get map coordinate metadata stored correctly in solr, changes were required. These changes revealed that the way in which some index fields were stored in solr but also lucene were not exactly correct and required changing too. 1. Coordinate/CD, CoordShort?/CS and GPSMapOverlayLabel/ML meta are now being stored. The schema elements created for these indexed fields notably need to say they're multivalued (multiple values per docOID) and are of type=string rather than type=text_en_splitting as the other meta have been so far. No term related information being stored for them as that doesn't appear important for these indexed fields. 2. Changes to solrbuildproc required and these changes were also repeated into lucenebuildproc: in their code before this commit, <field name=... /> elements were stored once for all meta elements in that field. It sort of worked out so far since the type=text_en_splitting for these fields. This however created the problem that for example all Coordinate meta for a docOID went into a single <field name=CD .../> element separate by spaces rather than a <field name=CD .../> element for each Coordinate meta. We wanted the latter behaviour for CD, CS and ML meta but also for all other indexed meta fields such as TI for titles. But also for indexed fields that include multiple meta in one index such as a hypothetical TT where TT would include dc.Title,ex.Title,text. In that case too we want a <field name=TT /> element for each title meta and for the text meta. 3. The num_processed_bytes calculation is left untouched and still includes the encapsulating <field name=.../> element and has not been changed to be calculated over just the meta data value of each field. This is because not only is it calculated to include the field in super -buildproc.pm classes, but also because the definition of num_processed_bytes in basebuilder.pm is defined as the number of bytes actually passed to (mg) for the current index, where lucene and mgpp buildprocs both include the enclosing element in the calculation which seems deliberate. Further, num_processed_bytes contrasts against num_bytes, declared and defined in basebuildproc.pm too as The actual number of bytes in the collection, normally the same as what's processed during text compression. num_bytes seems to be what Dr Bainbridge had in mind today when he said that actually the enclosing <field/> element shouldn't be included in the calculation of num_processed_bytes. Since the definition of num_processed_bytes seems ambiguous to me now, I leave it alone until discussed with Dr Bainbridge again, as there are many places where it needs changing otherwise.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/solr/trunk/src/perllib/solrbuilder.pm

    r32179 r33327  
    207207# build_cfg as, unlike in MGPP, we need these mappings in advance to configure 
    208208# Lucene/Solr. Unfortunately the original function found in mgbuilder.pm makes 
    209 # a mess of this - it only output fields that have been processed (none have) 
     209# a mess of this - it only outputs fields that have been processed (none have) 
    210210# and it has a hardcoded renaming for 'text' so it becomes 'TX' according to 
    211211# the schema but 'TE' according to XML sent to lucene_passes.pl/solr_passes.pl 
    212 # This version is dumber - just copy them all across verbatum - but works. We 
     212# This version is dumber - just copy them all across verbatim - but works. We 
    213213# do still need to support the special case of 'allfields' 
    214214sub make_final_field_list 
     
    286286        $schema_insert_xml .= "<field name=\"$field\" "; 
    287287 
    288         if($field eq "LA" || $field eq "LO") 
    289         { 
    290             $schema_insert_xml .=   "type=\"location\" "; 
     288        if($field eq "CD" || $field eq "CS") { 
     289            # Coordinate and CoordShort meta should not be split but treated as a whole string for searching. So type=string, not type=text_en_splitting             
     290            # Can't set to type="location", which uses solr.LatLonType, since type=location fields "must not be multivalued" as per conf/schema.xml.in. 
     291            # And we can have multiple Coordinate (and multiple CoordShort) meta for one doc, so multivalued=true. 
     292            # Not certain what to set stored to. As per conf/schema.xml.in, stored=false means "you only need to search on the field but 
     293            # don't need to return the original value". And they advice to set stored="false" for all fields possible (esp large fields)." 
     294            # But stored=false makes it not visible in Luke. So setting stored=true as for other fields 
     295            # TermVector: '"A term vector is a list of the document's terms and their number of occurrences in that documented." 
     296            # Each document has one term vector which is a list.' (http://makble.com/what-is-term-vector-in-lucene and lucene API for Field.TermVector) 
     297            # e.g. docA contains, "cat" 5 times, "dog" 10 times. We don't care to treat Coordinate meta as a term: not a "term" occurring 
     298            # in the doc, and don't care how often a Coordinate occurs in a document. 
     299            # Consequently, we don't care about term positions and term offsets for Coordinate meta either. 
     300             
     301            $schema_insert_xml .= "type=\"string\" indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"false\" termPositions=\"false\" termOffsets=\"false\" />\n"; 
    291302        } 
    292 #       elsif ($field ne "ZZ" && $field ne "TX") 
    293 #       { 
    294 #           $schema_insert_xml .=   "type=\"string\" "; 
    295 #       } 
    296         else 
    297         { 
    298             #$schema_insert_xml .= "type=\"text_en_splitting\" "; 
    299  
    300             # original default solr field type for all fields is text_en_splitting 
    301             my $solrfieldtype = "text_en_splitting"; 
    302             if(defined $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'}) {     
    303             $solrfieldtype = $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'}; 
    304             #print STDERR "@@@@#### found TYPE: $solrfieldtype\n"; 
    305             } 
    306             $schema_insert_xml .= "type=\"$solrfieldtype\" "; 
     303         
     304        elsif($field eq "ML") {  
     305            # mapLabel: same attributes as for coord meta CD and CS above 
     306            # mapLabel is also like facets with type="string" to not get tokenized, and multiValued="true" to allow each shape's label to be stored distinctly 
     307            $schema_insert_xml .= "type=\"string\" indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"false\" termPositions=\"false\" termOffsets=\"false\" />\n"; 
     308        } 
     309         
     310        else { 
     311            if($field eq "LT" || $field eq "LO") # full Latitude and Longitude coordinate meta, not the short variants (LatShort/LA and LongShort/LN) 
     312            { 
     313                # Latitude and Longitude is being phased out in favour of using Coord meta. 
     314                # However, if ever returning to using Lat and Lng instead of Coord meta, then the way the Lat Lng meta is currently written out for type="location" 
     315                # is in the wrong format. Lat and Lng shouldn't get written out separately but as: Lat,Lng 
     316                # It gets written out in solrbuildproc.pm, I think, so that would be where it needs to be corrected. 
     317                # For more info on type=location for our solr 4.7.2 or thereabouts, see https://web.archive.org/web/20160312154250/https://wiki.apache.org/solr/SpatialSearchDev 
     318                # which states: 
     319                #    When indexing, the format is something like: 
     320                #       <field name="store_lat_lon">12.34,-123.45</field> 
     321                # 
     322                $schema_insert_xml .=   "type=\"location\" ";                
     323            } 
    307324             
     325             
     326    #       elsif ($field ne "ZZ" && $field ne "TX") 
     327    #       { 
     328    #           $schema_insert_xml .=   "type=\"string\" "; 
     329    #       } 
     330            else 
     331            { 
     332                #$schema_insert_xml .= "type=\"text_en_splitting\" "; 
     333 
     334                # original default solr field type for all fields is text_en_splitting 
     335                my $solrfieldtype = "text_en_splitting"; 
     336                if(defined $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'}) {     
     337                $solrfieldtype = $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'}; 
     338                #print STDERR "@@@@#### found TYPE: $solrfieldtype\n"; 
     339                } 
     340                $schema_insert_xml .= "type=\"$solrfieldtype\" "; 
     341                 
     342            } 
     343            # set termVectors=\"true\" when term vectors info is required,  
     344            # see TermsResponse termResponse = solrResponse.getTermsResponse();  
     345            $schema_insert_xml .=  "indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"true\" termPositions=\"true\" termOffsets=\"true\" />\n"; 
    308346        } 
    309         # set termVectors=\"true\" when term vectors info is required,  
    310         # see TermsResponse termResponse = solrResponse.getTermsResponse();  
    311         $schema_insert_xml .=  "indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"true\" termPositions=\"true\" termOffsets=\"true\" />\n"; 
    312347    } 
    313348