Context Navigation

← Previous Change
Next Change →

solrbuilder.pm

Timestamp:

2019-07-18T22:45:22+12:00 (5 years ago)

Author:

ak19

Message:

In order to get map coordinate metadata stored correctly in solr, changes were required. These changes revealed that the way in which some index fields were stored in solr but also lucene were not exactly correct and required changing too. 1. Coordinate/CD, CoordShort/CS and GPSMapOverlayLabel/ML meta are now being stored. The schema elements created for these indexed fields notably need to say they're multivalued (multiple values per docOID) and are of type=string rather than type=text_en_splitting as the other meta have been so far. No term related information being stored for them as that doesn't appear important for these indexed fields. 2. Changes to solrbuildproc required and these changes were also repeated into lucenebuildproc: in their code before this commit, <field name=... /> elements were stored once for all meta elements in that field. It sort of worked out so far since the type=text_en_splitting for these fields. This however created the problem that for example all Coordinate meta for a docOID went into a single <field name=CD .../> element separate by spaces rather than a <field name=CD .../> element for each Coordinate meta. We wanted the latter behaviour for CD, CS and ML meta but also for all other indexed meta fields such as TI for titles. But also for indexed fields that include multiple meta in one index such as a hypothetical TT where TT would include dc.Title,ex.Title,text. In that case too we want a <field name=TT /> element for each title meta and for the text meta. 3. The num_processed_bytes calculation is left untouched and still includes the encapsulating <field name=.../> element and has not been changed to be calculated over just the meta data value of each field. This is because not only is it calculated to include the field in super -buildproc.pm classes, but also because the definition of num_processed_bytes in basebuilder.pm is defined as the number of bytes actually passed to (mg) for the current index, where lucene and mgpp buildprocs both include the enclosing element in the calculation which seems deliberate. Further, num_processed_bytes contrasts against num_bytes, declared and defined in basebuildproc.pm too as The actual number of bytes in the collection, normally the same as what's processed during text compression. num_bytes seems to be what Dr Bainbridge had in mind today when he said that actually the enclosing <field/> element shouldn't be included in the calculation of num_processed_bytes. Since the definition of num_processed_bytes seems ambiguous to me now, I leave it alone until discussed with Dr Bainbridge again, as there are many places where it needs changing otherwise.

File:

: 1 edited

gs3-extensions/solr/trunk/src/perllib/solrbuilder.pm (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/solr/trunk/src/perllib/solrbuilder.pm

-              r32179
+              r33327
 # build_cfg as, unlike in MGPP, we need these mappings in advance to configure
 # Lucene/Solr. Unfortunately the original function found in mgbuilder.pm makes
 # a mess of this - it only output fields that have been processed (none have)
+# a mess of this - it only outputs fields that have been processed (none have)
 # and it has a hardcoded renaming for 'text' so it becomes 'TX' according to
 # the schema but 'TE' according to XML sent to lucene_passes.pl/solr_passes.pl
 # This version is dumber - just copy them all across verbatum - but works. We
+# This version is dumber - just copy them all across verbatim - but works. We
 # do still need to support the special case of 'allfields'
 sub make_final_field_list
 …
         $schema_insert_xml .= "<field name=\"$field\" ";
+        if($field eq "LA" || $field eq "LO")
+        {
+            $schema_insert_xml .=   "type=\"location\" ";
+        if($field eq "CD" || $field eq "CS") {
+            # Coordinate and CoordShort meta should not be split but treated as a whole string for searching. So type=string, not type=text_en_splitting
+            # Can't set to type="location", which uses solr.LatLonType, since type=location fields "must not be multivalued" as per conf/schema.xml.in.
+            # And we can have multiple Coordinate (and multiple CoordShort) meta for one doc, so multivalued=true.
+            # Not certain what to set stored to. As per conf/schema.xml.in, stored=false means "you only need to search on the field but
+            # don't need to return the original value". And they advice to set stored="false" for all fields possible (esp large fields)."
+            # But stored=false makes it not visible in Luke. So setting stored=true as for other fields
+            # TermVector: '"A term vector is a list of the document's terms and their number of occurrences in that documented."
+            # Each document has one term vector which is a list.' (http://makble.com/what-is-term-vector-in-lucene and lucene API for Field.TermVector)
+            # e.g. docA contains, "cat" 5 times, "dog" 10 times. We don't care to treat Coordinate meta as a term: not a "term" occurring
+            # in the doc, and don't care how often a Coordinate occurs in a document.
+            # Consequently, we don't care about term positions and term offsets for Coordinate meta either.
+            $schema_insert_xml .= "type=\"string\" indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"false\" termPositions=\"false\" termOffsets=\"false\" />\n";
+        }
+#       elsif ($field ne "ZZ" && $field ne "TX")
+#       {
+#           $schema_insert_xml .=   "type=\"string\" ";
+#       }
+        else
+        {
+            #$schema_insert_xml .= "type=\"text_en_splitting\" ";
+            # original default solr field type for all fields is text_en_splitting
+            my $solrfieldtype = "text_en_splitting";
+            if(defined $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'}) {
+            $solrfieldtype = $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'};
+            #print STDERR "@@@@#### found TYPE: $solrfieldtype\n";
+            }
+            $schema_insert_xml .= "type=\"$solrfieldtype\" ";
+        elsif($field eq "ML") {
+            # mapLabel: same attributes as for coord meta CD and CS above
+            # mapLabel is also like facets with type="string" to not get tokenized, and multiValued="true" to allow each shape's label to be stored distinctly
+            $schema_insert_xml .= "type=\"string\" indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"false\" termPositions=\"false\" termOffsets=\"false\" />\n";
+        }
+        else {
+            if($field eq "LT" || $field eq "LO") # full Latitude and Longitude coordinate meta, not the short variants (LatShort/LA and LongShort/LN)
+            {
+                # Latitude and Longitude is being phased out in favour of using Coord meta.
+                # However, if ever returning to using Lat and Lng instead of Coord meta, then the way the Lat Lng meta is currently written out for type="location"
+                # is in the wrong format. Lat and Lng shouldn't get written out separately but as: Lat,Lng
+                # It gets written out in solrbuildproc.pm, I think, so that would be where it needs to be corrected.
+                # For more info on type=location for our solr 4.7.2 or thereabouts, see https://web.archive.org/web/20160312154250/https://wiki.apache.org/solr/SpatialSearchDev
+                # which states:
+                #    When indexing, the format is something like:
+                #       <field name="store_lat_lon">12.34,-123.45</field>
+                #
+                $schema_insert_xml .=   "type=\"location\" ";
+            }
+    #       elsif ($field ne "ZZ" && $field ne "TX")
+    #       {
+    #           $schema_insert_xml .=   "type=\"string\" ";
+    #       }
+            else
+            {
+                #$schema_insert_xml .= "type=\"text_en_splitting\" ";
+                # original default solr field type for all fields is text_en_splitting
+                my $solrfieldtype = "text_en_splitting";
+                if(defined $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'}) {
+                $solrfieldtype = $self->{'collect_cfg'}->{'indexfieldoptions'}->{$fullfieldname}->{'solrfieldtype'};
+                #print STDERR "@@@@#### found TYPE: $solrfieldtype\n";
+                }
+                $schema_insert_xml .= "type=\"$solrfieldtype\" ";
+            }
+            # set termVectors=\"true\" when term vectors info is required,
+            # see TermsResponse termResponse = solrResponse.getTermsResponse();
+            $schema_insert_xml .=  "indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"true\" termPositions=\"true\" termOffsets=\"true\" />\n";
+        }
-        # set termVectors=\"true\" when term vectors info is required,
-        # see TermsResponse termResponse = solrResponse.getTermsResponse();
-        $schema_insert_xml .=  "indexed=\"true\" stored=\"true\" multiValued=\"true\" termVectors=\"true\" termPositions=\"true\" termOffsets=\"true\" />\n";
+    }

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33327 for gs3-extensions/solr/trunk/src/perllib/solrbuilder.pm

Legend:

gs3-extensions/solr/trunk/src/perllib/solrbuilder.pm

Download in other formats: