Context Navigation

← Previous Changeset
Next Changeset →

Changeset 12426

Timestamp:

2006-08-09T15:54:41+12:00 (18 years ago)

Author:

mdewsnip

Message:

Deleted the code for removing entities, since it seemed to be negatively helpful (and done twice in many situations). When compressing the text, htmlsafe is called on the section text, so the XML will be valid in this case. When indexing the text, the HTML tags are stripped out ('strip_html' is always set for Lucene), so there is no problem in this case either.

File:

: 1 edited

trunk/gsdl/perllib/lucenebuildproc.pm (modified) (4 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/gsdl/perllib/lucenebuildproc.pm

-              r12424
+              r12426
     # Unlike MG and MGPP, Lucene supports incremental building
     return 1;
+}
-sub preprocess_text {
-    my $self = shift (@_);
-    my ($text, $strip_html, $para) = @_;
-    # call the mgpp method first
-    my ($new_text) = $self->SUPER::preprocess_text($text, $strip_html, $para);
-    # remove entities
-    $new_text =~ s/&\w{1,10};//g;
-    # remove &
-    $new_text =~ s/&//g;
-    return $new_text;
+}
 …
+            }
+        }
+        }
         else {
 …
             my $section_text = $doc_obj->get_text($section);
             if ($self->{'indexing_text'}) {
                             # tag the text with <Text>...</Text>, add the <Paragraph> tags and strip out html if needed
+                            # tag the text with <Text>...</Text>, add the <Paragraph> tags and always strip out HTML
                 $new_text .= "$parastarttag<$shortname index=\"1\">\n";
                 if ($parastarttag ne "") {
                 $section_text = $self->preprocess_text($section_text, $self->{'strip_html'}, "</$shortname>$paraendtag$parastarttag<$shortname index=\"1\">");
+                $section_text = $self->preprocess_text($section_text, 1, "</$shortname>$paraendtag$parastarttag<$shortname index=\"1\">");
+                }
                 else {
                 # we don't want to individually tag each paragraph if not doing para indexing
                 $section_text = $self->preprocess_text($section_text, $self->{'strip_html'}, "");
+                $section_text = $self->preprocess_text($section_text, 1, "");
+                }
                 $new_text .= "$section_text</$shortname>$paraendtag\n";
 …
             $new_text .= "$parastarttag<$shortname index=\"1\">$item</$shortname>$paraendtag\n";
+        }
-        # remove entities
-        $new_text =~ s/&\w{1,10};//g;
-        # remove &
-        $new_text =~ s/&//g;
+        }

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 12426

Legend:

trunk/gsdl/perllib/lucenebuildproc.pm

Download in other formats: