Context Navigation

← Previous Change
Next Change →

Changeset 2008 for trunk/gsdl

Timestamp:

2001-02-19T12:22:02+13:00 (23 years ago)

Author:

paynter

Message:

Marginally better support for non-English documents.

File:

: 1 edited

trunk/gsdl/perllib/classify/phind.pm (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/gsdl/perllib/classify/phind.pm

-              r1949
+              r2008
     my $doclanguage = $doc_obj->get_metadata_element ($top_section, "Language");
     my $phrlanguage = $self->{'language_exp'};
+    print STDERR "+ CLASSIFY - doclanguage: $doclanguage, phrlanguage $phrlanguage \n";
     return if ($doclanguage && ($doclanguage !~ /$phrlanguage/i));
 …
     my ($language_exp, $text) = @_;
+    print STDERR "+ tokenising in $language_exp\n";
     if ($language_exp =~ /en/) {
     return &convert_gml_to_tokens_EN($text);
+    }
-    # FIRST, remove GML tags
     $_ = $text;
+    # Replace all whitespace with a simple space
+    s/\s+/ /gso;
+    # 1. remove GML tags
     # Remove everything that is in a tag
 …
     # Now we have the text, but it may contain HTML
     # elements coded as &gt; etc.  Remove these tags.
+    s/&amp;/&/sgo;
     s/&lt;/</sgo;
     s/&gt;/>/sgo;
-    s/\s+/ /sgo;
     s/\s*<p>\s*/ PARAGRAPHBREAK /isgo;
     s/\s*<br>\s*/ LINEBREAK /isgo;
     s/<[^>]*>/ /sgo;
+    # remove &amp; and other miscellaneous markup tags
+    s/&amp;/&/sgo;
+    s/&lt;/</sgo;
+    s/&gt;/>/sgo;
+    s/&amp;/&/sgo;
+    # replace<p> and <br> placeholders with carriage returns
+    # replace<p> and <br> placeholders with clause break symbol (\n)
+    s/\s+/ /gso;
     s/PARAGRAPHBREAK/\n/sgo;
     s/LINEBREAK/\n/sgo;
+    s/&([^;]+);/&unicode::ascii2utf8(\&ghtml::getcharequiv($1,0))/gse;
+    # Convert the remaining text to "clause format",
+    # This means removing all excess punctuation and garbage text,
+    # normalising valid punctuation to fullstops and commas,
+    # then putting one clause on each line.
+    # Insert newline when the end of a sentence is detected
+    # 2. Split the remaining text into space-delimited tokens
+    # Convert any HTML special characters (like &quot;) to their UTF8 equivalent
+    s/&([^;]+);/&unicode::ascii2utf8(\&ghtml::getcharequiv($1,1))/gse;
+    # Split text at word boundaries
+    s/\b/ /go;
+    # 3. Convert the remaining text to "clause format"
+    # Insert newline if the end of a sentence is detected
     # (delimter is:  "[\.\?\!]\s")
     s/\s*[\.\?\!]\s+/\n/go;
     # split numbers after four digits
     s/(\d\d\d\d)/$1 /go;
     # remove extra whitespace
+    # s/\s*[\.\?\!]\s+/\n/go;
+    # remove unnecessary punctuation and replace with clause break symbol (\n)
+    s/[^\w ]/\n/go;
+    # remove extraneous whitespace
     s/ +/ /sgo;
     s/^\s+//mgo;

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 2008 for trunk/gsdl

Legend:

trunk/gsdl/perllib/classify/phind.pm

Download in other formats: