Changeset 21876


Ignore:
Timestamp:
2010-04-13T15:29:42+12:00 (14 years ago)
Author:
kjdon
Message:

only process into english clauses if english is the only language, not for eg with ar|en. Don't remove all non \w characters - this removes all non alphanumeric chars. I have made up a punctuation match, some replaced with new lines, some with space.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/perllib/classify/Phind.pm

    r20454 r21876  
    2929# The Phind clasifier plugin.
    3030# Type "classinfo.pl Phind" at the command line for a summary.
    31 
    32 # 12/05/02 Added usage datastructure - John Thompson
    3331
    3432package Phind;
     
    472470    }
    473471
    474     if ($language_exp =~ /en/) {
     472    if ($language_exp =~ /^en$/) {
    475473    return &convert_gml_to_tokens_EN($text);
    476474    }
     
    504502
    505503   
    506 
    507 
    508504    # 2. Split the remaining text into space-delimited tokens
    509505
     
    513509    # Split text at word boundaries
    514510    s/\b/ /go;
    515 
     511   
    516512    # 3. Convert the remaining text to "clause format"
    517513
     
    521517
    522518    # remove unnecessary punctuation and replace with clause break symbol (\n)
    523     s/[^\w ]/\n/go;
     519    # the following very nicely removes all non alphanumeric characters. too bad if you are not using english...
     520    #s/[^\w ]/\n/go;
     521    # replace punct with new lines - is this what we want??
     522    s/\s*[\?\;\:\!\,\.\"\[\]\{\}\(\)]\s*/\n/go; #"
     523    # then remove other punct with space
     524    s/[\'\`\\\_]/ /go;
    524525
    525526    # remove extraneous whitespace
Note: See TracChangeset for help on using the changeset viewer.