Changeset 21876

Show
Ignore:
Timestamp:
13.04.2010 15:29:42 (9 years ago)
Author:
kjdon
Message:

only process into english clauses if english is the only language, not for eg with ar|en. Don't remove all non \w characters - this removes all non alphanumeric chars. I have made up a punctuation match, some replaced with new lines, some with space.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/perllib/classify/Phind.pm

    r20454 r21876  
    2929# The Phind clasifier plugin.  
    3030# Type "classinfo.pl Phind" at the command line for a summary. 
    31  
    32 # 12/05/02 Added usage datastructure - John Thompson 
    3331 
    3432package Phind; 
     
    472470    } 
    473471 
    474     if ($language_exp =~ /en/) { 
     472    if ($language_exp =~ /^en$/) { 
    475473    return &convert_gml_to_tokens_EN($text); 
    476474    } 
     
    504502 
    505503     
    506  
    507  
    508504    # 2. Split the remaining text into space-delimited tokens 
    509505 
     
    513509    # Split text at word boundaries 
    514510    s/\b/ /go; 
    515  
     511     
    516512    # 3. Convert the remaining text to "clause format" 
    517513 
     
    521517 
    522518    # remove unnecessary punctuation and replace with clause break symbol (\n) 
    523     s/[^\w ]/\n/go; 
     519    # the following very nicely removes all non alphanumeric characters. too bad if you are not using english... 
     520    #s/[^\w ]/\n/go; 
     521    # replace punct with new lines - is this what we want?? 
     522    s/\s*[\?\;\:\!\,\.\"\[\]\{\}\(\)]\s*/\n/go; #" 
     523    # then remove other punct with space 
     524    s/[\'\`\\\_]/ /go; 
    524525 
    525526    # remove extraneous whitespace