Changeset 14923


Ignore:
Timestamp:
2007-12-17T13:47:08+13:00 (16 years ago)
Author:
mdewsnip
Message:

Undid a change I made back in August 2006 regarding removing entities from the text. Turns out this is necessary because HTML entities like   are not valid XML, so Lucene will barf on them. Improved the code so only named entities are removed, and &#nnnn; and &#xhhhh; entities are kept (as these are valid XML).

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gsdl/trunk/perllib/lucenebuildproc.pm

    r14068 r14923  
    446446    }
    447447
     448    # It's important that we remove name entities because otherwise the text passed to Lucene for indexing
     449    #   may not be valid XML (eg. if HTML-only entities like   are used)
     450    $new_text =~ s/&\w{1,10};//g;
     451    # Remove stray '&' characters, except in &#nnnn; or &#xhhhh; entities (which are valid XML)
     452    $new_text =~ s/&([^\#])/ $1/g;
     453
    448454    return $new_text;
    449455}
Note: See TracChangeset for help on using the changeset viewer.