Changeset 8818


Ignore:
Timestamp:
2004-12-15T15:38:18+13:00 (19 years ago)
Author:
mdewsnip
Message:

Title tags over multiple lines will now be removed correctly before classification by textcat.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/gsdl/perllib/plugins/BasPlug.pm

    r8814 r8818  
    650650    # remove <title>stuff</title> -- as titles tend often to be in English
    651651    # for foreign language documents
    652     $text =~ s/<title>.*?<\/title>//i;
     652    $text =~ s/<title>(.|\n)*?<\/title>//i;
    653653
    654654    # remove all HTML tags
Note: See TracChangeset for help on using the changeset viewer.