Changeset 33556


Ignore:
Timestamp:
2019-10-09T18:58:30+13:00 (5 years ago)
Author:
ak19
Message:

Blacklisted wikipedia pages that are actually in other languages which had found their way into commoncrawl MRI results.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt

    r33554 r33556  
    66# Without either ^ or $ symbol, urls containing the given url will get blacklisted
    77
     8
     9# wikipedia pages in
     10# ksh (a German dialect), ilo (Filippino), ty Tahitian, wa for Walons/Walloon,
     11# io (Ido version of Esperanto) and zh-min-nan (Min-Nan-Chinese) are not in the Maori language
     12# Not sure why Commoncrawl had found them for language code MRI
     13ksh.wikipedia.org
     14ilo.wikipedia.org
     15wa.wikipedia.org
     16ty.m.wikipedia.org
     17io.m.wikipedia.org
     18zh-min-nan.wikipedia.org
     19zh-min-nan.wiktionary.org
    820
    921# unwanted domains
Note: See TracChangeset for help on using the changeset viewer.