Changeset 33556

Show
Ignore:
Timestamp:
09.10.2019 18:58:30 (8 days ago)
Author:
ak19
Message:

Blacklisted wikipedia pages that are actually in other languages which had found their way into commoncrawl MRI results.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt

    r33554 r33556  
    66# Without either ^ or $ symbol, urls containing the given url will get blacklisted 
    77 
     8 
     9# wikipedia pages in 
     10# ksh (a German dialect), ilo (Filippino), ty Tahitian, wa for Walons/Walloon, 
     11# io (Ido version of Esperanto) and zh-min-nan (Min-Nan-Chinese) are not in the Maori language 
     12# Not sure why Commoncrawl had found them for language code MRI 
     13ksh.wikipedia.org 
     14ilo.wikipedia.org 
     15wa.wikipedia.org 
     16ty.m.wikipedia.org 
     17io.m.wikipedia.org 
     18zh-min-nan.wikipedia.org 
     19zh-min-nan.wiktionary.org 
    820 
    921# unwanted domains