Changeset 33823

Show
Ignore:
Timestamp:
13.01.2020 19:45:21 (9 days ago)
Author:
ak19
Message:

Recommitting mongo-data folder with renamed files with numbering.

Location:
other-projects/maori-lang-detection
Files:
31 added
2 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33816 r33823  
    10431043   https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages.  
    10441044NL: 
    1045 !!! - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz 
     1045(!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm] 
    10461046- https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL 
    10471047- tonhut.nl - misidentication 
     
    10531053- http://skimap.info/ - maps, NZ placenames in PDF 
    10541054DK: 
    1055 !! -  http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,  
     1055!! ++  http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,  
    10561056http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com, 
    10571057http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com 
     
    12341234 
    12351235# Just considering those sites outside NZ or not with .nz TLD: 
    1236  
    1237 db.getCollection('Websites').find({$and: [ 
    1238                 {geoLocationCountryCode: {$ne: "NZ"}}, 
    1239                 {domain: {$not: /\.nz/}}, 
    1240                 {numPagesContainingMRI: {$gt: 0}}, 
    1241                 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} 
    1242             ]}).count() 
    1243  
    1244 221 websites 
    1245  
    1246 # counts by country code excluding NZ related sites 
    12471236db.Websites.aggregate([ 
    12481237    { 
     
    12681257 
    12691258 
     1259# counts by country code excluding NZ related sites 
     1260db.getCollection('Websites').find({$and: [ 
     1261                {geoLocationCountryCode: {$ne: "NZ"}}, 
     1262                {domain: {$not: /\.nz/}}, 
     1263                {numPagesContainingMRI: {$gt: 0}}, 
     1264                {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} 
     1265            ]}).count() 
     1266 
     1267221 websites 
     1268 
     1269 
    12701270# But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld): 
    12711271db.getCollection('Websites').find({$and: [ 
     
    13011301 
    13021302----------------------- 
     1303US: 
    13031304Done: manually inspected 68/117 sites 
     1305 
     1306TOTAL US: 4+7+7+4+3=25 
    13041307 
    13051308DEFINITELY: 
     
    13231326X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters. 
    13241327 
    1325 PROBABLY: 
     1328CHECK - PROBABLY: 
    13261329!! https://maorinews.com,  
    13271330!! http://maaori.com,  
     
    14671470 
    14681471 
     1472 
     1473--------------- 
     1474 
     1475MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY 
     1476NZ: 176 
     1477US: 25 
     1478AU: 3 
     1479FR: 1 
     1480DK: 2 
     1481(CA: 0.5) 
     1482DE: 2 
     1483IE (Ireland): 1 
     1484CZ: 1 
     1485ES: 1 
     1486BG: 1 
     1487 
     1488TIDIED: 
     1489NZ: 176 
     1490US: 25 
     1491AU: 3 
     1492DE: 2 
     1493DK: 2 
     1494BG: 1 
     1495CZ: 1 
     1496ES: 1 
     1497FR: 1 
     1498IE: 1 
     1499TOTAL: 213 
     1500 
     1501 
  • other-projects/maori-lang-detection/conf/url-blacklist-filter.txt

    r33800 r33823  
    7070# more adult sites 
    7171acba.osb-land.com 
    72  
     72the-naked.com 
     73# the full URL is http://ww25.milfsplease.com, but don't know whether the ww25 prefix should be included or not 
     74ww25.milfsplease.com 
     75milfsplease.com 
    7376 
    7477# just get rid of any URL containing "livejasmin"