Changeset 33823


Ignore:
Timestamp:
01/13/20 19:45:21 (12 months ago)
Author:
ak19
Message:

Recommitting mongo-data folder with renamed files with numbering.

Location:
other-projects/maori-lang-detection
Files:
31 added
2 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33816 r33823  
    10431043   https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages.
    10441044NL:
    1045 !!! - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz
     1045(!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm]
    10461046- https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL
    10471047- tonhut.nl - misidentication
     
    10531053- http://skimap.info/ - maps, NZ placenames in PDF
    10541054DK:
    1055 !! -  http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,
     1055!! ++  http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,
    10561056http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com,
    10571057http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com
     
    12341234
    12351235# Just considering those sites outside NZ or not with .nz TLD:
    1236 
    1237 db.getCollection('Websites').find({$and: [
    1238                 {geoLocationCountryCode: {$ne: "NZ"}},
    1239                 {domain: {$not: /\.nz/}},
    1240                 {numPagesContainingMRI: {$gt: 0}},
    1241                 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
    1242             ]}).count()
    1243 
    1244 221 websites
    1245 
    1246 # counts by country code excluding NZ related sites
    12471236db.Websites.aggregate([
    12481237    {
     
    12681257
    12691258
     1259# counts by country code excluding NZ related sites
     1260db.getCollection('Websites').find({$and: [
     1261                {geoLocationCountryCode: {$ne: "NZ"}},
     1262                {domain: {$not: /\.nz/}},
     1263                {numPagesContainingMRI: {$gt: 0}},
     1264                {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
     1265            ]}).count()
     1266
     1267221 websites
     1268
     1269
    12701270# But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
    12711271db.getCollection('Websites').find({$and: [
     
    13011301
    13021302-----------------------
     1303US:
    13031304Done: manually inspected 68/117 sites
     1305
     1306TOTAL US: 4+7+7+4+3=25
    13041307
    13051308DEFINITELY:
     
    13231326X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters.
    13241327
    1325 PROBABLY:
     1328CHECK - PROBABLY:
    13261329!! https://maorinews.com,
    13271330!! http://maaori.com,
     
    14671470
    14681471
     1472
     1473---------------
     1474
     1475MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY
     1476NZ: 176
     1477US: 25
     1478AU: 3
     1479FR: 1
     1480DK: 2
     1481(CA: 0.5)
     1482DE: 2
     1483IE (Ireland): 1
     1484CZ: 1
     1485ES: 1
     1486BG: 1
     1487
     1488TIDIED:
     1489NZ: 176
     1490US: 25
     1491AU: 3
     1492DE: 2
     1493DK: 2
     1494BG: 1
     1495CZ: 1
     1496ES: 1
     1497FR: 1
     1498IE: 1
     1499TOTAL: 213
     1500
     1501
  • other-projects/maori-lang-detection/conf/url-blacklist-filter.txt

    r33800 r33823  
    7070# more adult sites
    7171acba.osb-land.com
    72 
     72the-naked.com
     73# the full URL is http://ww25.milfsplease.com, but don't know whether the ww25 prefix should be included or not
     74ww25.milfsplease.com
     75milfsplease.com
    7376
    7477# just get rid of any URL containing "livejasmin"
Note: See TracChangeset for help on using the changeset viewer.