Changeset 33838

Show
Ignore:
Timestamp:
16.01.2020 17:56:50 (5 weeks ago)
Author:
ak19
Message:

Updated after checking non-NZ and non-nz TLD sites with mi in URL path

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33823 r33838  
    15001500 
    15011501 
     1502------------------------------ 
     1503 
     1504Need to inspect all those URLs with mi in URL path (mi.* or */mi) that are not sites with nz TLD or originating in NZ: 
     1505 
     1506db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count() 
     1507472 
     1508 
     1509(vs: 
     1510db.getCollection('Websites').find({$and: [{numPagesInMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count() 
     1511209) 
     1512 
     1513 
     1514db.Websites.aggregate([ 
     1515    { 
     1516            $match: { 
     1517                $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}] 
     1518            } 
     1519        }, 
     1520    {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}}, 
     1521        { $sort : { count : -1} } 
     1522]) 
     1523 
     1524 
     1525Of interest or possible interest: 
     1526US:  
     1527!! http://indigenousblogs.com [15/18 blogs work] 
     1528X https://biblia.gospelprime.com.br - misdetection (containsMRI) 
     1529X ?https://follow3rs.com - seems dodgy and possibly auto-translated. Can't spell account, misspelled as accout 
     1530!! https://mi.m.wikipedia.org, https://mi.wikipedia.org 
     1531X https://usahello.org - autotranslated 
     1532X http://church-of-christ.org, http://www.church-of-christ.org - I think autotranslated, because "HET kerken van Christus" at https://church-of-christ.org/nl/ i.p.v. meervoud 
     1533X https://www.livehoster.com 
     1534X http://www.americasportsfloor.com, - product store. Misdetected 
     1535!! http://csunplugged.org, https://www.csunplugged.org - University of Canterbury NZ and site only available in EN, MI, DE, ES, CN 
     1536X https://mi.lawyers.cafe - autotranslated 
     1537    X https://mi.centr-zashity.ru - same as lawyers.cafe above: autotranslated 
     1538! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated. 
     1539X http://jobdescriptionsample.org - autotranslated 
     1540X http://mi.broadcastbeat.com - autotranslated product site 
     1541X http://www.samewe.net - autotranslated product site 
     1542X https://mi.kidspicturedictionary.com - autotranslated, but MAY BE USEFUL 
     1543X https://www.rikoooo.com - autotranslated 
     1544 
     1545CN: - 
     1546 
     1547FR: 
     1548? https://mi.phcoker.com - product site "Shangke Chemical Rapu + 86 (1812) 4514114 info@phcoker.com" 
     1549X http://www.gpedia.com - dodgy copy of wikipedia, see http://www.gpedia.com/nl/gpedia/Hoofdpagina 
     1550 
     1551NL: 
     1552X http://www.martinvrijland.nl - wordpress, autotranslated 
     1553 
     1554CA: 
     1555X https://www.wikiplanet.click (seems like a dodgy copy of wikipedia) 
     1556X cloudsfeed.com - wordpress admin page 
     1557 
     1558 
     1559db.getCollection('Webpages').find({$and: [{isMRI: true}, {URL: /indigenousblogs\.com/}]}) 
     1560=> http://indigenousblogs.com/mi/