Changeset 33838


Ignore:
Timestamp:
2020-01-16T17:56:50+13:00 (4 years ago)
Author:
ak19
Message:

Updated after checking non-NZ and non-nz TLD sites with mi in URL path

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33823 r33838  
    15001500
    15011501
     1502------------------------------
     1503
     1504Need to inspect all those URLs with mi in URL path (mi.* or */mi) that are not sites with nz TLD or originating in NZ:
     1505
     1506db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
     1507472
     1508
     1509(vs:
     1510db.getCollection('Websites').find({$and: [{numPagesInMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
     1511209)
     1512
     1513
     1514db.Websites.aggregate([
     1515    {
     1516            $match: {
     1517                $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]
     1518            }
     1519        },
     1520    {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}},
     1521        { $sort : { count : -1} }
     1522])
     1523
     1524
     1525Of interest or possible interest:
     1526US:
     1527!! http://indigenousblogs.com [15/18 blogs work]
     1528X https://biblia.gospelprime.com.br - misdetection (containsMRI)
     1529X ?https://follow3rs.com - seems dodgy and possibly auto-translated. Can't spell account, misspelled as accout
     1530!! https://mi.m.wikipedia.org, https://mi.wikipedia.org
     1531X https://usahello.org - autotranslated
     1532X http://church-of-christ.org, http://www.church-of-christ.org - I think autotranslated, because "HET kerken van Christus" at https://church-of-christ.org/nl/ i.p.v. meervoud
     1533X https://www.livehoster.com
     1534X http://www.americasportsfloor.com, - product store. Misdetected
     1535!! http://csunplugged.org, https://www.csunplugged.org - University of Canterbury NZ and site only available in EN, MI, DE, ES, CN
     1536X https://mi.lawyers.cafe - autotranslated
     1537    X https://mi.centr-zashity.ru - same as lawyers.cafe above: autotranslated
     1538! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated.
     1539X http://jobdescriptionsample.org - autotranslated
     1540X http://mi.broadcastbeat.com - autotranslated product site
     1541X http://www.samewe.net - autotranslated product site
     1542X https://mi.kidspicturedictionary.com - autotranslated, but MAY BE USEFUL
     1543X https://www.rikoooo.com - autotranslated
     1544
     1545CN: -
     1546
     1547FR:
     1548? https://mi.phcoker.com - product site "Shangke Chemical Rapu + 86 (1812) 4514114 [email protected]"
     1549X http://www.gpedia.com - dodgy copy of wikipedia, see http://www.gpedia.com/nl/gpedia/Hoofdpagina
     1550
     1551NL:
     1552X http://www.martinvrijland.nl - wordpress, autotranslated
     1553
     1554CA:
     1555X https://www.wikiplanet.click (seems like a dodgy copy of wikipedia)
     1556X cloudsfeed.com - wordpress admin page
     1557
     1558
     1559db.getCollection('Webpages').find({$and: [{isMRI: true}, {URL: /indigenousblogs\.com/}]})
     1560=> http://indigenousblogs.com/mi/
Note: See TracChangeset for help on using the changeset viewer.