Changeset 33838
- Timestamp:
- 2020-01-16T17:56:50+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/MoreReading/mongodb.txt
r33823 r33838 1500 1500 1501 1501 1502 ------------------------------ 1503 1504 Need to inspect all those URLs with mi in URL path (mi.* or */mi) that are not sites with nz TLD or originating in NZ: 1505 1506 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count() 1507 472 1508 1509 (vs: 1510 db.getCollection('Websites').find({$and: [{numPagesInMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count() 1511 209) 1512 1513 1514 db.Websites.aggregate([ 1515 { 1516 $match: { 1517 $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}] 1518 } 1519 }, 1520 {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}}, 1521 { $sort : { count : -1} } 1522 ]) 1523 1524 1525 Of interest or possible interest: 1526 US: 1527 !! http://indigenousblogs.com [15/18 blogs work] 1528 X https://biblia.gospelprime.com.br - misdetection (containsMRI) 1529 X ?https://follow3rs.com - seems dodgy and possibly auto-translated. Can't spell account, misspelled as accout 1530 !! https://mi.m.wikipedia.org, https://mi.wikipedia.org 1531 X https://usahello.org - autotranslated 1532 X http://church-of-christ.org, http://www.church-of-christ.org - I think autotranslated, because "HET kerken van Christus" at https://church-of-christ.org/nl/ i.p.v. meervoud 1533 X https://www.livehoster.com 1534 X http://www.americasportsfloor.com, - product store. Misdetected 1535 !! http://csunplugged.org, https://www.csunplugged.org - University of Canterbury NZ and site only available in EN, MI, DE, ES, CN 1536 X https://mi.lawyers.cafe - autotranslated 1537 X https://mi.centr-zashity.ru - same as lawyers.cafe above: autotranslated 1538 ! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated. 1539 X http://jobdescriptionsample.org - autotranslated 1540 X http://mi.broadcastbeat.com - autotranslated product site 1541 X http://www.samewe.net - autotranslated product site 1542 X https://mi.kidspicturedictionary.com - autotranslated, but MAY BE USEFUL 1543 X https://www.rikoooo.com - autotranslated 1544 1545 CN: - 1546 1547 FR: 1548 ? https://mi.phcoker.com - product site "Shangke Chemical Rapu + 86 (1812) 4514114 [email protected]" 1549 X http://www.gpedia.com - dodgy copy of wikipedia, see http://www.gpedia.com/nl/gpedia/Hoofdpagina 1550 1551 NL: 1552 X http://www.martinvrijland.nl - wordpress, autotranslated 1553 1554 CA: 1555 X https://www.wikiplanet.click (seems like a dodgy copy of wikipedia) 1556 X cloudsfeed.com - wordpress admin page 1557 1558 1559 db.getCollection('Webpages').find({$and: [{isMRI: true}, {URL: /indigenousblogs\.com/}]}) 1560 => http://indigenousblogs.com/mi/
Note:
See TracChangeset
for help on using the changeset viewer.