- Timestamp:
- 2020-01-13T19:45:21+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/MoreReading/mongodb.txt
r33816 r33823 1043 1043 https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages. 1044 1044 NL: 1045 !!! - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz 1045 (!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm] 1046 1046 - https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL 1047 1047 - tonhut.nl - misidentication … … 1053 1053 - http://skimap.info/ - maps, NZ placenames in PDF 1054 1054 DK: 1055 !! -http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,1055 !! ++ http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com, 1056 1056 http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com, 1057 1057 http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com … … 1234 1234 1235 1235 # Just considering those sites outside NZ or not with .nz TLD: 1236 1237 db.getCollection('Websites').find({$and: [1238 {geoLocationCountryCode: {$ne: "NZ"}},1239 {domain: {$not: /\.nz/}},1240 {numPagesContainingMRI: {$gt: 0}},1241 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}1242 ]}).count()1243 1244 221 websites1245 1246 # counts by country code excluding NZ related sites1247 1236 db.Websites.aggregate([ 1248 1237 { … … 1268 1257 1269 1258 1259 # counts by country code excluding NZ related sites 1260 db.getCollection('Websites').find({$and: [ 1261 {geoLocationCountryCode: {$ne: "NZ"}}, 1262 {domain: {$not: /\.nz/}}, 1263 {numPagesContainingMRI: {$gt: 0}}, 1264 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} 1265 ]}).count() 1266 1267 221 websites 1268 1269 1270 1270 # But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld): 1271 1271 db.getCollection('Websites').find({$and: [ … … 1301 1301 1302 1302 ----------------------- 1303 US: 1303 1304 Done: manually inspected 68/117 sites 1305 1306 TOTAL US: 4+7+7+4+3=25 1304 1307 1305 1308 DEFINITELY: … … 1323 1326 X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters. 1324 1327 1325 PROBABLY:1328 CHECK - PROBABLY: 1326 1329 !! https://maorinews.com, 1327 1330 !! http://maaori.com, … … 1467 1470 1468 1471 1472 1473 --------------- 1474 1475 MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY 1476 NZ: 176 1477 US: 25 1478 AU: 3 1479 FR: 1 1480 DK: 2 1481 (CA: 0.5) 1482 DE: 2 1483 IE (Ireland): 1 1484 CZ: 1 1485 ES: 1 1486 BG: 1 1487 1488 TIDIED: 1489 NZ: 176 1490 US: 25 1491 AU: 3 1492 DE: 2 1493 DK: 2 1494 BG: 1 1495 CZ: 1 1496 ES: 1 1497 FR: 1 1498 IE: 1 1499 TOTAL: 213 1500 1501
Note:
See TracChangeset
for help on using the changeset viewer.