Ignore:
Timestamp:
2020-01-17T19:32:16+13:00 (4 years ago)
Author:
ak19
Message:

indigenousblogs.com did have one page actually in Maori (an XML feed). So adding 1 to the table of counts for US sites with mi in the URL path that contained actual MRI.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33843 r33847  
    14881488TIDIED:
    14891489NZ: 176
    1490 US: 25+3 from US with mi in URL path = 28
     1490US: 25+4 from US with mi in URL path = 29
    14911491AU: 3
    14921492DE: 2
     
    14971497FR: 1
    14981498IE: 1
    1499 TOTAL: 213+3 from US with mi in URL path = 216
     1499TOTAL: 213+4 from US with mi in URL path = 217
    15001500
    15011501
     
    15251525Of interest or possible interest:
    15261526US:
    1527 !! http://indigenousblogs.com [15/18 blogs work]
     1527!! http://indigenousblogs.com [15/18 blogs work] - has one page in Maori (http://indigenousblogs.com/feeds/mi.xml)
    15281528X https://biblia.gospelprime.com.br - misdetection (containsMRI)
    15291529X ?https://follow3rs.com - seems dodgy and possibly auto-translated. Can't spell account, misspelled as accout
     
    15591559db.getCollection('Webpages').find({$and: [{isMRI: true}, {URL: /indigenousblogs\.com/}]})
    15601560=> http://indigenousblogs.com/mi/
     1561
     1562--------------------------
     1563
     1564
     1565db.Websites.aggregate([
     1566    {
     1567        $match: {
     1568            $and: [
     1569                {geoLocationCountryCode: {$ne: "NZ"}},
     1570                {domain: {$not: /\.nz/}},
     1571                {numPagesContainingMRI: {$gt: 0}},
     1572                {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}           
     1573            ]
     1574        }
     1575    },
     1576    { $unwind: "$geoLocationCountryCode" },
     1577    {
     1578        $group: {
     1579            _id: {$toLower: '$geoLocationCountryCode'},
     1580            count: { $sum: 1 },
     1581            domain: { $addToSet: '$domain' },
     1582            numPagesInMRI: { $addToSet: '$numPagesInMRI' },
     1583            numPagesContainingMRI: { $addToSet: '$numPagesContainingMRI' },
     1584            numPagesInMRICount: { $sum: '$numPagesInMRI' },
     1585            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
     1586        }
     1587    },
     1588    { $sort : { count : -1} }
     1589]);
     1590
     1591
     1592To convert json to csv
     1593In gedit replace
     1594\/\*\s*\d+\s*\*\/ => ,
Note: See TracChangeset for help on using the changeset viewer.