Changeset 33872

Show
Ignore:
Timestamp:
24.01.2020 21:44:04 (4 weeks ago)
Author:
ak19
Message:

1. Added the file containing the 255 random NZ page URLs to sample. 2. Minor updates to 2 existing counts files. 3. Recorded isMRI aggregate command used for selecting NZ domains to sample from - for NZ sites did not use containsMRI to generate samples.

Location:
other-projects/maori-lang-detection/mongodb-data
Files:
1 added
4 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/4counts_tentativeNonProductSites.json

    r33823 r33872  
    11/* 
    22 
    3 The websites that have some MRI detected AND which are either in NZ or with NZ TLD 
     3All the websites that have some MRI detected AND which are either in NZ or with NZ TLD 
    44or (so if they're from overseas) don't contain /mi or mi.* in URL path. 
    55We'll include Australia, to get the valid "kiwiproperty.com" website, 
  • other-projects/maori-lang-detection/mongodb-data/5counts_tentativeNonProductSites1.json

    r33823 r33872  
    11/* 
    22 
    3 The websites that have some MRI detected AND which are either in NZ or with NZ TLD 
     3All the websites that have some MRI detected AND which are either in NZ or with NZ TLD 
    44or (so if they're from overseas) don't contain /mi or mi.* in URL path. 
    55We'll include Australia, to get the valid "kiwiproperty.com" website, 
  • other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json

    r33868 r33872  
    11/* 
     2 
     3db.Websites.aggregate([ 
     4    { 
     5        $match: { 
     6            $and: [ 
     7                {numPagesInMRI: {$gt: 0}}, 
     8                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} 
     9            ] 
     10        } 
     11    }, 
     12    { $unwind: "$geoLocationCountryCode" }, 
     13    { 
     14        $group: { 
     15            _id: "nz", 
     16            count: { $sum: 1 }, 
     17            domain: { $addToSet: '$domain' }, 
     18            numPagesInMRICount: { $sum: '$numPagesInMRI' }, 
     19            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' } 
     20        } 
     21    }, 
     22    { $sort : { count : -1} } 
     23]); 
     24 
    225For sites originating in NZ or with nz TLD, none of the URLs are manually inspected and all URLs are accepted. 
    326 
     
    118141 
    119142 
    120 OR is this better: 
     143OR is this better (only numPagesINMRI): 
    121144 
    122145db.Websites.aggregate([ 
  • other-projects/maori-lang-detection/mongodb-data/tables.txt

    r33848 r33872  
    157157]); 
    158158 
     159 
    159160NZ: 
    160161db.Websites.aggregate([ 
     
    179180    { $sort : { count : -1} } 
    180181]); 
     182 
     183 
     184BETTER, numPagesINMRI rather than containingMRI: 
     185db.Websites.aggregate([ 
     186    { 
     187        $match: { 
     188            $and: [ 
     189                {numPagesInMRI: {$gt: 0}}, 
     190                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} 
     191            ] 
     192        } 
     193    }, 
     194    { $unwind: "$geoLocationCountryCode" }, 
     195    { 
     196        $group: { 
     197            _id: "nz", 
     198            count: { $sum: 1 }, 
     199            domain: { $addToSet: '$domain' }, 
     200            numPagesInMRICount: { $sum: '$numPagesInMRI' }, 
     201            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' } 
     202        } 
     203    }, 
     204    { $sort : { count : -1} } 
     205]);