Changeset 33872 for other-projects


Ignore:
Timestamp:
2020-01-24T21:44:04+13:00 (4 years ago)
Author:
ak19
Message:
  1. Added the file containing the 255 random NZ page URLs to sample. 2. Minor updates to 2 existing counts files. 3. Recorded isMRI aggregate command used for selecting NZ domains to sample from - for NZ sites did not use containsMRI to generate samples.
Location:
other-projects/maori-lang-detection/mongodb-data
Files:
1 added
4 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/4counts_tentativeNonProductSites.json

    r33823 r33872  
    11/*
    22
    3 The websites that have some MRI detected AND which are either in NZ or with NZ TLD
     3All the websites that have some MRI detected AND which are either in NZ or with NZ TLD
    44or (so if they're from overseas) don't contain /mi or mi.* in URL path.
    55We'll include Australia, to get the valid "kiwiproperty.com" website,
  • other-projects/maori-lang-detection/mongodb-data/5counts_tentativeNonProductSites1.json

    r33823 r33872  
    11/*
    22
    3 The websites that have some MRI detected AND which are either in NZ or with NZ TLD
     3All the websites that have some MRI detected AND which are either in NZ or with NZ TLD
    44or (so if they're from overseas) don't contain /mi or mi.* in URL path.
    55We'll include Australia, to get the valid "kiwiproperty.com" website,
  • other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json

    r33868 r33872  
    11/*
     2
     3db.Websites.aggregate([
     4    {
     5        $match: {
     6            $and: [
     7                {numPagesInMRI: {$gt: 0}},
     8                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
     9            ]
     10        }
     11    },
     12    { $unwind: "$geoLocationCountryCode" },
     13    {
     14        $group: {
     15            _id: "nz",
     16            count: { $sum: 1 },
     17            domain: { $addToSet: '$domain' },
     18            numPagesInMRICount: { $sum: '$numPagesInMRI' },
     19            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
     20        }
     21    },
     22    { $sort : { count : -1} }
     23]);
     24
    225For sites originating in NZ or with nz TLD, none of the URLs are manually inspected and all URLs are accepted.
    326
     
    118141
    119142
    120 OR is this better:
     143OR is this better (only numPagesINMRI):
    121144
    122145db.Websites.aggregate([
  • other-projects/maori-lang-detection/mongodb-data/tables.txt

    r33848 r33872  
    157157]);
    158158
     159
    159160NZ:
    160161db.Websites.aggregate([
     
    179180    { $sort : { count : -1} }
    180181]);
     182
     183
     184BETTER, numPagesINMRI rather than containingMRI:
     185db.Websites.aggregate([
     186    {
     187        $match: {
     188            $and: [
     189                {numPagesInMRI: {$gt: 0}},
     190                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
     191            ]
     192        }
     193    },
     194    { $unwind: "$geoLocationCountryCode" },
     195    {
     196        $group: {
     197            _id: "nz",
     198            count: { $sum: 1 },
     199            domain: { $addToSet: '$domain' },
     200            numPagesInMRICount: { $sum: '$numPagesInMRI' },
     201            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
     202        }
     203    },
     204    { $sort : { count : -1} }
     205]);
Note: See TracChangeset for help on using the changeset viewer.