Ignore:
Timestamp:
2020-01-24T21:44:04+13:00 (4 years ago)
Author:
ak19
Message:
  1. Added the file containing the 255 random NZ page URLs to sample. 2. Minor updates to 2 existing counts files. 3. Recorded isMRI aggregate command used for selecting NZ domains to sample from - for NZ sites did not use containsMRI to generate samples.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json

    r33868 r33872  
    11/*
     2
     3db.Websites.aggregate([
     4    {
     5        $match: {
     6            $and: [
     7                {numPagesInMRI: {$gt: 0}},
     8                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
     9            ]
     10        }
     11    },
     12    { $unwind: "$geoLocationCountryCode" },
     13    {
     14        $group: {
     15            _id: "nz",
     16            count: { $sum: 1 },
     17            domain: { $addToSet: '$domain' },
     18            numPagesInMRICount: { $sum: '$numPagesInMRI' },
     19            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
     20        }
     21    },
     22    { $sort : { count : -1} }
     23]);
     24
    225For sites originating in NZ or with nz TLD, none of the URLs are manually inspected and all URLs are accepted.
    326
     
    118141
    119142
    120 OR is this better:
     143OR is this better (only numPagesINMRI):
    121144
    122145db.Websites.aggregate([
Note: See TracChangeset for help on using the changeset viewer.