Changeset 33872 for other-projects
- Timestamp:
- 2020-01-24T21:44:04+13:00 (4 years ago)
- Location:
- other-projects/maori-lang-detection/mongodb-data
- Files:
-
- 1 added
- 4 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/mongodb-data/4counts_tentativeNonProductSites.json
r33823 r33872 1 1 /* 2 2 3 The websites that have some MRI detected AND which are either in NZ or with NZ TLD3 All the websites that have some MRI detected AND which are either in NZ or with NZ TLD 4 4 or (so if they're from overseas) don't contain /mi or mi.* in URL path. 5 5 We'll include Australia, to get the valid "kiwiproperty.com" website, -
other-projects/maori-lang-detection/mongodb-data/5counts_tentativeNonProductSites1.json
r33823 r33872 1 1 /* 2 2 3 The websites that have some MRI detected AND which are either in NZ or with NZ TLD3 All the websites that have some MRI detected AND which are either in NZ or with NZ TLD 4 4 or (so if they're from overseas) don't contain /mi or mi.* in URL path. 5 5 We'll include Australia, to get the valid "kiwiproperty.com" website, -
other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json
r33868 r33872 1 1 /* 2 3 db.Websites.aggregate([ 4 { 5 $match: { 6 $and: [ 7 {numPagesInMRI: {$gt: 0}}, 8 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} 9 ] 10 } 11 }, 12 { $unwind: "$geoLocationCountryCode" }, 13 { 14 $group: { 15 _id: "nz", 16 count: { $sum: 1 }, 17 domain: { $addToSet: '$domain' }, 18 numPagesInMRICount: { $sum: '$numPagesInMRI' }, 19 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' } 20 } 21 }, 22 { $sort : { count : -1} } 23 ]); 24 2 25 For sites originating in NZ or with nz TLD, none of the URLs are manually inspected and all URLs are accepted. 3 26 … … 118 141 119 142 120 OR is this better :143 OR is this better (only numPagesINMRI): 121 144 122 145 db.Websites.aggregate([ -
other-projects/maori-lang-detection/mongodb-data/tables.txt
r33848 r33872 157 157 ]); 158 158 159 159 160 NZ: 160 161 db.Websites.aggregate([ … … 179 180 { $sort : { count : -1} } 180 181 ]); 182 183 184 BETTER, numPagesINMRI rather than containingMRI: 185 db.Websites.aggregate([ 186 { 187 $match: { 188 $and: [ 189 {numPagesInMRI: {$gt: 0}}, 190 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} 191 ] 192 } 193 }, 194 { $unwind: "$geoLocationCountryCode" }, 195 { 196 $group: { 197 _id: "nz", 198 count: { $sum: 1 }, 199 domain: { $addToSet: '$domain' }, 200 numPagesInMRICount: { $sum: '$numPagesInMRI' }, 201 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' } 202 } 203 }, 204 { $sort : { count : -1} } 205 ]);
Note:
See TracChangeset
for help on using the changeset viewer.