Context Navigation

← Previous Change
Next Change →

Changeset 33806 for other-projects

Timestamp:

2019-12-13T21:31:11+13:00 (4 years ago)

Author:

ak19

Message:

More mongodb querying revealed that excluding tentative product sites (if site has /mi in path and emanates from outside NZ) from sites with numPagesCONTAININGMRI > 0, the result is barely different from just querying numPagesCONTAININGMRI > 0. Sadly, several autotranslated reslts still turned up by briefly checking the domains of the result sets in both cases. So maybe the test excluding tentativeProductSites should be repeated with numPagesINMRI > 0, to see whether that test that can better discriminate between auto-translated and sites with proper Maori language webpages.

Location:

other-projects/maori-lang-detection

Files:

: 4 added
: 1 edited

MoreReading/mongodb.txt (modified) (1 diff)
mongodb-data/counts_tentativeNonProductSites.json (added)
mongodb-data/geojson-features_tentativeNonProductSites.json (added)
mongodb-data/map_tentativeNonProductSites.png (added)
mongodb-data/multipoint_tentativeNonProductSites.json (added)

Legend:

: Unmodified
: Added
: Removed

other-projects/maori-lang-detection/MoreReading/mongodb.txt

-              r33804
+              r33806
 (Related work for other languages to quantifiably answer that)
 data-preparation
 docs
+------------------------------------------
+BUILDING TOWARDS NEW MONGODB QUERY: Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
+---
+# https://stackoverflow.com/questions/16902930/mongodb-aggregation-framework-match-or
+# https://docs.mongodb.com/manual/reference/operator/query/and/
+# 1. all the websites which are from NZ:
+db.getCollection('Websites').find({geoLocationCountryCode: "NZ"}).count()
+# 2. all the websites that have /mi in URL path which are from NZ:
+db.getCollection('Websites').find({$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]})
+# 3. all the websites that don't have /mi in URLpath
+db.getCollection('Websites').find({urlContainsLangCodeInPath: false}).count()
+# 4. all the websites that don't have /mi, or if they do are from NZ
+# (should be the sum of the above points 2 and 3 above)
+db.getCollection('Websites').find({$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}).count()
+# 5. All the websites that have at least 1 page detected as MRI AND either don't have /mi un URL path or if they do are from NZ
+# These are the TENTATIVE NON-PRODUCT SITES
+# Should be less than the point 4, but more than 1 to 3
+db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}).count()
+# 6. Now do the counts by country code of the above, by pasting the query of point 5 as the $match clause (i.e. without the .count() suffix)
+# Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
+db.Websites.aggregate([
+    {
+        $match: {$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}
+    },
+    { $unwind: "$geoLocationCountryCode" },
+    {
+        $group: {
+            _id: {$toLower: '$geoLocationCountryCode'},
+            count: { $sum: 1 }
+        }
+    },
+    { $sort : { count : -1} }
+]);
+The result is very close to the same aggregate on just numPagesContainingMRI.
+That's because if you count those websites that contain /mi/ AND numPagesContainingMRI, they're very few:
+db.Websites.aggregate([
+    {
+        $match: {
+            $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]
+        }
+    },
+    { $unwind: "$geoLocationCountryCode" },
+    {
+        $group: {
+            _id: {$toLower: '$geoLocationCountryCode'},
+            count: { $sum: 1 }
+        }
+    },
+    { $sort : { count : -1} }
+]);
+_id count
+us      4.0
+nz      4.0
+au      3.0
+ru      1.0
+de      1.0
+Total: 13 sites that have /mi/ and are detected as having MRI content,
+db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
+Of these 13, the 4 from NZ were already included in steps 5 and 6. So the difference is only 8 sites that are MI.
+Let's get a listing of the sites' domains - 3 whose country codes are NOT NZ have NZ TLD!
+/* 1 */
+{
+    "_id" : "nz",
+    "count" : 4.0,
+    "domain" : [
+        "http://firstworldwar.tki.org.nz",
+        "http://www.firstworldwar.tki.org.nz",
+        "https://admin.teara.govt.nz",
+        "http://community.nzdl.org"
+    ]
+}
+/* 2 */
+{
+    "_id" : "us",
+    "count" : 4.0,
+    "domain" : [
+        "https://sexualviolence.victimsinfo.govt.nz",
+        "https://follow3rs.com",
+        "http://www.church-of-christ.org",
+        "http://www.mytrickstips.com"
+    ]
+}
+/* 3 */
+{
+    "_id" : "au",
+    "count" : 3.0,
+    "domain" : [
+        "https://rapuatearatika.education.govt.nz",
+        "https://www.kiwiproperty.com",
+        "https://curriculumtool.education.govt.nz"
+    ]
+}
+/* 4 */
+{
+    "_id" : "ru",
+    "count" : 1.0,
+    "domain" : [
+        "http://www.treningmozga.com"
+    ]
+}
+/* 5 */
+{
+    "_id" : "de",
+    "count" : 1.0,
+    "domain" : [
+        "http://www.almancax.com" # Website to learn German, autotranslated
+    ]
+}
+But we're not catching a potentially large number of auto-translated sites, like
+- https://www.gigalight.com/all-languages.html
+- http://www.hzhinew.com/
+--------------
+GETTING TABLE DATA OUT OF MONGO DB:
+https://stackoverflow.com/questions/28733692/how-to-export-json-from-mongodb-using-robomongo
+"export to file" as in a spreadsheet like to a .csv?
+IMO this is the EASIEST way to do this in Robo 3T (formerly robomongo):
+.   In the top right of the Robo 3T GUI there is a "View Results in text mode" button, click it and copy everything
+.   paste everything into this website: https://json-csv.com/
+.   click the download button and now you have it in a spreadsheet.
+https://json-csv.com/
+---------------------
 /* 1 */

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33806 for other-projects

Legend:

other-projects/maori-lang-detection/MoreReading/mongodb.txt

Download in other formats: