Changeset 33806 for other-projects


Ignore:
Timestamp:
2019-12-13T21:31:11+13:00 (4 years ago)
Author:
ak19
Message:

More mongodb querying revealed that excluding tentative product sites (if site has /mi in path and emanates from outside NZ) from sites with numPagesCONTAININGMRI > 0, the result is barely different from just querying numPagesCONTAININGMRI > 0. Sadly, several autotranslated reslts still turned up by briefly checking the domains of the result sets in both cases. So maybe the test excluding tentativeProductSites should be repeated with numPagesINMRI > 0, to see whether that test that can better discriminate between auto-translated and sites with proper Maori language webpages.

Location:
other-projects/maori-lang-detection
Files:
4 added
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33804 r33806  
    711711(Related work for other languages to quantifiably answer that)
    712712
    713 
    714 
    715 
    716713data-preparation
    717714docs
    718715
    719716
    720 
     717------------------------------------------
     718
     719BUILDING TOWARDS NEW MONGODB QUERY: Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
     720---
     721
     722# https://stackoverflow.com/questions/16902930/mongodb-aggregation-framework-match-or
     723# https://docs.mongodb.com/manual/reference/operator/query/and/
     724
     725# 1. all the websites which are from NZ:
     726db.getCollection('Websites').find({geoLocationCountryCode: "NZ"}).count()
     727128
     728
     729# 2. all the websites that have /mi in URL path which are from NZ:
     730db.getCollection('Websites').find({$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]})
     7316
     732
     733# 3. all the websites that don't have /mi in URLpath
     734db.getCollection('Websites').find({urlContainsLangCodeInPath: false}).count()
     7351292
     736
     737# 4. all the websites that don't have /mi, or if they do are from NZ
     738# (should be the sum of the above points 2 and 3 above)
     739db.getCollection('Websites').find({$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}).count()
     7401298
     741
     742# 5. All the websites that have at least 1 page detected as MRI AND either don't have /mi un URL path or if they do are from NZ
     743# These are the TENTATIVE NON-PRODUCT SITES
     744# Should be less than the point 4, but more than 1 to 3
     745db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}).count()
     746859
     747
     748# 6. Now do the counts by country code of the above, by pasting the query of point 5 as the $match clause (i.e. without the .count() suffix)
     749# Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
     750db.Websites.aggregate([
     751    {
     752        $match: {$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}
     753    },
     754    { $unwind: "$geoLocationCountryCode" },
     755    {
     756        $group: {
     757            _id: {$toLower: '$geoLocationCountryCode'},
     758            count: { $sum: 1 }
     759        }
     760    },
     761    { $sort : { count : -1} }
     762]);
     763
     764The result is very close to the same aggregate on just numPagesContainingMRI.
     765
     766That's because if you count those websites that contain /mi/ AND numPagesContainingMRI, they're very few:
     767
     768db.Websites.aggregate([
     769    {
     770        $match: {
     771            $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]
     772        }
     773    },
     774    { $unwind: "$geoLocationCountryCode" },
     775    {
     776        $group: {
     777            _id: {$toLower: '$geoLocationCountryCode'},
     778            count: { $sum: 1 }
     779        }
     780    },
     781    { $sort : { count : -1} }
     782]);
     783
     784
     785_id count
     786us      4.0
     787nz      4.0
     788au      3.0
     789ru      1.0
     790de      1.0
     791
     792Total: 13 sites that have /mi/ and are detected as having MRI content,
     793db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
     79413
     795
     796Of these 13, the 4 from NZ were already included in steps 5 and 6. So the difference is only 8 sites that are MI.
     797
     798
     799Let's get a listing of the sites' domains - 3 whose country codes are NOT NZ have NZ TLD!
     800/* 1 */
     801{
     802    "_id" : "nz",
     803    "count" : 4.0,
     804    "domain" : [
     805        "http://firstworldwar.tki.org.nz",
     806        "http://www.firstworldwar.tki.org.nz",
     807        "https://admin.teara.govt.nz",
     808        "http://community.nzdl.org"
     809    ]
     810}
     811
     812/* 2 */
     813{
     814    "_id" : "us",
     815    "count" : 4.0,
     816    "domain" : [
     817        "https://sexualviolence.victimsinfo.govt.nz",
     818        "https://follow3rs.com",
     819        "http://www.church-of-christ.org",
     820        "http://www.mytrickstips.com"
     821    ]
     822}
     823
     824/* 3 */
     825{
     826    "_id" : "au",
     827    "count" : 3.0,
     828    "domain" : [
     829        "https://rapuatearatika.education.govt.nz",
     830        "https://www.kiwiproperty.com",
     831        "https://curriculumtool.education.govt.nz"
     832    ]
     833}
     834
     835/* 4 */
     836{
     837    "_id" : "ru",
     838    "count" : 1.0,
     839    "domain" : [
     840        "http://www.treningmozga.com"
     841    ]
     842}
     843
     844/* 5 */
     845{
     846    "_id" : "de",
     847    "count" : 1.0,
     848    "domain" : [
     849        "http://www.almancax.com" # Website to learn German, autotranslated
     850    ]
     851}
     852
     853
     854But we're not catching a potentially large number of auto-translated sites, like
     855- https://www.gigalight.com/all-languages.html
     856- http://www.hzhinew.com/
     857
     858
     859--------------
     860GETTING TABLE DATA OUT OF MONGO DB:
     861
     862https://stackoverflow.com/questions/28733692/how-to-export-json-from-mongodb-using-robomongo
     863"export to file" as in a spreadsheet like to a .csv?
     864
     865IMO this is the EASIEST way to do this in Robo 3T (formerly robomongo):
     866
     867 1.   In the top right of the Robo 3T GUI there is a "View Results in text mode" button, click it and copy everything
     868
     869 2.   paste everything into this website: https://json-csv.com/
     870
     871 3.   click the download button and now you have it in a spreadsheet.
     872
     873
     874https://json-csv.com/
     875
     876
     877---------------------
    721878
    722879/* 1 */
Note: See TracChangeset for help on using the changeset viewer.