Changeset 33824

Show
Ignore:
Timestamp:
13.01.2020 20:14:59 (9 days ago)
Author:
ak19
Message:

More instructions and explaining the contents of the mongodb-data folder.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33815 r33824  
    898898# Just considering those sites outside NZ or not with .nz TLD: 
    899899 
    900 db.getCollection('Websites').find({$and: [ 
    901                 {geoLocationCountryCode: {$ne: "NZ"}}, 
    902                 {domain: {$not: /\.nz/}}, 
    903                 {numPagesContainingMRI: {$gt: 0}}, 
    904                 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} 
    905             ]}).count() 
    906  
    907 221 websites 
    908  
    909 # counts by country code excluding NZ related sites 
    910900db.Websites.aggregate([ 
    911901    { 
     
    931921 
    932922 
     923# counts by country code excluding NZ related sites 
     924 
     925db.getCollection('Websites').find({$and: [ 
     926                {geoLocationCountryCode: {$ne: "NZ"}}, 
     927                {domain: {$not: /\.nz/}}, 
     928                {numPagesContainingMRI: {$gt: 0}}, 
     929                {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} 
     930            ]}).count() 
     931 
     932221 websites 
     933 
     934 
    933935# But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld): 
    934936db.getCollection('Websites').find({$and: [ 
     
    961963    { $sort : { count : -1} } 
    962964]); 
     965 
     966 
     967# Manually inspected shortlist of the 221 non-NZ websites to weed out those that aren't MRI (weeding out those misdetected as MRI, autotranslated or just contain placenames etc), and adding the 176 NZ on top: 
     968 
     969MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY: 
     970NZ: 176 
     971US: 25 
     972AU: 3 
     973DE: 2 
     974DK: 2 
     975BG: 1 
     976CZ: 1 
     977ES: 1 
     978FR: 1 
     979IE: 1 
     980TOTAL: 213 
     981 
     982Manually created counts.json file for above with name "6counts_nonProductSites1_manualShortlist.json" 
     983 
     984-------------------------------------------------------- 
     985APPENDIX: Legend of mongodb-data folder's contents  
     986-------------------------------------------------------- 
     9871. allCrawledSites: all sites from CommonCrawl where the content-language=MRI, which we then crawled with Nutch with depth=10. Some obvious auto-translated websites were skipped. 
     988 
     9892. sitesWithPagesInMRI: those sites of point 1 above which contained one or more pages that openNLP detected as MRI as primary language 
     990 
     9913. sitesWithPagesContainingMRI.json: those sites of point 1 where one or more pages containing at least one "sentence" for which the primary language detected by OpenNLP was MRI 
     992 
     9934. tentativeNonProductSites: sites of point 3 excluding those non-NZ sites that had "mi.*" or "*/mi" in the URL path 
     994 
     9955. tentativeNonProductSites1: similar to point 4, but "NZ sites" in this set were not just those that were detected as originating in NZ (hosted on NZ servers?) but also any with a TLD of .nz regardless of site's country of origin. 
     996 
     9976. nonProductSites1_manualShortlist: based on point 5, but manually inspected all the non-NZ sites for any that were not actually sources of MRI content. For example, sites where the content was in a different language misdetected by openNLP (and commoncrawl's language detection) as MRI, or any further sites that were autotranslated, sites where the "MRI" detected content were photos captioned with NZ placenames constituting the "sentence(s)" detected as being MRI. 
     998 
     999 
     1000a. All .json files that contain the "counts_" prefix are the counts by country code for each of the above variants. The comments section at the top of each such *counts_*.json file usually contains the mongodb query used to generate the json content of the file. 
     1001 
     1002b. All .json files that contain "geojson-features_" and "multipoint_" prefix for each of the above variants are generated by running org/greenstone/atea/CountryCodeCountsMapData.java on the *counts_*.json file. 
     1003 
     1004Run as: 
     1005    cd maori-lang-detection/src 
     1006    java -cp ".:../conf:../lib/*" org/greenstone/atea/CountryCodeCountsMapData ../mongodb-data/[1-6]counts*.json 
     1007 
     1008This will then generate the *multipoint_*.json and *geojson-features_*.json files for any of the above 1-6 variants of the input counts json file. 
     1009 
     1010c. All .png files that contain the "map_" prefix for each of the above variants were screenshots of the map generated by http://geojson.tools/ for each *geojson-features_*.json file. 
     1011GIMP was used to crop each screenshot to the area of interest. 
     1012 
    9631013 
    9641014--------------------------------------------------------