Changeset 33824


Ignore:
Timestamp:
2020-01-13T20:14:59+13:00 (4 years ago)
Author:
ak19
Message:

More instructions and explaining the contents of the mongodb-data folder.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33815 r33824  
    898898# Just considering those sites outside NZ or not with .nz TLD:
    899899
    900 db.getCollection('Websites').find({$and: [
    901                 {geoLocationCountryCode: {$ne: "NZ"}},
    902                 {domain: {$not: /\.nz/}},
    903                 {numPagesContainingMRI: {$gt: 0}},
    904                 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
    905             ]}).count()
    906 
    907 221 websites
    908 
    909 # counts by country code excluding NZ related sites
    910900db.Websites.aggregate([
    911901    {
     
    931921
    932922
     923# counts by country code excluding NZ related sites
     924
     925db.getCollection('Websites').find({$and: [
     926                {geoLocationCountryCode: {$ne: "NZ"}},
     927                {domain: {$not: /\.nz/}},
     928                {numPagesContainingMRI: {$gt: 0}},
     929                {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
     930            ]}).count()
     931
     932221 websites
     933
     934
    933935# But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
    934936db.getCollection('Websites').find({$and: [
     
    961963    { $sort : { count : -1} }
    962964]);
     965
     966
     967# Manually inspected shortlist of the 221 non-NZ websites to weed out those that aren't MRI (weeding out those misdetected as MRI, autotranslated or just contain placenames etc), and adding the 176 NZ on top:
     968
     969MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY:
     970NZ: 176
     971US: 25
     972AU: 3
     973DE: 2
     974DK: 2
     975BG: 1
     976CZ: 1
     977ES: 1
     978FR: 1
     979IE: 1
     980TOTAL: 213
     981
     982Manually created counts.json file for above with name "6counts_nonProductSites1_manualShortlist.json"
     983
     984--------------------------------------------------------
     985APPENDIX: Legend of mongodb-data folder's contents
     986--------------------------------------------------------
     9871. allCrawledSites: all sites from CommonCrawl where the content-language=MRI, which we then crawled with Nutch with depth=10. Some obvious auto-translated websites were skipped.
     988
     9892. sitesWithPagesInMRI: those sites of point 1 above which contained one or more pages that openNLP detected as MRI as primary language
     990
     9913. sitesWithPagesContainingMRI.json: those sites of point 1 where one or more pages containing at least one "sentence" for which the primary language detected by OpenNLP was MRI
     992
     9934. tentativeNonProductSites: sites of point 3 excluding those non-NZ sites that had "mi.*" or "*/mi" in the URL path
     994
     9955. tentativeNonProductSites1: similar to point 4, but "NZ sites" in this set were not just those that were detected as originating in NZ (hosted on NZ servers?) but also any with a TLD of .nz regardless of site's country of origin.
     996
     9976. nonProductSites1_manualShortlist: based on point 5, but manually inspected all the non-NZ sites for any that were not actually sources of MRI content. For example, sites where the content was in a different language misdetected by openNLP (and commoncrawl's language detection) as MRI, or any further sites that were autotranslated, sites where the "MRI" detected content were photos captioned with NZ placenames constituting the "sentence(s)" detected as being MRI.
     998
     999
     1000a. All .json files that contain the "counts_" prefix are the counts by country code for each of the above variants. The comments section at the top of each such *counts_*.json file usually contains the mongodb query used to generate the json content of the file.
     1001
     1002b. All .json files that contain "geojson-features_" and "multipoint_" prefix for each of the above variants are generated by running org/greenstone/atea/CountryCodeCountsMapData.java on the *counts_*.json file.
     1003
     1004Run as:
     1005    cd maori-lang-detection/src
     1006    java -cp ".:../conf:../lib/*" org/greenstone/atea/CountryCodeCountsMapData ../mongodb-data/[1-6]counts*.json
     1007
     1008This will then generate the *multipoint_*.json and *geojson-features_*.json files for any of the above 1-6 variants of the input counts json file.
     1009
     1010c. All .png files that contain the "map_" prefix for each of the above variants were screenshots of the map generated by http://geojson.tools/ for each *geojson-features_*.json file.
     1011GIMP was used to crop each screenshot to the area of interest.
     1012
    9631013
    9641014--------------------------------------------------------
Note: See TracChangeset for help on using the changeset viewer.