Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33824

Timestamp:

2020-01-13T20:14:59+13:00 (4 years ago)

Author:

ak19

Message:

More instructions and explaining the contents of the mongodb-data folder.

File:

: 1 edited

other-projects/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

other-projects/maori-lang-detection/hdfs-cc-work/GS_README.TXT

-              r33815
+              r33824
 # Just considering those sites outside NZ or not with .nz TLD:
-db.getCollection('Websites').find({$and: [
-                {geoLocationCountryCode: {$ne: "NZ"}},
-                {domain: {$not: /\.nz/}},
-                {numPagesContainingMRI: {$gt: 0}},
-                {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
-            ]}).count()
-websites
-# counts by country code excluding NZ related sites
 db.Websites.aggregate([
+    {
 …
+# counts by country code excluding NZ related sites
+db.getCollection('Websites').find({$and: [
+                {geoLocationCountryCode: {$ne: "NZ"}},
+                {domain: {$not: /\.nz/}},
+                {numPagesContainingMRI: {$gt: 0}},
+                {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
+            ]}).count()
+websites
 # But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
 db.getCollection('Websites').find({$and: [
 …
     { $sort : { count : -1} }
 ]);
+# Manually inspected shortlist of the 221 non-NZ websites to weed out those that aren't MRI (weeding out those misdetected as MRI, autotranslated or just contain placenames etc), and adding the 176 NZ on top:
+MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY:
+NZ: 176
+US: 25
+AU: 3
+DE: 2
+DK: 2
+BG: 1
+CZ: 1
+ES: 1
+FR: 1
+IE: 1
+TOTAL: 213
+Manually created counts.json file for above with name "6counts_nonProductSites1_manualShortlist.json"
+--------------------------------------------------------
+APPENDIX: Legend of mongodb-data folder's contents
+--------------------------------------------------------
+. allCrawledSites: all sites from CommonCrawl where the content-language=MRI, which we then crawled with Nutch with depth=10. Some obvious auto-translated websites were skipped.
+. sitesWithPagesInMRI: those sites of point 1 above which contained one or more pages that openNLP detected as MRI as primary language
+. sitesWithPagesContainingMRI.json: those sites of point 1 where one or more pages containing at least one "sentence" for which the primary language detected by OpenNLP was MRI
+. tentativeNonProductSites: sites of point 3 excluding those non-NZ sites that had "mi.*" or "*/mi" in the URL path
+. tentativeNonProductSites1: similar to point 4, but "NZ sites" in this set were not just those that were detected as originating in NZ (hosted on NZ servers?) but also any with a TLD of .nz regardless of site's country of origin.
+. nonProductSites1_manualShortlist: based on point 5, but manually inspected all the non-NZ sites for any that were not actually sources of MRI content. For example, sites where the content was in a different language misdetected by openNLP (and commoncrawl's language detection) as MRI, or any further sites that were autotranslated, sites where the "MRI" detected content were photos captioned with NZ placenames constituting the "sentence(s)" detected as being MRI.
+a. All .json files that contain the "counts_" prefix are the counts by country code for each of the above variants. The comments section at the top of each such *counts_*.json file usually contains the mongodb query used to generate the json content of the file.
+b. All .json files that contain "geojson-features_" and "multipoint_" prefix for each of the above variants are generated by running org/greenstone/atea/CountryCodeCountsMapData.java on the *counts_*.json file.
+Run as:
+    cd maori-lang-detection/src
+    java -cp ".:../conf:../lib/*" org/greenstone/atea/CountryCodeCountsMapData ../mongodb-data/[1-6]counts*.json
+This will then generate the *multipoint_*.json and *geojson-features_*.json files for any of the above 1-6 variants of the input counts json file.
+c. All .png files that contain the "map_" prefix for each of the above variants were screenshots of the map generated by http://geojson.tools/ for each *geojson-features_*.json file.
+GIMP was used to crop each screenshot to the area of interest.
 --------------------------------------------------------

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33824

Legend:

other-projects/maori-lang-detection/hdfs-cc-work/GS_README.TXT

Download in other formats: