Changeset 33824
- Timestamp:
- 2020-01-13T20:14:59+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/hdfs-cc-work/GS_README.TXT
r33815 r33824 898 898 # Just considering those sites outside NZ or not with .nz TLD: 899 899 900 db.getCollection('Websites').find({$and: [901 {geoLocationCountryCode: {$ne: "NZ"}},902 {domain: {$not: /\.nz/}},903 {numPagesContainingMRI: {$gt: 0}},904 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}905 ]}).count()906 907 221 websites908 909 # counts by country code excluding NZ related sites910 900 db.Websites.aggregate([ 911 901 { … … 931 921 932 922 923 # counts by country code excluding NZ related sites 924 925 db.getCollection('Websites').find({$and: [ 926 {geoLocationCountryCode: {$ne: "NZ"}}, 927 {domain: {$not: /\.nz/}}, 928 {numPagesContainingMRI: {$gt: 0}}, 929 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} 930 ]}).count() 931 932 221 websites 933 934 933 935 # But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld): 934 936 db.getCollection('Websites').find({$and: [ … … 961 963 { $sort : { count : -1} } 962 964 ]); 965 966 967 # Manually inspected shortlist of the 221 non-NZ websites to weed out those that aren't MRI (weeding out those misdetected as MRI, autotranslated or just contain placenames etc), and adding the 176 NZ on top: 968 969 MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY: 970 NZ: 176 971 US: 25 972 AU: 3 973 DE: 2 974 DK: 2 975 BG: 1 976 CZ: 1 977 ES: 1 978 FR: 1 979 IE: 1 980 TOTAL: 213 981 982 Manually created counts.json file for above with name "6counts_nonProductSites1_manualShortlist.json" 983 984 -------------------------------------------------------- 985 APPENDIX: Legend of mongodb-data folder's contents 986 -------------------------------------------------------- 987 1. allCrawledSites: all sites from CommonCrawl where the content-language=MRI, which we then crawled with Nutch with depth=10. Some obvious auto-translated websites were skipped. 988 989 2. sitesWithPagesInMRI: those sites of point 1 above which contained one or more pages that openNLP detected as MRI as primary language 990 991 3. sitesWithPagesContainingMRI.json: those sites of point 1 where one or more pages containing at least one "sentence" for which the primary language detected by OpenNLP was MRI 992 993 4. tentativeNonProductSites: sites of point 3 excluding those non-NZ sites that had "mi.*" or "*/mi" in the URL path 994 995 5. tentativeNonProductSites1: similar to point 4, but "NZ sites" in this set were not just those that were detected as originating in NZ (hosted on NZ servers?) but also any with a TLD of .nz regardless of site's country of origin. 996 997 6. nonProductSites1_manualShortlist: based on point 5, but manually inspected all the non-NZ sites for any that were not actually sources of MRI content. For example, sites where the content was in a different language misdetected by openNLP (and commoncrawl's language detection) as MRI, or any further sites that were autotranslated, sites where the "MRI" detected content were photos captioned with NZ placenames constituting the "sentence(s)" detected as being MRI. 998 999 1000 a. All .json files that contain the "counts_" prefix are the counts by country code for each of the above variants. The comments section at the top of each such *counts_*.json file usually contains the mongodb query used to generate the json content of the file. 1001 1002 b. All .json files that contain "geojson-features_" and "multipoint_" prefix for each of the above variants are generated by running org/greenstone/atea/CountryCodeCountsMapData.java on the *counts_*.json file. 1003 1004 Run as: 1005 cd maori-lang-detection/src 1006 java -cp ".:../conf:../lib/*" org/greenstone/atea/CountryCodeCountsMapData ../mongodb-data/[1-6]counts*.json 1007 1008 This will then generate the *multipoint_*.json and *geojson-features_*.json files for any of the above 1-6 variants of the input counts json file. 1009 1010 c. All .png files that contain the "map_" prefix for each of the above variants were screenshots of the map generated by http://geojson.tools/ for each *geojson-features_*.json file. 1011 GIMP was used to crop each screenshot to the area of interest. 1012 963 1013 964 1014 --------------------------------------------------------
Note:
See TracChangeset
for help on using the changeset viewer.