Changeset 33807
- Timestamp:
- 2019-12-17T19:29:58+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/MoreReading/mongodb.txt
r33806 r33807 856 856 - http://www.hzhinew.com/ 857 857 858 https://culturesconnection.com/manual-or-automatic-translation/ 859 Manual Or Automatic Translation? 860 861 Automatic translation continues to improve day by day. However, it is still unable to reach perfect levels of accuracy and lacks a natural feel. Will it ever replace human translation? 858 862 859 863 -------------- 864 Mr Bill Rogers' suggestions for beginnings of trying to sieve out the auto-translated sites: 865 - skip .com. .co.<tld>. But .co.nz is also used for non-commercial sites or sites that nevertheless have high quality Maori language content. 866 - change cut-off value of OpenNLP language prediction? But for sentences and overlapping sentences, we're not using the cut-off value, we're just checking the best predicted language regardless of confidence level for this. 867 868 - Code for (a range of) loading of language options in auto-translated sites? 869 870 ==================== 871 872 # https://stackoverflow.com/questions/20175122/how-can-i-use-not-like-operator-in-mongodb 873 874 Info on the sites with Maori language content that are either from NZ or have .nz domain (TLD): 875 876 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}) 877 878 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}).count() 879 183 880 881 Inverse: the sites detected as containing at least 1 Maori language sentence that are NOT from NZ NOR have .nz domain ending (TLD): 882 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}]}).count() 883 685 884 885 The above two figures correctly add up to a total of 868 sites, which is the number of sites detected as containing at least 1 sentence in MRI: 886 db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count() 887 868 888 889 Without those with /mi in path: 890 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]}).count() 891 892 Now let's get a listing of all 685 sites to be manually inspected to determine whether they're auto-translated: 893 894 /* 895 db.Websites.aggregate([ 896 { 897 $match: { 898 $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}] 899 } 900 }, 901 { $unwind: "$geoLocationCountryCode" }, 902 { 903 $group: { 904 _id: {$toLower: '$geoLocationCountryCode'}, 905 count: { $sum: 1 }, 906 domain: { $addToSet: '$domain' } 907 } 908 }, 909 { $sort : { count : -1} } 910 ]); 911 */ 912 db.Websites.aggregate([ 913 { 914 $match: { 915 $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}}, ] 916 } 917 }, 918 { $unwind: "$geoLocationCountryCode" }, 919 { 920 $group: { 921 _id: {$toLower: '$geoLocationCountryCode'}, 922 count: { $sum: 1 }, 923 domain: { $addToSet: '$domain' } 924 } 925 }, 926 { $sort : { count : -1} } 927 ]); 928 929 * CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/ 930 BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/ 931 932 * FR: 35 sites from FR 933 http://blueheavenisland.com - French Polynesia 934 https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway 935 http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori. 936 !! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers 937 http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names 938 http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid 939 * 940 941 -------------- 942 860 943 GETTING TABLE DATA OUT OF MONGO DB: 861 944 … … 876 959 877 960 --------------------- 878 879 /* 1 */880 {881 "_id" : "US",882 "count" : 93.0,883 -95.8,40.33884 }885 886 /* 2 */887 {888 "_id" : "AU",889 "count" : 7.0,890 135.8,-25.33891 }892 893 /* 3 */894 {895 "_id" : "CN",896 "count" : 7.0,897 100.8,898 32.33899 }900 901 /* 4 */902 {903 "_id" : "NZ",904 "count" : 5.0,905 175.8,906 -40.33907 }908 909 /* 5 */910 {911 "_id" : "DE",912 "count" : 5.0,913 10.8,914 50.33915 }916 917 /* 6 */918 {919 "_id" : "HK",920 "count" : 5.0,921 114,922 22.33923 }924 925 /* 7 */926 {927 "_id" : "RU",928 "count" : 4.0,929 38.4,930 55.5931 }932 933 /* 8 */934 {935 "_id" : "JP",936 "count" : 3.0,937 137.8,938 36939 }940 941 /* 9 */942 {943 "_id" : "GB",944 "count" : 3.0,945 -2,946 53.33947 }948 949 /* 10 */950 {951 "_id" : "CA",952 "count" : 2.0,953 -105.8,954 55.33955 }956 957 /* 11 */958 {959 "_id" : "FR",960 "count" : 2.0,961 3,962 47.33963 }964 965 /* 12 */966 {967 "_id" : "DK",968 "count" : 2.0,969 9.5,970 55.33971 }972 973 /* 13 British Virgin Islands */974 {975 "_id" : "VG",976 "count" : 2.0,977 -64.8,978 18.35979 }980 981 /* 14 Ukraine */982 {983 "_id" : "UA",984 "count" : 1.0,985 31.5,986 48.5987 }988 989 /* 15 */990 {991 "_id" : "CZ",992 "count" : 1.0,993 16.2,994 49.7995 }996 997 /* 16 Switzerland */998 {999 "_id" : "CH",1000 "count" : 1.0,1001 8.5,1002 471003 }1004 1005 /* 17 Zuid-Afrika */1006 {1007 "_id" : "ZA",1008 "count" : 1.0,1009 24.2,1010 -30.71011 }1012 1013 /* 18 */1014 {1015 "_id" : "NL",1016 "count" : 1.0,1017 5.8,1018 52.331019 }1020 1021 /* 19 */1022 {1023 "_id" : "KR",1024 "count" : 1.0,1025 127.8,1026 36.81027 }1028 1029 1030 /** http://geojson.tools/1031 1032 1033 {1034 "type": "MultiPoint",1035 "coordinates": [1036 [1037 -95.8,1038 40.331039 ],1040 [1041 135.8,1042 -25.331043 ],1044 [1045 100.8,1046 32.331047 ],1048 [1049 175.8,1050 -40.331051 ],1052 [1053 10.8,1054 50.331055 ],1056 [1057 10.8,1058 50.331059 ],1060 [1061 114,1062 22.331063 ],1064 [1065 38.4,1066 55.51067 ],1068 [1069 -2,1070 53.331071 ],1072 [1073 137.8,1074 361075 ],1076 [1077 -105.8,1078 55.331079 ],1080 [1081 3,1082 47.331083 ],1084 [1085 9.5,1086 55.331087 ],1088 [1089 -64.8,1090 18.351091 ],1092 [1093 31.5,1094 48.51095 ],1096 [1097 16.2,1098 49.71099 ],1100 [1101 8.5,1102 471103 ],1104 [1105 24.2,1106 -30.71107 ],1108 [1109 5.8,1110 52.331111 ],1112 [1113 127.8,1114 36.81115 ]1116 ]1117 }1118 1119 */1120 1121 /* 1 */1122 {1123 "_id" : "US",1124 "count" : 93.0,1125 -95.8,40.331126 }1127 1128 /* 2 */1129 {1130 "_id" : "AU",1131 "count" : 7.0,1132 135.8,-25.331133 }1134 1135 /* 3 */1136 {1137 "_id" : "CN",1138 "count" : 7.0,1139 100.8,1140 32.331141 }1142 1143 /* 4 */1144 {1145 "_id" : "NZ",1146 "count" : 5.0,1147 175.8,1148 -40.331149 }1150 1151 /* 5 */1152 {1153 "_id" : "DE",1154 "count" : 5.0,1155 10.8,1156 50.331157 }1158 1159 /* 6 */1160 {1161 "_id" : "HK",1162 "count" : 5.0,1163 114,1164 22.331165 }1166 1167 /* 7 */1168 {1169 "_id" : "RU",1170 "count" : 4.0,1171 38.4,1172 55.51173 }1174 1175 /* 8 */1176 {1177 "_id" : "JP",1178 "count" : 3.0,1179 137.8,1180 361181 }1182 1183 /* 9 */1184 {1185 "_id" : "GB",1186 "count" : 3.0,1187 -2,1188 53.331189 }1190 1191 /* 10 */1192 {1193 "_id" : "CA",1194 "count" : 2.0,1195 -105.8,1196 55.331197 }1198 1199 /* 11 */1200 {1201 "_id" : "FR",1202 "count" : 2.0,1203 3,1204 47.331205 }1206 1207 /* 12 */1208 {1209 "_id" : "DK",1210 "count" : 2.0,1211 9.5,1212 55.331213 }1214 1215 /* 13 British Virgin Islands */1216 {1217 "_id" : "VG",1218 "count" : 2.0,1219 -64.8,1220 18.351221 }1222 1223 /* 14 Ukraine */1224 {1225 "_id" : "UA",1226 "count" : 1.0,1227 31.5,1228 48.51229 }1230 1231 /* 15 */1232 {1233 "_id" : "CZ",1234 "count" : 1.0,1235 16.2,1236 49.71237 }1238 1239 /* 16 Switzerland */1240 {1241 "_id" : "CH",1242 "count" : 1.0,1243 8.5,1244 471245 }1246 1247 /* 17 Zuid-Afrika */1248 {1249 "_id" : "ZA",1250 "count" : 1.0,1251 24.2,1252 -30.71253 }1254 1255 /* 18 */1256 {1257 "_id" : "NL",1258 "count" : 1.0,1259 5.8,1260 52.331261 }1262 1263 /* 19 */1264 {1265 "_id" : "KR",1266 "count" : 1.0,1267 127.8,1268 36.81269 }1270 1271 1272 /** http://geojson.tools/1273 1274 1275 {1276 "type": "MultiPoint",1277 "coordinates": [1278 [1279 -95.8,1280 40.331281 ],1282 [1283 135.8,1284 -25.331285 ],1286 [1287 100.8,1288 32.331289 ],1290 [1291 175.8,1292 -40.331293 ],1294 [1295 10.8,1296 50.331297 ],1298 [1299 10.8,1300 50.331301 ],1302 [1303 114,1304 22.331305 ],1306 [1307 38.4,1308 55.51309 ],1310 [1311 -2,1312 53.331313 ],1314 [1315 137.8,1316 361317 ],1318 [1319 -105.8,1320 55.331321 ],1322 [1323 3,1324 47.331325 ],1326 [1327 9.5,1328 55.331329 ],1330 [1331 -64.8,1332 18.351333 ],1334 [1335 31.5,1336 48.51337 ],1338 [1339 16.2,1340 49.71341 ],1342 [1343 8.5,1344 471345 ],1346 [1347 24.2,1348 -30.71349 ],1350 [1351 5.8,1352 52.331353 ],1354 [1355 127.8,1356 36.81357 ]1358 ]1359 }1360 1361 */
Note:
See TracChangeset
for help on using the changeset viewer.