Changeset 33807


Ignore:
Timestamp:
12/17/19 19:29:58 (16 months ago)
Author:
ak19
Message:

Trying to manually go through a shortlisted set of domains to see if they're auto-translated. 114 CN-origin sites, all skippable.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33806 r33807  
    856856- http://www.hzhinew.com/
    857857
     858https://culturesconnection.com/manual-or-automatic-translation/
     859Manual Or Automatic Translation?
     860
     861Automatic translation continues to improve day by day. However, it is still unable to reach perfect levels of accuracy and lacks a natural feel. Will it ever replace human translation?
    858862
    859863--------------
     864Mr Bill Rogers' suggestions for beginnings of trying to sieve out the auto-translated sites:
     865- skip .com. .co.<tld>. But .co.nz is also used for non-commercial sites or sites that nevertheless have high quality Maori language content.
     866- change cut-off value of OpenNLP language prediction? But for sentences and overlapping sentences, we're not using the cut-off value, we're just checking the best predicted language regardless of confidence level for this.
     867
     868- Code for (a range of) loading of language options in auto-translated sites?
     869
     870====================
     871
     872# https://stackoverflow.com/questions/20175122/how-can-i-use-not-like-operator-in-mongodb
     873
     874Info on the sites with Maori language content that are either from NZ or have .nz domain (TLD):
     875
     876       db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]})
     877       
     878       db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}).count()
     879       183
     880
     881Inverse: the sites detected as containing at least 1 Maori language sentence that are NOT from NZ NOR have .nz domain ending (TLD):
     882     db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}]}).count()
     883     685
     884
     885The above two figures correctly add up to a total of 868 sites, which is the number of sites detected as containing at least 1 sentence in MRI:
     886    db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
     887    868
     888
     889Without those with /mi in path:
     890    db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]}).count()
     891
     892Now let's get a listing of all 685 sites to be manually inspected to determine whether they're auto-translated:
     893
     894/*
     895db.Websites.aggregate([
     896    {
     897        $match: {
     898            $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]
     899        }
     900    },
     901    { $unwind: "$geoLocationCountryCode" },
     902    {
     903        $group: {
     904            _id: {$toLower: '$geoLocationCountryCode'},
     905            count: { $sum: 1 },
     906            domain: { $addToSet: '$domain' }
     907        }
     908    },
     909    { $sort : { count : -1} }
     910]);
     911*/
     912db.Websites.aggregate([
     913    {
     914        $match: {
     915            $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}}, ]
     916        }
     917    },
     918    { $unwind: "$geoLocationCountryCode" },
     919    {
     920        $group: {
     921            _id: {$toLower: '$geoLocationCountryCode'},
     922            count: { $sum: 1 },
     923            domain: { $addToSet: '$domain' }
     924        }
     925    },
     926    { $sort : { count : -1} }
     927]);
     928
     929* CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/
     930     BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/
     931
     932* FR: 35 sites from FR
     933    http://blueheavenisland.com - French Polynesia
     934    https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway
     935    http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori.
     936!!  http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers
     937    http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names
     938    http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
     939*
     940
     941--------------
     942
    860943GETTING TABLE DATA OUT OF MONGO DB:
    861944
     
    876959
    877960---------------------
    878 
    879 /* 1 */
    880 {
    881     "_id" : "US",
    882     "count" : 93.0,
    883     -95.8,40.33
    884 }
    885 
    886 /* 2 */
    887 {
    888     "_id" : "AU",
    889     "count" : 7.0,
    890     135.8,-25.33
    891 }
    892 
    893 /* 3 */
    894 {
    895     "_id" : "CN",
    896     "count" : 7.0,
    897  100.8,
    898       32.33
    899 }
    900 
    901 /* 4 */
    902 {
    903     "_id" : "NZ",
    904     "count" : 5.0,
    905 175.8,
    906       -40.33
    907 }
    908 
    909 /* 5 */
    910 {
    911     "_id" : "DE",
    912     "count" : 5.0,
    913 10.8,
    914       50.33
    915 }
    916 
    917 /* 6 */
    918 {
    919     "_id" : "HK",
    920     "count" : 5.0,
    921 114,
    922       22.33
    923 }
    924 
    925 /* 7 */
    926 {
    927     "_id" : "RU",
    928     "count" : 4.0,
    929 38.4,
    930       55.5
    931 }
    932 
    933 /* 8 */
    934 {
    935     "_id" : "JP",
    936     "count" : 3.0,
    937       137.8,
    938       36
    939 }
    940 
    941 /* 9 */
    942 {
    943     "_id" : "GB",
    944     "count" : 3.0,
    945 -2,
    946       53.33
    947 }
    948 
    949 /* 10 */
    950 {
    951     "_id" : "CA",
    952     "count" : 2.0,
    953       -105.8,
    954       55.33
    955 }
    956 
    957 /* 11 */
    958 {
    959     "_id" : "FR",
    960     "count" : 2.0,
    961       3,
    962       47.33
    963 }
    964 
    965 /* 12 */
    966 {
    967     "_id" : "DK",
    968     "count" : 2.0,
    969  9.5,
    970       55.33
    971 }
    972 
    973 /* 13 British Virgin Islands */
    974 {
    975     "_id" : "VG",
    976     "count" : 2.0,
    977  -64.8,
    978       18.35
    979 }
    980 
    981 /* 14 Ukraine */
    982 {
    983     "_id" : "UA",
    984     "count" : 1.0,
    985       31.5,
    986       48.5
    987 }
    988 
    989 /* 15 */
    990 {
    991     "_id" : "CZ",
    992     "count" : 1.0,
    993       16.2,
    994       49.7
    995 }
    996 
    997 /* 16 Switzerland */
    998 {
    999     "_id" : "CH",
    1000     "count" : 1.0,
    1001       8.5,
    1002       47
    1003 }
    1004 
    1005 /* 17 Zuid-Afrika */
    1006 {
    1007     "_id" : "ZA",
    1008     "count" : 1.0,
    1009       24.2,
    1010       -30.7
    1011 }
    1012 
    1013 /* 18 */
    1014 {
    1015     "_id" : "NL",
    1016     "count" : 1.0,
    1017 5.8,
    1018       52.33
    1019 }
    1020 
    1021 /* 19 */
    1022 {
    1023     "_id" : "KR",
    1024     "count" : 1.0,
    1025       127.8,
    1026       36.8
    1027 }
    1028 
    1029 
    1030 /** http://geojson.tools/
    1031 
    1032 
    1033 {
    1034   "type": "MultiPoint",
    1035   "coordinates": [
    1036     [
    1037       -95.8,
    1038       40.33
    1039     ],
    1040     [
    1041       135.8,
    1042       -25.33
    1043     ],
    1044     [
    1045       100.8,
    1046       32.33
    1047     ],
    1048     [
    1049       175.8,
    1050       -40.33
    1051     ],
    1052     [
    1053       10.8,
    1054       50.33
    1055     ],
    1056     [
    1057       10.8,
    1058       50.33
    1059     ],
    1060     [
    1061       114,
    1062       22.33
    1063     ],
    1064     [
    1065       38.4,
    1066       55.5
    1067     ],
    1068     [
    1069       -2,
    1070       53.33
    1071     ],
    1072     [
    1073       137.8,
    1074       36
    1075     ],
    1076     [
    1077       -105.8,
    1078       55.33
    1079     ],
    1080     [
    1081       3,
    1082       47.33
    1083     ],
    1084     [
    1085       9.5,
    1086       55.33
    1087     ],
    1088     [
    1089       -64.8,
    1090       18.35
    1091     ],
    1092     [
    1093       31.5,
    1094       48.5
    1095     ],
    1096     [
    1097       16.2,
    1098       49.7
    1099     ],
    1100     [
    1101       8.5,
    1102       47
    1103     ],
    1104     [
    1105       24.2,
    1106       -30.7
    1107     ],
    1108     [
    1109       5.8,
    1110       52.33
    1111     ],
    1112     [
    1113       127.8,
    1114       36.8
    1115     ]
    1116   ]
    1117 }
    1118 
    1119 */
    1120 
    1121 /* 1 */
    1122 {
    1123     "_id" : "US",
    1124     "count" : 93.0,
    1125     -95.8,40.33
    1126 }
    1127 
    1128 /* 2 */
    1129 {
    1130     "_id" : "AU",
    1131     "count" : 7.0,
    1132     135.8,-25.33
    1133 }
    1134 
    1135 /* 3 */
    1136 {
    1137     "_id" : "CN",
    1138     "count" : 7.0,
    1139  100.8,
    1140       32.33
    1141 }
    1142 
    1143 /* 4 */
    1144 {
    1145     "_id" : "NZ",
    1146     "count" : 5.0,
    1147 175.8,
    1148       -40.33
    1149 }
    1150 
    1151 /* 5 */
    1152 {
    1153     "_id" : "DE",
    1154     "count" : 5.0,
    1155 10.8,
    1156       50.33
    1157 }
    1158 
    1159 /* 6 */
    1160 {
    1161     "_id" : "HK",
    1162     "count" : 5.0,
    1163 114,
    1164       22.33
    1165 }
    1166 
    1167 /* 7 */
    1168 {
    1169     "_id" : "RU",
    1170     "count" : 4.0,
    1171 38.4,
    1172       55.5
    1173 }
    1174 
    1175 /* 8 */
    1176 {
    1177     "_id" : "JP",
    1178     "count" : 3.0,
    1179       137.8,
    1180       36
    1181 }
    1182 
    1183 /* 9 */
    1184 {
    1185     "_id" : "GB",
    1186     "count" : 3.0,
    1187 -2,
    1188       53.33
    1189 }
    1190 
    1191 /* 10 */
    1192 {
    1193     "_id" : "CA",
    1194     "count" : 2.0,
    1195       -105.8,
    1196       55.33
    1197 }
    1198 
    1199 /* 11 */
    1200 {
    1201     "_id" : "FR",
    1202     "count" : 2.0,
    1203       3,
    1204       47.33
    1205 }
    1206 
    1207 /* 12 */
    1208 {
    1209     "_id" : "DK",
    1210     "count" : 2.0,
    1211  9.5,
    1212       55.33
    1213 }
    1214 
    1215 /* 13 British Virgin Islands */
    1216 {
    1217     "_id" : "VG",
    1218     "count" : 2.0,
    1219  -64.8,
    1220       18.35
    1221 }
    1222 
    1223 /* 14 Ukraine */
    1224 {
    1225     "_id" : "UA",
    1226     "count" : 1.0,
    1227       31.5,
    1228       48.5
    1229 }
    1230 
    1231 /* 15 */
    1232 {
    1233     "_id" : "CZ",
    1234     "count" : 1.0,
    1235       16.2,
    1236       49.7
    1237 }
    1238 
    1239 /* 16 Switzerland */
    1240 {
    1241     "_id" : "CH",
    1242     "count" : 1.0,
    1243       8.5,
    1244       47
    1245 }
    1246 
    1247 /* 17 Zuid-Afrika */
    1248 {
    1249     "_id" : "ZA",
    1250     "count" : 1.0,
    1251       24.2,
    1252       -30.7
    1253 }
    1254 
    1255 /* 18 */
    1256 {
    1257     "_id" : "NL",
    1258     "count" : 1.0,
    1259 5.8,
    1260       52.33
    1261 }
    1262 
    1263 /* 19 */
    1264 {
    1265     "_id" : "KR",
    1266     "count" : 1.0,
    1267       127.8,
    1268       36.8
    1269 }
    1270 
    1271 
    1272 /** http://geojson.tools/
    1273 
    1274 
    1275 {
    1276   "type": "MultiPoint",
    1277   "coordinates": [
    1278     [
    1279       -95.8,
    1280       40.33
    1281     ],
    1282     [
    1283       135.8,
    1284       -25.33
    1285     ],
    1286     [
    1287       100.8,
    1288       32.33
    1289     ],
    1290     [
    1291       175.8,
    1292       -40.33
    1293     ],
    1294     [
    1295       10.8,
    1296       50.33
    1297     ],
    1298     [
    1299       10.8,
    1300       50.33
    1301     ],
    1302     [
    1303       114,
    1304       22.33
    1305     ],
    1306     [
    1307       38.4,
    1308       55.5
    1309     ],
    1310     [
    1311       -2,
    1312       53.33
    1313     ],
    1314     [
    1315       137.8,
    1316       36
    1317     ],
    1318     [
    1319       -105.8,
    1320       55.33
    1321     ],
    1322     [
    1323       3,
    1324       47.33
    1325     ],
    1326     [
    1327       9.5,
    1328       55.33
    1329     ],
    1330     [
    1331       -64.8,
    1332       18.35
    1333     ],
    1334     [
    1335       31.5,
    1336       48.5
    1337     ],
    1338     [
    1339       16.2,
    1340       49.7
    1341     ],
    1342     [
    1343       8.5,
    1344       47
    1345     ],
    1346     [
    1347       24.2,
    1348       -30.7
    1349     ],
    1350     [
    1351       5.8,
    1352       52.33
    1353     ],
    1354     [
    1355       127.8,
    1356       36.8
    1357     ]
    1358   ]
    1359 }
    1360 
    1361 */
Note: See TracChangeset for help on using the changeset viewer.