Changeset 33816

12/19/19 22:33:08 (16 months ago)

Finished manually going through the sites that I couldn't easily filter out as probably autotranslated. I listed the ones that appeared to have genuine content in the Maori language and crossed out the ones that were misdetected or otherwise irrelevant. Some are question marked.

1 edited


  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33813 r33816  
    10251025     BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to
    1027 * FR: 35 sites from FR
    1028 - French Polynesia
     1027* FR: 16 sites from FR
     1028, - misdetection. French Polynesia
    10291029 -> takes me to NZ website etc for translating words anyway
    10301030 -> travel (blog?). Appears to be Hawaiian related and not Maori.
    10311031!! -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers
    10321032 - Tahiti, French Polynesian, ... island names
    1033 - Uses wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
    1034 *
     1033X - Uses wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
     1034 - misdetected a Japanese Zen Buddhist chant as MRI
     1035 - Rapa Nui Easter Island. Misdetected.
     1036 - autotranslated pages. Supposedly a GIF repository
     1037 - misdetection of the word "Retour"
     1038 - misdetection of Japanese hiragana etc, and French "faire", as MRI
     1039 - probably misdetection, see title
     1040 - Bora Bora, French Polynesia. Misdetected.
     1041 - appears to be related to Easter Island. Just 1 sentence however.
     1042 - misdetection. Hawaii.
     1043 - Misdetection. Appears to be in German. Manuals pages.
     1045!!! - and - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example,,,
     1046-,, - misidentification of URL
     1047- - misidentication
     1048?,,,, Feels autotranslated, but no language options visible. All SEO related
     1049- - Rapa Nui, Easter Island
     1050- - misidentified
     1051- - misidentification
     1052- - misidentification
     1053- - maps, NZ placenames in PDF
     1055!! -,,
     1058- - a photogallery page mentioning NZ placenames
     1060- AND - photos with Canadian placenames
     1061- - pagse of photos, captions involving NZ placenames
     1062~;Uriwera;Moriori;Hivaoa;Kumulipo.htm - Maori-Polynesian comparative dictionary words listing
     1063- - misidentification
     1064- - misidentification
     1065- - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames
    1038 - placenames, not meaningful
    1039 !! - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
     1067- - placenames, not meaningful
     1068!! and - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
    10401069~ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden"
    10411070- herocity - autotranslated
    10451074~ - mentions NZ mountain names
    10461075- - misdetected European (Dutch) names as MRI
    1047 - - autotranslated
     1076X - autotranslated
    10481077- - pure German pages, misdetected "Automatik" as a Maori language word
    10491078- - 5 pages containing 1 sentence each but none with 2 sentences detected
     1079- - misdetection of German "Warum?" as MRI
     1080- - misdetected pages on Hawaiian volcanoes.
     1081- - photos from NZ captioned with NZ placenames
     1082- - misdetection
     1083- - NZ placenames. Lots of distances mentioning Waitangi. Nothing detected as containing more than 1 sentence.
     1084- - German misdetected as Maori.
     1085- - seems like a junk site with a random sentence autotranslated into many different languages. So one sentence possibly in Maori, but may not make sense.
     1086- - misdetection. Photos from Hawaii.
     1087!! - Universal Declaration of Human Rights. (Also on a Bulgarian site). Multiple translations available.
    10511088- ITALY:
    10521089 - NZ photogallery with each photo captioned by placename
    10621099- RUSSIA: - misidentification of an email address
    10631100- JAPAN: - many pages of scientific names of (plants?) which are often misdetected as MRI
    1064 !! Ireland, ie:
     1101!! - Ireland, ie:
    10651102- IRAN: - video title from MaoriTelevision website
    1066 ? - CZECH republic: - NZ job position title in MRI but rest in English
    1067 - SPAIN: - 2 uses of the same placename
     1103- CZECH republic:
     1104? - NZ job position title in MRI but rest in English
     1105!! and variant
     1106 - dating site. Misidentification.
     1107- SPAIN:
     1109 - 2 occurrences of the word "kiwi"
     1110 - 2 uses of the same placename
     1111 - Polynesian placenames
    10681112- SINGAPORE: - autotranslated
    10691113- TURKEY: - autotranslated
    11221166 - photo captions containing NZ placenames
    11231167 - photogallery captioned with NZ placenames
    11941239                {domain: {$not: /\.nz/}},
    11951240                {numPagesContainingMRI: {$gt: 0}},
    1196                 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}           
     1241                {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
    11971242            ]}).count()
    12531298    { $sort : { count : -1} }
     1303Done: manually inspected 68/117 sites
     1307+, [Universal declaration of Human Rights]
     1309+, [often English, but COMMUNITY]
     1311BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations:
     1312+, [Dutch seems to be properly translated, not auto-translated, so maybe MRI too]
     1314+, [possibly not autotranslated]
     1315+, [possibly not autotranslated]
     1317+, [probably real translations, as there are multiple Dutch translations from different sources provided]
     1318+, [doesn't appear autotranslated]
     1319X, [Hiri Motu, PNG language misdetected. Doesn't seem to have Maori]
     1320X, doesn't have Maori translation. Misdetected.
     1321X, - doesn't have Maori translations. Instead, Rongo-rongo, Kiribati (Micronesian) etc misdetected
     1322X, [misdetected, Papua New Guinea]
     1323X, may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters.
     1335?, [English, but community]
     1336?, [COMMUNITY, but English]
     1337?, [COMMUNITY?, in English, environment site]
     1339~, [1 page, image caption, Maori language warden position title with English sentence for appointment as warden]
     1341X, [English, background of NZ place]
     1342X,  - placenames
     1343X, [English, NZ place]
     1345MAYBE, INSPECT:
     1346?, [lots of English, but COMMUNITY, CULTURE]
     1350X, [misdetected, Hawaiian]
     1351X, [misdetected, Tahiti]
     1352X, [art site by different artists. A Chinese and another (possibly Japanese) name were misdetected]
     1353X,  [NZ school, but main page mostly in English. No pages with > 1 senteced detected as MRI
     1355X,  [misdetected. Crawled content appears Polynesian not Maori]
     1356X, [NZ place photo caption]
     1357X, [mostly Cyrillic, with some NZ or Polynesian names misdetected]
     1358X - one page contained NZ placenames, another had a word misdetected
     1361+, [audio, with occasional English.]
     1362?, [video and audio, seems in English, but maybe CULTURAL/COMMUNITY?]
     1364X, timetable with occasional Maori language word
     1365+, is an image of Maori number names. But other page on is a NZ certificate or ID (in English) of a person's position.
     1366 - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content.
     1371  "karakia mo te moana - Google Search | Te Reo Maori Resources | Moana, Powerpoint tips, Google"
     1373  Maori Moko | Image | Moko Maori Tattoo & Portraits | TA MOKO | Maori tribe, Maori people, Maori art [COMMUNITY, CULTURE]
     1374[The other pinterest detected as numPagesContainingMRI > 0 was misdetected]
     1386X,  [Indonesian, misdetected]
     1387~, [title on page appears to be in MRI, but content appears to be in English and South/Central American. Internationally focussed content.]
     1388X, [NZ placename or institution]
     1389?, [te reo Maori related school activities. Described in English.]
     1390X, [blog in French, photo captions contain NZ placenames]
     1391X, [blog in French, Rapa Nui/Easter Island related content, misdetected.]
     1395??, feels like it's autotranslated, an image of text appears, but the text is in MRI [advertising for some weight loss gimmick]
     1403X,, [photo captions of ships. Sometimes misdetected Japanese words as MRI.]
     1404X,, [not more than 1 sentence detected as in MRI]
     1405X, [Name database, lots misdetected]
     1407STILL TO DO LIST:
     1409X, [misdetected 3 short English sentences as MRI]
     1410X, [misdetected short English sentence as MRI]
     1411X, [autotranslated product site]
     1412X, [linkfarm like site of related junk links, contained URLs misdetected as MRI]
     1414X, [contains URLs. URLs containing Aotearoa as substring detected as MRI. But no proper sentence content. ]
     1415X, [Hawaiian and Tahiti related content misdetected]
     1416X, [misdetected short English sentence as MRI]
     1417X,  [misdetected short English sentence as MRI]
     1418X, [image captions of "Wairua Warrior"]
     1420X, [Poker website? Looks autotranslated or Lorem Ipsum type of meaningless sentences.]
     1421X, Looks Polish or other East-European language. The NZ page had placenames detected.
     1423X, [detection and misdetection of author names of papers hosted]
     1425X, [.rar and media file links misdetected as MRI]
     1428X, [NZ place names for astronomical phenomena]
     1429X, [NZ placenames]
     1430X, [misdetection CD title]
     1431X, [Tahitian, Reo Tahiti, misdetected as MRI]
     1432X, [URL names, looked at several which were probably misdetected as MRI]
     1433X, [uses Google translate for auto-translation]
     1436??, [historical information, useful for CULTURE? e.g.]
     1438X, [Lots of names. And a few short sentences or words possibly in comments.]
     1439X, [Rapa Nui, Easter Island related content. Misdetected]
     1440X, [Blogspot. Title of a single page is in Maori. "Aotearoa ... kei te aroha au ki a koe"]
     1441X, [some source code related to languages' two letter codes]
     1443X, [Lots of misdetection based on word Kia.]
     1444??, [Similar looking science web sites for children. Uses auto-translation?]
     1446X, [place names. Pages about Solomon Islands. Misdetection of placenames.]
     1450X, [Has a page on Kia Kaha meaning, but URL redirects to a different low quality site with bad formatting and adverts. ]
     1451??, [Not sure if is autotranslated or not. Can't easily locate existence of Dutch or German translated pages. There's Tamil-Singapore, but no other Tamil. So maybe translations based on target buyer audience?]
     1452X, [URLs are misdetected as MRI]
     1453X, [appears to misdetected a short phrase of 2 words, Kai Kaia, besides phrase words from other languages]
     1464  X, [NZ hotels, placenames]
     1465  X, [Single sentence misdetection]
     1466  X, [just a list of lots of words, possibly placenames. Some misdetected, e.g. Rapa Nui]
Note: See TracChangeset for help on using the changeset viewer.