Changeset 33816


Ignore:
Timestamp:
2019-12-19T22:33:08+13:00 (4 years ago)
Author:
ak19
Message:

Finished manually going through the sites that I couldn't easily filter out as probably autotranslated. I listed the ones that appeared to have genuine content in the Maori language and crossed out the ones that were misdetected or otherwise irrelevant. Some are question marked.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33813 r33816  
    10251025     BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/
    10261026
    1027 * FR: 35 sites from FR
    1028     http://blueheavenisland.com - French Polynesia
     1027* FR: 16 sites from FR
     1028    http://blueheavenisland.com, http://www.blueheavenisland.com - misdetection. French Polynesia
    10291029    https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway
    10301030    http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori.
    10311031!!  http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers
    10321032    http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names
    1033     http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
    1034 *
    1035 
    1036 
     1033X    http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
     1034   http://mahajana.net - misdetected a Japanese Zen Buddhist chant as MRI
     1035   http://rapanui.fr - Rapa Nui Easter Island. Misdetected.
     1036   http://www.gif.ovh - autotranslated pages. Supposedly a GIF repository
     1037   http://baladeornithologique.com - misdetection of the word "Retour"
     1038   http://www.gaudry.be - misdetection of Japanese hiragana etc, and French "faire", as MRI
     1039   http://www.gototahiti.net - probably misdetection, see title
     1040   http://www.maraamusurfskirace.com - Bora Bora, French Polynesia. Misdetected.
     1041   http://www.rongo-rongo.com - appears to be related to Easter Island. Just 1 sentence however.
     1042   http://pt.city-usa.net - misdetection. Hawaii.
     1043   https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages.
     1044NL:
     1045!!! - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz
     1046- https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL
     1047- tonhut.nl - misidentication
     1048? http://nielsonboutique.co.uk, http://longhornlaw.net, http://tetsubo.org, http://hidsonphoto.com, http://wearehomework.com/- Feels autotranslated, but no language options visible. All SEO related
     1049- diverosa.com - Rapa Nui, Easter Island
     1050- nonlinear.demon.nl - misidentified
     1051- encyclo.co.uk - misidentification
     1052- henrifloor.nl - misidentification
     1053- http://skimap.info/ - maps, NZ placenames in PDF
     1054DK:
     1055!! -  http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,
     1056http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com,
     1057http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com
     1058- http://www.rennertweb.de - a photogallery page mentioning NZ placenames
     1059CA:
     1060- http://bcmarina.com AND http://bckayak.com - photos with Canadian placenames
     1061- http://www.myrasplace.net - pagse of photos, captions involving NZ placenames
     1062~ http://00.gs/Maniapoto;Uriwera;Moriori;Hivaoa;Kumulipo.htm - Maori-Polynesian comparative dictionary words listing
     1063- aguadilla.airport-authority.com - misidentification
     1064- https://articles.imperialtometric.com - misidentification
     1065- http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames
    10371066DE:
    1038 http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful
    1039 !! https://www.cartogiraffe.com/ - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
     1067- http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful
     1068!! https://www.cartogiraffe.com/ and https://www.cartogiraffe.com - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
    10401069~ http://svenkirsten.com/ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden"
    10411070- herocity - autotranslated
     
    10451074~ http://klaaskoehne.de/galleries/nzl-taranaki/index.html - mentions NZ mountain names
    10461075- http://www.nierstrasz.org/deGrauwRegister.rtf - misdetected European (Dutch) names as MRI
    1047 - https://afrikhepri.org/mi/ - autotranslated
     1076X https://afrikhepri.org/mi/ - autotranslated
    10481077- https://www.tvteile.de - pure German pages, misdetected "Automatik" as a Maori language word
    10491078- etoile-de-lune.net - 5 pages containing 1 sentence each but none with 2 sentences detected
    1050 
     1079- https://www.you-fly.com - misdetection of German "Warum?" as MRI
     1080- http://vulkane.ch - misdetected pages on Hawaiian volcanoes.
     1081- http://www.stephe.de - photos from NZ captioned with NZ placenames
     1082- http://insecta.pro - misdetection
     1083- http://m.distanta.1km.net - NZ placenames. Lots of distances mentioning Waitangi. Nothing detected as containing more than 1 sentence.
     1084- https://ersatzteile-fachversand.de - German misdetected as Maori.
     1085- https://laskar02cinta.page.tl/Info.htm - seems like a junk site with a random sentence autotranslated into many different languages. So one sentence possibly in Maori, but may not make sense.
     1086- http://www.behlig.de - misdetection. Photos from Hawaii.
     1087!! http://www.udhr.de - Universal Declaration of Human Rights. (Also on a Bulgarian site). Multiple translations available.
    10511088- ITALY:
    10521089  http://oipaz.net/IMG/GalleriaAotearoa/ - NZ photogallery with each photo captioned by placename
     
    10621099- RUSSIA: https://www.gismeteo.lv - misidentification of an email address
    10631100- JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI
    1064 !! Ireland, ie: https://coggle.it
     1101!! - Ireland, ie: https://coggle.it
    10651102- IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website
    1066 ? - CZECH republic: https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English
    1067 - SPAIN: http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename
     1103- CZECH republic:
     1104?  https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English
     1105!! http://www.henryklahola.nazory.cz/094.Maori.htm and http://henryklahola.nazory.cz variant
     1106  http://about.ilikeyou.com - dating site. Misidentification.
     1107- SPAIN:
     1108!! https://www.uv.es/~pla/red.net/intmaori.html
     1109  https://www.reclamaciondevuelos.com - 2 occurrences of the word "kiwi"
     1110  http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename
     1111  http://www.cruceros-princess.mx/princessMX/Oferta_Cruzeiros_Polinesia.html - Polynesian placenames
    10681112- SINGAPORE: https://omg-solutions.com - autotranslated
    10691113- TURKEY: https://www.elitedeluxe.com.tr/mi/yatak-odasi-takimlari - autotranslated
     
    11221166    http://mikestephens.co.uk/ - photo captions containing NZ placenames
    11231167    http://www.woolrych.org/nzholiday2004/ - photogallery captioned with NZ placenames
     1168   
    11241169--------------
    11251170
     
    11941239                {domain: {$not: /\.nz/}},
    11951240                {numPagesContainingMRI: {$gt: 0}},
    1196                 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}           
     1241                {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
    11971242            ]}).count()
    11981243
     
    12531298    { $sort : { count : -1} }
    12541299]);
     1300
     1301
     1302-----------------------
     1303Done: manually inspected 68/117 sites
     1304
     1305DEFINITELY:
     1306+ http://anglicanhistory.org,
     1307+ http://www.unicode.org, [Universal declaration of Human Rights]
     1308+ https://static-promote.weebly.com,
     1309+ http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY]
     1310
     1311BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations:
     1312+ http://bahaiprayers.net, [Dutch seems to be properly translated, not auto-translated, so maybe MRI too]
     1313+ https://biblehub.com,
     1314+ http://www.muhammad.com, [possibly not autotranslated]
     1315+ http://www.godrules.net, [possibly not autotranslated]
     1316+ http://m.biblepub.com,
     1317+ http://www.krassotkin.ru, [probably real translations, as there are multiple Dutch translations from different sources provided]
     1318+ http://www.gotquestions.org, [doesn't appear autotranslated]
     1319X https://ebible.org, [Hiri Motu, PNG language misdetected. Doesn't seem to have Maori]
     1320X https://www.bible.com, doesn't have Maori translation. Misdetected.
     1321X https://wol.jw.org, - doesn't have Maori translations. Instead, Rongo-rongo, Kiribati (Micronesian) etc misdetected
     1322X https://png.bible, [misdetected, Papua New Guinea]
     1323X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters.
     1324
     1325PROBABLY:
     1326!! https://maorinews.com,
     1327!! http://maaori.com,
     1328!!+ http://kiaorahola.blogspot.com/
     1329+ https://kjohnsonnz.blogspot.com,
     1330+ http://pumanawawhangara.blogspot.com,
     1331+ http://dannykahei.tripod.com,
     1332+ http://burkekm001.tripod.com
     1333+ http://tkkpipipaopao.blogspot.com,
     1334+ http://manateina.blogspot.com,
     1335? tkkpipipaopao.blogspot.com? http://rangiwewehi.com, [English, but community]
     1336? https://www.terakau.org, [COMMUNITY, but English]
     1337? https://www.pipirikiapapatuanuku.org, [COMMUNITY?, in English, environment site]
     1338~ http://georgegi.tripod.com,
     1339~ http://ngarangatahi.tripod.com, [1 page, image caption, Maori language warden position title with English sentence for appointment as warden]
     1340X http://fhr.kiwicelts.com,
     1341X http://tkrow.tripod.com, [English, background of NZ place]
     1342X http://www.mkiwi.com,  - placenames
     1343X http://www.waimate.com, [English, NZ place]
     1344
     1345MAYBE, INSPECT:
     1346? https://www.natekore2018.com, [lots of English, but COMMUNITY, CULTURE]
     1347+ http://tatai09.blogspot.com,
     1348+ http://www.twttoa.com,
     1349+ http://tuhua2010.blogspot.com,
     1350X http://www.huapala.org, [misdetected, Hawaiian]
     1351X https://www.vaihaunui.net, [misdetected, Tahiti]
     1352X https://www.kaifineart.com, [art site by different artists. A Chinese and another (possibly Japanese) name were misdetected]
     1353X http://mahoraroom8.blogspot.com,  [NZ school, but main page mostly in English. No pages with > 1 senteced detected as MRI
     1354+ http://piripi.blogspot.com,
     1355X http://www.hiroa.pf,  [misdetected. Crawled content appears Polynesian not Maori]
     1356X http://korora.econ.yale.edu, [NZ place photo caption]
     1357X https://www.poehalisnami.ua, [mostly Cyrillic, with some NZ or Polynesian names misdetected]
     1358X http://hannas-reiseblog.blogspot.com - one page contained NZ placenames, another had a word misdetected
     1359
     1360
     1361+ https://www.breaker.audio, [audio, with occasional English.]
     1362? https://livestream.com, [video and audio, seems in English, but maybe CULTURAL/COMMUNITY?]
     1363
     1364X https://docs.google.com, timetable with occasional Maori language word
     1365+ https://drive.google.com, https://drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview is an image of Maori number names. But other page on drive.google.com is a NZ certificate or ID (in English) of a person's position.
     1366http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content.
     1367
     1368
     1369PINTEREST
     1370+ https://in.pinterest.com/pin/317363104978423418/
     1371  "karakia mo te moana - Google Search | Te Reo Maori Resources | Moana, Powerpoint tips, Google"
     1372? https://za.pinterest.com/pin/524669425310419500/
     1373  Maori Moko | Image | Moko Maori Tattoo & Portraits | TA MOKO | Maori tribe, Maori people, Maori art [COMMUNITY, CULTURE]
     1374[The other pinterest detected as numPagesContainingMRI > 0 was misdetected]
     1375
     1376https://nl.pinterest.com,
     1377https://www.pinterest.jp,
     1378https://www.pinterest.it,
     1379https://www.pinterest.co.uk,
     1380https://www.pinterest.ca,
     1381https://za.pinterest.com,
     1382https://www.pinterest.fr,
     1383https://in.pinterest.com,
     1384
     1385MORE BLOGSPOTS
     1386X http://word-dialect.blogspot.com,  [Indonesian, misdetected]
     1387~ http://atopeconlostopes.blogspot.com, [title on page appears to be in MRI, but content appears to be in English and South/Central American. Internationally focussed content.]
     1388X http://lianzaconference2012.blogspot.com, [NZ placename or institution]
     1389? http://mrshamiltonskoolkidz.blogspot.com, [te reo Maori related school activities. Described in English.]
     1390X http://capsuraotearoa.blogspot.com, [blog in French, photo captions contain NZ placenames]
     1391X http://blogdepasopor.blogspot.com, [blog in French, Rapa Nui/Easter Island related content, misdetected.]
     1392
     1393
     1394UNLIKELY
     1395?? http://naturalfatburner.net, http://naturalfatburner.net/NoNonsenseTed/fatloss-mao/ feels like it's autotranslated, an image of text appears, but the text is in MRI [advertising for some weight loss gimmick]
     1396
     1397
     1398BLACKLIST:
     1399X http://ww25.milfsplease.com,
     1400X http://www.the-naked.com
     1401
     1402OTHER:
     1403X http://seapixonline.com, https://www.seapixonline.com, [photo captions of ships. Sometimes misdetected Japanese words as MRI.]
     1404X http://www.code-postal.com, https://www.code-postal.com, [not more than 1 sentence detected as in MRI]
     1405X https://www.dbnames.net, [Name database, lots misdetected]
     1406
     1407STILL TO DO LIST:
     1408
     1409X https://www.myadsclassified.com, [misdetected 3 short English sentences as MRI]
     1410X http://www.whoisthatr.com, [misdetected short English sentence as MRI]
     1411X https://www.oemsec.com, [autotranslated product site]
     1412X http://svenskadress.net, [linkfarm like site of related junk links, contained URLs misdetected as MRI]
     1413
     1414X https://www.webwiki.com, [contains URLs. URLs containing Aotearoa as substring detected as MRI. But no proper sentence content. ]
     1415X http://mikebonnice.com, [Hawaiian and Tahiti related content misdetected]
     1416X http://www.hudl.com, [misdetected short English sentence as MRI]
     1417X http://www.wikitree.com,  [misdetected short English sentence as MRI]
     1418X http://shuttersportnelson.photoshelter.com, [image captions of "Wairua Warrior"]
     1419
     1420X http://niken8media.logdown.com, [Poker website? Looks autotranslated or Lorem Ipsum type of meaningless sentences.]
     1421X https://www.podrozeady.com, Looks Polish or other East-European language. The NZ page https://www.podrozeady.com/NZ/4/ had placenames detected.
     1422
     1423X http://www.thesalmons.org, [detection and misdetection of author names of papers hosted]
     1424
     1425X http://linkvip.top, [.rar and media file links misdetected as MRI]
     1426
     1427
     1428X http://www.lunar-occultations.com, [NZ place names for astronomical phenomena]
     1429X http://shangrilapress.net, [NZ placenames]
     1430X http://malecek.com, [misdetection CD title]
     1431X https://www.blue-frontiers.com, [Tahitian, Reo Tahiti, misdetected as MRI]
     1432X http://www.whoisentry.com, [URL names, looked at several which were probably misdetected as MRI]
     1433X http://loquevendra318.com, [uses Google translate for auto-translation]
     1434
     1435
     1436?? http://www.forensicfashion.com, [historical information, useful for CULTURE? e.g. http://www.forensicfashion.com/1807MaoriChief.html]
     1437
     1438X http://www.eyecontactsite.com, [Lots of names. And a few short sentences or words possibly in comments.]
     1439X http://eartheum.com, [Rapa Nui, Easter Island related content. Misdetected]
     1440X http://www.steve-wheeler.co.uk, [Blogspot. Title of a single page is in Maori. "Aotearoa ... kei te aroha au ki a koe"]
     1441X https://chromium.googlesource.com, [some source code related to languages' two letter codes]
     1442
     1443X http://www.roadsmile.com, [Lots of misdetection based on word Kia.]
     1444?? https://www.knowatom.com, https://phet.colorado.edu [Similar looking science web sites for children. Uses auto-translation?]
     1445
     1446X https://www.indexmundi.com, [place names. Pages about Solomon Islands. Misdetection of placenames.]
     1447
     1448
     1449
     1450X http://wowwars.net, [Has a page on Kia Kaha meaning, but URL redirects to a different low quality site with bad formatting and adverts. ]
     1451?? https://www.hidroponia.org.mx, [Not sure if https://www.hidroponia.org.mx/index.php/idiomas/284-hydroponics-te-ahurea-wai-maori is autotranslated or not. Can't easily locate existence of Dutch or German translated pages. There's Tamil-Singapore, but no other Tamil. So maybe translations based on target buyer audience?]
     1452X http://www.v3whois.com, [URLs are misdetected as MRI]
     1453X http://rhymebrain.com, [appears to misdetected a short phrase of 2 words, Kai Kaia, besides phrase words from other languages]
     1454
     1455
     1456X SINGLE SENTENCE DETECTED (NO MORE AND NOT PAGE:)
     1457  http://frontrowphotos.com,
     1458  http://www.pressreader.com,
     1459  https://www.nccri.ie,
     1460  http://takethatvacation.com,
     1461  http://worldradiomap.com,
     1462  http://www.namesdir.com,
     1463
     1464  X http://www.frogsonline.com, [NZ hotels, placenames]
     1465  X http://www.geni.com, [Single sentence misdetection]
     1466  X http://wikiedit.org, [just a list of lots of words, possibly placenames. Some misdetected, e.g. Rapa Nui]
     1467
     1468
Note: See TracChangeset for help on using the changeset viewer.