Ignore:
Timestamp:
2020-02-13T17:09:07+13:00 (4 years ago)
Author:
ak19
Message:

Shortlisted just the domain sites by country into ManualShortlist2.txt after taking the reingest into MongoDB into account. And then put all these shortlisted domains for which containsMRI=true as per manual inspection into a separate new file.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33913 r33914  
    11031103- RUSSIA: https://www.gismeteo.lv - misidentification of an email address
    11041104- JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI
    1105 !! - Ireland, ie: https://coggle.it
     1105!! - IRELAND, IE: https://coggle.it
    11061106- IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website
    11071107- CZECH republic:
     
    13711371X https://docs.google.com, timetable with occasional Maori language word
    13721372+ https://drive.google.com, https://drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview is an image of Maori number names. But other page on drive.google.com is a NZ certificate or ID (in English) of a person's position.
    1373 http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content.
     1373~+ http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content.
    13741374
    13751375
     
    15411541X https://mi.lawyers.cafe - autotranslated
    15421542    X https://mi.centr-zashity.ru - same as lawyers.cafe above: autotranslated
    1543 ! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated.
     1543~! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated.
    15441544X http://jobdescriptionsample.org - autotranslated
    15451545X http://mi.broadcastbeat.com - autotranslated product site
     
    16191619   IT, AT, RO, CH, RU, BG, MX, JP, CN, IE, IR, FI same
    16201620
    1621 US gained 3:
    1622 anglican.org (NEW)
    1623 articles.imperialtometric.com (from CA)
    1624 daandehn.com (CA)
     1621US gained 3 + 1 from mi in URL path:
     1622+ anglican.org (NEW)
     1623X articles.imperialtometric.com (from CA)
     1624X daandehn.com (from CA)
     1625+ kiwiproperty.com (from AU)
    16251626
    16261627CA lost 2:
    1627 articles.imperialtometric.com (to US)
    1628 daandehn.com (to US)
     1628X articles.imperialtometric.com (to US)
     1629X daandehn.com (to US)
    16291630
    16301631AU:
    1631 lost kiwiproperty.com (to US - mi in URL path version file!)
     1632! lost kiwiproperty.com (to US - mi in URL path version file!)
    16321633
    16331634
    16341635CZ:
    1635 gained viveipcl.com (from UNKNOWN)
     1636X gained viveipcl.com (from UNKNOWN)
    16361637
    16371638UNKNOWN:
    1638 gained hitiaotera.com from IL
     1639X gained hitiaotera.com from IL
    16391640
    16401641IL:
    1641 lost one to (UNKNOWN)
    1642 
     1642X lost one (hitiaotera.com to UNKNOWN)
     1643
     1644
     1645FINAL SITE COUNT (contain >= 1 page with >= 1 MRI sentence)
     1646
     1647DK:
     1648http://ngapuhiradio.com
     1649http://ngapuhitelevision.com
     1650    [http://akona.ngapuhitelevision.com
     1651    http://waiatarangatiratanga.ngapuhitelevision.com
     1652    http://jazz.ngapuhitelevision.com
     1653    http://powhiri.ngapuhitelevision.com
     1654    http://komisch.ngapuhitelevision.com]
     1655
     1656DE
     1657http://www.udhr.de
     1658https://www.cartogiraffe.com/
     1659
     1660AU
     1661https://koreromaori.com
     1662(https://infogram.com/)
     1663
     1664FR
     1665http://chantsdeluttes.free.fr/
     1666
     1667ES
     1668https://www.uv.es/
     1669
     1670IE
     1671https://coggle.it
     1672
     1673CZ:
     1674http://www.henryklahola.nazory.cz
     1675
     1676BG:
     1677http://anitra.net/
     1678
     1679US finals:
     1680http://anglican.org
     1681http://anglicanhistory.org
     1682http://www.unicode.org
     1683https://static-promote.weebly.com
     1684http://aclhokiangarocks.blogspot.com
     1685http://bahaiprayers.net
     1686https://biblehub.com
     1687http://www.muhammad.com
     1688http://www.godrules.net
     1689http://m.biblepub.com
     1690http://www.krassotkin.ru
     1691http://www.gotquestions.org
     1692https://maorinews.com
     1693http://maaori.com
     1694http://kiaorahola.blogspot.com
     1695https://kjohnsonnz.blogspot.com
     1696http://pumanawawhangara.blogspot.com
     1697http://dannykahei.tripod.com
     1698http://burkekm001.tripod.com
     1699http://tkkpipipaopao.blogspot.com
     1700http://manateina.blogspot.com
     1701http://tatai09.blogspot.com
     1702http://www.twttoa.com
     1703http://tuhua2010.blogspot.com
     1704http://piripi.blogspot.com
     1705https://www.breaker.audio
     1706https://drive.google.com
     1707http://ritusehji.blogspot.com
     1708https://in.pinterest.com
     1709
     171029
     1711
     1712https://www.kiwiproperty.com
     1713http://indigenousblogs.com
     1714https://mi.m.wikipedia.org, https://mi.wikipedia.org
     1715http://csunplugged.org, https://www.csunplugged.org
     1716(https://policies.oclc.org)
     1717
     171834 incl with MI in URL Path
     1719
     1720
     1721---------------------
     1722NZ:
     1723    http://www.teipukarea.maori.nz
     1724        http://ngatipahauwera.co.nz
     1725        http://www.oag.govt.nz
     1726        https://sexualviolence.victimsinfo.govt.nz
     1727        http://tmoa.tki.org.nz
     1728        http://www.tewhanake.maori.nz
     1729        http://www.matarikifestival.org.nz
     1730        http://www.otepoti.school.nz
     1731        https://www.maoritelevision.com
     1732        http://pukapuka.nz
     1733        http://community.nzdl.org
     1734        http://maori.livingheritage.org.nz [http://www.livingheritage.org.nz]
     1735        http://pukoro.co.nz
     1736    https://cdn.tehiku.nz [DOMAIN: tehiku.nz]
     1737        http://www.runanga.co.nz
     1738        http://kuraaiwi.maori.nz
     1739        http://kurataiao.tki.org.nz
     1740        http://satellites.co.nz
     1741        http://teaohou.natlib.govt.nz
     1742        http://www.tuwharetoa.iwi.nz
     1743        https://www.terito.school.nz
     1744        https://ttw1.cwp.govt.nz
     1745        https://www.whanau-tahi.school.nz
     1746        https://e-ako-pangarau.nzmaths.co.nz
     1747        https://teaomaori.news
     1748        http://tetaurawhiri.govt.nz
     1749        https://www.tuiatematangi.ac.nz
     1750        http://animations.tewhanake.maori.nz
     1751        https://www.dnc.org.nz
     1752        http://firstworldwar.tki.org.nz [http://www.firstworldwar.tki.org.nz]
     1753        http://www.28maoribattalion.org.nz
     1754        http://www.tewikiotereomaori.co.nz
     1755        http://www.brettgraham.co.nz
     1756        https://hepatakakupu.nz
     1757    http://anglicanprayerbook.nz
     1758        http://arataua.nz
     1759        http://maori.tki.org.nz
     1760        https://paekupu.co.nz
     1761        https://haereheikaiako.co.nz
     1762        https://curriculumtool.education.govt.nz
     1763        http://kurakokiri.maori.nz [includes: http://www.kurakokiri.maori.nz]
     1764        http://www.kkmmaungarongo.co.nz
     1765        http://www.heartland.co.nz
     1766        http://oilcrash.com
     1767        http://www.kura-porirua.school.nz
     1768        https://www.sporty.co.nz
     1769        https://www.tematawai.maori.nz
     1770        https://www.terakipaewhenua.school.nz
     1771        http://www.tetaurawhiri.govt.nz
     1772        http://archive.stats.govt.nz
     1773        http://tiritiowaitangi.govt.nz
     1774        http://www.waiata.maori.nz [includes: http://waiata.maori.nz]
     1775        http://hana.co.nz
     1776        http://kaupare.co.nz
     1777        http://www.tereowrap.nz
     1778        http://www.hrc.co.nz
     1779        http://ngatiporoukiponeke.org.nz
     1780        http://rurued.school.nz
     1781        http://www.twtop.school.nz
     1782        http://www.huri-translations.pf
     1783        https://teara.govt.nz/ [https://admin.teara.govt.nz, http://blog.teara.govt.nz]
     1784        https://tiritiowaitangi.govt.nz
     1785        http://www.tmoa.tki.org.nz
     1786        https://www.komako.org.nz
     1787        http://www.wcl.govt.nz [included: http://kete.wcl.govt.nz]       
     1788        http://punareo.co.nz
     1789        https://rapuatearatika.education.govt.nz
     1790        http://tmmkkm.school.nz
     1791        http://www.cs.waikato.ac.nz
     1792        http://www.kupengahao.co.nz
     1793        https://www.hapuhauora.health.nz
     1794        http://cms.sunsmartschools.co.nz [http://sunsmartschools.co.nz/]
     1795        http://kuraproductions.co.nz
     1796        https://keepourmoneyclean.govt.nz
     1797        http://www.tekura.school.nz
     1798        http://www.tkkmmokopuna.school.nz
     1799        http://hangaraumatihiko.tki.org.nz
     1800        http://www.pakanae.maori.nz
     1801
     1802
     1803    http://holyspirit.nz
     1804    https://www.ngamanawainc.co.nz, [includes http://www.ngamanawainc.co.nz]
     1805    http://www.finlaysonpark.school.nz
     1806    http://www.w3vietnam.org.nz [includes http://w3vietnam.org.nz]
     1807    https://www.takitimu.ac.nz
     1808        https://kotahimiriona.co.nz
     1809        https://rehuamarae.co.nz
     1810        http://reoora.co.nz
     1811
     1812        https://manawatuheritage.pncc.govt.nz
     1813        http://rsnz.natlib.govt.nz
     1814        https://www.taitokerautrust.org.nz
     1815        http://tewikiotereomaori.nz
     1816        https://www.korokikahukura.co.nz
     1817        https://www.pinterest.nz
     1818        https://www.rereahu.maori.nz
     1819        http://givealittle.co.nz
     1820        https://kaiiwicamp.nz [includes http://kaiiwicamp.nz]
     1821        http://ngarauhuia.ngatiapakiterato.iwi.nz
     1822        https://m.wairarapatv.co.nz
     1823
     1824        http://avonside.net
     1825        http://www.maoriinvestments.co.nz
     1826        http://conference.tpwt.maori.nz
     1827        https://www.puau.school.nz
     1828        http://tehauora.org.nz
     1829
     1830        http://temahurehure.maori.nz
     1831        http://www.temarareo.org
     1832        http://www.tetaumuturunanga.iwi.nz
     1833        http://www.writersfestival.co.nz
     1834        http://www.kmk.maori.nz
     1835        https://www.stats.govt.nz [includes http://archive.stats.govt.nz]
     1836
     1837+?       http://ngatiwhakaue.iwi.nz
     1838+?       https://interactives.stuff.co.nz
     1839+?       http://whatonga.school.nz
     1840+?       https://player.vimeo.com
     1841+?       http://southerntribes.co.nz
     1842
     1843?X      https://www.e-agent.nz [includes: https://office.e-agent.nz, http://videos.e-agent.nz]
Note: See TracChangeset for help on using the changeset viewer.