Ignore:
Timestamp:
2020-02-13T17:09:07+13:00 (4 years ago)
Author:
ak19
Message:

Shortlisted just the domain sites by country into ManualShortlist2.txt after taking the reingest into MongoDB into account. And then put all these shortlisted domains for which containsMRI=true as per manual inspection into a separate new file.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/ManualShortlisting2.txt

    r33907 r33914  
    200820083. GRAND TOTALS
    20092009
    2010 Count per country of web SITES that contain at least 1 web page containing at least 1 genuine MRI sentence:
    2011 
     2010Count per country of web SITES that contain at least 1 web page containing at least 1 genuine MRI sentence. (Number in brackets for overseas is number of sites of that geolocation if nz TLDs were NOT grouped with NZ geolocation under "NZ". Number in brackets for NZ indicates the number of sites that are only of NZ geolocation ignoring nz TLDs hosted overseas. Numbers only present where different from counts of site by geolocation, which is the number indicated out of brackets.)
     2011
     2012OLD
    20122013countryCode, num manually inspected sites as having pages containing MRI, num sites openNLP detected as having pages containing MRI
    2013 NZ: 126 actual sites out of 176 detected sites
    2014 US: 29 actual out of 486 detected sites
    2015 AU: 2 actual out of 21 detected sites
     2014NZ: 126 actual sites out of 176 (89) detected sites
     2015US: 29 actual out of 422 (486) detected sites
     2016AU: 2 actual out of 5 (21) detected sites
    20162017DE, Germany: 2 actual out of 27 detected sites
    20172018DK, Denmark: 2 out of 8
    20182019BG, Bulgaria: 1 out of 1
    20192020CZ, Czech Republic: 1 out of 4
    2020 ES, Spain: 1 out of 7
    2021 FR, France: 1 out of 36
     2021ES, Spain: 1 out of 5 (7)
     2022FR, France: 1 out of 35 (36)
     2023IE, Ireland: 1 out of 2
     2024
     2025NEW - Adjusted grand totals above with changes to values after reingesting into mongodb (the adjusted values are from section C below). The number in brackets here are the UNIQUE domain names/sites that OpenNLP detected as having pages containing MRI, where different.
     2026
     2027countryCode, num manually inspected sites as having pages containing MRI, num sites openNLP detected as having pages containing MRI
     2028NZ: 124 (113 + 11 non-unique) actual sites out of 176 (159) detected sites
     2029US: 32 actual out of 422 (405) detected sites
     2030AU: 1 actual out of 5 detected sites
     2031DE, Germany: 2 actual out of 26 (24) detected sites
     2032DK, Denmark: 2 out of 8
     2033BG, Bulgaria: 1 out of 1
     2034CZ, Czech Republic: 1 out of 5 (4)
     2035ES, Spain: 1 out of 5
     2036FR, France: 1 out of 35 (34)
    20222037IE, Ireland: 1 out of 2
    20232038
     
    20262041
    20272042========================================
     2043Adjusted grand totals in manualShortlisting.txt with the following.
     2044
     2045----------------------------------------------------------------------
     2046C GEOLOCATION CHANGES AFTER REINGESTING UPON INTRODUCING ANGLICAN.ORG:
     2047----------------------------------------------------------------------
     2048NZ the same as before
     2049   NL, DE, FR, DK, ES, GB same
     2050   IT, AT, RO, CH, RU, BG, MX, JP, CN, IE, IR, FI same
     2051
     2052US gained 3:
     2053+ anglican.org (NEW)
     2054X articles.imperialtometric.com (from CA)
     2055X daandehn.com (CA)
     2056
     2057CA lost 2:
     2058X articles.imperialtometric.com (to US)
     2059X daandehn.com (to US)
     2060
     2061AU:
     2062+ ! lost kiwiproperty.com (to US - mi in URL path version file!)
     2063
     2064
     2065CZ:
     2066X gained viveipcl.com (from UNKNOWN)
     2067
     2068UNKNOWN:
     2069X gained hitiaotera.com from IL
     2070
     2071IL:
     2072X lost one (hitiaotera.com to UNKNOWN)
     2073
     2074-----------------
     2075FINAL SITE COUNT (contain >= 1 page with >= 1 MRI sentence)
     2076-----------------
     2077DK (2):
     2078http://ngapuhiradio.com
     2079http://ngapuhitelevision.com
     2080    [http://akona.ngapuhitelevision.com
     2081    http://waiatarangatiratanga.ngapuhitelevision.com
     2082    http://jazz.ngapuhitelevision.com
     2083    http://powhiri.ngapuhitelevision.com
     2084    http://komisch.ngapuhitelevision.com]
     2085
     2086DE (2)
     2087http://www.udhr.de
     2088https://www.cartogiraffe.com
     2089
     2090AU (1)
     2091https://koreromaori.com
     2092
     2093FR (1)
     2094http://chantsdeluttes.free.fr
     2095
     2096ES (1)
     2097https://www.uv.es
     2098
     2099IE (1)
     2100https://coggle.it
     2101
     2102CZ: (1)
     2103http://www.henryklahola.nazory.cz
     2104
     2105BG: (1)
     2106http://anitra.net
     2107
     2108US finals 31 (33):
     2109http://anglican.org
     2110http://anglicanhistory.org
     2111http://www.unicode.org
     2112https://static-promote.weebly.com
     2113http://aclhokiangarocks.blogspot.com
     2114http://bahaiprayers.net
     2115https://biblehub.com
     2116http://www.muhammad.com
     2117http://www.godrules.net
     2118http://m.biblepub.com
     2119http://www.krassotkin.ru
     2120http://www.gotquestions.org
     2121https://maorinews.com
     2122http://maaori.com
     2123http://kiaorahola.blogspot.com
     2124https://kjohnsonnz.blogspot.com
     2125http://pumanawawhangara.blogspot.com
     2126http://dannykahei.tripod.com
     2127http://burkekm001.tripod.com
     2128http://tkkpipipaopao.blogspot.com
     2129http://manateina.blogspot.com
     2130http://tatai09.blogspot.com
     2131http://www.twttoa.com
     2132http://tuhua2010.blogspot.com
     2133http://piripi.blogspot.com
     2134https://drive.google.com
     2135https://in.pinterest.com
     2136+? https://www.breaker.audio [AUDIO]
     2137+X http://ritusehji.blogspot.com
     213827 (28)
     2139
     2140https://www.kiwiproperty.com
     2141http://indigenousblogs.com
     2142https://mi.m.wikipedia.org [https://mi.wikipedia.org]
     2143http://csunplugged.org [includes https://www.csunplugged.org]
     2144?~ https://policies.oclc.org
     2145
     2146+ 4 (5) = 31 (33) incl with MI in URL Path
     2147
     2148
     2149NZ: 113 unique + 11 non-unique
     2150http://www.teipukarea.maori.nz
     2151http://ngatipahauwera.co.nz
     2152http://www.oag.govt.nz
     2153https://sexualviolence.victimsinfo.govt.nz
     2154http://tmoa.tki.org.nz
     2155http://www.tewhanake.maori.nz
     2156http://www.matarikifestival.org.nz
     2157http://www.otepoti.school.nz
     2158https://www.maoritelevision.com
     2159http://pukapuka.nz
     2160http://community.nzdl.org
     2161http://maori.livingheritage.org.nz [http://www.livingheritage.org.nz]
     2162http://pukoro.co.nz
     2163https://cdn.tehiku.nz [DOMAIN: tehiku.nz]
     2164http://www.runanga.co.nz
     2165http://kuraaiwi.maori.nz
     2166http://kurataiao.tki.org.nz
     2167http://satellites.co.nz
     2168http://teaohou.natlib.govt.nz
     2169http://www.tuwharetoa.iwi.nz
     2170https://www.terito.school.nz
     2171https://ttw1.cwp.govt.nz
     2172https://www.whanau-tahi.school.nz
     2173https://e-ako-pangarau.nzmaths.co.nz
     2174https://teaomaori.news
     2175http://tetaurawhiri.govt.nz
     2176https://www.tuiatematangi.ac.nz
     2177http://animations.tewhanake.maori.nz
     2178https://www.dnc.org.nz
     2179http://firstworldwar.tki.org.nz [http://www.firstworldwar.tki.org.nz]
     2180http://www.28maoribattalion.org.nz
     2181http://www.tewikiotereomaori.co.nz
     2182http://www.brettgraham.co.nz
     2183https://hepatakakupu.nz
     2184http://anglicanprayerbook.nz
     2185http://arataua.nz
     2186http://maori.tki.org.nz
     2187https://paekupu.co.nz
     2188https://haereheikaiako.co.nz
     2189https://curriculumtool.education.govt.nz
     2190http://kurakokiri.maori.nz [includes: http://www.kurakokiri.maori.nz]
     2191http://www.kkmmaungarongo.co.nz
     2192http://www.heartland.co.nz
     2193http://oilcrash.com
     2194http://www.kura-porirua.school.nz
     2195https://www.sporty.co.nz
     2196https://www.tematawai.maori.nz
     2197https://www.terakipaewhenua.school.nz
     2198http://www.tetaurawhiri.govt.nz
     2199http://archive.stats.govt.nz
     2200http://tiritiowaitangi.govt.nz
     2201http://www.waiata.maori.nz [includes: http://waiata.maori.nz]
     2202http://hana.co.nz
     2203http://kaupare.co.nz
     2204http://www.tereowrap.nz
     2205http://www.hrc.co.nz
     2206http://ngatiporoukiponeke.org.nz
     2207http://rurued.school.nz
     2208http://www.twtop.school.nz
     2209http://www.huri-translations.pf
     2210https://teara.govt.nz [https://admin.teara.govt.nz, http://blog.teara.govt.nz]
     2211https://tiritiowaitangi.govt.nz
     2212http://www.tmoa.tki.org.nz
     2213https://www.komako.org.nz
     2214http://www.wcl.govt.nz [included:http://kete.wcl.govt.nz]
     2215http://punareo.co.nz
     2216https://rapuatearatika.education.govt.nz
     2217http://tmmkkm.school.nz
     2218http://www.cs.waikato.ac.nz
     2219http://www.kupengahao.co.nz
     2220https://www.hapuhauora.health.nz
     2221http://cms.sunsmartschools.co.nz [http://sunsmartschools.co.nz/]
     2222http://kuraproductions.co.nz
     2223https://keepourmoneyclean.govt.nz
     2224http://www.tekura.school.nz
     2225http://www.tkkmmokopuna.school.nz
     2226http://hangaraumatihiko.tki.org.nz
     2227http://www.pakanae.maori.nz
     2228--- 78+9
     2229http://holyspirit.nz
     2230https://www.ngamanawainc.co.nz [includes http://www.ngamanawainc.co.nz]
     2231http://www.finlaysonpark.school.nz
     2232http://www.w3vietnam.org.nz [includes http://w3vietnam.org.nz]
     2233https://www.takitimu.ac.nz
     2234https://kotahimiriona.co.nz
     2235https://rehuamarae.co.nz
     2236http://reoora.co.nz
     2237https://manawatuheritage.pncc.govt.nz
     2238http://rsnz.natlib.govt.nz
     2239https://www.taitokerautrust.org.nz
     2240http://tewikiotereomaori.nz
     2241https://www.korokikahukura.co.nz
     2242https://www.pinterest.nz
     2243https://www.rereahu.maori.nz
     2244http://givealittle.co.nz
     2245https://kaiiwicamp.nz [includes http://kaiiwicamp.nz]
     2246http://ngarauhuia.ngatiapakiterato.iwi.nz
     2247https://m.wairarapatv.co.nz
     2248http://avonside.net
     2249http://www.maoriinvestments.co.nz
     2250http://conference.tpwt.maori.nz
     2251https://www.puau.school.nz
     2252http://tehauora.org.nz
     2253http://temahurehure.maori.nz
     2254http://www.temarareo.org
     2255http://www.tetaumuturunanga.iwi.nz
     2256http://www.writersfestival.co.nz
     2257http://www.kmk.maori.nz
     2258https://www.stats.govt.nz [includes http://archive.stats.govt.nz]
     2259---30+4
     2260+? http://ngatiwhakaue.iwi.nz
     2261+? https://interactives.stuff.co.nz
     2262+? http://whatonga.school.nz
     2263+? https://player.vimeo.com
     2264+? http://southerntribes.co.nz
     2265---78+30+(5)=113 unique + 11 non-unique
     2266?X https://www.e-agent.nz [includes: https://office.e-agent.nz,http://videos.e-agent.nz]
Note: See TracChangeset for help on using the changeset viewer.