- Timestamp:
- 2020-02-13T17:09:07+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/mongodb-data/ManualShortlisting2.txt
r33907 r33914 2008 2008 3. GRAND TOTALS 2009 2009 2010 Count per country of web SITES that contain at least 1 web page containing at least 1 genuine MRI sentence: 2011 2010 Count per country of web SITES that contain at least 1 web page containing at least 1 genuine MRI sentence. (Number in brackets for overseas is number of sites of that geolocation if nz TLDs were NOT grouped with NZ geolocation under "NZ". Number in brackets for NZ indicates the number of sites that are only of NZ geolocation ignoring nz TLDs hosted overseas. Numbers only present where different from counts of site by geolocation, which is the number indicated out of brackets.) 2011 2012 OLD 2012 2013 countryCode, num manually inspected sites as having pages containing MRI, num sites openNLP detected as having pages containing MRI 2013 NZ: 126 actual sites out of 176 detected sites2014 US: 29 actual out of 4 86detected sites2015 AU: 2 actual out of 21detected sites2014 NZ: 126 actual sites out of 176 (89) detected sites 2015 US: 29 actual out of 422 (486) detected sites 2016 AU: 2 actual out of 5 (21) detected sites 2016 2017 DE, Germany: 2 actual out of 27 detected sites 2017 2018 DK, Denmark: 2 out of 8 2018 2019 BG, Bulgaria: 1 out of 1 2019 2020 CZ, Czech Republic: 1 out of 4 2020 ES, Spain: 1 out of 7 2021 FR, France: 1 out of 36 2021 ES, Spain: 1 out of 5 (7) 2022 FR, France: 1 out of 35 (36) 2023 IE, Ireland: 1 out of 2 2024 2025 NEW - Adjusted grand totals above with changes to values after reingesting into mongodb (the adjusted values are from section C below). The number in brackets here are the UNIQUE domain names/sites that OpenNLP detected as having pages containing MRI, where different. 2026 2027 countryCode, num manually inspected sites as having pages containing MRI, num sites openNLP detected as having pages containing MRI 2028 NZ: 124 (113 + 11 non-unique) actual sites out of 176 (159) detected sites 2029 US: 32 actual out of 422 (405) detected sites 2030 AU: 1 actual out of 5 detected sites 2031 DE, Germany: 2 actual out of 26 (24) detected sites 2032 DK, Denmark: 2 out of 8 2033 BG, Bulgaria: 1 out of 1 2034 CZ, Czech Republic: 1 out of 5 (4) 2035 ES, Spain: 1 out of 5 2036 FR, France: 1 out of 35 (34) 2022 2037 IE, Ireland: 1 out of 2 2023 2038 … … 2026 2041 2027 2042 ======================================== 2043 Adjusted grand totals in manualShortlisting.txt with the following. 2044 2045 ---------------------------------------------------------------------- 2046 C GEOLOCATION CHANGES AFTER REINGESTING UPON INTRODUCING ANGLICAN.ORG: 2047 ---------------------------------------------------------------------- 2048 NZ the same as before 2049 NL, DE, FR, DK, ES, GB same 2050 IT, AT, RO, CH, RU, BG, MX, JP, CN, IE, IR, FI same 2051 2052 US gained 3: 2053 + anglican.org (NEW) 2054 X articles.imperialtometric.com (from CA) 2055 X daandehn.com (CA) 2056 2057 CA lost 2: 2058 X articles.imperialtometric.com (to US) 2059 X daandehn.com (to US) 2060 2061 AU: 2062 + ! lost kiwiproperty.com (to US - mi in URL path version file!) 2063 2064 2065 CZ: 2066 X gained viveipcl.com (from UNKNOWN) 2067 2068 UNKNOWN: 2069 X gained hitiaotera.com from IL 2070 2071 IL: 2072 X lost one (hitiaotera.com to UNKNOWN) 2073 2074 ----------------- 2075 FINAL SITE COUNT (contain >= 1 page with >= 1 MRI sentence) 2076 ----------------- 2077 DK (2): 2078 http://ngapuhiradio.com 2079 http://ngapuhitelevision.com 2080 [http://akona.ngapuhitelevision.com 2081 http://waiatarangatiratanga.ngapuhitelevision.com 2082 http://jazz.ngapuhitelevision.com 2083 http://powhiri.ngapuhitelevision.com 2084 http://komisch.ngapuhitelevision.com] 2085 2086 DE (2) 2087 http://www.udhr.de 2088 https://www.cartogiraffe.com 2089 2090 AU (1) 2091 https://koreromaori.com 2092 2093 FR (1) 2094 http://chantsdeluttes.free.fr 2095 2096 ES (1) 2097 https://www.uv.es 2098 2099 IE (1) 2100 https://coggle.it 2101 2102 CZ: (1) 2103 http://www.henryklahola.nazory.cz 2104 2105 BG: (1) 2106 http://anitra.net 2107 2108 US finals 31 (33): 2109 http://anglican.org 2110 http://anglicanhistory.org 2111 http://www.unicode.org 2112 https://static-promote.weebly.com 2113 http://aclhokiangarocks.blogspot.com 2114 http://bahaiprayers.net 2115 https://biblehub.com 2116 http://www.muhammad.com 2117 http://www.godrules.net 2118 http://m.biblepub.com 2119 http://www.krassotkin.ru 2120 http://www.gotquestions.org 2121 https://maorinews.com 2122 http://maaori.com 2123 http://kiaorahola.blogspot.com 2124 https://kjohnsonnz.blogspot.com 2125 http://pumanawawhangara.blogspot.com 2126 http://dannykahei.tripod.com 2127 http://burkekm001.tripod.com 2128 http://tkkpipipaopao.blogspot.com 2129 http://manateina.blogspot.com 2130 http://tatai09.blogspot.com 2131 http://www.twttoa.com 2132 http://tuhua2010.blogspot.com 2133 http://piripi.blogspot.com 2134 https://drive.google.com 2135 https://in.pinterest.com 2136 +? https://www.breaker.audio [AUDIO] 2137 +X http://ritusehji.blogspot.com 2138 27 (28) 2139 2140 https://www.kiwiproperty.com 2141 http://indigenousblogs.com 2142 https://mi.m.wikipedia.org [https://mi.wikipedia.org] 2143 http://csunplugged.org [includes https://www.csunplugged.org] 2144 ?~ https://policies.oclc.org 2145 2146 + 4 (5) = 31 (33) incl with MI in URL Path 2147 2148 2149 NZ: 113 unique + 11 non-unique 2150 http://www.teipukarea.maori.nz 2151 http://ngatipahauwera.co.nz 2152 http://www.oag.govt.nz 2153 https://sexualviolence.victimsinfo.govt.nz 2154 http://tmoa.tki.org.nz 2155 http://www.tewhanake.maori.nz 2156 http://www.matarikifestival.org.nz 2157 http://www.otepoti.school.nz 2158 https://www.maoritelevision.com 2159 http://pukapuka.nz 2160 http://community.nzdl.org 2161 http://maori.livingheritage.org.nz [http://www.livingheritage.org.nz] 2162 http://pukoro.co.nz 2163 https://cdn.tehiku.nz [DOMAIN: tehiku.nz] 2164 http://www.runanga.co.nz 2165 http://kuraaiwi.maori.nz 2166 http://kurataiao.tki.org.nz 2167 http://satellites.co.nz 2168 http://teaohou.natlib.govt.nz 2169 http://www.tuwharetoa.iwi.nz 2170 https://www.terito.school.nz 2171 https://ttw1.cwp.govt.nz 2172 https://www.whanau-tahi.school.nz 2173 https://e-ako-pangarau.nzmaths.co.nz 2174 https://teaomaori.news 2175 http://tetaurawhiri.govt.nz 2176 https://www.tuiatematangi.ac.nz 2177 http://animations.tewhanake.maori.nz 2178 https://www.dnc.org.nz 2179 http://firstworldwar.tki.org.nz [http://www.firstworldwar.tki.org.nz] 2180 http://www.28maoribattalion.org.nz 2181 http://www.tewikiotereomaori.co.nz 2182 http://www.brettgraham.co.nz 2183 https://hepatakakupu.nz 2184 http://anglicanprayerbook.nz 2185 http://arataua.nz 2186 http://maori.tki.org.nz 2187 https://paekupu.co.nz 2188 https://haereheikaiako.co.nz 2189 https://curriculumtool.education.govt.nz 2190 http://kurakokiri.maori.nz [includes: http://www.kurakokiri.maori.nz] 2191 http://www.kkmmaungarongo.co.nz 2192 http://www.heartland.co.nz 2193 http://oilcrash.com 2194 http://www.kura-porirua.school.nz 2195 https://www.sporty.co.nz 2196 https://www.tematawai.maori.nz 2197 https://www.terakipaewhenua.school.nz 2198 http://www.tetaurawhiri.govt.nz 2199 http://archive.stats.govt.nz 2200 http://tiritiowaitangi.govt.nz 2201 http://www.waiata.maori.nz [includes: http://waiata.maori.nz] 2202 http://hana.co.nz 2203 http://kaupare.co.nz 2204 http://www.tereowrap.nz 2205 http://www.hrc.co.nz 2206 http://ngatiporoukiponeke.org.nz 2207 http://rurued.school.nz 2208 http://www.twtop.school.nz 2209 http://www.huri-translations.pf 2210 https://teara.govt.nz [https://admin.teara.govt.nz, http://blog.teara.govt.nz] 2211 https://tiritiowaitangi.govt.nz 2212 http://www.tmoa.tki.org.nz 2213 https://www.komako.org.nz 2214 http://www.wcl.govt.nz [included:http://kete.wcl.govt.nz] 2215 http://punareo.co.nz 2216 https://rapuatearatika.education.govt.nz 2217 http://tmmkkm.school.nz 2218 http://www.cs.waikato.ac.nz 2219 http://www.kupengahao.co.nz 2220 https://www.hapuhauora.health.nz 2221 http://cms.sunsmartschools.co.nz [http://sunsmartschools.co.nz/] 2222 http://kuraproductions.co.nz 2223 https://keepourmoneyclean.govt.nz 2224 http://www.tekura.school.nz 2225 http://www.tkkmmokopuna.school.nz 2226 http://hangaraumatihiko.tki.org.nz 2227 http://www.pakanae.maori.nz 2228 --- 78+9 2229 http://holyspirit.nz 2230 https://www.ngamanawainc.co.nz [includes http://www.ngamanawainc.co.nz] 2231 http://www.finlaysonpark.school.nz 2232 http://www.w3vietnam.org.nz [includes http://w3vietnam.org.nz] 2233 https://www.takitimu.ac.nz 2234 https://kotahimiriona.co.nz 2235 https://rehuamarae.co.nz 2236 http://reoora.co.nz 2237 https://manawatuheritage.pncc.govt.nz 2238 http://rsnz.natlib.govt.nz 2239 https://www.taitokerautrust.org.nz 2240 http://tewikiotereomaori.nz 2241 https://www.korokikahukura.co.nz 2242 https://www.pinterest.nz 2243 https://www.rereahu.maori.nz 2244 http://givealittle.co.nz 2245 https://kaiiwicamp.nz [includes http://kaiiwicamp.nz] 2246 http://ngarauhuia.ngatiapakiterato.iwi.nz 2247 https://m.wairarapatv.co.nz 2248 http://avonside.net 2249 http://www.maoriinvestments.co.nz 2250 http://conference.tpwt.maori.nz 2251 https://www.puau.school.nz 2252 http://tehauora.org.nz 2253 http://temahurehure.maori.nz 2254 http://www.temarareo.org 2255 http://www.tetaumuturunanga.iwi.nz 2256 http://www.writersfestival.co.nz 2257 http://www.kmk.maori.nz 2258 https://www.stats.govt.nz [includes http://archive.stats.govt.nz] 2259 ---30+4 2260 +? http://ngatiwhakaue.iwi.nz 2261 +? https://interactives.stuff.co.nz 2262 +? http://whatonga.school.nz 2263 +? https://player.vimeo.com 2264 +? http://southerntribes.co.nz 2265 ---78+30+(5)=113 unique + 11 non-unique 2266 ?X https://www.e-agent.nz [includes: https://office.e-agent.nz,http://videos.e-agent.nz]
Note:
See TracChangeset
for help on using the changeset viewer.