1 | 257/260 pages detected by OpenNLP as being overall in MRI were genuinely overall in MRI from manual detection. This is about 98.8%.
|
---|
2 |
|
---|
3 | Our sample size gives us 90% confidence that OpenNLP's 98.8% accuracy rate with a 5% error rate represents all URLs whose pages it detects as being overall inMRI.
|
---|
4 |
|
---|
5 | Our samples tell us something about precision not recall, see
|
---|
6 | https://en.wikipedia.org/wiki/Precision_and_recall
|
---|
7 |
|
---|
8 | SUMMARY of the 260 random web page URLs sampled:
|
---|
9 | ================================================
|
---|
10 | * Only NZ and US had genuine pages in MRI
|
---|
11 | * 225 pages were NZ (.nz and NZ origin) and remaining, 35 from US
|
---|
12 | * 2 NZ pages were not in NZ MRI (Rarotongan/Cook Islands Maori page, Tokelauan page),
|
---|
13 | a 3rd had a single sentence in MRI but the rest were links with repeated English anchor text with digit suffixes File###
|
---|
14 |
|
---|
15 | So 222 NZ pages, 35 US web pages were largely in MRI.
|
---|
16 |
|
---|
17 | 11 unique domains from US (10 if mi.wikipedia and mi.m.wikipedia counted as one)
|
---|
18 | 34 unique domains from NZ (35 if admin.teara counted distinct from teara),
|
---|
19 | 33 unique domains from NZ after further skipping site with only a page in Cook Islands Maori in it.
|
---|
20 |
|
---|
21 |
|
---|
22 |
|
---|
23 | NZ sites with many (>=6) sampled pages inMRI are:
|
---|
24 | tmoa.tki.org.nz (83)
|
---|
25 | tetaurawhiri.govt.nz (31)
|
---|
26 | tiritiowaitangi.govt.nz (17)
|
---|
27 | pukoro.co.nz (15)
|
---|
28 | waiata.maori.nz (9)
|
---|
29 | twtop.school.nz (7)
|
---|
30 | paekupu.co.nz (6)
|
---|
31 |
|
---|
32 | Among the US sites those with >=6 sampled pages inMRI are:
|
---|
33 | m.biblepub.com (11 pages), and mi.m.wikipedia.org (8) though mi.m.wiki pages usually have
|
---|
34 | individual words or short phrases in MRI rather than several contiguous sentences or paragraphs.
|
---|
35 |
|
---|
36 |
|
---|
37 | 123 pages' contents are SIGNIFICANTLY_MAORI
|
---|
38 | 35 contain MRI, but it's in NAV (navigation menus) or pictures of non-OCR-ed text, with practically no other text on the page
|
---|
39 | 31 pages have one or more MAORI_PARAGRAPHS, with one or more other paras in other languages
|
---|
40 | 18 pages contain noticeably MIXED_TEXT in MRI and one or more languages within a single paragraph or set of sentences or a single sentence.
|
---|
41 | 15 pages contain POEMS_OR_SONGS
|
---|
42 | 15 pages have a SINGLE_MRI_SENTENCE
|
---|
43 | 13 pages have a set of singleton WORDS in MRI (often MRI language learning sites)
|
---|
44 | 4 contain any LITTLE of any non-navigation TEXT
|
---|
45 | 3 LINK_TEXT
|
---|
46 | 3 pages contain non-nav text in OTHER_LANGUAGES (English, Tokelau, Cook Islands or Rarotongan Maori)
|
---|
47 | = 260 sampled web pages
|
---|
48 |
|
---|