[33976] | 1 | 257/260 pages detected by OpenNLP as being overall in MRI were genuinely overall in MRI from manual detection. This is about 98.8%.
|
---|
[33966] | 2 |
|
---|
[33976] | 3 | Our sample size gives us 90% confidence that OpenNLP's 98.8% accuracy rate with a 5% error rate represents all URLs whose pages it detects as being overall inMRI.
|
---|
| 4 |
|
---|
[33977] | 5 | Our samples tell us something about precision not recall, see
|
---|
| 6 | https://en.wikipedia.org/wiki/Precision_and_recall
|
---|
| 7 |
|
---|
[33966] | 8 | SUMMARY of the 260 random web page URLs sampled:
|
---|
| 9 | ================================================
|
---|
| 10 | * Only NZ and US had genuine pages in MRI
|
---|
| 11 | * 225 pages were NZ (.nz and NZ origin) and remaining, 35 from US
|
---|
| 12 | * 2 NZ pages were not in NZ MRI (Rarotongan/Cook Islands Maori page, Tokelauan page),
|
---|
| 13 | a 3rd had a single sentence in MRI but the rest were links with repeated English anchor text with digit suffixes File###
|
---|
| 14 |
|
---|
| 15 | So 222 NZ pages, 35 US web pages were largely in MRI.
|
---|
| 16 |
|
---|
| 17 | 11 unique domains from US (10 if mi.wikipedia and mi.m.wikipedia counted as one)
|
---|
| 18 | 34 unique domains from NZ (35 if admin.teara counted distinct from teara),
|
---|
| 19 | 33 unique domains from NZ after further skipping site with only a page in Cook Islands Maori in it.
|
---|
| 20 |
|
---|
| 21 |
|
---|
| 22 |
|
---|
| 23 | NZ sites with many (>=6) sampled pages inMRI are:
|
---|
| 24 | tmoa.tki.org.nz (83)
|
---|
| 25 | tetaurawhiri.govt.nz (31)
|
---|
| 26 | tiritiowaitangi.govt.nz (17)
|
---|
| 27 | pukoro.co.nz (15)
|
---|
| 28 | waiata.maori.nz (9)
|
---|
| 29 | twtop.school.nz (7)
|
---|
| 30 | paekupu.co.nz (6)
|
---|
| 31 |
|
---|
| 32 | Among the US sites those with >=6 sampled pages inMRI are:
|
---|
| 33 | m.biblepub.com (11 pages), and mi.m.wikipedia.org (8) though mi.m.wiki pages usually have
|
---|
| 34 | individual words or short phrases in MRI rather than several contiguous sentences or paragraphs.
|
---|
| 35 |
|
---|
| 36 |
|
---|
| 37 | 123 pages' contents are SIGNIFICANTLY_MAORI
|
---|
| 38 | 35 contain MRI, but it's in NAV (navigation menus) or pictures of non-OCR-ed text, with practically no other text on the page
|
---|
| 39 | 31 pages have one or more MAORI_PARAGRAPHS, with one or more other paras in other languages
|
---|
| 40 | 18 pages contain noticeably MIXED_TEXT in MRI and one or more languages within a single paragraph or set of sentences or a single sentence.
|
---|
| 41 | 15 pages contain POEMS_OR_SONGS
|
---|
| 42 | 15 pages have a SINGLE_MRI_SENTENCE
|
---|
| 43 | 13 pages have a set of singleton WORDS in MRI (often MRI language learning sites)
|
---|
| 44 | 4 contain any LITTLE of any non-navigation TEXT
|
---|
| 45 | 3 LINK_TEXT
|
---|
| 46 | 3 pages contain non-nav text in OTHER_LANGUAGES (English, Tokelau, Cook Islands or Rarotongan Maori)
|
---|
| 47 | = 260 sampled web pages
|
---|
| 48 |
|
---|