source: other-projects/maori-lang-detection/mongodb-data/random260_results.txt@ 33976

Last change on this file since 33976 was 33976, checked in by ak19, 4 years ago

Adding in what I could remember of Dr Bainbridge's statement about the accuracy and confidence level in the findings from our samples.

File size: 2.2 KB
Line 
1257/260 pages detected by OpenNLP as being overall in MRI were genuinely overall in MRI from manual detection. This is about 98.8%.
2
3Our sample size gives us 90% confidence that OpenNLP's 98.8% accuracy rate with a 5% error rate represents all URLs whose pages it detects as being overall inMRI.
4
5SUMMARY of the 260 random web page URLs sampled:
6================================================
7* Only NZ and US had genuine pages in MRI
8* 225 pages were NZ (.nz and NZ origin) and remaining, 35 from US
9* 2 NZ pages were not in NZ MRI (Rarotongan/Cook Islands Maori page, Tokelauan page),
10a 3rd had a single sentence in MRI but the rest were links with repeated English anchor text with digit suffixes File###
11
12So 222 NZ pages, 35 US web pages were largely in MRI.
13
1411 unique domains from US (10 if mi.wikipedia and mi.m.wikipedia counted as one)
1534 unique domains from NZ (35 if admin.teara counted distinct from teara),
1633 unique domains from NZ after further skipping site with only a page in Cook Islands Maori in it.
17
18
19
20NZ sites with many (>=6) sampled pages inMRI are:
21tmoa.tki.org.nz (83)
22tetaurawhiri.govt.nz (31)
23tiritiowaitangi.govt.nz (17)
24pukoro.co.nz (15)
25waiata.maori.nz (9)
26twtop.school.nz (7)
27paekupu.co.nz (6)
28
29Among the US sites those with >=6 sampled pages inMRI are:
30m.biblepub.com (11 pages), and mi.m.wikipedia.org (8) though mi.m.wiki pages usually have
31individual words or short phrases in MRI rather than several contiguous sentences or paragraphs.
32
33
34123 pages' contents are SIGNIFICANTLY_MAORI
3535 contain MRI, but it's in NAV (navigation menus) or pictures of non-OCR-ed text, with practically no other text on the page
3631 pages have one or more MAORI_PARAGRAPHS, with one or more other paras in other languages
3718 pages contain noticeably MIXED_TEXT in MRI and one or more languages within a single paragraph or set of sentences or a single sentence.
3815 pages contain POEMS_OR_SONGS
3915 pages have a SINGLE_MRI_SENTENCE
4013 pages have a set of singleton WORDS in MRI (often MRI language learning sites)
414 contain any LITTLE of any non-navigation TEXT
423 LINK_TEXT
433 pages contain non-nav text in OTHER_LANGUAGES (English, Tokelau, Cook Islands or Rarotongan Maori)
44= 260 sampled web pages
45
Note: See TracBrowser for help on using the repository browser.