source: other-projects/maori-lang-detection/mongodb-data/random260_results.txt@ 33977

Last change on this file since 33977 was 33977, checked in by ak19, 4 years ago

Added something on precision vs recall being applicable to our sampling and results. Dr Bainbridge brought up precision and recall and how one applied to our sample's situation.

File size: 2.3 KB
Line 
1257/260 pages detected by OpenNLP as being overall in MRI were genuinely overall in MRI from manual detection. This is about 98.8%.
2
3Our sample size gives us 90% confidence that OpenNLP's 98.8% accuracy rate with a 5% error rate represents all URLs whose pages it detects as being overall inMRI.
4
5Our samples tell us something about precision not recall, see
6https://en.wikipedia.org/wiki/Precision_and_recall
7
8SUMMARY of the 260 random web page URLs sampled:
9================================================
10* Only NZ and US had genuine pages in MRI
11* 225 pages were NZ (.nz and NZ origin) and remaining, 35 from US
12* 2 NZ pages were not in NZ MRI (Rarotongan/Cook Islands Maori page, Tokelauan page),
13a 3rd had a single sentence in MRI but the rest were links with repeated English anchor text with digit suffixes File###
14
15So 222 NZ pages, 35 US web pages were largely in MRI.
16
1711 unique domains from US (10 if mi.wikipedia and mi.m.wikipedia counted as one)
1834 unique domains from NZ (35 if admin.teara counted distinct from teara),
1933 unique domains from NZ after further skipping site with only a page in Cook Islands Maori in it.
20
21
22
23NZ sites with many (>=6) sampled pages inMRI are:
24tmoa.tki.org.nz (83)
25tetaurawhiri.govt.nz (31)
26tiritiowaitangi.govt.nz (17)
27pukoro.co.nz (15)
28waiata.maori.nz (9)
29twtop.school.nz (7)
30paekupu.co.nz (6)
31
32Among the US sites those with >=6 sampled pages inMRI are:
33m.biblepub.com (11 pages), and mi.m.wikipedia.org (8) though mi.m.wiki pages usually have
34individual words or short phrases in MRI rather than several contiguous sentences or paragraphs.
35
36
37123 pages' contents are SIGNIFICANTLY_MAORI
3835 contain MRI, but it's in NAV (navigation menus) or pictures of non-OCR-ed text, with practically no other text on the page
3931 pages have one or more MAORI_PARAGRAPHS, with one or more other paras in other languages
4018 pages contain noticeably MIXED_TEXT in MRI and one or more languages within a single paragraph or set of sentences or a single sentence.
4115 pages contain POEMS_OR_SONGS
4215 pages have a SINGLE_MRI_SENTENCE
4313 pages have a set of singleton WORDS in MRI (often MRI language learning sites)
444 contain any LITTLE of any non-navigation TEXT
453 LINK_TEXT
463 pages contain non-nav text in OTHER_LANGUAGES (English, Tokelau, Cook Islands or Rarotongan Maori)
47= 260 sampled web pages
48
Note: See TracBrowser for help on using the repository browser.