source: other-projects/maori-lang-detection/mongodb-data/random260_results.txt@ 33966

Last change on this file since 33966 was 33966, checked in by ak19, 4 years ago

Added the origSequence and basicDomain columns to the random 260 web page URLs, a sorted spreadsheet version of it, and a summary of results from these samples.

File size: 1.9 KB
Line 
1
2SUMMARY of the 260 random web page URLs sampled:
3================================================
4* Only NZ and US had genuine pages in MRI
5* 225 pages were NZ (.nz and NZ origin) and remaining, 35 from US
6* 2 NZ pages were not in NZ MRI (Rarotongan/Cook Islands Maori page, Tokelauan page),
7a 3rd had a single sentence in MRI but the rest were links with repeated English anchor text with digit suffixes File###
8
9So 222 NZ pages, 35 US web pages were largely in MRI.
10
1111 unique domains from US (10 if mi.wikipedia and mi.m.wikipedia counted as one)
1234 unique domains from NZ (35 if admin.teara counted distinct from teara),
1333 unique domains from NZ after further skipping site with only a page in Cook Islands Maori in it.
14
15
16
17NZ sites with many (>=6) sampled pages inMRI are:
18tmoa.tki.org.nz (83)
19tetaurawhiri.govt.nz (31)
20tiritiowaitangi.govt.nz (17)
21pukoro.co.nz (15)
22waiata.maori.nz (9)
23twtop.school.nz (7)
24paekupu.co.nz (6)
25
26Among the US sites those with >=6 sampled pages inMRI are:
27m.biblepub.com (11 pages), and mi.m.wikipedia.org (8) though mi.m.wiki pages usually have
28individual words or short phrases in MRI rather than several contiguous sentences or paragraphs.
29
30
31123 pages' contents are SIGNIFICANTLY_MAORI
3235 contain MRI, but it's in NAV (navigation menus) or pictures of non-OCR-ed text, with practically no other text on the page
3331 pages have one or more MAORI_PARAGRAPHS, with one or more other paras in other languages
3418 pages contain noticeably MIXED_TEXT in MRI and one or more languages within a single paragraph or set of sentences or a single sentence.
3515 pages contain POEMS_OR_SONGS
3615 pages have a SINGLE_MRI_SENTENCE
3713 pages have a set of singleton WORDS in MRI (often MRI language learning sites)
384 contain any LITTLE of any non-navigation TEXT
393 LINK_TEXT
403 pages contain non-nav text in OTHER_LANGUAGES (English, Tokelau, Cook Islands or Rarotongan Maori)
41= 260 sampled web pages
42
Note: See TracBrowser for help on using the repository browser.