Changeset 33849

Show
Ignore:
Timestamp:
17.01.2020 22:22:18 (5 weeks ago)
Author:
ak19
Message:

One less Australian site as it was an infographic containing Maori words in English captions and graph legend.

Location:
other-projects/maori-lang-detection
Files:
2 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33847 r33849  
    11541154!!  https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated] 
    11551155?   http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!") 
    1156 !!  https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd  - site of individual pages (like docs.google.com). This one has a relevant infogram image. 
     1156X!!     https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd  - site of individual pages (like docs.google.com). This one has a relevant infogram image. But it's English with MRI in the image legend and captions. 
    11571157!!  https://koreromaori.com - some actual Maori language sentences 
    11581158    http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames 
     
    13101310+ http://www.unicode.org, [Universal declaration of Human Rights] 
    13111311+ https://static-promote.weebly.com, 
    1312 + http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY] 
     1312+ http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY. At least short or partial MRI sentences.] 
    13131313 
    13141314BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations: 
     
    13291329!! https://maorinews.com,  
    13301330!! http://maaori.com,  
    1331 !!+ http://kiaorahola.blogspot.com/ 
     1331!!+ http://kiaorahola.blogspot.com, 
    13321332+ https://kjohnsonnz.blogspot.com,  
    13331333+ http://pumanawawhangara.blogspot.com,  
    13341334+ http://dannykahei.tripod.com, 
    1335 + http://burkekm001.tripod.com 
     1335+ http://burkekm001.tripod.com, 
    13361336+ http://tkkpipipaopao.blogspot.com,  
    13371337+ http://manateina.blogspot.com,  
     
    14721472 
    14731473--------------- 
     1474All sites except NZ or .nz TLD where containingMRI=true manually inspected. Includes overseas sites with mi in URL path. All NZ sites passed through without inspection. 
    14741475 
    14751476MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY 
     
    14891490NZ: 176 
    14901491US: 25+4 from US with mi in URL path = 29 
    1491 AU: 3 
     1492AU: 2 
    14921493DE: 2 
    14931494DK: 2 
     
    14971498FR: 1 
    14981499IE: 1 
    1499 TOTAL: 213+4 from US with mi in URL path = 217 
     1500TOTAL: 213+4 from US with mi in URL path = 216 
    15001501 
    15011502 
  • other-projects/maori-lang-detection/journal-paper/writeup

    r33842 r33849  
    3434mri     0.0014          0.0017          0.0012 
    3535 
    36 Over 1400 sites were detected and CommonCrawl returned over 1400 unique site domain containing pages it detected as Maori in the twelve-month period from Sep 2018 to Aug 2019. The above percentages are for the 3 final crawls (June to Aug 2019). Of these 1400 sites, 213+3 = 216 sites appeared to contain actual Maori language sentences composed by humans when manually inspected. The percentage of the high-quality web content that is in Maori may therefore be almost an order of magnitude less. 
     36Over 1400 sites were detected and CommonCrawl returned over 1400 unique site domain containing pages it detected as Maori in the twelve-month period from Sep 2018 to Aug 2019. The above percentages are for the 3 final crawls (June to Aug 2019). Of these 1400 sites, 216 sites appeared to contain actual Maori language sentences composed by humans when manually inspected. The percentage of the high-quality web content that is in Maori may therefore be almost an order of magnitude less. 
    3737 
    3838