Changeset 33849


Ignore:
Timestamp:
2020-01-17T22:22:18+13:00 (4 years ago)
Author:
ak19
Message:

One less Australian site as it was an infographic containing Maori words in English captions and graph legend.

Location:
other-projects/maori-lang-detection
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33847 r33849  
    11541154!!  https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated]
    11551155?   http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!")
    1156 !!  https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd  - site of individual pages (like docs.google.com). This one has a relevant infogram image.
     1156X!!     https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd  - site of individual pages (like docs.google.com). This one has a relevant infogram image. But it's English with MRI in the image legend and captions.
    11571157!!  https://koreromaori.com - some actual Maori language sentences
    11581158    http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames
     
    13101310+ http://www.unicode.org, [Universal declaration of Human Rights]
    13111311+ https://static-promote.weebly.com,
    1312 + http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY]
     1312+ http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY. At least short or partial MRI sentences.]
    13131313
    13141314BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations:
     
    13291329!! https://maorinews.com,
    13301330!! http://maaori.com,
    1331 !!+ http://kiaorahola.blogspot.com/
     1331!!+ http://kiaorahola.blogspot.com,
    13321332+ https://kjohnsonnz.blogspot.com,
    13331333+ http://pumanawawhangara.blogspot.com,
    13341334+ http://dannykahei.tripod.com,
    1335 + http://burkekm001.tripod.com
     1335+ http://burkekm001.tripod.com,
    13361336+ http://tkkpipipaopao.blogspot.com,
    13371337+ http://manateina.blogspot.com,
     
    14721472
    14731473---------------
     1474All sites except NZ or .nz TLD where containingMRI=true manually inspected. Includes overseas sites with mi in URL path. All NZ sites passed through without inspection.
    14741475
    14751476MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY
     
    14891490NZ: 176
    14901491US: 25+4 from US with mi in URL path = 29
    1491 AU: 3
     1492AU: 2
    14921493DE: 2
    14931494DK: 2
     
    14971498FR: 1
    14981499IE: 1
    1499 TOTAL: 213+4 from US with mi in URL path = 217
     1500TOTAL: 213+4 from US with mi in URL path = 216
    15001501
    15011502
  • other-projects/maori-lang-detection/journal-paper/writeup

    r33842 r33849  
    3434mri     0.0014          0.0017          0.0012
    3535
    36 Over 1400 sites were detected and CommonCrawl returned over 1400 unique site domain containing pages it detected as Maori in the twelve-month period from Sep 2018 to Aug 2019. The above percentages are for the 3 final crawls (June to Aug 2019). Of these 1400 sites, 213+3 = 216 sites appeared to contain actual Maori language sentences composed by humans when manually inspected. The percentage of the high-quality web content that is in Maori may therefore be almost an order of magnitude less.
     36Over 1400 sites were detected and CommonCrawl returned over 1400 unique site domain containing pages it detected as Maori in the twelve-month period from Sep 2018 to Aug 2019. The above percentages are for the 3 final crawls (June to Aug 2019). Of these 1400 sites, 216 sites appeared to contain actual Maori language sentences composed by humans when manually inspected. The percentage of the high-quality web content that is in Maori may therefore be almost an order of magnitude less.
    3737
    3838
Note: See TracChangeset for help on using the changeset viewer.