Ignore:
Timestamp:
2020-02-13T17:09:07+13:00 (4 years ago)
Author:
ak19
Message:

Shortlisted just the domain sites by country into ManualShortlist2.txt after taking the reingest into MongoDB into account. And then put all these shortlisted domains for which containsMRI=true as per manual inspection into a separate new file.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/ManualShortlisting.txt

    r33891 r33914  
    17621762        "http://teaohou.natlib.govt.nz", 4/4, 2/4
    17631763        "http://www.tuwharetoa.iwi.nz", 2/3 0/3
    1764 +        "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html)
     1764X        "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html)
    17651765        "https://www.terito.school.nz", 3/3, 0/2 total
    17661766        "https://ttw1.cwp.govt.nz", 3/3 3/3
     
    199119913. GRAND TOTALS
    19921992
    1993 Count per country of web SITES that contain at least 1 web page containing at least 1 genuine MRI sentence:
    1994 
     1993Count per country of web SITES that contain at least 1 web page containing at least 1 genuine MRI sentence. (Number in brackets for overseas is number of sites of that geolocation if nz TLDs were NOT grouped with NZ geolocation under "NZ". Number in brackets for NZ indicates the number of sites that are only of NZ geolocation ignoring nz TLDs hosted overseas.)
     1994
     1995OLD
    19951996countryCode, num manually inspected sites as having pages containing MRI, num sites openNLP detected as having pages containing MRI
    1996 NZ: 126 actual sites out of 176 detected sites
    1997 US: 29 actual out of 486 detected sites
    1998 AU: 2 actual out of 21 detected sites
     1997NZ: 126 actual sites out of 176 (89) detected sites
     1998US: 29 actual out of 422 (486) detected sites
     1999AU: 2 actual out of 5 (21) detected sites
    19992000DE, Germany: 2 actual out of 27 detected sites
    20002001DK, Denmark: 2 out of 8
    20012002BG, Bulgaria: 1 out of 1
    20022003CZ, Czech Republic: 1 out of 4
    2003 ES, Spain: 1 out of 7
    2004 FR, France: 1 out of 36
     2004ES, Spain: 1 out of 5 (7)
     2005FR, France: 1 out of 35 (36)
    20052006IE, Ireland: 1 out of 2
     2007
    20062008
    20072009TOTAL: 166 sites of all the crawled sites where the crawled set of pages per site actually contained at least one sentence in Māori based on manual inspection.
Note: See TracChangeset for help on using the changeset viewer.