Changeset 33842

Show
Ignore:
Timestamp:
16.01.2020 22:30:09 (5 weeks ago)
Author:
ak19
Message:

Jotted down some further paragraphs and notes of interest. Tentatively moved a part to intro.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/journal-paper/writeup

    r33839 r33842  
    55- web page, web site? 
    66- auto(-)translated, automatically translated 
     7- See <> 
     8 
     9TODO: 
     10- Crop map images to just the map bounds 
     11- Redo map for #6: add the 2 or 3 more US ones detected (after confirming if they were 3 or 2) 
     12- Tables for each map 
     13- scholar.google => low resource languages; bibtex 
     14 
     15 
     16 
     17Intro - TODO: NEED TO REWORK AND MOVE PARTS ELSEWHERE 
     18------------------- 
     19 
     20We considered a few ways of approaching the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/]  
     21 
     22Common Crawl encourages use of their collected crawl data, provided by them as a "corpus for collaborative research, analysis and education", thereby reducing the potential burden on the web caused by many independent spiders crawling the internet for disparate research ends. In August 2018,  Common Crawl's crawling incorporated language detection [https://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/]. Starting a month later, they further enabled querying their Columnar Index for web page crawl data detected as matching desired languages [https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/], which was relevant for our own study.  
     23 
     24Common Crawl, however, does not crawl every web site in entirety, restricting the crawl depth for copyright and other reasons, and limits overlaps between each monthly crawl, instead aiming to provide a representative sampling of a broad cross-section of the web. They further take special note of minority languages, for instance they described this aspect of their July 2019 sampling as containing "2 million URLs of pages written in 130 less-represented languages" [http://commoncrawl.org/2019/07/]. 
     25 
     26https://www.forbes.com/sites/kalevleetaru/2017/09/28/common-crawl-and-unlocking-web-archives-for-research/#7067d4313b83 
     27As Ms. Crouse [Director of Common Crawl] put it, “this is big data intended for machine learning/readability. Further, our intention for its use is for public benefit i.e. to encourage research and innovation, not direct consumption.” She noted that “from the layperson’s perspective, it is not at all trivial at present to extract a specific website’s content (that is, text) from a Common Crawl dataset. This task generally requires one to know how to install and run a Hadoop cluster, among other things. This is not structured data. Further it is likely that not all pages of that website will be included (depending on the parameters for depth set for the specific crawl).” This means that “the bulk of [Common Crawl’s] users are from the noncommercial, educational, and research sectors. At a higher level, it’s important to note that we provide a broad and representative sample of the web, in the form of web crawl data, each month. No one really knows how big the web is, and at present, we limit our monthly data publication to approximately 3 billion pages.” 
     28 
     29 
     30https://commoncrawl.github.io/cc-crawl-statistics/plots/languages 
     31crawl   CC-MAIN-2019-43     CC-MAIN-2019-47     CC-MAIN-2019-51 
     32language %          %           % 
     33eng     43.2339         43.7573         43.5783          
     34mri     0.0014          0.0017          0.0012 
     35 
     36Over 1400 sites were detected and CommonCrawl returned over 1400 unique site domain containing pages it detected as Maori in the twelve-month period from Sep 2018 to Aug 2019. The above percentages are for the 3 final crawls (June to Aug 2019). Of these 1400 sites, 213+3 = 216 sites appeared to contain actual Maori language sentences composed by humans when manually inspected. The percentage of the high-quality web content that is in Maori may therefore be almost an order of magnitude less. 
     37 
     38 
    739 
    840Scope 
     
    1345Implementation 
    1446------------------- 
    15 We considered a few ways of approaching the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/]  
     47The flowchart in Figure <> illustrates the process described in this section. 
    1648 
    1749Common Crawl's large data sets are stored on distributed file systems and the same is required to access their content. Crawl data of interest is requested and retrieved by querying their columnar index. Since September 2019, Common Crawl have added the "content_languages" field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required, which suited our purposes. In our case, we requested crawled content that Common Crawl had stored as being "MRI", the 3 letter language code for the Maori language, rather than crawled web pages for which MRI was but one among several detected languages. We obtained the results for 12 contiguous months worth of Common Crawl's crawl data, spanning Sep 2018 to Aug 2019. The content was returned in WARC format, which our Common Crawl querying script then additionally converted to the WET format, a process that reduced html marked up web pages to just their extracted text, as this was the portion of interest. 
     
    2355 
    2456In the final phase, we queried the Nutch crawled data stored in MongoDB to answer some of our questions. 
     57 
    2558 
    2659 
     
    86119 
    87120 
     121 
     122 
    88123------------------------------------------ 
    89124UNUSED: