Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33842

Timestamp:

2020-01-16T22:30:09+13:00 (4 years ago)

Author:

ak19

Message:

Jotted down some further paragraphs and notes of interest. Tentatively moved a part to intro.

File:

: 1 edited

other-projects/maori-lang-detection/journal-paper/writeup (modified) (4 diffs)

Legend:

: Unmodified
: Added
: Removed

other-projects/maori-lang-detection/journal-paper/writeup

-              r33839
+              r33842
 - web page, web site?
 - auto(-)translated, automatically translated
+- See <>
+TODO:
+- Crop map images to just the map bounds
+- Redo map for #6: add the 2 or 3 more US ones detected (after confirming if they were 3 or 2)
+- Tables for each map
+- scholar.google => low resource languages; bibtex
+Intro - TODO: NEED TO REWORK AND MOVE PARTS ELSEWHERE
+-------------------
+We considered a few ways of approaching the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/]
+Common Crawl encourages use of their collected crawl data, provided by them as a "corpus for collaborative research, analysis and education", thereby reducing the potential burden on the web caused by many independent spiders crawling the internet for disparate research ends. In August 2018,  Common Crawl's crawling incorporated language detection [https://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/]. Starting a month later, they further enabled querying their Columnar Index for web page crawl data detected as matching desired languages [https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/], which was relevant for our own study.
+Common Crawl, however, does not crawl every web site in entirety, restricting the crawl depth for copyright and other reasons, and limits overlaps between each monthly crawl, instead aiming to provide a representative sampling of a broad cross-section of the web. They further take special note of minority languages, for instance they described this aspect of their July 2019 sampling as containing "2 million URLs of pages written in 130 less-represented languages" [http://commoncrawl.org/2019/07/].
+https://www.forbes.com/sites/kalevleetaru/2017/09/28/common-crawl-and-unlocking-web-archives-for-research/#7067d4313b83
+As Ms. Crouse [Director of Common Crawl] put it, âthis is big data intended for machine learning/readability. Further, our intention for its use is for public benefit i.e. to encourage research and innovation, not direct consumption.â She noted that âfrom the laypersonâs perspective, it is not at all trivial at present to extract a specific websiteâs content (that is, text) from a Common Crawl dataset. This task generally requires one to know how to install and run a Hadoop cluster, among other things. This is not structured data. Further it is likely that not all pages of that website will be included (depending on the parameters for depth set for the specific crawl).â This means that âthe bulk of [Common Crawlâs] users are from the noncommercial, educational, and research sectors. At a higher level, itâs important to note that we provide a broad and representative sample of the web, in the form of web crawl data, each month. No one really knows how big the web is, and at present, we limit our monthly data publication to approximately 3 billion pages.â
+https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
+crawl   CC-MAIN-2019-43     CC-MAIN-2019-47     CC-MAIN-2019-51
+language %          %           %
+eng     43.2339         43.7573         43.5783
+mri     0.0014          0.0017          0.0012
+Over 1400 sites were detected and CommonCrawl returned over 1400 unique site domain containing pages it detected as Maori in the twelve-month period from Sep 2018 to Aug 2019. The above percentages are for the 3 final crawls (June to Aug 2019). Of these 1400 sites, 213+3 = 216 sites appeared to contain actual Maori language sentences composed by humans when manually inspected. The percentage of the high-quality web content that is in Maori may therefore be almost an order of magnitude less.
 Scope
 …
 Implementation
 -------------------
+We considered a few ways of approaching the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/]
+The flowchart in Figure <> illustrates the process described in this section.
 Common Crawl's large data sets are stored on distributed file systems and the same is required to access their content. Crawl data of interest is requested and retrieved by querying their columnar index. Since September 2019, Common Crawl have added the "content_languages" field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required, which suited our purposes. In our case, we requested crawled content that Common Crawl had stored as being "MRI", the 3 letter language code for the Maori language, rather than crawled web pages for which MRI was but one among several detected languages. We obtained the results for 12 contiguous months worth of Common Crawl's crawl data, spanning Sep 2018 to Aug 2019. The content was returned in WARC format, which our Common Crawl querying script then additionally converted to the WET format, a process that reduced html marked up web pages to just their extracted text, as this was the portion of interest.
 …
 In the final phase, we queried the Nutch crawled data stored in MongoDB to answer some of our questions.
 …
 ------------------------------------------
 UNUSED:

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33842

Legend:

other-projects/maori-lang-detection/journal-paper/writeup

Download in other formats: