source: other-projects/maori-lang-detection/writeup@ 33825

Last change on this file since 33825 was 33825, checked in by ak19, 14 months ago

Beginnings of first draft of write up.

File size: 3.0 KB
4CommonCrawl (CC) "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [] Its large data sets are stored on distributed file systems and require the same to access the content. Since September 2019, CommonCrawl have added the content_languages field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required. In our case, we requested crawled content that CommonCrawl had marked as being MRI, rather than pages for which MRI was one among several detected languages. We obtained the results for 12 months worth of CommonCrawl's crawl data, from Sep 2018 up to Aug 2019. The content was returned in WARC format, which our commoncrawl querying script then additionally converted to the WET format, containing just the extracted text contents, since the html markup and headers of web pages weren't of interest to us compared to being able to avoid parsing away the html ourselves.
6Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth with Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site.
8The multiple WET files obtained for each of the 12 months of commoncrawl data were all processed together by our program. Its purpose was to further reduce the list of websites to crawl by excluding blacklisted (adult) and obviously autotranslated product (greylisted) sites, and to create a set of seedURLs and a regex-urlfilter.txt file for each site to allow Nutch to crawl it. We used a crawl depth of 10. Although not sufficient for all crawled websites, further processing of the crawled set of webpages for a site could always indicate to us whether the website was of sufficient interest to warrant exhaustive re-crawling in future. (While waiting for Nutch to run over the curated list of sites, a few further sites were excluded from crawling when manual inspection determined they were just autotranslated product web sites.)
10Nutch stores its crawl data in a database but can dump each website's contents into a text file. The subsequent phase involved processing the text dump of each website to split it into its webpages and computing website-level and webpage-level metadata
16We thus obtained multiple wet files for each of the 12 months of commoncrawl data. These were all processed together by our program, which would exclude blacklisted (adult) sites and any obviously autotranslated product sites (which were "greylisted"), before producing a final list of websites to be inspected further by first crawling each site in greater depth with Nutch. For each site to be crawled, a list of seedURLs and regex-url-filter.txt was produced to work with Nutch.
Note: See TracBrowser for help on using the repository browser.