Changeset 33828


Ignore:
Timestamp:
2020-01-14T22:09:43+13:00 (4 years ago)
Author:
ak19
Message:

Additions and modifications to the write-up.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/writeup

    r33825 r33828  
     1
     2Notes:
     3- Common Crawl is 2 words
     4- Maori needs macron
     5- web page, web site?
     6- auto(-)translated, automatically translated
     7
     8Scope
     9-------------------
     10We limited our investigations at this stage to locating textual content in Maori on the web, thus excluding audio-visual materials in Maori and any Maori cultural and community content that may be presented in other languages such as English, despite the value of such content in creating an eventual digital repository of Maori resources for preservation and researchers.
     11
     12
    113Implementation
    214-------------------
     15We considered a few ways of approaching the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/]
    316
    4 CommonCrawl (CC) "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/] Its large data sets are stored on distributed file systems and require the same to access the content. Since September 2019, CommonCrawl have added the content_languages field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required. In our case, we requested crawled content that CommonCrawl had marked as being MRI, rather than pages for which MRI was one among several detected languages. We obtained the results for 12 months worth of CommonCrawl's crawl data, from Sep 2018 up to Aug 2019. The content was returned in WARC format, which our commoncrawl querying script then additionally converted to the WET format, containing just the extracted text contents, since the html markup and headers of web pages weren't of interest to us compared to being able to avoid parsing away the html ourselves.
     17Common Crawl's large data sets are stored on distributed file systems and the same is required to access their content. Crawl data of interest is requested and retrieved by querying their columnar index. Since September 2019, Common Crawl have added the "content_languages" field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required, which suited our purposes. In our case, we requested crawled content that Common Crawl had stored as being "MRI", the 3 letter language code for the Maori language, rather than crawled web pages for which MRI was but one among several detected languages. We obtained the results for 12 contiguous months worth of Common Crawl's crawl data, spanning Sep 2018 to Aug 2019. The content was returned in WARC format, which our Common Crawl querying script then additionally converted to the WET format, a process that reduced html marked up web pages to just their extracted text, as this was the portion of interest.
     18
     19Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth using Apache Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site. The purpose of applying this additional layer of language detection to the content was to hopefully increase accuracy in determining whether the language, and therefore also the web page or site at large, remained relevant. To this end, the multiple WET files obtained for the 12 months of commoncrawl data were first processed to further reduce the list of websites to crawl by excluding blacklisted (adult) and obviously autotranslated product (greylisted) sites. A set of seedURLs and URL exclusion/inclusion filters for each remaining web site facilitated Nutch's crawling of them. We used a blanket crawl depth of 10. Although such a depth did not suffice to thoroughly crawl every web site in the shortlist, subsequent analysis of the crawled set of web pages for a site could always indicate whether the site warranted exhaustive re-crawling in future. (While waiting for Nutch to run over the curated list of sites, a few further sites were excluded from crawling when manual inspection revealed they were just autotranslated product web sites.)
     20
     21Nutch stores its crawl data in a database but can dump each website's contents into a text file. The subsequent phase involved processing the text dump of each web site crawled by Nutch, splitting it into its individual web pages and then computing metadata at both web site-level and page-level. Both this metadata and the full text of each web page were stored in MongoDB. Page-level metadata included storing the primary language detected by OpenNLP for an entire page as well as for individual sentences of the page. Site-level metadata stored a flag to indicate
     22whether any of a site's web pages or any sentence in any of its web pages were detected to be primarily Maori by OpenNLP. Some further site-level metadata were a flag indicating whether a web page's URL contained the 2 letter code for Maori (mi) as prefix or suffix and a field storing the site's originating country, to allow eventual filtering away of auto-translated web sites in MongoDB query results.
     23
     24In the final phase, we queried the Nutch crawled data stored in MongoDB to answer some of our questions.
     25
     26
     27
     28Results and Discussion
     29-------------------------
     30
     31It became apparent quite early on, when inspecting the web pages returned by querying Common Crawl for Maori as the content_language, that there were a lot of low-quality websites. A special problem was the presence of many automatically translated product sites. Ideally, we wanted these removed from the final result set to get a more authentic representation of where on the internet real Maori language textual content was to be found, and if such data was to ultimately go into a repository of high-quality Maori language materials for future analysis by researchers. At present, web sites do not contain any consistent indicator for whether they were automatically translated or whether their textual content was composed by humans. This makes it hard to programmatically detect such instances, so they can be excluded when necessary. This investigation revealed that there's a case to be made for the World Wide Web Consortium to enforce the inclusion of such metadata as would indicate whether a web page's (or its subset's) data is automatically generated or naturally created by a human.
     32
     33
     34---
     35Some basic MongoDB queries with results:
     36
     37# Num websites
     38db.getCollection('Websites').find({}).count()
     391445
     40
     41# Num webpages
     42db.getCollection('Webpages').find({}).count()
     43117496
     44
     45# Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI)
     46db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count()
     47361
     48
     49# Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI
     50db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
     51868
     52
     53# Obviously, the union of the above two will be identical to numPagesContainingMRI:
     54db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count()
     55868
     56
     57# Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true)
     58db.getCollection('Webpages').find({isMRI:true}).count()
     597818
     60
     61# Number of pages that contain any number of MRI sentences
     62db.getCollection('Webpages').find({containsMRI: true}).count()
     6320371
     64
     65# Number of sites with URLs containing /mi(/) OR http(s)://mi.*
     66db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
     67670
     68
     69# Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.* in any of its sub-urls
     70db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
     71656
     72
     73# 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ
     74db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count()
     7514
     76
     77---
     78
     79Geojson plots, generated with http://geojson.tools/, display counts by country of web site server origin (plotted on Antarctica where unknown) for
     80(i) all the Nutch crawled site level data, consisting of over 1400 web sites returned by Common Crawl as content being in Maori
     81(ii) those sites of (i) containing one or more pages where the primary language detected by OpenNLP was Maori,
     82(iii) those sites of (i) containing any page where the primary language for one or more sentences was detected by OpenNLP to be Maori,
     83(iv) the sites from (iii) excluding any websites that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi) if they originate outside New Zealand or Australia or have an .nz top level domain regardless of country of origin. The assumption is that any non-NZ website using "mi" in the URL prefix or suffix is likely to be auto-translated. Manual inspection confirmed this to be the case for Chinese-origin sites.
     84(v) the same as (iv) but grouping sites that originate in New Zealand or have an .nz top level domain under the counts for New Zealand, NZ.
     85(vi) the sites of (v) excluding any that were misdetected as Maori, contained only Maori New Zealand place names (such as for holiday photo captions) and any still autotranslated websites. This gives a more accurate picture of the sites that contain actual or higher quality Maori language content.
     86
     87
     88------------------------------------------
     89UNUSED:
     90
     91Scope
     92-------------------
     93
     94The study limits its investigation to locating textual content in Maori on the web, thus excluding audio-visual materials in Maori, or Maori cultural and community content that may be presented in non-Maori languages such as English.
     95
     96
     97
     98Implementation
     99-------------------
     100We considered a few approaches to the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/]
     101
     102Common Crawl's large data sets are stored on distributed file systems and the same is required to access their content. Crawl data of interest is requested and retrieved by querying their columnar index. Since September 2019, Common Crawl have added the content_languages field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required, which suited our purposes. In our case, we requested crawled content that Common Crawl had stored as being "MRI", the 3 letter language code for the Maori language, rather than crawled web pages for which MRI was but one among several detected languages. We obtained the results for 12 contiguous months worth of Common Crawl's crawl data, spanning Sep 2018 to Aug 2019. The content was returned in WARC format, which our Common Crawl querying script then additionally converted to the WET format, a process that reduced html marked up web pages to just their extracted text, as this was the portion of interest.
     103
     104Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth using Apache Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site. The purpose of applying this additional layer of language detection to the content was to hopefully increase accuracy in determining whether the language, and therefore also the web page or site at large, remained relevant. Our CCWETProcwessor.java program processed the multiple WET files obtained for all of the 12 months of commoncrawl data together. The program was intended to further reduce the list of websites to crawl by excluding blacklisted (adult) and obviously autotranslated product (greylisted) sites, and to create a set of seedURLs and a regex-urlfilter.txt file for each remaining web site to facilitate Nutch's crawling of it. We used a blanket crawl depth of 10. Although such a depth did not suffice to thoroughly crawl every web site in the final list, subsequent analysis of the crawled set of web pages for a site could always indicate whether the web site proved to be sufficient interest to warrant exhaustive re-crawling in future. (While waiting for Nutch to run over the curated list of sites, a few further sites were excluded from crawling when manual inspection revealed they were just autotranslated product web sites.)
     105
     106Nutch stores its crawl data in a database but can dump each website's contents into a text file. The subsequent phase was running NutchTextDumpToMongoDB.java to process the text dump of each web site crawled by Nutch, splitting it into its individual web pages and then computing metadata at both web site-level and page-level. Both this metadata and the full text of each web page were stored in MongoDB. Page-level metadata included storing the primary language detected by OpenNLP for an entire page as well as for individual sentences of the page. Site-level metadata stored a flag to indicate
     107whether any of a site's web pages or any sentence in any of its web pages were detected to be primarily Maori by OpenNLP. Some further site-level metadata were a flag indicating whether a web page's URL contained the 2 letter code for Maori (mi) as prefix or suffix and a field storing the site's originating country, to allow eventual filtering away of auto-translated web sites in MongoDB query results.
     108
     109In the final phase, we queried the Nutch crawled data stored in MongoDB to answer some of our questions.
     110
     111------------------------------------------
     112
     113
     114Common Crawl's large data sets are stored on distributed file systems and require the same to access the content by means of querying against their columnar index. Since September 2019, CommonCrawl have added the content_languages field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required. In our case, we requested crawled content that CommonCrawl had marked as being MRI, rather than pages for which MRI was one among several detected languages. We obtained the results for 12 months worth of CommonCrawl's crawl data, from Sep 2018 up to Aug 2019. The content was returned in WARC format, which our commoncrawl querying script then additionally converted to the WET format, containing just the extracted text contents, since the html markup and headers of web pages weren't of interest to us compared to being able to avoid parsing away the html ourselves.
    5115
    6116Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth with Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site.
     
    21131
    22132
     133
     134
     135(iv) the sites from (iii) excluding any websites originating outside of New Zealand or Australia that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi),
     136(v) the sites from (iii) excluding any websites with either the .nz toplevel domain or originating outside of New Zealand or Australia that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi),
     137
Note: See TracChangeset for help on using the changeset viewer.