Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33828

Timestamp:

2020-01-14T22:09:43+13:00 (4 years ago)

Author:

ak19

Message:

Additions and modifications to the write-up.

File:

: 1 edited

other-projects/maori-lang-detection/writeup (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

other-projects/maori-lang-detection/writeup

-              r33825
+              r33828
+Notes:
+- Common Crawl is 2 words
+- Maori needs macron
+- web page, web site?
+- auto(-)translated, automatically translated
+Scope
+-------------------
+We limited our investigations at this stage to locating textual content in Maori on the web, thus excluding audio-visual materials in Maori and any Maori cultural and community content that may be presented in other languages such as English, despite the value of such content in creating an eventual digital repository of Maori resources for preservation and researchers.
 Implementation
 -------------------
+We considered a few ways of approaching the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/]
+CommonCrawl (CC) "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/] Its large data sets are stored on distributed file systems and require the same to access the content. Since September 2019, CommonCrawl have added the content_languages field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required. In our case, we requested crawled content that CommonCrawl had marked as being MRI, rather than pages for which MRI was one among several detected languages. We obtained the results for 12 months worth of CommonCrawl's crawl data, from Sep 2018 up to Aug 2019. The content was returned in WARC format, which our commoncrawl querying script then additionally converted to the WET format, containing just the extracted text contents, since the html markup and headers of web pages weren't of interest to us compared to being able to avoid parsing away the html ourselves.
+Common Crawl's large data sets are stored on distributed file systems and the same is required to access their content. Crawl data of interest is requested and retrieved by querying their columnar index. Since September 2019, Common Crawl have added the "content_languages" field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required, which suited our purposes. In our case, we requested crawled content that Common Crawl had stored as being "MRI", the 3 letter language code for the Maori language, rather than crawled web pages for which MRI was but one among several detected languages. We obtained the results for 12 contiguous months worth of Common Crawl's crawl data, spanning Sep 2018 to Aug 2019. The content was returned in WARC format, which our Common Crawl querying script then additionally converted to the WET format, a process that reduced html marked up web pages to just their extracted text, as this was the portion of interest.
+Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth using Apache Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site. The purpose of applying this additional layer of language detection to the content was to hopefully increase accuracy in determining whether the language, and therefore also the web page or site at large, remained relevant. To this end, the multiple WET files obtained for the 12 months of commoncrawl data were first processed to further reduce the list of websites to crawl by excluding blacklisted (adult) and obviously autotranslated product (greylisted) sites. A set of seedURLs and URL exclusion/inclusion filters for each remaining web site facilitated Nutch's crawling of them. We used a blanket crawl depth of 10. Although such a depth did not suffice to thoroughly crawl every web site in the shortlist, subsequent analysis of the crawled set of web pages for a site could always indicate whether the site warranted exhaustive re-crawling in future. (While waiting for Nutch to run over the curated list of sites, a few further sites were excluded from crawling when manual inspection revealed they were just autotranslated product web sites.)
+Nutch stores its crawl data in a database but can dump each website's contents into a text file. The subsequent phase involved processing the text dump of each web site crawled by Nutch, splitting it into its individual web pages and then computing metadata at both web site-level and page-level. Both this metadata and the full text of each web page were stored in MongoDB. Page-level metadata included storing the primary language detected by OpenNLP for an entire page as well as for individual sentences of the page. Site-level metadata stored a flag to indicate
+whether any of a site's web pages or any sentence in any of its web pages were detected to be primarily Maori by OpenNLP. Some further site-level metadata were a flag indicating whether a web page's URL contained the 2 letter code for Maori (mi) as prefix or suffix and a field storing the site's originating country, to allow eventual filtering away of auto-translated web sites in MongoDB query results.
+In the final phase, we queried the Nutch crawled data stored in MongoDB to answer some of our questions.
+Results and Discussion
+-------------------------
+It became apparent quite early on, when inspecting the web pages returned by querying Common Crawl for Maori as the content_language, that there were a lot of low-quality websites. A special problem was the presence of many automatically translated product sites. Ideally, we wanted these removed from the final result set to get a more authentic representation of where on the internet real Maori language textual content was to be found, and if such data was to ultimately go into a repository of high-quality Maori language materials for future analysis by researchers. At present, web sites do not contain any consistent indicator for whether they were automatically translated or whether their textual content was composed by humans. This makes it hard to programmatically detect such instances, so they can be excluded when necessary. This investigation revealed that there's a case to be made for the World Wide Web Consortium to enforce the inclusion of such metadata as would indicate whether a web page's (or its subset's) data is automatically generated or naturally created by a human.
+---
+Some basic MongoDB queries with results:
+# Num websites
+db.getCollection('Websites').find({}).count()
+# Num webpages
+db.getCollection('Webpages').find({}).count()
+# Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI)
+db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count()
+# Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI
+db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
+# Obviously, the union of the above two will be identical to numPagesContainingMRI:
+db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count()
+# Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true)
+db.getCollection('Webpages').find({isMRI:true}).count()
+# Number of pages that contain any number of MRI sentences
+db.getCollection('Webpages').find({containsMRI: true}).count()
+# Number of sites with URLs containing /mi(/) OR http(s)://mi.*
+db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
+# Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.* in any of its sub-urls
+db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
+# 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ
+db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count()
+---
+Geojson plots, generated with http://geojson.tools/, display counts by country of web site server origin (plotted on Antarctica where unknown) for
+(i) all the Nutch crawled site level data, consisting of over 1400 web sites returned by Common Crawl as content being in Maori
+(ii) those sites of (i) containing one or more pages where the primary language detected by OpenNLP was Maori,
+(iii) those sites of (i) containing any page where the primary language for one or more sentences was detected by OpenNLP to be Maori,
+(iv) the sites from (iii) excluding any websites that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi) if they originate outside New Zealand or Australia or have an .nz top level domain regardless of country of origin. The assumption is that any non-NZ website using "mi" in the URL prefix or suffix is likely to be auto-translated. Manual inspection confirmed this to be the case for Chinese-origin sites.
+(v) the same as (iv) but grouping sites that originate in New Zealand or have an .nz top level domain under the counts for New Zealand, NZ.
+(vi) the sites of (v) excluding any that were misdetected as Maori, contained only Maori New Zealand place names (such as for holiday photo captions) and any still autotranslated websites. This gives a more accurate picture of the sites that contain actual or higher quality Maori language content.
+------------------------------------------
+UNUSED:
+Scope
+-------------------
+The study limits its investigation to locating textual content in Maori on the web, thus excluding audio-visual materials in Maori, or Maori cultural and community content that may be presented in non-Maori languages such as English.
+Implementation
+-------------------
+We considered a few approaches to the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/]
+Common Crawl's large data sets are stored on distributed file systems and the same is required to access their content. Crawl data of interest is requested and retrieved by querying their columnar index. Since September 2019, Common Crawl have added the content_languages field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required, which suited our purposes. In our case, we requested crawled content that Common Crawl had stored as being "MRI", the 3 letter language code for the Maori language, rather than crawled web pages for which MRI was but one among several detected languages. We obtained the results for 12 contiguous months worth of Common Crawl's crawl data, spanning Sep 2018 to Aug 2019. The content was returned in WARC format, which our Common Crawl querying script then additionally converted to the WET format, a process that reduced html marked up web pages to just their extracted text, as this was the portion of interest.
+Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth using Apache Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site. The purpose of applying this additional layer of language detection to the content was to hopefully increase accuracy in determining whether the language, and therefore also the web page or site at large, remained relevant. Our CCWETProcwessor.java program processed the multiple WET files obtained for all of the 12 months of commoncrawl data together. The program was intended to further reduce the list of websites to crawl by excluding blacklisted (adult) and obviously autotranslated product (greylisted) sites, and to create a set of seedURLs and a regex-urlfilter.txt file for each remaining web site to facilitate Nutch's crawling of it. We used a blanket crawl depth of 10. Although such a depth did not suffice to thoroughly crawl every web site in the final list, subsequent analysis of the crawled set of web pages for a site could always indicate whether the web site proved to be sufficient interest to warrant exhaustive re-crawling in future. (While waiting for Nutch to run over the curated list of sites, a few further sites were excluded from crawling when manual inspection revealed they were just autotranslated product web sites.)
+Nutch stores its crawl data in a database but can dump each website's contents into a text file. The subsequent phase was running NutchTextDumpToMongoDB.java to process the text dump of each web site crawled by Nutch, splitting it into its individual web pages and then computing metadata at both web site-level and page-level. Both this metadata and the full text of each web page were stored in MongoDB. Page-level metadata included storing the primary language detected by OpenNLP for an entire page as well as for individual sentences of the page. Site-level metadata stored a flag to indicate
+whether any of a site's web pages or any sentence in any of its web pages were detected to be primarily Maori by OpenNLP. Some further site-level metadata were a flag indicating whether a web page's URL contained the 2 letter code for Maori (mi) as prefix or suffix and a field storing the site's originating country, to allow eventual filtering away of auto-translated web sites in MongoDB query results.
+In the final phase, we queried the Nutch crawled data stored in MongoDB to answer some of our questions.
+------------------------------------------
+Common Crawl's large data sets are stored on distributed file systems and require the same to access the content by means of querying against their columnar index. Since September 2019, CommonCrawl have added the content_languages field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required. In our case, we requested crawled content that CommonCrawl had marked as being MRI, rather than pages for which MRI was one among several detected languages. We obtained the results for 12 months worth of CommonCrawl's crawl data, from Sep 2018 up to Aug 2019. The content was returned in WARC format, which our commoncrawl querying script then additionally converted to the WET format, containing just the extracted text contents, since the html markup and headers of web pages weren't of interest to us compared to being able to avoid parsing away the html ourselves.
 Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth with Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site.
 …
+(iv) the sites from (iii) excluding any websites originating outside of New Zealand or Australia that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi),
+(v) the sites from (iii) excluding any websites with either the .nz toplevel domain or originating outside of New Zealand or Australia that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi),

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33828

Legend:

other-projects/maori-lang-detection/writeup

Download in other formats: