source: other-projects/maori-lang-detection/journal-paper/writeup@ 33839

Last change on this file since 33839 was 33839, checked in by ak19, 4 years ago

Moving writeup text file into new folder so I can add the SVG flowchart generated in inkscape in there too

File size: 16.3 KB
Line 
1
2Notes:
3- Common Crawl is 2 words
4- Maori needs macron
5- web page, web site?
6- auto(-)translated, automatically translated
7
8Scope
9-------------------
10We limited our investigations at this stage to locating textual content in Maori on the web, thus excluding audio-visual materials in Maori and any Maori cultural and community content that may be presented in other languages such as English, despite the value of such content in creating an eventual digital repository of Maori resources for preservation and researchers.
11
12
13Implementation
14-------------------
15We considered a few ways of approaching the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/]
16
17Common Crawl's large data sets are stored on distributed file systems and the same is required to access their content. Crawl data of interest is requested and retrieved by querying their columnar index. Since September 2019, Common Crawl have added the "content_languages" field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required, which suited our purposes. In our case, we requested crawled content that Common Crawl had stored as being "MRI", the 3 letter language code for the Maori language, rather than crawled web pages for which MRI was but one among several detected languages. We obtained the results for 12 contiguous months worth of Common Crawl's crawl data, spanning Sep 2018 to Aug 2019. The content was returned in WARC format, which our Common Crawl querying script then additionally converted to the WET format, a process that reduced html marked up web pages to just their extracted text, as this was the portion of interest.
18
19Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth using Apache Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site. The purpose of applying this additional layer of language detection to the content was to hopefully increase accuracy in determining whether the language, and therefore also the web page or site at large, remained relevant. To this end, the multiple WET files obtained for the 12 months of commoncrawl data were first processed to further reduce the list of websites to crawl by excluding blacklisted (adult) and obviously autotranslated product (greylisted) sites. A set of seedURLs and URL exclusion/inclusion filters for each remaining web site facilitated Nutch's crawling of them. We used a blanket crawl depth of 10. Although such a depth did not suffice to thoroughly crawl every web site in the shortlist, subsequent analysis of the crawled set of web pages for a site could always indicate whether the site warranted exhaustive re-crawling in future. (While waiting for Nutch to run over the curated list of sites, a few further sites were excluded from crawling when manual inspection revealed they were just autotranslated product web sites.)
20
21Nutch stores its crawl data in a database but can dump each website's contents into a text file. The subsequent phase involved processing the text dump of each web site crawled by Nutch, splitting it into its individual web pages and then computing metadata at both web site-level and page-level. Both this metadata and the full text of each web page were stored in MongoDB. Page-level metadata included storing the primary language detected by OpenNLP for an entire page as well as for individual sentences of the page. Site-level metadata stored a flag to indicate
22whether any of a site's web pages or any sentence in any of its web pages were detected to be primarily Maori by OpenNLP. Some further site-level metadata were a flag indicating whether a web page's URL contained the 2 letter code for Maori (mi) as prefix or suffix and a field storing the site's originating country, to allow eventual filtering away of auto-translated web sites in MongoDB query results.
23
24In the final phase, we queried the Nutch crawled data stored in MongoDB to answer some of our questions.
25
26
27
28Results and Discussion
29-------------------------
30
31It became apparent quite early on, when inspecting the web pages returned by querying Common Crawl for Maori as the content_language, that there were a lot of low-quality websites. A special problem was the presence of many automatically translated product sites. Ideally, we wanted these removed from the final result set to get a more authentic representation of where on the internet real Maori language textual content was to be found, and if such data was to ultimately go into a repository of high-quality Maori language materials for future analysis by researchers. At present, web sites do not contain any consistent indicator for whether they were automatically translated or whether their textual content was composed by humans. This makes it hard to programmatically detect such instances, so they can be excluded when necessary. This investigation revealed that there's a case to be made for the World Wide Web Consortium to enforce the inclusion of such metadata as would indicate whether a web page's (or its subset's) data is automatically generated or naturally created by a human.
32
33
34---
35Some basic MongoDB queries with results:
36
37# Num websites
38db.getCollection('Websites').find({}).count()
391445
40
41# Num webpages
42db.getCollection('Webpages').find({}).count()
43117496
44
45# Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI)
46db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count()
47361
48
49# Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI
50db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
51868
52
53# Obviously, the union of the above two will be identical to numPagesContainingMRI:
54db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count()
55868
56
57# Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true)
58db.getCollection('Webpages').find({isMRI:true}).count()
597818
60
61# Number of pages that contain any number of MRI sentences
62db.getCollection('Webpages').find({containsMRI: true}).count()
6320371
64
65# Number of sites with URLs containing /mi(/) OR http(s)://mi.*
66db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
67670
68
69# Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.* in any of its sub-urls
70db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
71656
72
73# 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ
74db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count()
7514
76
77---
78
79Geojson plots, generated with http://geojson.tools/, display counts by country of web site server origin (plotted on Antarctica where unknown) for
80(i) all the Nutch crawled site level data, consisting of over 1400 web sites returned by Common Crawl as content being in Maori
81(ii) those sites of (i) containing one or more pages where the primary language detected by OpenNLP was Maori,
82(iii) those sites of (i) containing any page where the primary language for one or more sentences was detected by OpenNLP to be Maori,
83(iv) the sites from (iii) excluding any websites that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi) if they originate outside New Zealand or Australia or have an .nz top level domain regardless of country of origin. The assumption is that any non-NZ website using "mi" in the URL prefix or suffix is likely to be auto-translated. Manual inspection confirmed this to be the case for Chinese-origin sites.
84(v) the same as (iv) but grouping sites that originate in New Zealand or have an .nz top level domain under the counts for New Zealand, NZ.
85(vi) the sites of (v) excluding any that were misdetected as Maori, contained only Maori New Zealand place names (such as for holiday photo captions) and any still autotranslated websites. This gives a more accurate picture of the sites that contain actual or higher quality Maori language content.
86
87
88------------------------------------------
89UNUSED:
90
91Scope
92-------------------
93
94The study limits its investigation to locating textual content in Maori on the web, thus excluding audio-visual materials in Maori, or Maori cultural and community content that may be presented in non-Maori languages such as English.
95
96
97
98Implementation
99-------------------
100We considered a few approaches to the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/]
101
102Common Crawl's large data sets are stored on distributed file systems and the same is required to access their content. Crawl data of interest is requested and retrieved by querying their columnar index. Since September 2019, Common Crawl have added the content_languages field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required, which suited our purposes. In our case, we requested crawled content that Common Crawl had stored as being "MRI", the 3 letter language code for the Maori language, rather than crawled web pages for which MRI was but one among several detected languages. We obtained the results for 12 contiguous months worth of Common Crawl's crawl data, spanning Sep 2018 to Aug 2019. The content was returned in WARC format, which our Common Crawl querying script then additionally converted to the WET format, a process that reduced html marked up web pages to just their extracted text, as this was the portion of interest.
103
104Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth using Apache Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site. The purpose of applying this additional layer of language detection to the content was to hopefully increase accuracy in determining whether the language, and therefore also the web page or site at large, remained relevant. Our CCWETProcwessor.java program processed the multiple WET files obtained for all of the 12 months of commoncrawl data together. The program was intended to further reduce the list of websites to crawl by excluding blacklisted (adult) and obviously autotranslated product (greylisted) sites, and to create a set of seedURLs and a regex-urlfilter.txt file for each remaining web site to facilitate Nutch's crawling of it. We used a blanket crawl depth of 10. Although such a depth did not suffice to thoroughly crawl every web site in the final list, subsequent analysis of the crawled set of web pages for a site could always indicate whether the web site proved to be sufficient interest to warrant exhaustive re-crawling in future. (While waiting for Nutch to run over the curated list of sites, a few further sites were excluded from crawling when manual inspection revealed they were just autotranslated product web sites.)
105
106Nutch stores its crawl data in a database but can dump each website's contents into a text file. The subsequent phase was running NutchTextDumpToMongoDB.java to process the text dump of each web site crawled by Nutch, splitting it into its individual web pages and then computing metadata at both web site-level and page-level. Both this metadata and the full text of each web page were stored in MongoDB. Page-level metadata included storing the primary language detected by OpenNLP for an entire page as well as for individual sentences of the page. Site-level metadata stored a flag to indicate
107whether any of a site's web pages or any sentence in any of its web pages were detected to be primarily Maori by OpenNLP. Some further site-level metadata were a flag indicating whether a web page's URL contained the 2 letter code for Maori (mi) as prefix or suffix and a field storing the site's originating country, to allow eventual filtering away of auto-translated web sites in MongoDB query results.
108
109In the final phase, we queried the Nutch crawled data stored in MongoDB to answer some of our questions.
110
111------------------------------------------
112
113
114Common Crawl's large data sets are stored on distributed file systems and require the same to access the content by means of querying against their columnar index. Since September 2019, CommonCrawl have added the content_languages field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required. In our case, we requested crawled content that CommonCrawl had marked as being MRI, rather than pages for which MRI was one among several detected languages. We obtained the results for 12 months worth of CommonCrawl's crawl data, from Sep 2018 up to Aug 2019. The content was returned in WARC format, which our commoncrawl querying script then additionally converted to the WET format, containing just the extracted text contents, since the html markup and headers of web pages weren't of interest to us compared to being able to avoid parsing away the html ourselves.
115
116Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth with Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site.
117
118The multiple WET files obtained for each of the 12 months of commoncrawl data were all processed together by our CCWETProcwessor.java program. Its purpose was to further reduce the list of websites to crawl by excluding blacklisted (adult) and obviously autotranslated product (greylisted) sites, and to create a set of seedURLs and a regex-urlfilter.txt file for each site to allow Nutch to crawl it. We used a crawl depth of 10. Although not sufficient for all crawled websites, further processing of the crawled set of webpages for a site could always indicate to us whether the website was of sufficient interest to warrant exhaustive re-crawling in future. (While waiting for Nutch to run over the curated list of sites, a few further sites were excluded from crawling when manual inspection determined they were just autotranslated product web sites.)
119
120Nutch stores its crawl data in a database but can dump each website's contents into a text file. The subsequent phase involved processing the text dump of each website to split it into its webpages and computing website-level and webpage-level metadata
121
122
123
124----------------
125
126We thus obtained multiple wet files for each of the 12 months of commoncrawl data. These were all processed together by our CCWETProcwessor.java program, which would exclude blacklisted (adult) sites and any obviously autotranslated product sites (which were "greylisted"), before producing a final list of websites to be inspected further by first crawling each site in greater depth with Nutch. For each site to be crawled, a list of seedURLs and regex-url-filter.txt was produced to work with Nutch.
127
128
129
130
131
132
133
134
135(iv) the sites from (iii) excluding any websites originating outside of New Zealand or Australia that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi),
136(v) the sites from (iii) excluding any websites with either the .nz toplevel domain or originating outside of New Zealand or Australia that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi),
137
Note: See TracBrowser for help on using the repository browser.