source: other-projects/maori-lang-detection/journal-paper/MRI_slideNotes.txt@ 33903

Last change on this file since 33903 was 33903, checked in by ak19, 4 years ago

My notes when preparing for today's meetings. Some of this may be useful to inform content of any presentation slides by Dr Bainbridge in future?

File size: 11.8 KB
Line 
11. Where on the web can Maori text be found?
22 letter-langcode: MI
33 letter-langcode: MRI
4
5
62. General limitations:
7- only TEXT in Maori, not audio, video, etc
8- can't get at the deep web
9e.g. sites not linked up with rest of web,
10large digital repositories where there's no
11direct links to individual pages
12but which are found only by searching
13
14
153. Initial consideration:
16Do the exploratory Crawl ourselves.
17
18* unimpeded internet-wide crawl
19* crawl just NZ (AU, UK) sites: limit TLD
20
21In both cases, start off with known NZ sites
22acting as seed URLs for exploratory search
23via all linked sites.
24Seed URls could include NZ govt,
25language resource sites, digital library sites,
26Maori language blogs, community resource sites
27
28
294. Things to think about:
30* web traps:
31stuck crawling one or more pages forever.
32
33Some crawling software deal with this
34better than others, but problems remain
35
36* disk space
37In the early 2000s, Internet Archive's
38regular web wide crawl was already in the petabytes.
39
40To save space, we could analyse each site
41once crawled and throw away unpromising ones
42before crawling further
43
44* when would we know we have enough data
45to finally start analysing?
46
47
485. Alternative approaches to doing the
49web-wide crawl ourselves:
50
51Discovery of Ready-Made Crawl Data:
52- payware site that offers access to
53(query) its web-wide crawl data for money
54- free web crawl data offered by Common Crawl,
55which encourages individuals, businesses,
56institutions to use its crawl data,
57so researchers won't burden the internet
58with countless crawls for individual ends.
59
606. Common Crawl (CC) - limitations
61- not exhaustive
62 * crawls focus on breadth (representing
63 a wide cross-section of web), not full-depth
64 crawl of sites for copyright reasons a.o.
65 So need to recrawl sites of interest
66 at greater depth.
67 * crawls done monthly, trying to minimise
68 overlaps. So a month's crawl is not of
69 the entire known web.
70- needed Amazon s3 (paid account).
71- distributed CC data needs distributed
72system to access/query.
73- Big data: still takes some time chugging away.
74
75
767. Advantages of using CC:
77* Ready-made crawl data enriched with
78metadata fields stored in distributed DB.
79that you can run (distributed) queries against.
80e.g. get all .nz TLD sites of a CC crawl.
81* BETTER: Aug 2018 introduction of "content-language"
82metadata field, stores top few detected languages of
83each web page in descending order.
84Since Sep 2018, this field can be queried too!
85
86
878. Plan:
881. Query for MRI (Maori) as content-language
892. Pool results of multiple contiguous months worth of
90crawl data, to construct completer cross-section of web
913. re-crawl each *site* (domain) found at greater depth
92to hopefully crawl more sites fully than CC did.
93(At least still not an exploratory search
94of the entire internet.)
954. Run Apache Open NLP language detection over
96both downloaded web pages
97AND individual sentences (ideally paragraphs...)
985. CC's language detector software wasn't Apache
99OpenNLP, so still worth re-running over recrawls.
100
101
1029. * Initial testing effectively queried each CC crawl:
103 get all webpages where
104 content-language 'contains' MRI
105But low-quality results!
106e.g. Single-word pages that weren't actually Maori.
107* Ended up querying:
108 content-language = MRI
109 (not just primary language detected, but the
110 sole language detected)
111Still some disappointing results, but far less common.
112
113
11410. We were in July/Aug of 2018 when we began.
115Queried Sep 2018 - Aug 2019 (12 months) CC data.
116
117Next, need to prepare data for crawling locally:
118- ensure unique domains across CC crawl results,
119- remove low-quality sites and process special sites
120- create seed URLs, regex filters for each site
121to recrawl at depth 10 with Apache Nutch
122
123
12411. Low quality data
125Countless auto-translated sites like adult
126and product sites:
127- Blacklisted adult sites
128- Greylisted obvious product sites providing (auto)
129translations in countless languages of the globe.
130But too many to go through.
131Left this issue for "later" in the process pipeline.
132
133
13412. Special handling regex list for certain sites
135e.g. large sites.
136Don't want to crawl all of blogspot or docs.google
137or wikipedia, etc.
138Instead crawl mi.wikipedia; <blogname>.blogspot;
139docs.google/<individual-seed-page-id>
140
141
14213. <PROCESS FLOW CHART>
143
144
14514. Stripping html stripped paragraph info,
146so had to deal with sentences as units.
147But Apache OpenNLP language detection
148prefers to work on >= 2 sentences at a time.
149
150Still, in testing this, OpenNLP returned MRI as
151primary language for single sentences
152as often as it did for 2 contiguous sentences.
153But lower confidence level.
154
155
15615. MongoDB Webpage level meta:
157* URL,
158* full page text of downloaded webpage,
159* "sentences" array (trained basic Apache
160Open NLP sentence model for MRI)
161* isMRI? - whether openNLP detected MRI to be
162the primary language of overall page content
163* containsMRI? - whether openNLP detected MRI as
164primary language of any sentence on the page
165
166
16716. MongoDB Website level meta:
168* domain,
169* geo-location of site's server,
170* numPagesInMRI,
171* numPagesContainingMRI,
172* did_nutch_finish_crawling_site_fully?
173
174
17517. Querying MongoDB:
176Simple queries:
177* How many webSITES crawled?
178CC said these sites had MRI page(s)
179* How many webPAGES crawled?
180* How many PAGES with isMRI = true (openNLP)
181* How many PAGES with containsMRI = true
182* How many SITES where numPagesInMRI > 0
183* How many SITES where numPagesContainingMRI > 0
184(= sites with at least 1 webpage with at least
185sentence that openNLP detected as MRI)
186
187
188 After blacklisting, 1462 sites to crawl with Nutch, but a few were obvious product sites, so removed before crawling
189 or while crawling other sites.
190
191 After crawling,
192 # Num websites in MongoDB
193 1445
194
195 # Num webpages
196 117496
197
198 # The number of web SITES that contain 1 or more pages detected as being in Maori (num sites with a positive numPagesInMRI)
199 361
200
201 # Number of web SITES containing at least one page with at least one sentence for which OpenNLP detected the best language = MRI
202 # (Num sites with a positive numPagesContainingMRI)
203 868
204
205 # The number of web PAGES that are deemed to be overall in MRI (pages where isMRI=true)
206 7818
207
208 # Number of web PAGES that contain any number of MRI sentences
209 20371
210
211 # Number of web SITES with crawled web pages that have any URLs containing /mi(/) OR http(s)://mi.*
212 670
213
214 # Number of web SITES that are outside NZ that contain /mi(/) OR http(s)://mi.*
215 # in any of its crawled webpage urls
216 656
217
218 # 14 sites with page URLs containing /mi(/) OR http(s)://mi.* that are in NZ
219 14
220
221 # ATTEMPT TO FILTER OUT LIKELY AUTO-TRANSLATED SITES
222 # Get a count of all non-NZ (or .nz TLD) sites that don't have /mi(/) or http(s)://mi.*
223 # in the URL path of any crawled web pages of the site
224 220
225
226
227 # Count of websites that have at least 1 page containing at least one sentence detected as MRI
228 # AND which websites have mi in a webpage's URL path:
229 491
230
231
232 # The websites that have some MRI detected AND which are either in NZ or with NZ TLD
233 # or (so if they're from overseas) don't contain /mi or mi.* in a page's URL path:
234 396
235
236 # Include Australia, to get the valid "kiwiproperty.com" website included in the result list:
237 397
238
239
240 # counts of pages by country code excluding NZ related sites and AU sites
241 # that are detected as containing at least one Maori sentence:
242
243 221 websites
244
245
246 # But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
247 176
248
249 (Total is 221+176 = 397, which adds up).
250
251
252 # Manually inspected shortlist of the 221 non-NZ websites to weed out those that aren't MRI (weeding out those misdetected as MRI, autotranslated or just contain placenames etc), and adding the 176 NZ on top:
253
254 MANUAL INSPECTION: TOTAL COUNT BY COUNTRY OF SITES WITH AT LEAST ONE PAGE CONTAINING ONE SENTENCE OF MRI CONTENT (numPagesContainingMRI > 0):
255 NZ: 126
256 US: 25+4
257 AU: 2
258 DE: 2
259 DK: 2
260 BG: 1
261 CZ: 1
262 ES: 1
263 FR: 1
264 IE: 1
265 TOTAL: 166
266
26718. More complex MongoDB queries:
268Count of SITES by site geolocation
269where
270- numPagesInMRI > 0
271- numPagesContainingMRI > 0
272(- AND miInURLPath for overseas sites = false)
273
274Also: do the counts grouping NZ origin sites
275with ".nz" TLD sites (regardless of server
276geo-origin) under NZ.
277
278
27919. Detected results can turn out low-quality:
280- misdetection, e.g. Tongan, Kiribati, etc
281(not in OpenNLP language model)
282or ENG sentences with MRI words
283detected as MRI sentences
284- Maori personal and place names in
285references and gallery photo captions
286suffice to return sentences
287and single-sentence pages as MRI
288- autotranslated sites!!!!
289
290
29120. Auto-translated content = UNWANTED
292
293Don't want automatically translated sites
294when building a corpus of high quality Maori
295language text for researchers to work with.
296
297Also, it can be polluting:
298auto-translated content can't serve as
299proper training data set to inform better
300automatic translation in future either.
301
302
30321. Heuristics for some detection
304of auto-translated sites
305
306Dr Dave Nichols suggested:
307Find non-NZ sites that have /mi or mi* in URL
308(2 letter code for Maori) and remove them
309as they're more likely to be product sites.
310
311In practice: Still had to wade through list of all
312overseas sites with page URLs containing "mi"
313for the occasional exception.
314And reverse: some NZ sites with "mi" in any
315web page's URL could be auto-translated product sites.
316
317
31822. Bigger problem:
319Even if overseas sites with mi in page URLs
320were filtered out, a large set of auto-translated
321sites never use mi in the URL path.
322
323PROBLEM: can't detect auto-translated sites
324automatically. Confirmed by Dr Stephen Joe,
325Mr Bill Rogers, Dr Bainbridge.
326Human, manual intervention needed
327to weed them out.
328
329
33023. So manually went through MongoDB result list of
331 all websites with numPagesContainingMRI > 0
332to shortlist just those websites which had any
333webpage that truly contained at least one sentence
334in MRI.
335
336(Not even website[x].numPagesInMRI > 0)
337
338
33924. Results
340Results at website level (not webpage level).
341<TABLES AND GEO-JSON MAPS>
342
343
34425. Recommendation
345There's a case to be made for WWW standards
346to make it compulsory, including on legacy
347sites, to include some indicator on each
348webpage or even at paragraph level
349(HTML markup tag comparable to "lang"?)
350to denote whether the text content is formulated
351by a human or auto-translated.
352
353Or a processing sequence,
354e.g. content-source="human, ocr, bot-translation"
355for an automatic translation of a digitised book
356by a human auteur.
357
358
35926. Working on the final stages
360- Code generates random sample of webpage URLs
361for sitelisting for which we can make 90%
362confidence with 5% margin of error predictions.
363
364Then need to go over each sample webpage URL
365produced from manually pruned webSITE listing,
366and manually verify whether in cases where a
367webPAGE isMRI=true, the page's
368genuinely largely in Maori or not.
369
370- Finish writing code to automatically run the
371mongodb queries I've manually run, to
372summarise the results for generating tables
373and geojson maps.
374
375
37627. Future work
377- Knowing the site-level results,
378can fully recrawl those promising sites
379that weren't fully crawled before
380- Maybe retrain OpenNLP language model
381for Maori using high quality web pages found?
382
383
38428. Wider Applicability
385Repeating the process for other languages
386not in wide use:
387
388- CC prefers not to be burdened by data
389requests for very common languages, but
390low-resource languages are fine
391- Check if Apache OpenNLP supports language
392else need to train and add model.
393- MongoDB queries need to be adjusted.
394At present specific to Maori, e.g. its unique
395geographic distribution: NZ + .nz TLD
396treated specially vs overseas.
397But for the French language, France, Canada,
398New Caledonia etc TLDs need to be considered.
Note: See TracBrowser for help on using the repository browser.