Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

MRI_slideNotes.txt@ 33913

Last change on this file since 33913 was 33903, checked in by ak19, 4 years ago
My notes when preparing for today's meetings. Some of this may be useful to inform content of any presentation slides by Dr Bainbridge in future?
File size: 11.8 KB

Line
1	1. Where on the web can Maori text be found?
2	2 letter-langcode: MI
3	3 letter-langcode: MRI
4
5
6	2. General limitations:
7	- only TEXT in Maori, not audio, video, etc
8	- can't get at the deep web
9	e.g. sites not linked up with rest of web,
10	large digital repositories where there's no
11	direct links to individual pages
12	but which are found only by searching
13
14
15	3. Initial consideration:
16	Do the exploratory Crawl ourselves.
17
18	* unimpeded internet-wide crawl
19	* crawl just NZ (AU, UK) sites: limit TLD
20
21	In both cases, start off with known NZ sites
22	acting as seed URLs for exploratory search
23	via all linked sites.
24	Seed URls could include NZ govt,
25	language resource sites, digital library sites,
26	Maori language blogs, community resource sites
27
28
29	4. Things to think about:
30	* web traps:
31	stuck crawling one or more pages forever.
32
33	Some crawling software deal with this
34	better than others, but problems remain
35
36	* disk space
37	In the early 2000s, Internet Archive's
38	regular web wide crawl was already in the petabytes.
39
40	To save space, we could analyse each site
41	once crawled and throw away unpromising ones
42	before crawling further
43
44	* when would we know we have enough data
45	to finally start analysing?
46
47
48	5. Alternative approaches to doing the
49	web-wide crawl ourselves:
50
51	Discovery of Ready-Made Crawl Data:
52	- payware site that offers access to
53	(query) its web-wide crawl data for money
54	- free web crawl data offered by Common Crawl,
55	which encourages individuals, businesses,
56	institutions to use its crawl data,
57	so researchers won't burden the internet
58	with countless crawls for individual ends.
59
60	6. Common Crawl (CC) - limitations
61	- not exhaustive
62	* crawls focus on breadth (representing
63	a wide cross-section of web), not full-depth
64	crawl of sites for copyright reasons a.o.
65	So need to recrawl sites of interest
66	at greater depth.
67	* crawls done monthly, trying to minimise
68	overlaps. So a month's crawl is not of
69	the entire known web.
70	- needed Amazon s3 (paid account).
71	- distributed CC data needs distributed
72	system to access/query.
73	- Big data: still takes some time chugging away.
74
75
76	7. Advantages of using CC:
77	* Ready-made crawl data enriched with
78	metadata fields stored in distributed DB.
79	that you can run (distributed) queries against.
80	e.g. get all .nz TLD sites of a CC crawl.
81	* BETTER: Aug 2018 introduction of "content-language"
82	metadata field, stores top few detected languages of
83	each web page in descending order.
84	Since Sep 2018, this field can be queried too!
85
86
87	8. Plan:
88	1. Query for MRI (Maori) as content-language
89	2. Pool results of multiple contiguous months worth of
90	crawl data, to construct completer cross-section of web
91	3. re-crawl each site (domain) found at greater depth
92	to hopefully crawl more sites fully than CC did.
93	(At least still not an exploratory search
94	of the entire internet.)
95	4. Run Apache Open NLP language detection over
96	both downloaded web pages
97	AND individual sentences (ideally paragraphs...)
98	5. CC's language detector software wasn't Apache
99	OpenNLP, so still worth re-running over recrawls.
100
101
102	9. * Initial testing effectively queried each CC crawl:
103	get all webpages where
104	content-language 'contains' MRI
105	But low-quality results!
106	e.g. Single-word pages that weren't actually Maori.
107	* Ended up querying:
108	content-language = MRI
109	(not just primary language detected, but the
110	sole language detected)
111	Still some disappointing results, but far less common.
112
113
114	10. We were in July/Aug of 2018 when we began.
115	Queried Sep 2018 - Aug 2019 (12 months) CC data.
116
117	Next, need to prepare data for crawling locally:
118	- ensure unique domains across CC crawl results,
119	- remove low-quality sites and process special sites
120	- create seed URLs, regex filters for each site
121	to recrawl at depth 10 with Apache Nutch
122
123
124	11. Low quality data
125	Countless auto-translated sites like adult
126	and product sites:
127	- Blacklisted adult sites
128	- Greylisted obvious product sites providing (auto)
129	translations in countless languages of the globe.
130	But too many to go through.
131	Left this issue for "later" in the process pipeline.
132
133
134	12. Special handling regex list for certain sites
135	e.g. large sites.
136	Don't want to crawl all of blogspot or docs.google
137	or wikipedia, etc.
138	Instead crawl mi.wikipedia; <blogname>.blogspot;
139	docs.google/<individual-seed-page-id>
140
141
142	13. <PROCESS FLOW CHART>
143
144
145	14. Stripping html stripped paragraph info,
146	so had to deal with sentences as units.
147	But Apache OpenNLP language detection
148	prefers to work on >= 2 sentences at a time.
149
150	Still, in testing this, OpenNLP returned MRI as
151	primary language for single sentences
152	as often as it did for 2 contiguous sentences.
153	But lower confidence level.
154
155
156	15. MongoDB Webpage level meta:
157	* URL,
158	* full page text of downloaded webpage,
159	* "sentences" array (trained basic Apache
160	Open NLP sentence model for MRI)
161	* isMRI? - whether openNLP detected MRI to be
162	the primary language of overall page content
163	* containsMRI? - whether openNLP detected MRI as
164	primary language of any sentence on the page
165
166
167	16. MongoDB Website level meta:
168	* domain,
169	* geo-location of site's server,
170	* numPagesInMRI,
171	* numPagesContainingMRI,
172	* did_nutch_finish_crawling_site_fully?
173
174
175	17. Querying MongoDB:
176	Simple queries:
177	* How many webSITES crawled?
178	CC said these sites had MRI page(s)
179	* How many webPAGES crawled?
180	* How many PAGES with isMRI = true (openNLP)
181	* How many PAGES with containsMRI = true
182	* How many SITES where numPagesInMRI > 0
183	* How many SITES where numPagesContainingMRI > 0
184	(= sites with at least 1 webpage with at least
185	sentence that openNLP detected as MRI)
186
187
188	After blacklisting, 1462 sites to crawl with Nutch, but a few were obvious product sites, so removed before crawling
189	or while crawling other sites.
190
191	After crawling,
192	# Num websites in MongoDB
193	1445
194
195	# Num webpages
196	117496
197
198	# The number of web SITES that contain 1 or more pages detected as being in Maori (num sites with a positive numPagesInMRI)
199	361
200
201	# Number of web SITES containing at least one page with at least one sentence for which OpenNLP detected the best language = MRI
202	# (Num sites with a positive numPagesContainingMRI)
203	868
204
205	# The number of web PAGES that are deemed to be overall in MRI (pages where isMRI=true)
206	7818
207
208	# Number of web PAGES that contain any number of MRI sentences
209	20371
210
211	# Number of web SITES with crawled web pages that have any URLs containing /mi(/) OR http(s)://mi.*
212	670
213
214	# Number of web SITES that are outside NZ that contain /mi(/) OR http(s)://mi.*
215	# in any of its crawled webpage urls
216	656
217
218	# 14 sites with page URLs containing /mi(/) OR http(s)://mi.* that are in NZ
219	14
220
221	# ATTEMPT TO FILTER OUT LIKELY AUTO-TRANSLATED SITES
222	# Get a count of all non-NZ (or .nz TLD) sites that don't have /mi(/) or http(s)://mi.*
223	# in the URL path of any crawled web pages of the site
224	220
225
226
227	# Count of websites that have at least 1 page containing at least one sentence detected as MRI
228	# AND which websites have mi in a webpage's URL path:
229	491
230
231
232	# The websites that have some MRI detected AND which are either in NZ or with NZ TLD
233	# or (so if they're from overseas) don't contain /mi or mi.* in a page's URL path:
234	396
235
236	# Include Australia, to get the valid "kiwiproperty.com" website included in the result list:
237	397
238
239
240	# counts of pages by country code excluding NZ related sites and AU sites
241	# that are detected as containing at least one Maori sentence:
242
243	221 websites
244
245
246	# But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
247	176
248
249	(Total is 221+176 = 397, which adds up).
250
251
252	# Manually inspected shortlist of the 221 non-NZ websites to weed out those that aren't MRI (weeding out those misdetected as MRI, autotranslated or just contain placenames etc), and adding the 176 NZ on top:
253
254	MANUAL INSPECTION: TOTAL COUNT BY COUNTRY OF SITES WITH AT LEAST ONE PAGE CONTAINING ONE SENTENCE OF MRI CONTENT (numPagesContainingMRI > 0):
255	NZ: 126
256	US: 25+4
257	AU: 2
258	DE: 2
259	DK: 2
260	BG: 1
261	CZ: 1
262	ES: 1
263	FR: 1
264	IE: 1
265	TOTAL: 166
266
267	18. More complex MongoDB queries:
268	Count of SITES by site geolocation
269	where
270	- numPagesInMRI > 0
271	- numPagesContainingMRI > 0
272	(- AND miInURLPath for overseas sites = false)
273
274	Also: do the counts grouping NZ origin sites
275	with ".nz" TLD sites (regardless of server
276	geo-origin) under NZ.
277
278
279	19. Detected results can turn out low-quality:
280	- misdetection, e.g. Tongan, Kiribati, etc
281	(not in OpenNLP language model)
282	or ENG sentences with MRI words
283	detected as MRI sentences
284	- Maori personal and place names in
285	references and gallery photo captions
286	suffice to return sentences
287	and single-sentence pages as MRI
288	- autotranslated sites!!!!
289
290
291	20. Auto-translated content = UNWANTED
292
293	Don't want automatically translated sites
294	when building a corpus of high quality Maori
295	language text for researchers to work with.
296
297	Also, it can be polluting:
298	auto-translated content can't serve as
299	proper training data set to inform better
300	automatic translation in future either.
301
302
303	21. Heuristics for some detection
304	of auto-translated sites
305
306	Dr Dave Nichols suggested:
307	Find non-NZ sites that have /mi or mi* in URL
308	(2 letter code for Maori) and remove them
309	as they're more likely to be product sites.
310
311	In practice: Still had to wade through list of all
312	overseas sites with page URLs containing "mi"
313	for the occasional exception.
314	And reverse: some NZ sites with "mi" in any
315	web page's URL could be auto-translated product sites.
316
317
318	22. Bigger problem:
319	Even if overseas sites with mi in page URLs
320	were filtered out, a large set of auto-translated
321	sites never use mi in the URL path.
322
323	PROBLEM: can't detect auto-translated sites
324	automatically. Confirmed by Dr Stephen Joe,
325	Mr Bill Rogers, Dr Bainbridge.
326	Human, manual intervention needed
327	to weed them out.
328
329
330	23. So manually went through MongoDB result list of
331	all websites with numPagesContainingMRI > 0
332	to shortlist just those websites which had any
333	webpage that truly contained at least one sentence
334	in MRI.
335
336	(Not even website[x].numPagesInMRI > 0)
337
338
339	24. Results
340	Results at website level (not webpage level).
341	<TABLES AND GEO-JSON MAPS>
342
343
344	25. Recommendation
345	There's a case to be made for WWW standards
346	to make it compulsory, including on legacy
347	sites, to include some indicator on each
348	webpage or even at paragraph level
349	(HTML markup tag comparable to "lang"?)
350	to denote whether the text content is formulated
351	by a human or auto-translated.
352
353	Or a processing sequence,
354	e.g. content-source="human, ocr, bot-translation"
355	for an automatic translation of a digitised book
356	by a human auteur.
357
358
359	26. Working on the final stages
360	- Code generates random sample of webpage URLs
361	for sitelisting for which we can make 90%
362	confidence with 5% margin of error predictions.
363
364	Then need to go over each sample webpage URL
365	produced from manually pruned webSITE listing,
366	and manually verify whether in cases where a
367	webPAGE isMRI=true, the page's
368	genuinely largely in Maori or not.
369
370	- Finish writing code to automatically run the
371	mongodb queries I've manually run, to
372	summarise the results for generating tables
373	and geojson maps.
374
375
376	27. Future work
377	- Knowing the site-level results,
378	can fully recrawl those promising sites
379	that weren't fully crawled before
380	- Maybe retrain OpenNLP language model
381	for Maori using high quality web pages found?
382
383
384	28. Wider Applicability
385	Repeating the process for other languages
386	not in wide use:
387
388	- CC prefers not to be burdened by data
389	requests for very common languages, but
390	low-resource languages are fine
391	- Check if Apache OpenNLP supports language
392	else need to train and add model.
393	- MongoDB queries need to be adjusted.
394	At present specific to Maori, e.g. its unique
395	geographic distribution: NZ + .nz TLD
396	treated specially vs overseas.
397	But for the French language, France, Canada,
398	New Caledonia etc TLDs need to be considered.

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: other-projects/maori-lang-detection/journal-paper/MRI_slideNotes.txt@ 33913

Download in other formats: