[33848] | 1 | /*
|
---|
| 2 | For sites originating in NZ or with nz TLD, none of the URLs are manually inspected and all URLs are accepted.
|
---|
| 3 |
|
---|
| 4 | For all but NZ, get final column results with:
|
---|
| 5 | db.getCollection('Websites').find({domain:/coggle\.it/})
|
---|
| 6 | And can check for URLs with:
|
---|
| 7 | db.getCollection('Webpages').find({URL: /coggle\.it/, isMRI: true})
|
---|
| 8 |
|
---|
| 9 |
|
---|
| 10 | NOTES:
|
---|
| 11 | 1. DE:
|
---|
| 12 |
|
---|
| 13 | "de","2.0","0+1","9+35 misdetected", http://www.cartogiraffe.com, https://www.cartogiraffe.com,
|
---|
| 14 | Ought to be 2+2 numPagesInMRICount and 9+2 numPagesContainingMRICount:
|
---|
| 15 | - both cartogiraffe.com pages were identical and had mostly MRI sentences with one name not being MRI. So isMRI should have been true for both pages.
|
---|
| 16 | - Only one of the 2 MRI translations of the universal declaration of human rights at http://www.udhr.de got downloaded. A total of 75 pages were downloaded, but more translated pages appeared to be on the webpage. Not sure why the crawl had a _SUCCESS file to indicate completed download.
|
---|
| 17 | - Then http://www.udhr.de had 35-1 non-MRI language translations of the universal declaration of human rights where one or more sentences were misdetected as MRI. With the additional MRI page that didn't get downloaded, should have 9+2 = 11 pages containing MRI.
|
---|
| 18 |
|
---|
| 19 | So instead of
|
---|
| 20 | "de","2.0","1","44", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
|
---|
| 21 | "de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
|
---|
| 22 |
|
---|
| 23 |
|
---|
| 24 | "au","3.0",7+0+1,83+1+3,https://www.kiwiproperty.com, https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd,https://koreromaori.com
|
---|
| 25 |
|
---|
| 26 | 2. US:
|
---|
| 27 | aclhokiangarocks.blogspot.com contains at least a page with MRI paragraphs. See http://aclhokiangarocks.blogspot.com/feeds/posts/default under section "Nga Tuhinga o tatou Tupuna"
|
---|
| 28 | Although this page has been crawled by Nutch, the contents were presented in the blog in a complex way and therefore the text wasn't retrieved here. See also the dedicated page this text should have been in http://aclhokiangarocks.blogspot.com/2012/05/nga-tuhinga-o-tatou-tupuna.html
|
---|
| 29 |
|
---|
| 30 | "_id","siteCount","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI"
|
---|
| 31 | "nz","176.0","4360","9641"
|
---|
| 32 | "us","29.0",
|
---|
| 33 | 1+2+0+0+4+166+0+39 +257+2+21+12+25+13+53+0+1+0+1+11 +32+37+4 +0+0+0 = 681,
|
---|
| 34 | 31+2+2+20+58+166+3+91 +258+2+25+12+66+22+53+6+1+1+2+10 +58+54+6 +1+2+1 = 953,
|
---|
| 35 | anglicanhistory.org,unicode.org,static-promote.weebly.com,aclhokiangarocks.blogspot.com,bahaiprayers.net,biblehub.com,muhammad.com,godrules.net,m.biblepub.com, krassotkin.ru,gotquestions.org,
|
---|
| 36 | maorinews.com,maaori.com,kiaorahola.blogspot.com,kjohnsonnz.blogspot.com,pumanawawhangara.blogspot.com,dannykahei.tripod.com,burkekm001.tripod.com,tkkpipipaopao.blogspot.com, manateina.blogspot.com,
|
---|
| 37 | tatai09.blogspot.com,twttoa.com,tuhua2010.blogspot.com,
|
---|
| 38 | breaker.audio,drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview,in.pinterest.com/pin/317363104978423418/
|
---|
| 39 | "au","2.0","8","86", https://www.kiwiproperty.com, https://koreromaori.com
|
---|
| 40 | "de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
|
---|
| 41 | "dk","2.0","4","7", *.ngapuhitelevision.com, *.ngapuhiradio.com
|
---|
| 42 | "bg","1.0","2","2", http://anitra.net/activism/humanrights/UDHR/mbf_print.htm, http://anitra.net/activism/humanrights/UDHR/rrt_print.htm
|
---|
| 43 | "cz","1.0","0","1", http://www.henryklahola.nazory.cz/094.Maori.htm, http://henryklahola.nazory.cz/094.Maori.htm
|
---|
| 44 | "es","1.0","1","1", https://www.uv.es/~pla/red.net/intmaori.html
|
---|
| 45 | "fr","1.0","1","1", http://chantsdeluttes.free.fr/versionsinter/page%20maori.html
|
---|
| 46 | "ie","1.0","1","3", https://coggle.it/diagram/WSYB0mLA2QABD5BH/t/ko-au-ko-koe
|
---|
| 47 |
|
---|
| 48 | */
|
---|
| 49 |
|
---|
| 50 |
|
---|
| 51 |
|
---|
| 52 | "_id","siteCount","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI"
|
---|
| 53 | "nz","176.0","4360","9641"
|
---|
| 54 | "us","29.0","681","953"
|
---|
| 55 | "au","2.0","8","86"
|
---|
| 56 | "de","2.0","4","11"
|
---|
| 57 | "dk","2.0","4","7"
|
---|
| 58 | "bg","1.0","2","2"
|
---|
| 59 | "cz","1.0","0","1"
|
---|
| 60 | "es","1.0","1","1"
|
---|
| 61 | "fr","1.0","1","1"
|
---|
| 62 | "ie","1.0","1","3"
|
---|
| 63 |
|
---|
| 64 | Total sites containing MRI: 216
|
---|
| 65 | Total pages detected as being in MRI: 5062
|
---|
| 66 | Total pages detected as containing MRI sentences: 10706
|
---|
| 67 |
|
---|
| 68 |
|
---|