Show
Ignore:
Timestamp:
13.11.2019 23:08:37 (2 months ago)
Author:
ak19
Message:

Having finished sending all the crawl data to mongodb 1. Recrawled the 2 sites which I had earlier noted required recrawling 00152, 00332. 00152 required changes to how it needed to be crawled. MP3 files needed to be blocked, as there were HBase error messages about key values being too large. 2. Modified the regex-urlfilter.GS_TEMPLATE file for this to block mp3 files in general for future crawls too (in the location of the file where jpg etc were already blocked by nutch's default regex url filters). 3. Further had to control the 00152 site to only be crawled under its /maori/ sub-domain. Since the seedURL maori.html was not off a /maori/ url, this revealed that the CCWETProcessor code didn't already consider allowing filters to okay seedURLs even where the crawl was controlled to run over a subdomain (as expressed in conf/sites-too-big-to-exhaustively-crawl file) but where the seedURL didn't match these controlled regex filters. So now, in such cases, the CCWETProcessor adds seedURLs that don't match to the filters too (so we get just the single file of the seedURL pages) besides a filter on the requested subdomain, so we follow all pages linked by the seedURLs that match the subdomain expression. 4. Adding to_crawl.tar.gz to svn, the tarball of the sites to_crawl that I actually ran nutch over, of all the sites folders with their seedURL.txt and regex-urlfilter.txt files that the batchcrawl.sh runs over. This didn't use the latest version of the sites folder and blacklist/whitelist files generated by CCWETProcessor, since the latest version was regenerated after the final modifications to CCWETProcessor which was after crawling was finished. But to_crawl.tar.gz does have a manually modified 00152, wit the correct regex-urlfilter file and uses the newer regex-urlfilter.GS_TEMPLATE file that blocks mp3 files. 5. crawledNode6.tar.gz now contains the dump output for sites 00152 and 00332, which were crawled on node6 today (after which their processed dump.txt file results were added into MongoDB). 7. MoreReading?/mongodb.txt now contains the results of some queries I ran against the total nutch-crawled data.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/MoreReading/mongodb.txt

    r33653 r33666  
    346346https://docs.mongodb.com/manual/reference/method/db.collection.find/ 
    347347https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection 
     348 
     349 
     350------------------- 
     351 
     352Some queries with results: 
     353 
     354# Num websites 
     355db.getCollection('Websites').find({}).count() 
     3561446  
     357 
     358# Num webpages 
     359db.getCollection('Webpages').find({}).count() 
     36075139 
     361 
     362# Find number of websites who have 1 or more pages in Maori (a positive numPagesInMRI) 
     363db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count() 
     364361 
     365 
     366# Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true) 
     367db.getCollection('Webpages').find({isMRI:true}).count() 
     368X5224 
     3695215 
     370 
     371# Number of pages that contain any number of MRI sentences 
     372db.getCollection('Webpages').find({containsMRI: true}).count() 
     37312858 
     374 
     375# Number of sites with URLs containing /mi(/) 
     376db.getCollection('Websites').find({urlContainsLangCodeInpath:true}).count() 
     377153 
     378 
     379# Number of websites that are outside NZ that contain /mi(/) in any of its sub-urls 
     380db.getCollection('Websites').find({urlContainsLangCodeInpath:true, geoLocationCountryCode: {$ne : "NZ"} }).count() 
     381148 
     382 
     383# 5 sites with URLs containing /mi(/) that are in NZ 
     384db.getCollection('Websites').find({urlContainsLangCodeInpath:true, geoLocationCountryCode: "NZ"}).count() 
     3855 
     386 
     387# sort websites that contain /mi(/) in path by geoLocationCountryCode 
     388#    https://www.quackit.com/mongodb/tutorial/mongodb_sort_query_results.cfm 
     389db.getCollection('Websites').find({urlContainsLangCodeInpath:true}).sort({geoLocationCountryCode: 1}) 
     390 
     391