source: gs3-extensions

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @33606   4 years ak19 1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. …
(edit) @33605   4 years ak19 Node 4 VM still works, but committing first set of crawled sites on there
(edit) @33604   4 years ak19 1. Better output into possible-product-sites.txt including the …
(edit) @33603   4 years ak19 Incorporating Dr Nichols suggestion to help weed out product sites: if …
(edit) @33602   4 years ak19 1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
(edit) @33601   4 years ak19 Creates the 2nd csv file, with info about webpages. At present stores …
(edit) @33600   4 years ak19 Work in progress of writing out CSV files. In future, may write the …
(edit) @33599   4 years ak19 First one-third sites crawled. Committing to SVN despite the tarred …
(edit) @33598   4 years ak19 More instructions on setting up Nutch now that I've remembered to …
(edit) @33597   4 years ak19 Committing active version of template file which has a newline at end …
(edit) @33596   4 years ak19 Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template …
(edit) @33588   4 years ak19 Committing the MRI sentence model that I'm actually using, the one in …
(edit) @33587   4 years ak19 1. Better stats reporting on crawled sites: not just if a page was in …
(edit) @33586   4 years ak19 Refactored MaoriTextDetector.java class into more general …
(edit) @33585   4 years ak19 Much simpler way of using sentence and language detection model to …
(edit) @33584   4 years ak19 Committing experimental version 2 using the sentence detector model, …
(edit) @33583   4 years ak19 Committing experimental version 1 using the sentence detector model, …
(edit) @33582   5 years ak19 NutchTextDumpProcessor prints each crawled site's stats: number of …
(edit) @33581   5 years ak19 Minor fix. Noticed when looking for work I did on MRI sentence detection
(edit) @33580   5 years ak19 Finally fixed the thus-far identified bugs when parsing dump.txt.
(edit) @33579   5 years ak19 Debugging. Solved one problem.
(edit) @33578   5 years ak19 Corrections for compiling the 2 new classes.
(edit) @33577   5 years ak19 Forgot to adjust usage statement to say that silent mode was already …
(edit) @33576   5 years ak19 Introducing 2 new Java files still being written and untested. …
(edit) @33575   5 years ak19 Correcting usage string for CCWETProcessor before committing new java …
(edit) @33574   5 years ak19 If nutch stores a crawled site in more than 1 file, then cat all of …
(edit) @33573   5 years ak19 Forgot to document that spaces were also allowed as separator in the …
(edit) @33572   5 years ak19 Only meant to store the wet.gz versions of these files, not also the …
(edit) @33571   5 years ak19 Adding Dr Bainbridge's suggestion of appending the crawlId of each …
(edit) @33570   5 years ak19 Need to check if UNFINISHED file actually exists before moving it …
(edit) @33569   5 years ak19 1. batchcrawl.sh now does what it should have from the start, which is …
(edit) @33568   5 years ak19 1. More sites greylisted and blacklisted, discovered as I attempted to …
(edit) @33567   5 years ak19 batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
(edit) @33566   5 years ak19 batchcrawl.sh script now supports taking a comma or space separated …
(edit) @33565   5 years ak19 CCWETProcessor: domain url now goes in as a seedURL after the …
(edit) @33564   5 years ak19 batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
(edit) @33563   5 years ak19 Committing inactive testing batch scripts (only creates the …
(edit) @33562   5 years ak19 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit) @33561   5 years ak19 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
(edit) @33560   5 years ak19 1. Incorporated Dr Bainbridge's suggested improvements: only when …
(edit) @33559   5 years ak19 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
(edit) @33558   5 years ak19 Committing cumulative changes since last commit.
(edit) @33557   5 years ak19 Implemented the topSitesMap of topsite domain to url pattern in the …
(edit) @33556   5 years ak19 Blacklisted wikipedia pages that are actually in other languages which …
(edit) @33555   5 years ak19 Modified top sites list as Dr Bainbridge described: suffixes for the …
(edit) @33554   5 years ak19 Added more to blacklist and greylist. And removed remaining duplicates …
(edit) @33553   5 years ak19 Comments
(edit) @33552   5 years ak19 1. Code now processes ccrawldata folder, containing each individual …
(edit) @33551   5 years ak19 Added in top 500 urls from moz.com/top500 and removed duplicates, and …
(edit) @33550   5 years ak19 First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
(edit) @33549   5 years ak19 All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
(edit) @33548   5 years davidb Include new wavesurfer sub-project to install
(edit) @33546   5 years davidb Initial cut at wave-surfer based JS audio player extension for Greenstone
(edit) @33545   5 years ak19 Mainly changes to crawling-Nutch.txt and some minor changes to other …
(edit) @33543   5 years ak19 Filled in some missing instructions
(edit) @33541   5 years ak19 1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
(edit) @33540   5 years ak19 Since I wasn't getting further with nutch 2 to grab an entire site, I …
(edit) @33539   5 years ak19 File rename
(edit) @33538   5 years ak19 Some additions to the setup.sh script to query commoncrawl for MRI …
(edit) @33537   5 years ak19 More nutch and general site mirroring related links
(edit) @33536   5 years ak19 Changes required to the commoncrawl related Vagrant github project to …
(edit) @33535   5 years ak19 1. New setup.sh script for on a hadoop system to setup the git …
(edit) @33534   5 years ak19 Correction: toplevel script has to be placed inside cc-index-table not …
(edit) @33532   5 years ak19 Found the other top 500 sites link again at last which Dr Bainbridge …
(edit) @33531   5 years ak19 Added whitelist for mi.wikipedia.org, and updates to blacklist and …
(edit) @33530   5 years ak19 Completed sentence that was left hanging.
(edit) @33529   5 years ak19 Forgot to add most basic nutch links
(edit) @33528   5 years ak19 Adding in Nutch links
(edit) @33527   5 years ak19 Name change for folder
(edit) @33526   5 years ak19 Moved hadoop related scripts from bin/script into hdfs-instructions
(edit) @33525   5 years ak19 Rename before latest version
(edit) @33524   5 years ak19 1. Further adjustments to documenting what we did to get things to run …
(edit) @33523   5 years ak19 Instructional comment
(edit) @33522   5 years ak19 Some comments and an improvement
(edit) @33519   5 years ak19 Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
(edit) @33518   5 years ak19 Intermediate commit: got the seed urls file temporarily written out as …
(edit) @33517   5 years ak19 1. Blacklists were introduced so that too many instances of camelcased …
(edit) @33516   5 years ak19 Before I accidentally lose it, committing the script Dr Bainbridge …
(edit) @33515   5 years ak19 Removed an unused function
(edit) @33514   5 years ak19 Committing README on starting off with the vagrant VM for hadoop-spark …
(edit) @33513   5 years ak19 Higher level script that runs against each named crawl since Sep 2018 …
(edit) @33503   5 years ak19 More efficient blacklisting/greylisting/whitelisting now by reading in …
(edit) @33502   5 years ak19 Current url pattern blacklist and greylist filter files. Used by …
(edit) @33501   5 years ak19 Refactored code into 2 classes: The existing WETProcessor, which …
(edit) @33499   5 years ak19 Explicitly adding in IAM policy configuration details instead of just …
(edit) @33498   5 years ak19 Corrections to script. Modified the tests checking for file/dir …
(edit) @33497   5 years ak19 First version of discard url filter file. Inefficient implementation. …
(edit) @33496   5 years ak19 Minor changes to reading list file
(edit) @33495   5 years ak19 Pruned out unused commands, added comments, marked unused variables to …
(edit) @33494   5 years ak19 All in one script that takes as parameter a common crawl identifier of …
(edit) @33489   5 years ak19 Handy file to not have to keep manually repeating commands when …
(edit) @33488   5 years ak19 new function createSeedURLsFiles() in WETProcessor that replaces the …
(edit) @33480   5 years ak19 Much harder to remove pages where words are fused together as some are …
(edit) @33471   5 years ak19 Very minor changes.
(edit) @33470   5 years ak19 A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
(edit) @33469   5 years ak19 Don't want URLs with the word product(s) in them (but production …
(edit) @33468   5 years ak19 More meaningful to (also) write out the keep vs discard URLs into keep …
(edit) @33467   5 years ak19 Improved the code to use a static block to load the needed properties …
(edit) @33466   5 years ak19 1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
(edit) @33465   5 years ak19 Committing first version of the WETProcessor.java which takes a …
Note: See TracRevisionLog for help on using the revision log.