Context Navigation

source: gs3-extensions

Legend:

: Added
: Modified
: Copied or renamed

	Rev	Age	Author	Log Message
(edit)	@33606	4 years	ak19	1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. …
(edit)	@33605	4 years	ak19	Node 4 VM still works, but committing first set of crawled sites on there
(edit)	@33604	4 years	ak19	1. Better output into possible-product-sites.txt including the …
(edit)	@33603	4 years	ak19	Incorporating Dr Nichols suggestion to help weed out product sites: if …
(edit)	@33602	4 years	ak19	1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
(edit)	@33601	4 years	ak19	Creates the 2nd csv file, with info about webpages. At present stores …
(edit)	@33600	4 years	ak19	Work in progress of writing out CSV files. In future, may write the …
(edit)	@33599	4 years	ak19	First one-third sites crawled. Committing to SVN despite the tarred …
(edit)	@33598	4 years	ak19	More instructions on setting up Nutch now that I've remembered to …
(edit)	@33597	4 years	ak19	Committing active version of template file which has a newline at end …
(edit)	@33596	4 years	ak19	Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template …
(edit)	@33588	4 years	ak19	Committing the MRI sentence model that I'm actually using, the one in …
(edit)	@33587	4 years	ak19	1. Better stats reporting on crawled sites: not just if a page was in …
(edit)	@33586	4 years	ak19	Refactored MaoriTextDetector.java class into more general …
(edit)	@33585	4 years	ak19	Much simpler way of using sentence and language detection model to …
(edit)	@33584	4 years	ak19	Committing experimental version 2 using the sentence detector model, …
(edit)	@33583	4 years	ak19	Committing experimental version 1 using the sentence detector model, …
(edit)	@33582	5 years	ak19	NutchTextDumpProcessor prints each crawled site's stats: number of …
(edit)	@33581	5 years	ak19	Minor fix. Noticed when looking for work I did on MRI sentence detection
(edit)	@33580	5 years	ak19	Finally fixed the thus-far identified bugs when parsing dump.txt.
(edit)	@33579	5 years	ak19	Debugging. Solved one problem.
(edit)	@33578	5 years	ak19	Corrections for compiling the 2 new classes.
(edit)	@33577	5 years	ak19	Forgot to adjust usage statement to say that silent mode was already …
(edit)	@33576	5 years	ak19	Introducing 2 new Java files still being written and untested. …
(edit)	@33575	5 years	ak19	Correcting usage string for CCWETProcessor before committing new java …
(edit)	@33574	5 years	ak19	If nutch stores a crawled site in more than 1 file, then cat all of …
(edit)	@33573	5 years	ak19	Forgot to document that spaces were also allowed as separator in the …
(edit)	@33572	5 years	ak19	Only meant to store the wet.gz versions of these files, not also the …
(edit)	@33571	5 years	ak19	Adding Dr Bainbridge's suggestion of appending the crawlId of each …
(edit)	@33570	5 years	ak19	Need to check if UNFINISHED file actually exists before moving it …
(edit)	@33569	5 years	ak19	1. batchcrawl.sh now does what it should have from the start, which is …
(edit)	@33568	5 years	ak19	1. More sites greylisted and blacklisted, discovered as I attempted to …
(edit)	@33567	5 years	ak19	batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
(edit)	@33566	5 years	ak19	batchcrawl.sh script now supports taking a comma or space separated …
(edit)	@33565	5 years	ak19	CCWETProcessor: domain url now goes in as a seedURL after the …
(edit)	@33564	5 years	ak19	batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
(edit)	@33563	5 years	ak19	Committing inactive testing batch scripts (only creates the …
(edit)	@33562	5 years	ak19	1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit)	@33561	5 years	ak19	1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
(edit)	@33560	5 years	ak19	1. Incorporated Dr Bainbridge's suggested improvements: only when …
(edit)	@33559	5 years	ak19	1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
(edit)	@33558	5 years	ak19	Committing cumulative changes since last commit.
(edit)	@33557	5 years	ak19	Implemented the topSitesMap of topsite domain to url pattern in the …
(edit)	@33556	5 years	ak19	Blacklisted wikipedia pages that are actually in other languages which …
(edit)	@33555	5 years	ak19	Modified top sites list as Dr Bainbridge described: suffixes for the …
(edit)	@33554	5 years	ak19	Added more to blacklist and greylist. And removed remaining duplicates …
(edit)	@33553	5 years	ak19	Comments
(edit)	@33552	5 years	ak19	1. Code now processes ccrawldata folder, containing each individual …
(edit)	@33551	5 years	ak19	Added in top 500 urls from moz.com/top500 and removed duplicates, and …
(edit)	@33550	5 years	ak19	First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
(edit)	@33549	5 years	ak19	All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
(edit)	@33548	5 years	davidb	Include new wavesurfer sub-project to install
(edit)	@33546	5 years	davidb	Initial cut at wave-surfer based JS audio player extension for Greenstone
(edit)	@33545	5 years	ak19	Mainly changes to crawling-Nutch.txt and some minor changes to other …
(edit)	@33543	5 years	ak19	Filled in some missing instructions
(edit)	@33541	5 years	ak19	1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
(edit)	@33540	5 years	ak19	Since I wasn't getting further with nutch 2 to grab an entire site, I …
(edit)	@33539	5 years	ak19	File rename
(edit)	@33538	5 years	ak19	Some additions to the setup.sh script to query commoncrawl for MRI …
(edit)	@33537	5 years	ak19	More nutch and general site mirroring related links
(edit)	@33536	5 years	ak19	Changes required to the commoncrawl related Vagrant github project to …
(edit)	@33535	5 years	ak19	1. New setup.sh script for on a hadoop system to setup the git …
(edit)	@33534	5 years	ak19	Correction: toplevel script has to be placed inside cc-index-table not …
(edit)	@33532	5 years	ak19	Found the other top 500 sites link again at last which Dr Bainbridge …
(edit)	@33531	5 years	ak19	Added whitelist for mi.wikipedia.org, and updates to blacklist and …
(edit)	@33530	5 years	ak19	Completed sentence that was left hanging.
(edit)	@33529	5 years	ak19	Forgot to add most basic nutch links
(edit)	@33528	5 years	ak19	Adding in Nutch links
(edit)	@33527	5 years	ak19	Name change for folder
(edit)	@33526	5 years	ak19	Moved hadoop related scripts from bin/script into hdfs-instructions
(edit)	@33525	5 years	ak19	Rename before latest version
(edit)	@33524	5 years	ak19	1. Further adjustments to documenting what we did to get things to run …
(edit)	@33523	5 years	ak19	Instructional comment
(edit)	@33522	5 years	ak19	Some comments and an improvement
(edit)	@33519	5 years	ak19	Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
(edit)	@33518	5 years	ak19	Intermediate commit: got the seed urls file temporarily written out as …
(edit)	@33517	5 years	ak19	1. Blacklists were introduced so that too many instances of camelcased …
(edit)	@33516	5 years	ak19	Before I accidentally lose it, committing the script Dr Bainbridge …
(edit)	@33515	5 years	ak19	Removed an unused function
(edit)	@33514	5 years	ak19	Committing README on starting off with the vagrant VM for hadoop-spark …
(edit)	@33513	5 years	ak19	Higher level script that runs against each named crawl since Sep 2018 …
(edit)	@33503	5 years	ak19	More efficient blacklisting/greylisting/whitelisting now by reading in …
(edit)	@33502	5 years	ak19	Current url pattern blacklist and greylist filter files. Used by …
(edit)	@33501	5 years	ak19	Refactored code into 2 classes: The existing WETProcessor, which …
(edit)	@33499	5 years	ak19	Explicitly adding in IAM policy configuration details instead of just …
(edit)	@33498	5 years	ak19	Corrections to script. Modified the tests checking for file/dir …
(edit)	@33497	5 years	ak19	First version of discard url filter file. Inefficient implementation. …
(edit)	@33496	5 years	ak19	Minor changes to reading list file
(edit)	@33495	5 years	ak19	Pruned out unused commands, added comments, marked unused variables to …
(edit)	@33494	5 years	ak19	All in one script that takes as parameter a common crawl identifier of …
(edit)	@33489	5 years	ak19	Handy file to not have to keep manually repeating commands when …
(edit)	@33488	5 years	ak19	new function createSeedURLsFiles() in WETProcessor that replaces the …
(edit)	@33480	5 years	ak19	Much harder to remove pages where words are fused together as some are …
(edit)	@33471	5 years	ak19	Very minor changes.
(edit)	@33470	5 years	ak19	A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
(edit)	@33469	5 years	ak19	Don't want URLs with the word product(s) in them (but production …
(edit)	@33468	5 years	ak19	More meaningful to (also) write out the keep vs discard URLs into keep …
(edit)	@33467	5 years	ak19	Improved the code to use a static block to load the needed properties …
(edit)	@33466	5 years	ak19	1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
(edit)	@33465	5 years	ak19	Committing first version of the WETProcessor.java which takes a …

Note: See TracRevisionLog for help on using the revision log.

Download in other formats: