Context Navigation

source: gs3-extensions

Legend:

: Added
: Modified
: Copied or renamed

	Rev	Age	Author	Log Message
(edit)	@33575	5 years	ak19	Correcting usage string for CCWETProcessor before committing new java …
(edit)	@33574	5 years	ak19	If nutch stores a crawled site in more than 1 file, then cat all of …
(edit)	@33573	5 years	ak19	Forgot to document that spaces were also allowed as separator in the …
(edit)	@33572	5 years	ak19	Only meant to store the wet.gz versions of these files, not also the …
(edit)	@33571	5 years	ak19	Adding Dr Bainbridge's suggestion of appending the crawlId of each …
(edit)	@33570	5 years	ak19	Need to check if UNFINISHED file actually exists before moving it …
(edit)	@33569	5 years	ak19	1. batchcrawl.sh now does what it should have from the start, which is …
(edit)	@33568	5 years	ak19	1. More sites greylisted and blacklisted, discovered as I attempted to …
(edit)	@33567	5 years	ak19	batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
(edit)	@33566	5 years	ak19	batchcrawl.sh script now supports taking a comma or space separated …
(edit)	@33565	5 years	ak19	CCWETProcessor: domain url now goes in as a seedURL after the …
(edit)	@33564	5 years	ak19	batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
(edit)	@33563	5 years	ak19	Committing inactive testing batch scripts (only creates the …
(edit)	@33562	5 years	ak19	1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit)	@33561	5 years	ak19	1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
(edit)	@33560	5 years	ak19	1. Incorporated Dr Bainbridge's suggested improvements: only when …
(edit)	@33559	5 years	ak19	1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
(edit)	@33558	5 years	ak19	Committing cumulative changes since last commit.
(edit)	@33557	5 years	ak19	Implemented the topSitesMap of topsite domain to url pattern in the …
(edit)	@33556	5 years	ak19	Blacklisted wikipedia pages that are actually in other languages which …
(edit)	@33555	5 years	ak19	Modified top sites list as Dr Bainbridge described: suffixes for the …
(edit)	@33554	5 years	ak19	Added more to blacklist and greylist. And removed remaining duplicates …
(edit)	@33553	5 years	ak19	Comments
(edit)	@33552	5 years	ak19	1. Code now processes ccrawldata folder, containing each individual …
(edit)	@33551	5 years	ak19	Added in top 500 urls from moz.com/top500 and removed duplicates, and …
(edit)	@33550	5 years	ak19	First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
(edit)	@33549	5 years	ak19	All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
(edit)	@33548	5 years	davidb	Include new wavesurfer sub-project to install
(edit)	@33546	5 years	davidb	Initial cut at wave-surfer based JS audio player extension for Greenstone
(edit)	@33545	5 years	ak19	Mainly changes to crawling-Nutch.txt and some minor changes to other …
(edit)	@33543	5 years	ak19	Filled in some missing instructions
(edit)	@33541	5 years	ak19	1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
(edit)	@33540	5 years	ak19	Since I wasn't getting further with nutch 2 to grab an entire site, I …
(edit)	@33539	5 years	ak19	File rename
(edit)	@33538	5 years	ak19	Some additions to the setup.sh script to query commoncrawl for MRI …
(edit)	@33537	5 years	ak19	More nutch and general site mirroring related links
(edit)	@33536	5 years	ak19	Changes required to the commoncrawl related Vagrant github project to …
(edit)	@33535	5 years	ak19	1. New setup.sh script for on a hadoop system to setup the git …
(edit)	@33534	5 years	ak19	Correction: toplevel script has to be placed inside cc-index-table not …
(edit)	@33532	5 years	ak19	Found the other top 500 sites link again at last which Dr Bainbridge …
(edit)	@33531	5 years	ak19	Added whitelist for mi.wikipedia.org, and updates to blacklist and …
(edit)	@33530	5 years	ak19	Completed sentence that was left hanging.
(edit)	@33529	5 years	ak19	Forgot to add most basic nutch links
(edit)	@33528	5 years	ak19	Adding in Nutch links
(edit)	@33527	5 years	ak19	Name change for folder
(edit)	@33526	5 years	ak19	Moved hadoop related scripts from bin/script into hdfs-instructions
(edit)	@33525	5 years	ak19	Rename before latest version
(edit)	@33524	5 years	ak19	1. Further adjustments to documenting what we did to get things to run …
(edit)	@33523	5 years	ak19	Instructional comment
(edit)	@33522	5 years	ak19	Some comments and an improvement
(edit)	@33519	5 years	ak19	Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
(edit)	@33518	5 years	ak19	Intermediate commit: got the seed urls file temporarily written out as …
(edit)	@33517	5 years	ak19	1. Blacklists were introduced so that too many instances of camelcased …
(edit)	@33516	5 years	ak19	Before I accidentally lose it, committing the script Dr Bainbridge …
(edit)	@33515	5 years	ak19	Removed an unused function
(edit)	@33514	5 years	ak19	Committing README on starting off with the vagrant VM for hadoop-spark …
(edit)	@33513	5 years	ak19	Higher level script that runs against each named crawl since Sep 2018 …
(edit)	@33503	5 years	ak19	More efficient blacklisting/greylisting/whitelisting now by reading in …
(edit)	@33502	5 years	ak19	Current url pattern blacklist and greylist filter files. Used by …
(edit)	@33501	5 years	ak19	Refactored code into 2 classes: The existing WETProcessor, which …
(edit)	@33499	5 years	ak19	Explicitly adding in IAM policy configuration details instead of just …
(edit)	@33498	5 years	ak19	Corrections to script. Modified the tests checking for file/dir …
(edit)	@33497	5 years	ak19	First version of discard url filter file. Inefficient implementation. …
(edit)	@33496	5 years	ak19	Minor changes to reading list file
(edit)	@33495	5 years	ak19	Pruned out unused commands, added comments, marked unused variables to …
(edit)	@33494	5 years	ak19	All in one script that takes as parameter a common crawl identifier of …
(edit)	@33489	5 years	ak19	Handy file to not have to keep manually repeating commands when …
(edit)	@33488	5 years	ak19	new function createSeedURLsFiles() in WETProcessor that replaces the …
(edit)	@33480	5 years	ak19	Much harder to remove pages where words are fused together as some are …
(edit)	@33471	5 years	ak19	Very minor changes.
(edit)	@33470	5 years	ak19	A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
(edit)	@33469	5 years	ak19	Don't want URLs with the word product(s) in them (but production …
(edit)	@33468	5 years	ak19	More meaningful to (also) write out the keep vs discard URLs into keep …
(edit)	@33467	5 years	ak19	Improved the code to use a static block to load the needed properties …
(edit)	@33466	5 years	ak19	1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
(edit)	@33465	5 years	ak19	Committing first version of the WETProcessor.java which takes a …
(edit)	@33457	5 years	ak19	Got stage 1, the WARC to WET conversion, working, after necessary …
(edit)	@33456	5 years	ak19	Link to discussion on how to convert WARC to WET
(edit)	@33448	5 years	ak19	Minor clarification and inclusion of helpful command
(edit)	@33446	5 years	ak19	1. Committing working version of export_maori_subset.sh which takes …
(edit)	@33445	5 years	ak19	The first working hadoop spark script for processing common crawl …
(edit)	@33443	5 years	ak19	More notes
(edit)	@33442	5 years	ak19	Updated gutil.jar file (with SafeProcses debugging)
(edit)	@33441	5 years	ak19	Adding further notes to do with running the CC-index examples on spark.
(edit)	@33440	5 years	ak19	Split file to move vagrant-spark-hadoop notes into own file.
(edit)	@33428	5 years	ak19	Working commoncrawl cc-warc-examples' WET wordcount example using …
(edit)	@33425	5 years	ak19	A few more links now that I got past getting the vagrant VM with spark …
(edit)	@33423	5 years	ak19	Adding in the link to the vagrant VM with Hadoop, Spark for cluster …
(edit)	@33422	5 years	ak19	Some more links.
(edit)	@33419	5 years	ak19	Last evening, I had found some links about how language-detection is …
(edit)	@33414	5 years	ak19	Adding important links
(edit)	@33413	5 years	ak19	Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, …
(edit)	@33412	5 years	ak19	config command for wgetting a single file
(edit)	@33411	5 years	ak19	Newer version now doesn't mirror sites with wget but gets WET files …
(edit)	@33410	5 years	ak19	Committing some variable name changes before I replace this file with …
(edit)	@33409	5 years	ak19	Forgot to commit 2 files with links and shuffling some links around …
(edit)	@33408	5 years	ak19	Some rough notes. Will move into appropriate file later.
(edit)	@33407	5 years	ak19	gutil.jar was rebuilt yesterday in GS3 after a bugfix. Recommitting …
(edit)	@33405	5 years	ak19	Even though we're probably not going to use this code after all, will …
(edit)	@33404	5 years	ak19	1. Links to other Java ways of extracting text from web content. 2. …

Note: See TracRevisionLog for help on using the revision log.

Download in other formats: