|
|
@33575
|
5 years |
ak19 |
Correcting usage string for CCWETProcessor before committing new java …
|
|
|
@33574
|
5 years |
ak19 |
If nutch stores a crawled site in more than 1 file, then cat all of …
|
|
|
@33573
|
5 years |
ak19 |
Forgot to document that spaces were also allowed as separator in the …
|
|
|
@33572
|
5 years |
ak19 |
Only meant to store the wet.gz versions of these files, not also the …
|
|
|
@33571
|
5 years |
ak19 |
Adding Dr Bainbridge's suggestion of appending the crawlId of each …
|
|
|
@33570
|
5 years |
ak19 |
Need to check if UNFINISHED file actually exists before moving it …
|
|
|
@33569
|
5 years |
ak19 |
1. batchcrawl.sh now does what it should have from the start, which is …
|
|
|
@33568
|
5 years |
ak19 |
1. More sites greylisted and blacklisted, discovered as I attempted to …
|
|
|
@33567
|
5 years |
ak19 |
batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
|
|
|
@33566
|
5 years |
ak19 |
batchcrawl.sh script now supports taking a comma or space separated …
|
|
|
@33565
|
5 years |
ak19 |
CCWETProcessor: domain url now goes in as a seedURL after the …
|
|
|
@33564
|
5 years |
ak19 |
batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
|
|
|
@33563
|
5 years |
ak19 |
Committing inactive testing batch scripts (only creates the …
|
|
|
@33562
|
5 years |
ak19 |
1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
|
|
|
@33561
|
5 years |
ak19 |
1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
|
|
|
@33560
|
5 years |
ak19 |
1. Incorporated Dr Bainbridge's suggested improvements: only when …
|
|
|
@33559
|
5 years |
ak19 |
1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
|
|
|
@33558
|
5 years |
ak19 |
Committing cumulative changes since last commit.
|
|
|
@33557
|
5 years |
ak19 |
Implemented the topSitesMap of topsite domain to url pattern in the …
|
|
|
@33556
|
5 years |
ak19 |
Blacklisted wikipedia pages that are actually in other languages which …
|
|
|
@33555
|
5 years |
ak19 |
Modified top sites list as Dr Bainbridge described: suffixes for the …
|
|
|
@33554
|
5 years |
ak19 |
Added more to blacklist and greylist. And removed remaining duplicates …
|
|
|
@33553
|
5 years |
ak19 |
Comments
|
|
|
@33552
|
5 years |
ak19 |
1. Code now processes ccrawldata folder, containing each individual …
|
|
|
@33551
|
5 years |
ak19 |
Added in top 500 urls from moz.com/top500 and removed duplicates, and …
|
|
|
@33550
|
5 years |
ak19 |
First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
|
|
|
@33549
|
5 years |
ak19 |
All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
|
|
|
@33548
|
5 years |
davidb |
Include new wavesurfer sub-project to install
|
|
|
@33546
|
5 years |
davidb |
Initial cut at wave-surfer based JS audio player extension for Greenstone
|
|
|
@33545
|
5 years |
ak19 |
Mainly changes to crawling-Nutch.txt and some minor changes to other …
|
|
|
@33543
|
5 years |
ak19 |
Filled in some missing instructions
|
|
|
@33541
|
5 years |
ak19 |
1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
|
|
|
@33540
|
5 years |
ak19 |
Since I wasn't getting further with nutch 2 to grab an entire site, I …
|
|
|
@33539
|
5 years |
ak19 |
File rename
|
|
|
@33538
|
5 years |
ak19 |
Some additions to the setup.sh script to query commoncrawl for MRI …
|
|
|
@33537
|
5 years |
ak19 |
More nutch and general site mirroring related links
|
|
|
@33536
|
5 years |
ak19 |
Changes required to the commoncrawl related Vagrant github project to …
|
|
|
@33535
|
5 years |
ak19 |
1. New setup.sh script for on a hadoop system to setup the git …
|
|
|
@33534
|
5 years |
ak19 |
Correction: toplevel script has to be placed inside cc-index-table not …
|
|
|
@33532
|
5 years |
ak19 |
Found the other top 500 sites link again at last which Dr Bainbridge …
|
|
|
@33531
|
5 years |
ak19 |
Added whitelist for mi.wikipedia.org, and updates to blacklist and …
|
|
|
@33530
|
5 years |
ak19 |
Completed sentence that was left hanging.
|
|
|
@33529
|
5 years |
ak19 |
Forgot to add most basic nutch links
|
|
|
@33528
|
5 years |
ak19 |
Adding in Nutch links
|
|
|
@33527
|
5 years |
ak19 |
Name change for folder
|
|
|
@33526
|
5 years |
ak19 |
Moved hadoop related scripts from bin/script into hdfs-instructions
|
|
|
@33525
|
5 years |
ak19 |
Rename before latest version
|
|
|
@33524
|
5 years |
ak19 |
1. Further adjustments to documenting what we did to get things to run …
|
|
|
@33523
|
5 years |
ak19 |
Instructional comment
|
|
|
@33522
|
5 years |
ak19 |
Some comments and an improvement
|
|
|
@33519
|
5 years |
ak19 |
Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
|
|
|
@33518
|
5 years |
ak19 |
Intermediate commit: got the seed urls file temporarily written out as …
|
|
|
@33517
|
5 years |
ak19 |
1. Blacklists were introduced so that too many instances of camelcased …
|
|
|
@33516
|
5 years |
ak19 |
Before I accidentally lose it, committing the script Dr Bainbridge …
|
|
|
@33515
|
5 years |
ak19 |
Removed an unused function
|
|
|
@33514
|
5 years |
ak19 |
Committing README on starting off with the vagrant VM for hadoop-spark …
|
|
|
@33513
|
5 years |
ak19 |
Higher level script that runs against each named crawl since Sep 2018 …
|
|
|
@33503
|
5 years |
ak19 |
More efficient blacklisting/greylisting/whitelisting now by reading in …
|
|
|
@33502
|
5 years |
ak19 |
Current url pattern blacklist and greylist filter files. Used by …
|
|
|
@33501
|
5 years |
ak19 |
Refactored code into 2 classes: The existing WETProcessor, which …
|
|
|
@33499
|
5 years |
ak19 |
Explicitly adding in IAM policy configuration details instead of just …
|
|
|
@33498
|
5 years |
ak19 |
Corrections to script. Modified the tests checking for file/dir …
|
|
|
@33497
|
5 years |
ak19 |
First version of discard url filter file. Inefficient implementation. …
|
|
|
@33496
|
5 years |
ak19 |
Minor changes to reading list file
|
|
|
@33495
|
5 years |
ak19 |
Pruned out unused commands, added comments, marked unused variables to …
|
|
|
@33494
|
5 years |
ak19 |
All in one script that takes as parameter a common crawl identifier of …
|
|
|
@33489
|
5 years |
ak19 |
Handy file to not have to keep manually repeating commands when …
|
|
|
@33488
|
5 years |
ak19 |
new function createSeedURLsFiles() in WETProcessor that replaces the …
|
|
|
@33480
|
5 years |
ak19 |
Much harder to remove pages where words are fused together as some are …
|
|
|
@33471
|
5 years |
ak19 |
Very minor changes.
|
|
|
@33470
|
5 years |
ak19 |
A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
|
|
|
@33469
|
5 years |
ak19 |
Don't want URLs with the word product(s) in them (but production …
|
|
|
@33468
|
5 years |
ak19 |
More meaningful to (also) write out the keep vs discard URLs into keep …
|
|
|
@33467
|
5 years |
ak19 |
Improved the code to use a static block to load the needed properties …
|
|
|
@33466
|
5 years |
ak19 |
1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
|
|
|
@33465
|
5 years |
ak19 |
Committing first version of the WETProcessor.java which takes a …
|
|
|
@33457
|
5 years |
ak19 |
Got stage 1, the WARC to WET conversion, working, after necessary …
|
|
|
@33456
|
5 years |
ak19 |
Link to discussion on how to convert WARC to WET
|
|
|
@33448
|
5 years |
ak19 |
Minor clarification and inclusion of helpful command
|
|
|
@33446
|
5 years |
ak19 |
1. Committing working version of export_maori_subset.sh which takes …
|
|
|
@33445
|
5 years |
ak19 |
The first working hadoop spark script for processing common crawl …
|
|
|
@33443
|
5 years |
ak19 |
More notes
|
|
|
@33442
|
5 years |
ak19 |
Updated gutil.jar file (with SafeProcses debugging)
|
|
|
@33441
|
5 years |
ak19 |
Adding further notes to do with running the CC-index examples on spark.
|
|
|
@33440
|
5 years |
ak19 |
Split file to move vagrant-spark-hadoop notes into own file.
|
|
|
@33428
|
5 years |
ak19 |
Working commoncrawl cc-warc-examples' WET wordcount example using …
|
|
|
@33425
|
5 years |
ak19 |
A few more links now that I got past getting the vagrant VM with spark …
|
|
|
@33423
|
5 years |
ak19 |
Adding in the link to the vagrant VM with Hadoop, Spark for cluster …
|
|
|
@33422
|
5 years |
ak19 |
Some more links.
|
|
|
@33419
|
5 years |
ak19 |
Last evening, I had found some links about how language-detection is …
|
|
|
@33414
|
5 years |
ak19 |
Adding important links
|
|
|
@33413
|
5 years |
ak19 |
Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, …
|
|
|
@33412
|
5 years |
ak19 |
config command for wgetting a single file
|
|
|
@33411
|
5 years |
ak19 |
Newer version now doesn't mirror sites with wget but gets WET files …
|
|
|
@33410
|
5 years |
ak19 |
Committing some variable name changes before I replace this file with …
|
|
|
@33409
|
5 years |
ak19 |
Forgot to commit 2 files with links and shuffling some links around …
|
|
|
@33408
|
5 years |
ak19 |
Some rough notes. Will move into appropriate file later.
|
|
|
@33407
|
5 years |
ak19 |
gutil.jar was rebuilt yesterday in GS3 after a bugfix. Recommitting …
|
|
|
@33405
|
5 years |
ak19 |
Even though we're probably not going to use this code after all, will …
|
|
|
@33404
|
5 years |
ak19 |
1. Links to other Java ways of extracting text from web content. 2. …
|
|
|