|
|
@33623
|
4 years |
ak19 |
1. Incorporated Dr Nichols earlier suggestion of storing page modified …
|
|
|
@33621
|
4 years |
ak19 |
Comitting jotted down mongodb related instructions from what Dr …
|
|
|
@33615
|
4 years |
ak19 |
1. Worked out how to configure log4j to log both to console and …
|
|
|
@33603
|
5 years |
ak19 |
Incorporating Dr Nichols suggestion to help weed out product sites: if …
|
|
|
@33565
|
5 years |
ak19 |
CCWETProcessor: domain url now goes in as a seedURL after the …
|
|
|
@33558
|
5 years |
ak19 |
Committing cumulative changes since last commit.
|
|
|
@33545
|
5 years |
ak19 |
Mainly changes to crawling-Nutch.txt and some minor changes to other …
|
|
|
@33541
|
5 years |
ak19 |
1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
|
|
|
@33540
|
5 years |
ak19 |
Since I wasn't getting further with nutch 2 to grab an entire site, I …
|
|
|
@33537
|
5 years |
ak19 |
More nutch and general site mirroring related links
|
|
|
@33529
|
5 years |
ak19 |
Forgot to add most basic nutch links
|
|
|
@33528
|
5 years |
ak19 |
Adding in Nutch links
|
|
|
@33499
|
5 years |
ak19 |
Explicitly adding in IAM policy configuration details instead of just …
|
|
|
@33496
|
5 years |
ak19 |
Minor changes to reading list file
|
|
|
@33467
|
5 years |
ak19 |
Improved the code to use a static block to load the needed properties …
|
|
|
@33457
|
5 years |
ak19 |
Got stage 1, the WARC to WET conversion, working, after necessary …
|
|
|
@33456
|
5 years |
ak19 |
Link to discussion on how to convert WARC to WET
|
|
|
@33448
|
5 years |
ak19 |
Minor clarification and inclusion of helpful command
|
|
|
@33446
|
5 years |
ak19 |
1. Committing working version of export_maori_subset.sh which takes …
|
|
|
@33443
|
5 years |
ak19 |
More notes
|
|
|
@33441
|
5 years |
ak19 |
Adding further notes to do with running the CC-index examples on spark.
|
|
|
@33440
|
5 years |
ak19 |
Split file to move vagrant-spark-hadoop notes into own file.
|
|
|
@33428
|
5 years |
ak19 |
Working commoncrawl cc-warc-examples' WET wordcount example using …
|
|
|
@33425
|
5 years |
ak19 |
A few more links now that I got past getting the vagrant VM with spark …
|
|
|
@33423
|
5 years |
ak19 |
Adding in the link to the vagrant VM with Hadoop, Spark for cluster …
|
|
|
@33422
|
5 years |
ak19 |
Some more links.
|
|
|
@33419
|
5 years |
ak19 |
Last evening, I had found some links about how language-detection is …
|
|
|
@33414
|
5 years |
ak19 |
Adding important links
|
|
|
@33409
|
5 years |
ak19 |
Forgot to commit 2 files with links and shuffling some links around …
|
|
|
@33408
|
5 years |
ak19 |
Some rough notes. Will move into appropriate file later.
|
|
|
@33404
|
5 years |
ak19 |
1. Links to other Java ways of extracting text from web content. 2. …
|
|
|
@33393
|
5 years |
ak19 |
Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls …
|
|
|
@33391
|
5 years |
ak19 |
Some rough bash scripting lines that work but aren't complete.
|
|
|
@33376
|
5 years |
ak19 |
Links and extracts I've read so far on the Web Curator Tool (WCT), …
|