|
|
@33541
|
5 years |
ak19 |
1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
|
|
|
@33540
|
5 years |
ak19 |
Since I wasn't getting further with nutch 2 to grab an entire site, I …
|
|
|
@33539
|
5 years |
ak19 |
File rename
|
|
|
@33538
|
5 years |
ak19 |
Some additions to the setup.sh script to query commoncrawl for MRI …
|
|
|
@33537
|
5 years |
ak19 |
More nutch and general site mirroring related links
|
|
|
@33536
|
5 years |
ak19 |
Changes required to the commoncrawl related Vagrant github project to …
|
|
|
@33535
|
5 years |
ak19 |
1. New setup.sh script for on a hadoop system to setup the git …
|
|
|
@33534
|
5 years |
ak19 |
Correction: toplevel script has to be placed inside cc-index-table not …
|
|
|
@33533
|
5 years |
kjdon |
some collections might not have Title or root_Title metadata, so check …
|
|
|
@33532
|
5 years |
ak19 |
Found the other top 500 sites link again at last which Dr Bainbridge …
|
|
|
@33531
|
5 years |
ak19 |
Added whitelist for mi.wikipedia.org, and updates to blacklist and …
|
|
|
@33530
|
5 years |
ak19 |
Completed sentence that was left hanging.
|
|
|
@33529
|
5 years |
ak19 |
Forgot to add most basic nutch links
|
|
|
@33528
|
5 years |
ak19 |
Adding in Nutch links
|
|
|
@33527
|
5 years |
ak19 |
Name change for folder
|
|
|
@33526
|
5 years |
ak19 |
Moved hadoop related scripts from bin/script into hdfs-instructions
|
|
|
@33525
|
5 years |
ak19 |
Rename before latest version
|
|
|
@33524
|
5 years |
ak19 |
1. Further adjustments to documenting what we did to get things to run …
|
|
|
@33523
|
5 years |
ak19 |
Instructional comment
|
|
|
@33522
|
5 years |
ak19 |
Some comments and an improvement
|
|
|
@33521
|
5 years |
ak19 |
AUTOCOMMIT by gen-model-colls.sh script. Message: Redoing the CDS-ISIS …
|
|
|
@33520
|
5 years |
ak19 |
AUTOCOMMIT by gen-model-colls.sh script. Message: Redoing the CDS-ISIS …
|
|
|
@33519
|
5 years |
ak19 |
Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
|
|
|
@33518
|
5 years |
ak19 |
Intermediate commit: got the seed urls file temporarily written out as …
|
|
|
@33517
|
5 years |
ak19 |
1. Blacklists were introduced so that too many instances of camelcased …
|
|
|
@33516
|
5 years |
ak19 |
Before I accidentally lose it, committing the script Dr Bainbridge …
|
|
|
@33515
|
5 years |
ak19 |
Removed an unused function
|
|
|
@33514
|
5 years |
ak19 |
Committing README on starting off with the vagrant VM for hadoop-spark …
|
|
|
@33513
|
5 years |
ak19 |
Higher level script that runs against each named crawl since Sep 2018 …
|
|
|
@33512
|
5 years |
ak19 |
AUTOCOMMIT by gen-model-colls.sh script. Message: Rebuilding all the …
|
|
|
@33511
|
5 years |
ak19 |
AUTOCOMMIT by gen-model-colls.sh script. Message: Rebuilding all the …
|
|
|
@33510
|
5 years |
kjdon |
isEditingTurnedOn renamed to isEditingAllowed, and added …
|
|
|
@33509
|
5 years |
kjdon |
only display Map GPS editing stuff if its allowed in config file
|
|
|
@33508
|
5 years |
kjdon |
pass a param into readyPageForEditing - indicates whether to add the …
|
|
|
@33507
|
5 years |
kjdon |
moved canDoEditing variable code to top, so can be used everywhere in …
|
|
|
@33506
|
5 years |
kjdon |
need to check whether document editing is turned on, not just if the …
|
|
|
@33505
|
5 years |
kjdon |
allowUserComments option changed to start with lower case a, to match …
|
|
|
@33504
|
5 years |
kjdon |
allowDocumentEditing option changed to start with lower case a, to …
|
|
|
@33503
|
5 years |
ak19 |
More efficient blacklisting/greylisting/whitelisting now by reading in …
|
|
|
@33502
|
5 years |
ak19 |
Current url pattern blacklist and greylist filter files. Used by …
|
|
|
@33501
|
5 years |
ak19 |
Refactored code into 2 classes: The existing WETProcessor, which …
|
|
|
@33500
|
5 years |
ak19 |
ThemeRoller download functionality currently offline. So uploading the …
|
|
|
@33499
|
5 years |
ak19 |
Explicitly adding in IAM policy configuration details instead of just …
|
|
|
@33498
|
5 years |
ak19 |
Corrections to script. Modified the tests checking for file/dir …
|
|
|
@33497
|
5 years |
ak19 |
First version of discard url filter file. Inefficient implementation. …
|
|
|
@33496
|
5 years |
ak19 |
Minor changes to reading list file
|
|
|
@33495
|
5 years |
ak19 |
Pruned out unused commands, added comments, marked unused variables to …
|
|
|
@33494
|
5 years |
ak19 |
All in one script that takes as parameter a common crawl identifier of …
|
|
|
@33493
|
5 years |
kjdon |
if we are on a cross collection search page, the collection for each …
|
|
|
@33492
|
5 years |
kjdon |
not all ccs pages has hierarchy element, so just test on s1.collection
|
|
|
@33491
|
5 years |
kjdon |
need to add optional args for doc links into the CCS format links. …
|
|
|
@33490
|
5 years |
kjdon |
changed default partition sizes back to 20, to match what was there …
|
|
|
@33489
|
5 years |
ak19 |
Handy file to not have to keep manually repeating commands when …
|
|
|
@33488
|
5 years |
ak19 |
new function createSeedURLsFiles() in WETProcessor that replaces the …
|
|
|
@33487
|
5 years |
kjdon |
added code to display any error messages
|
|
|
@33486
|
5 years |
kjdon |
reindented the page, added some extra links, and organised the items …
|
|
|
@33485
|
5 years |
kjdon |
removed an erroneous space
|
|
|
@33484
|
5 years |
kjdon |
some changes and additions to the debuginfo page texts
|
|
|
@33483
|
5 years |
kjdon |
added an explicit space after Error:
|
|
|
@33482
|
5 years |
kjdon |
changed standardize_capitalization to …
|
|
|
@33481
|
5 years |
kjdon |
a few more refinements to List strings
|
|
|
@33480
|
5 years |
ak19 |
Much harder to remove pages where words are fused together as some are …
|
|
|
@33479
|
5 years |
kjdon |
changed numeric option order to match letter options
|
|
|
@33478
|
5 years |
kjdon |
some refining of list option descriptions
|
|
|
@33477
|
5 years |
kjdon |
need to call setup_custom_sort to allow for collection's customsorttools.pm
|
|
|
@33476
|
5 years |
kjdon |
enabled having customsorttools in collection's perllib folder. you can …
|
|
|
@33475
|
5 years |
kjdon |
added numeric partition defaults to match partition type
|
|
|
@33474
|
5 years |
kjdon |
it turns out that childtype is not set in all cases, so put in the …
|
|
|
@33473
|
5 years |
kjdon |
still didn't get it quite right…
|
|
|
@33472
|
5 years |
kjdon |
forgot the -> to access member of a hash ref
|
|
|
@33471
|
5 years |
ak19 |
Very minor changes.
|
|
|
@33470
|
5 years |
ak19 |
A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
|
|
|
@33469
|
5 years |
ak19 |
Don't want URLs with the word product(s) in them (but production …
|
|
|
@33468
|
5 years |
ak19 |
More meaningful to (also) write out the keep vs discard URLs into keep …
|
|
|
@33467
|
5 years |
ak19 |
Improved the code to use a static block to load the needed properties …
|
|
|
@33466
|
5 years |
ak19 |
1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
|
|
|
@33465
|
5 years |
ak19 |
Committing first version of the WETProcessor.java which takes a …
|
|
|
@33464
|
5 years |
kjdon |
I committed the last changes by mistake, using the previous revision …
|
|
|
@33463
|
5 years |
kjdon |
fixed up some typos. removed use_hlist_for option. This is very hard …
|
|
|
@33462
|
5 years |
ak19 |
Tested new tomcat.allowLinking property on Windows too now and it …
|
|
|
@33461
|
5 years |
ak19 |
Implementing Diego Spano's suggested changes for tomcat's allowLinking …
|
|
|
@33460
|
5 years |
kjdon |
fixed up some typos. removed use_hlist_for option. This is very hard …
|
|
|
@33459
|
5 years |
kjdon |
small changes to some strings
|
|
|
@33458
|
5 years |
cpb16 |
Running new morphology version after quick meeting with david last …
|
|
|
@33457
|
5 years |
ak19 |
Got stage 1, the WARC to WET conversion, working, after necessary …
|
|
|
@33456
|
5 years |
ak19 |
Link to discussion on how to convert WARC to WET
|
|
|
@33455
|
5 years |
cpb16 |
Started implementing Davids suggested morphology sequence, codeversion9
|
|
|
@33454
|
5 years |
kjdon |
updated metadata_selection_mode to be …
|
|
|
@33453
|
5 years |
kjdon |
the new and modified strings for revamped List classifier
|
|
|
@33452
|
5 years |
kjdon |
revamp of list classifier. More precise handling of numeric metadata …
|
|
|
@33451
|
5 years |
kjdon |
added a comment
|
|
|
@33450
|
5 years |
kjdon |
removed some unnecessary comments
|
|
|
@33449
|
5 years |
cpb16 |
termnal version executes correctly. (Didnt include init threshold in …
|
|
|
@33448
|
5 years |
ak19 |
Minor clarification and inclusion of helpful command
|
|
|
@33447
|
5 years |
cpb16 |
starting to implement terminal version of new morphology. need to fix. …
|
|
|
@33446
|
5 years |
ak19 |
1. Committing working version of export_maori_subset.sh which takes …
|
|
|
@33445
|
5 years |
ak19 |
The first working hadoop spark script for processing common crawl …
|
|
|
@33444
|
5 years |
cpb16 |
Have created a preprocess to remove large objects.
…
|
|
|
@33443
|
5 years |
ak19 |
More notes
|
|
|
@33442
|
5 years |
ak19 |
Updated gutil.jar file (with SafeProcses debugging)
|
|
|