and .


23:16 Changeset [33503] by ak19
More efficient blacklisting/greylisting/whitelisting now by reading in the …
23:11 Changeset [33502] by ak19
Current url pattern blacklist and greylist filter files. Used by …
21:28 Changeset [33501] by ak19
Refactored code into 2 classes: The existing WETProcessor, which processes …
19:05 Changeset [33500] by ak19
ThemeRoller? download functionality currently offline. So uploading the …
17:59 Changeset [33499] by ak19
Explicitly adding in IAM policy configuration details instead of just …
16:43 Changeset [33498] by ak19
Corrections to script. Modified the tests checking for file/dir existence …


21:17 Changeset [33497] by ak19
First version of discard url filter file. Inefficient implementation. …
19:23 Changeset [33496] by ak19
Minor changes to reading list file
19:19 Changeset [33495] by ak19
Pruned out unused commands, added comments, marked unused variables to be …


22:49 Changeset [33494] by ak19
All in one script that takes as parameter a common crawl identifier of the …


14:24 Changeset [33493] by kjdon
if we are on a cross collection search page, the collection for each …
13:43 Changeset [33492] by kjdon
not all ccs pages has hierarchy element, so just test on s1.collection
13:23 Changeset [33491] by kjdon
need to add optional args for doc links into the CCS format links. Also, …
12:34 Changeset [33490] by kjdon
changed default partition sizes back to 20, to match what was there …


20:20 Changeset [33489] by ak19
Handy file to not have to keep manually repeating commands when deleting …


14:48 Changeset [33488] by ak19
new function createSeedURLsFiles() in WETProcessor that replaces the bash …
14:24 Changeset [33487] by kjdon
added code to display any error messages
14:23 Changeset [33486] by kjdon
reindented the page, added some extra links, and organised the items into …
14:22 Changeset [33485] by kjdon
removed an erroneous space
14:21 Changeset [33484] by kjdon
some changes and additions to the debuginfo page texts
14:20 Changeset [33483] by kjdon
added an explicit space after Error:
10:55 Changeset [33482] by kjdon
changed standardize_capitalization to …
10:41 Changeset [33481] by kjdon
a few more refinements to List strings


19:45 Changeset [33480] by ak19
Much harder to remove pages where words are fused together as some are …
14:55 Changeset [33479] by kjdon
changed numeric option order to match letter options
14:54 Changeset [33478] by kjdon
some refining of list option descriptions
12:30 Changeset [33477] by kjdon
need to call setup_custom_sort to allow for collection's …
12:30 Changeset [33476] by kjdon
enabled having customsorttools in collection's perllib folder. you can …
11:19 Changeset [33475] by kjdon
added numeric partition defaults to match partition type
11:04 Changeset [33474] by kjdon
it turns out that childtype is not set in all cases, so put in the default …
10:21 Changeset [33473] by kjdon
still didn't get it quite right…
09:54 Changeset [33472] by kjdon
forgot the -> to access member of a hash ref


22:57 Changeset [33471] by ak19
Very minor changes.
22:53 Changeset [33470] by ak19
A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
21:46 Changeset [33469] by ak19
Don't want URLs with the word product(s) in them (but production should be …
19:24 Changeset [33468] by ak19
More meaningful to (also) write out the keep vs discard URLs into keep and …
17:44 Changeset [33467] by ak19
Improved the code to use a static block to load the needed properties from …


21:37 Changeset [33466] by ak19
1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) files. …
20:00 Changeset [33465] by ak19
Committing first version of the WETProcessor.java which takes a .warc.wet …
14:21 Changeset [33464] by kjdon
I committed the last changes by mistake, using the previous revision log …
14:17 Changeset [33463] by kjdon
fixed up some typos. removed use_hlist_for option. This is very hard to …


20:10 Changeset [33462] by ak19
Tested new tomcat.allowLinking property on Windows too now and it behaves …
19:45 Changeset [33461] by ak19
Implementing Diego Spano's suggested changes for tomcat's allowLinking …


13:04 Changeset [33460] by kjdon
fixed up some typos. removed use_hlist_for option. This is very hard to …
12:06 Changeset [33459] by kjdon
small changes to some strings


14:30 Changeset [33458] by cpb16
Running new morphology version after quick meeting with david last week. …


19:01 Changeset [33457] by ak19
Got stage 1, the WARC to WET conversion, working, after necessary …
17:26 Changeset [33456] by ak19
Link to discussion on how to convert WARC to WET


14:45 Changeset [33455] by cpb16
Started implementing Davids suggested morphology sequence, codeversion9


14:41 Changeset [33454] by kjdon
updated metadata_selection_mode to be metadata_selection_mode_within_level …
13:16 Changeset [33453] by kjdon
the new and modified strings for revamped List classifier
13:15 Changeset [33452] by kjdon
revamp of list classifier. More precise handling of numeric metadata …
12:55 Changeset [33451] by kjdon
added a comment
12:54 Changeset [33450] by kjdon
removed some unnecessary comments


17:08 Changeset [33449] by cpb16
termnal version executes correctly. (Didnt include init threshold in …


18:27 Changeset [33448] by ak19
Minor clarification and inclusion of helpful command
18:03 Changeset [33447] by cpb16
starting to implement terminal version of new morphology. need to fix. …


19:12 Changeset [33446] by ak19
1. Committing working version of export_maori_subset.sh which takes the …
17:01 Changeset [33445] by ak19
The first working hadoop spark script for processing common crawl data. …
16:57 Changeset [33444] by cpb16
//Have created a preprocess to remove large objects. …


20:22 Changeset [33443] by ak19
More notes
19:30 Changeset [33442] by ak19
Updated gutil.jar file (with SafeProcses? debugging)
19:30 Changeset [33441] by ak19
Adding further notes to do with running the CC-index examples on spark.
19:17 Changeset [33440] by ak19
Split file to move vagrant-spark-hadoop notes into own file.
17:03 Changeset [33439] by cpb16
Have created properties file and accessibility from …


17:14 Changeset [33438] by ak19
Forgot to commit a change made for Georgian.


16:44 Changeset [33437] by cpb16
made progress with morphology. Need to have a better area dimension …
Note: See TracTimeline for information about the timeline view.