Timeline


and .

04.10.2019:

22:19 Changeset [33553] by ak19
Comments
22:00 Changeset [33552] by ak19
1. Code now processes ccrawldata folder, containing each individual common …
19:35 Changeset [33551] by ak19
Added in top 500 urls from moz.com/top500 and removed duplicates, and …
19:06 Changeset [33550] by ak19
First stage of introducing sites-too-big-to-exhaustively-crawl.tx: split …
18:29 Changeset [33549] by ak19
All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 (when …
14:36 Changeset [33548] by davidb
Include new wavesurfer sub-project to install
14:30 Changeset [33547] by davidb
Initial cut at wavesurfer JS audio player version of AMC music content …
14:19 Changeset [33546] by davidb
Initial cut at wave-surfer based JS audio player extension for Greenstone

03.10.2019:

22:38 Changeset [33545] by ak19
Mainly changes to crawling-Nutch.txt and some minor changes to other txt …
18:56 Changeset [33544] by ak19
1. Dr Bainbridge had the correct fix for solr dealing with phrase …

02.10.2019:

17:01 Changeset [33543] by ak19
Filled in some missing instructions
15:25 Changeset [33542] by kjdon
use_hlist_for option is no longer valid

01.10.2019:

22:27 Changeset [33541] by ak19
1. hdfs-cc-work/GS_README.txt now contains the complete instructions to …
21:40 Changeset [33540] by ak19
Since I wasn't getting further with nutch 2 to grab an entire site, I am …
21:36 Changeset [33539] by ak19
File rename
21:36 Changeset [33538] by ak19
Some additions to the setup.sh script to query commoncrawl for MRI data on …

30.09.2019:

22:51 Changeset [33537] by ak19
More nutch and general site mirroring related links
21:28 Changeset [33536] by ak19
Changes required to the commoncrawl related Vagrant github project to get …
19:20 Ticket #956 (Minor changes to config for Images GPS tutorial after 3.09) created by ak19
As at 30 Sep 2019, the Images GPS collection works the same with a caveat …
16:49 Changeset [33535] by ak19
1. New setup.sh script for on a hadoop system to setup the git projects we …

27.09.2019:

17:05 Changeset [33534] by ak19
Correction: toplevel script has to be placed inside cc-index-table not its …
11:02 Changeset [33533] by kjdon
some collections might not have Title or root_Title metadata, so check …

26.09.2019:

23:06 Changeset [33532] by ak19
Found the other top 500 sites link again at last which Dr Bainbridge had …
23:03 Changeset [33531] by ak19
Added whitelist for mi.wikipedia.org, and updates to blacklist and …
22:41 Changeset [33530] by ak19
Completed sentence that was left hanging.
22:22 Changeset [33529] by ak19
Forgot to add most basic nutch links
21:47 Changeset [33528] by ak19
Adding in Nutch links
20:39 Changeset [33527] by ak19
Name change for folder
20:38 Changeset [33526] by ak19
Moved hadoop related scripts from bin/script into hdfs-instructions
20:35 Changeset [33525] by ak19
Rename before latest version
20:34 Changeset [33524] by ak19
1. Further adjustments to documenting what we did to get things to run on …
19:00 Changeset [33523] by ak19
Instructional comment
19:00 Changeset [33522] by ak19
Some comments and an improvement
17:49 Changeset [33521] by ak19
AUTOCOMMIT by gen-model-colls.sh script. Message: Redoing the CDS-ISIS …
17:49 Changeset [33520] by ak19
AUTOCOMMIT by gen-model-colls.sh script. Message: Redoing the CDS-ISIS …

24.09.2019:

21:40 Changeset [33519] by ak19
Code still writes out the global seedURLs.txt and regex-urlfilter.txt (in …
21:13 Changeset [33518] by ak19
Intermediate commit: got the seed urls file temporarily written out as …
20:30 Changeset [33517] by ak19
1. Blacklists were introduced so that too many instances of camelcased …
20:14 Changeset [33516] by ak19
Before I accidentally lose it, committing the script Dr Bainbridge wrote …
19:50 Changeset [33515] by ak19
Removed an unused function
19:44 Changeset [33514] by ak19
Committing README on starting off with the vagrant VM for hadoop-spark to …
19:15 Changeset [33513] by ak19
Higher level script that runs against each named crawl since Sep 2018 …
15:17 Changeset [33512] by ak19
AUTOCOMMIT by gen-model-colls.sh script. Message: Rebuilding all the …
15:16 Changeset [33511] by ak19
AUTOCOMMIT by gen-model-colls.sh script. Message: Rebuilding all the …
14:13 Changeset [33510] by kjdon
isEditingTurnedOn renamed to isEditingAllowed, and added …
14:12 Changeset [33509] by kjdon
only display Map GPS editing stuff if its allowed in config file
14:07 Changeset [33508] by kjdon
pass a param into readyPageForEditing - indicates whether to add the …
13:24 Changeset [33507] by kjdon
moved canDoEditing variable code to top, so can be used everywhere in the …
13:04 Changeset [33506] by kjdon
need to check whether document editing is turned on, not just if the user …
12:55 Changeset [33505] by kjdon
allowUserComments option changed to start with lower case a, to match …
12:53 Changeset [33504] by kjdon
allowDocumentEditing option changed to start with lower case a, to match …
10:23 Ticket #955 (Use of GreenStone/Koha with Multimedia Production Management - e.g. ...) created by kjdon
This was a feature request added to sourceforge greenstone 3 project in …

23.09.2019:

23:16 Changeset [33503] by ak19
More efficient blacklisting/greylisting/whitelisting now by reading in the …
23:11 Changeset [33502] by ak19
Current url pattern blacklist and greylist filter files. Used by …
21:28 Changeset [33501] by ak19
Refactored code into 2 classes: The existing WETProcessor, which processes …
19:05 Changeset [33500] by ak19
ThemeRoller? download functionality currently offline. So uploading the …
17:59 Changeset [33499] by ak19
Explicitly adding in IAM policy configuration details instead of just …
16:43 Changeset [33498] by ak19
Corrections to script. Modified the tests checking for file/dir existence …

22.09.2019:

21:17 Changeset [33497] by ak19
First version of discard url filter file. Inefficient implementation. …
19:23 Changeset [33496] by ak19
Minor changes to reading list file
19:19 Changeset [33495] by ak19
Pruned out unused commands, added comments, marked unused variables to be …

21.09.2019:

22:49 Changeset [33494] by ak19
All in one script that takes as parameter a common crawl identifier of the …

19.09.2019:

14:24 Changeset [33493] by kjdon
if we are on a cross collection search page, the collection for each …
13:43 Changeset [33492] by kjdon
not all ccs pages has hierarchy element, so just test on s1.collection
13:23 Changeset [33491] by kjdon
need to add optional args for doc links into the CCS format links. Also, …
12:34 Changeset [33490] by kjdon
changed default partition sizes back to 20, to match what was there …

18.09.2019:

20:20 Changeset [33489] by ak19
Handy file to not have to keep manually repeating commands when deleting …

17.09.2019:

14:48 Changeset [33488] by ak19
new function createSeedURLsFiles() in WETProcessor that replaces the bash …
14:24 Changeset [33487] by kjdon
added code to display any error messages
14:23 Changeset [33486] by kjdon
reindented the page, added some extra links, and organised the items into …
14:22 Changeset [33485] by kjdon
removed an erroneous space
14:21 Changeset [33484] by kjdon
some changes and additions to the debuginfo page texts
14:20 Changeset [33483] by kjdon
added an explicit space after Error:
10:55 Changeset [33482] by kjdon
changed standardize_capitalization to …
10:41 Changeset [33481] by kjdon
a few more refinements to List strings

16.09.2019:

19:45 Changeset [33480] by ak19
Much harder to remove pages where words are fused together as some are …
14:55 Changeset [33479] by kjdon
changed numeric option order to match letter options
14:54 Changeset [33478] by kjdon
some refining of list option descriptions
12:30 Changeset [33477] by kjdon
need to call setup_custom_sort to allow for collection's …
12:30 Changeset [33476] by kjdon
enabled having customsorttools in collection's perllib folder. you can …
11:19 Changeset [33475] by kjdon
added numeric partition defaults to match partition type
11:04 Changeset [33474] by kjdon
it turns out that childtype is not set in all cases, so put in the default …
10:21 Changeset [33473] by kjdon
still didn't get it quite right…
09:54 Changeset [33472] by kjdon
forgot the -> to access member of a hash ref

13.09.2019:

22:57 Changeset [33471] by ak19
Very minor changes.
22:53 Changeset [33470] by ak19
A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
21:46 Changeset [33469] by ak19
Don't want URLs with the word product(s) in them (but production should be …
19:24 Changeset [33468] by ak19
More meaningful to (also) write out the keep vs discard URLs into keep and …
17:44 Changeset [33467] by ak19
Improved the code to use a static block to load the needed properties from …

12.09.2019:

21:37 Changeset [33466] by ak19
1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) files. …
20:00 Changeset [33465] by ak19
Committing first version of the WETProcessor.java which takes a .warc.wet …
14:21 Changeset [33464] by kjdon
I committed the last changes by mistake, using the previous revision log …
14:17 Changeset [33463] by kjdon
fixed up some typos. removed use_hlist_for option. This is very hard to …

11.09.2019:

20:10 Changeset [33462] by ak19
Tested new tomcat.allowLinking property on Windows too now and it behaves …
19:45 Changeset [33461] by ak19
Implementing Diego Spano's suggested changes for tomcat's allowLinking …

09.09.2019:

13:04 Changeset [33460] by kjdon
fixed up some typos. removed use_hlist_for option. This is very hard to …
12:06 Changeset [33459] by kjdon
small changes to some strings

07.09.2019:

14:30 Changeset [33458] by cpb16
Running new morphology version after quick meeting with david last week. …

05.09.2019:

19:01 Changeset [33457] by ak19
Got stage 1, the WARC to WET conversion, working, after necessary …
17:26 Changeset [33456] by ak19
Link to discussion on how to convert WARC to WET

04.09.2019:

14:45 Changeset [33455] by cpb16
Started implementing Davids suggested morphology sequence, codeversion9
Note: See TracTimeline for information about the timeline view.