Timeline



2019-11-12:

21:33 Changeset [33657] by ak19
Some fixes after brief testing against 1/3 of the crawl. Restarted …
21:11 Changeset [33656] by ak19
Final minor changes before I start processing the crawls of node2.
20:56 Changeset [33655] by ak19
Minor change to print statement
20:54 Changeset [33654] by ak19
Removing jar file that wasn't used after all.
20:51 Changeset [33653] by ak19
1. As suggested by Dr Bainbridge, made the code changes to use Morphia …
20:41 Changeset [33652] by ak19
Introducing morphia subpackage
18:11 Changeset [33651] by ak19
1. Bugfix: overlappingSentences works. 2. storing numSentencesInMaor
12:06 Changeset [33650] by kjdon
updated to match the new xsl file names; lots of variable renames to …
12:04 Changeset [33649] by kjdon
renamed config_format and text_fragment_format to better represent …
12:04 Changeset [33648] by kjdon
changed the debuginfo xsl and strings to match the new o=xxx debug options
09:30 Changeset [33647] by kjdon
added/changed a few of the output values for debugging the transform

2019-11-11:

18:46 Changeset [33646] by ak19
Saving the mongodb queries and learning links that Dr Bainbridge found …
18:45 Changeset [33645] by ak19
Fix to 2 bugs when sending data to MongoDB: 1. overlappingSentences …
11:50 Changeset [33644] by ak19
Just committing the growing mongodb.txt file with links and …
11:46 Changeset [33643] by ak19
Brought the template log4j.properties.in back up to speed. I forgot it …
11:06 Changeset [33642] by ak19
Forgot to commit the java driver for mongodb when I committed the Java …
10:53 Changeset [33641] by kjdon
commented out some debug statements
10:48 Changeset [33640] by kjdon
oops, I must have 'tidied' up the file and then not compiled it to …
10:23 Changeset [33639] by kjdon
need to select child nodes, otherwise the gsf:default node ends up in …
10:22 Changeset [33638] by kjdon
gslib doesn't use xml-to-string.xsl. its only used by formatmanager, …
10:21 Changeset [33637] by kjdon
we can now use gsf and gslib in layout files.
10:04 Changeset [33636] by kjdon
include means the stylesheet gets added inline, import mea s it gets …
09:38 Changeset [33635] by ak19
Maori-language-detection doesn't use Greenstone 3 at present, it's not …

2019-11-08:

23:59 Changeset [33634] by ak19
Rewrote NutchTextDumpProcessor as NutchTextDumpToMongoDB.java, which …
19:43 Changeset [33633] by ak19
1. TextLanguageDetector now has methods for collecting all sentences …

2019-11-07:

14:53 Changeset [33632] by kjdon
overhaul of TransformingReceptionist. changed the order of inlining …
14:52 Changeset [33631] by kjdon
added a bit more error reporting
14:44 Changeset [33630] by kjdon
minor comment changes
14:20 Changeset [33629] by kjdon
added methods using Parameter2 - for params with text node values
13:52 Changeset [33628] by kjdon
not sure why documentNode was a gsf:template here. Can't be like that …
09:28 Changeset [33627] by kjdon
removed unnecessary comments

2019-11-05:

21:59 Changeset [33626] by ak19
TODOs
21:58 Changeset [33625] by ak19
A file listing domains with seedurls containing /mi(/) that are …
21:48 Changeset [33624] by ak19
Some cleanup surrounding the now renamed function createSeedURLsFile, …
21:04 Changeset [33623] by ak19
1. Incorporated Dr Nichols earlier suggestion of storing page modified …
15:42 Changeset [33622] by ak19
File rename

2019-11-04:

20:35 Changeset [33621] by ak19
Comitting jotted down mongodb related instructions from what Dr …
14:24 Changeset [33620] by ak19
Final crawl, done on vagrant VM node6. Crawl site IDs 01407-01462.
11:36 Changeset [33619] by kjdon
need to handle the case where a collection file (eg image) gets …

2019-11-01:

20:14 Changeset [33618] by ak19
Adding in the download URL
17:13 Changeset [33617] by ak19
Node5 is now full and here is the finished crawl (up to and including …

2019-10-31:

20:05 Changeset [33616] by ak19
Beginnings of Java class that is to interact with MongoDB. I don't yet …
20:03 Changeset [33615] by ak19
1. Worked out how to configure log4j to log both to console and …
11:22 Changeset [33614] by kjdon
added a new line
11:18 Changeset [33613] by kjdon
added allowdocumentediting and allowmapgpsediting options, plus also …
11:00 Changeset [33612] by kjdon
work to do with params. add in default values to params if they are …
10:55 Changeset [33611] by kjdon
added global setting to params - thesea re for params that are valid …
10:54 Changeset [33610] by kjdon
USER_SESSION_CACHE_ATT moved to GSParams, as it is stored in session …

2019-10-30:

23:03 Changeset [33609] by ak19
The tar files containing the crawled sites data shouldn't be called …
23:02 Changeset [33608] by ak19
1. New script to export from HBase so that we could in theory reimport …

2019-10-29:

18:33 Changeset [33607] by ak19
Updated with the remaining successfully crawled sites on node4 before …
15:18 Changeset [33606] by ak19
1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. …
14:54 Changeset [33605] by ak19
Node 4 VM still works, but committing first set of crawled sites on there

2019-10-24:

23:22 Changeset [33604] by ak19
1. Better output into possible-product-sites.txt including the …
22:04 Changeset [33603] by ak19
Incorporating Dr Nichols suggestion to help weed out product sites: if …

2019-10-23:

23:49 Changeset [33602] by ak19
1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
23:22 Changeset [33601] by ak19
Creates the 2nd csv file, with info about webpages. At present stores …
23:05 Changeset [33600] by ak19
Work in progress of writing out CSV files. In future, may write the …

2019-10-22:

20:49 Changeset [33599] by ak19
First one-third sites crawled. Committing to SVN despite the tarred …
20:19 Changeset [33598] by ak19
More instructions on setting up Nutch now that I've remembered to …
20:05 Changeset [33597] by ak19
Committing active version of template file which has a newline at end …
18:44 Changeset [33596] by ak19
Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template …
14:05 Changeset [33595] by kjdon
new displayBaskets template - to avoid replicating code in query and …
14:00 Changeset [33594] by kjdon
call gslib:displayBasket instead of replicating the code here
13:59 Changeset [33593] by kjdon
the test for facets should be facetList/facet/count, as the facets get …
13:51 Changeset [33592] by kjdon
reindented the file
11:51 Changeset [33591] by kjdon
added in some strings for 'this collection contains x documents and …
11:12 Changeset [33590] by kjdon
added 'this colleciton contains X documents and was last build Y days …

2019-10-21:

21:45 Changeset [33589] by cpb16
final01. Need Map results still

2019-10-18:

23:20 Changeset [33588] by ak19
Committing the MRI sentence model that I'm actually using, the one in …
23:16 Changeset [33587] by ak19
1. Better stats reporting on crawled sites: not just if a page was in …
22:20 Changeset [33586] by ak19
Refactored MaoriTextDetector.java class into more general …
21:41 Changeset [33585] by ak19
Much simpler way of using sentence and language detection model to …
21:20 Changeset [33584] by ak19
Committing experimental version 2 using the sentence detector model, …
21:20 Changeset [33583] by ak19
Committing experimental version 1 using the sentence detector model, …

2019-10-17:

23:12 Changeset [33582] by ak19
NutchTextDumpProcessor prints each crawled site's stats: number of …
21:53 Changeset [33581] by ak19
Minor fix. Noticed when looking for work I did on MRI sentence detection
21:44 Changeset [33580] by ak19
Finally fixed the thus-far identified bugs when parsing dump.txt.
21:05 Changeset [33579] by ak19
Debugging. Solved one problem.
19:31 Changeset [33578] by ak19
Corrections for compiling the 2 new classes.
19:12 Changeset [33577] by ak19
Forgot to adjust usage statement to say that silent mode was already …

2019-10-16:

23:37 Changeset [33576] by ak19
Introducing 2 new Java files still being written and untested. …
23:36 Changeset [33575] by ak19
Correcting usage string for CCWETProcessor before committing new java …
23:35 Changeset [33574] by ak19
If nutch stores a crawled site in more than 1 file, then cat all of …
21:39 Changeset [33573] by ak19
Forgot to document that spaces were also allowed as separator in the …
21:18 Changeset [33572] by ak19
Only meant to store the wet.gz versions of these files, not also the …
21:11 Changeset [33571] by ak19
Adding Dr Bainbridge's suggestion of appending the crawlId of each …
20:04 Changeset [33570] by ak19
Need to check if UNFINISHED file actually exists before moving it …
20:00 Changeset [33569] by ak19
1. batchcrawl.sh now does what it should have from the start, which is …

2019-10-14:

23:36 Changeset [33568] by ak19
1. More sites greylisted and blacklisted, discovered as I attempted to …
22:40 Changeset [33567] by ak19
batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
22:07 Changeset [33566] by ak19
batchcrawl.sh script now supports taking a comma or space separated …
21:04 Changeset [33565] by ak19
CCWETProcessor: domain url now goes in as a seedURL after the …
21:01 Changeset [33564] by ak19
batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
Note: See TracTimeline for information about the timeline view.