Timeline
2019-11-08:
- 23:59 Changeset [33634] by
- Rewrote NutchTextDumpProcessor as NutchTextDumpToMongoDB.java, which …
- 19:43 Changeset [33633] by
- 1. TextLanguageDetector now has methods for collecting all sentences …
2019-11-07:
- 14:53 Changeset [33632] by
- overhaul of TransformingReceptionist. changed the order of inlining …
- 14:52 Changeset [33631] by
- added a bit more error reporting
- 14:44 Changeset [33630] by
- minor comment changes
- 14:20 Changeset [33629] by
- added methods using Parameter2 - for params with text node values
- 13:52 Changeset [33628] by
- not sure why documentNode was a gsf:template here. Can't be like that …
- 09:28 Changeset [33627] by
- removed unnecessary comments
2019-11-05:
- 21:59 Changeset [33626] by
- TODOs
- 21:58 Changeset [33625] by
- A file listing domains with seedurls containing /mi(/) that are …
- 21:48 Changeset [33624] by
- Some cleanup surrounding the now renamed function createSeedURLsFile, …
- 21:04 Changeset [33623] by
- 1. Incorporated Dr Nichols earlier suggestion of storing page modified …
- 15:42 Changeset [33622] by
- File rename
2019-11-04:
- 20:35 Changeset [33621] by
- Comitting jotted down mongodb related instructions from what Dr …
- 14:24 Changeset [33620] by
- Final crawl, done on vagrant VM node6. Crawl site IDs 01407-01462.
- 11:36 Changeset [33619] by
- need to handle the case where a collection file (eg image) gets …
2019-11-01:
- 20:14 Changeset [33618] by
- Adding in the download URL
- 17:13 Changeset [33617] by
- Node5 is now full and here is the finished crawl (up to and including …
2019-10-31:
- 20:05 Changeset [33616] by
- Beginnings of Java class that is to interact with MongoDB. I don't yet …
- 20:03 Changeset [33615] by
- 1. Worked out how to configure log4j to log both to console and …
- 11:22 Changeset [33614] by
- added a new line
- 11:18 Changeset [33613] by
- added allowdocumentediting and allowmapgpsediting options, plus also …
- 11:00 Changeset [33612] by
- work to do with params. add in default values to params if they are …
- 10:55 Changeset [33611] by
- added global setting to params - thesea re for params that are valid …
- 10:54 Changeset [33610] by
- USER_SESSION_CACHE_ATT moved to GSParams, as it is stored in session …
2019-10-30:
- 23:03 Changeset [33609] by
- The tar files containing the crawled sites data shouldn't be called …
- 23:02 Changeset [33608] by
- 1. New script to export from HBase so that we could in theory reimport …
2019-10-29:
- 18:33 Changeset [33607] by
- Updated with the remaining successfully crawled sites on node4 before …
- 15:18 Changeset [33606] by
- 1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. …
- 14:54 Changeset [33605] by
- Node 4 VM still works, but committing first set of crawled sites on there
2019-10-24:
- 23:22 Changeset [33604] by
- 1. Better output into possible-product-sites.txt including the …
- 22:04 Changeset [33603] by
- Incorporating Dr Nichols suggestion to help weed out product sites: if …
2019-10-23:
- 23:49 Changeset [33602] by
- 1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
- 23:22 Changeset [33601] by
- Creates the 2nd csv file, with info about webpages. At present stores …
- 23:05 Changeset [33600] by
- Work in progress of writing out CSV files. In future, may write the …
2019-10-22:
- 20:49 Changeset [33599] by
- First one-third sites crawled. Committing to SVN despite the tarred …
- 20:19 Changeset [33598] by
- More instructions on setting up Nutch now that I've remembered to …
- 20:05 Changeset [33597] by
- Committing active version of template file which has a newline at end …
- 18:44 Changeset [33596] by
- Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template …
- 14:05 Changeset [33595] by
- new displayBaskets template - to avoid replicating code in query and …
- 14:00 Changeset [33594] by
- call gslib:displayBasket instead of replicating the code here
- 13:59 Changeset [33593] by
- the test for facets should be facetList/facet/count, as the facets get …
- 13:51 Changeset [33592] by
- reindented the file
- 11:51 Changeset [33591] by
- added in some strings for 'this collection contains x documents and …
- 11:12 Changeset [33590] by
- added 'this colleciton contains X documents and was last build Y days …
2019-10-21:
- 21:45 Changeset [33589] by
- final01. Need Map results still
2019-10-18:
- 23:20 Changeset [33588] by
- Committing the MRI sentence model that I'm actually using, the one in …
- 23:16 Changeset [33587] by
- 1. Better stats reporting on crawled sites: not just if a page was in …
- 22:20 Changeset [33586] by
- Refactored MaoriTextDetector.java class into more general …
- 21:41 Changeset [33585] by
- Much simpler way of using sentence and language detection model to …
- 21:20 Changeset [33584] by
- Committing experimental version 2 using the sentence detector model, …
- 21:20 Changeset [33583] by
- Committing experimental version 1 using the sentence detector model, …
2019-10-17:
- 23:12 Changeset [33582] by
- NutchTextDumpProcessor prints each crawled site's stats: number of …
- 21:53 Changeset [33581] by
- Minor fix. Noticed when looking for work I did on MRI sentence detection
- 21:44 Changeset [33580] by
- Finally fixed the thus-far identified bugs when parsing dump.txt.
- 21:05 Changeset [33579] by
- Debugging. Solved one problem.
- 19:31 Changeset [33578] by
- Corrections for compiling the 2 new classes.
- 19:12 Changeset [33577] by
- Forgot to adjust usage statement to say that silent mode was already …
2019-10-16:
- 23:37 Changeset [33576] by
- Introducing 2 new Java files still being written and untested. …
- 23:36 Changeset [33575] by
- Correcting usage string for CCWETProcessor before committing new java …
- 23:35 Changeset [33574] by
- If nutch stores a crawled site in more than 1 file, then cat all of …
- 21:39 Changeset [33573] by
- Forgot to document that spaces were also allowed as separator in the …
- 21:18 Changeset [33572] by
- Only meant to store the wet.gz versions of these files, not also the …
- 21:11 Changeset [33571] by
- Adding Dr Bainbridge's suggestion of appending the crawlId of each …
- 20:04 Changeset [33570] by
- Need to check if UNFINISHED file actually exists before moving it …
- 20:00 Changeset [33569] by
- 1. batchcrawl.sh now does what it should have from the start, which is …
2019-10-14:
- 23:36 Changeset [33568] by
- 1. More sites greylisted and blacklisted, discovered as I attempted to …
- 22:40 Changeset [33567] by
- batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
- 22:07 Changeset [33566] by
- batchcrawl.sh script now supports taking a comma or space separated …
- 21:04 Changeset [33565] by
- CCWETProcessor: domain url now goes in as a seedURL after the …
- 21:01 Changeset [33564] by
- batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
2019-10-11:
- 23:29 Changeset [33563] by
- Committing inactive testing batch scripts (only creates the …
- 21:52 Changeset [33562] by
- 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
- 20:49 Changeset [33561] by
- 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
2019-10-10:
- 23:49 Changeset [33560] by
- 1. Incorporated Dr Bainbridge's suggested improvements: only when …
- 23:44 Changeset [33559] by
- 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
- 23:41 Changeset [33558] by
- Committing cumulative changes since last commit.
2019-10-09:
- 23:10 Changeset [33557] by
- Implemented the topSitesMap of topsite domain to url pattern in the …
- 18:58 Changeset [33556] by
- Blacklisted wikipedia pages that are actually in other languages which …
- 18:43 Changeset [33555] by
- Modified top sites list as Dr Bainbridge described: suffixes for the …
- 18:11 Changeset [33554] by
- Added more to blacklist and greylist. And removed remaining duplicates …
Note:
See TracTimeline
for information about the timeline view.