Context Navigation

source: gs3-extensions

Legend:

: Added
: Modified
: Copied or renamed

	Rev	Age	Author	Log Message
(edit)	@34373	4 years	davidb	The result of running gen-heatmap.js
(edit)	@34372	4 years	davidb	NodeJS code to generate a JSON heatmap to be used with WaveSurferJS
(edit)	@34371	4 years	davidb	Top-level scripting and checks so CLI is ready to operate with the …
(edit)	@34370	4 years	davidb	WaveSurfer-JS source files and top-up player
(edit)	@34369	4 years	davidb	Adding in NodeJS to compilation sequence, so wavesurfer-js can be …
(edit)	@34368	4 years	davidb	No longer needed
(edit)	@34367	4 years	davidb	Now supports https URLs as well
(edit)	@34362	4 years	davidb	First rough cut at some notes
(edit)	@34361	4 years	davidb	Collating of python essensia custom scripts and essentia perl plugin …
(edit)	@34360	4 years	davidb	Collating of python essensia custom scripts and essentia perl plugin code
(edit)	@34359	4 years	davidb	Needs to be updated to be brought back into line with setup.bash
(edit)	@34358	4 years	davidb	Changed to be a Greenstone3 extension
(edit)	@34356	4 years	davidb	Some initial work computing essensia audio features when the …
(edit)	@34355	4 years	davidb	Scripts for processing audio files can extracting audio features for ML
(edit)	@34354	4 years	davidb	Script to checkout/clone essentia from its git-hub repository
(edit)	@34353	4 years	davidb	Useful in combo with a python2 to create a virtualenv python2 under …
(edit)	@34349	4 years	davidb	Used to stand up a version of python where extra pip packages have …
(edit)	@34348	4 years	davidb	Adding in Essential source code to go along with compile scripts
(edit)	@34347	4 years	davidb	Adding in Essential compile scripts
(edit)	@34346	4 years	davidb	Further dir that needs to be installed as a header file area
(edit)	@34345	4 years	davidb	Already done in setup.bash
(edit)	@34344	4 years	davidb	Extended to now setup/install Eigen3
(edit)	@34343	4 years	davidb	Tweak to sourcing file
(edit)	@34342	4 years	davidb	Added block to set GSDLOS
(edit)	@34341	4 years	davidb	Shift to using cascade-make
(edit)	@34340	4 years	davidb	Added in cascade-make as an external property
(edit)	@34339	4 years	davidb	Some initial files to compile up essentia, used in the Mars extension …
(edit)	@34166	4 years	ak19	Adding Italian language translations of the gs3colcfg module. Many …
(edit)	@33997	4 years	davidb	Top-level folder for MARS related Greenstone3 code
(edit)	@33736	4 years	kjdon	fixed a spelling mistake
(edit)	@33635	4 years	ak19	Maori-language-detection doesn't use Greenstone 3 at present, it's not …
(edit)	@33634	4 years	ak19	Rewrote NutchTextDumpProcessor as NutchTextDumpToMongoDB.java, which …
(edit)	@33633	4 years	ak19	1. TextLanguageDetector now has methods for collecting all sentences …
(edit)	@33626	4 years	ak19	TODOs
(edit)	@33625	4 years	ak19	A file listing domains with seedurls containing /mi(/) that are …
(edit)	@33624	4 years	ak19	Some cleanup surrounding the now renamed function createSeedURLsFile, …
(edit)	@33623	4 years	ak19	1. Incorporated Dr Nichols earlier suggestion of storing page modified …
(edit)	@33622	4 years	ak19	File rename
(edit)	@33621	4 years	ak19	Comitting jotted down mongodb related instructions from what Dr …
(edit)	@33620	4 years	ak19	Final crawl, done on vagrant VM node6. Crawl site IDs 01407-01462.
(edit)	@33618	4 years	ak19	Adding in the download URL
(edit)	@33617	4 years	ak19	Node5 is now full and here is the finished crawl (up to and including …
(edit)	@33616	4 years	ak19	Beginnings of Java class that is to interact with MongoDB. I don't yet …
(edit)	@33615	4 years	ak19	1. Worked out how to configure log4j to log both to console and …
(edit)	@33609	4 years	ak19	The tar files containing the crawled sites data shouldn't be called …
(edit)	@33608	4 years	ak19	1. New script to export from HBase so that we could in theory reimport …
(edit)	@33607	4 years	ak19	Updated with the remaining successfully crawled sites on node4 before …
(edit)	@33606	4 years	ak19	1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. …
(edit)	@33605	4 years	ak19	Node 4 VM still works, but committing first set of crawled sites on there
(edit)	@33604	5 years	ak19	1. Better output into possible-product-sites.txt including the …
(edit)	@33603	5 years	ak19	Incorporating Dr Nichols suggestion to help weed out product sites: if …
(edit)	@33602	5 years	ak19	1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
(edit)	@33601	5 years	ak19	Creates the 2nd csv file, with info about webpages. At present stores …
(edit)	@33600	5 years	ak19	Work in progress of writing out CSV files. In future, may write the …
(edit)	@33599	5 years	ak19	First one-third sites crawled. Committing to SVN despite the tarred …
(edit)	@33598	5 years	ak19	More instructions on setting up Nutch now that I've remembered to …
(edit)	@33597	5 years	ak19	Committing active version of template file which has a newline at end …
(edit)	@33596	5 years	ak19	Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template …
(edit)	@33588	5 years	ak19	Committing the MRI sentence model that I'm actually using, the one in …
(edit)	@33587	5 years	ak19	1. Better stats reporting on crawled sites: not just if a page was in …
(edit)	@33586	5 years	ak19	Refactored MaoriTextDetector.java class into more general …
(edit)	@33585	5 years	ak19	Much simpler way of using sentence and language detection model to …
(edit)	@33584	5 years	ak19	Committing experimental version 2 using the sentence detector model, …
(edit)	@33583	5 years	ak19	Committing experimental version 1 using the sentence detector model, …
(edit)	@33582	5 years	ak19	NutchTextDumpProcessor prints each crawled site's stats: number of …
(edit)	@33581	5 years	ak19	Minor fix. Noticed when looking for work I did on MRI sentence detection
(edit)	@33580	5 years	ak19	Finally fixed the thus-far identified bugs when parsing dump.txt.
(edit)	@33579	5 years	ak19	Debugging. Solved one problem.
(edit)	@33578	5 years	ak19	Corrections for compiling the 2 new classes.
(edit)	@33577	5 years	ak19	Forgot to adjust usage statement to say that silent mode was already …
(edit)	@33576	5 years	ak19	Introducing 2 new Java files still being written and untested. …
(edit)	@33575	5 years	ak19	Correcting usage string for CCWETProcessor before committing new java …
(edit)	@33574	5 years	ak19	If nutch stores a crawled site in more than 1 file, then cat all of …
(edit)	@33573	5 years	ak19	Forgot to document that spaces were also allowed as separator in the …
(edit)	@33572	5 years	ak19	Only meant to store the wet.gz versions of these files, not also the …
(edit)	@33571	5 years	ak19	Adding Dr Bainbridge's suggestion of appending the crawlId of each …
(edit)	@33570	5 years	ak19	Need to check if UNFINISHED file actually exists before moving it …
(edit)	@33569	5 years	ak19	1. batchcrawl.sh now does what it should have from the start, which is …
(edit)	@33568	5 years	ak19	1. More sites greylisted and blacklisted, discovered as I attempted to …
(edit)	@33567	5 years	ak19	batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
(edit)	@33566	5 years	ak19	batchcrawl.sh script now supports taking a comma or space separated …
(edit)	@33565	5 years	ak19	CCWETProcessor: domain url now goes in as a seedURL after the …
(edit)	@33564	5 years	ak19	batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
(edit)	@33563	5 years	ak19	Committing inactive testing batch scripts (only creates the …
(edit)	@33562	5 years	ak19	1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit)	@33561	5 years	ak19	1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
(edit)	@33560	5 years	ak19	1. Incorporated Dr Bainbridge's suggested improvements: only when …
(edit)	@33559	5 years	ak19	1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
(edit)	@33558	5 years	ak19	Committing cumulative changes since last commit.
(edit)	@33557	5 years	ak19	Implemented the topSitesMap of topsite domain to url pattern in the …
(edit)	@33556	5 years	ak19	Blacklisted wikipedia pages that are actually in other languages which …
(edit)	@33555	5 years	ak19	Modified top sites list as Dr Bainbridge described: suffixes for the …
(edit)	@33554	5 years	ak19	Added more to blacklist and greylist. And removed remaining duplicates …
(edit)	@33553	5 years	ak19	Comments
(edit)	@33552	5 years	ak19	1. Code now processes ccrawldata folder, containing each individual …
(edit)	@33551	5 years	ak19	Added in top 500 urls from moz.com/top500 and removed duplicates, and …
(edit)	@33550	5 years	ak19	First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
(edit)	@33549	5 years	ak19	All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
(edit)	@33548	5 years	davidb	Include new wavesurfer sub-project to install
(edit)	@33546	5 years	davidb	Initial cut at wave-surfer based JS audio player extension for Greenstone

Note: See TracRevisionLog for help on using the revision log.

Download in other formats: