source: gs3-extensions

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @34373   4 months davidb The result of running gen-heatmap.js
(edit) @34372   4 months davidb NodeJS code to generate a JSON heatmap to be used with WaveSurferJS
(edit) @34371   4 months davidb Top-level scripting and checks so CLI is ready to operate with the …
(edit) @34370   4 months davidb WaveSurfer-JS source files and top-up player
(edit) @34369   4 months davidb Adding in NodeJS to compilation sequence, so wavesurfer-js can be …
(edit) @34368   4 months davidb No longer needed
(edit) @34367   4 months davidb Now supports https URLs as well
(edit) @34362   5 months davidb First rough cut at some notes
(edit) @34361   5 months davidb Collating of python essensia custom scripts and essentia perl plugin …
(edit) @34360   5 months davidb Collating of python essensia custom scripts and essentia perl plugin code
(edit) @34359   5 months davidb Needs to be updated to be brought back into line with setup.bash
(edit) @34358   5 months davidb Changed to be a Greenstone3 extension
(edit) @34356   5 months davidb Some initial work computing essensia audio features when the …
(edit) @34355   5 months davidb Scripts for processing audio files can extracting audio features for ML
(edit) @34354   5 months davidb Script to checkout/clone essentia from its git-hub repository
(edit) @34353   5 months davidb Useful in combo with a python2 to create a virtualenv python2 under …
(edit) @34349   5 months davidb Used to stand up a version of python where extra pip packages have …
(edit) @34348   5 months davidb Adding in Essential source code to go along with compile scripts
(edit) @34347   5 months davidb Adding in Essential compile scripts
(edit) @34346   5 months davidb Further dir that needs to be installed as a header file area
(edit) @34345   5 months davidb Already done in setup.bash
(edit) @34344   5 months davidb Extended to now setup/install Eigen3
(edit) @34343   5 months davidb Tweak to sourcing file
(edit) @34342   5 months davidb Added block to set GSDLOS
(edit) @34341   5 months davidb Shift to using cascade-make
(edit) @34340   5 months davidb Added in cascade-make as an external property
(edit) @34339   5 months davidb Some initial files to compile up essentia, used in the Mars extension …
(edit) @34166   8 months ak19 Adding Italian language translations of the gs3colcfg module. Many …
(edit) @33997   11 months davidb Top-level folder for MARS related Greenstone3 code
(edit) @33736   14 months kjdon fixed a spelling mistake
(edit) @33635   15 months ak19 Maori-language-detection doesn't use Greenstone 3 at present, it's not …
(edit) @33634   15 months ak19 Rewrote NutchTextDumpProcessor as NutchTextDumpToMongoDB.java, which …
(edit) @33633   15 months ak19 1. TextLanguageDetector now has methods for collecting all sentences …
(edit) @33626   15 months ak19 TODOs
(edit) @33625   15 months ak19 A file listing domains with seedurls containing /mi(/) that are …
(edit) @33624   15 months ak19 Some cleanup surrounding the now renamed function createSeedURLsFile, …
(edit) @33623   15 months ak19 1. Incorporated Dr Nichols earlier suggestion of storing page modified …
(edit) @33622   15 months ak19 File rename
(edit) @33621   15 months ak19 Comitting jotted down mongodb related instructions from what Dr …
(edit) @33620   15 months ak19 Final crawl, done on vagrant VM node6. Crawl site IDs 01407-01462.
(edit) @33618   15 months ak19 Adding in the download URL
(edit) @33617   15 months ak19 Node5 is now full and here is the finished crawl (up to and including …
(edit) @33616   15 months ak19 Beginnings of Java class that is to interact with MongoDB. I don't yet …
(edit) @33615   15 months ak19 1. Worked out how to configure log4j to log both to console and …
(edit) @33609   15 months ak19 The tar files containing the crawled sites data shouldn't be called …
(edit) @33608   15 months ak19 1. New script to export from HBase so that we could in theory reimport …
(edit) @33607   15 months ak19 Updated with the remaining successfully crawled sites on node4 before …
(edit) @33606   15 months ak19 1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. …
(edit) @33605   15 months ak19 Node 4 VM still works, but committing first set of crawled sites on there
(edit) @33604   15 months ak19 1. Better output into possible-product-sites.txt including the …
(edit) @33603   15 months ak19 Incorporating Dr Nichols suggestion to help weed out product sites: if …
(edit) @33602   15 months ak19 1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
(edit) @33601   15 months ak19 Creates the 2nd csv file, with info about webpages. At present stores …
(edit) @33600   15 months ak19 Work in progress of writing out CSV files. In future, may write the …
(edit) @33599   15 months ak19 First one-third sites crawled. Committing to SVN despite the tarred …
(edit) @33598   15 months ak19 More instructions on setting up Nutch now that I've remembered to …
(edit) @33597   15 months ak19 Committing active version of template file which has a newline at end …
(edit) @33596   15 months ak19 Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template …
(edit) @33588   15 months ak19 Committing the MRI sentence model that I'm actually using, the one in …
(edit) @33587   15 months ak19 1. Better stats reporting on crawled sites: not just if a page was in …
(edit) @33586   15 months ak19 Refactored MaoriTextDetector.java class into more general …
(edit) @33585   15 months ak19 Much simpler way of using sentence and language detection model to …
(edit) @33584   15 months ak19 Committing experimental version 2 using the sentence detector model, …
(edit) @33583   15 months ak19 Committing experimental version 1 using the sentence detector model, …
(edit) @33582   16 months ak19 NutchTextDumpProcessor prints each crawled site's stats: number of …
(edit) @33581   16 months ak19 Minor fix. Noticed when looking for work I did on MRI sentence detection
(edit) @33580   16 months ak19 Finally fixed the thus-far identified bugs when parsing dump.txt.
(edit) @33579   16 months ak19 Debugging. Solved one problem.
(edit) @33578   16 months ak19 Corrections for compiling the 2 new classes.
(edit) @33577   16 months ak19 Forgot to adjust usage statement to say that silent mode was already …
(edit) @33576   16 months ak19 Introducing 2 new Java files still being written and untested. …
(edit) @33575   16 months ak19 Correcting usage string for CCWETProcessor before committing new java …
(edit) @33574   16 months ak19 If nutch stores a crawled site in more than 1 file, then cat all of …
(edit) @33573   16 months ak19 Forgot to document that spaces were also allowed as separator in the …
(edit) @33572   16 months ak19 Only meant to store the wet.gz versions of these files, not also the …
(edit) @33571   16 months ak19 Adding Dr Bainbridge's suggestion of appending the crawlId of each …
(edit) @33570   16 months ak19 Need to check if UNFINISHED file actually exists before moving it …
(edit) @33569   16 months ak19 1. batchcrawl.sh now does what it should have from the start, which is …
(edit) @33568   16 months ak19 1. More sites greylisted and blacklisted, discovered as I attempted to …
(edit) @33567   16 months ak19 batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
(edit) @33566   16 months ak19 batchcrawl.sh script now supports taking a comma or space separated …
(edit) @33565   16 months ak19 CCWETProcessor: domain url now goes in as a seedURL after the …
(edit) @33564   16 months ak19 batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
(edit) @33563   16 months ak19 Committing inactive testing batch scripts (only creates the …
(edit) @33562   16 months ak19 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit) @33561   16 months ak19 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
(edit) @33560   16 months ak19 1. Incorporated Dr Bainbridge's suggested improvements: only when …
(edit) @33559   16 months ak19 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
(edit) @33558   16 months ak19 Committing cumulative changes since last commit.
(edit) @33557   16 months ak19 Implemented the topSitesMap of topsite domain to url pattern in the …
(edit) @33556   16 months ak19 Blacklisted wikipedia pages that are actually in other languages which …
(edit) @33555   16 months ak19 Modified top sites list as Dr Bainbridge described: suffixes for the …
(edit) @33554   16 months ak19 Added more to blacklist and greylist. And removed remaining duplicates …
(edit) @33553   16 months ak19 Comments
(edit) @33552   16 months ak19 1. Code now processes ccrawldata folder, containing each individual …
(edit) @33551   16 months ak19 Added in top 500 urls from moz.com/top500 and removed duplicates, and …
(edit) @33550   16 months ak19 First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
(edit) @33549   16 months ak19 All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
(edit) @33548   16 months davidb Include new wavesurfer sub-project to install
(edit) @33546   16 months davidb Initial cut at wave-surfer based JS audio player extension for Greenstone
Note: See TracRevisionLog for help on using the revision log.