|
|
@33629
|
5 years |
kjdon |
added methods using Parameter2 - for params with text node values
|
|
|
@33628
|
5 years |
kjdon |
not sure why documentNode was a gsf:template here. Can't be like that …
|
|
|
@33627
|
5 years |
kjdon |
removed unnecessary comments
|
|
|
@33626
|
5 years |
ak19 |
TODOs
|
|
|
@33625
|
5 years |
ak19 |
A file listing domains with seedurls containing /mi(/) that are …
|
|
|
@33624
|
5 years |
ak19 |
Some cleanup surrounding the now renamed function createSeedURLsFile, …
|
|
|
@33623
|
5 years |
ak19 |
1. Incorporated Dr Nichols earlier suggestion of storing page modified …
|
|
|
@33622
|
5 years |
ak19 |
File rename
|
|
|
@33621
|
5 years |
ak19 |
Comitting jotted down mongodb related instructions from what Dr …
|
|
|
@33620
|
5 years |
ak19 |
Final crawl, done on vagrant VM node6. Crawl site IDs 01407-01462.
|
|
|
@33619
|
5 years |
kjdon |
need to handle the case where a collection file (eg image) gets …
|
|
|
@33618
|
5 years |
ak19 |
Adding in the download URL
|
|
|
@33617
|
5 years |
ak19 |
Node5 is now full and here is the finished crawl (up to and including …
|
|
|
@33616
|
5 years |
ak19 |
Beginnings of Java class that is to interact with MongoDB. I don't yet …
|
|
|
@33615
|
5 years |
ak19 |
1. Worked out how to configure log4j to log both to console and …
|
|
|
@33614
|
5 years |
kjdon |
added a new line
|
|
|
@33613
|
5 years |
kjdon |
added allowdocumentediting and allowmapgpsediting options, plus also …
|
|
|
@33612
|
5 years |
kjdon |
work to do with params. add in default values to params if they are …
|
|
|
@33611
|
5 years |
kjdon |
added global setting to params - thesea re for params that are valid …
|
|
|
@33610
|
5 years |
kjdon |
USER_SESSION_CACHE_ATT moved to GSParams, as it is stored in session …
|
|
|
@33609
|
5 years |
ak19 |
The tar files containing the crawled sites data shouldn't be called …
|
|
|
@33608
|
5 years |
ak19 |
1. New script to export from HBase so that we could in theory reimport …
|
|
|
@33607
|
5 years |
ak19 |
Updated with the remaining successfully crawled sites on node4 before …
|
|
|
@33606
|
5 years |
ak19 |
1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. …
|
|
|
@33605
|
5 years |
ak19 |
Node 4 VM still works, but committing first set of crawled sites on there
|
|
|
@33604
|
5 years |
ak19 |
1. Better output into possible-product-sites.txt including the …
|
|
|
@33603
|
5 years |
ak19 |
Incorporating Dr Nichols suggestion to help weed out product sites: if …
|
|
|
@33602
|
5 years |
ak19 |
1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
|
|
|
@33601
|
5 years |
ak19 |
Creates the 2nd csv file, with info about webpages. At present stores …
|
|
|
@33600
|
5 years |
ak19 |
Work in progress of writing out CSV files. In future, may write the …
|
|
|
@33599
|
5 years |
ak19 |
First one-third sites crawled. Committing to SVN despite the tarred …
|
|
|
@33598
|
5 years |
ak19 |
More instructions on setting up Nutch now that I've remembered to …
|
|
|
@33597
|
5 years |
ak19 |
Committing active version of template file which has a newline at end …
|
|
|
@33596
|
5 years |
ak19 |
Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template …
|
|
|
@33595
|
5 years |
kjdon |
new displayBaskets template - to avoid replicating code in query and …
|
|
|
@33594
|
5 years |
kjdon |
call gslib:displayBasket instead of replicating the code here
|
|
|
@33593
|
5 years |
kjdon |
the test for facets should be facetList/facet/count, as the facets get …
|
|
|
@33592
|
5 years |
kjdon |
reindented the file
|
|
|
@33591
|
5 years |
kjdon |
added in some strings for 'this collection contains x documents and …
|
|
|
@33590
|
5 years |
kjdon |
added 'this colleciton contains X documents and was last build Y days …
|
|
|
@33589
|
5 years |
cpb16 |
final01. Need Map results still
|
|
|
@33588
|
5 years |
ak19 |
Committing the MRI sentence model that I'm actually using, the one in …
|
|
|
@33587
|
5 years |
ak19 |
1. Better stats reporting on crawled sites: not just if a page was in …
|
|
|
@33586
|
5 years |
ak19 |
Refactored MaoriTextDetector.java class into more general …
|
|
|
@33585
|
5 years |
ak19 |
Much simpler way of using sentence and language detection model to …
|
|
|
@33584
|
5 years |
ak19 |
Committing experimental version 2 using the sentence detector model, …
|
|
|
@33583
|
5 years |
ak19 |
Committing experimental version 1 using the sentence detector model, …
|
|
|
@33582
|
5 years |
ak19 |
NutchTextDumpProcessor prints each crawled site's stats: number of …
|
|
|
@33581
|
5 years |
ak19 |
Minor fix. Noticed when looking for work I did on MRI sentence detection
|
|
|
@33580
|
5 years |
ak19 |
Finally fixed the thus-far identified bugs when parsing dump.txt.
|
|
|
@33579
|
5 years |
ak19 |
Debugging. Solved one problem.
|
|
|
@33578
|
5 years |
ak19 |
Corrections for compiling the 2 new classes.
|
|
|
@33577
|
5 years |
ak19 |
Forgot to adjust usage statement to say that silent mode was already …
|
|
|
@33576
|
5 years |
ak19 |
Introducing 2 new Java files still being written and untested. …
|
|
|
@33575
|
5 years |
ak19 |
Correcting usage string for CCWETProcessor before committing new java …
|
|
|
@33574
|
5 years |
ak19 |
If nutch stores a crawled site in more than 1 file, then cat all of …
|
|
|
@33573
|
5 years |
ak19 |
Forgot to document that spaces were also allowed as separator in the …
|
|
|
@33572
|
5 years |
ak19 |
Only meant to store the wet.gz versions of these files, not also the …
|
|
|
@33571
|
5 years |
ak19 |
Adding Dr Bainbridge's suggestion of appending the crawlId of each …
|
|
|
@33570
|
5 years |
ak19 |
Need to check if UNFINISHED file actually exists before moving it …
|
|
|
@33569
|
5 years |
ak19 |
1. batchcrawl.sh now does what it should have from the start, which is …
|
|
|
@33568
|
5 years |
ak19 |
1. More sites greylisted and blacklisted, discovered as I attempted to …
|
|
|
@33567
|
5 years |
ak19 |
batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
|
|
|
@33566
|
5 years |
ak19 |
batchcrawl.sh script now supports taking a comma or space separated …
|
|
|
@33565
|
5 years |
ak19 |
CCWETProcessor: domain url now goes in as a seedURL after the …
|
|
|
@33564
|
5 years |
ak19 |
batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
|
|
|
@33563
|
5 years |
ak19 |
Committing inactive testing batch scripts (only creates the …
|
|
|
@33562
|
5 years |
ak19 |
1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
|
|
|
@33561
|
5 years |
ak19 |
1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
|
|
|
@33560
|
5 years |
ak19 |
1. Incorporated Dr Bainbridge's suggested improvements: only when …
|
|
|
@33559
|
5 years |
ak19 |
1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
|
|
|
@33558
|
5 years |
ak19 |
Committing cumulative changes since last commit.
|
|
|
@33557
|
5 years |
ak19 |
Implemented the topSitesMap of topsite domain to url pattern in the …
|
|
|
@33556
|
5 years |
ak19 |
Blacklisted wikipedia pages that are actually in other languages which …
|
|
|
@33555
|
5 years |
ak19 |
Modified top sites list as Dr Bainbridge described: suffixes for the …
|
|
|
@33554
|
5 years |
ak19 |
Added more to blacklist and greylist. And removed remaining duplicates …
|
|
|
@33553
|
5 years |
ak19 |
Comments
|
|
|
@33552
|
5 years |
ak19 |
1. Code now processes ccrawldata folder, containing each individual …
|
|
|
@33551
|
5 years |
ak19 |
Added in top 500 urls from moz.com/top500 and removed duplicates, and …
|
|
|
@33550
|
5 years |
ak19 |
First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
|
|
|
@33549
|
5 years |
ak19 |
All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
|
|
|
@33548
|
5 years |
davidb |
Include new wavesurfer sub-project to install
|
|
|
@33547
|
5 years |
davidb |
Initial cut at wavesurfer JS audio player version of AMC music content …
|
|
|
@33546
|
5 years |
davidb |
Initial cut at wave-surfer based JS audio player extension for Greenstone
|
|
|
@33545
|
5 years |
ak19 |
Mainly changes to crawling-Nutch.txt and some minor changes to other …
|
|
|
@33544
|
5 years |
ak19 |
1. Dr Bainbridge had the correct fix for solr dealing with phrase …
|
|
|
@33543
|
5 years |
ak19 |
Filled in some missing instructions
|
|
|
@33542
|
5 years |
kjdon |
use_hlist_for option is no longer valid
|
|
|
@33541
|
5 years |
ak19 |
1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
|
|
|
@33540
|
5 years |
ak19 |
Since I wasn't getting further with nutch 2 to grab an entire site, I …
|
|
|
@33539
|
5 years |
ak19 |
File rename
|
|
|
@33538
|
5 years |
ak19 |
Some additions to the setup.sh script to query commoncrawl for MRI …
|
|
|
@33537
|
5 years |
ak19 |
More nutch and general site mirroring related links
|
|
|
@33536
|
5 years |
ak19 |
Changes required to the commoncrawl related Vagrant github project to …
|
|
|
@33535
|
5 years |
ak19 |
1. New setup.sh script for on a hadoop system to setup the git …
|
|
|
@33534
|
5 years |
ak19 |
Correction: toplevel script has to be placed inside cc-index-table not …
|
|
|
@33533
|
5 years |
kjdon |
some collections might not have Title or root_Title metadata, so check …
|
|
|
@33532
|
5 years |
ak19 |
Found the other top 500 sites link again at last which Dr Bainbridge …
|
|
|
@33531
|
5 years |
ak19 |
Added whitelist for mi.wikipedia.org, and updates to blacklist and …
|
|
|
@33530
|
5 years |
ak19 |
Completed sentence that was left hanging.
|
|
|