|
|
@33524
|
5 years |
ak19 |
1. Further adjustments to documenting what we did to get things to run …
|
|
|
@33523
|
5 years |
ak19 |
Instructional comment
|
|
|
@33522
|
5 years |
ak19 |
Some comments and an improvement
|
|
|
@33519
|
5 years |
ak19 |
Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
|
|
|
@33518
|
5 years |
ak19 |
Intermediate commit: got the seed urls file temporarily written out as …
|
|
|
@33517
|
5 years |
ak19 |
1. Blacklists were introduced so that too many instances of camelcased …
|
|
|
@33516
|
5 years |
ak19 |
Before I accidentally lose it, committing the script Dr Bainbridge …
|
|
|
@33515
|
5 years |
ak19 |
Removed an unused function
|
|
|
@33514
|
5 years |
ak19 |
Committing README on starting off with the vagrant VM for hadoop-spark …
|
|
|
@33513
|
5 years |
ak19 |
Higher level script that runs against each named crawl since Sep 2018 …
|
|
|
@33503
|
5 years |
ak19 |
More efficient blacklisting/greylisting/whitelisting now by reading in …
|
|
|
@33502
|
5 years |
ak19 |
Current url pattern blacklist and greylist filter files. Used by …
|
|
|
@33501
|
5 years |
ak19 |
Refactored code into 2 classes: The existing WETProcessor, which …
|
|
|
@33499
|
5 years |
ak19 |
Explicitly adding in IAM policy configuration details instead of just …
|
|
|
@33498
|
5 years |
ak19 |
Corrections to script. Modified the tests checking for file/dir …
|
|
|
@33497
|
5 years |
ak19 |
First version of discard url filter file. Inefficient implementation. …
|
|
|
@33496
|
5 years |
ak19 |
Minor changes to reading list file
|
|
|
@33495
|
5 years |
ak19 |
Pruned out unused commands, added comments, marked unused variables to …
|
|
|
@33494
|
5 years |
ak19 |
All in one script that takes as parameter a common crawl identifier of …
|
|
|
@33489
|
5 years |
ak19 |
Handy file to not have to keep manually repeating commands when …
|
|
|
@33488
|
5 years |
ak19 |
new function createSeedURLsFiles() in WETProcessor that replaces the …
|
|
|
@33480
|
5 years |
ak19 |
Much harder to remove pages where words are fused together as some are …
|
|
|
@33471
|
5 years |
ak19 |
Very minor changes.
|
|
|
@33470
|
5 years |
ak19 |
A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
|
|
|
@33469
|
5 years |
ak19 |
Don't want URLs with the word product(s) in them (but production …
|
|
|
@33468
|
5 years |
ak19 |
More meaningful to (also) write out the keep vs discard URLs into keep …
|
|
|
@33467
|
5 years |
ak19 |
Improved the code to use a static block to load the needed properties …
|
|
|
@33466
|
5 years |
ak19 |
1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
|
|
|
@33465
|
5 years |
ak19 |
Committing first version of the WETProcessor.java which takes a …
|
|
|
@33457
|
5 years |
ak19 |
Got stage 1, the WARC to WET conversion, working, after necessary …
|
|
|
@33456
|
5 years |
ak19 |
Link to discussion on how to convert WARC to WET
|
|
|
@33448
|
5 years |
ak19 |
Minor clarification and inclusion of helpful command
|
|
|
@33446
|
5 years |
ak19 |
1. Committing working version of export_maori_subset.sh which takes …
|
|
|
@33445
|
5 years |
ak19 |
The first working hadoop spark script for processing common crawl …
|
|
|
@33443
|
5 years |
ak19 |
More notes
|
|
|
@33442
|
5 years |
ak19 |
Updated gutil.jar file (with SafeProcses debugging)
|
|
|
@33441
|
5 years |
ak19 |
Adding further notes to do with running the CC-index examples on spark.
|
|
|
@33440
|
5 years |
ak19 |
Split file to move vagrant-spark-hadoop notes into own file.
|
|
|
@33428
|
5 years |
ak19 |
Working commoncrawl cc-warc-examples' WET wordcount example using …
|
|
|
@33425
|
5 years |
ak19 |
A few more links now that I got past getting the vagrant VM with spark …
|
|
|
@33423
|
5 years |
ak19 |
Adding in the link to the vagrant VM with Hadoop, Spark for cluster …
|
|
|
@33422
|
5 years |
ak19 |
Some more links.
|
|
|
@33419
|
5 years |
ak19 |
Last evening, I had found some links about how language-detection is …
|
|
|
@33414
|
5 years |
ak19 |
Adding important links
|
|
|
@33413
|
5 years |
ak19 |
Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, …
|
|
|
@33412
|
5 years |
ak19 |
config command for wgetting a single file
|
|
|
@33411
|
5 years |
ak19 |
Newer version now doesn't mirror sites with wget but gets WET files …
|
|
|
@33410
|
5 years |
ak19 |
Committing some variable name changes before I replace this file with …
|
|
|
@33409
|
5 years |
ak19 |
Forgot to commit 2 files with links and shuffling some links around …
|
|
|
@33408
|
5 years |
ak19 |
Some rough notes. Will move into appropriate file later.
|
|
|
@33407
|
5 years |
ak19 |
gutil.jar was rebuilt yesterday in GS3 after a bugfix. Recommitting …
|
|
|
@33405
|
5 years |
ak19 |
Even though we're probably not going to use this code after all, will …
|
|
|
@33404
|
5 years |
ak19 |
1. Links to other Java ways of extracting text from web content. 2. …
|
|
|
@33402
|
5 years |
ak19 |
Beginnings of the Java class to wget sites and process its pages to …
|
|
|
@33401
|
5 years |
ak19 |
MaoriTextDetector.class file now generated inside its package folder …
|
|
|
@33400
|
5 years |
ak19 |
1. Setting up log4j.properties based on the macronizer's basic one …
|
|
|
@33399
|
5 years |
ak19 |
Putting properties files into the conf folder and keeping the lib …
|
|
|
@33398
|
5 years |
ak19 |
Committing the actual package structure and the updated README after …
|
|
|
@33397
|
5 years |
ak19 |
1. Changing package structure and instructions on compiling/running as …
|
|
|
@33396
|
5 years |
ak19 |
Georgian language gs3colcfg module of GS interface. Many thanks to …
|
|
|
@33394
|
5 years |
ak19 |
1. Started a file on feasibility with the data now available and some …
|
|
|
@33393
|
5 years |
ak19 |
Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls …
|
|
|
@33392
|
5 years |
ak19 |
Kathy found a problem whereby she wanted to run consecutive buildcols …
|
|
|
@33391
|
5 years |
ak19 |
Some rough bash scripting lines that work but aren't complete.
|
|
|
@33390
|
5 years |
ak19 |
Minor message telling the user to wait for a task that takes some time.
|
|
|
@33388
|
5 years |
kjdon |
tidied up some debug statements
|
|
|
@33379
|
5 years |
ak19 |
New script to automate getting a file listing of the common crawl URL …
|
|
|
@33378
|
5 years |
ak19 |
New bin/script folder and relocating gen_SentenceDetection_model.sh to …
|
|
|
@33377
|
5 years |
ak19 |
Changes to get gen_SentenceDetection_model.sh to run still from the …
|
|
|
@33376
|
5 years |
ak19 |
Links and extracts I've read so far on the Web Curator Tool (WCT), …
|
|
|
@33372
|
5 years |
kjdon |
when writing out facets in buildConfig, need to get them from …
|
|
|
@33371
|
5 years |
kjdon |
separate sort and facet fields as the former needs to be single valued …
|
|
|
@33370
|
5 years |
kjdon |
use the new get_or_create_shortname instead of create_shortname
|
|
|
@33368
|
5 years |
kjdon |
sort fields cannot be multivalued. Facet fields need to be. SO have …
|
|
|
@33359
|
5 years |
davidb |
solr needs to add shortnames to the fieldnamemap otherwise it won't …
|
|
|
@33358
|
5 years |
ak19 |
More minor changes to README
|
|
|
@33357
|
5 years |
ak19 |
Minor changes
|
|
|
@33356
|
5 years |
ak19 |
Updating script. Correction to a filepath different in the svn folder …
|
|
|
@33355
|
5 years |
ak19 |
Changes for adding in the new gen_SentenceDetection_model.sh script, …
|
|
|
@33350
|
5 years |
ak19 |
Better comments. Tested macronised vs unmacronised Māori language test …
|
|
|
@33339
|
5 years |
ak19 |
Updated README.
|
|
|
@33338
|
5 years |
ak19 |
1.After renaming the java class, changed all occurrences of the old …
|
|
|
@33337
|
5 years |
ak19 |
Renaming the class to MaoriTextDetector, since it doesn't detect audio …
|
|
|
@33336
|
5 years |
ak19 |
Major rewrite to make this class more useful to callers. …
|
|
|
@33335
|
5 years |
ak19 |
First java file for Māori language detection using openNLP with the …
|
|
|
@33330
|
5 years |
ak19 |
Also rebuilt the solr demo collection with the changes to (solrbuilder …
|
|
|
@33327
|
5 years |
ak19 |
In order to get map coordinate metadata stored correctly in solr, …
|
|
|
@33315
|
5 years |
ak19 |
1. Bugfix to issue discovered on windows: when the GS3 server isn't …
|
|
|
@33307
|
5 years |
kjdon |
updating solr.war to include my latest changes. TODO: does this war …
|
|
|
@33306
|
5 years |
kjdon |
we need to use (the new) level_ids list to determine which cores we …
|
|
|
@33065
|
5 years |
ak19 |
3 new Georgian language files added, 2 of which automatically …
|
|
|
@32891
|
5 years |
davidb |
Additional error checking
|
|
|
@32890
|
5 years |
davidb |
No longer use the OAIConfig file
|
|
|
@32889
|
5 years |
davidb |
Some adjustments after testing
|
|
|
@32888
|
5 years |
davidb |
Also want to check and untar cantoloupe in this PREPARE file
|
|
|
@32886
|
5 years |
davidb |
Copy refactoring
|
|
|
@32885
|
5 years |
davidb |
Now in main Greenstone resources/iiif area
|
|
|
@32884
|
5 years |
davidb |
Edit to make more generic
|
|
|
@32883
|
5 years |
davidb |
Code tidy up
|
|
|
@32878
|
5 years |
davidb |
Changed to specify 'sites' as the path_prefix area within Greenstone, …
|
|
|