|
|
@33496
|
5 years |
ak19 |
Minor changes to reading list file
|
|
|
@33495
|
5 years |
ak19 |
Pruned out unused commands, added comments, marked unused variables to …
|
|
|
@33494
|
5 years |
ak19 |
All in one script that takes as parameter a common crawl identifier of …
|
|
|
@33493
|
5 years |
kjdon |
if we are on a cross collection search page, the collection for each …
|
|
|
@33492
|
5 years |
kjdon |
not all ccs pages has hierarchy element, so just test on s1.collection
|
|
|
@33491
|
5 years |
kjdon |
need to add optional args for doc links into the CCS format links. …
|
|
|
@33490
|
5 years |
kjdon |
changed default partition sizes back to 20, to match what was there …
|
|
|
@33489
|
5 years |
ak19 |
Handy file to not have to keep manually repeating commands when …
|
|
|
@33488
|
5 years |
ak19 |
new function createSeedURLsFiles() in WETProcessor that replaces the …
|
|
|
@33487
|
5 years |
kjdon |
added code to display any error messages
|
|
|
@33486
|
5 years |
kjdon |
reindented the page, added some extra links, and organised the items …
|
|
|
@33485
|
5 years |
kjdon |
removed an erroneous space
|
|
|
@33484
|
5 years |
kjdon |
some changes and additions to the debuginfo page texts
|
|
|
@33483
|
5 years |
kjdon |
added an explicit space after Error:
|
|
|
@33482
|
5 years |
kjdon |
changed standardize_capitalization to …
|
|
|
@33481
|
5 years |
kjdon |
a few more refinements to List strings
|
|
|
@33480
|
5 years |
ak19 |
Much harder to remove pages where words are fused together as some are …
|
|
|
@33479
|
5 years |
kjdon |
changed numeric option order to match letter options
|
|
|
@33478
|
5 years |
kjdon |
some refining of list option descriptions
|
|
|
@33477
|
5 years |
kjdon |
need to call setup_custom_sort to allow for collection's customsorttools.pm
|
|
|
@33476
|
5 years |
kjdon |
enabled having customsorttools in collection's perllib folder. you can …
|
|
|
@33475
|
5 years |
kjdon |
added numeric partition defaults to match partition type
|
|
|
@33474
|
5 years |
kjdon |
it turns out that childtype is not set in all cases, so put in the …
|
|
|
@33473
|
5 years |
kjdon |
still didn't get it quite right…
|
|
|
@33472
|
5 years |
kjdon |
forgot the -> to access member of a hash ref
|
|
|
@33471
|
5 years |
ak19 |
Very minor changes.
|
|
|
@33470
|
5 years |
ak19 |
A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
|
|
|
@33469
|
5 years |
ak19 |
Don't want URLs with the word product(s) in them (but production …
|
|
|
@33468
|
5 years |
ak19 |
More meaningful to (also) write out the keep vs discard URLs into keep …
|
|
|
@33467
|
5 years |
ak19 |
Improved the code to use a static block to load the needed properties …
|
|
|
@33466
|
5 years |
ak19 |
1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
|
|
|
@33465
|
5 years |
ak19 |
Committing first version of the WETProcessor.java which takes a …
|
|
|
@33464
|
5 years |
kjdon |
I committed the last changes by mistake, using the previous revision …
|
|
|
@33463
|
5 years |
kjdon |
fixed up some typos. removed use_hlist_for option. This is very hard …
|
|
|
@33462
|
5 years |
ak19 |
Tested new tomcat.allowLinking property on Windows too now and it …
|
|
|
@33461
|
5 years |
ak19 |
Implementing Diego Spano's suggested changes for tomcat's allowLinking …
|
|
|
@33460
|
5 years |
kjdon |
fixed up some typos. removed use_hlist_for option. This is very hard …
|
|
|
@33459
|
5 years |
kjdon |
small changes to some strings
|
|
|
@33458
|
5 years |
cpb16 |
Running new morphology version after quick meeting with david last …
|
|
|
@33457
|
5 years |
ak19 |
Got stage 1, the WARC to WET conversion, working, after necessary …
|
|
|
@33456
|
5 years |
ak19 |
Link to discussion on how to convert WARC to WET
|
|
|
@33455
|
5 years |
cpb16 |
Started implementing Davids suggested morphology sequence, codeversion9
|
|
|
@33454
|
5 years |
kjdon |
updated metadata_selection_mode to be …
|
|
|
@33453
|
5 years |
kjdon |
the new and modified strings for revamped List classifier
|
|
|
@33452
|
5 years |
kjdon |
revamp of list classifier. More precise handling of numeric metadata …
|
|
|
@33451
|
5 years |
kjdon |
added a comment
|
|
|
@33450
|
5 years |
kjdon |
removed some unnecessary comments
|
|
|
@33449
|
5 years |
cpb16 |
termnal version executes correctly. (Didnt include init threshold in …
|
|
|
@33448
|
5 years |
ak19 |
Minor clarification and inclusion of helpful command
|
|
|
@33447
|
5 years |
cpb16 |
starting to implement terminal version of new morphology. need to fix. …
|
|
|
@33446
|
5 years |
ak19 |
1. Committing working version of export_maori_subset.sh which takes …
|
|
|
@33445
|
5 years |
ak19 |
The first working hadoop spark script for processing common crawl …
|
|
|
@33444
|
5 years |
cpb16 |
Have created a preprocess to remove large objects.
…
|
|
|
@33443
|
5 years |
ak19 |
More notes
|
|
|
@33442
|
5 years |
ak19 |
Updated gutil.jar file (with SafeProcses debugging)
|
|
|
@33441
|
5 years |
ak19 |
Adding further notes to do with running the CC-index examples on spark.
|
|
|
@33440
|
5 years |
ak19 |
Split file to move vagrant-spark-hadoop notes into own file.
|
|
|
@33439
|
5 years |
cpb16 |
Have created properties file and accessibility from …
|
|
|
@33438
|
5 years |
ak19 |
Forgot to commit a change made for Georgian.
|
|
|
@33437
|
5 years |
cpb16 |
made progress with morphology. Need to have a better area dimension …
|
|
|
@33436
|
5 years |
ak19 |
3 important changes for 2 separate bugfixes where one bugfix is …
|
|
|
@33435
|
5 years |
ak19 |
Georgian language translations for the language's new glihelp module …
|
|
|
@33434
|
5 years |
ak19 |
Correcting syntax errors in this bash script.
|
|
|
@33433
|
5 years |
ak19 |
New Georgian language translation for perlmodules module of the GS …
|
|
|
@33432
|
5 years |
ak19 |
New Georgian language translation for glidict module of the GS …
|
|
|
@33431
|
5 years |
ak19 |
Corrections of automated processing, noticed when processing Georgian …
|
|
|
@33430
|
5 years |
ak19 |
Undo call to to_utf8() on the query_string argument (arg[q]) to …
|
|
|
@33429
|
5 years |
kjdon |
fixed a bug in get_or_create_shortname where it wasn't storing the new …
|
|
|
@33428
|
5 years |
ak19 |
Working commoncrawl cc-warc-examples' WET wordcount example using …
|
|
|
@33427
|
5 years |
davidb |
Some initial files on how to get going
|
|
|
@33426
|
5 years |
davidb |
Folder to details on how to standup the HTRC DevEnv locally
|
|
|
@33425
|
5 years |
ak19 |
A few more links now that I got past getting the vagrant VM with spark …
|
|
|
@33424
|
5 years |
ak19 |
Georgian (code ka) language translations for the gs3interface module …
|
|
|
@33423
|
5 years |
ak19 |
Adding in the link to the vagrant VM with Hadoop, Spark for cluster …
|
|
|
@33422
|
5 years |
ak19 |
Some more links.
|
|
|
@33421
|
5 years |
ak19 |
Forgot to fix up svn externals property for the Georgian …
|
|
|
@33420
|
5 years |
ak19 |
Update to svnproperty externals for the Georgian (code: ka) …
|
|
|
@33419
|
5 years |
ak19 |
Last evening, I had found some links about how language-detection is …
|
|
|
@33418
|
5 years |
cpb16 |
made progress with morphology, based one image, need to refine …
|
|
|
@33417
|
5 years |
ak19 |
Georgian language translations for the coredm for GS2, gsinstaller …
|
|
|
@33416
|
5 years |
ak19 |
DEC collections weren't getting built on 32 bit linux VM after trying …
|
|
|
@33415
|
5 years |
cpb16 |
updated, after unable to commit due to setup.bash being out of date. …
|
|
|
@33414
|
5 years |
ak19 |
Adding important links
|
|
|
@33413
|
5 years |
ak19 |
Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, …
|
|
|
@33412
|
5 years |
ak19 |
config command for wgetting a single file
|
|
|
@33411
|
5 years |
ak19 |
Newer version now doesn't mirror sites with wget but gets WET files …
|
|
|
@33410
|
5 years |
ak19 |
Committing some variable name changes before I replace this file with …
|
|
|
@33409
|
5 years |
ak19 |
Forgot to commit 2 files with links and shuffling some links around …
|
|
|
@33408
|
5 years |
ak19 |
Some rough notes. Will move into appropriate file later.
|
|
|
@33407
|
5 years |
ak19 |
gutil.jar was rebuilt yesterday in GS3 after a bugfix. Recommitting …
|
|
|
@33406
|
5 years |
kjdon |
if there is a semicolon after the file name, it ends up in the URL …
|
|
|
@33405
|
5 years |
ak19 |
Even though we're probably not going to use this code after all, will …
|
|
|
@33404
|
5 years |
ak19 |
1. Links to other Java ways of extracting text from web content. 2. …
|
|
|
@33403
|
5 years |
ak19 |
Mistake to do with launchdir in SafeProcess: if the environment for …
|
|
|
@33402
|
5 years |
ak19 |
Beginnings of the Java class to wget sites and process its pages to …
|
|
|
@33401
|
5 years |
ak19 |
MaoriTextDetector.class file now generated inside its package folder …
|
|
|
@33400
|
5 years |
ak19 |
1. Setting up log4j.properties based on the macronizer's basic one …
|
|
|
@33399
|
5 years |
ak19 |
Putting properties files into the conf folder and keeping the lib …
|
|
|
@33398
|
5 years |
ak19 |
Committing the actual package structure and the updated README after …
|
|
|
@33397
|
5 years |
ak19 |
1. Changing package structure and instructions on compiling/running as …
|
|
|