|
|
@33470
|
5 years |
ak19 |
A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
|
|
|
@33469
|
5 years |
ak19 |
Don't want URLs with the word product(s) in them (but production …
|
|
|
@33468
|
5 years |
ak19 |
More meaningful to (also) write out the keep vs discard URLs into keep …
|
|
|
@33467
|
5 years |
ak19 |
Improved the code to use a static block to load the needed properties …
|
|
|
@33466
|
5 years |
ak19 |
1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
|
|
|
@33465
|
5 years |
ak19 |
Committing first version of the WETProcessor.java which takes a …
|
|
|
@33464
|
5 years |
kjdon |
I committed the last changes by mistake, using the previous revision …
|
|
|
@33463
|
5 years |
kjdon |
fixed up some typos. removed use_hlist_for option. This is very hard …
|
|
|
@33462
|
5 years |
ak19 |
Tested new tomcat.allowLinking property on Windows too now and it …
|
|
|
@33461
|
5 years |
ak19 |
Implementing Diego Spano's suggested changes for tomcat's allowLinking …
|
|
|
@33460
|
5 years |
kjdon |
fixed up some typos. removed use_hlist_for option. This is very hard …
|
|
|
@33459
|
5 years |
kjdon |
small changes to some strings
|
|
|
@33458
|
5 years |
cpb16 |
Running new morphology version after quick meeting with david last …
|
|
|
@33457
|
5 years |
ak19 |
Got stage 1, the WARC to WET conversion, working, after necessary …
|
|
|
@33456
|
5 years |
ak19 |
Link to discussion on how to convert WARC to WET
|
|
|
@33455
|
5 years |
cpb16 |
Started implementing Davids suggested morphology sequence, codeversion9
|
|
|
@33454
|
5 years |
kjdon |
updated metadata_selection_mode to be …
|
|
|
@33453
|
5 years |
kjdon |
the new and modified strings for revamped List classifier
|
|
|
@33452
|
5 years |
kjdon |
revamp of list classifier. More precise handling of numeric metadata …
|
|
|
@33451
|
5 years |
kjdon |
added a comment
|
|
|
@33450
|
5 years |
kjdon |
removed some unnecessary comments
|
|
|
@33449
|
5 years |
cpb16 |
termnal version executes correctly. (Didnt include init threshold in …
|
|
|
@33448
|
5 years |
ak19 |
Minor clarification and inclusion of helpful command
|
|
|
@33447
|
5 years |
cpb16 |
starting to implement terminal version of new morphology. need to fix. …
|
|
|
@33446
|
5 years |
ak19 |
1. Committing working version of export_maori_subset.sh which takes …
|
|
|
@33445
|
5 years |
ak19 |
The first working hadoop spark script for processing common crawl …
|
|
|
@33444
|
5 years |
cpb16 |
Have created a preprocess to remove large objects.
…
|
|
|
@33443
|
5 years |
ak19 |
More notes
|
|
|
@33442
|
5 years |
ak19 |
Updated gutil.jar file (with SafeProcses debugging)
|
|
|
@33441
|
5 years |
ak19 |
Adding further notes to do with running the CC-index examples on spark.
|
|
|
@33440
|
5 years |
ak19 |
Split file to move vagrant-spark-hadoop notes into own file.
|
|
|
@33439
|
5 years |
cpb16 |
Have created properties file and accessibility from …
|
|
|
@33438
|
5 years |
ak19 |
Forgot to commit a change made for Georgian.
|
|
|
@33437
|
5 years |
cpb16 |
made progress with morphology. Need to have a better area dimension …
|
|
|
@33436
|
5 years |
ak19 |
3 important changes for 2 separate bugfixes where one bugfix is …
|
|
|
@33435
|
5 years |
ak19 |
Georgian language translations for the language's new glihelp module …
|
|
|
@33434
|
5 years |
ak19 |
Correcting syntax errors in this bash script.
|
|
|
@33433
|
5 years |
ak19 |
New Georgian language translation for perlmodules module of the GS …
|
|
|
@33432
|
5 years |
ak19 |
New Georgian language translation for glidict module of the GS …
|
|
|
@33431
|
5 years |
ak19 |
Corrections of automated processing, noticed when processing Georgian …
|
|
|
@33430
|
5 years |
ak19 |
Undo call to to_utf8() on the query_string argument (arg[q]) to …
|
|
|
@33429
|
5 years |
kjdon |
fixed a bug in get_or_create_shortname where it wasn't storing the new …
|
|
|
@33428
|
5 years |
ak19 |
Working commoncrawl cc-warc-examples' WET wordcount example using …
|
|
|
@33427
|
5 years |
davidb |
Some initial files on how to get going
|
|
|
@33426
|
5 years |
davidb |
Folder to details on how to standup the HTRC DevEnv locally
|
|
|
@33425
|
5 years |
ak19 |
A few more links now that I got past getting the vagrant VM with spark …
|
|
|
@33424
|
5 years |
ak19 |
Georgian (code ka) language translations for the gs3interface module …
|
|
|
@33423
|
5 years |
ak19 |
Adding in the link to the vagrant VM with Hadoop, Spark for cluster …
|
|
|
@33422
|
5 years |
ak19 |
Some more links.
|
|
|
@33421
|
5 years |
ak19 |
Forgot to fix up svn externals property for the Georgian …
|
|
|
@33420
|
5 years |
ak19 |
Update to svnproperty externals for the Georgian (code: ka) …
|
|
|
@33419
|
5 years |
ak19 |
Last evening, I had found some links about how language-detection is …
|
|
|
@33418
|
5 years |
cpb16 |
made progress with morphology, based one image, need to refine …
|
|
|
@33417
|
5 years |
ak19 |
Georgian language translations for the coredm for GS2, gsinstaller …
|
|
|
@33416
|
5 years |
ak19 |
DEC collections weren't getting built on 32 bit linux VM after trying …
|
|
|
@33415
|
5 years |
cpb16 |
updated, after unable to commit due to setup.bash being out of date. …
|
|
|
@33414
|
5 years |
ak19 |
Adding important links
|
|
|
@33413
|
5 years |
ak19 |
Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, …
|
|
|
@33412
|
5 years |
ak19 |
config command for wgetting a single file
|
|
|
@33411
|
5 years |
ak19 |
Newer version now doesn't mirror sites with wget but gets WET files …
|
|
|
@33410
|
5 years |
ak19 |
Committing some variable name changes before I replace this file with …
|
|
|
@33409
|
5 years |
ak19 |
Forgot to commit 2 files with links and shuffling some links around …
|
|
|
@33408
|
5 years |
ak19 |
Some rough notes. Will move into appropriate file later.
|
|
|
@33407
|
5 years |
ak19 |
gutil.jar was rebuilt yesterday in GS3 after a bugfix. Recommitting …
|
|
|
@33406
|
5 years |
kjdon |
if there is a semicolon after the file name, it ends up in the URL …
|
|
|
@33405
|
5 years |
ak19 |
Even though we're probably not going to use this code after all, will …
|
|
|
@33404
|
5 years |
ak19 |
1. Links to other Java ways of extracting text from web content. 2. …
|
|
|
@33403
|
5 years |
ak19 |
Mistake to do with launchdir in SafeProcess: if the environment for …
|
|
|
@33402
|
5 years |
ak19 |
Beginnings of the Java class to wget sites and process its pages to …
|
|
|
@33401
|
5 years |
ak19 |
MaoriTextDetector.class file now generated inside its package folder …
|
|
|
@33400
|
5 years |
ak19 |
1. Setting up log4j.properties based on the macronizer's basic one …
|
|
|
@33399
|
5 years |
ak19 |
Putting properties files into the conf folder and keeping the lib …
|
|
|
@33398
|
5 years |
ak19 |
Committing the actual package structure and the updated README after …
|
|
|
@33397
|
5 years |
ak19 |
1. Changing package structure and instructions on compiling/running as …
|
|
|
@33396
|
5 years |
ak19 |
Georgian language gs3colcfg module of GS interface. Many thanks to …
|
|
|
@33395
|
5 years |
ak19 |
Georgian language translation work for the gs3interface module of the …
|
|
|
@33394
|
5 years |
ak19 |
1. Started a file on feasibility with the data now available and some …
|
|
|
@33393
|
5 years |
ak19 |
Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls …
|
|
|
@33392
|
5 years |
ak19 |
Kathy found a problem whereby she wanted to run consecutive buildcols …
|
|
|
@33391
|
5 years |
ak19 |
Some rough bash scripting lines that work but aren't complete.
|
|
|
@33390
|
5 years |
ak19 |
Minor message telling the user to wait for a task that takes some time.
|
|
|
@33389
|
5 years |
kjdon |
store csv field array associated with filename, because you might have …
|
|
|
@33388
|
5 years |
kjdon |
tidied up some debug statements
|
|
|
@33387
|
5 years |
kjdon |
removed all my debug statements
|
|
|
@33386
|
5 years |
kjdon |
modified the test for whether this is the selected node or not. cant …
|
|
|
@33385
|
5 years |
kjdon |
need to import response node as it is not part of same document
|
|
|
@33384
|
5 years |
cpb16 |
backup before intellij working
|
|
|
@33383
|
5 years |
kjdon |
some more work on the help page
|
|
|
@33382
|
5 years |
kjdon |
don't add collection/collname to pref and help link if collname is empty
|
|
|
@33381
|
5 years |
kjdon |
use nice /page/gsdl url for about greenstone page
|
|
|
@33380
|
5 years |
kjdon |
some more mods and strings for collection help page
|
|
|
@33379
|
5 years |
ak19 |
New script to automate getting a file listing of the common crawl URL …
|
|
|
@33378
|
5 years |
ak19 |
New bin/script folder and relocating gen_SentenceDetection_model.sh to …
|
|
|
@33377
|
5 years |
ak19 |
Changes to get gen_SentenceDetection_model.sh to run still from the …
|
|
|
@33376
|
5 years |
ak19 |
Links and extracts I've read so far on the Web Curator Tool (WCT), …
|
|
|
@33375
|
5 years |
cpb16 |
Full backup after running first successful highres classifier run
|
|
|
@33374
|
5 years |
davidb |
added in opt-doc-args-link variable otherwise the transform fails with …
|
|
|
@33373
|
5 years |
kjdon |
need to check for null result from getTextString - otherwise get a …
|
|
|
@33372
|
5 years |
kjdon |
when writing out facets in buildConfig, need to get them from …
|
|
|
@33371
|
5 years |
kjdon |
separate sort and facet fields as the former needs to be single valued …
|
|
|