and .


19:01 Changeset [33457] by ak19
Got stage 1, the WARC to WET conversion, working, after necessary …
17:26 Changeset [33456] by ak19
Link to discussion on how to convert WARC to WET


14:45 Changeset [33455] by cpb16
Started implementing Davids suggested morphology sequence, codeversion9


14:41 Changeset [33454] by kjdon
updated metadata_selection_mode to be metadata_selection_mode_within_level …
13:16 Changeset [33453] by kjdon
the new and modified strings for revamped List classifier
13:15 Changeset [33452] by kjdon
revamp of list classifier. More precise handling of numeric metadata …
12:55 Changeset [33451] by kjdon
added a comment
12:54 Changeset [33450] by kjdon
removed some unnecessary comments


17:08 Changeset [33449] by cpb16
termnal version executes correctly. (Didnt include init threshold in …


18:27 Changeset [33448] by ak19
Minor clarification and inclusion of helpful command
18:03 Changeset [33447] by cpb16
starting to implement terminal version of new morphology. need to fix. …


19:12 Changeset [33446] by ak19
1. Committing working version of export_maori_subset.sh which takes the …
17:01 Changeset [33445] by ak19
The first working hadoop spark script for processing common crawl data. …
16:57 Changeset [33444] by cpb16
//Have created a preprocess to remove large objects. …


20:22 Changeset [33443] by ak19
More notes
19:30 Changeset [33442] by ak19
Updated gutil.jar file (with SafeProcses? debugging)
19:30 Changeset [33441] by ak19
Adding further notes to do with running the CC-index examples on spark.
19:17 Changeset [33440] by ak19
Split file to move vagrant-spark-hadoop notes into own file.
17:03 Changeset [33439] by cpb16
Have created properties file and accessibility from …


17:14 Changeset [33438] by ak19
Forgot to commit a change made for Georgian.


16:44 Changeset [33437] by cpb16
made progress with morphology. Need to have a better area dimension …


23:21 Changeset [33436] by ak19
3 important changes for 2 separate bugfixes where one bugfix is …
21:28 Changeset [33435] by ak19
Georgian language translations for the language's new glihelp module of …
21:22 Changeset [33434] by ak19
Correcting syntax errors in this bash script.


20:15 Changeset [33433] by ak19
New Georgian language translation for perlmodules module of the GS …
19:35 Changeset [33432] by ak19
New Georgian language translation for glidict module of the GS interface. …
19:18 Changeset [33431] by ak19
Corrections of automated processing, noticed when processing Georgian …
16:14 Ticket #954 (GTI: Correct Existing Translations form needs fixing and enhancement) created by ak19
GTI's "Correct Existing Translations" form needs 1. fixing: search term …
14:40 Changeset [33430] by ak19
Undo call to to_utf8() on the query_string argument (arg[q]) to prevent …
11:04 Changeset [33429] by kjdon
fixed a bug in get_or_create_shortname where it wasn't storing the new …


20:31 Changeset [33428] by ak19
Working commoncrawl cc-warc-examples' WET wordcount example using Hadoop. …
14:25 Changeset [33427] by davidb
Some initial files on how to get going
14:23 Changeset [33426] by davidb
Folder to details on how to standup the HTRC DevEnv? locally


22:15 Changeset [33425] by ak19
A few more links now that I got past getting the vagrant VM with spark and …
18:19 Changeset [33424] by ak19
Georgian (code ka) language translations for the gs3interface module of …


20:07 Changeset [33423] by ak19
Adding in the link to the vagrant VM with Hadoop, Spark for cluster …
17:52 Changeset [33422] by ak19
Some more links.
16:39 Changeset [33421] by ak19
Forgot to fix up svn externals property for the Georgian solr-jdbm-demo …
16:38 Changeset [33420] by ak19
Update to svnproperty externals for the Georgian (code: ka) gs3-collection …
16:20 Changeset [33419] by ak19
Last evening, I had found some links about how language-detection is done …
13:53 Changeset [33418] by cpb16
made progress with morphology, based one image, need to refine further, …


19:55 Changeset [33417] by ak19
Georgian language translations for the coredm for GS2, gsinstaller (new) …
17:48 Changeset [33416] by ak19
DEC collections weren't getting built on 32 bit linux VM after trying to …
11:42 Changeset [33415] by cpb16
updated, after unable to commit due to setup.bash being out of date. Added …


21:57 Changeset [33414] by ak19
Adding important links
21:57 Changeset [33413] by ak19
Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, …
21:54 Changeset [33412] by ak19
config command for wgetting a single file
21:50 Changeset [33411] by ak19
Newer version now doesn't mirror sites with wget but gets WET files and …
21:48 Changeset [33410] by ak19
Committing some variable name changes before I replace this file with the …
15:59 Changeset [33409] by ak19
Forgot to commit 2 files with links and shuffling some links around into …
15:09 Changeset [33408] by ak19
Some rough notes. Will move into appropriate file later.
14:40 Changeset [33407] by ak19
gutil.jar was rebuilt yesterday in GS3 after a bugfix. Recommitting for …
12:17 Changeset [33406] by kjdon
if there is a semicolon after the file name, it ends up in the URL that …


20:37 Changeset [33405] by ak19
Even though we're probably not going to use this code after all, will …
20:35 Changeset [33404] by ak19
1. Links to other Java ways of extracting text from web content. 2. …
15:07 Changeset [33403] by ak19
Mistake to do with launchdir in SafeProcess?: if the environment for the …


22:03 Changeset [33402] by ak19
Beginnings of the Java class to wget sites and process its pages to detect …
21:16 Changeset [33401] by ak19
MaoriTextDetector?.class file now generated inside its package folder (for …
21:15 Changeset [33400] by ak19
1. Setting up log4j.properties based on the macronizer's basic one that I …
20:48 Changeset [33399] by ak19
Putting properties files into the conf folder and keeping the lib folder …
19:35 Changeset [33398] by ak19
Committing the actual package structure and the updated README after …
19:30 Changeset [33397] by ak19
1. Changing package structure and instructions on compiling/running as …
18:20 Changeset [33396] by ak19
Georgian language gs3colcfg module of GS interface. Many thanks to Vano …
18:03 Changeset [33395] by ak19
Georgian language translation work for the gs3interface module of the GS …


20:37 Changeset [33394] by ak19
1. Started a file on feasibility with the data now available and some …
18:57 Changeset [33393] by ak19
Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls file …


15:15 Changeset [33392] by ak19
Kathy found a problem whereby she wanted to run consecutive buildcols …


19:11 Changeset [33391] by ak19
Some rough bash scripting lines that work but aren't complete.
17:31 Changeset [33390] by ak19
Minor message telling the user to wait for a task that takes some time.


13:19 Changeset [33389] by kjdon
store csv field array associated with filename, because you might have 2 …
11:46 Changeset [33388] by kjdon
tidied up some debug statements
11:33 Changeset [33387] by kjdon
removed all my debug statements
11:06 Changeset [33386] by kjdon
modified the test for whether this is the selected node or not. cant just …
