source:

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @33496   5 years ak19 Minor changes to reading list file
(edit) @33495   5 years ak19 Pruned out unused commands, added comments, marked unused variables to …
(edit) @33494   5 years ak19 All in one script that takes as parameter a common crawl identifier of …
(edit) @33493   5 years kjdon if we are on a cross collection search page, the collection for each …
(edit) @33492   5 years kjdon not all ccs pages has hierarchy element, so just test on s1.collection
(edit) @33491   5 years kjdon need to add optional args for doc links into the CCS format links. …
(edit) @33490   5 years kjdon changed default partition sizes back to 20, to match what was there …
(edit) @33489   5 years ak19 Handy file to not have to keep manually repeating commands when …
(edit) @33488   5 years ak19 new function createSeedURLsFiles() in WETProcessor that replaces the …
(edit) @33487   5 years kjdon added code to display any error messages
(edit) @33486   5 years kjdon reindented the page, added some extra links, and organised the items …
(edit) @33485   5 years kjdon removed an erroneous space
(edit) @33484   5 years kjdon some changes and additions to the debuginfo page texts
(edit) @33483   5 years kjdon added an explicit space after Error:
(edit) @33482   5 years kjdon changed standardize_capitalization to …
(edit) @33481   5 years kjdon a few more refinements to List strings
(edit) @33480   5 years ak19 Much harder to remove pages where words are fused together as some are …
(edit) @33479   5 years kjdon changed numeric option order to match letter options
(edit) @33478   5 years kjdon some refining of list option descriptions
(edit) @33477   5 years kjdon need to call setup_custom_sort to allow for collection's customsorttools.pm
(edit) @33476   5 years kjdon enabled having customsorttools in collection's perllib folder. you can …
(edit) @33475   5 years kjdon added numeric partition defaults to match partition type
(edit) @33474   5 years kjdon it turns out that childtype is not set in all cases, so put in the …
(edit) @33473   5 years kjdon still didn't get it quite right…
(edit) @33472   5 years kjdon forgot the -> to access member of a hash ref
(edit) @33471   5 years ak19 Very minor changes.
(edit) @33470   5 years ak19 A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
(edit) @33469   5 years ak19 Don't want URLs with the word product(s) in them (but production …
(edit) @33468   5 years ak19 More meaningful to (also) write out the keep vs discard URLs into keep …
(edit) @33467   5 years ak19 Improved the code to use a static block to load the needed properties …
(edit) @33466   5 years ak19 1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
(edit) @33465   5 years ak19 Committing first version of the WETProcessor.java which takes a …
(edit) @33464   5 years kjdon I committed the last changes by mistake, using the previous revision …
(edit) @33463   5 years kjdon fixed up some typos. removed use_hlist_for option. This is very hard …
(edit) @33462   5 years ak19 Tested new tomcat.allowLinking property on Windows too now and it …
(edit) @33461   5 years ak19 Implementing Diego Spano's suggested changes for tomcat's allowLinking …
(edit) @33460   5 years kjdon fixed up some typos. removed use_hlist_for option. This is very hard …
(edit) @33459   5 years kjdon small changes to some strings
(edit) @33458   5 years cpb16 Running new morphology version after quick meeting with david last …
(edit) @33457   5 years ak19 Got stage 1, the WARC to WET conversion, working, after necessary …
(edit) @33456   5 years ak19 Link to discussion on how to convert WARC to WET
(edit) @33455   5 years cpb16 Started implementing Davids suggested morphology sequence, codeversion9
(edit) @33454   5 years kjdon updated metadata_selection_mode to be …
(edit) @33453   5 years kjdon the new and modified strings for revamped List classifier
(edit) @33452   5 years kjdon revamp of list classifier. More precise handling of numeric metadata …
(edit) @33451   5 years kjdon added a comment
(edit) @33450   5 years kjdon removed some unnecessary comments
(edit) @33449   5 years cpb16 termnal version executes correctly. (Didnt include init threshold in …
(edit) @33448   5 years ak19 Minor clarification and inclusion of helpful command
(edit) @33447   5 years cpb16 starting to implement terminal version of new morphology. need to fix. …
(edit) @33446   5 years ak19 1. Committing working version of export_maori_subset.sh which takes …
(edit) @33445   5 years ak19 The first working hadoop spark script for processing common crawl …
(edit) @33444   5 years cpb16 Have created a preprocess to remove large objects. …
(edit) @33443   5 years ak19 More notes
(edit) @33442   5 years ak19 Updated gutil.jar file (with SafeProcses debugging)
(edit) @33441   5 years ak19 Adding further notes to do with running the CC-index examples on spark.
(edit) @33440   5 years ak19 Split file to move vagrant-spark-hadoop notes into own file.
(edit) @33439   5 years cpb16 Have created properties file and accessibility from …
(edit) @33438   5 years ak19 Forgot to commit a change made for Georgian.
(edit) @33437   5 years cpb16 made progress with morphology. Need to have a better area dimension …
(edit) @33436   5 years ak19 3 important changes for 2 separate bugfixes where one bugfix is …
(edit) @33435   5 years ak19 Georgian language translations for the language's new glihelp module …
(edit) @33434   5 years ak19 Correcting syntax errors in this bash script.
(edit) @33433   5 years ak19 New Georgian language translation for perlmodules module of the GS …
(edit) @33432   5 years ak19 New Georgian language translation for glidict module of the GS …
(edit) @33431   5 years ak19 Corrections of automated processing, noticed when processing Georgian …
(edit) @33430   5 years ak19 Undo call to to_utf8() on the query_string argument (arg[q]) to …
(edit) @33429   5 years kjdon fixed a bug in get_or_create_shortname where it wasn't storing the new …
(edit) @33428   5 years ak19 Working commoncrawl cc-warc-examples' WET wordcount example using …
(edit) @33427   5 years davidb Some initial files on how to get going
(edit) @33426   5 years davidb Folder to details on how to standup the HTRC DevEnv locally
(edit) @33425   5 years ak19 A few more links now that I got past getting the vagrant VM with spark …
(edit) @33424   5 years ak19 Georgian (code ka) language translations for the gs3interface module …
(edit) @33423   5 years ak19 Adding in the link to the vagrant VM with Hadoop, Spark for cluster …
(edit) @33422   5 years ak19 Some more links.
(edit) @33421   5 years ak19 Forgot to fix up svn externals property for the Georgian …
(edit) @33420   5 years ak19 Update to svnproperty externals for the Georgian (code: ka) …
(edit) @33419   5 years ak19 Last evening, I had found some links about how language-detection is …
(edit) @33418   5 years cpb16 made progress with morphology, based one image, need to refine …
(edit) @33417   5 years ak19 Georgian language translations for the coredm for GS2, gsinstaller …
(edit) @33416   5 years ak19 DEC collections weren't getting built on 32 bit linux VM after trying …
(edit) @33415   5 years cpb16 updated, after unable to commit due to setup.bash being out of date. …
(edit) @33414   5 years ak19 Adding important links
(edit) @33413   5 years ak19 Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, …
(edit) @33412   5 years ak19 config command for wgetting a single file
(edit) @33411   5 years ak19 Newer version now doesn't mirror sites with wget but gets WET files …
(edit) @33410   5 years ak19 Committing some variable name changes before I replace this file with …
(edit) @33409   5 years ak19 Forgot to commit 2 files with links and shuffling some links around …
(edit) @33408   5 years ak19 Some rough notes. Will move into appropriate file later.
(edit) @33407   5 years ak19 gutil.jar was rebuilt yesterday in GS3 after a bugfix. Recommitting …
(edit) @33406   5 years kjdon if there is a semicolon after the file name, it ends up in the URL …
(edit) @33405   5 years ak19 Even though we're probably not going to use this code after all, will …
(edit) @33404   5 years ak19 1. Links to other Java ways of extracting text from web content. 2. …
(edit) @33403   5 years ak19 Mistake to do with launchdir in SafeProcess: if the environment for …
(edit) @33402   5 years ak19 Beginnings of the Java class to wget sites and process its pages to …
(edit) @33401   5 years ak19 MaoriTextDetector.class file now generated inside its package folder …
(edit) @33400   5 years ak19 1. Setting up log4j.properties based on the macronizer's basic one …
(edit) @33399   5 years ak19 Putting properties files into the conf folder and keeping the lib …
(edit) @33398   5 years ak19 Committing the actual package structure and the updated README after …
(edit) @33397   5 years ak19 1. Changing package structure and instructions on compiling/running as …
Note: See TracRevisionLog for help on using the revision log.