Ignore:
Timestamp:
2019-09-13T17:44:41+12:00 (5 years ago)
Author:
ak19
Message:

Improved the code to use a static block to load the needed properties from config.properties and initialise some static final ints from there. Code now uses the logger for debugging. New properties in config.properties. Returned code to use a counter, recordCount, re-zeroed for each WETProcessor since the count was used for unique filenames, and filename prefixes are unique for each warc.wet file. So these prefixes, in combination with keeping track of the recordcount per warc.wet file, each WET record written out to a file is assigned a unique filename. (No longer need a running total of all WET records across warc.wet files processed ensuring uniqueness of filenames.) All appears to still work similarly to previous commit in creating discard and keep subfolders.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33457 r33467  
    4747Sebastian
    4848
     49====================
     50wharariki:[239]/Scratch/ak19/gs3-extensions/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.WETProcessor ../tmp/processWET /Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET
     51
     52wharariki:[188]/Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET>ls keep | wc
     53   4090    4090   65440
     54wharariki:[189]/Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET>ls discard | wc
     55   1515    1515   24240
     56
     57We keep 4090 WET records and are discarding 1515.
     58
    4959=======================
    5060Latest version of the index's schema:
Note: See TracChangeset for help on using the changeset viewer.