Changeset 33394

Show
Ignore:
Timestamp:
09.08.2019 20:37:23 (13 days ago)
Author:
ak19
Message:

1. Started a file on feasibility with the data now available and some links that have interesting or useful information. 2. Minor simplification to get_commoncrawl_nz_urls.sh script. 3. config.props file to be used by Java. Can't find wget configuration settings to limit mirroring of a site to a certain number of pages, but can limit overall download to size (--quote or -Q).

Location:
gs3-extensions/maori-lang-detection
Files:
3 added
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh

    r33393 r33394  
    158158uniq_urls_file=uniq-tld-nz-urls-`date +%F`.txt 
    159159echo "Creating file $uniq_urls_file containing just the unique domains and subdomains..." 
    160 cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file 
     160#cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file 
     161<$outfile cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file 
     162 
     163# cat is unnecessary: https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/1915750 
     164#    "You can dump the cat! Rather than piping into <the subsequent process>, just let <the subsequent process> read the file using <. Piping through cat is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had." 
    161165 
    162166# The first cut grabs the url field of the json.