Changeset 33394 for gs3-extensions


Ignore:
Timestamp:
2019-08-09T20:37:23+12:00 (5 years ago)
Author:
ak19
Message:
  1. Started a file on feasibility with the data now available and some links that have interesting or useful information. 2. Minor simplification to get_commoncrawl_nz_urls.sh script. 3. config.props file to be used by Java. Can't find wget configuration settings to limit mirroring of a site to a certain number of pages, but can limit overall download to size (--quote or -Q).
Location:
gs3-extensions/maori-lang-detection
Files:
3 added
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh

    r33393 r33394  
    158158uniq_urls_file=uniq-tld-nz-urls-`date +%F`.txt
    159159echo "Creating file $uniq_urls_file containing just the unique domains and subdomains..."
    160 cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file
     160#cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file
     161<$outfile cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file
     162
     163# cat is unnecessary: https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/1915750
     164#    "You can dump the cat! Rather than piping into <the subsequent process>, just let <the subsequent process> read the file using <. Piping through cat is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had."
    161165
    162166# The first cut grabs the url field of the json.
Note: See TracChangeset for help on using the changeset viewer.