- Timestamp:
- 2019-08-09T20:37:23+12:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh
r33393 r33394 158 158 uniq_urls_file=uniq-tld-nz-urls-`date +%F`.txt 159 159 echo "Creating file $uniq_urls_file containing just the unique domains and subdomains..." 160 cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file 160 #cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file 161 <$outfile cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file 162 163 # cat is unnecessary: https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/1915750 164 # "You can dump the cat! Rather than piping into <the subsequent process>, just let <the subsequent process> read the file using <. Piping through cat is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had." 161 165 162 166 # The first cut grabs the url field of the json.
Note:
See TracChangeset
for help on using the changeset viewer.