Context Navigation

← Previous Change
Next Change →

Changeset 33393 for gs3-extensions

Timestamp:

2019-08-09T18:57:12+12:00 (5 years ago)

Author:

ak19

Message:

Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls file of just the unique toplevel sites

Location:

gs3-extensions/maori-lang-detection

Files:

: 2 edited

MoreReading/CommonCrawl.txt (modified) (1 diff)
bin/script/get_commoncrawl_nz_urls.sh (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

-              r33391
+              r33393
-NEXT PROBLEMS: prefixes to basic domain should not be counted.
-e.g. cs.waikato.ac.nz
-and waikato.ac.nz should count as one?
-    https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column
-It's not enough to cut off http:// and then anything before first ., since some won't have a prefix to the domain. How to detect which ones do and don't and only attempt to remove the prefix from those urls that have a prefix?
-* With -r can leave out \( for ( and \? for ? wildcard:
-    tikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less
-* Also want to get read of starting " and ending ,"
-FINAL ONE:
-    [[Atikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less
-UNIQ requires 2 consecutive duplicates in order to detect duplicates.
-So if there's 2 different lines followed by a 3rd line that's a duplicate of the first, then uniq won't detect that.
-And this happens in our case because some urls are http and https, some have www and some don't. And Massey university's domain URL strangely ends with . sometimes though usually not.
-    tikauka:[199]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | uniq | less
-    tikauka:[194]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | uniq | less
-tikauka:[182]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\).*@\1@'
-http://100health.nz
-tikauka:[178]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\)@boo@'
-boo/ProdList.asp?p=1&ClassID=196,
-maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | less
-    where
-        cut -d ' ' -f4
-    gets the 4th field (the urls) where each field is separated by a space instead of the default tab delimiter.
 http://webdatacommons.org/

gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh

-              r33390
+              r33393
 echo ""
 echo "The file $outfile has now been created, containing all the .nz domains for the crawl of $1"
+echo "Remember to delete the products in the tmp folder or the folder itself after inspecting its contents"
+echo ""
+uniq_urls_file=uniq-tld-nz-urls-`date +%F`.txt
+echo "Creating file $uniq_urls_file containing just the unique domains and subdomains..."
+cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file
+# The first cut grabs the url field of the json.
+# The second cut grabs the domain name from the url (located between first // and immediately subsequent /).
+# The first sed process then removes any trailing . (e.g. "massey.ac.nz." becomes "massey.ac.nz") followed by ", and optional spaces before the end,
+# and the final sed removes any "www." prefix.
+# Then we get the uniq urls out of all that.
+echo "File $uniq_urls_file containing just the unique .nz sites (domains and subdomains) to be used as seed urls has now been created."
+num_uniq_urls=`wc -l $uniq_urls_file`
+total_urls=`wc -l $outfile`
+echo ""
+echo ""
+echo "Summary:"
+echo "There were $num_uniq_urls unique sites"
+echo "out of a total of $total_urls urls in $outfile."
+echo ""
+echo ""
+echo ""
+echo "NEXT:"
+echo "Remember to delete the products in the tmp folder or the folder itself after inspecting its contents."
 echo ""
 exit 0

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33393 for gs3-extensions

Legend:

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh

Download in other formats: