Context Navigation

← Previous Change
Next Change →

Changeset 33391 for gs3-extensions

Timestamp:

2019-08-07T19:11:12+12:00 (5 years ago)

Author:

ak19

Message:

Some rough bash scripting lines that work but aren't complete.

File:

: 1 edited

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

-              r33376
+              r33391
+NEXT PROBLEMS: prefixes to basic domain should not be counted.
+e.g. cs.waikato.ac.nz
+and waikato.ac.nz should count as one?
+    https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column
+It's not enough to cut off http:// and then anything before first ., since some won't have a prefix to the domain. How to detect which ones do and don't and only attempt to remove the prefix from those urls that have a prefix?
+* With -r can leave out \( for ( and \? for ? wildcard:
+    tikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less
+* Also want to get read of starting " and ending ,"
+FINAL ONE:
+    [[Atikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less
+UNIQ requires 2 consecutive duplicates in order to detect duplicates.
+So if there's 2 different lines followed by a 3rd line that's a duplicate of the first, then uniq won't detect that.
+And this happens in our case because some urls are http and https, some have www and some don't. And Massey university's domain URL strangely ends with . sometimes though usually not.
+    tikauka:[199]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | uniq | less
+    tikauka:[194]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | uniq | less
+tikauka:[182]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\).*@\1@'
+http://100health.nz
+tikauka:[178]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\)@boo@'
+boo/ProdList.asp?p=1&ClassID=196,
+maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | less
+    where
+        cut -d ' ' -f4
+    gets the 4th field (the urls) where each field is separated by a space instead of the default tab delimiter.
 http://webdatacommons.org/

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33391 for gs3-extensions

Legend:

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

Download in other formats: