Changeset 33428


Ignore:
Timestamp:
2019-08-19T20:31:23+12:00 (5 years ago)
Author:
ak19
Message:

Working commoncrawl cc-warc-examples' WET wordcount example using Hadoop. And some more links.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33425 r33428  
     1To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
     21. ssh analytics -Y
     32. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y
     4or
     5vagrant ssh -- -Y node1
     6(the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)
     7
     8Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101
     9- Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
     10- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.
     11
     12
     13
     14
     15WET example from https://github.com/commoncrawl/cc-warc-examples
     16
     17vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data
     18vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.
     19vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data
     20Found 1 items
     21-rw-r--r--   1 vagrant supergroup  154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz
     22vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount
     23
     24<ONCE FINISHED:>
     25
     26vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*
     27
     28
     29
     30INFO ON HADOOP/HDFS:
     31https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/
     32
     33---------------
     34More examples to try:
     35https://github.com/commoncrawl/cc-warc-examples
     36
     37
     38A bit outdated?
     39https://www.journaldev.com/20342/apache-spark-example-word-count-program-java
     40https://www.journaldev.com/20261/apache-spark
     41
     42--------
     43
    144sudo apt-get install maven
    245(or sudo apt update
Note: See TracChangeset for help on using the changeset viewer.