1 | To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
|
---|
2 | 1. ssh analytics -Y
|
---|
3 | 2. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y
|
---|
4 | or
|
---|
5 | vagrant ssh -- -Y node1
|
---|
6 | (the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)
|
---|
7 |
|
---|
8 | Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101
|
---|
9 | - Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
|
---|
10 | - If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.
|
---|
11 |
|
---|
12 |
|
---|
13 |
|
---|
14 |
|
---|
15 | WET example from https://github.com/commoncrawl/cc-warc-examples
|
---|
16 |
|
---|
17 | vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data
|
---|
18 | vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.
|
---|
19 | vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data
|
---|
20 | Found 1 items
|
---|
21 | -rw-r--r-- 1 vagrant supergroup 154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz
|
---|
22 | vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount
|
---|
23 |
|
---|
24 | <ONCE FINISHED:>
|
---|
25 |
|
---|
26 | vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*
|
---|
27 |
|
---|
28 |
|
---|
29 |
|
---|
30 | INFO ON HADOOP/HDFS:
|
---|
31 | https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/
|
---|
32 |
|
---|
33 | SPARK:
|
---|
34 | configure option example: https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions
|
---|
35 |
|
---|
36 |
|
---|
37 |
|
---|
38 | LIKE '%isl%'
|
---|
39 |
|
---|
40 | cd cc-index-table
|
---|
41 | APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
|
---|
42 | > $SPARK_HOME/bin/spark-submit \
|
---|
43 | # $SPARK_ON_YARN \
|
---|
44 | --conf spark.hadoop.parquet.enable.dictionary=true \
|
---|
45 | --conf spark.hadoop.parquet.enable.summary-metadata=false \
|
---|
46 | --conf spark.sql.hive.metastorePartitionPruning=true \
|
---|
47 | --conf spark.sql.parquet.filterPushdown=true \
|
---|
48 | --conf spark.sql.parquet.mergeSchema=true \
|
---|
49 | --class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \
|
---|
50 | --query "SELECT url, warc_filename, warc_record_offset, warc_record_length
|
---|
51 | FROM ccindex
|
---|
52 | WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages = LIKE '%mri%'" \
|
---|
53 | --numOutputPartitions 12 \
|
---|
54 | --numRecordsPerWarcFile 20000 \
|
---|
55 | --warcPrefix ICELANDIC-CC-2018-43 \
|
---|
56 | s3://commoncrawl/cc-index/table/cc-main/warc/ \
|
---|
57 | .../my_output_path/
|
---|
58 |
|
---|
59 |
|
---|
60 | ----------------
|
---|
61 | Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
|
---|
62 |
|
---|
63 |
|
---|
64 | https://stackoverflow.com/questions/39355354/spark-no-filesystem-for-scheme-https-cannot-load-files-from-amazon-s3
|
---|
65 | https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3
|
---|
66 | "2018-01-10 Update Hadoop 3.0 has cut its s3: and s3n implementations: s3a is all you get. It is now significantly better than its predecessor and performs as least as good as the Amazon implementation. Amazon's "s3:" is still offered by EMR, which is their closed source client. Consult the EMR docs for more info."
|
---|
67 |
|
---|
68 | 1. https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
|
---|
69 |
|
---|
70 | "Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.
|
---|
71 |
|
---|
72 | Here are the key parts, as of December 2015:
|
---|
73 |
|
---|
74 | Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
|
---|
75 |
|
---|
76 | You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
|
---|
77 |
|
---|
78 | You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
|
---|
79 |
|
---|
80 | In spark.properties you probably want some settings that look like this:
|
---|
81 |
|
---|
82 | spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
|
---|
83 | spark.hadoop.fs.s3a.access.key=ACCESSKEY
|
---|
84 | spark.hadoop.fs.s3a.secret.key=SECRETKEY
|
---|
85 |
|
---|
86 | I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them."
|
---|
87 |
|
---|
88 |
|
---|
89 | 2. The classpath used by hadoop can be found by running the command (https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath):
|
---|
90 | hadoop classpath
|
---|
91 |
|
---|
92 |
|
---|
93 | 3. Got hadoop-aws 2.7.6 jar
|
---|
94 | from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
|
---|
95 | and put it into /home/vagrant
|
---|
96 |
|
---|
97 |
|
---|
98 | 4. https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath
|
---|
99 | https://stackoverflow.com/questions/28520821/how-to-add-external-jar-to-hadoop-job/54459211#54459211
|
---|
100 | vagrant@node1:~$ export LIBJARS=/home/vagrant/hadoop-aws-2.7.6.jar
|
---|
101 | vagrant@node1:~$ export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
|
---|
102 | vagrant@node1:~$ hadoop classpath
|
---|
103 |
|
---|
104 | 5. https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
|
---|
105 | "Download the aws sdk for java https://aws.amazon.com/sdk-for-java/ Uploaded it to the hadoop directory. You should see the aws-java-sdk-1.10.65.jar in /usr/hdp/2.4.0.0-169/hadoop/"
|
---|
106 |
|
---|
107 | I got version 1.11
|
---|
108 |
|
---|
109 | [Can't find a spark.properties file, but this seems to contain spark specific properties:
|
---|
110 | $SPARK_HOME/conf/spark-defaults.conf
|
---|
111 |
|
---|
112 | https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-properties.html
|
---|
113 | "The default Spark properties file is $SPARK_HOME/conf/spark-defaults.conf that could be overriden using spark-submit with the --properties-file command-line option."]
|
---|
114 |
|
---|
115 | Can SUDO COPY the 2 jar files hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar to:
|
---|
116 | /usr/local/hadoop/share/hadoop/common/
|
---|
117 | (else /usr/local/hadoop/share/hadoop/hdfs/hadoop-aws-2.7.6.jar)
|
---|
118 |
|
---|
119 | --------
|
---|
120 | schema
|
---|
121 | https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
|
---|
122 |
|
---|
123 | ---------------
|
---|
124 | More examples to try:
|
---|
125 | https://github.com/commoncrawl/cc-warc-examples
|
---|
126 |
|
---|
127 |
|
---|
128 | A bit outdated?
|
---|
129 | https://www.journaldev.com/20342/apache-spark-example-word-count-program-java
|
---|
130 | https://www.journaldev.com/20261/apache-spark
|
---|
131 |
|
---|
132 | --------
|
---|
133 |
|
---|
134 | sudo apt-get install maven
|
---|
135 | (or sudo apt update
|
---|
136 | sudo apt install maven)
|
---|
137 | git clone https://github.com/commoncrawl/cc-index-table.git
|
---|
138 | cd cc-index-table
|
---|
139 | mvn package
|
---|
140 | vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
|
---|
141 |
|
---|
142 |
|
---|
143 |
|
---|
144 |
|
---|
145 | spark:
|
---|
146 | https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html
|
---|
147 |
|
---|
148 | ============
|
---|
149 | Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
|
---|
150 |
|
---|
151 | https://github.com/martinprobson/vagrant-hadoop-hive-spark
|
---|
152 |
|
---|
153 | Vagrant:
|
---|
154 | * Guide: https://www.vagrantup.com/intro/getting-started/index.html
|
---|
155 | * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
|
---|
156 | * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
|
---|
157 | * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
|
---|
158 | * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
|
---|
159 | sudo apt-get -y install firefox
|
---|
160 | * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
|
---|
161 |
|
---|
162 | * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
|
---|
163 | * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
|
---|
164 | ---
|
---|
165 | ==> node1: Forwarding ports...
|
---|
166 | node1: 8080 (guest) => 8081 (host) (adapter 1)
|
---|
167 | node1: 8088 (guest) => 8089 (host) (adapter 1)
|
---|
168 | node1: 9083 (guest) => 9084 (host) (adapter 1)
|
---|
169 | node1: 4040 (guest) => 4041 (host) (adapter 1)
|
---|
170 | node1: 18888 (guest) => 18889 (host) (adapter 1)
|
---|
171 | node1: 16010 (guest) => 16011 (host) (adapter 1)
|
---|
172 | node1: 22 (guest) => 2200 (host) (adapter 1)
|
---|
173 | ==> node1: Running 'pre-boot' VM customizations...
|
---|
174 |
|
---|
175 |
|
---|
176 | ==> node1: Checking for guest additions in VM...
|
---|
177 | node1: The guest additions on this VM do not match the installed version of
|
---|
178 | node1: VirtualBox! In most cases this is fine, but in rare cases it can
|
---|
179 | node1: prevent things such as shared folders from working properly. If you see
|
---|
180 | node1: shared folder errors, please make sure the guest additions within the
|
---|
181 | node1: virtual machine match the version of VirtualBox you have installed on
|
---|
182 | node1: your host and reload your VM.
|
---|
183 | node1:
|
---|
184 | node1: Guest Additions Version: 5.1.38
|
---|
185 | node1: VirtualBox Version: 5.2
|
---|
186 |
|
---|
187 | ------------
|
---|