Changeset 33457
- Timestamp:
- 2019-09-05T19:01:36+12:00 (5 years ago)
- Location:
- gs3-extensions/maori-lang-detection/MoreReading
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt
r33456 r33457 3 3 https://gist.github.com/Smerity/afe7430fdb4371015466 4 4 https://github.com/commoncrawl/commoncrawl/issues/11 5 5 WARC TO WET: 6 6 https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to 7 7 8 Sebastian Nagel 8 9 05/07/2017 … … 44 45 45 46 Best, 46 Sebastian 47 Sebastian 48 47 49 ======================= 48 50 Latest version of the index's schema: … … 167 169 tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz 168 170 171 vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/. 169 172 170 173 -------------------------------------------- -
gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt
r33456 r33457 14 14 - Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/ 15 15 - If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1. 16 17 =========================================== 18 WARC TO WET 19 =========================================== 20 https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to 21 22 Sebastian Nagel 23 05/07/2017 24 Hi, 25 26 unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives. 27 28 But it's easy to run the WET extractor on the WARC files, see: 29 https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion 30 https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion 31 32 That's what you have to do: 33 34 # download the WARC files and place them in a directory "warc/" 35 # create sibling folders wat and wet 36 # | 37 # |-- warc/ 38 # | |-- CC-NEWS-20161001224340-00008.warc.gz 39 # | |-- CC-NEWS-20161017145313-00000.warc.gz 40 # | `-- ... 41 # | 42 # |-- wat/ 43 # | 44 # `-- wet/ 45 46 git clone https://github.com/commoncrawl/ia-web-commons 47 cd ia-web-commons 48 mvn install 49 50 cd .. 51 git clone https://github.com/commoncrawl/ia-hadoop-tools 52 cd ia-hadoop-tools 53 mvn package 54 55 java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \ 56 -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz 57 58 The folders wat/ and wet/ will then contain the exports. 59 60 Best, 61 Sebastian 62 63 --- 64 65 1. So following the above instructions, I first made a warc subfolder in hdfs:///user/vagrant/cc-mri-subset/ 66 Then moved all the downloaded *warc.gz into there. 67 Then created wat and wet subfolders in there alongside the warc folder. 68 69 2. Next, I did the 2 git clone and mvn compile operations above. 70 The first, ia-web-commons, successfully compiled (despite some test failures) 71 72 3. BUT THE 2ND GIT PROJECT, ia-hadoop-tools, DIDN'T COMPILE AT FIRST, with the mvn package command failing: 73 74 git clone https://github.com/commoncrawl/ia-hadoop-tools 75 cd ia-hadoop-tools 76 mvn package 77 78 Compile failed with a message about the JSONTokener constructor not taking a String object. It turned out that the JSONTokener used was a version of the class that was too old. Whereas the necessary constructor is present in the most recent version, as seen in the API at https://docs.oracle.com/cd/E51273_03/ToolsAndFrameworks.110/apidoc/assembler/com/endeca/serialization/json/JSONTokener.html 79 80 So instead, I opened up ia-hadoop-tools/pom.xml for editing and added the newest version of the org.json package's json (see http://builds.archive.org/maven2/org/json/json/ for <version>) into the pom.xml's <dependencies> element, based on how it this was done at https://bukkit.org/threads/problem-loading-libraries-with-maven-java-lang-noclassdeffounderror-org-json-jsonobject.428354/: 81 82 <dependency> 83 <groupId>org.json</groupId> 84 <artifactId>json</artifactId> 85 <version>20131018</version> 86 </dependency> 87 88 Then I was able to run "mvn package" successfully. 89 (Maybe I could also have added in a far more recent version, as seen in the version numbers at https://mvnrepository.com/artifact/org.json/json, 90 but didn't want to go too far ahead in case there was other incompatibility.) 91 92 4. Next, I wanted to finally run the built executable to convert the warc files to wet files. 93 94 I had the warc files on the hadoop filesystem. The original instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to however were apparently for working with warcs stored on the local filesystem, as those instructions did not run the hadoop command but the regular java command. The regular java command did not work with the files being on the hadoop system (attempt #1 below). 95 96 ATTEMPTS THAT DIDN'T WORK: 97 1. vagrant@node1:~/ia-hadoop-tools$ java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz 98 2. vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz 99 100 101 The 2nd attempt, which uses a proper hadoop command, I based off https://www.tutorialspoint.com/map_reduce/implementation_in_hadoop.htm 102 It produced lots of errors and the output wet (and wat) .gz files were all corrupt as gunzip could not successfully run over them: 103 104 vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz 105 19/09/05 05:57:22 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout 106 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100139-000000.warc.gz 107 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100141-000001.warc.gz 108 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100451-000002.warc.gz 109 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100453-000003.warc.gz 110 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100805-000004.warc.gz 111 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100809-000005.warc.gz 112 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000006.warc.gz 113 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000007.warc.gz 114 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000008.warc.gz 115 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000009.warc.gz 116 19/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032 117 19/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032 118 19/09/05 05:57:23 INFO mapred.FileInputFormat: Total input paths to process : 10 119 19/09/05 05:57:24 INFO mapreduce.JobSubmitter: number of splits:10 120 19/09/05 05:57:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1567397114047_0001 121 19/09/05 05:57:24 INFO impl.YarnClientImpl: Submitted application application_1567397114047_0001 122 19/09/05 05:57:24 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1567397114047_0001/ 123 19/09/05 05:57:24 INFO mapreduce.Job: Running job: job_1567397114047_0001 124 19/09/05 05:57:31 INFO mapreduce.Job: Job job_1567397114047_0001 running in uber mode : false 125 19/09/05 05:57:31 INFO mapreduce.Job: map 0% reduce 0% 126 19/09/05 05:57:44 INFO mapreduce.Job: map 10% reduce 0% 127 19/09/05 05:57:44 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000002_0, Status : FAILED 128 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 129 Container killed by the ApplicationMaster. 130 Container killed on request. Exit code is 143 131 Container exited with a non-zero exit code 143 132 133 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000004_0, Status : FAILED 134 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 135 Container killed by the ApplicationMaster. 136 Container killed on request. Exit code is 143 137 Container exited with a non-zero exit code 143 138 139 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED 140 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 141 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000005_0, Status : FAILED 142 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 143 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000000_0, Status : FAILED 144 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 145 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000003_0, Status : FAILED 146 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 147 19/09/05 05:57:46 INFO mapreduce.Job: map 0% reduce 0% 148 19/09/05 05:57:54 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000007_0, Status : FAILED 149 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 150 19/09/05 05:57:55 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000006_0, Status : FAILED 151 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 152 19/09/05 05:57:57 INFO mapreduce.Job: map 10% reduce 0% 153 19/09/05 05:57:57 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000009_0, Status : FAILED 154 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 155 Container killed by the ApplicationMaster. 156 Container killed on request. Exit code is 143 157 Container exited with a non-zero exit code 143 158 159 19/09/05 05:57:58 INFO mapreduce.Job: map 20% reduce 0% 160 19/09/05 05:57:58 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000008_0, Status : FAILED 161 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 162 19/09/05 05:58:06 INFO mapreduce.Job: map 30% reduce 0% 163 19/09/05 05:58:08 INFO mapreduce.Job: map 60% reduce 0% 164 19/09/05 05:58:09 INFO mapreduce.Job: map 70% reduce 0% 165 19/09/05 05:58:10 INFO mapreduce.Job: map 80% reduce 0% 166 19/09/05 05:58:12 INFO mapreduce.Job: map 90% reduce 0% 167 19/09/05 05:58:13 INFO mapreduce.Job: map 100% reduce 0% 168 19/09/05 05:58:13 INFO mapreduce.Job: Job job_1567397114047_0001 completed successfully 169 19/09/05 05:58:13 INFO mapreduce.Job: Counters: 32 170 File System Counters 171 FILE: Number of bytes read=0 172 FILE: Number of bytes written=1239360 173 FILE: Number of read operations=0 174 FILE: Number of large read operations=0 175 FILE: Number of write operations=0 176 HDFS: Number of bytes read=1430 177 HDFS: Number of bytes written=0 178 HDFS: Number of read operations=30 179 HDFS: Number of large read operations=0 180 HDFS: Number of write operations=0 181 Job Counters 182 Failed map tasks=10 183 Launched map tasks=20 184 Other local map tasks=10 185 Data-local map tasks=10 186 Total time spent by all maps in occupied slots (ms)=208160 187 Total time spent by all reduces in occupied slots (ms)=0 188 Total time spent by all map tasks (ms)=208160 189 Total vcore-milliseconds taken by all map tasks=208160 190 Total megabyte-milliseconds taken by all map tasks=213155840 191 Map-Reduce Framework 192 Map input records=10 193 Map output records=0 194 Input split bytes=1430 195 Spilled Records=0 196 Failed Shuffles=0 197 Merged Map outputs=0 198 GC time elapsed (ms)=1461 199 CPU time spent (ms)=2490 200 Physical memory (bytes) snapshot=1564528640 201 Virtual memory (bytes) snapshot=19642507264 202 Total committed heap usage (bytes)=1126170624 203 File Input Format Counters 204 Bytes Read=0 205 File Output Format Counters 206 Bytes Written=0 207 vagrant@node1:~/ia-hadoop-tools$ 208 209 210 5. The error messages are all the same but not very informative 211 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED 212 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 213 214 All the references I could find on google indicated that the full version of the error message was that this method (com.google.common.io.ByteStreams.limit(...)) could not be located. 215 The page at http://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%[email protected]%3E 216 revealed that guava.jar contains the com.google.common.io.ByteStreams class. 217 218 219 TO GET THE EXECUTABLE TO WORK: 220 I located guava.jar, found there were 2 identical ones on the filesystem but that neither was on the hadoop classpath yet, so I copied it into one of the Hadoop Classpath locations. Then I was able to successfully run the executable and produce meaningful WET files at last from the WARC input files: 221 222 223 224 vagrant@node1:~$ locate guava.jar 225 /usr/share/java/guava.jar 226 /usr/share/maven/lib/guava.jar 227 vagrant@node1:~$ jar -tvf /usr/share/maven/lib/guava.jar | less 228 vagrant@node1:~$ jar -tvf /usr/share/java/guava.jar | less 229 # both contained the ByteStreams class 230 231 vagrant@node1:~$ cd - 232 /home/vagrant/ia-hadoop-tools 233 vagrant@node1:~/ia-hadoop-tools$ find . -name "guava.jar" 234 # None in the git project 235 236 vagrant@node1:~/ia-hadoop-tools$ hadoop classpath 237 /usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar 238 # guava.jar not on hadoop classpath yet 239 240 vagrant@node1:~/ia-hadoop-tools$ diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar 241 # no differences, identical 242 243 vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -put /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/. 244 put: `/usr/local/hadoop/share/hadoop/common/.': No such file or directory 245 # hadoop classpath locations are not hdfs filesystem 246 247 vagrant@node1:~/ia-hadoop-tools$ sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/. 248 vagrant@node1:~/ia-hadoop-tools$ hadoop classpath 249 /usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar 250 # Copied guava.jar to somewhere on existing hadoop classpath 251 252 vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz 253 # Successful run 254 255 vagrant@node1:~$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/. 256 vagrant@node1:~$ cd .. 257 vagrant@node1:~$ gunzip MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz 258 vagrant@node1:~$ less MAORI-CC-2019-30-20190902100139-000000.warc.wet 259 # Copied a WET output file from the hadoop filesystem to local filesystem and inspected its contents. Works! 16 260 17 261 -----------------------------------
Note:
See TracChangeset
for help on using the changeset viewer.