Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33457

Timestamp:

2019-09-05T19:01:36+12:00 (5 years ago)

Author:

ak19

Message:

Got stage 1, the WARC to WET conversion, working, after necessary adjustments to the online instructions were discovered and made. Instructions and code didn't work as is, they probably were out of date.

Location:

gs3-extensions/maori-lang-detection/MoreReading

Files:

: 2 edited

CommonCrawl.txt (modified) (3 diffs)
Vagrant-Spark-Hadoop.txt (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

-              r33456
+              r33457
 https://gist.github.com/Smerity/afe7430fdb4371015466
 https://github.com/commoncrawl/commoncrawl/issues/11
+WARC TO WET:
 https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
 Sebastian Nagel
 /07/2017
 …
 Best,
+Sebastian
+Sebastian
 =======================
 Latest version of the index's schema:
 …
 tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz
+vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/.
 --------------------------------------------

gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

-              r33456
+              r33457
 - Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
 - If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.
+===========================================
+        WARC TO WET
+===========================================
+https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
+Sebastian Nagel
+/07/2017
+Hi,
+unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.
+But it's easy to run the WET extractor on the WARC files, see:
+  https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion
+  https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion
+That's what you have to do:
+# download the WARC files and place them in a directory "warc/"
+# create sibling folders wat and wet
+# |
+# |-- warc/
+# |   |-- CC-NEWS-20161001224340-00008.warc.gz
+# |   |-- CC-NEWS-20161017145313-00000.warc.gz
+# |   `-- ...
+# |
+# |-- wat/
+# |
+# `-- wet/
+git clone https://github.com/commoncrawl/ia-web-commons
+cd ia-web-commons
+mvn install
+cd ..
+git clone https://github.com/commoncrawl/ia-hadoop-tools
+cd ia-hadoop-tools
+mvn package
+java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
+   -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz
+The folders wat/ and wet/ will then contain the exports.
+Best,
+Sebastian
+---
+. So following the above instructions, I first made a warc subfolder in hdfs:///user/vagrant/cc-mri-subset/
+Then moved all the downloaded *warc.gz into there.
+Then created wat and wet subfolders in there alongside the warc folder.
+. Next, I did the 2 git clone and mvn compile operations above.
+The first, ia-web-commons, successfully compiled (despite some test failures)
+. BUT THE 2ND GIT PROJECT, ia-hadoop-tools, DIDN'T COMPILE AT FIRST, with the mvn package command failing:
+git clone https://github.com/commoncrawl/ia-hadoop-tools
+cd ia-hadoop-tools
+mvn package
+Compile failed with a message about the JSONTokener constructor not taking a String object. It turned out that the JSONTokener used was a version of the class that was too old. Whereas the necessary constructor is present in the most recent version, as seen in the API at https://docs.oracle.com/cd/E51273_03/ToolsAndFrameworks.110/apidoc/assembler/com/endeca/serialization/json/JSONTokener.html
+So instead, I opened up ia-hadoop-tools/pom.xml for editing and added the newest version of the org.json package's json (see http://builds.archive.org/maven2/org/json/json/ for <version>) into the pom.xml's <dependencies> element, based on how it this was done at https://bukkit.org/threads/problem-loading-libraries-with-maven-java-lang-noclassdeffounderror-org-json-jsonobject.428354/:
+   <dependency>
+      <groupId>org.json</groupId>
+      <artifactId>json</artifactId>
+      <version>20131018</version>
+    </dependency>
+Then I was able to run "mvn package" successfully.
+(Maybe I could also have added in a far more recent version, as seen in the version numbers at https://mvnrepository.com/artifact/org.json/json,
+but didn't want to go too far ahead in case there was other incompatibility.)
+. Next, I wanted to finally run the built executable to convert the warc files to wet files.
+I had the warc files on the hadoop filesystem. The original instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to however were apparently for working with warcs stored on the local filesystem, as those instructions did not run the hadoop command but the regular java command. The regular java command did not work with the files being on the hadoop system (attempt #1 below).
+ATTEMPTS THAT DIDN'T WORK:
+. vagrant@node1:~/ia-hadoop-tools$ java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
+. vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
+The 2nd attempt, which uses a proper hadoop command, I based off https://www.tutorialspoint.com/map_reduce/implementation_in_hadoop.htm
+It produced lots of errors and the output wet (and wat) .gz files were all corrupt as gunzip could not successfully run over them:
+vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
+/09/05 05:57:22 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
+/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100139-000000.warc.gz
+/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100141-000001.warc.gz
+/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100451-000002.warc.gz
+/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100453-000003.warc.gz
+/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100805-000004.warc.gz
+/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100809-000005.warc.gz
+/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000006.warc.gz
+/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000007.warc.gz
+/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000008.warc.gz
+/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000009.warc.gz
+/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
+/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
+/09/05 05:57:23 INFO mapred.FileInputFormat: Total input paths to process : 10
+/09/05 05:57:24 INFO mapreduce.JobSubmitter: number of splits:10
+/09/05 05:57:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1567397114047_0001
+/09/05 05:57:24 INFO impl.YarnClientImpl: Submitted application application_1567397114047_0001
+/09/05 05:57:24 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1567397114047_0001/
+/09/05 05:57:24 INFO mapreduce.Job: Running job: job_1567397114047_0001
+/09/05 05:57:31 INFO mapreduce.Job: Job job_1567397114047_0001 running in uber mode : false
+/09/05 05:57:31 INFO mapreduce.Job:  map 0% reduce 0%
+/09/05 05:57:44 INFO mapreduce.Job:  map 10% reduce 0%
+/09/05 05:57:44 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000002_0, Status : FAILED
+Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
+Container killed by the ApplicationMaster.
+Container killed on request. Exit code is 143
+Container exited with a non-zero exit code 143
+/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000004_0, Status : FAILED
+Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
+Container killed by the ApplicationMaster.
+Container killed on request. Exit code is 143
+Container exited with a non-zero exit code 143
+/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
+Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
+/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000005_0, Status : FAILED
+Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
+/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000000_0, Status : FAILED
+Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
+/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000003_0, Status : FAILED
+Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
+/09/05 05:57:46 INFO mapreduce.Job:  map 0% reduce 0%
+/09/05 05:57:54 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000007_0, Status : FAILED
+Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
+/09/05 05:57:55 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000006_0, Status : FAILED
+Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
+/09/05 05:57:57 INFO mapreduce.Job:  map 10% reduce 0%
+/09/05 05:57:57 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000009_0, Status : FAILED
+Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
+Container killed by the ApplicationMaster.
+Container killed on request. Exit code is 143
+Container exited with a non-zero exit code 143
+/09/05 05:57:58 INFO mapreduce.Job:  map 20% reduce 0%
+/09/05 05:57:58 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000008_0, Status : FAILED
+Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
+/09/05 05:58:06 INFO mapreduce.Job:  map 30% reduce 0%
+/09/05 05:58:08 INFO mapreduce.Job:  map 60% reduce 0%
+/09/05 05:58:09 INFO mapreduce.Job:  map 70% reduce 0%
+/09/05 05:58:10 INFO mapreduce.Job:  map 80% reduce 0%
+/09/05 05:58:12 INFO mapreduce.Job:  map 90% reduce 0%
+/09/05 05:58:13 INFO mapreduce.Job:  map 100% reduce 0%
+/09/05 05:58:13 INFO mapreduce.Job: Job job_1567397114047_0001 completed successfully
+/09/05 05:58:13 INFO mapreduce.Job: Counters: 32
+    File System Counters
+        FILE: Number of bytes read=0
+        FILE: Number of bytes written=1239360
+        FILE: Number of read operations=0
+        FILE: Number of large read operations=0
+        FILE: Number of write operations=0
+        HDFS: Number of bytes read=1430
+        HDFS: Number of bytes written=0
+        HDFS: Number of read operations=30
+        HDFS: Number of large read operations=0
+        HDFS: Number of write operations=0
+    Job Counters
+        Failed map tasks=10
+        Launched map tasks=20
+        Other local map tasks=10
+        Data-local map tasks=10
+        Total time spent by all maps in occupied slots (ms)=208160
+        Total time spent by all reduces in occupied slots (ms)=0
+        Total time spent by all map tasks (ms)=208160
+        Total vcore-milliseconds taken by all map tasks=208160
+        Total megabyte-milliseconds taken by all map tasks=213155840
+    Map-Reduce Framework
+        Map input records=10
+        Map output records=0
+        Input split bytes=1430
+        Spilled Records=0
+        Failed Shuffles=0
+        Merged Map outputs=0
+        GC time elapsed (ms)=1461
+        CPU time spent (ms)=2490
+        Physical memory (bytes) snapshot=1564528640
+        Virtual memory (bytes) snapshot=19642507264
+        Total committed heap usage (bytes)=1126170624
+    File Input Format Counters
+        Bytes Read=0
+    File Output Format Counters
+        Bytes Written=0
+vagrant@node1:~/ia-hadoop-tools$
+. The error messages are all the same but not very informative
+/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
+   Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
+All the references I could find on google indicated that the full version of the error message was that this method (com.google.common.io.ByteStreams.limit(...)) could not be located.
+The page at http://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%[email protected]%3E
+revealed that guava.jar contains the com.google.common.io.ByteStreams class.
+TO GET THE EXECUTABLE TO WORK:
+I located guava.jar, found there were 2 identical ones on the filesystem but that neither was on the hadoop classpath yet, so I copied it into one of the Hadoop Classpath locations. Then I was able to successfully run the executable and produce meaningful WET files at last from the WARC input files:
+vagrant@node1:~$ locate guava.jar
+/usr/share/java/guava.jar
+/usr/share/maven/lib/guava.jar
+vagrant@node1:~$ jar -tvf /usr/share/maven/lib/guava.jar | less
+vagrant@node1:~$ jar -tvf /usr/share/java/guava.jar | less
+# both contained the ByteStreams class
+vagrant@node1:~$ cd -
+/home/vagrant/ia-hadoop-tools
+vagrant@node1:~/ia-hadoop-tools$ find . -name "guava.jar"
+# None in the git project
+vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
+/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
+# guava.jar not on hadoop classpath yet
+vagrant@node1:~/ia-hadoop-tools$ diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
+# no differences, identical
+vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -put /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
+put: `/usr/local/hadoop/share/hadoop/common/.': No such file or directory
+# hadoop classpath locations are not hdfs filesystem
+vagrant@node1:~/ia-hadoop-tools$ sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
+vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
+/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
+# Copied guava.jar to somewhere on existing hadoop classpath
+vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
+# Successful run
+vagrant@node1:~$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/.
+vagrant@node1:~$ cd ..
+vagrant@node1:~$ gunzip MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz
+vagrant@node1:~$ less MAORI-CC-2019-30-20190902100139-000000.warc.wet
+# Copied a WET output file from the hadoop filesystem to local filesystem and inspected its contents. Works!
 -----------------------------------

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33457

Legend:

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

Download in other formats: