Changeset 33524
- Timestamp:
- 2019-09-26T20:34:12+12:00 (5 years ago)
- Location:
- gs3-extensions/maori-lang-detection/hdfs-instructions
- Files:
-
- 23 added
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/hdfs-instructions/Readme.txt
r33514 r33524 213 213 214 214 git clone https://github.com/commoncrawl/cc-index-table.git 215 cd cc-index-table 216 mvn package 217 218 2. Although cc-index-table will compile successfully, it will throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows: 215 216 2. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below: 217 218 17c17,18 219 < <spark.version>2.4.1</spark.version> 220 --- 221 > <!--<spark.version>2.4.1</spark.version>--> 222 > <spark.version>2.3.0</spark.version> 223 135a137,143 224 > <dependency> 225 > <groupId>org.apache.hadoop</groupId> 226 > <artifactId>hadoop-aws</artifactId> 227 > <version>2.7.6</version> 228 > </dependency> 229 > 230 231 3. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows: 219 232 220 233 a. Set option(header) to false, since the csv file contains no header row, only data rows. … … 234 247 .toJavaRDD(); 235 248 236 3. Now recompile cc-index-table with the above modifications: 237 238 (cd cc-index-table) 239 mvn package 249 // TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here. 250 251 4. Now (re)compile cc-index-table with the above modifications: 252 253 cd cc-index-table 254 mvn package 240 255 241 256 ------------------------------- … … 254 269 255 270 256 2. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener :271 2. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json): 257 272 258 273 <dependency> … … 261 276 <version>20131018</version> 262 277 </dependency> 278 279 [ 280 UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above: 281 a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml: 282 ia-hadoop-tools>diff pom.xml.orig pom.xml 283 284 < <groupId>org.netpreserve.commons</groupId> 285 < <artifactId>webarchive-commons</artifactId> 286 < <version>1.1.1-SNAPSHOT</version> 287 --- 288 > <groupId>org.commoncrawl</groupId> 289 > <artifactId>ia-web-commons</artifactId> 290 > <version>1.1.9-SNAPSHOT</version> 291 292 b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to 293 294 However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period. 295 296 ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/ 297 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ 298 Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java 299 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ 300 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ 301 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ 302 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ 303 Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java 304 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ 305 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ 306 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ 307 Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ 308 Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ 309 Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ 310 ] 263 311 264 312 3. Now can compile ia-hadoop-tools: … … 298 346 299 347 OUTPUT: 300 After hours of processing , you should end up with:348 After hours of processing (leave it to run overnight), you should end up with: 301 349 hdfs dfs -ls /user/vagrant/<crawl-timestamp> 302 350 In particular, the zipped wet records at hdfs:///user/vagrant/<crawl-timestamp>/wet/
Note:
See TracChangeset
for help on using the changeset viewer.