Context Navigation

← Previous Change
Next Change →

wcsa

Timestamp:

2017-01-29T21:51:30+13:00 (7 years ago)

Author:

davidb

Message:

Quick code added to downsample

File:

: 1 edited

other-projects/hathitrust/wcsa/extracted-features-solr/trunk/solr-ingest/src/main/java/org/hathitrust/extractedfeatures/ProcessForCatalogLangCount.java (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

other-projects/hathitrust/wcsa/extracted-features-solr/trunk/solr-ingest/src/main/java/org/hathitrust/extractedfeatures/ProcessForCatalogLangCount.java

-              r31364
+              r31365
     public void execCatalogLangCountSparkDirect()
+    {
+        String spark_app_name = generateSparkAppName("Spark-Direct + Per Volume");
+        SparkConf conf = new SparkConf().setAppName(spark_app_name);
+        SparkConf conf = new SparkConf().setAppName("Spark-Direct + Per Volume: Downsample");
         JavaSparkContext jsc = new JavaSparkContext(conf);
 …
+    }
+    public void sampleDown()
+    {
+        String spark_app_name = generateSparkAppName("Spark Cluster + Per Volume");
+        SparkConf conf = new SparkConf().setAppName(spark_app_name);
+        JavaSparkContext jsc = new JavaSparkContext(conf);
+        jsc.hadoopConfiguration().set("io.compression.codec.bzip2.library", "java-builtin");
+        String packed_sequence_path = "hdfs:///user/capitanu/data/packed-ef";
+        JavaPairRDD<Text, Text> input_pair_rdd = jsc.sequenceFile(packed_sequence_path, Text.class, Text.class);
+        JavaPairRDD<Text, Text> json_text_sample_rdd = input_pair_rdd.sample(false,0.0001,42);
+        String output_directory = "packed-ef-10000";
+        json_text_sample_rdd.saveAsTextFile(output_directory);
+    }
     public void execCatalogLangCount()
+    {
         String spark_app_name = generateSparkAppName("YARN Cluster + Per Volume");
 …
             = new ProcessForCatalogLangCount(input_dir,json_list_filename,verbosity);
+        prep_for_lang.execCatalogLangCount();
+        //prep_for_lang.execCatalogLangCount();
+        prep_for_lang.sampleDown();
+    }

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 31365 for other-projects/hathitrust/wcsa

Legend:

other-projects/hathitrust/wcsa/extracted-features-solr/trunk/solr-ingest/src/main/java/org/hathitrust/extractedfeatures/ProcessForCatalogLangCount.java

Download in other formats: