I am trying to convert JavaRDD to Dataset for optimisation purposes and don’t want to create a EMR cluster whenever I want to test Spark. Is there a way to locally run Spark and to set up a debugger as well for my code?
The code I want to test looks something like this (the below code compares RDD with Dataset) ->
public void runSparkLocal() {
SparkSession spark = SparkSession.builder().appName("DatasetVsRDD").config("spark.master", "local").getOrCreate();
// Sample Data
List<Row> data = new ArrayList<>();
for (int i = 0; i < 100000; i++) {
data.add(RowFactory.create("value_" + i, i));
}
Dataset<Row> dataset = spark.createDataFrame(data, new StructType().add("col1", DataTypes.StringType).add("col2", DataTypes.IntegerType));
// Processing with Dataset (log timestamp)
long datasetStartTime = System.currentTimeMillis();
dataset.filter("col1 like 'value_%'").select("col2").count();
long datasetEndTime = System.currentTimeMillis();
System.out.println("Dataset Processing Time: " + (datasetEndTime - datasetStartTime) + " ms");
// Processing with RDD (log timestamp)
List<Row> rddData = data; // Convert to a List for RDD creation
JavaRDD<Row> rdd = spark.createDataFrame(rddData, new StructType().add("col1", DataTypes.StringType).add("col2", DataTypes.IntegerType)).toJavaRDD();
long rddStartTime = System.currentTimeMillis();
// Filter and convert to integers using a custom function
JavaRDD<Integer> filteredRDD = rdd.filter(row -> row.getString(0).startsWith("value_"))
.map(row -> (Integer) row.get(1)); // Extract integer from Row
long filteredCount = filteredRDD.count(); // Count the filtered integers
long rddEndTime = System.currentTimeMillis();
System.out.println("RDD Processing Time: " + (rddEndTime - rddStartTime) + " ms");
spark.stop();
}
Tried to find some articles regarding this but nothing was articulated enough to work for me