I am trying to explore Spark-Connect.
I could successfully spin up Spark-Connect, Spark Workers and Spark master in a dockerised way.
Using a simple Pyspark code, I am able to push and execute some simple Dataframe queries to Spark via Spark-Connect. Here is a simple code example,
from pyspark.sql import SparkSession
def main():
SparkSession.builder.master("local[*]").getOrCreate().stop()
# Connect to the Spark Connect server
spark = SparkSession.builder.appName("HelloSparkConnect").remote("sc://localhost:15002").getOrCreate()
print("Connected to Spark Connect!")
# Create a DataFrame with sample data
data = [("Alice", 25), ("Bob", 30)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
# Stop the SparkSession
spark.stop()
if __name__ == "__main__":
main()
Challange:
I want to accomplish the same in Java 8 or higher (Preferably Java 17 or more). I do not find an official, Java Spark Connect Client yet. Tried with latest 3.5.x versions of Spark Master and Spark Connect. Also tried with 4.x versions along with Java 17.
No support for remote api yet in Java. This remote is the api call that connects to the Spark Connect Client that runs in the port 15002.
see the code below,
SparkSession.builder.appName("HelloSparkConnect").remote("sc://localhost:15002").getOrCreate()
This documentation also says the same (although not explicitly stated, it says support for Python and Scala is available)
https://spark.apache.org/docs/latest/spark-connect-overview.html
Any idea when Apache Spark community is planning support for Java for Spark Connect Client?
Next question is what are all the other languages (like GO, RUST etc), we can expect a Spark Connect Client?