I am trying to build a spark 3.0 custom data source (either v1 or v2). There are a large number of tutorials online but many of them use TableProvider. I am interested in using RelationProvider instead because of the ability to perform DDL operations.
This new datasource is for a remote database and needs to transport a large amount of data through the connection. Therefore, JDBC is not a good option. I have read the Redshift tutorial on this matter (https://github.com/databricks/spark-redshift) and they export the data first to an s3 location and parallelize the building of the dataframe via this s3 location.
The first thing I would simply like to be able to do is read data from the remote database using this intermediary step. Could anyone please provide any further design choices that may be helpful and any tutorials on how I could do this? There are tutorials that use JDBC but they do not perform partitioning from large table requests.
So far here are the links that I have collected:
- https://github.com/databricks/spark-redshift
- https://github.com/snowflakedb/spark-snowflake
- https://blog.madhukaraphatak.com/spark-3-datasource-v2-part-3
- https://medium.com/@suffyan.asad1/apache-spark-creating-a-jdbc-read-only-data-source-in-apache-spark-3-2c8880b1263b
I require some help piecing these together to get this first read through s3 location going!
For now, the overall goal is to do something like in the redshift tutorial:
import sqlContext.implicits._
val eventsDF = sqlContext.read
.format("azure")
.option("url",jdbcURL )
.option("tempdir", tempS3Dir)
.option("dbtable", "event")
.load()
eventsDF.show()