I want to use the Spark SQL DataSourceV2 API and create a custom DataWriter that is able to get the data in the internal ColumnarBatch representation such that I can leverage the columnar representation for efficient serialization of the data before I write out to my data storage.
I could not find whether I can do this and have Spark SQL call my custom DataWriter and pass to it the data as ColumnarBatch and not InternalRow.
Can this even be done without changing the Spark source code? If so – how do I specify to Spark that my DataWriter should work on ColumnarBatch (beyond making my custom DataWriter implement DataWriter<ColumnarBatch>?
I have tried looking in the source code to understand if this is possible but could not find the answer myself.
I have looked at this question and the blog post it points to, but they all deal with writing row by row using the InternalRow interface and not using the ColumnarBatch representation.