TLDR: I have a persisted pyspark dataframe, I can modify it just by calling df.write...
, should that be possible? It happens because the dataframe comes from reading a delta table which gets modified. See the code below on how to replicate (on spark >=3.3)
Longer details:
I have a complicated script using pyspark, where I:
- Read dataframe from a previous run
- Add new data that is available since the previous run
- Save the dataframe back to where I read it from
- Use the updated dataframe in other ways – as inputs to other functions etc.
I found I was getting wrong results, and tracked it down to the behaviour you can see in the simplified code to replicate the issue below. My understanding is that spark is “clever” and see’s that the original data has changed, so re-runs the full logical plan, however I obviously changed the data on purpose, and with the persist did not want it to recalculate. Adding a persist/cache does not help. The easy solution is to not save the dataframe until I have finished “using” it, but this pattern of reading, updating and then saving again feels so natural I feel like it must cause errors in lots of people’s code…
import pyspark
print(f"{pyspark.__version__=}")
from pyspark.sql import functions as f
table_path = "/tmp/table123"
dbutils.fs.rm("/tmp/table123", recurse=True)
orig_data = spark.createDataFrame(data=[
{"tag": "tag-1", "time": 0, "value": 0},
])
orig_data.write.format("delta").mode("overwrite").save(table_path)
# Read in some data previously stored
df = spark.read.format("delta").load(table_path)
# Add new data from some made up data source
new_extra_data = spark.createDataFrame(data=[
{"tag": "tag-1", "time": 1, "value": 1},
])
df = df.unionByName(new_extra_data)
df.persist()
df.count() # action to trigger the plan so the persist happens
print(f"before saving df to the table, calling count after persist, {df.count()=}")
print(f"before saving df to the table, df.show outputs:")
df.show()
# save the new updated table with the new data
df.write.format("delta").mode("overwrite").save(table_path)
# now do something with the updated data, e.g. do some analysis, make a plot etc.
print(f"after saving df to the delta table, df.show outputs:")
df.show()
print(f"nAs you can see, just calling 'df.write...' has cause the df itself to change, despite being persisted")
print(f"The assertion of 'df.count() == 2' will {'pass' if df.count() == 2 else 'fail'}")
assert df.count() == 2
Output:
Screenshot of output from running in a notebook
This behaviour is in spark >=3.3, earlier versions behave as I would expect.
Have I found a bug in pyspark / spark or should I just use a different pattern? Would appreciate your views and insights
When running the above code, I was expecting the dataframe to stay the same after the command df.write.format("delta").mode("overwrite").save(table_path)
, instead it gets modified and a new row has been added.
mh55 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1