Should this patten in pyspark of reading data, modifying and saving it and then re-using it work and have I found a bug in pyspark or not?

TLDR: I have a persisted pyspark dataframe, I can modify it just by calling df.write..., should that be possible? It happens because the dataframe comes from reading a delta table which gets modified. See the code below on how to replicate (on spark >=3.3)

Longer details:
I have a complicated script using pyspark, where I:

Read dataframe from a previous run
Add new data that is available since the previous run
Save the dataframe back to where I read it from
Use the updated dataframe in other ways – as inputs to other functions etc.

I found I was getting wrong results, and tracked it down to the behaviour you can see in the simplified code to replicate the issue below. My understanding is that spark is “clever” and see’s that the original data has changed, so re-runs the full logical plan, however I obviously changed the data on purpose, and with the persist did not want it to recalculate. Adding a persist/cache does not help. The easy solution is to not save the dataframe until I have finished “using” it, but this pattern of reading, updating and then saving again feels so natural I feel like it must cause errors in lots of people’s code…

import pyspark
print(f"{pyspark.__version__=}")

from pyspark.sql import functions as f

table_path = "/tmp/table123"
dbutils.fs.rm("/tmp/table123", recurse=True)

orig_data = spark.createDataFrame(data=[
  {"tag": "tag-1", "time": 0, "value": 0},
])
orig_data.write.format("delta").mode("overwrite").save(table_path)


# Read in some data previously stored
df = spark.read.format("delta").load(table_path)

# Add new data from some made up data source
new_extra_data = spark.createDataFrame(data=[
  {"tag": "tag-1", "time": 1, "value": 1},
])

df = df.unionByName(new_extra_data)

df.persist()
df.count()  # action to trigger the plan so the persist happens
print(f"before saving df to the table, calling count after persist, {df.count()=}")
print(f"before saving df to the table,  df.show outputs:")
df.show()

# save the new updated table with the new data
df.write.format("delta").mode("overwrite").save(table_path)

# now do something with the updated data, e.g. do some analysis, make a plot etc.
print(f"after saving df to the delta table, df.show outputs:")
df.show()

print(f"nAs you can see, just calling 'df.write...' has cause the df itself to change, despite being persisted")

print(f"The assertion of 'df.count() == 2' will {'pass' if df.count() == 2 else 'fail'}")

assert df.count() == 2

Output:

Screenshot of output from running in a notebook

This behaviour is in spark >=3.3, earlier versions behave as I would expect.

Have I found a bug in pyspark / spark or should I just use a different pattern? Would appreciate your views and insights

When running the above code, I was expecting the dataframe to stay the same after the command df.write.format("delta").mode("overwrite").save(table_path), instead it gets modified and a new row has been added.

New contributor

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 18:58

Thẻ: apache-sparkpysparkpersistdelta

Thiết kế website giá rẻ

Danh mục

Should this patten in pyspark of reading data, modifying and saving it and then re-using it work and have I found a bug in pyspark or not?