Moving Execution of Less Rows in Pyspark causes Spike in time

I’m running some Pyspark code like this to help clean up an input dataframe by calculating dataframes containing ID’s I’d like to remove :


# each of these DF's on their own are the same 2 cols (col_a, col_b), and maybe 5-10K rows each
df1 = evalFunc1(inputDf, ... others)
df2 = evalFunc2(inputDf, ... others)
df3 = evalFunc3(inputDf, ... others)
# ...etc more

unioned_df = (
   df1.union(df2).union(df3)
   # the rest of the unions
)

unioned_df.persist(StorageLevel.MEMORY_AND_DISK)

print(f"all row offers before filters: {input_df.count()}") # maybe 150K Max
cleaned_input_df = input_df.join(
    unioned_df, 
    on=["col_a", "col_b"],
    how="left_anti",
)
cleaned_input_df.persist(StorageLevel.MEMORY_AND_DISK)
after_filter_count = cleaned_input_df.count()
print(f"rows after filter : {after_filter_count}") # usually about #30K - so big reduction

Normally the above takes about 2-3 minutes to execute. It’s super small volume of data. However, I had a change in my logic, and I realized I needed to move evalFunc3 down to execute in a specific order – to do a similar thing, except execute AFTER the other functions have executed – so the above turns into this :

df1 = evalFunc1(inputDf, ... others)
df2 = evalFunc2(inputDf, ... others)
# df3 = evalFunc3(inputDf, ... others) # no longer executing here
# ...etc more

unioned_df = (
   df1.union(df2)
   # the rest of the unions (except not df3)
)

unioned_df.persist(StorageLevel.MEMORY_AND_DISK)

print(f"all row offers before filters: {input_df.count()}") # still approx 150K Max
cleaned_input_df = input_df.join(
    unioned_df, *emphasized text*
    on=["col_a", "col_b"],
    how="left_anti",
)
cleaned_input_df.persist(StorageLevel.MEMORY_AND_DISK)
after_filter_count = cleaned_input_df.count()
print(f"rows after filter : {after_filter_count}") # usually about #30K - so big reduction

# now execute df3 with the now smaller cleaned_input_df
df3 = evalFunc3(cleaned_input_df, ... others)

print("resulting rowcount ", df3.count()) # force evaluation - this now takes 13 minutes

final_cleaned_input_df = cleaned_input_df.join(
    df3, 
    on=["col_a", "col_b"],
    how="left_anti",
)

The difference is – now the evaluation of df3 takes more than 4 times the length of execution, despite the fact that I’m now doing the same logic, except on less data. The actual code’s query plan is enormous in both scenarios, like 70K rows or more, so I can’t really use that to figure out what’s going on. it’s just that the first execution takes very little time, as expected. The data is never supremely large, always under 200K rows, and no dataframe in the above has more than 15 rows of simple text and ints.

What causes the explosion of run time if I have my code like it is in the second code block?

I’m running this in a Pyspark Notebook on Databricks in Azure, using a 14.3 photon-enhanced cluster. I’m really scratching my head here. My cluster is absolutely huge too, a single node one – but with 128 gb memory, so should be plenty. And again, in the first cell code block, it runs fine. Any idea what’s causing the increase? Thanks!

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 04:18

Thẻ: pythondataframepysparkdatabricks

Thiết kế website giá rẻ

Danh mục

Moving Execution of Less Rows in Pyspark causes Spike in time