Suppose I have a table that contains orders of customers, partitioned (only) by date_transaction
. I want to find the maximum value of date_transaction
.
Previously, I simply ran spark.sql("SELECT MAX(date_transaction) FROM orders").first()
(naive approach).
I recently discovered that I could use spark.sql("SHOW PARTITIONS orders").toPandas().max()
+ some string cleaning (show partition approach).
I noted that using show partitions makes the process much quicker, as Spark doesn’t need to scan the whole table.
One drawback I can think of is if there is no data in maximum date transaction, show partitions will still return it whereas the naive approach will return the correct date (is it possible to have empty partitions?).
I’m wondering if the show partitions approach has any other drawbacks that I haven’t thought of.
I would appreciate your input. Thank you and have a great day.
1
If you use the SHOW PARTITIONS
approach, it will return all partition values (even if there is no data in some of those partitions),
whereas the SELECT MAX(date_transaction)
approach ensures that only non-empty partitions are considered.
# You can combine the 2 apporaches with if else condition
max_partition = spark.sql("SHOW PARTITIONS orders").toPandas()['partition'].max()
import re
max_partition_date = re.search(r'date_transaction=(d{4}-d{2}-d{2})', max_partition).group(1)
result = spark.sql(f"SELECT COUNT(*) FROM orders WHERE date_transaction = '{max_partition_date}'").first()
if result[0] > 0:
print(f"The maximum date with data is: {max_partition_date}")
else:
max_date_with_data = spark.sql("SELECT MAX(date_transaction) FROM orders").first()[0]
print(f"The maximum date with actual data is: {max_date_with_data}")