Spark job spilling data vs OOM
I am using spark sql to run sql jobs using 10G executor memory.
When I am monitoring using Spark UI, I can see that data is being spilled to Disk and Memory (expected doing some explode operations ).
Spark job spilling data vs OOM
I am using spark sql to run sql jobs using 10G executor memory.
When I am monitoring using Spark UI, I can see that data is being spilled to Disk and Memory (expected doing some explode operations ).
Spark job spilling data vs OOM
I am using spark sql to run sql jobs using 10G executor memory.
When I am monitoring using Spark UI, I can see that data is being spilled to Disk and Memory (expected doing some explode operations ).
PySpark Window functions: Aggregation differs if WindowSpec has sorting
I am working through this example of aggregation functions for PySpark Window
.
In spark physical planning, what kind of optimization is happening?
I have a quote from a blog on the physical planning in spark. I want to understand what it means
Max of a table partition column vs Max of result of show partitions of the same table
Suppose I have a table that contains orders of customers, partitioned (only) by date_transaction
. I want to find the maximum value of date_transaction
.
value of another column that is the same row as my last lag value
I have a timeseries dataset and am looking to make a new column that represents the last reported values (not null). I think I have this part figured out, using a combination of lag
and last
spark weird broadcasting size
I came across a very weird discovery today.
ShutdownHookManager Error in Spark with Custom Temporary Directory Configuration
I am encountering a ShutdownHookManager error when running Spark with a custom temporary directory configuration. My directory structure and configuration details are as follows:
How to drop records after date based on condition
I’m looking for an elegant way to drop all records in a DataFrame that occur before the latest occurrence of ‘TEST_COMPONENT’ being ‘UNSATISFACTORY’, based on their ‘TEST_DT’ value for each ID.