Relative Content

Tag Archive for apache-sparkpysparkdatabricks

Shuffle partitions conf changed in one of the paralleled Databricks tasks

I have a Databricks workflow that creates several entities (each represented as one task in the workflow). Some of them due to optimization issues, I optimized by hand by changing the shuffle size. When such a task finishes, I the shuffle.partitions is reverted to the default value (200).

Shuffle partitions conf changed in one of the paralleled Databricks tasks

I have a Databricks workflow that creates several entities (each represented as one task in the workflow). Some of them due to optimization issues, I optimized by hand by changing the shuffle size. When such a task finishes, I the shuffle.partitions is reverted to the default value (200).

Databricks api list all jobs from workspace

I am trying to get all job data from my Databricks. Basically, I need to put all job data into a DataFrame. There are more than 3000 jobs, so need to use the page_token to traverse all pages.

Parallelize for loop in pyspark; one table per iteration

I’ve got a few dozen spark tables in Databricks with sizes between ~1 and ~20 GB and want to execute a function on each of these tables. Since there is no interdependency between the results of each query, this should be easy to parallelize.