Multiple Sinks Processing not persisting in Databricks Community Edition
I am just trying to use Rate with Structured Streaming, so as to write to multiple tables names per MicroBatch. I.e. just refreshing multiple sinks logic in prep for some certification, in pyspark.
Multiple Sinks Processing not persisting in Databricks Community Edition
I am just trying to use Rate with Structured Streaming, so as to write to multiple tables names per MicroBatch. I.e. just refreshing multiple sinks logic in prep for some certification, in pyspark.
Shuffle partitions conf changed in one of the paralleled Databricks tasks
I have a Databricks workflow that creates several entities (each represented as one task in the workflow). Some of them due to optimization issues, I optimized by hand by changing the shuffle size. When such a task finishes, I the shuffle.partitions
is reverted to the default value (200).
Shuffle partitions conf changed in one of the paralleled Databricks tasks
I have a Databricks workflow that creates several entities (each represented as one task in the workflow). Some of them due to optimization issues, I optimized by hand by changing the shuffle size. When such a task finishes, I the shuffle.partitions
is reverted to the default value (200).
Databricks api list all jobs from workspace
I am trying to get all job data from my Databricks. Basically, I need to put all job data into a DataFrame. There are more than 3000 jobs, so need to use the page_token to traverse all pages.
Including a Period between Dictionaries in Apache Spark with Databricks
It would appear Databricks now require you to include the ‘Catalog Name’ when creating a Delta Table (well, at least in the way that I want to create a Delta Table).
Dataframe.write() produces csv file on single node jobs cluster, but not on 2+1 nodes cluster
I’m writing (df.repartition(1).write.save()
) an empty Dataframe to driver local hard disk. I use header=True
, so expected output is a csv file with just the header row.
How number of partition are 12 if I have only 2 cores available?
I am diving deep into pyspark, and understandig internals of PySpark.
Failed to load preview: Notebook size exceeded the byte limit
Due to some large plotly plots in my Databricks notebook, I’m exceeding the file size limit of 10 MB and can’t work with the notebook anymore.
Parallelize for loop in pyspark; one table per iteration
I’ve got a few dozen spark tables in Databricks with sizes between ~1 and ~20 GB and want to execute a function on each of these tables. Since there is no interdependency between the results of each query, this should be easy to parallelize.