Relative Content

Tag Archive for apache-sparkpysparkapache-spark-sql

Difference between spark sql insert overwrite table and spark dataframe partition by overwrite mode

I have been testing spark jobs with hive tables backed by google buckets.

What does shuffle intermediate buffer on the Map side mean?

I am trying to understand Spark memory management and I came across this blog. In that the author mentions one of the usages of Execution memory in Spark:

Pyspark with liquid clustering

I have an existing dataframe stored in Azure storage. How can I enable liquid clustering on it?

collect_set and size – on medium size data overrunning

I have a dataframe containing columns: device_id, country, language, channel, genre, and few others attributes . The data is partitioned by year, month, day and hour.

Spark Catalog doesn’t see the database that I created

I have been learning Spark (3.5.0) and I tried out the following exercise:

How to partition data in Spark when reading data from a MySQL table with string type primary key

I’m reading data from a MySQL table in Spark. The table structure may like:

EMR Spark shuffle FetchFailedException with 65tb data with AQE enabled

I am getting error Spark shuffle FetchFailedException while executing spark in emr with 65 TB input data. The code is aggregated metrics spark sql on top of s3 parquert reading 30 days worth of data .

Executor distribution across nodes in a cluster

How are executors of a Spark application distributed across the nodes of a cluster? Let’s say Spark is running in Cluster mode with YARN as the manager. The cluster is said to have 6 nodes, 16 cores each and 64GB mem. With the following configuration, how are the executors distributed across the cluster:

Number of cores in an executor and OOM error

I have read some articles on OOM error in executor of a Spark application and a number of the mention high concurrency as one of the possible reasons. I am aware that the concurrency is determined by the number of cores which determine maximum number of tasks that can run within an executor.

Spark executor memory overhead

from this blog, I understand there is reserved memory within each executor, which amounts to a constant 300MB. In the article as of Spark 1.6, the value of this reserved memory is said to be mutable but requires spark to be recompiled. In the spark config docs, there is spark.executor.memoryOverhead and this config was introduced as of Spark 2.3. Does this config determine the size of the reserved memory which was difficult to change in the Spark 1.6+ versions?

Thiết kế website giá rẻ

Danh mục