Internals of worker/executor usage during coalesce/repartition
Lets say we have spark cluster with below configuration.
A column value becomes all 1s after pyspark dataframe toPandas()
I encountered a strange error. All of the value of a column become completed same after toPandas().
Nested condition on simple data
I have a dataframe having 3 columns, two boolean type and one column as string.
Understanding exchange in spark ui
I am learning about Spark and came across the following thing. I create a simple dataframe using range.
Transform parquet table file by file
I have a big parquet table stored in multiple partitions, and each partition having multiple files inside of it, like:
How to connect from my host system to a Spark Container?
I have a container running Spark on port 7077, and got the container IPAddress through Docker inspect command, and I’m trying to create a SparkSession through my host machine (I’m using Jupyter on my host) on the container’s cluster.
Reading json file in pyspark with nested structure array of array of dictionaries
I am trying to read the following json data which is inside a data.json file:
Free up local checkpoint memory
I am writing an iterative algorithm and I am using localCheckpoint
to break the dataframe lineage as well as to persist the data for next iteration. The problem is unpersist
doesn’t seem to work on local checkpointed dataframe, and thus the local checkpointed dataframe is never cleared (the storage UI also confirms that).
Join text file content to existing dataframe
I have a dataframe with metadata df_meta
that has the uri of text files with the content that I require:
| id | s3_uri |
| ——– | ————– |
| 1 | s3://bucket1/prefix |
| 2 | s3://bucket2/another_prefix |
| 3 | s3://bucket3/a_different_prefix |
Reading data from csv in spark
Thank you for making time to answer this question.