Relative Content

Tag Archive for apache-sparkpyspark

Split value – column name and rest of the value

I have a list called segments having list of column names. Each column value also contains column name. for ex: genre_list is a column, which can have value genre_list_comedy / genre_list_drama_val_us etc.

Left Anti Join is not performing as expected – It is supposed to remove the records which are already there. But duplicates are coming

Left Anti Join is not performing as expected – It is supposed to remove the records which are already there. But duplicates are coming.

Is there a limit to the size of a path list that can be used in a DataFrameReader

I am doing

Error while calculating pyspark dataframe size

I’m trying to calculate the DataFrame size to determine the number of partitions for repartitioning the DataFrame while writing to a Parquet file.

Create Multiple Data Frame from JSON and write in Parallel in PySpark

I want to create multiple Data Frame from the Given JSON and write to Delta table in Parallel .

Load Spark Schema as variable in PySaprk

I have one PySpark job where i need to set schema from config based on Parameter .

How to improve performance of jdbc read in spark

I am reading a large oracle table using jdbc with number of partitions specified as 8 on a standalone spark setup with 8 cores , 64 gb memory. It takes forever to load data from oracle table with 2 million records.

How can I start a pyspark file in a cluster if is not supported by documentation?

I’m trying to start a pyspark.py file in cluster mode but I receive this error:
Cluster deploy mode is currently not supported for python applications on standalone clusters.
So my question is, if pyspark use python, but python is not supported in cluster mode, how can I use spark?

How to ship & use custom python environment at all executor nodes in pyspark in a Jupyter notebook (running on dataproc)?

I am using pyspark in Jupyter notebook (running on dataproc) and have a custom python enviroment named m3.Linux_debian_11.py38.simple-env.tar.gz which I want to ship to all the executor nodes and use this custom python library. I see that the custom enviroment (m3.Linux_debian_11.py38.simple-env.tar.gz) has been shipped to all executors in their current home directory. However the python version and libraries version it is using is the default one.

filter the data on start and end days from a delta table

I have a delta table partitioned by year,month,day. partitioned columns are set as string.

Thiết kế website giá rẻ

Danh mục