Relative Content

Tag Archive for apache-sparkpyspark

Split value – column name and rest of the value

I have a list called segments having list of column names. Each column value also contains column name. for ex: genre_list is a column, which can have value genre_list_comedy / genre_list_drama_val_us etc.

How to improve performance of jdbc read in spark

I am reading a large oracle table using jdbc with number of partitions specified as 8 on a standalone spark setup with 8 cores , 64 gb memory. It takes forever to load data from oracle table with 2 million records.

How to ship & use custom python environment at all executor nodes in pyspark in a Jupyter notebook (running on dataproc)?

I am using pyspark in Jupyter notebook (running on dataproc) and have a custom python enviroment named m3.Linux_debian_11.py38.simple-env.tar.gz which I want to ship to all the executor nodes and use this custom python library. I see that the custom enviroment (m3.Linux_debian_11.py38.simple-env.tar.gz) has been shipped to all executors in their current home directory. However the python version and libraries version it is using is the default one.