Relative Content

Tag Archive for pythonapache-sparkpyspark

how to use complex classes with spark udfs

Context I have a job that generates a csv based on some data in the datalake of my company. This job is triggered once a day with some predefined configuration. This job is implemented using spark and python and executed in an Airflow pipeline. The csv is later on uploaded to a particular customer. Case […]

Force no data exchange in pyspark when joining?

I am trying to make some joints, groupings,… more efficiently with pyspark, by trying to avoid unnecessary exchanges. I have a situation where first I need to join a dataframe by columns (a, b, c), and later another join by columns (a, b, d).