Cleansing the data
Consider a Scenario i have a two files one is metadata file and another one some list of data frame values, in my metadata file i have the list column name and the data type as the values
How to Looping the list of table contain business key pass to method to execute data process
we are doing the data validation to passing the source table for SQL view name like TEST_SCH.VIEWNAME and target datalake delta view for schema.deltaview to compare the column row count testing. so that we have to write many script to do execute.
How to calculate day difference with specified conditions between rows in pyspark
I am a beginner of Pyspark.
I have a dataframe like below and want to calculate the day difference between the “first date” of “Type D” and “last date” of the PREVIOUS row of “Type I”
Getting pyspark warning when doing a join operation with a column creating by pandas UDF
I have this code snippet to get the minimum distances of each user to all centroids:
How would you sort a column after applying regex and also move all null values to the end using Python and Pyspark?
I currently apply basic asc/desc sort to a column in Python like this:
Merge rows in PySpark DataFrame based on partial overlapping values in a column
I would like to merge rows in PySpark DataFrame based on partial overlapping values in a column. Here’s a simplified example:
How to select items inside a python list and add it to a dataframe
I have a pyspark dataframe with below columns
Pyspark Error: incorrect call to ‘column’
Getting a Pyspark reference error here (full code below): ‘if fi.stat().st_ctime >= today_midnight:’. Working in Palantir Foundry, using a transform to rename a json file. Any idea what is causing this? Thx!!
Unable to display a dataframe in VScode Jupyter notebook
I am trying to create a data frame in pySpark, but after writing all the steps properly, program throws an error at df.show() line
Remove duplicate characters from a string : Pyspark
I want to keep only the unique alphabets in a pyspark string Column.Please suggest any solution without using udfs. I need a Pyspark solution, not the multiple pythonic solutions present on the forum. Thanks