My pyspark dataframe consists of xml tag data in a single column as String.
I want to remove duplicate rows from the dataframe. The normal dataframe methods “dropDuplicates” and “distinct” are not working and I’m not getting the unique rows from the dataframe.
Please note that the column data is in String format and the actual data is xml tag data.
Please suggest a solution to remove duplicates from the dataframe.
1