I have a PySpark DataFrame which needs ordering on a column (“Reference”).
The values in the column typically look like:
["AA.1234.56", "AA.1101.88", "AA.904.33", "AA.8888.88"]
I have a function already which sorts this list:
myFunc = lambda x: [int(a) if a.isdigit() else a for a in x.split(".")]
which yields as required:
["AA.904.33", "AA.1101.88", "AA.1234.56", "AA.8888.88"]
I want to order the DataFrame applying this lambda
. I tried with the sortByKey
but it is not clear how to isolate the DataFrame for just a specific column. Any ideas?
A generic question that relates to this, but which kind of use cases require that the PySpark DataFrame gets converted to an RDD? The sortByKey
function seems to only apply to RDDs, and not DataFrames.