My data looks like:
+-----------------+
| columnName|
+-----------------+
| 1 (1)|
| null|
| 11 (10)|
| 2 (3)|
+-----------------+
The numbers may be either floats or integers, but for simplicity I used integers in the example.
My existing code does two things:
- If a column is being sorted, always move the rows with a null value for that column to the end
- Sorting by the number in the parentheses
In the example above, I’d expect the sorted order of the numbers with parentheses to be 1, 3, 10
, but instead it’s sorting to 1, 10, 3
. I think that it’s sorting lexicographically so the “1” in the “10” is being sorted instead of the entire number “10”.
I’m trying to cast to a double, but can’t figure out why the case isn’t helping. Does anyone know what’s wrong here? Thanks in advance.
from pyspark.sql import functions as F
def apply_regex(column: Column, pattern: str, group: int):
extracted = F.regexp_extract(column, pattern, group)
return F.when(extracted == "", column).otherwise(extracted.cast("double"))
def sort(dataframe, column_name: str, direction: str):
col = F.col(column_name)
# Group 3 is the number inside the parentheses
col = apply_regex(col, r"(d+(.d+)?) ((d+(.d+)?))", 3)
col = F.when(col.isNull(), F.lit(None)).otherwise(col)
sort_exprs = [col.asc() if direction == "asc" else col.desc()]
return dataframe.orderBy(*sort_exprs)