I am working in Databricks, and I have a dataframe (let’s call it ‘df’) with several null rows within a column:
+—-+————+——-+
| id | date | value |
+—-+————+——-+
| 1 | 2024-01-01 | 0 |
| 1 | 2024-01-02 | null |
| 1 | 2024-01-03 | null |
| 1 | 2024-01-04 | null |
| 1 | 2024-01-05 | 4 |
| 1 | 2024-01-06 | null |
| 2 | 2024-03-20 | 0 |
| 2 | 2024-03-21 | null |
| 2 | 2024-03-22 | null |
| 2 | 2024-03-23 | 3 |
| 2 | 2024-03-24 | null |
| 2 | 2024-03-25 | 0 |
| 2 | 2024-03-26 | null |
| 2 | 2024-03-27 | null |
| 2 | 2024-03-28 | null |
| 2 | 2024-03-29 | 4 |
+—-+————+——-+
I need to partition by ‘id’ and order by ‘date’ to replace these nulls with the value of the previous row incremented by one, but I’m at a loss.
Pseudocode algorithm:
for (int i = 1; i < number_of_rows; i++)
{
if value[i] == null
{
value[i] = value[i-1] + 1
}
}
(it’d be okay to skip the first row and just accept the value as-is, null or not)
The output should therefore be the following:
+—-+————+——-+————–+
| id | date | value | value_filled |
+—-+————+——-+————–+
| 1 | 2024-01-01 | 0 | 0 |
| 1 | 2024-01-02 | null | 1 |
| 1 | 2024-01-03 | null | 2 |
| 1 | 2024-01-04 | null | 3 |
| 1 | 2024-01-05 | 4 | 4 |
| 1 | 2024-01-06 | null | 5 |
| 2 | 2024-03-20 | 0 | 0 |
| 2 | 2024-03-21 | null | 1 |
| 2 | 2024-03-22 | null | 3 |
| 2 | 2024-03-23 | 3 | 3 |
| 2 | 2024-03-24 | null | 4 |
| 2 | 2024-03-25 | 0 | 0 |
| 2 | 2024-03-26 | null | 1 |
| 2 | 2024-03-27 | null | 2 |
| 2 | 2024-03-28 | null | 3 |
| 2 | 2024-03-29 | 4 | 4 |
+—-+————+——-+————–+
I have already figured out how to replace the nulls with the previous value using the last function, e.g.:
w = Window.partitionBy('vin').orderBy('scan_date').rowsBetween(Window.unboundedPreceding, 0)
df = df.withColumn('value_filled', F.last('value', ignorenulls=True).over(w))
+—-+————+——-+————–+
| id | date | value | value_filled |
+—-+————+——-+————–+
| 1 | 2024-01-01 | 0 | 0 |
| 1 | 2024-01-02 | null | 0 |
| 1 | 2024-01-03 | null | 0 |
| 1 | 2024-01-04 | null | 0 |
| 1 | 2024-01-05 | 4 | 4 |
| 1 | 2024-01-06 | null | 4 |
| 2 | 2024-03-20 | 0 | 0 |
| 2 | 2024-03-21 | null | 0 |
| 2 | 2024-03-22 | null | 0 |
| 2 | 2024-03-23 | 3 | 3 |
| 2 | 2024-03-24 | null | 3 |
| 2 | 2024-03-25 | 0 | 0 |
| 2 | 2024-03-26 | null | 0 |
| 2 | 2024-03-27 | null | 0 |
| 2 | 2024-03-28 | null | 0 |
| 2 | 2024-03-29 | 4 | 4 |
+—-+————+——-+————–+
I’ve tried various approaches to add one to the previous column if the current column is null, but at most I can only get the first null in the group of nulls to update, such as:
w = Window.partitionBy('vin').orderBy('date').rowsBetween(Window.unboundedPreceding, 0)
df = df.withColumn('value', F.when(F.isnull(F.col('value')), F.last(F.col('value'),
ignorenulls=True).over(w) + 1).otherwise(F.col('value')))
+—-+————+——-+————–+
| id | date | value | value_filled |
+—-+————+——-+————–+
| 1 | 2024-01-01 | 0 | 0 |
| 1 | 2024-01-02 | null | 1 |
| 1 | 2024-01-03 | null | 1 |
| 1 | 2024-01-04 | null | 1 |
| 1 | 2024-01-05 | 4 | 4 |
| 1 | 2024-01-06 | null | 5 |
| 2 | 2024-03-20 | 0 | 0 |
| 2 | 2024-03-21 | null | 1 |
| 2 | 2024-03-22 | null | 1 |
| 2 | 2024-03-23 | 3 | 3 |
| 2 | 2024-03-24 | null | 4 |
| 2 | 2024-03-25 | 0 | 0 |
| 2 | 2024-03-26 | null | 1 |
| 2 | 2024-03-27 | null | 1 |
| 2 | 2024-03-28 | null | 1 |
| 2 | 2024-03-29 | 4 | 4 |
+—-+————+——-+————–+
Mikey Antonakakis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.