I have some parquet data which looks like :
Name,Count,Result
ABC,500,"123,456,789,..."
ABC,499,"321,456,789..."
with Count numbers being present in the Result string. I would like to replace the Result of any row with Count=499 with 32767, 500 times. I’m hoping this could serve as an error of sorts. I thought about replacing with np.nan, but not sure how this would work with the string. In dask I was hoping this would be easy with :
fill_values_499 = np.array2string(np.full((500,), 32767), separator=",").strip("[]")
df[df["Count"] != 500]["Result"] = fill_values_499
But on examining the data afterwards I found this missed rows. I couldn’t find a way to use the data as an array first and then do the filling based on the Count column.
So overall :
- Can this operation be done in Dask? Maybe there’s a more efficient way to approach the problem.
- Could this operation be done in SQL, something like DuckDB/SQLite? For my data this might be faster since I’m just on one machine.
EDIT: e.g another approach is exploding the strings to have one number per row, but I think this transformation is very memory intensive (at least in Dask) and so isn’t practical.