I have data of this shape :
Name,Signal,Date
MyName,"1,2,3,4,5,6,7,8,9,10",19-04-2024
MyName,"1,2,3,4,5,6,7,8,9,10",19-04-2024
from here I’ve been trying to filter the Signal column by turning it into an array type. However when using dask specifically I’ve been running into some problems. Applying advice from that previous question and others, I’m currently using :
import dask.dataframe as dd
df = dd.read_csv(my_csv.csv) # all columns are strings to start with
df["temp1"] = df["Signal"].map(lambda s: list(map(int, s.split(","))), meta=pd.Series("int32"))
df["temp1"].apply(sum, meta=pd.Series("int32")).compute()
# TypeError: unsupported operand type(s) for +: 'int' and 'str'
and the sum raises a TypeError
. What’s going wrong here? I also tried many different options for the map, including numpy array methods. I’d really like to use the column as an array so I can apply a function like :
from numba import njit
@njit
def filter_anomaly(row):
if -100 in row or 300 in row:
return False
else:
return True
and then add complexity as needed. I’m also interested in how this can be accelerated with cuPy without doing the comparisions row by row (otherwise I/O overhead too large).