I have some data (1m+ rows) which looks like (but with the signal column having a much larger list in reality) :
Name, Signal, SName
Myname,"1,2,3,4,6,7,8,9,10",AnotherName
Myanme,"1,1,1,1,5,3,2,7,3",AnotherName
I’m using dask to read the csv, and then trying to convert the Signal column to a column of arrays. I tried the following:
import dask.dataframe as dd
df = dd.read_csv("my_csv.csv") # all columns read as strings at first
df["Signal"] = df["Signal"].apply(lambda s: np.fromiter(map(int, s.split(",")), dtype=np.int32).astype(int), meta=pd.Series("np.int32"))
At this point, I assumed that I’d have an array-like object in each row. So I tried to run another apply with a function containing an array operation like if val in arr
, however anything like this gives TypeError: 'in <string>' requires string as left operand, not int
. So it looks like it’s still a string? What am I missing here?
- How can I cast my data into numpy arrays to allow me to run functions that can operate on Signal elements as arrays?
- Is there an easy way to GPU accelerate this once the previous step is working? Would the data be batched automatically in order for the I/O overhead to not outweigh the performance benefit?