I have many rows with this shape (with the Signal col being much longer in reality)
Name,Signal,Name2
Abc,"1,2,3,4,6,2,3,2,4,1,10,15,3,23,4",Abc2
Abc,"1,2,7,4,4,3,4,53,3,1,10,15,3,25,4",Abc3
With the Signal column sometimes having differing lengths by one or two. Using dask I can turn this into (CPU) numpy arrays and run numerical operations like the sum :
import dask.dataframe as dd
import numpy as np
def sum_numpy(part):
return part.apply(lambda x: _numpy_sum_stub(x))
df = dd.read_csv("mycsv.csv")
df["raw_list"] = df.Signal.map_partitions(lambda s: list(map(np.int32, s.str.split(","))), meta=("e", "object"))
df["n_d"] = df.raw_list.map_partitions(lambda s: np.array(s), meta=("n_d", "object"))
df["sum"] = df.n_d.map_partitions(lambda s: sum_numpy(s), meta=("sum", "int64"))
However, when replacing dask with dask-cudf here in attempt to accelerate the operation, I get : TypingError: Failed in cuda mode pipeline (step: nopython frontend) Unknown attribute 'str' of type Masked(string_view)
inside my first map_partitions
which turns the column into a list of ints before using np.array
I also tried converting the arrays to cuPy lists instead
import cupy as cp
df["cupy"] = df.raw_list.apply(lambda s: cp.array(s), meta=("cupy", "object"))
print(df["cupy"].apply(lambda s: cp.sum(s), meta=("cpsum", "int64")).head())
but this leads to AttributeError: 'str' object has no attribute 'sum'
which is strange considering the approach is identical to the numpy version. Moreover, even if this did work it would likely be slower as the allocation wouldn’t be blocked together for all the arrays, negating the GPU speedup.
My questions :
- How can I convert my numpy method (converting the string data into arrays) to work either with cupy arrays or dask-cudf?
- If I selected some rows from the dataframe, how can I append these into one contiguous array?
- (side question) Do I need to use map_partitions which calls a function containing apply or can I use apply directly?