I have a dataframe in pandas that looks like this:
>>> df[['BranchNumber', 'ModelArticleNumber', 'ActualSellingPrice']].info()
<class 'pandas.core.frame.DataFrame'>
Index: 447970 entries, 0 to 500734
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 BranchNumber 447970 non-null int64
1 ModelArticleNumber 447970 non-null object
2 ActualSellingPrice 447970 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 13.7+ MB
Given that there are 463 unique branch numbers and 5,235 model article numbers, I’d like to create a 463×5,235 sparse matrix with indices. The following kludge is how far I’ve gotten:
import numpy as np
import scipy
def index(df):
ix = { v: n for n, v in enumerate(df.unique()) }
return np.array(list(map(lambda value: ix[value], df)))
def csr(df, row_tag, col_tag, value_tag):
m = scipy.sparse.csr_matrix(
(
df[value_tag],
(
index(df[row_tag]),
index(df[col_tag])
)
)
)
return m
I’d like to be able to do things like m.T.dot(m)
efficiently, and retain the ability to see which key is associated with the rows and columns. I’ve looked through the documentation on pandas indices and arrays, and through the scipy docs on sparse matrices, and I’m just not there. Can somebody point me in the right direction?
I’d like to be able to do things like m.T.dot(m) efficiently, …
This design looks like a reasonable way to do that.
… and retain the ability to see which key is associated with the rows and columns.
That’s more complicated. SciPy sparse arrays don’t natively support labeled data. You could keep the row/column mapping around in a separate variable, though. You can use pd.factorize()
to create the row/column values and the row/column index simultaneously.
e.g.
def csr(df, row_tag, col_tag, value_tag):
rows, row_labels = pd.factorize(df[row_tag])
cols, col_labels = pd.factorize(df[col_tag])
m = scipy.sparse.csr_matrix(
(
df[value_tag],
(
rows,
cols
)
),
shape=(len(row_labels), len(col_labels))
)
return m, row_labels, col_labels
Then if you want to know the label for what is in m[10, 20]
, you can look at row_labels[10]
and col_labels[20]
.
Alternatively, you could use a sparse DataFrame. Pandas has a dtype, SparseDtype, which stores the data for a column in a sparse format. Using the previous csr()
method, you could do something like:
m, row_labels, col_labels = csr(df, 'BranchNumber', 'ModelArticleNumber', 'ActualSellingPrice')
df_sparse = pd.DataFrame.sparse.from_spmatrix(m, index=row_labels, columns=col_labels)
That gives you a labelled sparse DataFrame.