I am trying to get an R package to work in a Python script using rpy2. I have done what I can to set up the Python code so that it can work with with the functions of this R package, which uses statistical approaches to replace NaN values in numerical data. I have written a minimal reproducible example of my code below:
import pandas as pd
import numpy as np
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects.vectors import StrVector
from rpy2.robjects import pandas2ri
from rpy2.robjects import r
import string
import random
pandas2ri.activate()
utils = importr('utils')
msImpute = importr('msImpute')
# Function to generate random column names
def generate_column_names(n, suffixes):
columns = []
for _ in range(n):
name = ''.join(random.choices(string.ascii_uppercase, k=3)) # Random 3-character string
suffix = random.choice(suffixes) # Randomly choose between "_Healthy" and "_Sick"
columns.append(name + suffix)
return columns
# Number of rows and columns
n_rows = 1000
n_cols = 15
# Generate random float values between 0 and 10
data = np.random.uniform(0, 10, size=(n_rows, n_cols))
# Introduce NaN values sporadically
nan_indices = np.random.choice([True, False], size=data.shape, p=[0.1, 0.9])
data[nan_indices] = np.nan
# Generate random column names
column_names = generate_column_names(n_cols, ["_Healthy", "_Sick"])
# Create the DataFrame
df = pd.DataFrame(data, columns=column_names)
df = df.replace(np.nan, "NA")
groups_list = [col.split("_")[-1] for col in df.columns]
groups_vector_r = StrVector(groups_list)
r_matrix = pandas2ri.py2rpy(df)
r_matrix = r('as.matrix')(df_r)
# Print the shape of the r_matrix to verify
print("Shape of r_matrix:", ro.r['dim'](r_matrix))
#r_matrix = r('as.matrix')(df_r)
imputed_r_matrix = msImpute.msImpute(r_matrix, method="v2-mnar", group=groups_vector_r)
However, by the last line, I always get the error: R[write to console]: Error in rowSums(!is.na(y)) : 'x' must be an array of at least two dimensions
.
Using the line print("Shape of r_matrix:", ro.r['dim'](r_matrix))
I find that the shape of the matrix I am passing in is Shape of r_matrix: [1] 15000
. This is highly confusing to me, since I know that original Pandas dataframe has 15 columns and 1000 rows. But it does not seem that the dimensionality of this dataframe is retained when it is converted into an R data matrix. How can I retain the dimensionality? In some example R code that uses the same package, the example data included in the package has a dimensionality of 2357 and 13 (please see below).
#msImpute testing
library(msImpute)
library(limma)
library(rrcovNA)
library(imputeLCMD)
library(ggplot2)
library(patchwork)
library(ggsci)
library(ggExtra)
#msImpute
data(pxd010943)
y <- log2(data.matrix(pxd010943))
#group is a vector of strings for group names
group <- gsub("_[1234]","", colnames(y))
yimp <- msImpute(y, method="v2-mnar", group=group)
dim(y)