So, I have been looking for similar problems and questions to find ways to fill a data structure iteratively, be it a Numpy array or a Pandas DataFrame for example, and I could not find a proper answer.
Many answers advocate for simply not doing it ; indeed, the cost of reallocating the Numpy array or Pandas DataFrame to the new size is very costly.
On the other hand, unless I am missing something, whenever you need a whole dataset to from different files, I fail to see what good ways there are.
It is even more of a problem, when you need to have a dataset with different data types.
For example, this has more or less the same issue than concatenating a DataFrame, even if it’s faster and uses less memory:
# lists to store values for each files
l_fn = []
l_exp = []
l_tc = []
l_cor = []
mat_im = np.empty(0)
for f in files_list:
# functions retrieving various metadata and data values from files on drive
# fn, exp are strings
# exp, cor are integers but could be processed as strings
# data is a 1D numpy array (row vector)
fn, im, exp, tc, cor = extract_data(f) # extract metadata and data from each file
l_fn.append(fn)
l_exp.append(exp)
l_tc.append(tc)
l_cor.append(cor)
# stacking vector in a matrix
# reallocating issue
if pixels_data.size == 0:
vec_im = im
mat_im = np.vstack((vec_im, im))
df1 = pd.DataFrame(list(zip(l_tc, l_exp, l_cor)))
df2 = pd.DataFrame(mat_im)
df2 = pd.concat([df1, df2], ignore_index=False)
This seems neither elegant nor efficient. (Sorry for the variables names, I was not inspired at all)
Of course, DataFrames can replace the lists and numpy array, but, while it’s less ugly, it’s worse performance-wise, and memory-wise.
What would be a good way to do it?