I want to use a dask dataframe to load a pandas dataframe using the dd.read_hdf()
method. I create a very basic pandas dataframe, then I separate the values from column headers and index and I save them in an hdf5 file. I can read the hdf5 file and recreate the orinal dataframe, that loos ok
I cannot however to read the hdf5 file from dd.read_hdf()
, it drops me and error:
TypeError: An error occurred while calling the read_hdf method registered to the pandas backend.
Original Message: cannot create a storer if the object is not existing nor a value are passed
Below is a minimal example to reproduce this error above
import pandas as pd
import numpy as np
import dask.dataframe as dd
import h5py
def save_hdf5(df, hdf5_path):
# Separate the DataFrame values, column names, and index
values = df.values
columns = df.columns.to_numpy()
index = df.index.to_numpy()
# Save to HDF5 file with different groups
with h5py.File(hdf5_path, 'w') as hdf:
hdf.create_dataset('values', data=values)
hdf.create_dataset('columns', data=columns)
hdf.create_dataset('index', data=index)
def load_hdf5(hdf5_path):
# Load the data from the HDF5 file
with h5py.File(hdf5_path, 'r') as hdf:
values = hdf['values'][:] # Load entire dataset into memory
columns = hdf['columns'][:].astype(str) # Load and convert back to strings
index = hdf['index'][:].astype(str) # Load and convert back to strings
return pd.DataFrame(values, columns=columns, index=index)
# Create a simple pandas DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
df.index =['one','two','three']
# File path for the HDF5 file
my_path = 'my_df.h5'
save_hdf5(df, my_path)
my_df = load_hdf5(my_path)
If I call now dd.read_hdf(my_path, key="values")
then I get the error above. What I am doing wrong? It looks to me that the store is empty. Am I missing something really basic?
store = pd.HDFStore(my_path)
print(store.keys())
1