I have list of files (X_files and Y_files) containing numpy arrays of shape N x … and N x …, where N is a sample dimension. I can load them in a tensorflow dataset like this:
def load_data(X_file, Y_file):
X_data = np.load(X_file).astype(np.float32)
Y_data = np.load(Y_file).astype(np.float32)
return X_data, Y_data
file_dataset = tf.data.Dataset.from_tensor_slices((X_files, Y_files))
dataset = file_dataset.map(lambda X_file, Y_file: tf.numpy_function(load_data, [X_file, Y_file], [tf.float32, tf.float32]))
return dataset
This works as expected. If I print:
for feat, labels in train_dataset :
print(feat.shape, labels.shape)
, I get indeed the different samples have been loaded from each file as a single object:
(54, 24, 3600) (54, 24)
(61, 24, 3600) (61, 24)
(83, 24, 3600) (83, 24)
(83, 24, 3600) (83, 24)
What I did than was to use unbatch to flatten along the sample dimension, and batch than properly:
dataset = dataset.unbatch()
dataset = dataset.shuffle(buffer_size=100).batch(16)
but the model.fit than complains about unknown shapes. I am quite sure that the problem is in the tensorflow dataset construction, because I am able to build a working tensorflow dataset using “from_generator”. But how can I modify this approach based on from_tensor_slices so that it does not return errors?