I’m training a Sequential model in a binary classification problem. My dataset is in HDF format and consists of many files, which are often too large to fit in memory. To handle this, I tried using a TensorFlow pipeline approach based on tf.data.Dataset.from_generator. However, after many attempts (using resources like this post), I noticed a significant worsening of the trained model’s performance.
I managed to replicate the issue with a generic dataset using the following code:
import tensorflow as tf
import tensorflow.keras as keras
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
def set_seeds(seed=999):
tf.random.set_seed(seed)
np.random.seed(seed)
tf.keras.utils.set_random_seed(seed) # Sets seeds for base-python, numpy, and tf
n_features = 5
# Generate synthetic dataset and split into training and testing sets
X, y = make_classification(n_samples=1_000_000, n_features=n_features, n_classes=2, random_state=42)
X_train = pd.DataFrame(X)
y_train = pd.DataFrame(y)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
# Define a Sequential model
def define_model(n_features=n_features):
model = keras.models.Sequential([
keras.layers.BatchNormalization(input_shape=(n_features, )),
keras.layers.Dense(10, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
return model
def compile(model):
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
Training based on the DataFrame approach
When I train the model using pd.DataFrame, I get the following results:
# Train based on pd.DataFrame
set_seeds()
model = define_model()
compile(model)
history = model.fit(X_train, y_train, epochs=10, batch_size=10_000, validation_split=0.0, verbose=True)
ax = pd.DataFrame(history.history).plot()
ax.set_ylim(0, 1.0)
# Evaluate on the test sample
model.evaluate(X_test, y_test, batch_size=10_000)
This takes around 5 seconds and gives me an accuracy of ~0.96.
Training based on the Dataset
approach
However, when I train using tf.data.Dataset with the following code:
class Generator:
def __call__(self):
yield X_train.values, y_train.values
ds_train = tf.data.Dataset.from_generator(
Generator(),
output_signature=(
tf.TensorSpec(shape=(None, n_features), dtype=tf.float32),
tf.TensorSpec(shape=(None, 1), dtype=tf.int32)
)
)
set_seeds()
model = define_model()
compile(model)
history = model.fit(ds_train, epochs=10, batch_size=10_000, verbose=True)
ax = pd.DataFrame(history.history).plot()
ax.set_ylim(0, 1.0)
model.evaluate(X_test, y_test, batch_size=10_000)
I get an accuracy of ~0.6. However, this method takes about 2.5 seconds. To achieve the same accuracy as the DataFrame method, I need to train for around 400 epochs, which takes 40 seconds.
Why does dataset perform worse?
I’m not sure if I’m misunderstanding the usage of tf.data.Dataset at a fundamental level, but this behavior is unexpected. It seems like the epochs in the Dataset method are less informative than those using the DataFrame method.
Does anyone have insights or suggestions on why this is happening and how to fix it?
I’m using TF 2.13.1
and python 3.11.3
Navix is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.