Thiết kế website giá rẻ

Question

I am studying ML with Tensorflow and I wanted to try overfitting just to understand if my data formatting makes some sense!

To achieve overfitting I am following a few ideas:

not applying any regularizer
I am using a large batch size
using a reasonable learning rate.

I am interested in minimizing the mean absolute error. Ideally, in my training, my MAE should get close to 0 for the batch, and follow a similar trend for the whole train dataset. I am scaling the train dataset as follows:

import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import joblib


def PrepareData(dataset_path):
    """Prepare and scale the data."""
    dataset = np.load(dataset_path, allow_pickle=True)
    X = dataset[:, :-1]
    y = dataset[:, -1]

    # Create and fit scalers
    feature_scaler = StandardScaler()
    X_scaled = feature_scaler.fit_transform(X)

    label_scaler = MinMaxScaler()
    y_scaled = label_scaler.fit_transform(y.reshape(-1, 1)).flatten()

    # Save scalers and reference data
    joblib.dump(feature_scaler, 'feature_scaler.pkl')
    joblib.dump(label_scaler, 'label_scaler.pkl')
    np.save('reference_X.npy', X)  # Save reference data
    np.save('reference_y.npy', y)

    # Reshape features for the model
    X_scaled = X_scaled[:, :, np.newaxis]

    return X_scaled, y_scaled, y, feature_scaler, label_scaler

I managed to get a sufficiently low MAE, I’m not interested in improving the model much more at this point for studying purposes.


Epoch 407/500
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - loss: 0.0251 - mean_absolute_error: 0.0251
Epoch 407: mean_absolute_error improved from 0.03041 to 0.02641, saving model to model_default.keras
Calculated Mean Absolute Error (Original Scale): 1.3444
Batch-wise Scaled MAE: 0.022792
Keras MAE (logs): 0.026412

For testing I am using some random samples from the exact same training dataset (just to practice and learning, I am aware that in principle train, test, and validation are different!).
However, whenever I call the Test() function I get predictions that are way different from those predicted by the model during the training!

Example:

Training:
Sample 154: Predicted = -69.3622, Actual = -67.8924, Absolute Error = 1.4698
Sample 194: Predicted = -66.9870, Actual = -67.3688, Absolute Error = 0.3818

Test:
Sample 154: Predicted: -190.6574, Actual: -67.8924, Diff: -122.7650
Sample 194: Predicted: -69.2418, Actual: -67.3688, Diff: -1.8730

I don’t understand if it’s a scaling issue or if I am missing something else.
Here is my Test() method:

import numpy as np
from keras.models import load_model
import joblib
from os import walk, path

# Load scalers and model
feature_scaler = joblib.load('feature_scaler.pkl') 
label_scaler = joblib.load('label_scaler.pkl') 
model = load_model('model_default.keras')


def Test(args_):
    ref_X = np.load('reference_X.npy')  # references
    # I'm accessing to each np array.
    for folder, _, samples in walk('./predictions'):
        for file in samples:
            if file.endswith('npy'):
                name = file.split(".")[0]
                sample_path = path.join(folder, file)

                # loading the sample. 
                sample_data = np.load(sample_path)

                if sample_data.ndim == 1: # check for discrepancies
                    sample_data = sample_data.reshape(1, -1) 

                if sample_data.shape[1] < ref_X.shape[1]: # and pad if different (none are)
                    padding = np.zeros((sample_data.shape[0], ref_X.shape[1] - sample_data.shape[1]))
                    sample_data = np.hstack((sample_data, padding))

                # !!! scaling
                sample_scaled = feature_scaler.transform(sample_data)
                sample_scaled = sample_scaled.reshape(sample_scaled.shape[0], sample_scaled.shape[1], 1)

                # Predicting
                prediction_scaled = model.predict(sample_scaled, verbose=0)

                # converting from the prediction to the real data
                prediction_rescaled = label_scaler.inverse_transform(prediction_scaled)

                # Read from the source file
                real = None
                with open('source.txt', 'r') as comp:
                    for line in comp.readlines():
                        if name in line:
                            real = float(line.split()[3])
                            break

                # calculate differences
                if real is not None:
                    diff = prediction_rescaled[0][0] - real
                    print(f"{name} - Predicted: {prediction_rescaled[0][0]:.4f}, "
                          f"Real: {real:.4f}, Diff: {diff:.4f}")
                print("-" * 50)

I just wish I could replicate the results in the training, but I can’t get my head around why I can’t.
I tried:

different scaling
saving and loading the scaling factors as .pkl
recalculating the scaling factor on the training data
recalculating the scaling factor on the test data
tried data average, min and max.

Thanks!

Thiết kế website giá rẻ

Danh mục

Prediction on the same sample differs from training to testing