I dont understnad why I need to retrain the modelstraing from the first model. why increasing the number of epoch does not improve the results. I played difffernt hyper paramaters, optimasisers step size. This is always the case.
Also any suggests for improvements will be appreciated
I have a simple GCN model
` import torch.nn.functional as F
from torch.nn import Linear, Dropout,BatchNorm1d
from torch_geometric.nn import GCNConv, GATv2Conv
from torch_geometric.nn import global_mean_pool as gap, global_max_pool as gmp class
embedding_size=60
GCN(torch.nn.Module):
"""Graph Convolutional Network"""
def init(self): super().init()
self.gcn1 = GCNConv(4,embedding_size) self.gcn2 = GCNConv(embedding_size, embedding_size)
self.out = Linear(embedding_size, 100)
self.out2 = Linear(100, 50)
self.out3 = Linear(50, 1)
def forward(self, x, edge_index, batch_index):
h = self.gcn1(h, edge_index).tanh()
h = self.gcn2(h, edge_index).tanh()
hidden = torch.cat([gmp(h, batch_index)], dim=1)
# Apply a final (linear).
hidden = self.out(hidden).tanh()
hidden = self.out2(hidden).tanh()
hidden = self.out3(hidden)
return hidden, hidden
model = GCN()
print(model)
print("Number of parameters: ", sum(p.numel() for p in model.parameters()))
# Root mean squared error
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001,weight_decay=1e-2)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)
NUM_GRAPHS_PER_BATCH = 32`
Training is done as follows
train_size = int(0.8 * len(data))
test_size = len(data) - train_size
train_dataset, test_dataset = random_split(data, [train_size, test_size])
train_loader = DataLoader(train_dataset, batch_size=NUM_GRAPHS_PER_BATCH, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=NUM_GRAPHS_PER_BATCH, shuffle=False)
def train(data):
for batch in train_loader:
batch = batch.to(device)
optimizer.zero_grad()
pred, embedding = model(batch.x.float(), batch.edge_index, batch.batch)
loss = loss_fn(pred, batch.y.float())
loss.backward()
optimizer.step()
return loss.item(), embedding, pred,batch.y.float()
print("Starting training...")
losses = []
losses_inter = []
integer_values = []
average_losses = []
for epoch in range(2500):
loss, h, pred, target = train(data)
losses.append(loss)
losses_inter.append(loss)
if epoch % 50 == 0: # Check if 50 epochs have passed
avg_loss = sum(losses_inter) / len(losses_inter) # Calculate the average loss
average_losses.append(avg_loss) # Save the average loss
integer_values.append(epoch)
print(f"Epoch {epoch} | Train Loss {loss} |Train average loss {avg_loss}")
losses_inter = [] # Reset the list of losses for the next 50 epochs
For the first run (first model), the loss goes as
Epoch 0 | Train Loss at epoch 0.6653909683227539 |Train average loss of last 50 epochs0.6653909683227539
Epoch 50 | Train Loss at epoch 0.0515737310051918 |Train average loss of last 50 epochs0.3343683324754238
Epoch 100 | Train Loss at epoch 0.026481525972485542 |Train average loss of last 50 epochs0.07989441519603133
Epoch 150 | Train Loss at epoch 0.047521624714136124 |Train average loss of last 50 epochs0.07267563190311194
Epoch 200 | Train Loss at epoch 0.019531363621354103 |Train average loss of last 50 epochs0.0612673082947731
Epoch 250 | Train Loss at epoch 0.30581197142601013 |Train average loss of last 50 epochs0.08922389825806022
Epoch 300 | Train Loss at epoch 0.29909631609916687 |Train average loss of last 50 epochs0.0706970215588808
Epoch 350 | Train Loss at epoch 0.11375083774328232 |Train average loss of last 50 epochs0.06479552377015352
Epoch 400 | Train Loss at epoch 0.03928852081298828 |Train average loss of last 50 epochs0.07733491346240044
Epoch 450 | Train Loss at epoch 0.02452031895518303 |Train average loss of last 50 epochs0.07129390228539706
Epoch 500 | Train Loss at epoch 0.0247601717710495 |Train average loss of last 50 epochs0.07803433934226632
Epoch 550 | Train Loss at epoch 0.019612541422247887 |Train average loss of last 50 epochs0.06395480493083597
Epoch 600 | Train Loss at epoch 0.01744224689900875 |Train average loss of last 50 epochs0.0624170701391995
Epoch 650 | Train Loss at epoch 0.10861895978450775 |Train average loss of last 50 epochs0.10159282617270947
Epoch 700 | Train Loss at epoch 0.012816054746508598 |Train average loss of last 50 epochs0.057011911273002626
Epoch 750 | Train Loss at epoch 0.013633777387440205 |Train average loss of last 50 epochs0.06599821309559047
Epoch 800 | Train Loss at epoch 0.017795192077755928 |Train average loss of last 50 epochs0.05937261689454317
Epoch 850 | Train Loss at epoch 0.0060029239393770695 |Train average loss of last 50 epochs0.04518596758134663
Epoch 900 | Train Loss at epoch 0.014046485535800457 |Train average loss of last 50 epochs0.06654578611254693
Epoch 950 | Train Loss at epoch 0.013308274559676647 |Train average loss of last 50 epochs0.04094984469935298
Epoch 1000 | Train Loss at epoch 0.2524654269218445 |Train average loss of last 50 epochs0.047308804411441086
Epoch 1050 | Train Loss at epoch 0.0033438336104154587 |Train average loss of last 50 epochs0.07126852492801845
Epoch 1100 | Train Loss at epoch 0.005667633842676878 |Train average loss of last 50 epochs0.035457657966762784
Epoch 1150 | Train Loss at epoch 0.004261013586074114 |Train average loss of last 50 epochs0.03699902100954205
Epoch 1200 | Train Loss at epoch 0.00764879584312439 |Train average loss of last 50 epochs0.03940372625831515
Epoch 1250 | Train Loss at epoch 0.22968187928199768 |Train average loss of last 50 epochs0.04749706107191742
Epoch 1300 | Train Loss at epoch 0.004750204272568226 |Train average loss of last 50 epochs0.04106631238479167
Epoch 1350 | Train Loss at epoch 0.09794936329126358 |Train average loss of last 50 epochs0.04521390167530626
Epoch 1400 | Train Loss at epoch 0.008493292145431042 |Train average loss of last 50 epochs0.024447529329918324
Epoch 1450 | Train Loss at epoch 0.016368480399250984 |Train average loss of last 50 epochs0.048365002614445984
when the model is re-run with the first model (which I call 2nd model) the model performs much better (at least visually) and the
loss also seems to converg much better
Epoch 0 | Train Loss at epoch 0.25503531098365784 |Train average loss of last 50 epochs0.25503531098365784
Epoch 50 | Train Loss at epoch 0.0031670196913182735 |Train average loss of last 50 epochs0.02515609229914844
Epoch 100 | Train Loss at epoch 0.001393115147948265 |Train average loss of last 50 epochs0.04640387752559036
Epoch 150 | Train Loss at epoch 0.0026145633310079575 |Train average loss of last 50 epochs0.04310374601744115
Epoch 200 | Train Loss at epoch 0.002247799886390567 |Train average loss of last 50 epochs0.050710707290563733
Epoch 250 | Train Loss at epoch 0.0020032026804983616 |Train average loss of last 50 epochs0.03414240623824298
Epoch 300 | Train Loss at epoch 0.22453083097934723 |Train average loss of last 50 epochs0.02633341144071892
Epoch 350 | Train Loss at epoch 0.0019714697264134884 |Train average loss of last 50 epochs0.03984586708480492
Epoch 400 | Train Loss at epoch 0.005678846966475248 |Train average loss of last 50 epochs0.03385940831620246
Epoch 450 | Train Loss at epoch 0.0022569075226783752 |Train average loss of last 50 epochs0.027584525449201466
Epoch 500 | Train Loss at epoch 0.0012922872556373477 |Train average loss of last 50 epochs0.011321051421109587
Epoch 550 | Train Loss at epoch 0.004552837461233139 |Train average loss of last 50 epochs0.021755292571615428
Epoch 600 | Train Loss at epoch 0.003204363863915205 |Train average loss of last 50 epochs0.04994374781381339
Epoch 650 | Train Loss at epoch 0.0016530484426766634 |Train average loss of last 50 epochs0.02917213011998683
Epoch 700 | Train Loss at epoch 0.0035411242861300707 |Train average loss of last 50 epochs0.0366158974042628
Epoch 750 | Train Loss at epoch 0.22690822184085846 |Train average loss of last 50 epochs0.03369493865640834
Epoch 800 | Train Loss at epoch 0.0024173096753656864 |Train average loss of last 50 epochs0.03216820436995477
Epoch 850 | Train Loss at epoch 0.0025715790688991547 |Train average loss of last 50 epochs0.03964726389851421
Epoch 900 | Train Loss at epoch 0.002828154945746064 |Train average loss of last 50 epochs0.04076477232156321
Epoch 950 | Train Loss at epoch 0.22405576705932617 |Train average loss of last 50 epochs0.024730477367993445
Epoch 1000 | Train Loss at epoch 0.0025856413412839174 |Train average loss of last 50 epochs0.024556294272188098
Epoch 1050 | Train Loss at epoch 0.001974760787561536 |Train average loss of last 50 epochs0.023162467416841536
Epoch 1100 | Train Loss at epoch 0.0030594163108617067 |Train average loss of last 50 epochs0.0349681565980427
Epoch 1150 | Train Loss at epoch 0.0016730850329622626 |Train average loss of last 50 epochs0.026076464417856188
Epoch 1200 | Train Loss at epoch 0.29780691862106323 |Train average loss of last 50 epochs0.05417488843668252
Epoch 1250 | Train Loss at epoch 0.22281265258789062 |Train average loss of last 50 epochs0.04079796551610343
Epoch 1300 | Train Loss at epoch 0.0022581592202186584 |Train average loss of last 50 epochs0.020289139293599875
Epoch 1350 | Train Loss at epoch 0.002116310875862837 |Train average loss of last 50 epochs0.04509262201841921
Epoch 1400 | Train Loss at epoch 0.002864697715267539 |Train average loss of last 50 epochs0.042254742274526504
Epoch 1450 | Train Loss at epoch 0.0026543678250163794 |Train average loss of last 50 epochs0.048286572040524334
Also if I run the model for 3000 instead of 1500 epocs the learnt model is similar to similar to the first model.
I am puzzled as to why this keeps happening. Why do I have to go through the retraining process from starting with the first model? It is baffling that increasing the number of epochs does not yield better results with the first model.