I want to create AI that generate openpose from textual description for example if input “a man running” output would be like the image I provided Is there any model architecture recommend for me?
my data condition is
- canvas_width: 900px
- canvas_height: 300px
- frames: 5 (5 person)
expected output
I trying to train RNN for this task and I use sentence transformer for embedding text and then pass to RNN and the loss is look like image below
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
text = "a man running"
text_input = torch.tensor(sentence_model.encode(text), dtype=torch.float)
loss image with num_layers=3
My RNN setting
embedding_dim = 384
hidden_dim = 512
num_layers = 3
output_dim = 180
num_epochs = 100
learning_rate = 0.001
rnn_model = RNN(embedding_dim, hidden_dim, num_layers, output_dim)
but the problem is whatever I input the output is the same everytime! but when I try changing num_layers to 1 and keep other setting the same like this
embedding_dim = 384
hidden_dim = 512
num_layers = 1
output_dim = 180
num_epochs = 100
learning_rate = 0.001
rnn_model = RNN(embedding_dim, hidden_dim, num_layers, output_dim)
the loss now look like this
loss image with num_layers=1
and now the problem is gone !!
Also I try to check the cause of the “output is the same everytime” problem I check dataloader and other code but no problem was found only num_layers=3 that cause the problem num_layers=1 fixed it
This is my training loop
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(rnn_model.parameters(), lr=learning_rate)
trainingEpoch_loss = []
validationEpoch_loss = []
for epoch in range(num_epochs):
step_loss = []
rnn_model.train()
for idx, train_inputs in enumerate(train_dataloader):
optimizer.zero_grad()
outputs = rnn_model(torch.unsqueeze(train_inputs['text'], dim=0))
training_loss = criterion(outputs, train_inputs['poses'])
training_loss.backward()
optimizer.step()
step_loss.append(training_loss.item())
if (idx+1) % 1 == 0: print (f'Epoch [{epoch+1}/{num_epochs}], Step [{idx+1}/{len(train_dataloader)}], Loss: {training_loss.item():.4f}')
trainingEpoch_loss.append(np.array(step_loss).mean())
rnn_model.eval()
for idx, val_inputs in enumerate(val_dataloader):
validationStep_loss = []
outputs = rnn_model(torch.unsqueeze(val_inputs['text'], dim=0))
val_loss = criterion(outputs, val_inputs['poses'])
validationStep_loss.append(val_loss.item())
validationEpoch_loss.append(np.array(validationStep_loss).mean())
This is my Inference
text = "a man running"
processed_text = torch.tensor(sentence_model.encode(text), dtype=torch.float)
output_poses = rnn_model(processed_text.unsqueeze(0))
print(output_poses.shape) #shape=(1, 180) 1 person is 36 (original data for 1 person is 54 but I change to 36 because I want only x and y and not z so cut out the z axis) and there's 5 person so 5*36 = 180
My question is
- Is there any model architecture recommend for this task other than RNN?
- Why whatever I input the output is the same everytime when num_layers=3 I’m very confused because the loss wouldn’t go down if the model was giving the same output right? that’s mean it give the same output in the Inference phase
Expected Answer
- Model architecture that suit best for my task any papers or github repo related given would be appreciated
- Answer why whatever I input the output is the same everytime when num_layers=3
Peemmaphat Sripongsai is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.