In the context of the Audio Spectrogram Transformer (AST), how should variable-length data be handled? Specifically, within a dataset, each batch of data is padded to have the same length, but different batches have varying lengths. During actual training, is it necessary to reinitialize the model based on the length of the data within each batch?
if __name__ == '__main__':
input_tdim = 100
ast_mdl = ASTModel(input_tdim=input_tdim)
# input a batch of 10 spectrogram, each with 100 time frames and 128 frequency bins
test_input = torch.rand([10, input_tdim, 128])
test_output = ast_mdl(test_input)
# output should be in shape [10, 527], i.e., 10 samples, each with prediction of 527 classes.
print(test_output.shape)
input_tdim = 256
ast_mdl = ASTModel(input_tdim=input_tdim,label_dim=50, audioset_pretrain=True)
# input a batch of 10 spectrogram, each with 512 time frames and 128 frequency bins
test_input = torch.rand([10, input_tdim, 128])
test_output = ast_mdl(test_input)
# output should be in shape [10, 50], i.e., 10 samples, each with prediction of 50 classes.
print(test_output.shape)
Given the code in the AST project on GitHub, it appears that different lengths of data are handled by starting the model twice for each length. This raises the question of whether the AST model needs to be reinitialized for each batch with different lengths during training.
Elijah is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.