I have been writing the Audio Detection model for my project in the university, and got some problem, the model was trained on 1.1 sec duration audios, and when evaluating the model with window-size approach for processing long audio, it does not give me good results, even considering that fact, that the testing accuracy of the model is pretty high, mostly 97%.
The university says the following about how to process the data for this task: “The last step is preparation for the upcoming challenge phase: Qualitatively evaluate
the performance of your best classifier on the 4 scenes provided via Moodle. Those
scenes are 6-24 seconds long and contain complete speech commands consisting of
a device keyword and an action keyword (e.g., “Radio aus”). To apply the classifier
from the previous stage to these longer recordings, they first need to be cut into
shorter 1.1-second segments. To this end, use a sliding window with a hop size of 1
frame to extract feature sequences of 44 frames (44 sequence steps correspond to
1.1 seconds). The figure below illustrates the process of extracting snippets (red
windows) from a mel-spectrogram of length N with a hop size of h frames and a
windows size of w.”
My code for now is looking like this:
import torchaudio
import torch
def sliding_window_segments(signal, frame_size, hop_size):
segments = []
for start in range(0, signal.size(1) - frame_size + 1, hop_size):
end = start + frame_size
segments.append(signal[:, start:end])
return segments
device = torch.device("mps")
model = AudioClassifier()
model.to(device)
model.eval()
sample_path = "evaluation_data/2_Florian_Heizung_aus.mp3"
aud = AudioUtil.open(sample_path)
melspec = AudioUtil.spectro_gram(aud, n_mels=64, n_fft=1024, hop_len=320)
frame_size = 44
hop_size = 1
segments = sliding_window_segments(melspec, frame_size, hop_size)
predictions = []
idx2label = ds.idx2label
for segment in segments:
segment = segment.unsqueeze(0)
segment = normalize(segment)
with torch.no_grad():
output = model(segment.to(device))
predicted_label_idx = torch.argmax(output, dim=1).cpu().item()
predictions.append(idx2label[predicted_label_idx])
counter = Counter(predictions)
print(counter)
most_common_prediction = counter.most_common(1)[0][0]
print(f"Predicted label for the long audio: {most_common_prediction}")
And the results, obtained when running this code are Counter({‘Heizung’: 14, ‘Ofen’: 7})
Predicted label for the long audio: Heizung, but the actual labels of the audio is only word Heizung.
Could someone suggest me some advice, or solution to this problem?
I have tried using the window-sliding approach but even though it does not work as it should be.