I’m trying to implement CNN audio denoising with torchaudio, I’m following the Matlab example given below:
https://mathworks.com/help/deeplearning/ug/denoise-speech-using-deep-learning-networks.html
with Pytorch/torchaudio. In this approach, eight segments of 256 length STFT of noisy aduio is used as predictor and one segment of clean audio as target.
However, I’m not sure about implementing a data loader for this. Do I need to generate both overlapped noisy audio and clean audio in the dataloader class?
Also, what is one set of data in this application? I mean whether it is a stft of whole audio file or one segment of stft? If one segment is considered as data set for training, then there might be a big step change from one file to another.