I’m trying to write a program in Python that will take an input of a .wav (sound) file, and determine whether the user is saying “yes” or “no”.
The issue is that the sound files are not always the same length.
I’m worried that with a static input dimension (i.e. 5 seconds of audio), I may have a sample that exceeds that dimension.
I recently read this paper written by Google’s Deepmind, which uses sound, but I can’t tell how they deal with this issue.
Any insights on how to allow my neural network to deal with a variable size input would be appreciated.
3
In general most sound processing works like other natural language processing in that one of the first steps is to slice your data into basic tokens, i.e. words – in human sound processing we split the words based on the silence between them. Accordingly you can pre-process to:
- Filter out sound outside of then normal, significant, speech bandwidth, this is what telephone companies do to save bandwidth.
- Split each sample into chunks based on the gaps.
This is the equivalent of the visual deep learning systems standardising the size and bit depth of the images.
With some people, who run their words into each other, the software will have some problems but so would most listeners.
3