Problem Description
I am wanting to use voice recognition as part of a hardware project, which I would like to be completely self containing (I’m using small low power, low speed devices such as Arduino’s and Raspberry Pi’s, Kinects etc, no running traditional computer with an OS is involved. So a closed / self containing project).
Voice recognition can be very complicated depending on the level of sophistication you desire. I have what I believe a comparatively simple set of requirements. I only want to recognise my own voice, and I have a small dictionary of 20 or so words I’d like to recognise. Thus I don’t require complex speech-to-text and voice recognition libraries or any of the excellent 3rd party software I find via Internet search engines (there is no shortage of these!). I believe my requirements are “simple enough” (within reason) that I can code my own solution. I am wondering if anyone has written their own process like this, and is my method is massively flawed? Is there a better way to do this without requiring a high level of mathematics or having to write a complex algorithm? That is the solution I have tried to think up below.
Solution Description
I will be writing this in C but I wish to discuss a language agnostic process, focussing on the process its self. So lets ignore that if we can.
1 . I will pre-record my dictionary of words to match those being spoken. We can imagine I have 20 recordings of my 20 different words, or perhaps short phrases or sentences of two or three words. I believe this makes the process of comparing two recording files easier than actually converting the audio to text and comparing two strings.
2 . A microphone is connected to my hardware device running my code. [1]. The code is continuously taking fixed length samples, say 10msec in length for example, and storing 10 consecutive samples for example, in a circular logging style. [2]. (I’m inventing these figures off the top of my head so they are only examples to describe the process).
[1] This would likely be connected through a band-pass filter and op-amp, as would the dictionary recordings be made, to keep the stored and collected audio samples smaller.
[2] I’m not sure exactly how I will take a sample, I need to work out a method though were I produce a numerical figure (integer/float/double) that represents the audio of a 10msec sample (perhaps a CRC value or MD5 sum etc of the audio sample), or a stream of figures (a stream of audio readings of frequencies perhaps). Ultimately a “sample” will be a numerical figure or figures. This part is going to be much more hardware involved so not really for discussion here.
3 . The code looks at it’s stored 10 consecutive samples and looks for a volume increase to indicate a word or phrase is being said (a break from silence) and then increases is consecutive sample collecting to say 500 samples for example. That would mean it captures 5 seconds of audio in 10 msec samples.
It is these samples or “slices” that are compared between the stored sound and captured sound. If a high enough percentage of samples captured matched the equivalent stored ones, the code assumes its the same word.
The start of a store recording of the world "hello" for example,
stored words are split into 10 msec samples also
Stored Sample No | 1| 2| 3| 4| 5| 6| 7| 8|
Stored Sample Value |27|38|41|16|59|77|200|78|
Incoming audio (me saying "hello") with some "blank" samples
at the start to symbolise silence
Incoming Sample No | 1| 2| 3| 4| 5| 6| 7| 8| 9|10| 11|12|
Incoming Sample Value | | | |20|27|38|46|16|59|77|200|78|
4 . Once the code has collected a full sample stream, it then chops off the blanks samples at the start to produce the following audio recording. It could also move the sample set backwards and forwards a few places to better align with the stored sample.
This produces a sample set like the below:
Stored Sample No | 1| 2| 3| 4| 5| 6| 7| 8|
Stored Sample Value |27|38|41|16|59|77|200|78|
Incoming Sample No |-1| 1| 2| 3| 4| 5| 6| 7| 8|
Incoming Sample Value |20|27|38|46|16|59|81|201|78|
5 . I believe that by having a percentage value for how close each sample must be, so sample 7 differs by a value of 1 which is less than %1, and a percentage value for the total number of samples which must be within their sample matching percentage, the code has an easily tunable level of accuracy.
I have never done anything like this with audio before, it could be a lot of work. This is why I am asking this question, if you perhaps already know the answer to this question to be obvious (what ever that answer may be). I am hoping this won’t be a computationally massive tasks as some of the hardware I will be using will be low sec stuff. In the hundreds of Megahertz (Maybe 1Ghz using an over-clocked Rasp Pi). So this is a rather crude way to match audio samples using lower computational power. I’m not aiming for instant results, but less than 30 seconds for a decent proof of concept.
PS
I don’t have the rep to tag this with a new tag like “audio”, “audio recognition”, “voice”, “voice recognition” etc.
11
Well i don’t believe the Arduino has the horse power to do this. its operating at 16Mhz
An Arduino has about 32K of memory. Even 20 words sampled in Mp3 (smaller then wav) wouldnt fit in it, despite its only your own voice.
The rasberi pi might do the trick, its operating at 700Mhz depending on the version it might have 512MB memory. That’s still not a lot dough.
You might need a fourier
(http://www.drdobbs.com/cpp/a-simple-and-efficient-fft-implementatio/199500857)
Or if you intend to use volume, do a few averages with previous samples like
x= ( x + x[n-1] + x[n-2] + x[n-3] ) / 4 // thats quite simple might need more
A next thing you need to do is i think if you would plot these X values
Then you need some kind of slope detection of that line
Because detecting commands based on volume depends otherwise a lot on distance
While you would rather like to detect the pattern of the words
Then it depends a bit how to record the slope so that the pattern would fit another time.
I mean one doesnt speak in the exact tempo a computer can match
and slope can be a bit steeper. In the end i think this is a bit how steep those lines are and their length y axis, should be within some average
-
Arduino and Raspberry Pi are prototyping boards with little chips on them. You should focus on the chip first. Look for something with a DSP (digital signal processing) toolbox, maybe you already have a DSP toolbox and don’t know it. DSP toolboxes have algorithms on call like fft (fast fourier transform) and ifft (inverse fft) for fast frequency domain analysis.
-
Focus on your programmatic style:
Are your samples in a stack or a queue? You will want a queue for this type of data. A queue looks like:Position NO --|1|2|3|4|5|6|7|8| Sample Value |5|7|9|1|2|2|9|8|
Next iteration:
Position NO --|1|2|3|4|5|6|7|8| Sample Value |0|5|7|9|1|2|2|9| -> First in First out (FIFO)
Notice how things shift toward the ‘right’? I think you described a “circular” algorithm. Just overwrite the oldest samples with the second oldest samples, then overwrite the second oldest samples with the third oldest, …, all the way to the beginning of the queue where you insert your newest data.
-
“The code is continuously taking fixed length samples, say 10msec” <–incorrect
Think this way: The code is discretely taking quantized (height) samples, at a sampling rate of 10000 samples per second, which makes each sample 0.1 ms apart.What is your sampling frequency? What is the bitrate on your quantizer? Lower numbers will help you free up memory. I would suggest a low sampling rate like 6600 samples per second (Nyquist). I suspect 4 bit (16 levels) would be sufficient for recognition. So thats 3300 bytes of recording per second. Now do fft and delete everything above 3300 Hz (telephony filter). Now you have 1650 bytes used for one second of sound. These DSP tricks will save a lot of memory.
I don’t know who thinks 512 MB is small. With the above info that is 300,000+ seconds of recording… over 3 days solid.
-
I think you will find the frequency domain (by using fft) to be a better environment to perform voice recognition.
I hope I didnt confuse you worse 🙂