I’m encountering an issue with speech recognition when trying to process audio streamed from a client to a Python backend. Here’s a breakdown of the process:
Client-Side (JavaScript):
audioRecorder.ondataavailable = function(event) {
if (event.data && event.data.size > 0) {
const reader = new FileReader();
reader.onload = function() {
const buffer = reader.result;
const chunkSize = 1024 * 32;
for (let i = 0; i < buffer.byteLength; i += chunkSize) {
const chunk = buffer.slice(i, i + chunkSize);
const uint8Array = new Uint8Array(chunk);
ws.send(JSON.stringify({ type: 'AUDIO_FRAME', data: Array.from(uint8Array) }));
}
};
reader.readAsArrayBuffer(event.data);
}
}
Server-Side (Python):
elif data['type'] == 'AUDIO_FRAME':
audio_bytes = bytes(data['data'])
audio_file.write(audio_bytes)
recognize_speech(audio_bytes)
Speech Recognition Function (Python):
import speech_recognition as sr
sample_rate = 48000
sample_width = 2
chunk_duration = 1
audio_buffer = b''
recognizer = sr.Recognizer()
def recognize_speech(audio_data):
global audio_buffer
audio_buffer += audio_data
buffer_size = sample_rate * sample_width * chunk_duration
if len(audio_buffer) >= buffer_size:
try:
chunk_bytes = audio_buffer[:buffer_size]
audio_buffer = audio_buffer[buffer_size:]
audio = sr.AudioData(chunk_bytes, sample_rate, sample_width)
text = recognizer.recognize_google(audio)
print("Recognized:", text)
except sr.UnknownValueError:
print("Could not understand audio")
except sr.RequestError as e:
print(f"Request error from Google Speech Recognition service: {e}")
However, despite successful audio streaming and storage on the server, the speech recognition consistently returns an UnknownValueError
. Interestingly, when I tested the speech recognition with live microphone input in a Jupyter notebook, it worked flawlessly.
I’d appreciate any insights or suggestions on how to resolve this issue and successfully transcribe the streamed audio.
I attempted to store the audio file on the server from the streaming chunks. While I could play the stored audio and hear it clearly. However, there is not much clarity when I was storing chunks in recognize_speech function.
Let me know if you need anymore details.
Thank you!