Im using Google speech to text or deepgram to test this , both give me the same result.
When I perform a twilio call im getting a really wierd behaviour some times.
Like the audio is being repeated multiple times for no reason , IDK if its the way im handling the websocket or seding the auido.
When I get the response from deepgram or google speech it comes like this:
It may seem that I said all of that really fast and at the same time but that is not the case, when I speak I give it about 5 second stop between each sentence and they are being returned as if I spoke all at the same time.
This happens only in 40% of the calls , the others work as intended.
non final: hello what's your name
non final: hello what's your name can
non final: hello what's your name can you
non final: hello what's your name can you hear
non final: hello what's your name can you hear me
non final: hello what's your name can you hear me hello
non final: hello what's your name can you hear me hello
non final: hello what's your name can you hear me hello
non final: hello what's your name can you hear me hello
non final: hello what's your name can you hear me hello are
non final: hello what's your name can you hear me hello are you
non final: hello what's your name can you hear me hello are you there
non final: hello what's your name can you hear me hello are you there
non final: hello what's your name can you hear me hello are you there hello
non final: hello what's your name can you hear me hello are you there hello
non final: hello what's your name can you hear me hello are you there hello bro
non final: hello what's your name can you hear me hello are you there hello bro
non final: hello what's your name can you hear me hello are you there hello bro
non final: hello what's your name can you hear me hello are you there hello bro
non final: hello what's your name can you hear me hello are you there hello bro what
non final: hello what's your name can you hear me hello are you there hello bro what
non final: hello what's your name can you hear me hello are you there hello bro what the
non final: hello what's your name can you hear me hello are you there hello bro what the
non final: hello what's your name can you hear me hello are you there hello bro what the hell
non final: hello what's your name can you hear me hello are you there hello bro what the hell
non final: hello what's your name can you hear me hello are you there hello bro what the hell is
non final: hello what's your name can you hear me hello are you there hello bro what the hell is
Above IS the code that receives de call from twilio:
#init connection with web socket
@application.post('/call')#Make Phone Call OUTGOING
async def handle_incoming_calls(request: Request, From: Annotated[str, Form()],To: Annotated[str, Form()]):
response = VoiceResponse()
connect = Connect()
URL = f"wss://{PUBLIC_URL}/stream"
connect.stream(url=URL)
response.append(connect)
return Response(content=str(response), media_type='text/xml')
That Function Will send the stream to stream endpoint which will hadle the websockets:
@application.websocket('/stream')
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
while True:
await wait_for_user_input(websocket)
wait for user input function:
config = RecognitionConfig(
encoding=RecognitionConfig.AudioEncoding.MULAW,
sample_rate_hertz=8000,
language_code="en",
model="telephony",
enable_automatic_punctuation=False
)
streaming_config = StreamingRecognitionConfig(config=config, interim_results=True)
async def wait_for_user_input(ws):
transcript = ""
transcript_ready = False
def on_transcription_response(response):
nonlocal transcript
nonlocal transcript_ready
if not response.results:
return
result = response.results[0]
if not result.alternatives:
return
transcription = result.alternatives[0].transcript
if result.is_final is True:
print("nTRANSCRIPT FINAL",transcription)
else:
print("non final:",transcription)
print("WS connection opened")
bridge = SpeechClientBridge(streaming_config, on_transcription_response)
t = threading.Thread(target=bridge.start)
t.start()
while True:
message = await ws.receive_text()
if message is None:
bridge.add_request(None)
bridge.terminate()
break
data = json.loads(message)
if data['event'] == 'media':
media = data['media']
if media['track'] == 'inbound':
media = data["media"]
chunk = base64.b64decode(media["payload"])
bridge.add_request(chunk)
if data["event"] == "stop":
print(f"Media WS: Received event 'stop': {message}")
print("Stopping...")
break
if transcript_ready:
print("Transcript: ",transcript)
transcript_ready = False
bridge.terminate()
print("WS connection closed")
The same happens with deepgram.
I tryied to use different STT solutions and I face the same issue with all of them ,
Then I thought it was twilio but they checked all my calls and determined there where no issues with the call itself.