I have a Twilio Bi-Directional Media Streaming WebSocket set up, and I am trying to send audio data back to Twilio using Google TTS. Here is the code I am using:
async def send_audio(text):
client = texttospeech.TextToSpeechClient()
synthesis_input = texttospeech.SynthesisInput(text=text)
voice = texttospeech.VoiceSelectionParams(
language_code="en-IN",
name="en-IN-Standard-C",
ssml_gender=texttospeech.SsmlVoiceGender.MALE,
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MULAW,
sample_rate_hertz=8000,
# speaking_rate=1.2
)
response = client.synthesize_speech(
input=synthesis_input, voice=voice, audio_config=audio_config
)
# print(response)
for x in range(0,64):
audio_content = response.audio_content
decoded_audio = base64.b64decode(audio_content)
trimmed_audio = decoded_audio[x:]
re_encoded_audio = base64.b64encode(trimmed_audio).decode('utf-8')
# print(re_encoded_audio)
print(x)
outbound_media_meta = {
"payload": re_encoded_audio,
}
outbound_media = {
"event": "media",
"media": outbound_media_meta,
"streamSid": session_stream_sid[call_sid]
}
json_message = json.dumps(outbound_media)
await websocket.send_text(json_message)
I have a for loop around the code that sends data, trimming up to the first 64 bits of the audio, as from other slack threads I have learned google TTS sends a header that needs to be removed ( one says 44 bytes the other 58 bytes. The message is sending to my websocket and I’m hearing a short and loud click/static noise, but no comprehensible audio. Also the audio is generating properly from Google TTS.
Here are some answers to questions that I stole from @wayeast from this thread. They all apply to my case as well
Some answers to likely questions:
Twilio specifies that the audio sent to it should be audio/x-mulaw-encoded at 8000Hz sample rate. Are you sure what you’re sending meets that requirement? Yes. When making my TTS request to Google, I specify that I would like the audio mulaw-encoded at 8000Hz. Google sends audio back as base64-encoded bytes; when I decode the returned base64 string and save it to a file on disk, both ffprobe and soxi confirm that it is indeed mulaw-encoded and with a sample rate of 8000Hz.
When Google encodes its TTS audio as mulaw, it attaches a wav header to the result. Twilio says that media sent to it should not contain any headers. Are you sure you are only sending raw audio bytes to Twilio? Yes. When I get a result back from Google, I first decode the base64 string, then clip the first 44 bytes (the size of the wav header), and base64 encode only the remaining bytes to send to Twilio. I know that the bytes I have clipped are the right ones because I have written them to a file on disk, then imported them into Audacity as mulaw, 8000Hz raw audio data, and Audacity plays the audio just fine.
Are you base64-encoding your mulaw/8000 bytes correctly? I suppose this may be an unanswerable question (on my end), but I think so. If I write the base64 string (encoded with the “standard” engine) I send to Twilio to a file test.enc on disk, then run base64 test.enc -d > unk.dat, I can import unk.dat into Audacity as mulaw/8000 raw data and it plays. In my application, I have tried the gamut of common base64 engines: standard, standard-no-pad, url-safe, url-safe-no-pad. None of these produce good results.
Are you formatting your Media message to Twilio correctly? Yes. At least, the message I am sending looks like the example here.
Is the Media message you are sending to Twilio a WS Text message? Yes.
I remember seeing someone somewhere on the interwebs suggesting that it’s a good idea to send a Mark message after a Media message to Twilio. Did you try this for giggles? Yes.
Vishwa Reddy Akkati is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.