is there any approach or model architecture that can generate text and speech by giving speech an text as input or one. I know it is possible with STS and TTS models but the difference is it would need to basically understand and produce different kinds of cues like voice intonation, non-verbal parts of speech (laughs etc.) pauses emotions and all other kinds of things and pair that with the intelligence of an LLM for producing coherent responses. Basically goal is to create something with the conversational abilities of ChatGPT-4o in terms of its expressiveness. It should be one single model handling all modalities (Powerful LLM for intelligence and speech with mentioned tones and details). Any open source known sources would be helpful as well.
I have tried STS and TTS models or LLM models combined with Speech-to-speech models but they don’t capture all details of voice like voice tone, emotions etc.
I am thinking of following approaches if they make sense:
Taking embeddings of LLama model and generate voice based on these embeddings with a vocoder. But problem in this is not sure if LLama embeddings would have cues of voice tone.
What am I expecting is similar to ChatGPT-4o like model with great intelligence coupled with human like voice tone.