I am using whisper and need to provide accurate results to my end users. The 2 options are:
- Using segments granularities, I get sentences but no word timestamps:
{
“id”: 0,
“seek”: 3000,
“start”: 30.0,
“end”: 33.0,
“text”: ” Mama, take this badge off of me”,
“tokens”: [
50364,
17775,
11,
747,
341,
25797,
766,
295,
385,
50514
],
“temperature”: 0.0,
“avg_logprob”: -0.29641300439834595,
“compression_ratio”: 1.2115384340286255,
“no_speech_prob”: 0.31771889328956604
},
{
“id”: 1,
“seek”: 3000,
“start”: 37.0,
“end”: 40.0,
“text”: ” I can’t use it anymore”,
“tokens”: [
50714,
286,
393,
380,
764,
309,
3602,
50864
],
“temperature”: 0.0,
“avg_logprob”: -0.29641300439834595,
“compression_ratio”: 1.2115384340286255,
“no_speech_prob”: 0.31771889328956604
},
- Using words granularities, I get words timestamps, which is better, but no sentences stop so I can’t really display my text properly.
{
“word”: “Mama”,
“start”: 30.0,
“end”: 30.639999389648438
},
{
“word”: “take”,
“start”: 30.920000076293945,
“end”: 30.920000076293945
},
{
“word”: “this”,
“start”: 30.920000076293945,
“end”: 31.360000610351562
},
{
“word”: “badge”,
“start”: 31.360000610351562,
“end”: 31.81999969482422
},
{
“word”: “off”,
“start”: 31.81999969482422,
“end”: 32.20000076293945
},
{
“word”: “of”,
“start”: 32.20000076293945,
“end”: 32.439998626708984
},
{
“word”: “me”,
“start”: 32.439998626708984,
“end”: 33.63999938964844
},
{
“word”: “I”,
“start”: 33.63999938964844,
“end”: 37.380001068115234
},
{
“word”: “can’t”,
“start”: 37.380001068115234,
“end”: 37.81999969482422
},
{
“word”: “use”,
“start”: 37.81999969482422,
“end”: 38.279998779296875
},
{
“word”: “it”,
“start”: 38.279998779296875,
“end”: 38.939998626708984
},
{
“word”: “anymore”,
“start”: 38.939998626708984,
“end”: 40.459999084472656
},
Am I missing in the output in words timestamps to reconstruct the sentences as they are heard by whisper? Do you think of an other way?
I would like to avoid running twice the API call for financial reasons (and planet saving ^^)
Thanks a lot for your ideas,