Getting chunk level output with start and end timestamps with Whisper
I am using the Whisper3 model to transcribe several audio files. However, the output I am getting is in the form of a tensor. I would like to obtain text chunks with corresponding start and end timestamps instead. Can someone please assist me in achieving this desired output using the available method only? I get the desired output if I make use of pipeline with “AutoModelForSpeechSeq2Seq” class instead of “WhisperForConditionalGeneration” like below.