I am using the Whisper3 model to transcribe several audio files. However, the output I am getting is in the form of a tensor. I would like to obtain text chunks with corresponding start and end timestamps instead. Can someone please assist me in achieving this desired output using the available method only? I get the desired output if I make use of pipeline with “AutoModelForSpeechSeq2Seq” class instead of “WhisperForConditionalGeneration” like below.
from transformers import WhisperForConditionalGeneration, AutoProcessor
processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
audio, sr = librosa.load(audio,sr=16000)
inputs = processor(audio, return_tensors="pt", truncation=False, padding="longest", return_attention_mask=True, sampling_rate=sr)
result = model.generate(**inputs)
decoded = processor.batch_decode(result, skip_special_tokens=True)
The result variable here is a tensor, but I want the output in the form of text chunks with start and end time stamps. Please help.