I need to perform semantic chunking of a video. So far, I have resampled the audio to 16000 Hz and then used wav2vec2 for getting transcriptions of the audio.
I need to
- Time-Align Transcript with Audio: Describe the methodology and steps for aligning the transcript with the audio.
- Semantic Chunking of Data: Slice the data into audio-text pairs, using both semantic information from the text and voice activity information from the audio, with each audio-chunk being less than 15s in length.
Here is the code I have so far
import torch
import librosa
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
# Load the pre-trained Wav2Vec2 model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
model.eval() # Set the model to evaluation mode
# Function to process and transcribe audio in chunks
def transcribe_audio(input_file, chunk_duration=30):
# Load the audio file
speech, sr = librosa.load(input_file, sr=16000, mono=True)
# Calculate the number of samples per chunk
chunk_samples = int(sr * chunk_duration)
# Initialize the transcript result
transcript = ""
# Process audio in chunks
for start in range(0, len(speech), chunk_samples):
# Get a chunk of speech
end = start + chunk_samples
speech_chunk = speech[start:end]
# Convert the speech chunk to tensor
input_values = processor(speech_chunk, return_tensors="pt", sampling_rate=sr).input_values
# Perform inference
with torch.no_grad():
logits = model(input_values).logits
# Decode the predicted ids to text
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
# Append the chunk transcription to the total transcription
transcript += transcription[0] + ' '
return transcript
file_path = 'resampled_audio.wav'
final_transcript = transcribe_audio(file_path)
print("Final transcription:", final_transcript)
Any help would be appreciated. Also, what is the general concept that I should study to learn these?
I want to slice the data in to audio-text pairs. Any help on how to do it would be much appreciated.