I am trying to extract a timecoded transcription from an audio file that i extracted from a transcoded video, the following is the sequence of this (i am using GCP services)
- Upload a video to cloud storage
- Transcode the uploaded video into several video and audio settings
- Transcribe the audio from #2 using the speech-to-text api
All of is working fine, with a so so transcription accuracy. Now, there’s an article on the speech-to-text documentation page on how to optimise audio file for speech-to-text, but i can’t seem to find the correct config for the audio to set on my transcode audio config.
The following are my transcoding config for audios
import type { google } from '@google-cloud/video-transcoder/build/protos/protos'
export const config: google.cloud.video.transcoder.v1.IJobConfig = {
elementaryStreams: [
...
{
key: 'audio-stream0',
audioStream: {
codec: 'aac',
bitrateBps: 32000,
},
},
{
key: 'audio-stream1',
audioStream: {
codec: 'aac',
bitrateBps: 64000,
},
},
{
key: 'audio-stream2',
audioStream: {
codec: 'aac',
bitrateBps: 96000,
},
},
{
key: 'audio-stream3',
audioStream: {
codec: 'aac',
bitrateBps: 128000,
},
},
{
key: 'audio-stream4',
audioStream: {
codec: 'aac-he',
bitrateBps: 128000,
},
},
{
key: 'audio-stream5',
audioStream: {
codec: 'aac-he-v2',
bitrateBps: 128000,
},
},
{
key: 'audio-stream6',
audioStream: {
codec: 'aac',
bitrateBps: 128000,
sampleRateHertz: 44100,
},
},
],
muxStreams: [
...
{
key: 'aac-32000',
container: 'fmp4',
elementaryStreams: [
'audio-stream0',
],
},
{
key: 'aac-64000',
container: 'fmp4',
elementaryStreams: [
'audio-stream1',
],
},
{
key: 'aac-96000',
container: 'fmp4',
elementaryStreams: [
'audio-stream2',
],
},
{
key: 'aac-128000',
container: 'fmp4',
elementaryStreams: [
'audio-stream3',
],
},
{
key: 'aac-he-128000',
container: 'fmp4',
elementaryStreams: [
'audio-stream4',
],
},
{
key: 'aac-he-v2-128000',
container: 'fmp4',
elementaryStreams: [
'audio-stream5',
],
},
{
key: 'aac-128000-44100',
container: 'fmp4',
elementaryStreams: [
'audio-stream6',
],
},
],
manifests: [
{
fileName: 'manifest.mpd',
type: 'DASH',
muxStreams: [
'aac-he-v2-128000',
'aac-he-128000',
'aac-128000',
'aac-96000',
'aac-64000',
'aac-32000',
'aac-128000-44100',
],
},
],
}
And here’s my speech-to-text operation config
const [operation] = await client.v1.longRunningRecognize({
config: {
encoding: 'LINEAR16', //tried MP3 too
sampleRateHertz: 48000, //tried 16000, 44100 too
languageCode: lang,
audioChannelCount: 2,
alternativeLanguageCodes: ['en'],
enableWordConfidence: true,
enableAutomaticPunctuation: true,
enableWordTimeOffsets: true,
useEnhanced: true,
enableSeparateRecognitionPerChannel: true,
enableSpokenPunctuation: {
value: true,
},
},
audio: {
uri: `gs://${process.env.GOOGLE_STORAGE_BUCKET}/${key}`,
},
})
I have tried a bunch of combination for the transcoding audio config (like; bitrateBps, and sampleRateHertz), as well as the speech-to-text operation config (like; sampleRateHertz, and encoding), with the best thing i can get is a 70-75% accuracy on the transcription.
I am shooting in the dark here, since i don’t really know the details on audio technicality. Is there any optimised transcoding config and transcription config combination that can come up with a better transcription accuracy?
1