Thiết kế website giá rẻ

Question

I am trying to extract a timecoded transcription from an audio file that i extracted from a transcoded video, the following is the sequence of this (i am using GCP services)

Upload a video to cloud storage
Transcode the uploaded video into several video and audio settings
Transcribe the audio from #2 using the speech-to-text api

All of is working fine, with a so so transcription accuracy. Now, there’s an article on the speech-to-text documentation page on how to optimise audio file for speech-to-text, but i can’t seem to find the correct config for the audio to set on my transcode audio config.

The following are my transcoding config for audios

import type { google } from '@google-cloud/video-transcoder/build/protos/protos'


export const config: google.cloud.video.transcoder.v1.IJobConfig = {
  elementaryStreams: [
    ...
    {
      key: 'audio-stream0',
      audioStream: {
        codec: 'aac',
        bitrateBps: 32000,
      },
    },
    {
      key: 'audio-stream1',
      audioStream: {
        codec: 'aac',
        bitrateBps: 64000,
      },
    },
    {
      key: 'audio-stream2',
      audioStream: {
        codec: 'aac',
        bitrateBps: 96000,
      },
    },
    {
      key: 'audio-stream3',
      audioStream: {
        codec: 'aac',
        bitrateBps: 128000,
      },
    },
    {
      key: 'audio-stream4',
      audioStream: {
        codec: 'aac-he',
        bitrateBps: 128000,
      },
    },
    {
      key: 'audio-stream5',
      audioStream: {
        codec: 'aac-he-v2',
        bitrateBps: 128000,
      },
    },
    {
      key: 'audio-stream6',
      audioStream: {
        codec: 'aac',
        bitrateBps: 128000,
        sampleRateHertz: 44100,
      },
    },
  ],
  muxStreams: [
    ...
    {
      key: 'aac-32000',
      container: 'fmp4',
      elementaryStreams: [
        'audio-stream0',
      ],
    },
    {
      key: 'aac-64000',
      container: 'fmp4',
      elementaryStreams: [
        'audio-stream1',
      ],
    },
    {
      key: 'aac-96000',
      container: 'fmp4',
      elementaryStreams: [
        'audio-stream2',
      ],
    },
    {
      key: 'aac-128000',
      container: 'fmp4',
      elementaryStreams: [
        'audio-stream3',
      ],
    },
    {
      key: 'aac-he-128000',
      container: 'fmp4',
      elementaryStreams: [
        'audio-stream4',
      ],
    },
    {
      key: 'aac-he-v2-128000',
      container: 'fmp4',
      elementaryStreams: [
        'audio-stream5',
      ],
    },
    {
      key: 'aac-128000-44100',
      container: 'fmp4',
      elementaryStreams: [
        'audio-stream6',
      ],
    },
  ],
  manifests: [
    {
      fileName: 'manifest.mpd',
      type: 'DASH',
      muxStreams: [
        'aac-he-v2-128000',
        'aac-he-128000',
        'aac-128000',
        'aac-96000',
        'aac-64000',
        'aac-32000',
        'aac-128000-44100',
      ],
    },
  ],
}

And here’s my speech-to-text operation config

const [operation] = await client.v1.longRunningRecognize({
  config: {
    encoding: 'LINEAR16', //tried MP3 too
    sampleRateHertz: 48000, //tried 16000, 44100 too
    languageCode: lang,
    audioChannelCount: 2,
    alternativeLanguageCodes: ['en'],
    enableWordConfidence: true,
    enableAutomaticPunctuation: true,
    enableWordTimeOffsets: true,
    useEnhanced: true,
    enableSeparateRecognitionPerChannel: true,
    enableSpokenPunctuation: {
      value: true,
    },
  },
  audio: {
    uri: `gs://${process.env.GOOGLE_STORAGE_BUCKET}/${key}`,
  },
})

I have tried a bunch of combination for the transcoding audio config (like; bitrateBps, and sampleRateHertz), as well as the speech-to-text operation config (like; sampleRateHertz, and encoding), with the best thing i can get is a 70-75% accuracy on the transcription.

I am shooting in the dark here, since i don’t really know the details on audio technicality. Is there any optimised transcoding config and transcription config combination that can come up with a better transcription accuracy?

Thiết kế website giá rẻ

Danh mục

Optimise audio config for speech to text