We’re a video streaming platform (e.g. FlixNet). A video has separate audio files in different languages and also subtitles for each audio file. We get the subtitles from the vendor as a plain text. Using AI we convert that plain text into JSON. Each items in it represents a time frame (start, end in seconds from the start of the audio file) and the subtitle itself. Here is an example:
“Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam id
consectetur tellus, a malesuada lacus. Donec dignissim ornare
fringilla. Integer sed lorem vel mi dictum tempus vel et lorem. Sed
aliquam volutpat sem. Cras quam nulla, laoreet vitae leo cursus,
laoreet facilisis sapien. Aliquam vitae”
{
"transcription": [
{
"start": 3,
"end": 5,
"text": "Lorem ipsum dolor sit amet,"
},
{
"start": 6,
"end": 7,
"text": "consectetur adipiscing elit."
},
{
"start": 8,
"end": 10,
"text": "Nam id consectetur tellus, a malesuada lacus."
},
{
"start": 15,
"end": 17,
"text": "Donec dignissim ornare fringilla."
},
{
"start": 19,
"end": 22,
"text": "Integer sed lorem vel mi dictum tempus vel et lorem."
},
{
"start": 23,
"end": 24,
"text": "Sed aliquam volutpat sem. Cras quam nulla,"
},
{
"start": 25,
"end": 26,
"text": "laoreet vitae leo cursus, laoreet"
},
{
"start": 29,
"end": 30,
"text": "facilisis sapien. Aliquam vitae"
}
]
}
Time frames have different length and may have gaps when no one is talking. I don’t know the pattern by which AI chops text into JSON.
The issues:
vendor my sends us updated subtitles. For example: changed translation, fixed typos or removed entire paragraph because it’s rasist or 18+.
We’re using PHP 7.2. I need to write some php script that can identify the changes and update the subtitle in the JSON without triggering the AI again, because it costs money and sometimes AI hallucinating.