Imagine I have two distinct OS processes (the actual OS is unimportant). Process A is responsible for playing a video file. Process B is responsible for playing the audio that accompanies the video file. Both processes are clients, connected via a local area network to a server.
Assuming that the video and audio streams are synchronized at the file level, what mechanisms would I use to ensure that both processes coordinate with the server to absolutely ensure that once instructed they begin and continue playing in sync?
This feels like a common problem but I am struggling to find any detailed, practical solutions.
7
According to EBU Recommendation R37:
“The relative timing of the sound and vision components of a television signal” states that end-to-end audio/video sync should be within +40ms and -60ms (audio before / after video, respectively) and that each stage should be within +5ms and -15ms.
This quote is the summary from the Audio to video synchronization wikipedia page.
This suggests that you need timing accuracy measured in 10’s of milliseconds.
Karl Bielefeldt’s suggestion of Precision Time Protocol was a good one, but seems like overkill to me. PTP has a sub microsecond accuracy (on a local LAN), so is 3 orders of magnitude (more than 1000 times) more accurate than we need and consequently much more difficult to implement.
The much older and more widely available Network Time Protocol (NTP) should result in clocks being synchronised to within a millisecond on a LAN, which is an order of magnitude (more than 10 times) more accurate than we require. Even if your server and clients were on the Internet, you should be able to get clocks synchronisied to 10’s of ms if you don’t have problems with asymmetric routes and network congestion.
NTP client/server software is standard on most operating systems, all you need to do is sync both clients to the same server. Note that even if both clients are individually synced to the server with an accuracy of plus/minus 1ms, with respect to each other they are only synchronsed to plus/minus 2ms (one could be 1ms ahead of the server while the other is 1ms behind), but this is still well within the threshold of perception.
Once your system times are synchronised, clients would fill their initial buffer and inform the server of the earliest time they could guarantee starting to serve that content. Once the server had received both times, it would send the worst case time back to both clients and they would both be expected to start at that time.
Finally, since clocks can drift over time, your clients and server would have to keep synchronising clocks, and if the video drifts too far from the audio, you should duplicate or skip frames of video to maintain synchronisation. This should only be needed if you are running very long streams though.
Incidentally, the reason for adjusting the video rather than the audio is that we are far less likely to notice a 1 frame dup/skip in video (assuming 20fps or higher) than even a 1/60th of a second audio glitch.
1
Usually, this sort of thing is solved by using a transport link with strict timing specifications, like HDMI. For a TCP/IP network, there are systems like JACK that get you part of the way there for audio, but I’m not aware of a synchronized audio/video solution.
Essentially, you need a very accurate synchronized clock, using something like Precision Time Protocol, then in your packets, you specify precisely what time to play each frame/sample. Then you need real-time scheduling to make sure the deadlines are hit, and enough buffer to cover any network latency. Much easier said than done.
That’s the hard way. The easy way is to just make a best effort, then give the user controls to manually adjust the delay.
1
What about having the server send each block of the stream to the clients simultaneously, and not sending the next block until both clients have acknowledged they have completed playing the previous block? You’ll probably have to play with some kind of block buffers in the client, but that shouldn’t be hard. This way, the server is the one source of playing position, and everything will be synchronized on block boundaries. Block size will probably be determined by your network bandwidth (and possibly your source medium).
5
My thinking is that there is no way to guarantee this kind of event at the process level. You can’t know what priority is going to be placed on the receipt of any given event, the order of the next relevant event etc.
To me it sounds like you need a third marshalling application whose job is to receive the events, queue up the buffered resource and synchronize the playback locally. So the application’s sole responsibility would be to receive the separate network data and then distribute that data simultaneously to the local peripheral resources individually.
As @TMN mentions, you’ll absolutely need some kind of buffering because network changes can be unpredictable, and in synchronized media even a single millisecond could make things weird.
1