I have developed a deep learning video captioning model that reads frames from a USB camera operating at 30 FPS. The model selects one frame from every 6 consecutive frames, effectively sampling at 5 FPS. It then processes these 16 sampled frames (collected over a 3.2-second interval) and generates a description.
what would be the criteria to consider my model as “real-time”?
Should Inference Speed Be Less Than 1 Second for Real-Time Video Captioning or what???