I am currently training a YOLO (You only look once) object detector for an application within an industrial environment. Since a fixed setup of cameras is used, the backgrounds of the images are camera-dependent, but do not vary over time (besides positional inaccuracies of the measured parts).
From a data-science point of view, this means: My training data can be logically grouped into N camera views, each containing a specific background.
My idea is the following:
Instead of shuffling all images within one big dataset, i only shuffle batches, so that the camera views never get mixed up within a single training batch. Using this, i hope that the special properties of the background views are learned more specifically.
The tradeoff would be to “sacrifice” some generalization capabilities of the neural network (which is not necessary in this special application) in order to get better performance in this specialized setting.
Does anyone here have an in-depth understanding on how different the weights will be updated using this method? Some literature would also be cool, but i did not find similar problems in the time i researched.