I have an image dataset with 10 different object classes and their bounding boxes. I want to train a model to predict the most probable position of the object.
Inputs:
A canvas size (WxH) e.g: 1024×1024
Number of different objects (and maybe their sizes) e.g: 1 title 250×40, 1 text 100×100, 1 text 150×50, 10 symbols 50×50, 2 illustration 200×200
Output:
I want to predict the X and Y position of each input object (or maybe dimention if we decide to simplify inputs) and note that sometime they can overlap each other and sometime they cannot. Also they input are related to each other for example most of the time symboles are grouped and placed close to each other
Can I use YOLO (or any other model) to predict bounding box? What is the best starting point for me? Can you give me a big picture of the solution?