I am working on document layout analysis and have been exploring CNNs and transformer-based networks for this task. Typically, images are passed as 3-channel RGB inputs to these networks. However, my data source is in PDF format, from which I can extract the exact position and character information directly.
I am concerned that converting this PDF data into images for analysis will result in the loss of valuable positional and character information. My idea is to modify the input dimensions of the CNN from the standard 3 RGB channels to a higher-dimensional input that includes this additional positional and character information.
I understand how CNNs work and highly suspect that this approach might not work, but I would appreciate any feedback or suggestions from the community. Has anyone experimented with augmenting input channels in this way, or does anyone have insights into integrating positional and character data directly into CNNs?
Thanks in advance for your thoughts and advice!