Relative Content

Tag Archive for pythondatabaseparsingautomationdata-extraction

parse the text of a PDF of a research paper so that you can identify each paragraph or section

I want to create a database that contains information about several research papers so that I can vectorize phrases or paragraphs to be used by an AI. I am using Python to get text from a PDF and I am getting a string that I do not know how to parse. This is an example of the first page of a paper (I am an author of the paper so there is no problem with the data copyright).
An aditional problem is that not every papers have the same style/organization