I am trying to assign the text of a PDF to columns and using a regular expression to get the values in-between matches in column headers. Ultimately for CSV.
I am getting the Year data printed twice, but want a list of Year and Temperature.
Also, I am only getting the first page of years from the PDF, please advise on that if you can, too.
PDF found here: https://www.weather.gov/media/slc/ClimateBook/Annual%20Average%20Temperature%20By%20Year.pdf
import re
from pdfminer.high_level import extract_text
PDF_path = "B:PyPDF_to_CSVAnnual_Average_Temperature_By_Year.pdf"
text = extract_text(PDF_path)
Year, Temp = [], []
p1 = [m.start() for m in re.finditer('Year',text)]
p2 = [m.start() for m in re.finditer('Annual Average Temperature (F)',text)]
for el1,el2 in zip(p1,p2):
Year.append(text[el1+len(p1):el2])
Temp.append(text[el1+len(p2):el2])
print(Year,Temp)
Current Output:
[‘arnnYearn1875n1876n1877n1878n1879n1880n1881n1882n1883n1884n1885n1886n1887n1888n1889n1890n1891n1892n1893n1894n1895n1896n1897n1898n1899n1900n1901n1902n1903n1904n1905n1906n1907n1908nn’] [‘earnnYearn1875n1876n1877n1878n1879n1880n1881n1882n1883n1884n1885n1886n1887n1888n1889n1890n1891n1892n1893n1894n1895n1896n1897n1898n1899n1900n1901n1902n1903n1904n1905n1906n1907n1908nn’]
As a side-note, I have already handled this into a nice CSV format using this code:
Merging tables that span multiple pages using Camelot
However, I am trying to learn regular expressions and to simplify code.
Thank you.