Looking to parse PDFs to glean relevant info.
Using pypdf
and am able to extract text, but it’s a bit of a slog formatting into something usable because it appears the PDFs are formatted and not straight text.
For instance, looking to extract ‘Asset’, ‘Transaction Type’ and ‘Amount’, from the table herein:
https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2020/20017693.pdf
If I’m not able to extract the table (headers and all), I’d like to extract the ticker (eg, ‘(CSCO)’), asset type (eg, ‘[ST]’ here), transaction type (eg, ‘S’) and amount, individually.
The below gets me all the text to parse, but what I’m returning so far is kind of janky and wonder if there’s a better way.
import pypdf
import io
import requests as re
import pandas as pd
from bs4 import BeautifulSoup
import pickle
import fnmatch
import re as rx
url = 'https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2020/20017693.pdf'
c = io.BytesIO(re.get(url = url).content)
pdf = pypdf.PdfReader(c)
text = ""
for page in pdf.pages:
text += page.extract_text() + "n"
substring = text.split("n")
holding = fnmatch.filter(substring, '*(*)*')
htype = fnmatch.filter(substring, '*[[]*[]]*')