I’m trying to extract 10-qs from the SEC website. I extracted it, but when processing and standardizing it I’m having problems. The date shows as: Date: 5810-17-00, and along the edges of each table the are numbers.
Here is my standardize_columns function where I attempt to fix the date and clean the data.
def standardize_columns(financial_data):
“””Standardizes columns across multiple dataframes from different filings.”””
standardized_dfs = []
all_columns = set()
for entry in financial_data:
if isinstance(entry, tuple) and len(entry) == 2:
date, df = entry
if not isinstance(df, pd.DataFrame):
try:
df = pd.DataFrame(df)
logging.info(f"Converted data to DataFrame for date {date}")
except Exception as e:
logging.error(f"Failed to convert data to DataFrame for date {date}: {str(e)}")
continue
if not pd.api.types.is_numeric_dtype(df.columns):
df.columns = df.iloc[0]
df = df[1:].reset_index(drop=True)
all_columns.update(df.columns)
for entry in financial_data:
if isinstance(entry, tuple) and len(entry) == 2 and isinstance(entry[1], pd.DataFrame):
date, df = entry
df = df.reindex(columns=all_columns, fill_value=np.nan)
transposed_df = df.transpose()
standardized_dfs.append((date, transposed_df))
return standardized_dfs
The result: 2024-05-30 23:04:30,333 – INFO – Date: 5810-17-00, Data:
3 5 7 … 19 20 21
0 NaN Assets Corporate debt securities … 2.20% Notes Due 2021 (1) 3.20% Notes Due 2026 (1) Interest rate swap (2)
1 Pricing Category NaN Level 2 … Level 2 Level 2 Level 2
2 NaN NaN NaN … NaN NaN NaN
3 October 29, 2017 NaN $ … $ $ $
4 October 29, 2017 NaN 1510 … 996 1007 3