I’m trying to scrape this page of peruvian teams. I’m able to extract the information that has multiple tables and a header which says the region of each team (that is in a table).
My problem is when I try to organize the data. I have multiple tables. What I want is to store the header of each table as a another column and get one whole dataset.
I share my code below:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# This scrapper was made to retrieve all teams in peru from Wikiepdia.
url = "https://es.wikipedia.org/wiki/Anexo:Clubes_de_fútbol_del_Perú"
dfs = []
soup = BeautifulSoup(requests.get(url).content, "html.parser")
tables = soup.find_all(
lambda tag: tag.name == "table"
and tag.select_one('th:-soup-contains("Equipo", "Ciudad", "Fundación", "Estadio", "Liga")'))
for table in tables:
for tag in table.select('[style="display:none"]'):
tag.extract()
df = pd.read_html(str(table))[0]
df['Region'] = table.find_previous(["h3","h2"]).span.text
dfs.append(df)
df = pd.concat(dfs)
print(dfs)
#df.to_csv("/Users/home/Downloads/data_results_manager.csv", index=False)
I think my problem is when I append it and with the the second loop. This one.
for tag in table.select('[style="display:none"]'):
tag.extract()
df = pd.read_html(str(table))[0]
Any help fixing my code will be greatly appreciate it.