I am trying to scrape data from tables on websites. An example that seems to error is here. In the original source HTML, there are 4 tables of class publicschematic
, which are the only ones I care about. I have existing code to parse these into a Pandas DataFrame:
url = "https://postings.speechwire.com/c-postings-schem.php?groupingid=10&Submit=View%20postings&tournid=15214"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, "html.parser")
tables = soup.find_all("table", {"class": "publicschematic"})
for table in tables:
dfs = pandas.read_html(StringIO(str(table)), flavor="bs4")
print(table)
However, BeautifulSoup seems to be adding a </table>
tag to the first table before it actually occurs in the HTML
Real HTML of the first table is something like this:
<table class='publicschematic' width='100%'>
<tr class="publicschematicheader">
<td colspan="5" align="CENTER" class="publicschematicheader publicschematic"><a name='r1'></a>Round 1 - Saturday, 8:00 AM</td>
</tr>
...
<td class="publicschematic publicschematicsectionname centered">B</td>
<td class="publicschematic centered">29C Brad Jensen</td>
<td class="publicschematic publicschematicroom centered">W303</td>
<td class="publicschematic centered">9B Audrey Drakos and Lena Drakos (AFF)</td>
<td class="publicschematic centered">18G Ethan Jani and Ashok Vasan (Neg)</td>
</tr>
// ...
</table>
On the other hand, the program prints this:
<table class="publicschematic" width="100%">
<tr class="publicschematicheader">
<td align="CENTER" class="publicschematicheader publicschematic" colspan="5"><a name="r1"></a>Round 1 - Saturday, 8:00 AM</td>
</tr>
<tr class="publicschematicsubheader">
<td class="publicschematic publicschematicsubheader centered">Sect.</td>
<td class="publicschematic publicschematicsubheader centered">Judge</td>
<td class="publicschematic publicschematicsubheader centered">Room</td>
<td class="publicschematic publicschematicsubheader centered" colspan="2">Competitors</td>
</tr>
<td class="publicschematic publicschematicsectionname centered">A</td>
<td class="publicschematic centered">21A Jade Felthoven</td>
<td class="publicschematic publicschematicroom centered">W301</td>
<td class="publicschematic centered">24A Liliana Smith and Vinisha Tripathi (AFF)</td>
<td class="publicschematic centered">28A Camryn Williams and Ethan Samuels (Neg)</td>
</table>
In the source html, there isnt actually a </table>
tag there. Futhermore, this only happens with the first table on the page, but works fine for all the other ones.
Any suggestions?