I’m trying to create a script to scrape a few fields from tables based on their headers. The problem is that all the tables are not of the same length, along with their headers. Here are the HTML elements of the two tables with different lengths for your consideration.
from bs4 import BeautifulSoup
html_table1 = """
<thead>
<tr>
<td rowspan="1" data-sort-key="survey_date_surrkey"> </td>
<th scope="col" colspan="1" data-sort-key="cell-1"><span>2 Bedroom</span></th>
<th scope="col" colspan="1" data-sort-key="cell-2"><span>3 Bedroom +</span></th>
<th scope="col" colspan="1" data-sort-key="cell-3"><span>Total</span></th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row" class="first-cell" data-field="survey_date_surrkey" data-value="0">2010 October</th>
<td class="numericalData" data-field="cell-1" data-value="23">23</td>
<td class="numericalData" data-field="cell-2" data-value="5">5</td>
<td class="numericalData" data-field="cell-3" data-value="28">28</td>
</tr>
</tbody>
"""
html_table = """
<thead>
<tr>
<td rowspan="1" data-sort-key="survey_date_surrkey"> </td>
<th scope="col" colspan="2" data-sort-key="cell-1"><span>Bachelor</span></th>
<th scope="col" colspan="2" data-sort-key="cell-3"><span>1 Bedroom</span></th>
<th scope="col" colspan="2" data-sort-key="cell-5"><span>2 Bedroom</span></th>
<th scope="col" colspan="2" data-sort-key="cell-7"><span>3 Bedroom +</span></th>
<th scope="col" colspan="2" data-sort-key="cell-9"><span>Total</span></th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row" class="first-cell" data-field="survey_date_surrkey" data-value="0">1990 October</th>
<td class="numericalData" data-field="cell-1" data-value="0.0">0.0</td>
<td class="" data-field="cell-2" data-value="b ">b </td>
<td class="numericalData" data-field="cell-3" data-value="0.2">0.2</td>
<td class="" data-field="cell-4" data-value="a ">a </td>
<td class="numericalData" data-field="cell-5" data-value="0.0">0.0</td>
<td class="" data-field="cell-6" data-value="b ">b </td>
<td class="numericalData" data-field="cell-7" data-value="0.0">0.0</td>
<td class="" data-field="cell-8" data-value="b ">b </td>
<td class="numericalData" data-field="cell-9" data-value="0.1">0.1</td>
<td class="" data-field="cell-10" data-value="a ">a </td>
</tr>
</tbody>
"""
soup = BeautifulSoup(html_table1,'html.parser')
headers = [header.text.strip() for header in soup.select('thead th')]
non_class_presence = [item.text.strip() for item in soup.select("tbody tr td[class='']")]
index_num = len(soup.select("td.numericalData"))
try:
bachelor = soup.select("td.numericalData")[0].text
except IndexError: bachelor = ""
try:
one_bed = soup.select("td.numericalData")[1].text
except IndexError: one_bed = ""
try:
two_bed = soup.select("td.numericalData")[2].text
except IndexError: two_bed = ""
try:
three_bed = soup.select("td.numericalData")[3].text
except IndexError: three_bed = ""
try:
total_bed = soup.select("td.numericalData")[4].text
except IndexError: total_bed = ""
print(f"bachelor: {bachelor}none_bed: {one_bed}ntwo_bed: {two_bed}nthree_bed: {three_bed}ntotal_bed: {total_bed}")
When I run the script using the second html_table
, I get the following output: (correct result
)
bachelor: 0.0
one_bed: 0.2
two_bed: 0.0
three_bed: 0.0
total_bed: 0.1
However, the script produces errorneous results when I run it using html_table1
: (wrong result
)
bachelor: 23
one_bed: 5
two_bed: 28
three_bed:
total_bed:
Correct output for html_table1
:
bachelor:
one_bed:
two_bed: 23
three_bed: 5
total_bed: 28
How can I change the script’s logic such that, despite the table’s and its headers’ length, the results are accurately parsed?