Convert HTML table with a header to Json – Python-ThrowExceptions

Exception or error:

Suppose I have the following HTML table:

<table>
  <tr>
    <th>Name</th>
    <th>Age</th>
    <th>License</th>
    <th>Amount</th>
  </tr>
  <tr>
    <td>John</td>
    <td>28</td>
    <td>Y</td>
    <td>12.30</td>
  </tr>
  <tr>
    <td>Kevin</td>
    <td>25</td>
    <td>Y</td>
    <td>22.30</td>
  </tr>
  <tr>
    <td>Smith</td>
    <td>38</td>
    <td>Y</td>
    <td>52.20</td>
  </tr>
  <tr>
    <td>Stewart</td>
    <td>21</td>
    <td>N</td>
    <td>3.80</td>
  </tr>
</table>

I’d like to convert this table to JSON, potentially in the following format:

data= [
  { 
    Name: 'John',         
    Age: 28,
    License: 'Y',
    Amount: 12.30
  },
  { 
    Name: 'Kevin',         
    Age: 25,
    License: 'Y',
    Amount: 22.30
  },
  { 
    Name: 'Smith',         
    Age: 38,
    License: 'Y',
    Amount: 52.20
  },
  { 
    Name: 'Stewart',         
    Age: 21,
    License: 'N',
    Amount: 3.80
  }
];

I’ve seen another example that sort of does the above, which I found here.
However, there are a couple of things that I can’t get working given that answer. Those are:

  • It is limited to two rows on the table. If I add an additional row, I get an error:

print(json.dumps(OrderedDict(table_data))) ValueError: too many values
to unpack (expected 2)

  • The header rows of the table are not taken into account.

This is my code so far:

html_data = """
<table>
  <tr>
    <th>Name</th>
    <th>Age</th>
    <th>License</th>
    <th>Amount</th>
  </tr>
  <tr>
    <td>John</td>
    <td>28</td>
    <td>Y</td>
    <td>12.30</td>
  </tr>
  <tr>
    <td>Kevin</td>
    <td>25</td>
    <td>Y</td>
    <td>22.30</td>
  </tr>
  <tr>
    <td>Smith</td>
    <td>38</td>
    <td>Y</td>
    <td>52.20</td>
  </tr>
  <tr>
    <td>Stewart</td>
    <td>21</td>
    <td>N</td>
    <td>3.80</td>
  </tr>
</table>
"""

from bs4 import BeautifulSoup
from collections import OrderedDict
import json

table_data = [[cell.text for cell in row("td")]
                         for row in BeautifulSoup(html_data, features="lxml")("tr")]

print(json.dumps(OrderedDict(table_data)))

But I’m getting the following error:

print(json.dumps(OrderedDict(table_data))) ValueError: need more than
0 values to unpack

EDIT
The answer below works perfectly if there is only one table in the HTML. What if there are two tables? For example:

<html>
    <body>
        <h1>My Heading</h1>
        <p>Hello world</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>John</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>Rich</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
    </body>
</html>

If I plug this in the below code, only the first table is shown as the JSON output.

How to solve:

This code does exactly what you want

from bs4 import BeautifulSoup
import json

xml_data = """
[[your xml data]]"""


if __name__ == '__main__':
    model = BeautifulSoup(xml_data, features='lxml')
    fields = []
    table_data = []
    for tr in model.table.find_all('tr', recursive=False):
        for th in tr.find_all('th', recursive=False):
            fields.append(th.text)
    for tr in model.table.find_all('tr', recursive=False):
        datum = {}
        for i, td in enumerate(tr.find_all('td', recursive=False)):
            datum[fields[i]] = td.text
        if datum:
            table_data.append(datum)

    print(json.dumps(table_data, indent=4))

###

You can use soup.find_all:

from bs4 import BeautifulSoup as soup
s = soup(html, 'html.parser').table
h, [_, *d] = [i.text for i in s.tr.find_all('th')], [[i.text for i in b.find_all('td')] for b in s.find_all('tr')]
result = [dict(zip(h, i)) for i in d]

Output:

[{'Name': 'John', 'Age': '28', 'License': 'Y', 'Amount': '12.30'}, {'Name': 'Kevin', 'Age': '25', 'License': 'Y', 'Amount': '22.30'}, {'Name': 'Smith', 'Age': '38', 'License': 'Y', 'Amount': '52.20'}, {'Name': 'Stewart', 'Age': '21', 'License': 'N', 'Amount': '3.80'}]

Leave a Reply

Your email address will not be published. Required fields are marked *