python – Get all HTML tags with Beautiful Soup-ThrowExceptions

Exception or error:

I am trying to get a list of all html tags from beautiful soup.

I see find all but I have to know the name of the tag before I search.

If there is text like

html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>"""

How would I get a list like

list_of_tags = ["<div>", "<div>", "<div class='magical'>", "<p>"]

I know how to do this with regex, but am trying to learn BS4

How to solve:

You don’t have to specify any arguments to find_all() – in this case, BeautifulSoup would find you every tag in the tree, recursively. Sample:

>>> from bs4 import BeautifulSoup
>>>
>>> html = """<div>something</div>
... <div>something else</div>
... <div class='magical'>hi there</div>
... <p>ok</p>"""
>>> soup = BeautifulSoup(html, "html.parser")
>>> [tag.name for tag in soup.find_all()]
[u'div', u'div', u'div', u'p']
>>> [str(tag) for tag in soup.find_all()]
['<div>something</div>', '<div>something else</div>', '<div class="magical">hi there</div>', '<p>ok</p>']

###

I thought I’d share my solution to a very similar question for those that find themselves here, later.

Example

I needed to find all tags quickly but only wanted unique values. I’ll use the Python calendar module to demonstrate.

We’ll generate an html calendar then parse it, finding all and only those unique tags present.

The below structure is very similar to the above, using set comprehensions:

>>> from bs4 import BeautifulSoup
>>> import calendar
>>>
>>> html_cal = calendar.HTMLCalendar().formatmonth(2020, 1)
>>> set(tag.name for tag in BeautifulSoup(html_cal, 'html.parser').find_all())
{'table', 'td', 'th', 'tr'}

###

Please try the below–

for tag in soup.findAll(True):
    print(tag.name)

Leave a Reply

Your email address will not be published. Required fields are marked *