python – Return text surrounded by double tag with BeautifulSoup-ThrowExceptions

Exception or error:

I am looping through list with urls. On each page there is between 1 and n descriptions which are surrounded by double p tag.

BeautifulSoup.find(class_='view-content')

# url 1
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One animal</p>
</p>
</div>
</div>
</div>

# url 2
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One person</p>
</p>
</div>
</div>
<div class="row">
<div class="description">
<p><p>Two people </p>
</p>
</div>
</div>
</div>

When I use

for d in soup.find(class_='view-content').find_all('p'):
    dd = d.contents[0]
    print(dd)

I get

<p>One animal</p>One animal
<p>One person</p>One person
<p>Two people</p>Two people

Instead of expected

One animal
One person
Two people

Any way to retrieve content surrounded by double p tags?

Edit: The following returns the same, but at least without p tags.

for d in soup.find_all("div",class_="view-content"):
    print(' '.join(i.text for i in review.find_all('p')[1:]))
How to solve:

Another solution.

from simplified_scrapy import SimplifiedDoc
html = '''
# url 1
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One animal</p>
</p>
</div>
</div>
</div>

# url 2
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One person</p>
</p>
</div>
</div>
<div class="row">
<div class="description">
<p><p>Two people </p>
</p>
</div>
</div>
</div>
'''
doc = SimplifiedDoc(html)
divs = doc.selects('div.view-content')
datas=[]
for div in divs:
  datas.extend ([p.text for p in div.ps])
print (datas)

Result:

['One animal', 'One person', 'Two people']

Leave a Reply

Your email address will not be published. Required fields are marked *