python – Some text is missing when reading html using urllib-ThrowExceptions

Exception or error:

I use the following function to read html from a website called (example link).

def read_html(url):
    # Create a custom opener with User-agent header which allows cookies.
    cookiejar = http.cookiejar.LWPCookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookiejar))
    opener.addheaders = [
        ('User-agent', 'Mozilla/5.0'), 
        ('Content-Type', 'text/html; charset=utf-8')

    # Make the opener the (global) default opener (urlopen will use it).

    # Open URL and read response.
    response = urllib.request.urlopen(url)

Each line in the list of concentrates has a class name ‘recline’ which lists the various information about a flavour concentrate like it’s name, percentage, etc.

Using beautifulsoup to extract the recline divs gives this as an example (this is the same in the html text returned by read_html).

<div class="recline highlight flmis prmis">
 <div class="rlab" id="rfl1">
  <a href="">
   Acetyl Pyrazine 5% (
   <abbr title="The Flavor/Perfumer's Apprentice">
 <div class="runit" id="flu1">
 <div class="rdrops" id="fld1">
 <div class="rgrams" id="flg1">
 <div class="rpercent" id="flp1">

Note that rdrops, rgrams and rpercent is missing the expected text (it’s just a newline character). Why might this be the case?

How to solve:

Leave a Reply

Your email address will not be published. Required fields are marked *