html – Scraping escaped JSON data within a <script type="text/javascript"> in R-ThrowExceptions

Exception or error:

I am currently trying to scrape the data from the two graphs on following html page (information from two graphs listed there: Forsmark and Ringhals): https://group.vattenfall.com/se/var-verksamhet/vara-energislag/karnkraft/aktuell-karnkraftsproduktion

The data originate from script tags like this (fragment)

<script type="text/javascript">
/*<![CDATA[*/ productionData = JSON.parse("{\"timestamp\":1582642616000,\"powerPlant\":\"Ringhals\", // etc
</script>

I would like to get two dataframes that looks like these:

F1          F2          F3       
number      number      number 

and

R1          R2          R3       
number      number      number  

I tried to use XML and xpath to parse an html page but did not get anywhere with that.

Do you have any ideas?

Thanks!

How to solve:

Those charts are <iframe>s that load from

so you should scrape those two pages directly.

This was an interesting challenge.

It becomes not too hard with rvest and jsonlite, which you will have to install if you don’t already have. jsonlite requires rtools.

Try this:

library('rvest')
library('jsonlite')

# Load the URL (do the same for the other iframe)
url <- 'https://gvp.vattenfall.com/sweden/produced-power/iframe/forsmark'

# Parse it
webpage <- read_html(url)

# Extract the script element. That's a CSS selector for the specific one that holds the json data
# You can find it in your browser's DevTools by finding the script element
# and right-clicking, choosing Copy > CSS Path/Selector
script_element <- html_nodes(webpage, 'body > section:nth-child(2) > script:nth-child(2)')

# Extract its string content
json = html_text(script_element)

# Clean it up
json = gsub("\n        /*<![CDATA[*/\n        productionData = JSON.parse(", "", json, fixed=TRUE)
json = gsub(");\n        /*]]>*/\n    ", "", json, fixed=TRUE)
json = gsub("\"{", "{\"", json, fixed=TRUE)
json = gsub("}\"", "}", json, fixed=TRUE)
json = gsub("{\"\\\"", "{\\\"", json, fixed=TRUE)

# Extract data
data = jsonlite::fromJSON(gsub("\\\"", "\"", json, fixed=TRUE))

Caveat: I’m not really an R expert, there is likely a more elegant way of doing this (particularly the data cleaning portion). But it works.

For historical preservation, that takes this DOM node (the text content of the <script> tag):

"\n        /*<![CDATA[*/\n        productionData = JSON.parse(\"{\\\"timestamp\\\":1582643336000,\\\"powerPlant\\\":\\\"Forsmark\\\",\\\"blockProductionDataList\\\":[{\\\"name\\\":\\\"F1\\\",\\\"production\\\":998.86194,\\\"percent\\\":99.88619},{\\\"name\\\":\\\"F2\\\",\\\"production\\\":1120.434,\\\"percent\\\":97.8545},{\\\"name\\\":\\\"F3\\\",\\\"production\\\":1189.7126,\\\"percent\\\":99.55754}]}\");\n        /*]]>*/\n    "

and will result in data of this format

> data
$timestamp
[1] 1.582647e+12

$powerPlant
[1] "Forsmark"

$blockProductionDataList
  name production  percent
1   F1   997.7902 99.77902
2   F2  1131.6150 98.83100
3   F3  1190.0520 99.58594

Leave a Reply

Your email address will not be published. Required fields are marked *