Hey I am trying to implement a program that can get urls from the html of a website, but I only want the urls from the body. Basically, I want to avoid ads and menus on the website and only get links to the websites that are embedded in the actual article. Does anyone know of a good way of isolating the body html from the rest of the html without hardcoding how the body is designated for each website?
It is a simple process to scrape only specific parts of the html. For the most part you can choose elements from the page you want. Let’s say you only want the
<div id="example">example</div> you can specify your scraper to only pick up that div. Please check this example out.