There is more than one way to extract contents out of a webpage. The CPAN module Web::Scraper provides a great way to scrape (hence the name) contents out of webpages. It makes use of CSS selectors or XPath queries to extract the data out of the HTML content. I prefer to use CSS selectors as I am already familar with them. The module provides decent documentation, however, examples are always better to understand. CSS selectors are common in the world of HTML. Usually you are using them to describe certain elements in a HTML document and assign CSS styles to them. However, with Web::Scraper you can use selectors to navigate through the document.
The challenge
There is a webpage of the German national broadcaster which hosts video streams of reporters from all over the world in video blogs. Unfortunately they don’t provide a RSS feed to the video streams. So every time you want to watch a new episode you can use the RSS feed to access a webpage with the new stream. I decided to pull their RSS feed, query the links inside it and check for video streams. I wrote a small wrapper around this so I can use a small Web::Simple based application to generate a RSS feed with video feeds included – which I can subscribe to. Voila, there is my RSS feed with videos for my tablet.
The solution
Here are some elements which I wanted to extract and I’ll show you the CSS selector which I used to navigate through the content.
Here is the first snippet of the webpage that I want to extract:
<html>
….
<span>Videoblog …</span>
<h1>…</h1>
<p>…</p>
There is only one h1 headline inside the document. I wanted the content of the headline and the content of the paragraph which is following after the h1 tag. Here is the selector to get in the language of Web::Scraper.
my $scrap = scraper {
process ‘h1′, headline => ‘TEXT’;
process ‘h1+p’, story => ‘TEXT’;
…
The h1 tag is pretty self explanatory. The notation h1+p describes the first p tag after the h1 tag. The ‘TEXT’ says that I want to extract the content which is enclosed by the tags.
The webpage has some download links for video streams on it. I wanted to grab the URLs of the streams and the title of the corresponding stream. Here is the HTML snippet.
<ul id=…>
<li>
<a href=”…” title=”…”>
<span>…</span>
</a>
</li>
<li>…multiple times one per stream ….</li>
</ul>
Well, things are getting a little bit complicated here. I am interested in the href attribute and in the content enclosed by the span tag. As there are many li elements, one for each stream we need to store the return value in an array. So, let’s look into the CSS selector.
…
process ‘a.downloadLink’, “formats[]” => scraper {
process ‘a.downloadLink[href]‘, link => ‘@href’;
process ‘span.title’, format => ‘TEXT’;
};
Huh, ok – so let’s proceed step by step. The selector a.downloadLink selects every a tag which is of the class downloadLink. I am using the array formats[] to store the return values of the included scraper. This scraper selects the href attribute (‘@href’) and text (‘TEXT’) of the span element of the class title.
The result is a data structure which can be processed. Here is an extract of it:
$VAR1 = {
‘formats’ => [
{
'link' => ...,
'format' => ...,
},
{
'link' => ...,
'format' => ...,
},
],
‘story’ => “…”,
‘headline’ => “…”
};
The display of the source code and the data structure is nowhere near perfect in this blog engine so I’ve provided a small gist for it on my GitHub account. You can find it here.
After you walked through the source code of the example you’ll know how to extract portions of a webpage using CSS selectors. You can use Mojolicious for this purpose, too! The Mojo::DOM::CSS module provides everything you need. Which one you choose is just a matter of subject. If you are using long HTML documents I would go for Web::Scraper which might be faster due to the fact that it can use LibXML if properly installed (plus you’ll have validation support if needed).
Conclusion
CSS selectors provide a powerful way to parse HTML/XML contents and to extract data out of it. Web::Scraper or Mojo::DOM::CSS are two modules on CPAN which are well maintained, documented and have an active community around it. I think Web::Scraper deserves to be included in the toolchain of all people working with Perl.
Here are some more links with some useful material: