Web Scraping 101 with PHP and Goutte

Published on by Safeer

The Basics

I’ll be showing how to scrape pages easily using PHP and Goutte. To get started download Goutter.phar from here. Then create a file called in the same directory as the one you put goutte.phar in. Now pop the following into the file:

<?php

include_once __DIR__ . '/goutte.phar';

$goutte = new Goutte\Client();

This will load Goutte and create a new Goutte Client, now let’s find a page to scrape. For this example we’ll be using the following dummy page created by me for the tutorial. Now that we have a page, we need to find sections to scrape, to do this we’ll open up Google Chrome and inspect the element(s) we want to fetch. This is as simple as just hovering over the section you want to retrieve, right click and navigate to ‘Inspect Element’. From here we can see the details for this element, including the ‘id’, ‘class’ and we can even grab the XPath by clicking ‘Copy XPath’. The XPath is a language for addressing parts of an XML document, full information can be found on the w3 website.

I’ll quickly run through how to extract the data using the ‘id’, ‘class’ and the ‘xpath’. We will be fetching the ordered list from the dummy page, so right click the first point in the list, click ‘Inspect Element’ and then copy the XPath. This will copy the following

//*[@id="points"]/li[1]/p

What this tells us is that the element we’re trying to reference is nested under all of those elements. Any numbers beside a DOM element is the element number that this item is under. For this tutorial as we’re interested in getting the ordered list, we can remove the [1] from the retrieved XPath and pass that through. Using this we can now add the following to our Scrape.php file:

$baseURL = 'http://saf33r.com/';
$urlEndpoint = 'resources/scrape-this-page.html';

$domSelector = '//*[@id="points"]/li/p';

$crawler = $goutte->request('GET', $baseURL . $urlEndpoint);

$results = $crawler->filterXPath($domSelector)->each(function ($node, $i) {
    return $node->nodeValue; // This is a DOMElement Object
});

var_dump($results);

This code starts off by defining the URL and DOM Selectors we’re going to use and then creates a request to fetch the HTML from the given page and store it in $crawler. The section after that grabs the HTML Page we retrieved and filters based on the data in our $domSelector variable. It passes into the closure a DOMElement object for each found occurrence and also the index. As such we can use any of the methods described in the PHP DOMElement class. Then this data gets stored into an array which we finally dump out. I’m only pulling the nodeValue which DOMElement inherits from DOMNode.

To grab this data using the ‘id’ or ‘class’, we can do the same as above but instead of providing the XPath we instead provide the ‘id’ or ‘class’. To do this we can hover over the section we want to grab and inspect the element, from there we can copy the ‘class’/’id’, then all we have to do is change the domSelector and change filterXPath to filter like

$results = $crawler->filter($domSelector)->each(function ($node, $i) {
    // ...
});

and the domSelector to

$domSelector = '#points li';

and to get this content using the ‘class’ we can just use

$domSelector = '.points li';

That’s it, it’s rather simple to grab information from a page. I’ve added a few snippets below showing how to do some of the common things when working scraping the DOM.

Important

Please keep the following in mind when scraping pages:

  • Ensure that you don’t hammer the server(s) that you are fetching data from, you can do this by simply adding a sleep(seconds) after each call. This will have the script stop and do nothing for the number of seconds you tell it to.
  • Whilst your experimenting with getting the data you want from a page, it’ll be better to use a dummy page which mirrors the page structure you want or better still use a cached version.
  • Check whether a RSS, API or another format is provided by the website you wish to scrape, as this may be better, easier and more reliable in the long term since page references and xpaths are more likely to change without notice than a official API.
  • Plan your code and implement ways to alert you when something fails repeatedly, as this could mean that the page structure has changed (amongst other things). Also it’s best to keep your code well structured so you can use it for multiple cases and/or allow for easy updates when the page structure does change.

Retrieve Element X

If you want to grab a element in a list you can do the following:

$results = $crawler->filter($domSelector)->eq(0)->text();

The value in eq can be replaced with the position of the element you wish to retrieve, also it’s worth noting that you can grab the first or last by replacing eq(position) with first() or last().

Checking for a Attribute and Retrieving a given Attribute

To check whether a element has a given attribute we can use the following methods on the $node element in the closure, getAttribute() or hasAttribute().

$results = $crawler->filter($domSelector)->each(function ($node, $i) {
    if ($node->hasAttribute('id')) {
        return $node->getAttribute('id');
    }
    return 'noID';
});

Simpler way to extract wanted data

You can grab a list of attributes from a CSS Selector by running the following:

$results = $crawler->filter($domSelector)->extract(array('_text', 'class', 'id'));

This will return an multidimensional array of each element, with a an array element for each of the provided attributes. If there is no value it will put an empty string in the position.