Web scraping is one of the most useful and least understood methods for journalists to gather data. It's the thing that helps you when, in your online research, you come across information that qualifies as data, but does not have a handy "Download" button. Here's your guide on how to get started – without any coding necessary.
Say, for example, I am looking for "coffee" on Amazon. When I hit "search", I get a list of results that's made to be easily readable for humans. But it doesn't help me much if I want to analyze the underlying data – say, how much the average coffee package costs or which brands dominate the Amazon coffee market. For that purpose, a handy table might be more practical.
One option, then, might be to copy the information on each result by hand. Let's say that takes my unpaid intern – whom I hired because I didn't want to do the job myself – 5 seconds for each search result. With 200,000 results, that still takes them more than a month, if they work full-time from 9 to 5 at constant speed, without pause.
So, even if I have an unpaid intern at hand, this way is just not practical. My main motivation for learning how to code has always been laziness: I hate doing the same boring thing twice, let alone 200,000 times. So, I learned how to scrape data.
Scrapers, in practice, are little programs that extract and refine information from web pages.
They can come in the form of point-and-click tools or scripts that you write in a programming language. Their big advantages are:
- They're much quicker than manual work,
- can automate the task of information extraction and
- can be recycled whenever you need to scrape the same website again.
If you need to scrape many differently structured sites, though, you'll quickly notice their biggest drawback: scrapers are pretty fragile. They have to be configured for the exact structure of one website. If the structure changes, your scraper might break and not produce the output you expect anymore.
Scrapers occupy an important place in the scope of data sources available to data journalists. So let's get it: how to scrape data yourself.
A bit of a damper first: scraping is one of the more advanced ways to gather data. Still, there are some tools you can – and should – start using immediately.
😊 Level 1: Capture tables from Websites
This is the first step in your scraping career: there are extensions for Chrome ("Table Capture") and Firefox ("Table to Excel"), that help you easily copy tables from websites into Excel or a similar program. They're the same program, they just have different names because why make it easy. (Note: if you used to have an add-on called "TableTools" in Firefox and wonder where it went: it's the "Table to Excel" one. Go ahead and reinstall it!)
With some tables, just marking them in your browser and using copy and paste will work, but it will often mess up the structure or the formatting of the table. These extensions spare you some trouble. If you have them installed, you can just go ahead and:
- Right-click on any table on a web page (try this list of countries in the Eurovision Song Contest, for example)
- Choose the option "Table to Excel – Display Inline" (or "Table Capture – Display Inline" if you're in Chrome). A field should appear in the upper left corner of the table that says something like "Copy to clipboard" and "Save to Excel"
- Click "Copy to Clipboard"
- Open Excel or your program of choice
- Paste the data. Voila – a neatly formatted table.
😳 Level 2: Scrape a single website with the "Scraper" extension
If you're feeling a little more adventurous, try the Scraper extension for Chrome. It can scrape more than just tables – it scrapes anything you can see on a website, with no programming knowledge necessary.
- Right-click one of the links and
- Select "Scrape similar…".
A new window will open and, if you wait a minute, you'll see that the program has already tried to guess which elements of the web page you want information about. It saw that there are many links like the one you clicked on the site, and thought you might want to scrape all of them.
If you know your way around the inner workings of a website a bit, you'll recognize that the "XPath" field specifies which kinds of elements you want to extract. If you don't: don't worry about it for now, this is generated automatically if you click on the right element to scrape.
For links, it will extract the text and the URL by default. If you want to extract more properties, try this:
- Click the little green "+" next to "URL"
- Type "@title" in the new field that says "XPath", and "Date" (or anything, this is just the column name) where it says "Name"
- Click "Scrape" at the bottom
- Wait a second
Your window should now look like the one on the image. Congratulations: you've now also extracted the publication date of each comic. This only works with this example and only because, hidden in the source code of the website, the xkcd website administrators also specified a "title" element for each link (it makes a tooltip show up when you hover over each link) and wrote the publication date into that element.
You can see the structure of the xkcd archive page in the image below (see for yourself by right-clicking on the "Laptop Issues" link on the page and selecting "Inspect…" to open your browser's developer tools). The title and href elements are the ones that the Scraper extension extracted from the page.
Feel free to try the Scraper extension on any information you want to extract as well as on other elements – it doesn't just work on links. Once you have the information you want and the table preview looks good, just click "Copy to clipboard" or "Export to Google Docs…" and do some magic with your newly scraped data!
😨 Level 3: Scrape many web pages with the "Web Scraper" extension
Often, the data we want is not presented neatly on one page, but spread out over multiple pages. One very common example of this is a search that only displays a few results at a time. Even with our fancy extensions, this would reduce us to clicking through each results page and scraping them one by one. But we didn't want to do that anymore, remember? Thankfully, there's a programming-free solution for this as well. It's the Web Scraper extension for Chrome.
As you've probably already realized with the previous extension, you really need to know how websites work, to build more complex scrapers. Web Scraper is still pretty interactive and doesn't require coding, but if you've never opened the developer tools in your browser before, it might get confusing pretty quickly.
The big advantage that Web Scraper has is that it lets you scrape not only one page, but go into its sub-pages as well. So, for example, if you want to scrape not only the titles and links of all xkcd comics, but also extract the direct URL of each image, you could make the extension click on the link to each comic in turn, get the image URL from each subpage and add it to your dataset.
I won't go into much detail about how this works here – that might be material for its own tutorial – but I do want to show you the basic process. You can see it in the video below.
- Web Scraper lives in the Developer Tools, so right-click > "Inspect.." and select the "Web Scraper" tab.
- Create a sitemap.
- Click into the sitemap and create a selector. This tells the program which elements you want to scrape information from.
- Click "Sitemap [name]" > "Scrape" and wait until it's done.
- Click "Sitemap [name]" > "Export data as CSV". Congratulations! You did it.
This produces the same output as we got with the Scraper plugin before. The exciting part starts once you add another layer to the process. It works pretty much the same way, as you can see in the second video:
- Click into the sitemap, click
into the selector and create a new selector inside the first.
(You can see the hierarchy of selected elements by clicking on "Sitemap [name]" > "Selector graph".)
- Click "Sitemap [name]" > "Scrape" and wait until it's done. This might take a while depending on the number of sub-pages you need to loop through. Once it's done:
- Click "Sitemap [name]" > "Export data as CSV".
- Congratulations! You did it.
Even more tools
With this, you can tackle many of the scraping challenges that will present themselves to you. Of course, there are many more possible tools out there. I don't know most of them, and many of the more fancy ones are pretty expensive, but that should not keep you from knowing about them. So here's an incomplete list of other tools to check out:
Happy scraping, and don't panic!
As you can see, there's a bunch of options out there for those of you who do not code at all. Programming your own scrapers gives you even more freedom in the process and helps you get past any limitations to the tools I've just introduced. Still: if you know how to extract a table, and maybe even how to use the Scraper extension: awesome! You already know more than 99 percent of the population. And that fact is definitely true – we're data journalists, after all.
We hope this tutorial-slash-toolkit-overview has provided you with a good starting point for your scraping endeavours. Thanks for reading, and happy scraping!