This is Part I in our Data Scraping blog post series. Part II by Mike Klaczynski provides an Import.io walk-through and Part III by Isaac Obezo shows how to write scripts for scraping using Python.
You may have heard that August is Data Month. All month, we will be providing you with ideas on where to get data and how to use it in Tableau. Earlier, I showed you how to optimize your large data sets for performance and Ben Jones shared a few resources for Open Data. This is the first post in a series on Data Scraping, where we will be sharing a few different methods of collecting data from the web. Today’s method uses IFTTT.
IFTTT (pronounced like “gift” without the g) stands for “If this, then that.” It is a service that connects together different channels into rules to automate data collection, social networking posts, and email. One of the channels you can activate is for Google drive, making it easy to create IFTTT recipes that will translate into spreadsheets of data you can bring into Tableau. Here are a few examples of some IFTTT recipes:
Once you set up a recipe in IFTTT, it runs in the background for as long as you have it turned on. You never have to think about it. You just open up your Google Drive account, and rows of data will be waiting for you! Tableau Public user Mat Hughes used the Last.fm Scrobbles recipe to create a viz documenting his music listening habits. One of the competitors in our recent Social Media Viz Contest, Russell Spangler, used the Instagram hashtags recipe to create this dashboard of food popularity.
Last Spring, I used IFTTT to scrape Craigslist posts for tickets to the Sasquatch Music Festival. Sasquatch sold out in just under 2 hours this year, and many of my friends were unable to get tickets. I wanted to monitor Craigslist for tickets, to see if I could predict a drop in scalping prices and get a good deal for my friends. Craigslist is notoriously hard to scrape using traditional means, but using IFTTT and a little Excel wizardry, I was able to clean up the data and create this viz:
To get the data, I used the Craigslist Search recipe listed above. Whenever a new Craigslist listing was posted with the word “Sasquatch” in it, IFTTT wrote a row in my Google Spreadsheet. The data came out looking like this:
After I downloaded all the data, there were some additional clean-up tasks to be done. First, IFTTT always formats dates like “May 21, 2013 at 1:24 PM.” Having the “at” in there will keep Tableau from recognizing that this is date field. My simple solution was to do a “Find and Replace” in the date column, replacing “at” with a blank. Doing this allowed Tableau to recognize this field as a date.
The main piece of data I wanted from this project was the price of the tickets. The price is part of the title field, so I needed a way to separate it out. Since Craigslist post titles always follow the same structure: [Title (Location) $Price], I was able to use Text-to Columns in Excel to split up the title. I used $ as a delimiter to get the price out of the title. When I was done, I had a separate column just for price:
There were a few other data cleanup tasks to do: removing posts requesting tickets instead of selling, making sure that the list price was the price for 1 single general admission ticket, removing rows that didn’t have price data, etc.
Using IFTTT for data collection is an easy way for a non-programmer to get started with data scraping. There are some drawbacks; mainly that data may need some additional formatting work before bringing into Tableau. But it is still a great tool for the data scraping novice.
“Tableau Public is a new freeware app that intelligently suggests the right data visualization, and then lets you publish it to the Web as an embed.”