Protip: Wikipedia Is a Treasure Trove of Data Sets
It wasn't that long ago that I was in college, being warned by my teachers not to use Wikipedia to do research for papers. Believing that academics were often too cautious and afraid when it comes to technology, I ignored them. The trick was looking at the footnotes of the Wikipedia article then referencing whatever they referenced as the source—sneaky, right?
These days, I'm sure it's hard for many of us to imagine a world in which we couldn't fulfill our sudden, burning desire to know which year Twinkies were invented, or looking up who originally sang that song you love and ending up falling into a musical rabbit hole and coming out the other side, realizing that you really respect the work of Burt Bacharach.
Wikipedia is perceived to be so important that an article in Wired magazine even ran an article headlined "Why Wikipedia Is as Important as the Pyramids." The piece stated: "The site’s monumental compilation of 19 million entries in 282 languages has already had a greater cultural impact worldwide than most of the other 936 sites recognized for 'outstanding universal value' on the World Heritage List."
But Wikipedia isn't just a cultural treasure or a great research source for procrastinating college students. It's also a really great place to find data sets. Just ask Peter Gilks, who often uses Wikipedia as a source for his excellent vizzes like this one on the world's largest stadiums:
Take a look at the link for Wikipedia page on the world's largest stadiums. With the data neatly arranged in a table like that, it's easy to just copy that table and paste it directly into Excel for cleanup, or even directly into Tableau!
As much fun as it can be to dig through random articles until something interesting comes up, Wikipedia actually has a very handy list of featured lists, many of which are already in a usable table format like the stadiums page.
Another cool thing you can do with all the information out there on Wikipedia is combine tables to dig deeper into a topic. For example, you could take this table of guest stars on The Simpsons and use the production code to join it to these tables of all the episodes to see how certain guest stars correlated with number of US viewers.
Just because the data in a Wikipedia article isn't in a table doesn't mean it's impossible to get besides by hand. I was looking at a list of Glee episodes and noticed that when you click through the links of each episode, you get a page that's pretty standard across all the episodes and always has this box in the corner:
There's some additional information here that wasn't in the original table: featured music. I used import.io to automatically visit each of these episode pages and pull the songs. I could even take it a step further and scrape info about the songs themselves from the links on the episode pages:
Speaking of musical TV shows, I was delighted when I found out that the Wikipedia pages for Dancing with the Stars contained not only information on episodes, but all the scores for every dance that's ever been judged on that show! Using a combo of import.io, copy/paste, and some cleaning up in Excel, I got a pretty cool data set about the show's performances over the past 19 seasons. I've only barely scratched the surface of exploring the data, but here are some preliminary findings:
Have you explored Wikipedia data? Tell us about the data gems you've found in the comments below.