Starting a Data Discovery Project

By Lori Williams 2012/08/07

Sometimes you have data and get vague instructions to do something useful with it. Like, hey Lori, why don’t you take a look at what people are doing with Tableau Public? Great! Wait a minute. I have an overwhelming amount of data. Where should I start?

Over a couple of posts, I’m going to give a behind-the-scenes look at my Tableau Public analysis. This post covers the first stage in data discovery - figuring out what data you’ve got. I used three methods for getting a handle on the data, talking to other people, searching for documentation, and examining the data itself.

The most important method is talking with coworkers who are already database experts. I talked with one of our software engineers who is revising the script that feeds data in to the Tableau Public database. He let me know that some of the tables were used for tracking history of a viz and that for those tables, you had to add a filter for what date the data was loaded on. I also spoke with two coworkers who have done projects working with the same data. They gave me sample workbooks that showed me how to connect some of the most popular tables together as well as helpful tips such as turn off the auto-update (see image below). Useful whenever you're working with big data.

In addition to talking with folks, I also searched for documentation on our internal wikis and servers. These were great supplemental materials to reference in my conversations. I found an old data dictionary that explained what defines the unique identifiers and the fields for many of the tables. These helped me discover the names of the tables that I wanted to use, and to focus my conversations on how to work with the data in those tables. So I could ask "What column should I use to connect the users and the workbooks table?" rather than asking which table contained data on users and workbooks.

Lastly, I looked at the data itself to spark questions and to check that what people told me about the data was what I saw. Specfically, I looked at the number of records in each table and at the number of records from connecting tables and asked:

  • When I join users and workbooks together do I still have the right number of workbooks(~60K)?
  • Are there missing data points in any of the fields I need to use?
  • Are there unusual values? (e.g. A new user with more than 100 workbooks)

After all that research, I'm ready to get to work answering questions. How do you create a picture of the average Tableau Public viz? That's the topic of my next post.