I’m using the following pipeline to ingest, process, and analyze data with Tableau on an AWS stack.
In this example, I’ll use the New York City Taxi data set as the source data. The data set has nine years’ worth of taxi rides activity—including pick-up and drop-off location, amount paid, payment type—captured in 1.2 billion records.
The data lands in S3. It is cleansed and partitioned via Amazon EMR and converted to an analytically optimized columnar Parquet format.
Note that you can point Tableau to the raw data in S3 (via Amazon Athena) as well as access the cleansed data with Tableau using Presto via your Amazon EMR cluster. Why might you want to use Tableau this early in the pipeline? Because sometimes you want to discover what’s out there and understand some questions worth asking before you even start the analysis.
Once you discover those questions and determine if this sort of analysis has long-term advantages, you can automate and optimize that pipeline, adding new data as soon as it arrives so you can get it to the processes and people that need it. You may also want to provision this data to a highly performant “Hotter” layer (Amazon Redshift or a Tableau extract) for repeated access.
As represented in the flow above, S3 contains the raw, denormalized taxi ride data at the timestamp level of granularity. This is the fact table. Amazon Redshift has the time dimensions broken out by date, month, and year, along with the taxi zone information.
Now let’s imagine that I’d like to know where and when taxi pickups happen on a certain date in a certain borough. With support for Amazon Redshift Spectrum, I can now join the S3 tables with the Amazon Redshift dimensions.
I can then analyze the data in Tableau to produce a borough-by-borough view of NYC ride density on Christmas Day 2015.
I could also hone in on just Manhattan to identify pick up “hotspots” where ride charges appear way higher than average.
With Amazon Redshift Spectrum, you now have a fast, cost-effective engine that minimizes data processed with dynamic partition pruning. Further improve query performance by reducing the data scanned. You could do this by partitioning and compressing data and by using a columnar format for storage.
At the end of the day, your choice of data source that you connect to in Tableau should be based on what variable you want to optimize for. For example, you may choose to connect live to Amazon Athena, Amazon Redshift, Amazon Redshift Spectrum, or bring a subset of your data into a Tableau extract.
Start by considering:
- Cost: Are you comfortable with the serverless cost model of Amazon Athena and potential full scans vs. the advantages of no setup?
- Performance: Do you want the throughput of local disk?
- Setup effort and time: Are you okay with the lead time of a Redshift cluster setup vs. just bringing everything into a Tableau extract?
To meet the many needs of our customers, Tableau’s approach is simple: it’s all about choice. This includes how you choose to connect to and analyze your data.
For more on how to approach data architecture decisions for the enterprise, watch this Big Data Strategy session my friend Robin Cottiss and I delivered at Tableau Conference 2017. We share several examples of companies leveraging the Tableau on AWS platform and provide a detailed run-through of the aforementioned demonstration.
If you’re curious to learn more about analyzing data with Tableau on Amazon Redshift we encourage you to check out the following resources: