Data Blending

Subject Area: Architecture

Level of complexity: Intermediate

Approximate Time to complete: 1 hour

Companion Workbook (twbx): n/a

Additional Resources on the web:

The simplest way to think of data blending is as an aggregated outer join. However, rather than creating a physical join of your data set, a data blend is a view of an aggregated join from disparate data sets. Data blending is a very useful tool, but there are some effects on performance and functionality.

Any time data blending is in action tableau will be querying multiple datasets. When using a single data set everything on the view is represented by a single VizQl query. When two data sets are blended together there is an increase in time to query and render due to an extra step that is being taken in the queries.

Scenario: one data set contains customer information along with their transactions and sales. A second data set contains a row of data with any transaction that had a returned item. Joining these two data sets will create a single table. When we want to look at the number of returns by customer we are writing a single query to the joined table ~ "Select Count(returns) from data group by customers". If we would like to create the same view with data blending there is an extra step that tableau must take to do this. Not only does tableau have to query each data set individually, but it then has to go through an extra step to match the linking fields between these two queries. In this case it links the customer names that are the same. This match is done in tableau with black magic. At a smaller scale these extra queries are negligible, but on a larger scale they can add up and cause a dip in performance.

The diagram above describes the process of a data blend. Each secondary data source must have a join key field that matches a primary key field being visualized by the primary data source. These secondary fields will also be aggregated at the level of the joining key field similar to a group by SQL statement.

There is no suggested limit on number of rows to use for data blending. Instead it is helpful to test it on your own data. If a blend is taking an unacceptable amount of time to query/render, than it is suggested to join the data sets for improved performance. A noticeable decrease in performance often occurs when data blending is applied to millions of rows.