เอกสารประกอบ
Data Blending: Dynamic Workload Driven Data Integration in Tableau
Data blending, a capability introduced in Tableau 6, has been a smash hit with our customers. Why? Because any user can combine data sources by simply dragging them into a single view for fast visual analysis. Whether the data source is an officially sanctioned data warehouse, a spreadsheet, or a .csv file sitting on a desktop, combining disparate sources for rapid analysis with Tableau’s data blending is easy and fast.
This paper, written by Kristi Morton of the University of Washington’s Computer Science department, introduces the power of data blending and walks through practical examples about how to leverage this Tableau capability. A helpful guide for both getting started and advanced techniques, if you’re someone who has two or more data sources you’d like to analyze in Tableau, this paper is for you.
We've also pulled out the first several pages of the whitepaper for you to read. Download the PDF on the right to read the rest.
ABSTRACT
Tableau is a commercial business intelligence (BI) software tool that supports interactive, visual analysis of data. Armed with a visual interface to data and a focus on usability, Tableau enables a wide audience of end-users to gain insight into their datasets. The user experience is a fluid process of interaction in which exploring and visualizing data takes just a few simple drag-and-drop operations (no programming or DB experience necessary). In this context of exploratory, ad-hoc visual analysis, we describe a novel approach to integrating large, heterogeneous data sources. We present a new feature in Tableau called data blending, which gives users the ability to create data visualization mashups from structured, heterogeneous data sources dynamically without any upfront integration effort. Users can author visualizations that automatically integrate data from a variety of sources, including data warehouses, data marts, text files, spreadsheets, and data cubes. Because our data blending system is workload driven, we are able to bypass many of the painpoints and uncertainty in creating mediated schemas and schema-mappings in current pay-as-you-go integration systems.
1. INTRODUCTION
Unlike databases, human brains have limited capacity for managing and making sense of large collections of data. In database terms, the feat of gaining insight in big data is often accomplished by issuing aggregation and filter queries. Yet, this approach is time-consuming as the user is forced to 1) figure out what queries to write, 2) write the queries, 3)wait for the results to be returned back in textual format, and finally 4) read through these textual summaries (often containing thousands of rows) to search for interesting patterns or anomalies. Tools like Tableau help bridge this gap by providing a visual interface to the data. This approach removes the burden of writing queries and has the user ask their questions through visual drag-and-drop operations (no queries or programming experience required). Additionally, answers are displayed visually, where patterns and outliers can quickly be identified.
Visualizations leverage the powerful human visual system to effectively digest large amounts of information. Figure 1 illustrates how visualization is a key component in the sensemaking model [1], i.e. the theory of how people search for, organize, and generate new knowledge from data. The process starts with some task or question that a knowledge worker (shown at the center) seeks to gain understanding. In the first stage, the user forages for data that may contain relevant information for their analysis task. Next, they search for a visual structure that is appropriate for the data and instantiate that structure. At this point, the user interacts with the resulting visualization (e.g. drill down to details or roll up to summarize) to develop further insight. Once the necessary insight is obtained, the user can then make an informed decision and take action. This cycle is centered around and driven by the user and requires that the visualization system be flexible enough to support user feedback and allow alternative paths based on the needs of the user’s exploratory tasks. Most visualization tools [4, 5, 13], however, treat this cycle as a single, directed pipeline, and offer limited interaction with the user.
Moreover, users often want to ask their analytical questions over multiple data sources. However, the task of setting up data for integration is orthogonal to the analysis task at hand, requiring a context switch that interrupts the natural flow of the analysis cycle. We extend the visual analysis cycle with a new feature called data blending that allows the user to seamlessly combine and visualize data from multiple different data sources on-the-fly. Our blending system issues live queries to each data source to extract the minimum information necessary to accomplish the visual analysis task. Often, the visual level of detail is at a coarser level than the data sets. Aggregation queries, therefore, are issued to each data source before the results are copied over and joined in Tableau’s local in-memory view. We refer to this type of join as a post-aggregate join and find it a natural fit for exploratory analysis, as less data is moved from the sources for each analytical task, resulting in a more responsive system. Finally, Tableau’s data blending feature automatically infers how to integrate the datasets on-the-fly, involving the user only in resolving conflicts. This system also addresses a few other key data integration challenges, including combining datasets with mismatched domains or different levels of detail and dirty or missing data values. One interesting property of blending data in the context of a visualization is that the user can immediately observe any anomalies or problems through the resulting visualization.
These aforementioned design decisions were grounded in the needs of Tableau’s typical BI user base. Thanks to the availability of a wide-variety of rich public datasets from sites like data.gov, many of Tableau’s users integrate data from external sources such as the Web or corporate data such as internally-curated Excel spreadsheets into their enterprise data warehouses to do predictive, what-if analysis. However, the task of integrating external data sources into their enterprise systems is complicated. First, such repositories are under strict management by IT departments, and often IT does not have the bandwidth to incorporate and maintain each additional data source. Second, users often have restricted permissions and cannot add external data sources themselves. Such users cannot integrate their external and enterprise sources without having them collocated. An alternative approach is to move the data sets to a data repository that the user has access to, but moving large data is expensive and often untenable. We therefore architected data blending with the following principles in mind: 1) move as little data as possible, 2) push the computations to the data, and 3) automate the integration challenges as much as possible, involving the user only in resolving conflicts.
The rest of this paper is organized as follows. First in Section 2 we present a brief overview of Tableau and data blending. Then in Section 3 we discuss the use-case scenarios and underlying principles that drove the design of our data blending architecture. Section 4 describes our overall approach to data blending. Section 5 covers related work. Finally, in Section 6, we discuss interesting research directions, and in Section 7 we conclude.
2. BACKGROUND
In this section we describe Tableau and data blending at a high-level. To make the concepts more concrete, we discuss a simple example that blends data from three data sources to produce a compelling visualization.
2.1 Tableau Overview
Tableau is a data visualization tool that sits between the end-user and the database and allows the user to create visualizations by dragging and dropping fields from their datasets onto a visual canvas. In response to these actions, Tableau generates formal VizQL (Visual Query Language)statements to build the requested visualization. VizQL is a structured query language with support for rendering graphics. Each VizQL statement is compiled into the SQL or MDX queries necessary to generate the data for the visualization.
2.2 Blending Data Overview
The data blending feature, released in Tableau 6.0, allows an end-user to dynamically combine and visualize data from multiple heterogeneous sources without any upfront integration effort. A user authors a visualization starting with a single data source – known as the primary – which establishes the context for subsequent blending operations in that visualization. Data blending begins when the user drags in fields from a different data source, known as a secondary data source. Blending happens automatically, and only requires user intervention to resolve conflicts. Thus the user can continue modifying the visualization, including bringing in additional secondary data sources, drilling down to finer-grained details, etc., without disrupting their analytical flow. The novelty of this approach is that the entire architecture supporting the task of integration is created at runtime and adapts to the evolving queries in typical analytical workflows.