Tableau Prep: How it helps you build performant data prep flows
Learn more about Tableau Prep and how to navigate building and running flows with confidence.
Tableau Prep is a data exploration and preparation tool that provides a visual, no-code way for you to derive insights, by allowing you to interactively explore, clean, and shape your data in real-time before analysis. With Prep, you create flows, which are visualizations of the transformations that you’re going to perform on your data. Prep flows start with at least one Input step on the left, transformation steps, like Clean steps in the middle, and at least one Output step on the right.
Let’s dive into what happens behind the scenes, so that you can author and run your Prep flows with confidence.
There are two ways you can interact with Tableau Prep—Interactive Mode and Run Mode.
You’re in Interactive Mode when you’re authoring (creating or editing) a flow, like when you’re renaming columns in a Clean Step. Interactive Mode provides direct and interactive feedback as you clean, combine, and reshape your data. For example, if you join two tables together in a Join Step, you’ll see the final join result—down to the number of rows—immediately. At times, Prep may sample your data to ensure responsiveness, even when you’re working with large data sets.
You’re in Run Mode when you run your flow ad hoc in Tableau Prep Builder, Tableau Cloud, or Tableau Server, or when you run a scheduled flow in Tableau Prep Conductor. Unlike Interactive Mode, Run Mode generates output data for analysis, and never samples input data.
Just like cakes have multiple layers, Prep has multiple software layers. The only layer that you’ll directly interact with is the visualization layer. This is what you see when you use Prep, and shows your connectors, a graph of your flow details, profiles of your column summaries, and row level details. You can save your Prep Flow to a Tableau Prep flow (.tfl) file to use later, to share with others, or to run using Tableau Prep Conductor.
There are several more layers behind the scenes:
- A compiler validates that your flow doesn’t have errors, and if there are errors, exposes them to you in an understandable manner.
- Another compiler translates the details from your flow into a logical model that represents each part of your flow. The logical model looks like a tree and represents query operations like joins and calculations.
- Yet another compiler removes redundant parts of the logical model and transforms it into queries that can be run against your data.
Prep leverages the Tableau Query Layer that allows Prep users to connect to files, and relational, and NoSQL databases using more than 95 Tableau native connectors. The query layer is a building block that you’ll never see but that you’ll benefit from everytime you use Prep. The query layer performs a lot of heavy lifting to make sure your Prep flows run as fast as possible.
Here are some examples of how the query layer helps:
- Common subexpression optimizations: removes duplicate expressions to make queries more performant.
- Predicate Pushdown: pushes filters closer to database tables, to limit data movement.
- Aggregate Pushdown: pushes aggregations closer to database tables, to perform operations closer to the data.
- Cost Based Optimization: speeds up cross-database joins by minimizing data movement
- Converting Inner Joins to Semi Joins: makes joins more performant for databases that support SemiJoin.
- Converting Joins to Analytic Functions: can make joins run faster.
- Generates SQL in the native dialect of the source databases.
Putting it together to author a flow
When you build a new flow in Interactive Mode, you start off by adding an Input Step. You’ll notice that you don’t have the option to select between a live or extract connection, like you have in Tableau Desktop. Instead, Prep leverages the Query Layer to automatically select how to connect to your data based on the mode you’re in.
When you’re in Interactive Mode and connect to a new data source, or when you open an existing flow, Prep fetches data from the connector you configured in the Input Step. Prep places the data into Hyper, Tableau’s high performance, in-memory data engine.
Prep uses Hyper as a cache so that you can immediately validate your changes as they're happening without reloading data. The Pivot Step is an exception because it loads data directly from your connector. There are a couple ways to load new data into Prep’s cache: you can change any setting in the Input Step, like remove a column, or you can click the Refresh data button.
To make Prep responsive, Prep automatically applies sampling to load at most 1 million rows of data into Hyper. If you’d like to learn more about how sampling works then be sure to read the upcoming blog on sampling in Tableau Prep. To help out even more, Prep uses another cache to quickly remember visualizations, such as aggregations.
This cache is triggered on demand when you click on a step. This makes Prep load data faster when you navigate back to it. Depending on your flow, the caching may also be applied to steps to the left of the step you clicked on, making them faster to load as well. Making Prep as responsive as possible is top of mind for our engineers and we’re constantly making improvements behind the scenes.
Running your Prep Flow after you’ve authored it
You can generate output data for your Prep Flow in Run Mode by simply clicking Run Flow. Run Mode operates in a similar manner as Interactive Mode but there are two key differences:
- In Run Mode, Prep generates at least one data output rather than interactively visualizing your data.
- In Run Mode, Prep runs your flow against the entire data set. For example, if you set your sample size to a thousand rows, but your underlying data source has five million rows, the entire five-million-row data set is queried when you run your flow.
Manually running your flows works well when just starting out, but you may find that you need to automate the process to meet the needs of your organization. Tableau Prep Conductor, a part of the Data Management add-on, allows you to schedule and run workflows in a scalable, reliable, and secure fashion in Tableau Cloud or Tableau Server. Additionally, Prep Conductor allows you to centralize the scheduling, monitoring, and administration of your flow.
If you have a complex flow, consider dividing your steps into separate flows to organize your work if it isn't necessary for the cleaning operations to live in a single flow. Individual steps or even entire flows can be brought together later through Join or Union steps. If you’d like to learn more about how to optimally author your Prep Flows, then be sure to read the Best Practices to Improve the Performance of Your Data Preparation Flows blog.
How Tableau Prep helps you manage your data
Building your flow with Tableau Prep
When you are cleaning your data in any of the available step types (Clean, Union, Join, etc.), you are in what we call Interactive Mode in Tableau Prep. Interactive Mode delivers direct, interactive feedback as you clean, combine, and reshape your data. For example, if you join two tables together in a Join step, you will see the final join result—down to the number of rows—immediately. At times, Tableau Prep may also sample your data. Sampling ensures responsiveness to keep you in the flow of your task, even when you are working with large amounts of data in Tableau Prep. You can use the default sample amount or build a sample set by specifying a fixed number of rows.
How Tableau Prep caches data
When you connect to a data source in Tableau Prep, you will notice that you don’t have the option to select between a live or extract connection like in Tableau Desktop. Tableau Prep instead caches your data as Hyper extracts in our high performance, in-memory data engine. Tableau Prep does not cache everything, however, and the experience differs if it is your first time building a flow from scratch, versus coming back to edit an existing flow already built by a colleague.
Building a flow from scratch
Tableau Prep will always cache the results from your Input step. In the Input step, Tableau Prep queries the input tables from the source database or files, ingesting data into a Hyper extract which serves as the cache used as you continue to build your flow. We create this cache so that you can validate your changes as they're happening without slowing down the underlying database or your machine. When you clean or reshape your data after the Input step, the changes are applied to the data in our cache. (Tableau Prep doesn't query the source tables after the Input step when you’re in Interactive Mode.) What is cached will be based on how you configure your Input step. If you choose to sample, remove columns, or change a data type, these changes will affect the result. Your data is cached again throughout the flow when you add computationally expensive steps like Join steps or Union steps.
Editing a flow
If you go back and edit a pre-built flow, caching will depend on which step you open and begin to explore. Tableau Prep is equipped with a special algorithm that decides which step is best to cache. Depending on where you are in the flow, and what data is necessary to deliver a performant, visual, and interactive experience, Tableau Prep may cache your Input, Union, or Join steps, among others.
What happens when you pause updates?
Sometimes when you are building a flow in Tableau Prep, you may not need direct, live updates. There are times when you just need to go in an add a quick cleaning operation or input data transformations in bulk. In these scenarios, Tableau Prep allows you to pause data updates. When you choose to pause data updates, you are in Metadata Mode. Pausing data updates allows you to quickly make changes to your flow before generating your results.
Running your flow with Tableau Prep
When you run your flow and generate your output for analysis, you are in Run Mode and Tableau Prep runs your flow against the entire data set. For example, if you set your sample size to five thousand rows, but your underlying data source has five million rows, the entire five-million-row data set is queried when you run your flow. Tableau Prep has a smart execution engine that pushes down operations to your database when possible. This saves you time and resources because Tableau Prep conserves your machine’s processing power if you’re running your flow manually or through our command line interface; if you are scheduling your flows on Prep Conductor, your Server resources are conserved in a similar way. Did you know that you can create as many Output steps as you want in your flow? You have the flexibility to run one output at a time, or all outputs with the Run all Flows button at the top of the Tableau Prep interface.
Manually running flows works well when just starting out, but ultimately, you’ll need to automate the refresh process to properly scale your data preparation flow to meet the needs of your entire organization. Tableau Prep Conductor, a part of the Data Management add-on, allows users to schedule and run workflows in a scalable, reliable, and secure fashion in a Tableau Server or Online environment. Prep Conductor (shown below) allows you to centralize the scheduling, monitoring, and administration of your flow.
Now that we have demystified how Tableau Prep works under the hood, you can build and run your data preparation flows with confidence. Download a free trial of Tableau Prep and get started with this master list of learning resources.