In a previous post, we looked under the hood at how Tableau Prep Builder helps us build efficient flows through its native modes. It may be helpful to read that post to understand some of the terms we reference below, like interactive mode and run mode. In this blog, we build on that foundation and discuss design best practices that you can implement as you build data flows to ensure optimal performance.
Best practices for authoring your data preparation flow
Bring in the right data
The more data you bring in to Prep Builder while you’re in interactive mode, the more computationally “expensive” your flow will be. While there’s no row limit for working with data sets in Prep Builder, there are considerations to optimize performance—great power comes great responsibility after all. A simple yet powerful way to ensure high performance is working with only the data you need. When you run your flow, whether in Prep Builder or via Prep Conductor, changes are always applied to the entire data set and not just the sample, so that you can walk away with a clean, ready-to-analyze data set.
Oftentimes, your data only needs high-level restructuring, which doesn’t require insight into every individual row of data. Say you just need to pivot rows to columns, or union a few tables together. In scenarios like this, you can sample to speed things up. Prep Builder will automatically start sampling your data at just over one million records, but you can set your flow to sample regardless of whether this threshold has been met. Less data will lead to a faster authoring experience!
You'll notice you have two sampling methods available: a quick select sample and a random sample. Random sampling will provide a representation of all values in the dataset, whereas quick select sampling will bring in the first number of (x) rows from your data set, based on your sample size. Either can be effective, depending on what you need to accomplish.
Did you know you can filter data in the Input step, before you start cleaning, integrating, and reshaping your data? By filtering out any data that isn’t crucial to your workflow, you narrow down the scope of what Prep Builder must process. Using a smaller data set, when possible, will almost always guarantee better performance, because Prep Builder has less data to cache and query.
Pause your flow
At times, you may not need interactive feedback as you build your flow. If you need to add an operation or add data transformations in bulk, pause your flow. Just be aware that you won’t immediately see the results. When your flow is paused, interactive updates are paused, and queries are not generated. Pausing your flow allows you to quickly do what you need to clean your data and get to the final output faster.
Consider your data type
Any data you bring into Prep Builder will be assigned a data type. Tableau makes an educated guess, but you have the freedom to change this classification. Consider that a number or Boolean data type is easier to query than a string data type, because there are fewer possible values to be parsed. For example, say you have a numeric ID in your data that Prep Builder has classified as a string data type rather than a number type. You can change the data type from a string to a number (whole) data type to reduce query time.
Design with performance in mind
Identify flow objectives
When designing a Prep Builder flow, focus on a few key objectives. Consider dividing your steps into separate flows to organize your work if it isn't necessary for the cleaning operations to live in a single flow. Individual steps or even entire flows can be brought together later through join steps or union steps, or through copying and pasting steps where appropriate.
Filtering in the Clean step
The Clean step allows you to see a profiled summary of your entire data set, and easily identify and fix issues. It's also a great place to filter out unnecessary data to ensure better performance. Remember: less data means faster flow. Here are a few examples below.
Reusable steps (version 2019.3.2 or later)
If you find yourself applying the same cleaning operations or writing complex calculations repeatedly across multiple flows, use the reusable steps feature. You can save one or more steps as individual flows locally on your machine or publish them to share with other data preppers on Tableau Server or Tableau Online.
Save time by future proofing your flows. Future proofing will help you organize your work and leave breadcrumbs of what tasks were accomplished and where they are in a flow. If you share your flow with others, or publish it via Prep Conductor, a clean, organized flow gives visibility to others in the organization, fostering trust and confidence.
Name your steps
Organize your flow by naming steps as you build. This helps you and your colleagues understand why a step was added, and what tasks were accomplished within it.
In Prep Builder, you can leave comments on steps, on changes in the Changes pane, and within calculated fields. Help your teammates understand how the data was cleaned by quickly adding comments explaining what you did—especially if you plan on publishing your flows to Tableau Server or Online with Prep Conductor. These comments can also serve as helpful signposts and reminders for you when you revisit your own work.
Customize step colors
Right-click any step in the flow pane to change the color of the step. Changing step colors can help you visually organize your steps, especially in larger, more complex flows. For example, you may want to color code based on step type, to help you quickly identify all instances of joins. At other times, it may make sense to color-code based on task, like when two separate flows are eventually joined to create a single output.
Best practices for running your data preparation flow
Output only what you need
An Output step allows you to bring your prepared data out of Prep Builder and into the rest of the platform. Add a Clean step before your Output step to remove any extraneous columns or rows. Remember, when you run your flow, the entire database is queried. You can reduce query time and the load you’re putting on your computers and databases by outputting only necessary data. While a couple of extraneous columns shouldn’t impact performance in a serious way, your resulting flow will be cleaner and easier to work with both for you and others in your organization. (Read more about this process here, under “Running your flow with Tableau Prep.”)
Consider your output file type
In the Output step, you can configure Prep Builder to output the data as a Tableau extract or a CSV file. In testing, we found that an identical, large data set took 2 hours and 47 minutes to output as a CSV, but only 11 minutes to output as a Hyper extract. Extracts are better equipped to handle large data sets, and unless your workflow requires a flat file, we recommend using a Hyper extract. Outputting to an extract will also allow you to publish and maintain a single source of truth for your data on Tableau Server or Tableau Online if you’re using Prep Conductor to publish your flows.
As of 2020.3, Prep Builder will also be able to write back to your database.
Leverage the power of Tableau Server with Prep Conductor
Rather than relying on your personal computer to generate large outputs, you can instead leverage the power of Tableau Server. Tableau Server is scaled and configured to handle generating large outputs. Use Prep Conductor to run a completed, published flow rather than your computer. Then, you can manually generate the first output and schedule future refreshes on a schedule. (Learn how to isolate loads with Resource Blocks.)
Take advantage of incremental extracts (2020.2.1)
With the introduction of incremental extracts, you can configure your flows to refresh incrementally, meaning that only new rows will be retrieved when the data is refreshed. Incremental refreshes save you time and resources, especially on larger outputs. You can still schedule a full extract refresh as well—on a weekly or monthly cadence, for example.
Say you have transactional data—months of daily sales orders collected in a historical dataset that gets updated with new data every day. You don’t need to update the historical data in your flow; instead you only need to process the new rows. Prep Builder can save you time and resources through incremental extracts, by processing only the new data every day instead of the entire data set every time the flow is refreshed.
Armed with these design best practices, you can build performant data preparation flows with confidence. Next, read how Tableau Prep Builder works under the hood to help with performance. Download a free trial of Prep Builder and get started with this master list of learning resources.