Back in November, we introduced TabPy (currently in beta), making it possible to use Python scripts in Tableau calculated fields. When you pair Python’s machine-learning capabilities with the power of Tableau, you can rapidly develop advanced-analytics applications that can aid in various business tasks.

Let me show you what I mean with an example. Let’s say I’m trying to identify criminal hotspots in Seattle, my hometown. I’ll use data from the Seattle Police Department showing 911 calls for various type of criminal activities in the past few years.

With this data, it is really hard to visually identify patterns given the density of activity and noise in GPS readings. Let’s see what we can find out by applying some unsupervised machine learning.

Density-based spatial clustering of applications with noise (DBSCAN) is a well-suited algorithm for this job. It is also installed conveniently by default as part of TabPy. It takes two parameters: one to specify the maximum allowed distance between points for them to be considered part of the same cluster and one more to specify the minimum number of nearby points to constitute a cluster.

This allows for experimenting with different values of distance and event frequencies criteria. Different options can be more appropriate for downtown Seattle versus the suburbs, a police officer looking for hotspots versus a tourist looking for places to avoid, or a tenant looking for houses to rent or buy. You can download this example Tableau workbook here.

Embedding the Python code into Tableau worked great in this example. But in some cases, you may want to host your Python scripts outside Tableau workbooks so they are centralized and easier to manage or because the models themselves require upfront training.

To demonstrate, let’s use a data set on breast cancer cases in Wisconsin. And let’s see if we can train a model that can provide the correct diagnosis given a patient’s test results.

Let’s start with the model that is most easily accessible as part of our exploration: clustering, which we introduced in Tableau 10. When I simply double-click on clustering, I see that Tableau automatically finds two clusters corresponding to malignant and benign tumors, and identifies the cases with 92.2% accuracy. That’s pretty impressive considering I gave Tableau no hint whatsoever as to what the correct diagnosis were or even that there had to be two categories.

But since we have this information in the data set, could we use a different algorithm that can learn from the actual diagnosis for these patients? Let’s try a variety of supervised machine-learning algorithms in Python and see how they will perform.

You can download the Jupyter work containing all the Python code used for model training and evaluation here.

Before you can use it, you need to start Jupyter.

This will open Jupyter in your browser. If you downloaded the example notebook, you can navigate to the directory on this screen and click on BreastCancerExample.ipynb to open it.

Once the notebook loads, your browser window should look like this. Note that this workbook relies on many packages in scikit-learn 16.0.1, the version which ships with TabPy by default. Earlier or newer versions may lack some of these methods or have different names for them.

The notebook is extensively documented so I won’t get into the details in this post. (In a few words, what it does is to fit Naïve Bayes, Logistic Regression, Support Vector Machine and Gradient Boosted Tree models to the breast cancer data set by doing a grid search with k-fold cross-validation to find the best model.) This sample is also meant to be a template you can swap in different models easily, for example to use a neural network instead.

Then it deploys the best model (Gradient Boosting in this case) as a function to the TabPy server so it can be used to classify new data from Tableau dashboards.

Now we can call the published function from Tableau with configurable parameters so one can enter values to get a prediction and embed it in a nice dashboard.

You can download this example Tableau workbook here.

There are many more use cases for TabPy for data scientists. You might use it to build models for your HR department to predict attrition. You might help your sales department score leads. Or you might be an ISV creating vertical-specific advanced-analytics applications using Tableau to help renters make more educated decisions when picking their next home. Whatever your use case, TabPy can help take your analytics to 11.

To learn more about TabPy and to install it, visit our GitHub page. How are you using TabPy for advanced analytics? Tell us about your use case in the comments below.

Try Tableau 10.2

Try out all the new features in this post, and many more coming to Tableau Desktop, by signing up for Tableau’s beta program here. And visit our Coming Soon page to learn about all the features we're planning for Tableau 10.2.

Learn more about Tableau 10.2

Tableau 10.2 beta is here
Leverage the power of Python in Tableau with TabPy
Cut data-prep time with these enhancements in Tableau 10.2
Do more with your data on the web in Tableau Online 10.2

También podría interesarle...


Hii..Your posting is really much more informative and helpful to all people...Thanks for sharing these types of informative updates...

The link to the workbook
requires a username and password.

Hi Bora,

First of all amazing article. I loved it. I used TabPy and I tried to publish the workbook. The workbook got published but I when I open it I get the below error

An unexpected error occurred. If you continue to receive this error please contact your Tableau Server Administrator.

TableauException: An error occurred while communicating with the external service. Tableau is unable to connect to the service. Verify that the service is running and that you have access privileges.
2017-03-24 16:45:41.355 (WNVNMAofA8QAABFABIYAAAPl,0,0)

TabPy is running on linux server. Any suggestions would be really helpful

Hi Ashwin,
Is TabPy configured on Tableau Server?

tabadmin stop
tabadmin set hostnamegoeshere
tabadmin set vizqlserver.extsvc.port portnumbergoeshere
tabadmin config
tabadmin start

If it is still not working it could be a firewall issue or did you have Rserve configured on the same server before?



Hi Bora,

Thanks for the great write up. I'm walking through your code and am attempting to reproduce it. So far I ran the code in your notebook, resolved some errors, and deployed the model. Next, I set-up the calculation per your post. However, when I attempt to use it I get this error: The endpoint you're trying to query did not respond. Please make sure the endpoint exists and the correct set of arguments are provided.

I thought that perhaps I didn't run all of the code so I reran it and got this error: RuntimeError: An endpoint with that name ('DiagnosticsDemo') already exists. Use 'override = True' to force update an existing endpoint.

I'm confused that it exist but it doesn't respond. What are possible reasons it wouldn't respond that I can work to resolve?


Hi Bora,

Firstly, thank you for the amazing tutorial. I'm getting an error when i open the breast cancer workbook. I can see the tabpy server is running perfectly. Tableau shows "There is an error connecting to the predictive service" - Endpoint dosen't exist etc..

I've tried running individual pieces of the jupyter notebook, fixed few errors and changed the csv file path to my local directory. Nothing worked..

The seattle criminal hotspots workbook is working perfectly, but there seems to be some problem with the breast cancer workbook. Please advise on how to approach.

Thank you in advance!


Most likely answer is that the model wasn't successfully trained. When this happens, you can still publish a function to TabPy server but the error will surface when you try to run the model. On some Windows machines, this could happen due to a bug in Python when you try to run model fitting by using all the CPU cores you have. This is controlled by the "n_jobs=-1" setting in the Jupyter workbook. I suspect this might be the issue. The solution is to set n_jobs=1 which will run the training using only a single core. After making this, if you add the argument override=True to your deploy function then run the entire Jupyter workbook (Cell > Run All) I suspect it might fix it.

Agregar nuevo comentario

Suscribirse a nuestro blog