Tableau Research is the industrial research team here at Tableau. Our job is to explore the road ahead for the rest of the development team and make recommendations for future work.
When Tableau released the DATEPARSE feature to parse dates, we found that 15% of the user-authored date format strings that were extracted from Tableau Public were invalid. The research team decided to focus on this problem given that date formats are complex and occur in a myriad of forms. The challenge is to develop algorithms for detecting such patterns and further determine how accurate the algorithms are. Tableau Research consists of a diverse set of research scientists focused on the same goal of helping people see and understand data, yet from different perspectives ranging from color theory, storytelling, statistics, HCI to natural language processing. The work on DATEPARSE was reflective of that diversity by approaching the same problem of better parsing of dates, yet from different points of view: one focusing on pattern recognition while the other using natural language grammar rules. These complementary approaches helped cross-validate the other algorithm to assess its precision and accuracy.
Often research exploration is written up as academic papers, but not every paper that the research group produces ends up being published. This can be for a number of reasons, but it isn't usually because the work is uninteresting. To deal with this problem, we have started a technical note system on the Tableau Research website and are happy to announce that the first note has been published: "Does Anybody Really Know What Time It Is: Automating the Extraction of Date Scalars" by Richard Wesley, Vidya Setlur, and Dan Cory. The work described was shipped as Automatic DATEPARSE for Tableau 10.2 in 2017. We will describe this work informally here, but please read the paper for more details.
One of the most common data preparation operations that users perform in Tableau is converting strings to dates. For many years, the only tool users had was various string functions, which lead to complex error-prone formulas. To help users with this problem, we introduced the DATEPARSE function to our calculation language. This used a single formatting syntax for all databases and enabled users to write simple expressions to convert strings to dates.
This was great in theory, but after a few months in the wild, we decided to check on how it was working for our users. We found that when we looked at their attempts on Tableau Public, users had a 15% syntax error rate. Even that figure assumes that the valid syntaxes are actually correct, which is hardly guaranteed.
The trouble is that users rarely need this functionality and are essentially relearning a complex mini-language every time they need to use it for a small problem. And it’s not just casual users either. We implemented our syntax and do a lot of research on time analytics, yet we had to relearn a lot of it every time we needed it! Why should our users have to be masters of a system that we ourselves are far from automatic at deploying?
A tale of two techniques
|EEE MMM dd HH:mm:ss zzz yyyy||Fri Apr 01 02:09:27 EDT 2011|
|dd-MMM-yy hh.mm.ss.SSSSS a||01-OCT-13 01.09.00.000000 PM|
|MM ''yyyy||01 '2013|
|MM/dd/yyyy - HH:mm||04/09/2014 - 23:47|