Does Anybody Really Know What Time It Is? Automating the Extraction of Date Scalars
Tableau Research White Paper (TR #2018-01)
Interactive visual analytics is an effective approach for the timely analysis of data. Users who are already engaged in interactive data analysis prefer staying in the visualization environment for unavoidable data cleaning and preparation tasks to preserve their analytic flow. This has led visualization environments to include simple data preparation functions such as scalar parsing, pattern matching and categorical binning. One common scalar parsing task is extracting date and time data from string representations. Several relational database management systems (RDBMSs) include date parsing “mini-languages" to cover the wide range of possible formats, but analysis of user data from one visualization system shows that the parsing language syntax can be difficult for users to master.
In this paper, we present two algorithms for automatically deriving date formats from a column of data with minimal user disruption, one based on minimal entropy representations and another based on natural language processing techniques. Both have accuracies of over 95% on a large corpus of date columns extracted from an online data repository. One of the methods is also fast enough to produce results within the user's perceptual threshold. Moreover, we were able to avoid prohibitively expensive manual verification by using the algorithms to cross-check each other at scale.