When Tableau is combined with Apache Hadoop from Cloudera you can get rapid analytics on top of massive amounts of semi-structured, unstructured and nested data.
What is Hadoop and Hive?
Hadoop is a software framework that enables distributed processing of large volumes of data (in the order of petabytes) on clusters of computer nodes using a simple programming model called MapReduce. Hadoop runs on commodity hardware and scales horizontally. One of the special qualities of the database is that it can process any kind of data – structured, semi-structured, unstructured and even nested data, which traditional relational databases don’t deal with very well. The upshot is that Hadoop enables enormous data volumes in many different formats to be stored and processed rapidly.
However, all of that data storage capability is wasted if there is no way to access it. Hive is a data warehouse infrastructure built on top of Hadoop that provides a SQL like query language called HiveQL that allows users more familiar with SQL to query data. With version 7.0, Tableau lets you connect to Apache Hadoop through Cloudera’s driver. Connecting to Hadoop in Tableau is exactly the same as connecting to any other datasource, with no additional setup required. This puts the powerful capabilities of Hadoop into the hands of anyone who wants to analyze data.
How does Tableau enhance Hadoop
Since Hadoop can scale to massive amounts of data, it is increasingly being used for storage, log file analysis, web analytics, data-mining, live content serving, fraud detection and other data intensive tasks. Tableau makes it easy for you to perform sophisticated visual analysis without having to learn HiveQL while allowing the more advanced user to take full advantage of the Hive/MapReduce capabilities. A connection in Tableau to Hadoop will be visually displayed as a series of fields that you can drag-and-drop to visualize with, as you can see below. It really is that simple.
However, Tableau doesn’t only simplify the Hadoop technology, it also extends it. Since the primitive data types supported by HiveQL are limited, Tableau users can take advantage of its ability to promote datatypes. For example, dates are stored as strings in Hadoop-Hive and Tableau allows you to change this string date into a Date & Time, thereby allowing you to drill-up and drill down on date dimensions. Connections to Hadoop-Hive may also utilize Custom HiveQL expressions, which allows for a more filtered or pre-aggregated data set. The Initial SQL setting in the Tableau Hadoop Hive Connection, allows you to specify any number of settings Hive can use to improve functionality and performance of the Hive queries. Tableau also handles Hive columns containing XML data by allowing you to write HiveQL to un-nest XML elements. These capabilities significantly extend the real-world usability of Hadoop clusters in Hive.
Get more from Hadoop and Tableau with these resources:
Finding answers and making decisions with big data has always been challenging. Most legacy database technologies are slow, complex and difficult to learn. By combining the Big Data capabilities of Hadoop-Hive with Tableau’s ease of use, any organization can apply an enormous amount of leverage to their big data. For more information on how you can leverage Hadoop with Tableau, please refer to the following knowledge base articles:
Mythili Gopalakrishnan is a Solution Architect with the Tableau Professional Services team. She helps customers across a variety of industries achieve their goals and hone their analytical skills. She can be reached by email at firstname.lastname@example.org.
Want to know more about what others are doing with Hadoop and Tableau?
From customer stories and white papers to Tableau and Hadoop workbooks using real-world data to solve real problems, you can find out more about how to leverage Tableau to get the most out of Hadoop here:
Hadoop Customer Stories