If you have a Hadoop deployment then you’re probably dealing with rapidly growing, unstructured data. Hadoop has become a critical part of the overall data ecosystem for many organizations because of its ability to store, process, and analyze petabytes of data. By storing data in a distributed file system (HDFS) across commodity servers, Hadoop enables organizations to affordably scale their infrastructure to their needs. This combined with its ability to support parallel processing across these servers means that storage and compute is achievable at massive scale for unprecedented amounts of data. Where traditional databases struggle to retain data at this volume as well as have difficulty dealing with JSON, XML, or RDF formats, Hadoop allows organizations to overcome both obstacles without exponentially growing the cost for IT infrastructure.

Tableau connects directly to multiple Hadoop distributions meaning that no matter what your environment looks like, business users can quickly and easily find insight within your Hadoop deployment.

Tableau’s Built-in Hadoop Connections:

  • Cloudera
  • Hortonworks
  • IBM InfoSphere BigInsights
  • MapR
  • Pivotal
Fast Hadoop Analysis
The power of Hadoop without the latency

Hadoop’s most well-known drawback is its high latency. When you work with Hadoop and Tableau, you can connect live to your Hadoop cluster and then extract the data into Tableau’s fast in-memory data engine. In order to get the benefit of ad hoc visualization at interactive speeds, you need to be able to move fast.

Tableau lets you bring your data into its fast, in-memory analytical engine. With this approach you can query an extract of data without waiting for MapReduce queries to complete. Click to refresh the extract or schedule automatic refreshes.

Native Connection
Native connectors to Cloudera Impala and Cloudera Hadoop, DataStax Enterprise, Hortonworks, and MapR Hadoop Distribution for Hadoop reporting and analysis

Unlike other Hadoop analysis software, getting Hadoop data to work with Tableau is easy: just point at your cluster! You do need Hive installed on your Hadoop cluster, which is a common component that provides a SQL interface to Hadoop. There’s no special configuration you need to do for either Tableau or Hadoop.

Cloudera Impala, Cloudera Hadoop, DataStax Enterprise, Hortonworks, and MapR Hadoop distributions are simply another data source in Tableau. You can connect with no programming and drag & drop to visualize your data.

Here we have weather data from a set of XML objects, now stored in a Hadoop cluster. Tableau’s powerful visualization capabilities let you create maps, charts and dashboards easily.

XML Support for Hadoop Data
Work with a variety of data, including XML

An important application of Hadoop and Hive together is working with a variety of data, such as XML files. This often means that you need to unpack nested data, perform data transformations and process URLs. Tableau supports a number of new string functions when working with Hive and Hadoop, including URL processing, regular expressions, and hex/binary numeric operators.

This weather data was stored as a series of XML files that were loaded into Hadoop and unpacked on the fly by the Tableau custom SQL connection – this is true flexibility, and almost like on-the-fly ETL.

Here we’re using the “XPATH” function to create a City field so that we can represent this data in a more traditional, relational way. XML functions are exposed in the Tableau calculations window when you’re working with Hive/ Hadoop data so you don’t need to do custom programming to work with XML objects.