Cloudera Impala Integration: Fast Analytics for Hadoop

By Ted Wasserman 29 Oct, 2012

Last week, I blogged about the Big Data announcements we made in conjunction with the Strata-Hadoopworld conference. There was one additional highlight that didn't make my blog post because the technology had not been announced yet at the conference. It is actually a very significant announcement for customers using Tableau on top of Cloudera's hadoop distribution (CDH). At the conference, Cloudera announced a new hadoop technology called Impala, which is a real-time query processing engine for hadoop. Tableau was chosen as one of its first partners to integrate with Cloudera Impala and we previewed an early version of the Impala connector at the conference.

Cloudera Impala was developed in response to one of the biggest complaints of using Hadoop (and Hive) for analytics: latency. While hadoop was great for batch-oriented type of workloads that churned through massive volumes of data, it did not provide a fast and interactive experience for users doing ad-hoc analytics. Even very simple queries could take 20 seconds or more because of the Map-Reduce overhead occurring behind-the-scenes.

Impala by-passes the Map-Reduce layer in hadoop. In their internal tests, Cloudera has reported that Impala is anywhere from 3x-90x faster than Hive depending on the type of query and workload. This should provide significant performance gains over Tableau's existing Hive connectivity.

Customers interested in trying out a preview version of the Impala connector should contact their Tableau account manager for details about how to participate in the early access program. For more information about Impala, I encourage you to read Cloudera's blog post that describes the technology in more detail.

PS - many thanks to Franz Funk for creating the dashboard above that we showed at the Strata-Hadoopworld conference!