From web logs, thousands of XML files, or ecommerce data, data is growing at a tremendous rate. Hadoop is an emerging technology that can help deal with data every day that is big, unstructured, messy or all three. And today, we're announcing native support for Apache Hadoop from Cloudera with Tableau 6.1.
Hadoop is usually used to mean several related technologies—Hadoop, Map/Reduce, and Hive—that can be used together to help you work with that data.
Hadoop, together with the Map/Reduce framework, is a distributed system that lets you query across multiple and potentially different data sources at once. Technically, HDFS is the distributed file storage part of the system and map/reduce is the algorithm that processes queries across that system, while Hadoop is the overall technology for managing distributed execution and resilience to node failure. Hive is a technology that essentially lets you run SQL queries across the Hadoop cluster. It includes a number of functions that make it easier to process and transform data from Hadoop.
Data is only getting larger and more complex, with new data sources like web logs, bar code data and more. The most important data is also increasingly found in disparate places- XML files, databases, various unstructured formats. Hadoop is a technology and open source project that is leading the way in dealing with these new mountains of data.
Tableau's mission is to help people see and understand data. That includes data that is not handled well by traditional databases, and that's why it's important to support Hadoop.
Example of the new string functions that can be used with Hadoop and Hive to work with XML objects and other data types.
If you are running a Hadoop cluster with Hive, there are three ways to try it:
This page on Hadoop has more information as well as links to a demo, whitepapers and Cloudera's site.