TDWI's Intro to Big Data Analysis

Author
Philip Russom, Senior Manager, TDWI Research

Big data analytics is where advanced analytic techniques operate on big data sets—one of the most profound trends in business intelligence today. Using advanced analytics, businesses can study big data to understand the current state of the business and track still-evolving aspects such as customer behavior. This empowers users to explore granular details of business operations and customer interactions that seldom find their way into a data warehouse or standard report.

Big data analytics is the intersection of two technical entities that have come together. First, there’s big data for massive amounts of detailed information. Second, there’s advanced analytics, which can include predictive analytics, data mining, statistics, artificial intelligence, natural language processing, and so on. Put them together and you get big data analytics, the hottest new practice in BI.

A new flood of user organizations is commencing or expanding solutions for analytics with big data. To supply the demand vendors have recently released numerous new products and functions, specifically for advanced forms of analytics (beyond OLAP and reporting) and analytic databases that can manage big data. This research report drills into all the aspects of big data analytics mentioned here to give users and their business sponsors a solid background for big data analytics, including business and technology drivers, successful business use cases, and common technology enablers. The report also uses survey data to project the future of the most common tool types, features, and functions associated with big data analytics, so users can apply this information to planning their own programs and technology stacks for big data analytics.

We've also pulled out the first several pages of the whitepaper for you to read. Download the PDF on the right to read the rest.


Executive Summary

Oddly enough, big data was a serious problem just a few years ago. When data volumes started skyrocketing in the early 2000s, storage and CPU technologies were overwhelmed by the numerous terabytes of big data—to the point that IT faced a data scalability crisis. Then we were once again snatched from the jaws of defeat by Moore’s law. Storage and CPUs not only developed greater capacity, speed, and intelligence; they also fell in price. Enterprises went from being unable to afford or manage big data to lavishing budgets on its collection and analysis.

Today, enterprises are exploring big data to discover facts they didn't know before. This is an important task right now because the recent economic recession forced deep changes into most businesses, especially those that depend on mass consumers. Using advanced analytics, businesses can study big data to understand the current state of the business and track still-evolving aspects such as customer behavior.

If you really want the lowdown on what’s happening in your business, you need large volumes of highly detailed data. If you truly want to see something you’ve never seen before, it helps to tap into data that’s never been tapped for business intelligence (BI) or analytics. Some of the untapped data will be foreign to you, coming from sensors, devices, third parties, Web applications, and social media. Some big data sources feed data unceasingly in real time. Put all that together, and you see that big data is not just about giant data volumes; it’s also about an extraordinary diversity of data types, delivered at various speeds and frequencies.

Note that two technical entities have come together. First, there’s big data for massive amounts of detailed information. Second, there’s advanced analytics, which is actually a collection of different tool types, including those based on predictive analytics, data mining, statistics, artificial intelligence, natural language processing, and so on. Put them together and you get big data analytics, the hottest new practice in BI today.

Of course, businesspeople can learn a lot about the business and their customers from BI programs and data warehouses. But big data analytics explores granular details of business operations and customer interactions that seldom find their way into a data warehouse or standard report. Some organizations are already managing big data in their enterprise data warehouses (EDWs), while others have designed their DWs for the well-understood, auditable, and squeaky clean data that the average business report demands. The former tend to manage big data in the EDW and execute most analytic processing there, whereas the latter tend to distribute their efforts onto secondary analytic platforms. There are also hybrid approaches.

Regardless of approach, user organizations are currently reevaluating their analytic portfolios. In response to the demand for platforms suited to big data analytics, vendors have released a slew of new product types including analytic databases, data warehouse appliances, columnar databases, no-SQL databases, distributed file systems, and so on. There is also a new slew of analytic tools.

This report drills into all the aspects of big data analytics mentioned here to give users and their business sponsors a solid background for big data analytics, including business and technology drivers, successful business use cases, and common technology enablers. The report also uses survey data to project the future of the most common tool types, features, and functions associated with big data analytics, so users can apply this information to planning their own programs and technology stacks for big data analytics.

Introduction to Big Data Analytics

Big data analytics is where advanced analytic techniques operate on big data sets. Hence, big data analytics is really about two things—big data and analytics—plus how the two have teamed up to create one of the most profound trends in business intelligence (BI) today. Let’s start by defining advanced analytics, then move on to big data and the combination of the two.

Defining Advanced Analytics as a Discovery Mission

According to a 2009 TDWI survey, 38% of organizations surveyed reported practicing advanced analytics, whereas 85% said they would be practicing it within three years. Why the rush to advanced analytics? First, change is rampant in business, as seen in the multiple “economies” we’ve gone through in recent years. Analytics helps us discover what has changed and how we should react. Second, as we crawl out of the recession and into the recovery, there are more and more business opportunities that should be seized. To that end, advanced analytics is the best way to discover new customer segments, identify the best suppliers, associate products of affinity, understand sales seasonality, and so on. For these reasons, TDWI has seen a steady stream of user organizations implementing analytics in recent years.

The rush to analytics means that many organizations are embracing advanced analytics for the first time, and hence are confused about how to go about it. Even if you have related experience in data warehousing, reporting, and online analytic processing (OLAP), you’ll find that the business and technical requirements are different for advanced forms of analytics. To help user organizations select the right form of analytics and prepare big data for analysis, this report will discuss new options for advanced analytics and analytic databases for big data so that users can make intelligent decisions as they embrace analytics.

Note that user organizations are implementing specific forms of analytics, particularly what is sometimes called advanced analytics. This is a collection of related techniques and tool types, usually including predictive analytics, data mining, statistical analysis, and complex SQL. We might also extend the list to cover data visualization, artificial intelligence, natural language processing, and database capabilities that support analytics (such as MapReduce, in-database analytics, in-memory databases, columnar data stores).

Instead of “advanced analytics,” a better term would be “discovery analytics,” because that’s what users are trying to accomplish. (Some people call it “exploratory analytics.”) In other words, with big data analytics, the user is typically a business analyst who is trying to discover new business facts that no one in the enterprise knew before. To do that, the analyst needs large volumes of data with plenty of detail. This is often data that the enterprise has not yet tapped for analytics.

For example, in the middle of the recent economic recession, companies were constantly being hit by new forms of customer churn. To discover the root cause of the newest form of churn, a business analyst would grab several terabytes of detailed data drawn from operational applications to get a view of recent customer behaviors. The analyst might mix that data with historic data from a data warehouse. Dozens of queries later, the analyst would discover a new churn behavior in a subset of the customer base. With any luck, that discovery would lead to a metric, report, analytic model, or some other product of BI, through which the company could track and predict the new form of churn.

Discovery analytics against big data can be enabled by different types of analytic tools, including those based on SQL queries, data mining, statistical analysis, fact clustering, data visualization, natural language processing, text analytics, artificial intelligence, and so on. It’s quite an arsenal of tool types, and savvy users get to know their analytic requirements before deciding which tool type is appropriate to their needs.

All these techniques have been around for years, many of them appearing in the 1990s. The difference today is that far more user organizations are actually using them. That’s because most of these techniques adapt well to very large, multi-terabyte data sets with minimal data preparation. That brings us to big data.

Defining Big Data Via the Three Vs

Most definitions of big data focus on the size of data in storage. Size matters, but there are other important attributes of big data, namely data variety and data velocity. The three Vs of big data (volume, variety, and velocity) constitute a comprehensive definition, and they bust the myth that big data is only about data volume. In addition, each of the three Vs has its own ramifications for analytics.

Data volume as a defining attribute of big data.

It's obvious that data volume is the primary attribute of big data. With that in mind, most people define big data in terms of time. For example, due to the seven-year statute of limitations in the U.S., many firms prefer to keep seven years of data available for risk, compliance, and legal analysis.

The scope of big data affects its quantification, too. For example, in many organizations, the data collected for general data warehousing differs from data collected specifically for analytics. Different forms of analytics may have different data sets. Some analytic practices lead a business analyst or similar user to create ad hoc analytic data sets per analytic project. Then, there’s the entire enterprise, which in toto has its own, even larger scope of big data. Furthermore, each of these quantifications of big data grows continuously. All this makes big data for analytics a moving target that’s tough to quantify.

Data type variety as a defining attribute of big data.

One of the things that makes big data really big is that it’s coming from a greater variety of sources than ever before. Many of the newer ones are Web sources, including logs, clickstreams, and social media. Sure, user organizations have been collecting Web data for years. But, for most organizations, it’s been a kind of hoarding. We’ve seen similar untapped big data collected and hoarded, such as RFID data from supply chain applications, text data from call center applications, semistructured data from various business-to-business processes, and geospatial data in logistics. What’s changed is that far more users are now analyzing big data instead of merely hoarding it. The few organizations that have been analyzing this data now do so at a more complex and sophisticated level. Big data isn’t new, but the effective analytical leveraging of big data is.

The recent tapping of these sources for analytics means that so-called structured data (which previously held unchallenged hegemony in analytics) is now joined by unstructured data (text and human language) and semistructured data (XML, RSS feeds). There’s also data that’s hard to categorize, as it comes from audio, video, and other devices. Plus, multidimensional data can be drawn from a data warehouse to add historic context to big data. That’s a far more eclectic mix of data types than analytics has ever seen. So, with big data, variety is just as big as volume. In addition, variety and volume tend to fuel each other.

Data feed velocity as a defining attribute of big data.

Big data can be described by its velocity or speed. You may prefer to think of it as the frequency of data generation or the frequency of data delivery. For example, think of the stream of data coming off of any kind of device or sensor, say robotic manufacturing machines, thermometers sensing temperature, microphones listening for movement in a secure area, or video cameras scanning for a specific face in a crowd. The collection of big data in real time isn't new; many firms have been collecting clickstream data from Web sites for years, using streaming data to make purchase recommendations to Web visitors. With sensor and Web data flying at you relentlessly in real time, data volumes get big in a hurry. Even more challenging, the analytics that go with streaming data have to make sense of the data and possibly take action—all in real time.

Defining Big Data Analytics

Again, big data analytics is where advanced analytic techniques operate on big data. The definition is easy to understand, but do users actually use the term? To quantify this question, the survey for this report asked: “Which of the following best characterizes your familiarity with big data analytics and how you name it?” (See Figure 2.) The survey results show that most users understand the concept of big data analytics, whether they have a name for it or not:

Few respondents are unfamiliar with the concept. Only 7% report that they “haven’t seen or heard of anything resembling big data analytics.”

Most users surveyed don’t have a name for big data analytics. Even so, they understand the definition (65% of respondents).

Roughly a quarter of respondents have a name for big data analytics. Twenty-eight percent both understand the concept and have named it.

Most of the survey respondents who report having a name for big data analytics typed the name they use into the survey software. The name entered most often is the term used in this report: “big data analytics” (18% in Figure 3). Similar terms appeared, such as large-volume or large-data-set analytics (7%). Many use the popular term advanced analytics (12%), or they simply call it analytics (12%). A few common terms were entered, such as data warehousing (4%), data mining (2%), and predictive analytics (2%). A whopping 43% entered a unique name, showing that names for analytic methods are amazingly diverse.

Finally, a few survey respondents entered humorous but revealing terms such as honking big data, my day job, pain in the neck, and we-need-to-buy-more-hardware analytics.

Why Put Big Data and Analytics Together Now?

Big data provides gigantic statistical samples, which enhance analytic tool results. Most tools designed for data mining or statistical analysis tend to be optimized for large data sets. In fact, the general rule is that the larger the data sample, the more accurate are the statistics and other products of the analysis. Instead of using mining and statistical tools, many users generate or hand-code complex SQL, which parses big data in search of just the right customer segment, churn profile, or excessive operational cost. The newest generation of data visualization tools and in-database analytic functions likewise operate on big data.

Analytic tools and databases can now handle big data. They can also execute big queries and parse tables in record time. Recent generations of vendor tools and platforms have lifted us onto a new plateau of performance that is very compelling for applications involving big data

The economics of analytics is now more embraceable than ever. This is due to a precipitous drop in the cost of data storage and processing bandwidth. The fact that tools and platforms for big data analytics are relatively affordable is significant because big data is not just for big business. Many small-to-midsize businesses (especially those deep into digital processes for sales, customer interactions, or supply chain) also need to manage and leverage big data.

There’s a lot to learn from messy data, as long as it’s big. Most modern tools and techniques for advanced analytics and big data are very tolerant of raw source data, with its transactional schema, non-standard data, and poor-quality data. That’s a good thing, because discovery and predictive analytics depend on lots of details—even questionable data. For example, analytic applications for fraud detection often depend on outliers and non-standard data as indications of fraud. So, be careful: If you apply ETL and data quality processes to big data as you do for a data warehouse, you run the risk of stripping out the very nuggets that make big data a treasure trove for advanced analytics.

Big data is a special asset that merits leverage. That’s the real point of big data analytics. The new technologies and new best practices are fascinating, even mesmerizing, and there’s a certain macho coolness to working with dozens of terabytes. But don’t do it for the technology. Put big data and discovery analytics together for the new insights they give the business.

Analytics based on large data samples reveals and leverages business change. The recession has accelerated the already quickening pace of business. The recovery, though welcome, brings even more change. In fact, the average business has changed beyond all recognition because of the recent economic recession and recovery. The change has not gone unnoticed. Businesspeople now share a wholesale recognition that they must explore change just to understand the new state of the business.

Even more compelling, however, is the prospect of discovering problems that need fixing (such as new forms of customer churn and competitive pressure) and opportunities that merit leverage (such as new customer segments and sales prospects).

Want to read more? Download the rest of the whitepaper!


Get the whitepaper

Complete the form to access this and the rest of the great content on Tableau.com.


Already have a Tableau.com account? Sign in here

Tableau product screenshots
Address

About the author

image

Philip Russom

Senior Manager, TDWI Research

Philip Russom is the senior manager of research and services at The Data Warehousing Institute (TDWI), where he oversees many of TDWI’s research-oriented publications, services, and events. Prior to joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research, Giga Information Group, and Hurwitz Group, as well as a contributing editor with Intelligent Enterprise and DM Review magazines.