Big Data: Powering the Next Industrial Revolution

Overview | What you'll learn: 

How would you like to be the person or the team at your company, big or small, that identifies what no one else is thinking about at your organization but should be? Think of the status you’d gain, the spotlight and corner office—and the new challenges you’ll be able to tackle that will rocket your company to the top.

The key is to start thinking about the problems you can now solve with data which were not possible just a few years ago. The proliferation of data has exploded, and so have the tools for collecting, analyzing and making use of all this newly available data. In this whitepaper, Abhishek Mehta discusses the power of using big data to solve problems once thought unsolvable, and how you can apply this knowledge at your own company.

We've also pulled out the first several pages of the whitepaper for you to read. Download the PDF on the right to read the rest.

Executive Summary

Data is a key ‘raw material’ for a variety of socioeconomic business systems. Unfortunately, the ability to maximize the use of and value from data has been under leveraged. Till now. A massive change in the technology ecosystem, explosion in sensors emitting new data and sprouting of new business models is revolutionizing the data landscape. Forever changing the way we look at, solve and commercialize business problems.

This is the world of Big Data, where ‘big’ is merely a reference to the size of the opportunity, not the inability to address it.

It is ushering us into the next industrial revolution, which will be powered by data factories, fed by the most valuable raw material of all, data.

The total value at stake - $5.000,000,000,000. Five trillion dollars.

Data - The New Normal

There is a new normal emerging for Big Data, shattering hitherto taboo and unimaginable standards and beliefs that really were not that logical. So everything from what data was considered worthy of storage to how best to analyze and consume the data has been re-defined. The new ground rules may seem like a common sense approach to data, but have never before been implemented universally.

Data Trash - An Oxymoron

There is nothing, note nothing, called data trash. Irrespective of the source, structure, format, frequency and identity of the data, all data is valuable. In the new normal, data trash is an oxymoronic phrase. If a certain kind of data seems to have no value today, it is simply because we have not mined it enough or cannot yet comprehend what its use could be. As an example, when Google started harnessing (the satellite imagery), capturing (street views) and then sharing (for free) geographical data 5 years ago, not many people got what the value of it really was. 5 years hence, think again.

Store all the data, all the time

Not only is there no such thing as data trash, we need to store all of the data being produced in the world today. Thanks to Moore’s lawii, the cost of storing a terabyte of data has fallen from $1 million in the 1970’s to $50 today. Moreover, this rapid decline in storage cost, coupled with the increase of storage capacity in increasingly smaller footprint has rendered tape as a storage mechanism obsolete. An organization can finally store all of the data it creates, buys and consumes, for every individual customer, with all associate history and for every single interaction.

Size does not matter

If the data in all of the books in the Library of Congress was sized, it would total roughly 15 terabytes. Twitter generates about 15 terabytes of data a day. Walmart processes 1 million transactions per hour and stores around 3-5 petabytes of data (1 petabyte equals 1,000 terabytes). Google can process 1 petabyte of data each hour and has the ability to store the entire web on its servers (around 25 petabytes of data), many times over.

By 2010, all the digital data in existence was around 1,200 exabytes (1 exabyte equals 1 million terabytes). It has been estimated that if we were to put in place and size all words ever spoken by mankind, it would be around 5 exabytes.

And this gargantuan mound of data continues to grow at a rapid pace. With the explosion in sensors (there are 4 billion cell-phones in the world) and rise of ever more ambitious projects (the Large Hadron Collideriv generates 40 terabytes of data per second), data will continue to occupy the space it is given.

Beyond these massive numbers is a silver lining. The world has the capability to make Big Data small. Size of the data is no longer a limiting factor, but an opportunity to build a unique capability and competitive differentiator if you have the power to harness it all.

Structure does not matter either

5% of all data produced in 2010 was structured. The rest, using the much abused industry term, was ‘unstructured’ data. This term usually refers to data streams like email, voice, GPS, video and others we cannot comprehend. The value of it is unquestionable. How we uncover it the challenge.

While I do not like the term ‘unstructured data’, I do agree in principle that irrespective of data sitting in existing data warehouses where a format has been force fitted to it (making it structured) or data stored as flat files waiting for the structure to be found beneath (hence ‘unstructured’ or better said complex), the information it can give us is valuable.

Data is personal again

The ability to take data that exists in multiple silos, whether they are within one organization (multiple Systems of Record taking in data) or across many (commerce data, healthcare data, financial data, telecom data, online data, social data etc), and tie it together at the individual level is game changing in nature. This has been very successfully done by Google for web search. Every single IP address has its own customized ‘internet’, in essence enabling mass personalization of a very wide swath of data. The fact
that tools can allow for this massive democratization of data and enable individualization to its nth degree is fascinating as it renders sampling useless. And analyzing data opens up possibilities to address critical socio-economic problems that simply could not be solved before.

The Technology Mega-Shift

The driving force defining this new normal is a tectonic shift in technology that would not have been possible were it not for the dot com boom; which led to the creation of a world wide platform that was infinitely powerful, scalable and cheap. Those developments have gathered pace in the last 5 years as what was considered nascent projects then, have now become the building blocks of the architecture shaping the next generation Big Data capabilities.

Commodity hardware

Making $50 terabyte disks stand up to the rigors of more than a billion people simultaneously researching a massive online ‘database’ is not easy. Yet this is exactly what Google does every second. What they had to write the book on is how to take commodity hardware and
with the right fault tolerance, processor power and networking, mesh it into a single unit as powerful as supercomputers, at $2,000 a box.

“Store everything” file systems

Storage systems are getting simpler and more powerful. Distributed File Systems, such as Hadoop, now allow users and companies to store all kinds of data in one place. So whether you need to handle transactional data, emails, voice logs, GPS triangulations, video feeds, chat transcripts, photos, machine data or data not even in existence today, the good news is it can all be stored, indexed and processed on one platform.
Structured and unstructured data can finally co-exist, even be processed and tagged to one ID, all on the same platform.

Faster is better - Let’s game the network

It’s a rule of nature, when things get big, they get difficult to carry around. Newer technologies are rendering the necessity to move data or process it before you could use it as unnecessary. Two interesting tools, Tableau, a visual intelligence software tool and Hadoop, an open source Data Operating system, are good examples of this. Both Tableau and Hadoop have leveraged a unique advance as well as change in discipline.

Rather than move the data to your code, which is time consuming, messy, and sometimes downright impossible - because one should not try to move one petabyte of data through a one gigabyte network; they process the data where it is stored by taking the code to the data (megabits moved through a gigabit line (a lot faster) and then bringing the results back to the user (once again, much smaller size of data to be moved).

This opens up a range of possibilities, from the ability to model an entire population, rather than samples, to reducing the time taken to experiment with new data. It also allows the user to model with all instances, including tail events, hence building inherently superior algorithms with much better predictive ability.

Massively-parallel processing (MPP) analytics

While massively parallel systems have existed for a while, MPP analytics takes the parallelism on offer and supercharges analytical processing as well. The ability to write algorithms that can now leverage the benefits of parallelism is the next big leap in computing. It finally gives the ability to mankind to solve problems that were either too expensive, too time consuming or just computationally impossible to solve.

The work being done on the Human Genome is a good example. It was first decoded in 2003. It took 10 years to do it and as you may know, it wasn’t an easy thing to do. Decoding the human genome involves taking three billion base pairs and then processing various combinations and analyzing all of it at scale. With new technologies and the advancements, what took ten years in 2003 is now down to a week. And at a tenth the cost.

Just imagine the challenging problems that can now be turned on their head. With these rapid advancements, we can start questioning the assumptions that we were forced to take due to the technological limitations imposed on us and solve problems we thought were not solvable.

Want to read more? Download the rest of the whitepaper!

Continue Reading...

You might also be interested in...