Floating the Elephant – Hadoop's Continued Emergence

Big Data – Hadoop Elephant

With the rise of big data we continue to see the growing importance of Hadoop as one of the premier big data infrastructures. Not only is it a powerful framework for reliable, scalable, distributed computing and data storage, but it’s also a very affordable solution. As an open-source platform, Hadoop is quite accessible to users and user groups of all sizes.

If you are new to Hadoop, essentially it is a software framework designed for handling large sets of data. This type of data processing and the underlying technology has been in place in the consumer realm for quite some time. As Hadoop pushes further and further into the enterprise sphere, some distro vendors are looking for concrete numbers that they can pin down. Charles Zedlewski, VP of product at Cloudera (a major provider of Hadoop solutions, and thus obviously biased in one direction) shared his thoughts recently on the direction of Hadoop.

“The opportunity for big data has been stymied by the limitations of today’s current data management architecture,” said Zedlewski, who called Hadoop “very attractive” for advanced analytics and data processing.

Enterprises that ingest massive amounts of data–50 terabytes per day, for instance–aren’t well-served by ETL systems. “It’s very common to hear about people who are starting to miss the ETL window,” Zedlewski said. “The number of hours it takes to pre-process data before they can make use out of it has grown from four hours to five, six. In many cases, the amount of time is exceeding 24 hours.”

In a nutshell, what we are looking at here is the ability to process and analyze amounts of data that would otherwise be impossible to deal with in the limiting 24-hour time period of the day. Hadoop gives us the ability and high-speeds necessary to do so, at a fraction of the cost of alternatives.

While probably not as cheap as it will get, it is certainly significantly cheaper than the alternatives. According to Zedlewski, the cost per terabyte can run from $5,000 to $15,000. “If you look at databases,” he says “data marts, data warehouses, and the hardware that supports them, it’s not uncommon to talk about numbers more like $10,000 or $15,000 a terabyte.”

Zedlewski admits that Hadoop still has far to go to before it is able to achieve “real-time” results. This is mainly because it wasn’t designed that way. Moreover this is an issue that will always be a challenge when speaking in terms of terabytes and petabytes. And, increasingly, people seem to be realizing that “real-time” data isn’t always “right-time” data.

Hadoop’s reputation in the data storage and analysis market is very well-respected. Even the US government has used it for determining the locations of terrorist land mines and proactively fixing military helicopters.

What may be more surprising to others is that a complete implementation and management system “including hardware, software, and other expenses, comes to about $1,000 a terabyte,” roughly one-fifth to one-twentieth of alternative technologies and methods.

Hadoop in Action:

What are your experiences with Hadoop? Are you using it now? If not, what is stopping you?