Haven’t heard of Hortonworks? Well you’ve no doubt heard of Yahoo! It was from Yahoo! that Hortonworks was born. The company was created a year ago from the spinout of the Yahoo! engineering team behind the open source MapReduce method of processing large sets of data. And derived from MapReduce is the increasingly-prominent Hadoop, which has just seen its first production release of what will be the flagship Apache Hadoop distribution.
Hadoop is open source software that enables distributed parallel processing of huge amounts of data across inexpensive, commodity servers. The broad base of the consumer market is already utilizing Hadoop every day – with social media, airline ticket searches, dinner reservations, and the hundreds of millions of search queries happening every day.
Version 1 of the Hortonworks Data Platform (HDP), to be released June 15, will be Hortonworks’ first production-ready product release. Hortonworks was set up a year ago by Yahoo, along with Benchmark Capital, to provide enterprise support for Hadoop, the large-scale data analysis platform. Yahoo played a pivotal role in the early development of Hadoop.
Hortonworks now competes with a number of other companies also offering support packages, including Cloudera, MapR and IBM. Microsoft has chosen Hortonworks’ Hadoop distribution for use on its Azure cloud service, though that service, promised by the end of 2011, has not debuted yet.
Like other commercial Hadoop packages, HDP packages a number of different open-source Hadoop components, including the latest versions of the Pig scripting engine, the Hive data warehousing software and the HBase database.
In addition to these basic components, Hortonworks added a number of additional management and interoperability tools to the package, all of them based on open-source projects as well.
To aid in management, the package includes a customized version of Apache Ambari, a Hadoop monitoring and lifecycle management program. With this software, an administrator can set up a single Hadoop instance across a number of servers. Once Hadoop is installed, the software then monitors performance of the servers as well as the Hadoop jobs themselves, presenting the data on a dashboard.
“The dashboards are customizable and the APIs [application programming interfaces] allow the management and monitoring functionality to be tied into third-party dashboards like Hewlett-Packard’s OpenView or Teradata’s Viewpoint,” Kreisa said.
With this release, the management tools will only be able to manage a single cluster, though future versions may be able to manage multiple clusters, said Ari Zilka, Hortonworks chief products officer. Specific metrics that are being captured include network utilization, throughput and latency, and usage of CPUs, memory and disks. Jobs in Hadoop are also measured, including the time it takes for a task to start, how many tasks there are on backlog, how many data blocks a task uses and where these data blocks are located.
The diagram below illustrates the Hortonworks Data Platform
The system includes a metadata catalogue designed to make it easier to run queries against Hadoop data sets, benefiting data analysis and BI applications. Hortonworks package includes a graphical interface via Talend Open Studio for exploring, querying, and applying logical workflows to data sets (in a similar manner to desktop applications like MySQL Workbench which provides a GUI to interact with databases in parallel to the SQL language).
In addition, Hortonworks has announced a partnership with VMware to provide a set of tools to run HDP in high-availability (HA) mode.
This is an exciting development in the world of big data. We’ll be looking forward to where things go in the coming months.