With Supercomputing 2012 underway this week, Datanami published a fantastic article with insights into the top 20 lessons from 20 different sessions that enterprise IT decision makers should take away with them. The original article is quite lengthy so here is each point presented more simply with links to the events on the website.
If you are attending SC12 remember to stop and see us in the booths of our partners CoolIT (Booth 4629) and Intel (Booth 2511)
- It’s crucial to have an overall technical understanding of exactly what Big Data is – To help people understand precisely this, the first day of the show will be full of tutorials from Dr. Alex Szalay of Johns Hopkins University – one of the undisputed leaders in scientific “big data” computing – Robert Grossman of the University of Chicago and Collin Bennett from the Open Data Group. They will cover many of the tools and techniques that can be used for managing and analyzing large datasets, as well as a variety of related subjects including Graywulf architectures, NoSQL databases, and distributed file systems including Hadoop.
- Approaching data analytics today requires acknowledging the reality that petabytes are common – With storage needs seeing their data double in volume each year, dealing with petabyte-sized archives has become the new normal. This increases the complexity of the infrastructure necessary to support this growing level of data. On Monday, Scott Klasky and Ranga Raju Vatsavai from Oak Ridge National Lab join with Manish Parashar for a day-long workshop to tackle this issue: how do we deal with growing data sizes and the infrastructures required to work with them.
- Heterogeneous computing models that utilize CPU and GPU are more important than ever – We’ve known for quite a while how parallel computing with a GPU can dramatically improve performance when floating point calculations are involved. Now that a team from Ohio State University has put forward that MapReduce is a suitable framework for simplified parallel application development for many classes of applications, the heterogeneous model of GPU computing is critical to divide workloads involved.
- Internet services and HPC have common research issues between them – Most notably, these issues center around exploiting locality and providing efficient communication. Going further on this line of thought, Internet services and HPC have overlapping areas of problems as well. Three that recur frequently in both are data placement, data indexing, and data communication. Jack Dongarra from the University of Tennessee at Knoxville and Zhiwei Xu from the Chinese Academy of Sciences present their argument on Wednesday afternoon.
- HPC and Hadoop’s MapReduce share many characteristics – These features include large data volumes, variety of data types, high CPU utilization, and distributed system architecture. Why is this important? Similar approaches to both will compliment one another. Several researchers, led by Gilad Shainer of Mellanox present the idea that RDMA capable programming models will enable efficient data transfers between computation nodes, and that the Hadoop MapReduce framework can be ported to RDMA.
- Exascale and big data I/O will continue to drive the major trends in HPC – Research, technology, production evolution, and more are being pushed by approach of exascale computing. The past ten years have seen a tremendous growth in speeds and inverse decrease in latencies. Enterprise is now incorporating these technologies at a rapid rate. But what will the next generation of architectures look like? Bill Boas from the Infiniband Trade Association moderates this discussion on Wednesday afternoon.
- What happens in labs doesn’t just stay in labs – As the author writes, “the computer science business at national labs might appear far removed from daily enterprise computing reality,” but this isn’t the case as new tools and applications which have real-world impact are consistently tested and refined. A team from some of the top labs in the US present on open source visualization for massive, complex datasets. The enterprise should pay close attention at this full-day event today.
- Hadoop, despite its popularity, is still underutilized – Researchers from Carnegie Mellon and the University of Washington analyzed and compared Hadoop workloads from three different research clusters from an application-level perspective. Their conclusion is that “Hadoop usage is still in its adolescence” and a number of features and capabilities are not being taken advantage of. During their presentation they will also identify areas and opportunities for optimization.
- Exascale science translates to big data and is spurring R&D and innovation in networking technology – A team from Fermi National Lab deals with one of the largest big data challenge sites in the United States, making the case that its data is globally distributed, particularly regarding the Large Hadron Collider. FermiLab must deal with both scaling and wide-area distribution challenges in processing its data. The team discusses how evolving technologies help them meet these challenges and their current R&D efforts into a variety of areas including network I/O, middleware modification, and network path reconfiguration.
- Hybrid-core computing is a potentially effective way of handling big data – Kirby Collins from Convey Computer presents the latest innovations in Hybrid-Core computing and depicts how these high bandwidth, highly parallel re-configurable architectures can address the explosion in data that is stored and processed for daily life more effectively than conventional architectures.
- MapReduce is a very useful tool for analyzing patterns in large-scale graphs – Researchers make a case for successfully analyzing these patterns, for example, in social networks like Facebook, Twitter, Linkedin, etc. Single computers are unable to process information in large-scale graphs with millions and billions of edges, hence the need for Hadoop and MapReduce. Among the many possible benefits are community identification, blog analysis, intrusion and spamming detections. This is important because it is thus far impossible with a single computer.
- Graph analytics, not just conventional analytics, have important implications for HPC – Amar Shan from YarcData and Shoaib Mufti from Cray discuss how, even though big data has grown enormously in importance over the past 5 years, most data intensive computing remains focused on conventional analytics. They draw a distinction between conventional and graph analytics: conventional involves searching, aggregating and summarizing data sets while graph analytics go beyond to search for patterns of relationships. Shan and Mufti aim to bring together practitioners of graph analytics and discuss system architectures and software designed specifically for this application.
- Users are unaware of the true costs of data storage – Matthew Drahzal from IBM will argue that users are completely unaware of challenges and costs of increasing data storage. Since the rate of growth of data-stored is higher than the areal-density growth rate of spinning disks, organizations are purchasing more disk and spending more IT budget on managing data. In other words, while the cost for computation is decreasing, the cost to store, move, and manage the resultant information is increasing. He also describes how IBM is working on new technologies and approaches to lower data cost-curves.
- A new query processing framework is needed for filtering large data sets – The traditional methods of filtering data stored in scientific databases have been generally been considered impractical due to the immense requirements. Filtering has often been done during simulation or by loading snapshots into the aggregate memory of an HPC cluster. A team of researchers from Johns Hopkins University will describe a new query processing framework for the efficient evaluation of spatial filters on large numerical simulation datasets stored in a data-intensive cluster, performing filtering within the database and supporting large filter widths.
- Computation and 3D active stereo visualization have advanced data analysis – Two researchers from Los Alamos National Laboratory will share how innovation in computation and 3D active stereo visualization has improved our ability to analyze data. They will demonstrate the combination of computation and 3D stereo visualization for the analysis of large complex data sets, as well as host an open discussion of visualization and the new frontier of data analysis.
- BFS algorithms have been used to boost in-memory graph exploration performance – A team from Ohio State University will describe the challenges they faced in research they conducted involving graph explorations with a distributed in-memory approach. This team designed a family of highly-efficient Breadth-First Search (BFS) algorithms. They will discuss the successes from optimizing these BFS algorithms on the latest two generations of Blue Gene machines, as well as areas of improvement needed going forward.
- Graph transversal is growing in its importance in HPC – Intel joins the same unit from Ohio State for another look at BFS algorithms used at the massive scale. This session will explore graph-traversal in a different light, noting that it goes far beyond the world of supercomputing benchmarks that test data-intensive computing performance. Graph transversal is used in many fields including social-networks, bioinformatics and HPC. They will address how this application is well-optimized for single-node CPUs, but current cluster implementations suffer from high-latency and large-volume inter-node communication, with low performance and energy-efficiency. The team will describe the novel approach they used to obtain 6.6X performance improvement and order-of-magnitude energy savings.
- With greater availability of data, efficiency in access, analysis, and distribution is more critical than ever – This is especially true in the HPC space where data stores can grow rapidly. While computation technology has kept space, storage has lagged, “creating a barrier between researchers and their data.” This session examines how implementing scale-out storage can eliminate the storage bottleneck in HPC and put data immediately into the hands of those who need it most.
- Layer-4 relays can improve throughput and reduce impact of large end-to-end latencies – A team from Virginia Tech led by Torsten Hoefler from ETH Zurich will present their case that throughput can be improved if one can reduce the impact of large end-to-end latencies by introducing layer-4 relays along the path. Such relays would enable a cascade of TCP connections, each with lower latency, resulting in better aggregate throughput. The team argues that this would directly benefit typical applications as well as big data applications in distributed HPC.
- Big data is needed to face the cyber-security threats the modern world – Daniel M. Best from Pacific Northwest National Laboratory will explore how novel and efficient algorithms are needed to investigate and unravel threats in cyberspace. In this presentation, Best will bring together domain experts from various research communities to talk about current techniques and grand challenges being researched to foster.
Read the full piece at Datanami – 20 Lessons Enterprise CIOs Can Learn from Supercomputing
Every year, the annual Supercomputing Conference draws thousands of global leaders, practitioners, and students in high performance computing. Enterprise HPC and big data decision makers have increasingly found interest in this conference, seeking knowledge that will help them manage and grow their business. Learn more about SC12.