Yahoo Betting on Hive + Tez + YARN

Key points from Yahoo post below, betting on Apache Hive, Tez and YARN:

Pig was developed by Yahoo in 2007
Yahoo adopted Hive in 2010 with limited usage
Growth in percentage of hive jobs relative to overall hadoop jobs
Benchmark on 300 nodes (I would be happy if I had 5 nodes for my tests)
Most of the queries failed on Shark with a 10TB dataset
In-Memory aproach fails when: a) data is too large to fit in memory; or b) memory resources have to be shared among tenants or users

source: http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn

original yahoo post:

Yahoo Betting on Apache Hive, Tez, and YARN

by The Hadoop Platforms Team

Low-latency SQL queries, Business Intelligence (BI), and Data Discovery on Big Data are some of the hottest topics these days in the industry with a range of solutions coming to life lately to address them as either proprietary or open-source implementations on top of Hadoop. Some of the popular ones talked about in the Big Data communities are Hive, Presto, Impala, Shark, andDrill.

Hive’s Adoption at Yahoo

Yahoo has traditionally used Apache Pig, a technology developed at Yahoo in 2007, as the de facto platform for processing Big Data, accounting for well over half of all Hadoop jobs till date. One of the primary reasons for Pig’s success at Yahoo has been its ability to express complex processing needs well through feature rich constructs and operators ideal for large-scale ETL pipelines. Something that is not easy to express in SQL. Researchers and engineers working on data systems built on Hadoop at the time found it an order of magnitude better than working with Java MapReduce APIs directly. Apache Pig settled in and quickly made a place for itself among developers.

Over time and with increased adoption of the Hadoop platform across Yahoo, a SQL or SQL-like solution over Hadoop started to become necessary for adhoc analytics that Pig was not well suited for. SQL is the most widely used language for data analysis and manipulation, and Hadoop had also started to reach beyond the data scientists and engineers to downstream analysts and reporting teams. Apache Hive, originally developed at Facebook in 2007-2008, was a popular and scalable SQL-like solution available over Hadoop at the time that ran in batch mode on Hadoop’s MapReduce engine. While Yahoo adopted Hive in 2010, its use remained limited.

On the other hand, MapReduce, Pig and Hive, all running on top of Hadoop, raised concerns around sharing of data among applications written using these different approaches. Pig and MapReduce’s tight coupling with underlying data storage was also an issue in terms of managing schema and format changes. Apache HCatalog, a table and storage management layer was conceived at Yahoo as a result in 2010 to provide a shared schema and data model for MapReduce, Pig, and Hive by providing wrappers around Hive’s metastore. HCatalog eventually merged with the Hive project in 2013, but remained central to our effort to register all data on the platform in a common metastore, and make them discoverable and sharable with controlled access.

The Need for Interactive SQL on Hadoop

By mid 2012, the need to make SQL over Hadoop more interactive became material as specific use cases and requirements emerged. At the same time, Yahoo had also undertaken a large effort to stabilize Hadoop 0.23 (pre Hadoop 2.x branch) and YARN to roll it out at scale on all our production clusters. YARNs value propositions were absolutely clear. To address the interactive SQL use cases, we started exploring our options in parallel, and around the same time, Project Stinger got announced as a community driven project from Hortonworks to make Hive capable of handling a broad spectrum of SQL queries (from interactive to batch) along with extending its analytics functions and standard SQL support. Early version of HiveServer2 also became available to address the concurrency and security issues in connecting Hive over standard ODBC and JDBC that BI and reporting tools like MicroStrategy and Tableau needed. We decided to stick with Hive and participate in its development and phased (Phases I, II, III) delivery. At this point, Hive also happens to be one of the fastest growing products in our platform technology stack (Fig 1) confirming the fact that SQL on Hadoop is a hot topic for good reasons.

Fig 1. Growth in Hive jobs relative to overall Hadoop jobs

Why Hive?

So, why did we stick with Hive or as one may say, bet on Hive? We did an evaluation of available solutions, and stayed the course we were on with Hive as the best solution for our users for several key reasons:

Hive is the SQL standard for Hadoop that has been around for seven years, battle tested at scale, and widely used across industries
A single solution that works across a broad spectrum of data volumes (more on this in the performance section)
HCatalog, part of Hive, acts as the central metastore for facilitating interoperability among various Hadoop tools
A vibrant community from many well known companies with top notch engineers and architects vested in its future
Top Level Project (TLP) with Apache Software Foundation (ASF) that offers several advantages, including our deep familiarity with ASF and all the related Hadoop ecosystem projects under Apache and the clarity around making contributions to gain influence in the community that may allow Yahoo to evolve Hive in a direction that meets our users needs
Perhaps one of the few SQL on Hadoop solutions around that has been widely certified by BI vendors (an important distinction to consider as Hive gets used in many cases by data analysts and reporting teams directly)
Alleviating performance concerns with relentless phased delivery (Hive 0.11, 0.12 and 0.13) against the initially stated performance goals

Query Performance on Hive 0.13

Since performance was one of users biggest concerns with Hive 0.10, the version Yahoo was running, we conducted Hive’s performance benchmarks, not to say that the significant facelift in features with later versions of Hive wasn’t important.

In one of the recent performance benchmarks Yahoo’s Hive team conducted on the Jan version of Hive 0.13, we found the query execution times dramatically better than Hive 0.10 on a 300 node cluster. To give you an idea of the magnitude of performance difference we observed, Fig 2 shows TCP-H benchmark results with 100 GB dataset on Hive 0.10 with RCFile (Row Columnar) format on Hadoop 0.23 (MapReduce on YARN) vs. Hive 0.13 with ORC File (Optimized Row Columnar), Apache Tez on YARN, Vectorization, and Hadoop 2.3). Security was turned off in both cases. With Hive 0.13, 18 out of 21 queries finished under 60 seconds with the longest still under 80 seconds. Also, Hive 0.13 execution times were comparable or better than Shark on a 100 node cluster.

Fig 2. Hive 0.10 vs. Hive 0.13 on 100 GB of data

On the other hand, Hive 0.13 query execution times were not only significantly better at higher volumes of data (Fig 3 and 4) but also executed successfully without failing. In our comparisons and observations with Shark, we saw most queries fail with the larger (10TB) dataset. These same queries ran successfully and much faster on Hive 0.13, allowing for better scale. This was extremely critical for us, as we needed a single query and BI solution on the Hadoop grid regardless of dataset size. The Hive solution resonates with our users, as they do not have to worry about learning multiple technologies and discerning which solution to use when. A common solution also results in cost and operational efficiencies from having to build, deploy, and maintain a single solution.

Fig 3. Hive 0.10 vs. Hive 0.13 on 1 TB of data

Fig 4. Hive 0.10 vs. Hive 0.13 on 10 TB of data

The performance of Hive 0.13 is certainly impressive over its predecessors, but one must realize how these performance improvements came by. Several systems rely on caching data in memory to lower latency. While this works well for some use cases, the approach fails when either the data is too large to fit in the memory or on a shared multi-tenant environment where memory resources have to be shared among tenants or users. Hive 0.13, on the other hand, achieves comparable performance through ORC, Tez, and Vectorization (vectorized query processing) that does not suffer from the issues noted above. On the flip side, building solutions in this manner certainly requires heavy engineering investment (100s of man month in case of Hive 0.13 since the start of 2013) for robustness and flexibility.

Looking Ahead

We are excited about the work going on in the Hive community to take Hive 0.13 to the next level in subsequent releases in terms of both features and performance, in particular the Cost-based Query Optimizations and the ability to perform inserts, updates, and deletes with full ACID support.

Big Data - Guilherme Braccialli

Tuesday, May 20, 2014