May 29, 2014 | Bob Wilkinson
Presto Change-O: Tez Helps Hive to Near Presto Performance
This will be a two-part addendum to the Open-Source SQL-on-Hadoop Performance benchmark published by Radiant Advisors and John O’Brien . If you have not had a chance to read it yet, I highly recommend it. It’s not the end-all, be-all of benchmarks, but IMHO it’s a fair comparison of several different Open-Source SQL-on-Hadoop solutions over a representative analytics and reporting workload in the web analytics domain. My continuing assertion is that real-time/interactive SQL-on-Hadoop is going to be increasingly important in the enterprise as Hadoop adoption grows and that we have compelling solutions that can deliver the kind of performance required for advanced operational applications.
As an aside - the biggest criticism of the initial benchmark is that it is relatively “small” in Hadoop terms. I don’t dispute that claim at all - it was strictly a practical matter of how much we wanted (and were able…) to contribute to AWS profit margins, and not because any one of the technologies included cannot scale. If there is someone out there with resources to execute this on a larger scale, we would love to collaborate on the effort - I’m certainly confident with how InfiniDB scales and welcome the chance to see how the other technologies fare also.
For today’s update, I will focus on Hive performance using Tez. When we initially collaborated with Radiant on the benchmark, Hive/Tez support was not generally available and thus not included in the report. This time around I used the latest HortonWorks HDP 2.1 release with Hadoop 2.4.0, Hive 0.13.0, and Tez 0.4.0. The system configuration is identical to the prior benchmark effect - 5 m1.xlarge instances.
Enabling Tez support in HDP 2.1 was simple - just set the hive.execution.engine property as follows:
This table summarizes the results using the same query number as reported in the benchmark. I included the Hive 0.12 and Presto 0.57 performance numbers as-is from the report for reference purposes - the Hive 0.13/Tez and InfiniDB 4.5.1 numbers are new.
As you can see, the Tez team has done a great job improving the performance of Hive. Hive is slightly slower in aggregate but actually beat Presto on 3 of the 10 queries and was able to run 8 of 10 queries total whereas Hive 0.12 only ran 4. For clarification, the “TSTC*” note in the table is my “too-slow-too-care” designation. This benchmark is all about interactive SQL performance so I arbitrarily threw out anything that was going to run for hours.
A few caveats about the results. In the report, for all technologies except Hive, we perform an explicit restart/flush operation before each query to ensure a clean/consistent initial state. I have to plead ignorance with respect to any caching that may occur in Tez that could benefit performance across query execution, but restarting core Hadoop services is generally a painfully slow process so I continue to skip that for the Hive/Tez runs here. Second, Hive/Tez seemed constrained to a “mostly correct” answer on Adhoc Q-06 and Analytic Q-10 - in subsequent runs of the same query, Hive included a varying handful of “bonus” rows in the otherwise correct result set. I’m sure these bugs are things the Hive development teams will take care of over time.
Hopefully this has been helpful - I love to see progress like this in the space because it means that more and more people will have access to and expect the level of performance that is possible with modern SQL-on-Hadoop technology. If you have any questions or are interested in collaborating on future benchmarking efforts, please get in touch with me at firstname.lastname@example.org, or on Twitter @bobwilkinon20.