June 14, 2014 | Bob Wilkinson
InfiniDB on MapR: The Fast Gets Faster
In my last blog post, I promised a two-part follow-on to the Open-Source SQL-on-Hadoop Performance benchmark published by Radiant Advisors and John O’Brien. When I wrote that statement, I intended part two to cover Shark SQL. There has been a good amount of market “noise” about Shark of late, and it has come up in customer conversations more frequently. I set out with good intentions and the expectation of finding another viable query engine competitor. Well, imagine my surprise when after a few days and several hours of reading, searching, and effort, I could not produce a working Shark installation. I mean no disrespect to the community out there working on Shark, and I am sure I could have eventually made it work because people clearly do. However, it’s hard to dispute that there are some challenges when the latest version (0.9.1) was released in April 2014, and the last set of install documentation is for version 0.8.0 and has not been updated since December 2013. I decided to move on, and hope to revisit Shark at a later date if/when they resolve some of the basic installation and documentation issues.
As an alternate, I decided to pair InfiniDB with the MapR Hadoop distribution. Initially, we focused InfiniDB on the Cloudera and Hortonworks distributions because they seem to have the broadest market acceptance. However, I have been intrigued about what MapR is doing with their MapR Filesystem and in seeing whether their performance claims hold up when integrated with InfiniDB.
This update is a bit different in the sense that it is not a competitive update, but rather an update focused on the impact of the MapR Filesystem on InfiniDB, as compared to the standard Apache Hadoop File System (HDFS). The environment for this test is identical to all previous work - 5 m1.xlarge AWS instances. For the MapR cluster, I used the 3.1 M3 version (based on Core Hadoop 1.0.3) with one control node and four data nodes. For the HDFS comparison, I used a Cloudera 5.0.1 install (based on Core Hadoop 2.3.0).
Before jumping into the results, a few comments on integration with the MapR Filesystem. MapR offers two methods for integration - either a native “C” library API implementation (libMapRClient.so) that bypasses Java/JNI, or a fully compatible libhdfs.so that does use Java/JNI. I decided to try both and see whether the difference was noticeable. The integration with InfiniDB encountered only a few hiccups. First, the MapR native “C” library does not implement hdfsCopy(), which we use in one area, so I had to write my own implementation in our HDFS “plugin” code. Second, there was some inconsistency with regards to rewriting a file - this required a minor workaround in our plugin while the MapR dev team looks at the issue.
Here is a chart showing query times for the original benchmark queries run in three different modes (MapR native, MapR libhdfs, Apache HDFS):
In nearly all cases (Q9 was a virtual dead heat), the queries run faster against the MapR Filesystem compared to HDFS. Some queries showed more improvement than others, but in aggregate, the full query set ran ~12% faster. This is not as high as typical MapR claims. However, InfiniDB is exceptionally efficient with I/O, so logically, the gains due to the MapR filesystem are limited to the I/O portion of the queries. The other interesting observation was that performance with the native library was all but indistinguishable from the libhdfs version. This result surprised me and I don’t have any explanation for it other than to conclude that the Java/JNI layer is clearly not a bottleneck.
Before leaving these two clusters, I decided to experiment with a few more queries that I knew would be I/O intensive. The first two involved COUNT(*) from, and were essentially a column scan operation in InfiniDB. The second was a DML statement that involved updating a particular column. As a side note, InfiniDB is one of the few Hadoop query engines to support a full DML syntax, and is relatively efficient at it because the InfiniDB file format stores each column in separate files.
The single column count and the DML statement showed almost 30% improvement. This result was very nice and consistent with the hypothesis that I/O intensive queries will show greater performance differentiation. The two column count differential was more in line with the other queries at around 12%. This result was a bit of a “head scratcher”, but consistent across runs and across different columns.
I hope that this has been helpful. We will be looking to add official MapR support to our distribution in an upcoming release. It the meantime, if you are interested in the combination of MapR and InfiniDB, please get in touch with me and we will make it happen - firstname.lastname@example.org, or on Twitter @bobwilkinson20.