July 11, 2014 | Bob Wilkinson
Making Sense of the "Fastest SQL-on-Hadoop" Morass
There’s a mind numbing amount of different products and technologies all claiming to be the “Fastest SOL-on-Hadoop” - so much so that at the recent Hortonworks and Yahoo Hadoop Summit, it prompted Gartner’s Nick Heudecker (@nheudecker) to tweet that there were at least eight companies on the show floor proclaiming that tagline and fellow analyst Merv Adrian (@merv) to tweet that “Fastest SQL-on-Hadoop” has become its own product category. For those of us in the space that eat, drink, and breathe Big Data SQL, we can usually cut through much of the hype and classify technologies into a few different buckets. However, for the typical technology consumer in IT or product development, this can be incredibly challenging. In this blog post, I’ll present a broad taxonomy that can be applied to help make sense of the “Fastest SQL-on-Hadoop” morass.
First and foremost, when introducing InfiniDB to someone, I always like to begin by discussing the two dimensions of Big Data and Data Warehousing: structured vs. unstructured, and online transaction processing (OLTP) vs. analytics.
The first dimension relates to the characteristics of the data itself. I have argued in earlier blog posts that virtually all data has structure. However, in this context, I more specifically use “structured” to mean data that can be interpreted with a SQL schema. Unstructured data includes any non-SQL data stores, most commonly JSON document or key-value data stores.
The second dimension is about the characteristics of the workload. OLTP applications do an extremely high volume of operations that involve a small scope of data, such as order processing, transaction processing, etc. Analytics systems, however, have a lower volume of operations, but each one operates at an extremely high scale. For example, a telecom company might aggregate mobile network activity for 24 hours to identify “top talkers”.
Let’s consider a two-by-two matrix and drill down a bit into each quadrant:
Quadrant 1: Unstructured OLTP - This is the quadrant includes products like Apache HBase, MongoDB, Cassandra, Couchbase, etc. While some of these technologies claim the ability to do analytics, the reality is that these solutions are really better at taking over OLTP workloads from the “legacy” general-purpose DBMSs than actually doing any serious analytical workloads. By definition, these technologies don’t play directly into SQL-on-Hadoop (they are “unstructured” after all). Rather, as a testament to the ubiquity of SQL, most of these technologies support an “SQL-like” language. In terms of SQL-on-Hadoop specifically, the Apache Phoenix project is “A SQL Layer over HBase”.
Quadrant 2: Structured, OLTP - This is the domain where Oracle, MySQL and the other general purpose DBMSs reign supreme. From an SQL-on-Hadoop standpoint, there is one notable player - SpliceMachine. SpliceMachine aims to compete and win against Oracle with a Hadoop and HBase derived technology stack. They even go so far as to use “the only Hadoop RDBMS” as the core part of their tagline. SpliceMachine has momentum, having recently closed a $15M B-round earlier this year.
Let’s pause for a minute before covering the last two quadrants and note that when the players in these two quadrants claim that their solutions are the “Fastest SQL-on-Hadoop”, they really mean that their solutions process the most SQL transactions per second on Hadoop.
Now, moving on…
Quadrant 3: Unstructured Analytics - This is admittedly the hardest category to quantify and the one with the least applicability to SQL-on-Hadoop. The best-known name here is probably Splunk. In their words, “Splunk is a distributed, non-relational, semi-structured database with an implicit time dimension”. In layman terms, Splunk consumes text data (often machine-generated log data) and applies some implicit structure to enable analytics. I would also put technologies like ElasticSearch in this quadrant. Solutions like Splunk and Elastic Search tend to have their own custom query language and haven’t made any noticeable move towards SQL.
Quadrant 4: Structured Analytics - Last, but certainly not least, I get to the home quadrant of InfiniDB. Solutions in this space are designed to ingest large volumes of data (typically in “batch” fashion) and then support fast and efficient query access across data volumes that are often extreme. In the context of Hadoop, this quadrant is by far the most crowded. In addition to InfiniDB, all of the other various open-source SQL solutions emerging from the Hadoop community live in this quadrant, including Apache Hive, Presto, Impala, Apache Tajo, Shark. Several of the major commercial vendors of analytics databases, including Hawq (an EMC subsidiary), Actian, HP Vertica, etc., have also announced products that belong in this quadrant. As well, there are several new commercial solutions from companies such as Jethrodata and Hadapt.
To wrap up, let’s revisit “Fastest SQL-on-Hadoop” in the context of Quadrant 4. I suggest we start with the premise that each solution is probably the “fastest” at something, otherwise, they have no reason to exist. The better question is, "Are they fast enough for your actual use cases and are the other aspects of the solution (license, price, maturity, robustness, etc.) a match for your business needs?" In broad terms, most of the true “native” Hadoop query engines are very good at ETL-style queries and longer running “batch” queries. More hybrid solutions like InfiniDB, Impala, Hawq, Actian, etc. are more suited for real-time/interactive query. Some of the commercial solutions may be very good across different query workloads but carry a high price tag.
Ultimately, the “Art” of Big Data is about marrying different technologies across these quadrants to solve the problem at hand. If we can help you with your Big Data challenges and determining whether InfiniDB might be the “Fastest SQL-on-Hadoop” solution for your application, please get in touch with me - email@example.com, or on Twitter @bobwilkinson20.