June 17, 2014 | Dipti Joshi
Getting to Know InfiniDB
If you need analytics insight into huge data sets in real-time, you probably at some point googled “Big Data”, “Columnar Database”, “SQL on Hadoop”, “MySQL Storage Engine”, “Massively Parallel Database” or “Distributed Database”. And it is highly likely that you came across InfiniDB as one of the results. In this post, I will help you get to know InfiniDB and to see how these google keywords apply to InfiniDB.
Before getting into the architecture of InfiniDB, I should mention that InfiniDB is a GPLv2 open source platform delivered as MySQL Storage engine. It’s available on premise, in the cloud, and on Hadoop. Each version of InfiniDB comes with an open-source core and an optional Enterprise Subscription. The InfiniDB Enterprise Subscription includes enhanced server software and enterprise-level SLA support. InfiniDB can run locally on your server farm using local disks, GlusterFS, Hadoop(HDFS) on Linux platforms as well as server farm using Windows platform. RPMs and binary are available at www.infinidb.co/downloads. You can also find community machine Image for InfiniDB on Amazon AMI market place. For those wanting to contribute to InfiniDB core the source code is available at www.github.com/infinidb
Now let us decipher those google keywords as they apply to InfiniDB.
InfiniDB – a Massively Parallel, Distributed Database
InfiniDB consists of two main components – a User Module and a Performance Module. Each installation of InfiniDB can have one or more User Modules and one or more Performance Module. The User Module running on a server acts as a MySQL Front End processing incoming MySQL connection requests and SQL queries on each user connection as shown in Figure below. Once the User Module receives an SQL query, it processes that SQL query and distributes query operations across the Performance Modules. The Performance modules executes the query operations in a distributed manner and reads/writes the InfiniDB Columnar data file on distributed data storage across the Performance Modules, and then returns intermediate query operation results to the User Modules. In this sense, the User Module is the master node and the Performance modules are slave nodes as shown in the figure below.
InfiniDB- a MySQL Storage Engine
InfiniDB is a MySQL storage engine and plugs into MySQL front end using MySQL Storage Engine API. To use InfiniDB as a storage engine for a table, simply use standard “Engine=InfiniDB” when creating table in MySQL as shown below
CREATE TABLE `game_warehouse`.`dim_title` (
Once you declare a table to use InfiniDB as table, the table is stored using InfiniDB’s scalable columnar file format and queried via InfiniDB’s query engine.
InfiniDB – A Columnar Database
InfiniDB’s columnar architecture stores data for a table, column by column. i.e. each column of a table is stored as a separate file as oppose to traditional row based database where all the columns for a row are physically stored together. Such partitioning of data by columns is called vertical partitioning. Such storage allows InfiniDB to touch only those columns for a query that are either requested in projection list or are in filter condition of the query. This obviously gives performance gains to select query performed on large numbers of rows in InfiniDB.
Horizontal partitioning provides even further performance gains to InfiniDB. InfiniDB does horizontal partitioning of data across the Performance Modules by range of rows. When data is loaded from a central location into a User Module, data is equally distributed across Performance Module for fair balancing, and then meta data is stored within in-memory structure as to what range of rows are on which partition and which Performance Module. This allows InfiniDB to further narrow disk block touches when performing filtering operation for a query and thus accelerating query performance.
Both the horizontal and the vertical of partitioning are automatically handled by InfiniDB - a user does not have to specify which columns or row ranges to partition.
InfiniDB – SQL on Hadoop?
When InfiniDB is running on a Hadoop Cluster, the User Module runs on the Hadoop Name Node and the Performance Module runs on the Hadoop Data Node as shown in figure below:
The InfiniDB Performance Module uses libhdfs.so to natively read and write to HDFS without going through map-reduce layer.
InfiniDB – The Big Data Platform
InfiniDB is suited for reporting or analysis of millions-billions of rows from data sets containing millions-trillions of rows. It is not suited for (semi) transactional/OLTP workloads. Nor does not replace ETL or Map/Reduce processing.
So why would one choose InfiniDB? The top three reasons people choose InfiniDB are:
Scalable and Fast
You can find benchmark report from Radiant Advisor showing InfiniDB’s superior performance at http://infinidb.co/system/files/RadiantAdvisors_Benchmark_SQL-on-Hadoop_2014Q1_0.pdf. The summary of the results from this bench mark are shown below
Queries 1 through 4 are reporting type query, queries 5 to 7 are adhoc type queires while queries 8 through 10 are analytic queries. In this benchmark, InfiniDB was 40 to 100 times faster than Hive, 4 to 12 faster than Impala, 10 to 50 times faster than Presto.
Percona has also performed a benchmark for InfiniDB, you can find it at http://www.mysqlperformanceblog.com/2014/06/02/using-infinidb-engine-mysql-hadoop-cluster-data-analytics/
From ground up InfiniDB is simple to deploy and use. Being a MySQL storage engine, there is zero learning curve for MySQL users. DBA and data architect find themselves liberated from having to design indexes, materialized views or manual partitioning - while still delivering the real-time performance that the business units demand.
In my next blog post we will go deeper into InfiniDB’s integration with HDFS and InfiniDB’s columnar data file structure. Meanwhile if you have any more questions or suggestion on InfiniDB, do write to me at email@example.com