June 25, 2014 | Dipti Joshi
InfiniDB on HDFS
In my blog post – “Getting to Know InfiniDB”, I introduced you to InfiniDB. As promised, in this blog post I will take you deeper into InfiniDB’s integration with Hadoop Distributed File System (HDFS) and InfiniDB’s columnar data file structure.
As I mentioned in my last post, InfiniDB is a distributed, massively parallel, columnar SQL database that can run on premise, in the cloud or on a Hadoop cluster. When InfiniDB is running on a Hadoop Cluster, the User Module runs on the Hadoop Name Node, while the Performance Module runs on the Hadoop Data Node.
The User Module running on a server acts as a MySQL Front End processing incoming MySQL connection requests and SQL queries on each user connection. Once the User Module receives an SQL query, it processes the SQL query and distributes query operations across the Performance Modules. The Performance modules executes the query operations in a distributed manner and reads/writes the InfiniDB Columnar data files, and then returns intermediate query operation results to the User Modules. The User Module takes the query operations results from all the Performance Module queries and returns the final results to the user.
How does InfiniDB Query Engine access HDFS?
As we look at the InfiniDB Query Engine’s working with HDFS, let us see where InfiniDB sits in the Hadoop Stack:
Unlike Apache Hive and Apache Pig, InfiniDB is a non-map/reduce engine – that is, InfiniDB does not use the map/reduce framework of Apache Hadoop. Rather, InfiniDB natively interfaces with HDFS via libhdfs.so. In a Map/Reduce based Apache Hadoop application, the name node process (Job Tracker or Resource Manager) dispatches a Map/Reduce Task on data nodes and a data node process (Task Tracker or Node Manager) executes the map tasks on individual data nodes. In the case of InfiniDB, this Map/Reduce framework is not used, as InfiniDB already has its own computational distributed framework that is optimized for a real time analytics query load.
InfiniDB Columnar File
InfiniDB’s columnar architecture stores data for a table, column by column (i.e. each column of a table is stored as a separate file as oppose to traditional row based database where all the columns for a row are physically stored together). Such partitioning of data by columns is called vertical partitioning. Further, InfiniDB does horizontal partitioning of data across Performance Modules by range of rows. Both horizontal and vertical partitioning are automatically handled by InfiniDB. A user does not have to specify which columns or row ranges to partition.
Physical Data Storage
Each column of a table is stored independently in 8M rows logical measure called an extent. An extent is physically stored as collection of contiguous blocks - each 8K bytes - within a segment file. For string columns that are longer than 8 characters, main column file stores store indexes referring to a dictionary file containing the actual values. With respect to the placement of segment files on a disk subsystem, there is a configuration available in InfiniDB to specify the disk resources to be used for InfiniDB storage. InfiniDB spreads all segment files across all specified disk resources to maximize the physical read rate of the data.
A partition is collectively all of the column segment files and dictionary files belonging to an extent for a table. InfiniDB refers to such partitioning as casual partitioning. The partition is stored as hierarchical directory structure organized by segment files.
By default, the data stored in the column files is compressed.
InfiniDB meta store maps file structure/location to DB schema as well as information used for partitioning. The key meta-structure that powers InfiniDB’s performance is the extent map - a catalog of all extents and their corresponding blocks along with the minimum and maximum values for a column’s data within an extent. When performing queries, InfiniDB first eliminates the extents by only taking into consideration only the extents for the column in join and filter conditions. Then, using the minimum and maximum value for the extents for join, columns are filtered further and additional extents eliminated. Let us look at an example:
Eliminating an extent when a column scan involves a filter. That filter value is compared to the minimum and maximum values stored in each extent for the column:
In the above figure, if a WHERE column filter of “COL1 BETWEEN 220 AND 250” is specified, InfiniDB will eliminate extents 1, 2 and 4 from being scanned, saving ¾ of the I/ O and many comparison operations.
If the extent cannot possibly contain any rows that would match the filter, that extent is ignored entirely. Additionally, since each extent eliminated for a column also eliminates the same extents for all the other columns in a table, impossible filter chains can be quickly identified without requiring I/O. For example, take the following two columns and their Extent Map information:
If a column WHERE filter of “COL1 BETWEEN 220 AND 250 AND COL2 < 10000” is specified, InfiniDB will eliminate extents 1, 2 and 4 from the first column filter. Then, looking at just the matching extents for COL2 (i.e. just extent 3), it will determine that no extents match and return zero rows without doing any I/O at all.
There is still a question of resource management when InfiniDB is running on an HDFS stack. As I mentioned earlier, InfiniDB natively accesses HDFS bypassing map reduce layer. Currently, InfiniDB uses cgroup for resource management when deployed in Hadoop environment (YARN support is planned). I will go in further details about cgroup usage in a future post.
In my next blog post we will look at how to load data in and extract data from InfiniDB. Write to me at email@example.com or follow me @dipti_smg for more insights into InfiniDB.