back to blog
July 15, 2014 | Dipti Joshi
Getting In and Out of InfiniDB
In this series so far, I have introduced you to InfiniDB in “Getting to Know InfiniDB”, and taken you deeper into InfiniDB’s HDFS integration and columnar data files in “InfiniDB on HDFS”. In this blog post, I will show you how to load data into and extract data from InfiniDB.
In any InfiniDB installation, the fastest way to load bulk data into InfiniDB is to use cpimport utility. Alternately, within the sql command, I can use “LOAD DATA INFILE” and “INSERT INTO SELECT FROM”. However, cpimport is 4 to 6 times faster when loading large amounts of data. For example, for the lineitem table from TPCH, a 6 million row data load took 19 seconds with cpimport, compared to 60 seconds with LOAD DATA INFILE and 105 seconds with INSERT INTO SELECT FROM. For same lineitem table from TPCH, 60 million rows data load took 142 seconds with cpimport, 581 seconds with LOAD DATA INFILE, and 1015 seconds with INSERT INTO SELECT FROM.
Before using cpimport, I prepare my data in CSV format. If I want to load data from a CSV file, I use this command:
cpimport dbName tblName [loadFile]
Sometimes, when I want to load data into InfiniDB that is the output of other processes, I simply use standard input to pipe the data into cpimport. For example, if I have an ISAM table in a mysql database that I want to load into InfiniDB for the purpose of migrating from MySQL to InfiniDB – I use the following command:
mysql -e 'select * from source_table;' -N source_db | cpimport dest_db dest_tbl -s '\t‘
Single file Central Input: Data source at UM
In installations where my source data files are on a UM node, I use mode 1 of cpimport. A mode in cpimport is to instruct the cpimport whether the source file is at the UM or the PM node:
cpimport -m1 mytest mytable mytable.tbl
Distributed Input: Data Source at PMs
If I have partitioned load files in which my data has been divided into n files, where n is number of PM nodes, I use cpimport in mode 2 or mode 3 cpimport.
In mode 2, the cpimport command is executed on the UM node, but the source data files are on the PM nodes:
cpimport -m2 testdb mytable /home/mydata/mytable.tbl
Mode 3 allows the cpimport command to be run on individual pm nodes:
cpimport –m3 testdb mytable /home/mydata/mytable.tbl
Irrespective of the cpimport mode, I can continue to perform read queries on the table(s) being loaded. Notice I mentioned table(s), as I can load multiple tables using cpimport at same time by launching multiple cpimport jobs at the same time. Customers specifically like mode 3 as they can separately load data on individual PM nodes, and if cpimport fails at a PM node due to data format errors, it does not affect the cpimport processes at other PM nodes.
Binary Source File
In addition to text CSV files, cpimport can also import a binary data file using fixed length records with no column or line delimiters. Usually, customers that produce binary log files or binary device statistics from network devices will use this mode. An example to load a binary file is:
cpimport -I1 mydb mytable sourcefile.bin
To export bulk data out of InfiniDB, I use a mysql, odbc or jdbc client to execute queries against an InfiniDB database instance. When the data set to export is not too large and the speed of export is not too critical, I will do a central export at the UM node. In this method, I make an SQL connection to the mysql server on the UM node and extract the query results in an output file on the UM node. But, if I want high performance, I do a distributed data extract using the LOCAL PM query feature of InfiniDB to extract data at individual PM node. This way I can extract data in parallel at all PM nodes, and if I have an ETL process to transform the extracted data, the ETL process can work on individual data sets extracted at the PM node in parallel.
To enable the LOCAL PM Query feature when installing InfiniDB, I make sure that I answer "yes" to "Do you want to install UM on each PM node?" in the postConfigure step. Doing so enables the PM node to perform direct SQL queries on the data that is local to the particular PM node. Now, for a single table query, I simply export the query result to a file like this:
idbmysql –e ‘SELECT * from tbl1’ –N sourcedb > outputfile
If I have to extract results of joining a fact table and a dimension table, I pass the following SQL statements to the SQL connection:
select fact.column, dim.column2 from fact join dim using (key) where idbPm(fact.key) = idbLocalPm();
This way, while I am joining the fact table data that is local to the PM node, I am also joining it with all the rows of the dimension table across all the PMs to get the correct query result. Since I find that, typically, dimension tables are much smaller than fact tables, I still get much better performance than performing the entire query from the central location at UM node.
In my next blog post, we will see how Apache Sqoop integrated with cpimport and the parallel data extract method that I described in this post, can be used for performing data load and extract in a Hadoop environment. Write to me at email@example.com or follow me @dipti_smg for more insights into InfiniDB.