ETL delete before insert

7 posts / 0 new
Last post
joegh
joegh's picture
Offline
Last seen: 1 month 1 week ago
Joined: Jan 17 2014
Junior Boarder

Posts: 10

Joe Wu
ETL delete before insert

Hello,

As a ETL process, we always do data deleting before incremental load to avoid data duplication. e.g. The previous loading contains error records or incomplete. if we load data for 2014-01-16, we may first delete from table where date='2014-01-16', then load the data again.

If the partition can be create as demand, we can partition data by date and drop special partition. But InfiniDB partition column automatically and can't be controlled. Data from same day may divide to 2-3 partition or some section in one partition and others in another partition. So it can't be deleted by dropping partitions.

Is there any efficient way to solve such problem. Thank you!

ads2people
ads2people's picture
Offline
Last seen: 2 days 1 hour ago
Joined: Jan 18 2014
Junior Boarder

Posts: 2

Marcel Ads2people
Re: ETL delete before insert

We had so much problems with deletes on infinidb.

The solution: DO NOT DELETE !

Better create new temp_table. Import the fresh data with cpimport then truncate the old table and rename the temp table. Thats how infinidb is working without any problem. Forget partitions.. forget delete small amounts and so on. At some point your infinidb will get into trouble because of these massive DML.. delete operations.

just my 2 cents ;)

Cheers
Marcel

joegh
joegh's picture
Offline
Last seen: 1 month 1 week ago
Joined: Jan 17 2014
Junior Boarder

Posts: 10

Joe Wu
Re: ETL delete before insert

Hi Marcel,

Thank you for your reply.

But the problem is we store data in a big table or fact table, including historical data. That means the table contains data for many months even more than years. It can not be truncated because historical data is big and impossible to be loaded again from source. :(

If we create new table for everyday's data, it's hard to query data among multi days.

So we are looking for other solutions for this problem.

Thank you for your reminding about the defect of delete operations. We will try to avoid massive DML operations. :)

ads2people
ads2people's picture
Offline
Last seen: 2 days 1 hour ago
Joined: Jan 18 2014
Junior Boarder

Posts: 2

Marcel Ads2people
Re: ETL delete before insert

We had similar problems like you described:

We have to replace/refresh statitic data every day for 90 days.. So we created a separate table for these 90 days which we truncate every day like i described. Even for our biggest table (800 and more days) we sometimes need to delete data. in this case we export ONLY THE DATA we need to a file (CSV) , truncate the big table and create a new table with the new data. We do this also once a day.

for the big table (export is about 13 GB or so) it does not take longer than 30min on single server installation.

Tip: Use to combine the 2 tables.. Its fast !

joegh
joegh's picture
Offline
Last seen: 1 month 1 week ago
Joined: Jan 17 2014
Junior Boarder

Posts: 10

Joe Wu
Re: ETL delete before insert

Hi Marcel,

It may solve the problem when the big table is less than 100G. But if the data up to TBs, it hard to do so. It will cost lots of hours to export and load everyday.

So if there is any more efficient ways?

jmagana
jmagana's picture
Offline
Last seen: 1 month 1 week ago
Joined: Apr 3 2013
Junior Boarder

Posts: 5

JOSHUA MAGANA , BI Architect
Gender: Male
Re: ETL delete before insert

I have gotten somewhat creative in my solution to this issue. I have a "Delta" and "Delete" table, which is just an ID column. Let me start by stating that I use Talend to perform the following task. I lookup and insert the surrogate key for all the delta records, which is the updates, and inserts and put the surrogate keys into the "Delta" table. I then
INSERT INTO DELTA_TABLE "Delta" Keys
SELECT COUNT(*) FROM DELTA JOIN DIM > IF COUNT(*) > 0
DO
INSERT INTO DELETE_TABLE SELECT KEY FROM DELTA_TABLE ORDER BY KEY LIMIT N;
DELETE FROM DIM WHERE KEY in (SELECT KEY FROM DELETE_TABLE);
DELETE FROM DELTA_TABLE WHERE KEY IN (SELECT KEY FROM DELETE_TABLE);
TRUNCATE DELETE_TABLE;
SELECT COUNT(*) FROM DELTA_TABLE JOIN DIM ON KEY > IF COUNT(*) > 0 LOOP AGAIN

Just shower, rinse, repeat until DIM table has all the delta records expunged and then do a cpimport of deltas from flat file.

joegh
joegh's picture
Offline
Last seen: 1 month 1 week ago
Joined: Jan 17 2014
Junior Boarder

Posts: 10

Joe Wu
Re: ETL delete before insert

I have got a nice solution.

Befor import data to table, disable all partitions using the function of calDisablePartitionsByValue. Then drop the partition if there is same range data already imported to insure data duplication will not occur. Using cpimport to load data to table, and that will create new partitions and make all data imported at this time in fixed partitions.

After data import finished, make all partitions enable using the function of calEnablePartitionsByValue. And it's all done.

The only problem to use this method is when you loading data, the table can not be accessed. But if it's a very short time, the problem can be ignored. ;)