To Normalize or not?

11 posts / 0 new
Last post
srkiNZ84
srkiNZ84's picture
Offline
Last seen: 3 years 8 months ago
Joined: Feb 17 2010
Junior Boarder

Posts: 3

Srdan Dukic
To Normalize or not?

Hi,

Currently I am working with a very large table of approx. 300M rows in SQL Server which is being constantly updated (The table is used to record log entries). The table schema is quite inefficient/inelegant in my opinion as it is a very flat structure consisting of about 15 columns, most of which are of the type VARCHAR(255, NULL). The table also contains an extremely large amount of duplicate data.

As a part of setting up a BI/Analysis tool, I would like to do two things (in no particular order):

1. Import the data into Calpont's InfiniDB

2. Take this database and normalize it, turning as many of VARCHAR columns into INT's acting as foreign keys to tables with distinct values.

Essentially what the step requires is to do a 'SELECT DISTINCT [column name] INTO [new table]', where [new table] will have an auto increment, primary key column. The transformation should then add a foreign key column in the original table pointing to the value in the [new table] and finally drop the [column name] column.

I've been told on a [url=http://forums.pentaho.org/showthread.php?p=232920#post232920]seperate forum[/url] that there is no point in normalizing the data, and that it might even degrade performance in InfiniDB to restructure the schema this way. Is this true? Am I better off keeping the data as one large table of VARCHAR values?

Thank you

davidhill
davidhill's picture
Offline
Last seen: 1 month 2 weeks ago
Joined: Oct 27 2009
Administrator

Posts: 595

david hill
Re:To Normalize or not?

Thanks for checking out InfiniDB and appreciate any feedback you have.

We are looking into your request and should get a response posted in the next day.

Thanks, David

jtommaney
jtommaney's picture
Offline
Last seen: 1 month 1 week ago
Joined: Oct 27 2009
Expert Boarder

Posts: 58

Jim Tommaney
Re:To Normalize or not?

First of all thanks for the good questions.

There are a couple of thing in place here that may help guide your answer with Calpont’s InfiniDB engine.

First, a brief description of our dictionary handling, i.e. strings longer than 8 bytes. We will actually store those in a separate structure and store an 8-byte key pointing to the strings. So, in a way the normalization you describe is already happening under the covers without you having to create separate structures. There can be some places where you may know there are only a small number of possible strings, and you could implement a normalization using a 2 byte or a 1 byte key that would save space vs. our basic implementation.

The whole question about flat file vs. normalized is complex, but is based on 2 features;

1. We do have multi-threaded and optionally distributed scan operations against the fact table that allow us to churn through even de-normalized data very quickly. The column storage works well with these operations by significantly reduction I/O for many analytic queries. So we can run well against the flat file data.

2. However, we also have multi-threaded and optionally distributed hash-join operations that allow for joining to the fact table very quickly. In addition, we handle a large number of joins very well:

Join rate trended through 20 dimensions:
http://www.infinidb.org/myblog-admin/profiling-infinidb-multi-join.html

Star Schema Benchmark:
http://www.infinidb.org/myblog-admin/mysql-parallel-query-processing-of-...

Are we generally faster if you don’t have to join? Yes, but that isn’t always the complete picture. If you save enough bytes in storage then InfiniDB can be faster when joining. In addition, saving bytes may allow more queries to be satisfied from cache and gaining overall acceleration

So, I would use these general guidelines in this general order:

1) Tight data type declarations. Because we don’t (yet) have compression for non-dictionary data types, if you declare a bigint when you only need an int data type then there will be storing and moving around twice as much data as needed.

2) Using codes, flags extensively. If you can replace ‘Promotional Code 001’ - a varchar(40) with the value ‘P001’ - a varchar(4) there can be significant space and performance benefits. And the 4-byte string is probably faster as a fact table column. So any ETL transform process that allows for replacing a long

srkiNZ84
srkiNZ84's picture
Offline
Last seen: 3 years 8 months ago
Joined: Feb 17 2010
Junior Boarder

Posts: 3

Srdan Dukic
Re:To Normalize or not?

Thank you for the reply. I didn't realise that InfiniDB treated long strings that way. From your description, I'm guessing that it doesn't do de-duplication of values i.e. if you have two strings with the same value they will point to two different 8-byte keys, whose values are the same? (as opposed to pointing to the exact same 8-byte key).

I will try to get both types of schema (normalized and flat table) set up and benchmark them to see what kind of performance difference there is, if any.

My feeling is that the normalized dataset will most likely out perform the flat table, due to the fact that I'm testing this out on a spare development machine with a small amount of RAM, a slow hard disk, but a fast processor. My hope is to get the foreign key tables small enough, so that they fit into memory (I assume that the fact table is kept in memory?).

The fact that InfiniDB doesn't support auto incrementing columns is a bit of a pain, as I'm probably going to have to import and transform the data into an InnoDB table, to generate the keys, and then do a straight copy of the data into InfiniDB.

Will post the results of my benchmarks once I have them.

pedrorjbr
pedrorjbr's picture
Offline
Last seen: 1 month 4 days ago
Joined: Apr 9 2010
Junior Boarder

Posts: 4

Pedro Magalhaes
Re:To Normalize or not?

Do u have any answer abou your first question?
"From your description, I'm guessing that it doesn't do de-duplication of values i.e. if you have two strings with the same value they will point to two different 8-byte keys, whose values are the same?"

I would like to know too.

Thanks

wweeks
wweeks's picture
Offline
Last seen: 2 months 2 weeks ago
Joined: Jan 1 1970
Gold Boarder

Posts: 86

Walt Weeks
Re:To Normalize or not?

Hi Pedro,

InfiniDB will reuse the same string values for columns that have a few unique values provided you use the cpimport program for your loads. The strings will be reused when all of the values fit into a single 8K block. Each row will get it's own string once the values exceed a single block. There's a little more to it as a column is broken into multiple store files and the rule above applies to each store file.

Insert and load data infile do not currently reuse the string values. We have an internal future enhancement open to reuse strings for those. Using cpimport is the way to go if you are tables with millions as it is much faster than load data infile.

Thanks very much for posting and we appreciate your checking out InfiniDB.

Walt Weeks

sanjv
sanjv's picture
Offline
Last seen: 2 days 9 hours ago
Joined: Mar 10 2010
Junior Boarder

Posts: 10

Sanjay Vekaria
Re:To Normalize or not?

[b]jtommaney wrote:[/b]

Quote:
Generally very long strings with few distinct values will benefit from normalization, especially when a smaller key can be used. However, these may also be good candidates for replacement with shorter codes.

what is a very long string?

In my fact table, I am currently holding a 4 digit id for "person", and then have a separate PERSON table with id, person_name, city, state, country. person_name is varchar about 30 characters max. city and country about varchar 20 max. Only about 400 entries in this table and unlikely to ever grow above 1000.

Would you recommend just holding these items in my fact table, or are we doing the right thing by splitting this way?

davestokes
davestokes's picture
Offline
Last seen: 3 years 3 months ago
Joined: Apr 24 2010
Gold Boarder

Posts: 116

David Stokes
Re:To Normalize or not?

What is a very long string? Well, earlier in this thread you probably read:

[i]First, a brief description of our dictionary handling, i.e. strings longer than 8 bytes. We will actually store those in a separate structure and store an 8-byte key pointing to the strings. So, in a way the normalization you describe is already happening under the covers without you having to create separate structures. There can be some places where you may know there are only a small number of possible strings, and you could implement a normalization using a 2 byte or a 1 byte key that would save space vs. our basic implementation. [/i]

My [u]personal[/u] preference would be to keep the data in your fact table especially since it is not going to grow too large in the total number of records. This simplifies the overall design and will help others work with the schema without have to join several tables to get a name from an id number. But if the number of records starts to grow dramatically or you start to store many strings longer than 8-bytes, you may want to revisit this problem.

qerub
qerub's picture
Offline
Last seen: 1 year 11 months ago
Joined: May 22 2012
Junior Boarder

Posts: 16

Christoffer Sawicki
Re:To Normalize or not?

"wweeks" post=660 wrote:
Insert and load data infile do not currently reuse the string values. We have an internal future enhancement open to reuse strings for those. Using cpimport is the way to go if you are tables with millions as it is much faster than load data infile.

What's the status of this enhancement?

radams
radams's picture
Offline
Last seen: 2 days 11 hours ago
Joined: Jan 3 2011
Administrator

Posts: 492

Robert Adams
Re:To Normalize or not?

Hi,

The enhancement was completed and strings are reused until they exceed an 8K block.

cpimport is still much faster than load data infile.

Thanks,

Robert
Calpont

sanjv
sanjv's picture
Offline
Last seen: 2 days 9 hours ago
Joined: Mar 10 2010
Junior Boarder

Posts: 10

Sanjay Vekaria
Re: To Normalize or not?

Hi - Just to support Robert's post, we used to use Load Data infile because it seemed the easier option because we had to perform several text manipulation scripts on the data before we uploaded.

We persisted with cpImport and using awk command are now able to use cpImport to do the same thing MUCH faster.

Really recommend you persist with cpImport for a strong long term uploading solution.