insert only unique values in to a table, ignore rest?

Started by George Nychisabout 19 years ago8 messagesgeneral
Jump to latest
#1George Nychis
gnychis@cmu.edu

Hi,

I have approximately 2 billion data entries that I would like to insert into a database.
Each entry consists of:
INT BOOLEAN INT BOOLEAN

I want to populate a table such that it only contains the unique rows, all other data
should be thrown out. I would say a significant amount of the insertions are going to
fail due to unique constraints. The unique constraint is on the two integers, not on the
booleans.

Using mysql, I was able to do this with the following query, for all data files (25574
data files total):
mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
FIELDS TERMINATED BY ' ';\"

What I *think* mysql did was sort each data file and do a sort of merge sort between the
data I was inserting and the data in the database. It would insert the first unique
instance of a row it saw, and reject all other insertions that violated the unique
constraint due to the "IGNORE".

From what I understand, this functionality is not in postgresql. Fine, I certainly can't
change that. But I am looking for a comparable solution for the size of my data.

One solution is to have a temporary table, insert all 2 billion rows, and then copy the
distinct entries to another table. This would be like one massive sort?

Is this the only/best solution using postgresql?

Thanks!
George

#2Scott Marlowe
smarlowe@g2switchworks.com
In reply to: George Nychis (#1)
Re: insert only unique values in to a table, ignore rest?

On Mon, 2007-01-08 at 14:58, George Nychis wrote:

Hi,

I have approximately 2 billion data entries that I would like to insert into a database.
Each entry consists of:
INT BOOLEAN INT BOOLEAN

I want to populate a table such that it only contains the unique rows, all other data
should be thrown out. I would say a significant amount of the insertions are going to
fail due to unique constraints. The unique constraint is on the two integers, not on the
booleans.

Using mysql, I was able to do this with the following query, for all data files (25574
data files total):
mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
FIELDS TERMINATED BY ' ';\"

A quick question. Could you run selects or other inserts on that table
while the load data infile was running? Cause I'm guessing that it
basically locked the whole table while running.

What I *think* mysql did was sort each data file and do a sort of merge sort between the
data I was inserting and the data in the database. It would insert the first unique
instance of a row it saw, and reject all other insertions that violated the unique
constraint due to the "IGNORE".

Me too. Which would require "one big lock" on the table which would
mean no parallel access.

It's also likely that it used a temp table which doubled the size of the
database while you were inserting.

From what I understand, this functionality is not in postgresql. Fine, I certainly can't
change that. But I am looking for a comparable solution for the size of my data.

One solution is to have a temporary table, insert all 2 billion rows, and then copy the
distinct entries to another table. This would be like one massive sort?

Is this the only/best solution using postgresql?

TANSTAAFL. PostgreSQL is designed so that you can run an import process
on that table while 100 other users still access it at the same time.
Because of that, you don't get to do dirty, nasty things under the
sheets that allow for super easy data loading and merging like you got
with MySQL. Apples and Oranges.

Assuming you're loading into an empty table, the load to temp, select
distinct out and into the final table seems reasonable, should run
reasonably fast. If you need to load to an existing table, it might get
a little more complex.

#3George Nychis
gnychis@cmu.edu
In reply to: Scott Marlowe (#2)
Re: insert only unique values in to a table, ignore rest?

Scott Marlowe wrote:

On Mon, 2007-01-08 at 14:58, George Nychis wrote:

Hi,

I have approximately 2 billion data entries that I would like to insert into a database.
Each entry consists of:
INT BOOLEAN INT BOOLEAN

I want to populate a table such that it only contains the unique rows, all other data
should be thrown out. I would say a significant amount of the insertions are going to
fail due to unique constraints. The unique constraint is on the two integers, not on the
booleans.

Using mysql, I was able to do this with the following query, for all data files (25574
data files total):
mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
FIELDS TERMINATED BY ' ';\"

A quick question. Could you run selects or other inserts on that table
while the load data infile was running? Cause I'm guessing that it
basically locked the whole table while running.

What does this have to do with my question? I don't need to run selects or inserts on the
table while the load data is running...

What I *think* mysql did was sort each data file and do a sort of merge sort between the
data I was inserting and the data in the database. It would insert the first unique
instance of a row it saw, and reject all other insertions that violated the unique
constraint due to the "IGNORE".

Me too. Which would require "one big lock" on the table which would
mean no parallel access.

Thats fine, it doesn't matter.

It's also likely that it used a temp table which doubled the size of the
database while you were inserting.

From what I understand, this functionality is not in postgresql. Fine, I certainly can't
change that. But I am looking for a comparable solution for the size of my data.

One solution is to have a temporary table, insert all 2 billion rows, and then copy the
distinct entries to another table. This would be like one massive sort?

Is this the only/best solution using postgresql?

TANSTAAFL. PostgreSQL is designed so that you can run an import process
on that table while 100 other users still access it at the same time.
Because of that, you don't get to do dirty, nasty things under the
sheets that allow for super easy data loading and merging like you got
with MySQL. Apples and Oranges.

Assuming you're loading into an empty table, the load to temp, select
distinct out and into the final table seems reasonable, should run
reasonably fast. If you need to load to an existing table, it might get
a little more complex.

The goal is not to run queries while the data is being inserted....I am wondering if the
postgresql method I have mentioned to actually insert and get only distinct values is most
optimal, which would produce the same results method I explained in mysql.

#4Scott Marlowe
smarlowe@g2switchworks.com
In reply to: George Nychis (#3)
Re: insert only unique values in to a table, ignore rest?

On Mon, 2007-01-08 at 15:52, George Nychis wrote:

Scott Marlowe wrote:

On Mon, 2007-01-08 at 14:58, George Nychis wrote:

Hi,

I have approximately 2 billion data entries that I would like to insert into a database.
Each entry consists of:
INT BOOLEAN INT BOOLEAN

I want to populate a table such that it only contains the unique rows, all other data
should be thrown out. I would say a significant amount of the insertions are going to
fail due to unique constraints. The unique constraint is on the two integers, not on the
booleans.

Using mysql, I was able to do this with the following query, for all data files (25574
data files total):
mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
FIELDS TERMINATED BY ' ';\"

A quick question. Could you run selects or other inserts on that table
while the load data infile was running? Cause I'm guessing that it
basically locked the whole table while running.

What does this have to do with my question? I don't need to run selects or inserts on the
table while the load data is running...

What I *think* mysql did was sort each data file and do a sort of merge sort between the
data I was inserting and the data in the database. It would insert the first unique
instance of a row it saw, and reject all other insertions that violated the unique
constraint due to the "IGNORE".

Me too. Which would require "one big lock" on the table which would
mean no parallel access.

Thats fine, it doesn't matter.

It's also likely that it used a temp table which doubled the size of the
database while you were inserting.

From what I understand, this functionality is not in postgresql. Fine, I certainly can't
change that. But I am looking for a comparable solution for the size of my data.

One solution is to have a temporary table, insert all 2 billion rows, and then copy the
distinct entries to another table. This would be like one massive sort?

Is this the only/best solution using postgresql?

TANSTAAFL. PostgreSQL is designed so that you can run an import process
on that table while 100 other users still access it at the same time.
Because of that, you don't get to do dirty, nasty things under the
sheets that allow for super easy data loading and merging like you got
with MySQL. Apples and Oranges.

Assuming you're loading into an empty table, the load to temp, select
distinct out and into the final table seems reasonable, should run
reasonably fast. If you need to load to an existing table, it might get
a little more complex.

The goal is not to run queries while the data is being inserted....I am wondering if the
postgresql method I have mentioned to actually insert and get only distinct values is most
optimal, which would produce the same results method I explained in mysql.

Did I fail to answer your question?

Sorry if I gave you more information than you needed. Please feel free
to ask someone else next time.

#5George Nychis
gnychis@cmu.edu
In reply to: Scott Marlowe (#4)
Re: insert only unique values in to a table, ignore rest?

Scott Marlowe wrote:

On Mon, 2007-01-08 at 15:52, George Nychis wrote:

Scott Marlowe wrote:

On Mon, 2007-01-08 at 14:58, George Nychis wrote:

Hi,

I have approximately 2 billion data entries that I would like to insert into a database.
Each entry consists of:
INT BOOLEAN INT BOOLEAN

I want to populate a table such that it only contains the unique rows, all other data
should be thrown out. I would say a significant amount of the insertions are going to
fail due to unique constraints. The unique constraint is on the two integers, not on the
booleans.

Using mysql, I was able to do this with the following query, for all data files (25574
data files total):
mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
FIELDS TERMINATED BY ' ';\"

A quick question. Could you run selects or other inserts on that table
while the load data infile was running? Cause I'm guessing that it
basically locked the whole table while running.

What does this have to do with my question? I don't need to run selects or inserts on the
table while the load data is running...

What I *think* mysql did was sort each data file and do a sort of merge sort between the
data I was inserting and the data in the database. It would insert the first unique
instance of a row it saw, and reject all other insertions that violated the unique
constraint due to the "IGNORE".

Me too. Which would require "one big lock" on the table which would
mean no parallel access.

Thats fine, it doesn't matter.

It's also likely that it used a temp table which doubled the size of the
database while you were inserting.

From what I understand, this functionality is not in postgresql. Fine, I certainly can't
change that. But I am looking for a comparable solution for the size of my data.

One solution is to have a temporary table, insert all 2 billion rows, and then copy the
distinct entries to another table. This would be like one massive sort?

Is this the only/best solution using postgresql?

TANSTAAFL. PostgreSQL is designed so that you can run an import process
on that table while 100 other users still access it at the same time.
Because of that, you don't get to do dirty, nasty things under the
sheets that allow for super easy data loading and merging like you got
with MySQL. Apples and Oranges.

Assuming you're loading into an empty table, the load to temp, select
distinct out and into the final table seems reasonable, should run
reasonably fast. If you need to load to an existing table, it might get
a little more complex.

The goal is not to run queries while the data is being inserted....I am wondering if the
postgresql method I have mentioned to actually insert and get only distinct values is most
optimal, which would produce the same results method I explained in mysql.

Did I fail to answer your question?

Sorry if I gave you more information than you needed. Please feel free
to ask someone else next time.

ahhh i missed your last paragraph... so much text. Actually yeah that answers my
question, thank you. I guess its more a single run through than the mysql method which
was piece-wise.

Thanks for the help/response.

- George

#6Scott Marlowe
smarlowe@g2switchworks.com
In reply to: George Nychis (#5)
Re: insert only unique values in to a table, ignore rest?

On Mon, 2007-01-08 at 15:59, George Nychis wrote:

Scott Marlowe wrote:

On Mon, 2007-01-08 at 15:52, George Nychis wrote:

Scott Marlowe wrote:

On Mon, 2007-01-08 at 14:58, George Nychis wrote:

Hi,

I have approximately 2 billion data entries that I would like to insert into a database.
Each entry consists of:
INT BOOLEAN INT BOOLEAN

SNIP

Assuming you're loading into an empty table, the load to temp, select
distinct out and into the final table seems reasonable, should run
reasonably fast. If you need to load to an existing table, it might get
a little more complex.

The goal is not to run queries while the data is being inserted....I am wondering if the
postgresql method I have mentioned to actually insert and get only distinct values is most
optimal, which would produce the same results method I explained in mysql.

Did I fail to answer your question?

Sorry if I gave you more information than you needed. Please feel free
to ask someone else next time.

ahhh i missed your last paragraph... so much text. Actually yeah that answers my
question, thank you. I guess its more a single run through than the mysql method which
was piece-wise.

Note that things will go faster if you do your initial data load using
"copy from stdin" for the initial bulk data load. individual inserts in
postgresql are quite costly compared to mysql. It's the transactional
overhead. by grouping them together you can make things much faster.
copy from stdin does all the inserts in one big transaction.

If you use insert statements, wrap them in a begin; end; pair to make
them be one transaction. not as fast as copy, due to parsing, but still
much faster than individual transactions.

#7Jeremy Haile
jhaile@fastmail.fm
In reply to: Scott Marlowe (#6)
Re: insert only unique values in to a table, ignore rest?

Note that things will go faster if you do your initial data load using
"copy from stdin" for the initial bulk data load. individual inserts in
postgresql are quite costly compared to mysql. It's the transactional
overhead. by grouping them together you can make things much faster.
copy from stdin does all the inserts in one big transaction.

You could do "copy from file" as well right? (no performance difference
compared to "copy from stdin") I do this all the time.

Also - maybe I misunderstand something, but why does PostgreSQL's
implementation prohibit it from ignoring insert errors during a copy?
If you added a unique constraint to the table before copying, PostgreSQL
would generate errors due to the unique constraint violation - so I
don't think any additional locking would be required for it to simply
say "If there is an error while copying in, ignore it and continue
inserting other rows"

PostgreSQL's copy command doesn't currently support this, so the temp
table followed by a distinct select is the way to go. But I didn't
follow all of the talk about it requiring locking the table and being
inherently impossible for PostgreSQL to support.

I've wanted a similar feature. I select rows into a table on a regular
basis. I'd like to be able to overlap old values and have PostgreSQL
ignore failed inserts. SQL Server offers a flag that allows you to
ignore inserts whose primary key already exists in the table. The only
solution in PostgreSQL is to run a query to manually delete the
duplicate rows from a temp table before inserting - which takes much
more time.

#8George Nychis
gnychis@cmu.edu
In reply to: Jeremy Haile (#7)
Re: insert only unique values in to a table, ignore rest?

Jeremy Haile wrote:

Note that things will go faster if you do your initial data load using
"copy from stdin" for the initial bulk data load. individual inserts in
postgresql are quite costly compared to mysql. It's the transactional
overhead. by grouping them together you can make things much faster.
copy from stdin does all the inserts in one big transaction.

You could do "copy from file" as well right? (no performance difference
compared to "copy from stdin") I do this all the time.

Also - maybe I misunderstand something, but why does PostgreSQL's
implementation prohibit it from ignoring insert errors during a copy?
If you added a unique constraint to the table before copying, PostgreSQL
would generate errors due to the unique constraint violation - so I
don't think any additional locking would be required for it to simply
say "If there is an error while copying in, ignore it and continue
inserting other rows"

PostgreSQL's copy command doesn't currently support this, so the temp
table followed by a distinct select is the way to go. But I didn't
follow all of the talk about it requiring locking the table and being
inherently impossible for PostgreSQL to support.

I've wanted a similar feature. I select rows into a table on a regular
basis. I'd like to be able to overlap old values and have PostgreSQL
ignore failed inserts. SQL Server offers a flag that allows you to
ignore inserts whose primary key already exists in the table. The only
solution in PostgreSQL is to run a query to manually delete the
duplicate rows from a temp table before inserting - which takes much
more time.

I would also like this feature... :) (obviously)

I also didn't exactly follow the locking, I don't need it as far as I know.

- George