Bulk Inserts

Started by Souvik Bhattacherjeeover 6 years ago11 messagesgeneral
Jump to latest
#1Souvik Bhattacherjee
kivuosb@gmail.com

Hi,

I'm trying to measure the performance of the following: Multiple txns
inserting tuples into a table concurrently vs single txn doing the whole
insertion.

*new table created as:*
create table tab2 (
id serial,
attr1 integer not null,
attr2 integer not null,
primary key(id)
);

*EXP 1: inserts with multiple txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2
= 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2
= 20);

note: attr2 has only two values 10 and 20

*EXP 2: inserts with a single txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);

I also performed another experiment as follows:
*EXP 3:* select attr1, attr2 into tab2 from tab1;

The observation here is EXP 3 is much faster than EXP 2 probably due to
bulk inserts used by Postgres. However I could not find a way to insert id
values in tab2 using EXP 3. Also select .. into .. from .. throws an error
if we create a table first and then populate the tuples using the command.

I have the following questions:
1. Is it possible to have an id column in tab2 and perform a bulk insert
using select .. into .. from .. or using some other means?
2. If a table is already created, is it possible to do bulk inserts via
multiple txns inserting into the same table (EXP 3)?

Best,
-SB

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Souvik Bhattacherjee (#1)
Re: Bulk Inserts

On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote:

Hi,

I'm trying to measure the performance of the following: Multiple txns
inserting tuples into a table concurrently vs single txn doing the whole
insertion.

*new table created as:*
create table tab2 (
id serial,
attr1 integer not null,
attr2 integer not null,
primary key(id)
);

*EXP 1: inserts with multiple txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 20);

note: attr2 has only two values 10 and 20

*EXP 2: inserts with a single txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);

I also performed another experiment as follows:
*EXP 3:* select attr1, attr2 into tab2 from tab1;

The observation here is EXP 3  is much faster than EXP 2 probably due to
bulk inserts used by Postgres. However I could not find a way to insert
id values in tab2 using EXP 3. Also select .. into .. from .. throws an
error if we create a table first and then populate the tuples using the
command.

Yes as SELECT INTO is functionally the same as CREATE TABLE AS:

https://www.postgresql.org/docs/11/sql-selectinto.html

I have the following questions:
1. Is it possible to have an id column in tab2 and perform a bulk insert
using select .. into .. from .. or using some other means?

Not using SELECT INTO for reasons given above.
Though it is possible to SELECT INTO as you show in EXP 3 and then:
alter table tab2 add column id serial primary key;
EXP 2 shows the other means.

2. If a table is already created, is it possible to do bulk inserts via
multiple txns inserting into the same table (EXP 3)?

Yes, but you will some code via client or function that batches the
inserts for you.

Best,
-SB

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Souvik Bhattacherjee
kivuosb@gmail.com
In reply to: Adrian Klaver (#2)
Re: Bulk Inserts

Hi Adrian,

Thanks for the response.

Yes, but you will some code via client or function that batches the
inserts for you.

Could you please elaborate a bit on how EXP 1 could be performed such that
it uses bulk inserts?

Best,
-SB

On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote:

Hi,

I'm trying to measure the performance of the following: Multiple txns
inserting tuples into a table concurrently vs single txn doing the whole
insertion.

*new table created as:*
create table tab2 (
id serial,
attr1 integer not null,
attr2 integer not null,
primary key(id)
);

*EXP 1: inserts with multiple txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 20);

note: attr2 has only two values 10 and 20

*EXP 2: inserts with a single txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);

I also performed another experiment as follows:
*EXP 3:* select attr1, attr2 into tab2 from tab1;

The observation here is EXP 3 is much faster than EXP 2 probably due to
bulk inserts used by Postgres. However I could not find a way to insert
id values in tab2 using EXP 3. Also select .. into .. from .. throws an
error if we create a table first and then populate the tuples using the
command.

Yes as SELECT INTO is functionally the same as CREATE TABLE AS:

https://www.postgresql.org/docs/11/sql-selectinto.html

I have the following questions:
1. Is it possible to have an id column in tab2 and perform a bulk insert
using select .. into .. from .. or using some other means?

Not using SELECT INTO for reasons given above.
Though it is possible to SELECT INTO as you show in EXP 3 and then:
alter table tab2 add column id serial primary key;
EXP 2 shows the other means.

2. If a table is already created, is it possible to do bulk inserts via
multiple txns inserting into the same table (EXP 3)?

Yes, but you will some code via client or function that batches the
inserts for you.

Best,
-SB

--
Adrian Klaver
adrian.klaver@aklaver.com

#4Rob Sargent
robjsargent@gmail.com
In reply to: Souvik Bhattacherjee (#3)
Re: Bulk Inserts

On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <kivuosb@gmail.com> wrote:

Hi Adrian,

Thanks for the response.

Yes, but you will some code via client or function that batches the
inserts for you.

Could you please elaborate a bit on how EXP 1 could be performed such that it uses bulk inserts?

Best,
-SB

On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote:

Hi,

I'm trying to measure the performance of the following: Multiple txns
inserting tuples into a table concurrently vs single txn doing the whole
insertion.

*new table created as:*
create table tab2 (
id serial,
attr1 integer not null,
attr2 integer not null,
primary key(id)
);

*EXP 1: inserts with multiple txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 20);

note: attr2 has only two values 10 and 20

*EXP 2: inserts with a single txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);

I also performed another experiment as follows:
*EXP 3:* select attr1, attr2 into tab2 from tab1;

The observation here is EXP 3 is much faster than EXP 2 probably due to
bulk inserts used by Postgres. However I could not find a way to insert
id values in tab2 using EXP 3. Also select .. into .. from .. throws an
error if we create a table first and then populate the tuples using the
command.

Yes as SELECT INTO is functionally the same as CREATE TABLE AS:

https://www.postgresql.org/docs/11/sql-selectinto.html <https://www.postgresql.org/docs/11/sql-selectinto.html&gt;

I have the following questions:
1. Is it possible to have an id column in tab2 and perform a bulk insert
using select .. into .. from .. or using some other means?

Not using SELECT INTO for reasons given above.
Though it is possible to SELECT INTO as you show in EXP 3 and then:
alter table tab2 add column id serial primary key;
EXP 2 shows the other means.

2. If a table is already created, is it possible to do bulk inserts via
multiple txns inserting into the same table (EXP 3)?

Yes, but you will some code via client or function that batches the
inserts for you.

Best,
-SB

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

Top-posting (i.e. putting your reply at the top is discouraged here)
Does this appeal to you:
COPY (SELECT * FROM relation) TO ... (https://www.postgresql.org/docs/10/sql-copy.html <https://www.postgresql.org/docs/10/sql-copy.html&gt;)

#5Souvik Bhattacherjee
kivuosb@gmail.com
In reply to: Rob Sargent (#4)
Re: Bulk Inserts

Does this appeal to you:
COPY (SELECT * FROM relation) TO ... (

https://www.postgresql.org/docs/10/sql-copy.html)

Not sure if COPY can be used to transfer data between tables.

On Sat, Aug 10, 2019 at 11:01 PM Rob Sargent <robjsargent@gmail.com> wrote:

Show quoted text

On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <kivuosb@gmail.com>
wrote:

Hi Adrian,

Thanks for the response.

Yes, but you will some code via client or function that batches the
inserts for you.

Could you please elaborate a bit on how EXP 1 could be performed such that
it uses bulk inserts?

Best,
-SB

On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote:

Hi,

I'm trying to measure the performance of the following: Multiple txns
inserting tuples into a table concurrently vs single txn doing the

whole

insertion.

*new table created as:*
create table tab2 (
id serial,
attr1 integer not null,
attr2 integer not null,
primary key(id)
);

*EXP 1: inserts with multiple txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 20);

note: attr2 has only two values 10 and 20

*EXP 2: inserts with a single txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);

I also performed another experiment as follows:
*EXP 3:* select attr1, attr2 into tab2 from tab1;

The observation here is EXP 3 is much faster than EXP 2 probably due

to

bulk inserts used by Postgres. However I could not find a way to insert
id values in tab2 using EXP 3. Also select .. into .. from .. throws an
error if we create a table first and then populate the tuples using the
command.

Yes as SELECT INTO is functionally the same as CREATE TABLE AS:

https://www.postgresql.org/docs/11/sql-selectinto.html

I have the following questions:
1. Is it possible to have an id column in tab2 and perform a bulk

insert

using select .. into .. from .. or using some other means?

Not using SELECT INTO for reasons given above.
Though it is possible to SELECT INTO as you show in EXP 3 and then:
alter table tab2 add column id serial primary key;
EXP 2 shows the other means.

2. If a table is already created, is it possible to do bulk inserts via
multiple txns inserting into the same table (EXP 3)?

Yes, but you will some code via client or function that batches the
inserts for you.

Best,
-SB

--
Adrian Klaver
adrian.klaver@aklaver.com

Top-posting (i.e. putting your reply at the top is discouraged here)
Does this appeal to you:

COPY (SELECT * FROM relation) TO ... (
https://www.postgresql.org/docs/10/sql-copy.html)

#6Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Rob Sargent (#4)
Quoting style (was: Bulk Inserts)

On 2019-08-10 21:01:50 -0600, Rob Sargent wrote:

On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <kivuosb@gmail.com>
wrote:

Hi Adrian,

Thanks for the response.

Yes, but you will some code via client or function that batches the
inserts for you.

Could you please elaborate a bit on how EXP 1 could be performed such that
it uses bulk inserts?

Best,
-SB

On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

[70 lines of full quote removed]

Top-posting (i.e. putting your reply at the top is discouraged here)

He didn't really top-post. He quoted the relevant part of Adrian's
posting and then wrote his reply below that. This is the style I prefer,
because it makes it really clear what one is replying to.

After his reply, he quoted Adrian's posting again, this time completely.
I think this is unnecessary and confusing (you apparently didn't even
see that he quoted something above his reply). But it's not as bad as
quoting everything below the answer (or - as you did - quoting
everything before the answer which I think is even worse: If I don't see
any original content within the first 100 lines or so I usually skip the
rest).

hp

--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/&gt;

#7Rob Sargent
robjsargent@gmail.com
In reply to: Peter J. Holzer (#6)
Re: Quoting style (was: Bulk Inserts)

Sorry. I thought I had cut most of the redundancy

Show quoted text

On Aug 11, 2019, at 2:26 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:

On 2019-08-10 21:01:50 -0600, Rob Sargent wrote:
On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <kivuosb@gmail.com>
wrote:

Hi Adrian,

Thanks for the response.

Yes, but you will some code via client or function that batches the
inserts for you.

Could you please elaborate a bit on how EXP 1 could be performed such that
it uses bulk inserts?

Best,
-SB

On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

[70 lines of full quote removed]

Top-posting (i.e. putting your reply at the top is discouraged here)

He didn't really top-post. He quoted the relevant part of Adrian's
posting and then wrote his reply below that. This is the style I prefer,
because it makes it really clear what one is replying to.

After his reply, he quoted Adrian's posting again, this time completely.
I think this is unnecessary and confusing (you apparently didn't even
see that he quoted something above his reply). But it's not as bad as
quoting everything below the answer (or - as you did - quoting
everything before the answer which I think is even worse: If I don't see
any original content within the first 100 lines or so I usually skip the
rest).

hp

--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/&gt;

#8Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Souvik Bhattacherjee (#3)
Re: Bulk Inserts

On 8/10/19 7:47 PM, Souvik Bhattacherjee wrote:

Hi Adrian,

Thanks for the response.

Yes, but you will some code via client or function that batches the
inserts for you.

Could you please elaborate a bit on how EXP 1 could be performed such
that it uses bulk inserts?

I guess it comes down to what you define as bulk inserts. From your OP:

EXP 1: inserts with multiple txn:
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 20);

If the selects are returning more then one row then you are already
doing bulk inserts. If they are returning single rows or you want to
batch them then you need some sort of code to do that. Something
like(pseudo Python like code):

attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]

for val_batch in attr2_vals:
BEGIN
for id in val_batch:
insert into tab2 (attr1, attr2) (select attr1, attr2
from tab1 where attr2 = id)
COMMIT

Best,
-SB

--
Adrian Klaver
adrian.klaver@aklaver.com

#9Souvik Bhattacherjee
kivuosb@gmail.com
In reply to: Adrian Klaver (#8)
Re: Bulk Inserts

If the selects are returning more then one row then you are already
doing bulk inserts. If they are returning single rows or you want to
batch them then you need some sort of code to do that. Something
like(pseudo Python like code):

attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]

for val_batch in attr2_vals:

BEGIN
for id in val_batch:
insert into tab2 (attr1, attr2) (select attr1, attr2
from tab1 where attr2 = id)
COMMIT

For *EXP 1: inserts with multiple txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2
= 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2
= 20);

tab1 has ~6M rows and there are only two values for the attribute attr2 in
tab1 which are evenly distributed. So, yes, I guess I'm already doing
batching
here.

Also, I ran the following two statements to see if their performances are
comparable.
While STMT 1 always runs faster in my machine but their performances seem
to differ
by a couple of seconds at most.

STMT 1: select attr1, attr2 into tab2 from tab1;
STMT 2: insert into tab2 (select attr1, attr2 from tab1);

However adding the serial id column as an ALTER TABLE statement actually
takes more time
than inserting the tuples, so the combined total time is more than double
the time taken to insert
the tuples into tab2 without serial id column.

Best,
-SB

On Sun, Aug 11, 2019 at 11:11 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 8/10/19 7:47 PM, Souvik Bhattacherjee wrote:

Hi Adrian,

Thanks for the response.

Yes, but you will some code via client or function that batches the
inserts for you.

Could you please elaborate a bit on how EXP 1 could be performed such
that it uses bulk inserts?

I guess it comes down to what you define as bulk inserts. From your OP:

EXP 1: inserts with multiple txn:
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 20);

If the selects are returning more then one row then you are already
doing bulk inserts. If they are returning single rows or you want to
batch them then you need some sort of code to do that. Something
like(pseudo Python like code):

attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]

for val_batch in attr2_vals:
BEGIN
for id in val_batch:
insert into tab2 (attr1, attr2) (select attr1, attr2
from tab1 where attr2 = id)
COMMIT

Best,
-SB

--
Adrian Klaver
adrian.klaver@aklaver.com

#10Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Souvik Bhattacherjee (#9)
Re: Bulk Inserts

On 8/13/19 6:34 AM, Souvik Bhattacherjee wrote:

If the selects are returning more then one row then you are already
doing bulk inserts. If they are returning single rows or you want to
batch them then you need some sort of code to do that. Something
like(pseudo Python like code):

attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]

for val_batch in attr2_vals:

        BEGIN
        for id in val_batch:
                insert into tab2 (attr1, attr2) (select attr1, attr2
                 from tab1 where attr2 = id)
         COMMIT

For *EXP 1: inserts with multiple txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 20);

tab1 has ~6M rows and there are only two values for the attribute attr2 in
tab1 which are evenly distributed. So, yes, I guess I'm already doing
batching
here.

Also, I ran the following two statements to see if their performances
are comparable.
While STMT 1 always runs faster in my machine but their performances
seem to differ
by a couple of seconds at most.

STMT 1: select attr1, attr2 into tab2 from tab1;
STMT 2: insert into tab2 (select attr1, attr2 from tab1);

All I have left is:

select tab2 row_number () OVER (order by attr2, attr1 )AS id, attr1,
attr2 into tab2 from tab1;

That will not create a serial type in the id column though. You can
attach a sequence to that column. Something like:

1) create sequence tab2_id start <max id + 1> owned by tab2.id;

2) alter table tab2 alter COLUMN id set default nextval('tab2_id');

However adding the serial id column as an ALTER TABLE statement actually
takes more time
than inserting the tuples, so the combined total time is more than
double the time taken to insert
the tuples into tab2 without serial id column.

Best,
-SB

--
Adrian Klaver
adrian.klaver@aklaver.com

#11Souvik Bhattacherjee
kivuosb@gmail.com
In reply to: Adrian Klaver (#10)
Re: Bulk Inserts

All I have left is:

select tab2 row_number () OVER (order by attr2, attr1 )AS id, attr1,
attr2 into tab2 from tab1;

That will not create a serial type in the id column though. You can
attach a sequence to that column. Something like:

1) create sequence tab2_id start <max id + 1> owned by tab2.id;

2) alter table tab2 alter COLUMN id set default nextval('tab2_id');

Thanks. This is a bit indirect but works fine. Performance wise this turns
out to the best when inserting rows from one table to another (new) table
with a serial id column in the new table.

Best,
-SB

On Tue, Aug 13, 2019 at 11:08 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 8/13/19 6:34 AM, Souvik Bhattacherjee wrote:

If the selects are returning more then one row then you are already
doing bulk inserts. If they are returning single rows or you want to
batch them then you need some sort of code to do that. Something
like(pseudo Python like code):

attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]

for val_batch in attr2_vals:

BEGIN
for id in val_batch:
insert into tab2 (attr1, attr2) (select attr1, attr2
from tab1 where attr2 = id)
COMMIT

For *EXP 1: inserts with multiple txn:*
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 20);

tab1 has ~6M rows and there are only two values for the attribute attr2

in

tab1 which are evenly distributed. So, yes, I guess I'm already doing
batching
here.

Also, I ran the following two statements to see if their performances
are comparable.
While STMT 1 always runs faster in my machine but their performances
seem to differ
by a couple of seconds at most.

STMT 1: select attr1, attr2 into tab2 from tab1;
STMT 2: insert into tab2 (select attr1, attr2 from tab1);

All I have left is:

select tab2 row_number () OVER (order by attr2, attr1 )AS id, attr1,
attr2 into tab2 from tab1;

That will not create a serial type in the id column though. You can
attach a sequence to that column. Something like:

1) create sequence tab2_id start <max id + 1> owned by tab2.id;

2) alter table tab2 alter COLUMN id set default nextval('tab2_id');

However adding the serial id column as an ALTER TABLE statement actually
takes more time
than inserting the tuples, so the combined total time is more than
double the time taken to insert
the tuples into tab2 without serial id column.

Best,
-SB

--
Adrian Klaver
adrian.klaver@aklaver.com