Inserts and bad performance
Greetings
I am inserting a large number of rows, 5,10, 15 million. The python code commits every 5000 inserts. The table has partitioned children.
At first, when there were a low number of rows inserted, the inserts would run at a good clip - 30 - 50K inserts per second. Now, after inserting oh say 1.5 Billion rows, the insert rate has dropped to around 5000 inserts per second. I dropped the unique index , rebuilt the other indexes and no change. The instance is 16 vcpu and 64GB ram.
I'm perplexed, I can't see to find any reason for the slow down...
Thanks,
pg
Phil Godfrin | Database Administration
NOV
NOV US | Engineering Data
9720 Beechnut St | Houston, Texas 77036
M 281.825.2311
E Philippe.Godfrin@nov.com<mailto:Philippe.Godfrin@nov.com>
On Wed, Nov 24, 2021 at 07:15:31PM +0000, Godfrin, Philippe E wrote:
Greetings
I am inserting a large number of rows, 5,10, 15 million. The python code commits every 5000 inserts. The table has partitioned children.At first, when there were a low number of rows inserted, the inserts would run at a good clip - 30 - 50K inserts per second. Now, after inserting oh say 1.5 Billion rows, the insert rate has dropped to around 5000 inserts per second. I dropped the unique index , rebuilt the other indexes and no change. The instance is 16 vcpu and 64GB ram.
I'm perplexed, I can't see to find any reason for the slow down...
Thanks,
pg
Hi,
With not much information, it may be I/O related. CPU and RAM cannot fix
that once items need to be written to disk. Are there any errors in the
logs or CPUs maxxed out?
Regards,
Ken
My apologies for the dearth of details. No on both the cpu and errors. But I do believe it is IO related. I just can't find it.
I thought maybe it was index splitting so I altered the unique index with filterfactor=40 and reindexed. No change.
I then dropped the unique index. No change.
I thought maybe it was checkpoint timeouts, but there was no correlation.
Oddly enough other jobs running concurrently, are also inserting, most likely into different partitions, are running about 2x faster than others.
I'm rather perplexed.
pg
-----Original Message-----
From: Kenneth Marshall <ktm@rice.edu>
Sent: Wednesday, November 24, 2021 1:20 PM
To: Godfrin, Philippe E <Philippe.Godfrin@nov.com>
Cc: pgsql-general@lists.postgresql.org
Subject: [EXTERNAL] Re: Inserts and bad performance
On Wed, Nov 24, 2021 at 07:15:31PM +0000, Godfrin, Philippe E wrote:
Greetings
I am inserting a large number of rows, 5,10, 15 million. The python code commits every 5000 inserts. The table has partitioned children.At first, when there were a low number of rows inserted, the inserts would run at a good clip - 30 - 50K inserts per second. Now, after inserting oh say 1.5 Billion rows, the insert rate has dropped to around 5000 inserts per second. I dropped the unique index , rebuilt the other indexes and no change. The instance is 16 vcpu and 64GB ram.
I'm perplexed, I can't see to find any reason for the slow down...
Thanks,
pg
Hi,
With not much information, it may be I/O related. CPU and RAM cannot fix that once items need to be written to disk. Are there any errors in the logs or CPUs maxxed out?
Regards,
Ken
"Godfrin, Philippe E" <Philippe.Godfrin@nov.com> writes:
I am inserting a large number of rows, 5,10, 15 million. The python code commits every 5000 inserts. The table has partitioned children.
At first, when there were a low number of rows inserted, the inserts would run at a good clip - 30 - 50K inserts per second. Now, after inserting oh say 1.5 Billion rows, the insert rate has dropped to around 5000 inserts per second. I dropped the unique index , rebuilt the other indexes and no change. The instance is 16 vcpu and 64GB ram.
Can you drop the indexes and not rebuild them till after the bulk load is
done? Once the indexes exceed available RAM, insert performance is going
to fall off a cliff, except maybe for indexes that are receiving purely
sequential inserts (so that only the right end of the index gets touched).
Also see
https://www.postgresql.org/docs/current/populate.html
regards, tom lane
On Wed, Nov 24, 2021 at 2:15 PM Godfrin, Philippe E <
Philippe.Godfrin@nov.com> wrote:
Greetings
I am inserting a large number of rows, 5,10, 15 million. The python code
commits every 5000 inserts. The table has partitioned children.
On the Python client side, if you're using psycopg, you should consider
using using COPY instead of INSERT if you're not:
https://www.psycopg.org/psycopg3/docs/basic/copy.html#copy
And if using psycopg2, execute_batch might be of value:
https://www.psycopg.org/docs/extras.html?highlight=insert#psycopg2.extras.execute_batch
Regards,
Gavin
Hi Tom. Good point about the index paging out of the buffer. I did that and no change. I do have the shared buffers at 40GB, so there's a good bit there, but I also did all those things on the page you referred, except for using copy. At this point the data has not been scrubbed, so I'm trapping data errors and duplicates. I am curios though, as sidebar, why copy is considered faster than inserts. I was unable to get COPY faster than around 25K inserts a second (pretty fast anyway). Frankly, initially I was running 3 concurrent insert jobs and getting 90K ins/sec ! but after a certain number of records, the speed just dropped off.
pg
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Wednesday, November 24, 2021 1:32 PM
To: Godfrin, Philippe E <Philippe.Godfrin@nov.com>
Cc: pgsql-general@lists.postgresql.org
Subject: [EXTERNAL] Re: Inserts and bad performance
"Godfrin, Philippe E" <Philippe.Godfrin@nov.com<mailto:Philippe.Godfrin@nov.com>> writes:
I am inserting a large number of rows, 5,10, 15 million. The python code commits every 5000 inserts. The table has partitioned children.
At first, when there were a low number of rows inserted, the inserts would run at a good clip - 30 - 50K inserts per second. Now, after inserting oh say 1.5 Billion rows, the insert rate has dropped to around 5000 inserts per second. I dropped the unique index , rebuilt the other indexes and no change. The instance is 16 vcpu and 64GB ram.
Can you drop the indexes and not rebuild them till after the bulk load is
done? Once the indexes exceed available RAM, insert performance is going
to fall off a cliff, except maybe for indexes that are receiving purely
sequential inserts (so that only the right end of the index gets touched).
Also see
regards, tom lane
The notion of COPY blocks and asynchronously is very interesting
From: Gavin Roy <gavinr@aweber.com>
Sent: Wednesday, November 24, 2021 1:50 PM
To: Godfrin, Philippe E <Philippe.Godfrin@nov.com>
Cc: pgsql-general@lists.postgresql.org
Subject: [EXTERNAL] Re: Inserts and bad performance
On Wed, Nov 24, 2021 at 2:15 PM Godfrin, Philippe E <Philippe.Godfrin@nov.com<mailto:Philippe.Godfrin@nov.com>> wrote:GreetingsI am inserting a large number of rows, 5,10, 15 million. The python code commits every 5000 inserts. The table has partitioned children.On the Python client side, if you're usi
On Wed, Nov 24, 2021 at 2:15 PM Godfrin, Philippe E <Philippe.Godfrin@nov.com<mailto:Philippe.Godfrin@nov.com>> wrote:
Greetings
I am inserting a large number of rows, 5,10, 15 million. The python code commits every 5000 inserts. The table has partitioned children.
On the Python client side, if you're using psycopg, you should consider using using COPY instead of INSERT if you're not:
And if using psycopg2, execute_batch might be of value:
Regards,
Gavin
How many partitions? How many rows do they have when performance is slowing
considerably? Does this table get many updates or is it insert only? What
version of PostgreSQL? Are the inserts randomly distributed among the
partitions or targeting one or a few partitions? Are you able to capture an
example and run it in a transaction with explain (analyze, buffers,
verbose) and then rollback?
*Michael Lewis | Database Engineer*
*Entrata*
On Thu, 25 Nov 2021 at 08:59, Godfrin, Philippe E
<Philippe.Godfrin@nov.com> wrote:
Hi Tom. Good point about the index paging out of the buffer. I did that and no change. I do have the shared buffers at 40GB, so there’s a good bit there, but I also did all those things on the page you referred, except for using copy. At this point the data has not been scrubbed, so I’m trapping data errors and duplicates. I am curios though, as sidebar, why copy is considered faster than inserts. I was unable to get COPY faster than around 25K inserts a second (pretty fast anyway). Frankly, initially I was running 3 concurrent insert jobs and getting 90K ins/sec ! but after a certain number of records, the speed just dropped off.
EXPLAIN (ANALYZE, BUFFERS) works with INSERTs. You just need to be
aware that using ANALYZE will perform the actual insert too. So you
might want to use BEGIN; and ROLLBACK; if it's not data that you want
to keep.
SET track_io_timing = on; might help you too.
David
On 11/24/21 1:15 PM, Godfrin, Philippe E wrote:
[snip]
I dropped the unique index , rebuilt the other indexes and no change.
IMNSHO, this is the worst possible approach. Drop everything *except* the
unique index, and then (if possible) sort the input file by the unique
key. That'll increase buffered IO; otherwise, you're bopping all around
the filesystem.
Using a bulk loader if possible would increase speeds
--
Angular momentum makes the world go 'round.
Excellent idea David, silly me, I didn't think of that. For the other questions:
How many partitions?
14
How many rows do they have when performance is slowing considerably?
Not sure, maybe on the low millions
Does this table get many updates or is it insert only?
insert
What version of PostgreSQL?
13
Are the inserts randomly distributed among the partitions or targeting one or a few partitions?
Sequentially one partition at a time, so each set of runs is inserting across each part.
Are you able to capture an example and run it in a transaction with explain (analyze, buffers, verbose) and then rollback?
Yes, I'm looking into that
pg
-----Original Message-----
From: David Rowley <dgrowleyml@gmail.com>
Sent: Wednesday, November 24, 2021 7:13 PM
To: Godfrin, Philippe E <Philippe.Godfrin@nov.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; pgsql-general@lists.postgresql.org
Subject: Re: [EXTERNAL] Re: Inserts and bad performance
On Thu, 25 Nov 2021 at 08:59, Godfrin, Philippe E <Philippe.Godfrin@nov.com> wrote:
Hi Tom. Good point about the index paging out of the buffer. I did that and no change. I do have the shared buffers at 40GB, so there’s a good bit there, but I also did all those things on the page you referred, except for using copy. At this point the data has not been scrubbed, so I’m trapping data errors and duplicates. I am curios though, as sidebar, why copy is considered faster than inserts. I was unable to get COPY faster than around 25K inserts a second (pretty fast anyway). Frankly, initially I was running 3 concurrent insert jobs and getting 90K ins/sec ! but after a certain number of records, the speed just dropped off.
EXPLAIN (ANALYZE, BUFFERS) works with INSERTs. You just need to be aware that using ANALYZE will perform the actual insert too. So you might want to use BEGIN; and ROLLBACK; if it's not data that you want to keep.
SET track_io_timing = on; might help you too.
David
Hi Gavin – thanks I hadn’t realized that about psychopg. I’m on the earlier version, so I can’t use what you recommended at this point. But I did use copy_expert.
Interestingly enough the performance of the copy statement is only slightly better than the insert, as I was running inserts with 5000 values clauses. In the end, the current config couldn’t keep up with the WAL creation, so I turned all that off. But still no perf gains. I also turned off fsync and set the kernel settings to 10% and 98% for dirty pages…
I wonder if there’s a better load product than COPY???? But I’d still like to know what separates COPY from bulk inserts…
pf
From: Gavin Roy <gavinr@aweber.com>
Sent: Wednesday, November 24, 2021 1:50 PM
To: Godfrin, Philippe E <Philippe.Godfrin@nov.com>
Cc: pgsql-general@lists.postgresql.org
Subject: [EXTERNAL] Re: Inserts and bad performance
On Wed, Nov 24, 2021 at 2:15 PM Godfrin, Philippe E <Philippe.Godfrin@nov.com<mailto:Philippe.Godfrin@nov.com>> wrote:GreetingsI am inserting a large number of rows, 5,10, 15 million. The python code commits every 5000 inserts. The table has partitioned children.On the Python client side, if you're usi
On Wed, Nov 24, 2021 at 2:15 PM Godfrin, Philippe E <Philippe.Godfrin@nov.com<mailto:Philippe.Godfrin@nov.com>> wrote:
Greetings
I am inserting a large number of rows, 5,10, 15 million. The python code commits every 5000 inserts. The table has partitioned children.
On the Python client side, if you're using psycopg, you should consider using using COPY instead of INSERT if you're not:
And if using psycopg2, execute_batch might be of value:
Regards,
Gavin
Right you are sir! I figured that out a few hours ago!
pg
From: Ron <ronljohnsonjr@gmail.com>
Sent: Wednesday, November 24, 2021 10:58 PM
To: pgsql-general@lists.postgresql.org
Subject: [EXTERNAL] Re: Inserts and bad performance
On 11/24/21 1:15 PM, Godfrin, Philippe E wrote: [snip] I dropped the unique index , rebuilt the other indexes and no change. IMNSHO, this is the worst possible approach. Drop everything except the unique index, and then (if possible) sort the input file by the unique key. That
[snip]
I dropped the unique index , rebuilt the other indexes and no change.
IMNSHO, this is the worst possible approach. Drop everything except the unique index, and then (if possible) sort the input file by the unique key. That'll increase buffered IO; otherwise, you're bopping all around the filesystem.
Using a bulk loader if possible would increase speeds
--
Angular momentum makes the world go 'round.
On 28/11/21 17:17, Godfrin, Philippe E wrote:
Right you are sir! I figured that out a few hours ago!
pg
*From:* Ron <ronljohnsonjr@gmail.com>
*Sent:* Wednesday, November 24, 2021 10:58 PM
*To:* pgsql-general@lists.postgresql.org
*Subject:* [EXTERNAL] Re: Inserts and bad performanceOn 11/24/21 1:15 PM, Godfrin, Philippe E wrote:
[snip]
I dropped the unique index , rebuilt the other indexes and no change.
IMNSHO, this is the worst possible approach. Drop everything *except*
the unique index, and then (if possible) sort the input file by the
unique key. That'll increase buffered IO; otherwise, you're bopping
all around the filesystem.Using a bulk loader if possible would increase speeds
--
Angular momentum makes the world go 'round.
Please don't top post!
Cheers,
Gavin
Ok, thanks
--
Sent from Mail.ru app for Android Wednesday, 24 November 2021, 11:28pm +03:00 from Godfrin, Philippe E philippe.godfrin@nov.com :
Show quoted text
The notion of COPY blocks and asynchronously is very interesting
From: Gavin Roy < gavinr@aweber.com>
Sent: Wednesday, November 24, 2021 1:50 PM
To: Godfrin, Philippe E < Philippe.Godfrin@nov.com>
Cc: pgsql-general@lists.postgresql.org
Subject: [EXTERNAL] Re: Inserts and bad performance
On Wed, Nov 24, 2021 at 2:15 PM Godfrin, Philippe E < Philippe.Godfrin@nov.com> wrote:Greetings
I am inserting a large number of rows, 5,10, 15 million. The python code commits every 5000 inserts. The table has partitioned children.
On the Python client side, if you're using psycopg, you should consider using using COPY instead of INSERT if you're not:
https://www.psycopg.org/psycopg3/docs/basic/copy.html#copy
And if using psycopg2, execute_batch might be of value:
https://www.psycopg.org/docs/extras.html?highlight=insert#psycopg2.extras.execute_batch
Regards,
Gavin