BUG #17652: Performance degradation after migrating from v13.4 to v14.4
The following bug has been logged on the website:
Bug reference: 17652
Logged by: Michael Guissine
Email address: mguissine@gmail.com
PostgreSQL version: 14.4
Operating system: AWS RDS
Description:
Hello,
We recently upgraded our AWS RDS instances to PG 14.4 and are seeing weird
performance degradation. The issue seems to be isolated to one specific use
case which is querying partitioned table with a large number of elements in
the IN clause. In our setup, we observed that Postgres is switching from
Index Scan to Bitmap Index Scan/Bitmap Heap Scan if the number of elements
in the IN clause is above 60, below 60 the Index Scan is performed on all
partitions.
Query and query plan are attached below. We also tried to turn off
enable_bitmapscan and observed that query is performing well which was our
experience in the previous version (pg 13.4)
https://explain.depesz.com/s/F587
Thank you!
On Wed, Oct 19, 2022 at 12:21 PM PG Bug reporting form <
noreply@postgresql.org> wrote:
The following bug has been logged on the website:
Bug reference: 17652
Logged by: Michael Guissine
Email address: mguissine@gmail.com
PostgreSQL version: 14.4
Operating system: AWS RDS
Description:Hello,
We recently upgraded our AWS RDS instances to PG 14.4 and are seeing weird
performance degradation.
First things first, did you do an ANALYZE on the database after the
upgrade? (Maybe RDS does it for you, I don't know. But it isn't worth
looking into anything else until you make sure that that has been done)
Cheers,
Jeff
Hi Jeff,
Sorry I should've mentioned that. Yes we did ANALYZE all the tables after
upgrade, we even went ahead and REINDEX all indexes and then also run
VACUUM ANALYZE but still facing same performance issue
On Wed, Oct 19, 2022 at 1:09 PM Jeff Janes <jeff.janes@gmail.com> wrote:
Show quoted text
On Wed, Oct 19, 2022 at 12:21 PM PG Bug reporting form <
noreply@postgresql.org> wrote:The following bug has been logged on the website:
Bug reference: 17652
Logged by: Michael Guissine
Email address: mguissine@gmail.com
PostgreSQL version: 14.4
Operating system: AWS RDS
Description:Hello,
We recently upgraded our AWS RDS instances to PG 14.4 and are seeing weird
performance degradation.First things first, did you do an ANALYZE on the database after the
upgrade? (Maybe RDS does it for you, I don't know. But it isn't worth
looking into anything else until you make sure that that has been done)Cheers,
Jeff
Yi,
On 2022-10-19 16:11:15 +0000, PG Bug reporting form wrote:
We recently upgraded our AWS RDS instances to PG 14.4 and are seeing weird
performance degradation.
Are they actually configured the same? Of particular interest would be
shared_buffers, effective_cache_size, random_page_cost, seq_page_cost,
cpu_index_tuple_cost, cpu_operator_cost, cpu_tuple_cost, work_mem.
Your explain shows that most of the time is spent doing IO:
Append (cost=0.43..93144.17 rows=121 width=1121) (actual time=225005.053..1160202.658 rows=7 loops=1)
Buffers: shared hit=379846 read=2382552 dirtied=57320 written=231982
I/O Timings: read=1113819.977 write=2850.752
Is there a chance at least part of this is due to a cold cache after
upgrading?
What is the definition of the indexes the partitions?
Based on the names of the indexes I'd guess that the reason the bitmap scans
are bad is that the index includes the partition_key column as a leading
column, but skips over that column and just filters by "text".
I'm a bit confused by your plan showing:
Rows Removed by Index Recheck: 62841
and
Heap Blocks: exact=46111
in the same node.
I haven't rechecked the code, but I don't think that should happen for btree
indexes. It could however for gin, bloom, hash etc.
Query and query plan are attached below. We also tried to turn off
enable_bitmapscan and observed that query is performing well which was our
experience in the previous version (pg 13.4)
Do you know what the plan was in 13?
Greetings,
Andres Freund
On Wed, Oct 19, 2022 at 1:30 PM Michael Guissine <mguissine@gmail.com>
wrote:
Hi Jeff,
Sorry I should've mentioned that. Yes we did ANALYZE all the tables after
upgrade, we even went ahead and REINDEX all indexes and then also run
VACUUM ANALYZE but still facing same performance issue
I'm guessing that paritioned_table_hash_p3_partition_key_text_idx is a
pg_trgm index, and that you are victim of the change "Allow GiST/GIN
pg_trgm indexes to do equality lookups". It is not surprising that this
change would bite someone. If you can redo the upgrade but this time
refrain from upgrading pg_trgm from 1.5 to 1.6, that should fix the issue
(assuming RDS even gives you enough control to do that), or drop the
extension and recreate it specifically as version 1.5 (which unfortunately
would cascade to dropping all the indexes, and they would need to be
recreated after--even though the index structures themselves would require
no change).
I have a file which would let you downgrade the extension without needing
to drop all the pg_trgm indexes, but there is no way for you to deploy that
to RDS.
Cheers,
Jeff
Show quoted text
Attachments:
Thank you Andres,
The column names in the plan and queries are obfuscated, the cold cache
should not be an issue as we upgraded 3 days ago and since
reindexed/re-analyzed everything plus we have an active application running
against this database so the data should be cached
here is a obfuscated table definition, there are multiple indexes on those
columns to support different query patterns, note that the
partitioned_table_partition_key_start_time_text index was added after the
upgrade as we thought it might help the optimizer to come up with a better
plan but it didn't help
Partitioned table
"public.partitioned_table"
Column | Type | Collation |
Nullable | Default
---------------------------+-----------------------------+-----------+----------+------------------------------------
id | bigint | |
not null | nextval('partitioned_table_id_seq'::regclass)
partition_key | integer | |
not null |
text | text | |
|
description | text | |
|
...
more fields
...
Partition key: HASH (partition_key)
Indexes:
"partitioned_table_pkey" PRIMARY KEY, btree (partition_key, id)
"partitioned_table_partition_key_start_time_index" btree
(partition_key, start_time)
"partitioned_table_partition_key_start_time_text" btree
(partition_key, start_time, text)
"partitioned_table_partition_key_text_idx" gin (partition_key,
text gin_trgm_ops)
"partitioned_table_text_tsv_index" gin
(to_tsvector('simple'::regconfig, text)) WHERE text IS NOT NULL
Number of partitions: 128 (Use \d+ to list them.)
On Wed, Oct 19, 2022 at 2:46 PM Andres Freund <andres@anarazel.de> wrote:
Show quoted text
Yi,
On 2022-10-19 16:11:15 +0000, PG Bug reporting form wrote:
We recently upgraded our AWS RDS instances to PG 14.4 and are seeing
weird
performance degradation.
Are they actually configured the same? Of particular interest would be
shared_buffers, effective_cache_size, random_page_cost, seq_page_cost,
cpu_index_tuple_cost, cpu_operator_cost, cpu_tuple_cost, work_mem.Your explain shows that most of the time is spent doing IO:
Append (cost=0.43..93144.17 rows=121 width=1121) (actual
time=225005.053..1160202.658 rows=7 loops=1)
Buffers: shared hit=379846 read=2382552 dirtied=57320 written=231982
I/O Timings: read=1113819.977 write=2850.752Is there a chance at least part of this is due to a cold cache after
upgrading?What is the definition of the indexes the partitions?
Based on the names of the indexes I'd guess that the reason the bitmap
scans
are bad is that the index includes the partition_key column as a leading
column, but skips over that column and just filters by "text".I'm a bit confused by your plan showing:
Rows Removed by Index Recheck: 62841
and
Heap Blocks: exact=46111
in the same node.I haven't rechecked the code, but I don't think that should happen for
btree
indexes. It could however for gin, bloom, hash etc.Query and query plan are attached below. We also tried to turn off
enable_bitmapscan and observed that query is performing well which wasour
experience in the previous version (pg 13.4)
Do you know what the plan was in 13?
Greetings,
Andres Freund
paritioned_table_hash_p3_partition_key_text_idx is in fact pg_trgm index,
you are right!
interesting, why would it use that index at all on some partitions but not
on others? the query is a simple match on 3 columns and we have other
indexes to cover it
On Wed, Oct 19, 2022 at 3:01 PM Jeff Janes <jeff.janes@gmail.com> wrote:
Show quoted text
On Wed, Oct 19, 2022 at 1:30 PM Michael Guissine <mguissine@gmail.com>
wrote:Hi Jeff,
Sorry I should've mentioned that. Yes we did ANALYZE all the tables after
upgrade, we even went ahead and REINDEX all indexes and then also run
VACUUM ANALYZE but still facing same performance issueI'm guessing that paritioned_table_hash_p3_partition_key_text_idx is a
pg_trgm index, and that you are victim of the change "Allow GiST/GIN
pg_trgm indexes to do equality lookups". It is not surprising that this
change would bite someone. If you can redo the upgrade but this time
refrain from upgrading pg_trgm from 1.5 to 1.6, that should fix the issue
(assuming RDS even gives you enough control to do that), or drop the
extension and recreate it specifically as version 1.5 (which unfortunately
would cascade to dropping all the indexes, and they would need to be
recreated after--even though the index structures themselves would require
no change).I have a file which would let you downgrade the extension without needing
to drop all the pg_trgm indexes, but there is no way for you to deploy that
to RDS.Cheers,
Jeff
Hi,
On 2022-10-19 15:32:17 -0400, Michael Guissine wrote:
The column names in the plan and queries are obfuscated
I guessed so :)
Indexes:
"partitioned_table_pkey" PRIMARY KEY, btree (partition_key, id)
"partitioned_table_partition_key_start_time_index" btree
(partition_key, start_time)
"partitioned_table_partition_key_start_time_text" btree
(partition_key, start_time, text)
"partitioned_table_partition_key_text_idx" gin (partition_key,
text gin_trgm_ops)
"partitioned_table_text_tsv_index" gin
(to_tsvector('simple'::regconfig, text)) WHERE text IS NOT NULL
Number of partitions: 128 (Use \d+ to list them.)
Ok, that explains why we see the massive rechecks. For the bitmap scans we end
up using a gin index instead of the btree index.
The reason you're seeing this in 14 is likely the following entry in the
release notes:
Allow GiST/GIN pg_trgm indexes to do equality lookups (Julien Rouhaud)
This is similar to LIKE except no wildcards are honored.
CCed author/committer of that change. I suspect this would need proper costing
to not be inadvertently be used.
I assume you need the pg_trgm index?
In theory you could likely work around the issue by reverting the relevant
catalog change:
+ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+ OPERATOR 11 pg_catalog.= (text, text);
by issuing
ALTER OPERATOR FAMILY gin_trgm_ops USING gin DROP OPERATOR 11 (text, text);
but I suspect you won't be allowed to do that in RDS.
Greetings,
Andres Freund
On Wed, Oct 19, 2022 at 1:12 PM Andres Freund <andres@anarazel.de> wrote:
The reason you're seeing this in 14 is likely the following entry in the
release notes:Allow GiST/GIN pg_trgm indexes to do equality lookups (Julien Rouhaud)
This is similar to LIKE except no wildcards are honored.CCed author/committer of that change. I suspect this would need proper costing
to not be inadvertently be used.
I am reminded of this recent thread:
/messages/by-id/3188617.44csPzL39Z@aivenronan
(Perhaps you were thinking of the same thing already, but thought I'd
mention it, just in case.)
--
Peter Geoghegan
Hi,
On 2022-10-19 13:20:37 -0700, Peter Geoghegan wrote:
On Wed, Oct 19, 2022 at 1:12 PM Andres Freund <andres@anarazel.de> wrote:
The reason you're seeing this in 14 is likely the following entry in the
release notes:Allow GiST/GIN pg_trgm indexes to do equality lookups (Julien Rouhaud)
This is similar to LIKE except no wildcards are honored.CCed author/committer of that change. I suspect this would need proper costing
to not be inadvertently be used.I am reminded of this recent thread:
/messages/by-id/3188617.44csPzL39Z@aivenronan
(Perhaps you were thinking of the same thing already, but thought I'd
mention it, just in case.)
No, I hadn't seen that one. That indeed looks quite relevant...
Greetings,
Andres Freund
Thank you all for such super quick responses. We are considering removing
the pg_trgm index and hope that will help!
On Wed, Oct 19, 2022 at 4:36 PM Andres Freund <andres@anarazel.de> wrote:
Show quoted text
Hi,
On 2022-10-19 13:20:37 -0700, Peter Geoghegan wrote:
On Wed, Oct 19, 2022 at 1:12 PM Andres Freund <andres@anarazel.de>
wrote:
The reason you're seeing this in 14 is likely the following entry in
the
release notes:
Allow GiST/GIN pg_trgm indexes to do equality lookups (Julien
Rouhaud)
This is similar to LIKE except no wildcards are honored.
CCed author/committer of that change. I suspect this would need proper
costing
to not be inadvertently be used.
I am reminded of this recent thread:
/messages/by-id/3188617.44csPzL39Z@aivenronan
(Perhaps you were thinking of the same thing already, but thought I'd
mention it, just in case.)No, I hadn't seen that one. That indeed looks quite relevant...
Greetings,
Andres Freund
On Wed, Oct 19, 2022 at 3:44 PM Michael Guissine <mguissine@gmail.com>
wrote:
paritioned_table_hash_p3_partition_key_text_idx is in fact pg_trgm index,
you are right!
interesting, why would it use that index at all on some partitions but not
on others? the query is a simple match on 3 columns and we have other
indexes to cover it
The two index choices are (falsely) similar in cost, so just small
differences in the selectivity estimates is enough to drive one over the
other from partition to partition.
There are (at least) two planner deficiencies at work here.
For the actually faster index, it thinks it will need to re-descend the
index one time for every value in the IN-list on the leading column, and
each of those re-descents is going to end up on a different leaf page,
charging about one random_io_cost for each one. But it is counting a
re-descent for every item in the IN-list, for every partition. But it only
needs to re-descend for the items of the IN-list which also happen to match
the remainder for that hash partition. This is incorrectly inflating the
cost of the faster index. (You could argue that if the executor really
does do the re-descents, then it is correct for the planner to account for
them. But I suspect that many of the hopeless re-descents may end up on
leaf pages already in memory. Anyway, fixing the planner and fixing the
executor would probably go hand in hand)
For the slower index, the gin code doesn't estimate how many rechecks there
will be, only how many end results there will be. But the rechecks are
where most of the true cost is. So that makes this one end up with an
incorrectly low cost. The overestimation of one and the underestimation of
the other bring the two estimates close enough together that random
variations in the stats can drive the choice between them. Now you could
say "all the row estimates are 1, so how could there be any variation"?
But notice that for the bitmap heap scans, there is a filter step which
removes around 1000 rows from each partition. We know how many rows were
actually removed from each partition, but we are not told how many rows the
planner thought it would remove. Variations in this unseen estimated
filtered number would lead to variations in the cost estimate.
While fixing the planner would be nice, I think the best current solution
for you is to change the order of columns in the 3-column index to be
(text, start_time, key). Since the condition on "text" is extremely
selective, the planner should know that by the time it gets to the key
column it can check it with a single hash-table probe rather than a bunch
of costly re-descents. This should be estimated to be faster than either
of the existing two indexes, and also should actually be faster.
Cheers,
Jeff
Show quoted text
Hi Jeff, this is super useful, thanks for spending your time explaining the
internals. In our case, we ended up dropping the trgm index and everything
is back to normal but I like your idea of reversing the column order, will
give it a try!
~ Michael
On Thu, Oct 20, 2022 at 1:03 AM Jeff Janes <jeff.janes@gmail.com> wrote:
Show quoted text
On Wed, Oct 19, 2022 at 3:44 PM Michael Guissine <mguissine@gmail.com>
wrote:paritioned_table_hash_p3_partition_key_text_idx is in fact pg_trgm index,
you are right!
interesting, why would it use that index at all on some partitions but
not on others? the query is a simple match on 3 columns and we have other
indexes to cover itThe two index choices are (falsely) similar in cost, so just small
differences in the selectivity estimates is enough to drive one over the
other from partition to partition.There are (at least) two planner deficiencies at work here.
For the actually faster index, it thinks it will need to re-descend the
index one time for every value in the IN-list on the leading column, and
each of those re-descents is going to end up on a different leaf page,
charging about one random_io_cost for each one. But it is counting a
re-descent for every item in the IN-list, for every partition. But it only
needs to re-descend for the items of the IN-list which also happen to match
the remainder for that hash partition. This is incorrectly inflating the
cost of the faster index. (You could argue that if the executor really
does do the re-descents, then it is correct for the planner to account for
them. But I suspect that many of the hopeless re-descents may end up on
leaf pages already in memory. Anyway, fixing the planner and fixing the
executor would probably go hand in hand)For the slower index, the gin code doesn't estimate how many rechecks
there will be, only how many end results there will be. But the rechecks
are where most of the true cost is. So that makes this one end up with an
incorrectly low cost. The overestimation of one and the underestimation of
the other bring the two estimates close enough together that random
variations in the stats can drive the choice between them. Now you could
say "all the row estimates are 1, so how could there be any variation"?
But notice that for the bitmap heap scans, there is a filter step which
removes around 1000 rows from each partition. We know how many rows were
actually removed from each partition, but we are not told how many rows the
planner thought it would remove. Variations in this unseen estimated
filtered number would lead to variations in the cost estimate.While fixing the planner would be nice, I think the best current solution
for you is to change the order of columns in the 3-column index to be
(text, start_time, key). Since the condition on "text" is extremely
selective, the planner should know that by the time it gets to the key
column it can check it with a single hash-table probe rather than a bunch
of costly re-descents. This should be estimated to be faster than either
of the existing two indexes, and also should actually be faster.Cheers,
Jeff