Performance Issue with Hash Partition Query Execution in PostgreSQL 16

Started by Ramakrishna mover 1 year ago8 messagesgeneral
Jump to latest
#1Ramakrishna m
ram.pgdb@gmail.com

Hi Team,

One of the queries, which retrieves a single record from a table with 16
hash partitions, is taking more than 10 seconds to execute. In contrast,
when we run the same query manually, it completes within milliseconds. This
issue is causing exhaustion of the application pools. Do we have any bugs
in postgrs16 hash partitions? Please find the attached log, table, and
execution plan.

size of the each partitions : 300GB
Index Size : 12GB

Postgres Version : 16.x
Shared Buffers : 75 GB
Effective_cache : 175 GB
Work _mem : 4MB
Max_connections : 3000

OS : Ubuntu 22.04
Ram : 384 GB
CPU : 64

Please let us know if you need any further information or if there are
additional details required.

Regards,
Ram.

Attachments:

Postgres_hash_partion_issue.sqlapplication/octet-stream; name=Postgres_hash_partion_issue.sqlDownload
#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Ramakrishna m (#1)
Re: Performance Issue with Hash Partition Query Execution in PostgreSQL 16

On Wed, 2024-11-06 at 22:43 +0530, Ramakrishna m wrote:

One of the queries, which retrieves a single record from a table with 16 hash partitions,
is taking more than 10 seconds to execute. In contrast, when we run the same query manually,
it completes within milliseconds.

If I read your file right, the slow execution has an additional "LIMIT 55".
That can make a difference.

In general, you can use auto_explain to get the execution plan of the slow query.

Yours,
Laurenz Albe

#3David Mullineux
dmullx@gmail.com
In reply to: Ramakrishna m (#1)
Re: Performance Issue with Hash Partition Query Execution in PostgreSQL 16

It might be worth eliminating the use of cached plans here. Is your app
using prepared statements at all?
Point is that if the optimizer sees the same prepared query , 5 times, the
it locks the plan that it found at that time. This is a good trade off as
it avoids costly planning-time for repetitive queries. But if you are
manually querying, the a custom plan will be generated anew.
A quick analyze of the table should reset the stats and invalidate any
cached plans.
This may not be your problem just worth eliminating it from the list of
potential causes.

On Wed, 6 Nov 2024, 17:14 Ramakrishna m, <ram.pgdb@gmail.com> wrote:

Show quoted text

Hi Team,

One of the queries, which retrieves a single record from a table with 16
hash partitions, is taking more than 10 seconds to execute. In contrast,
when we run the same query manually, it completes within milliseconds. This
issue is causing exhaustion of the application pools. Do we have any bugs
in postgrs16 hash partitions? Please find the attached log, table, and
execution plan.

size of the each partitions : 300GB
Index Size : 12GB

Postgres Version : 16.x
Shared Buffers : 75 GB
Effective_cache : 175 GB
Work _mem : 4MB
Max_connections : 3000

OS : Ubuntu 22.04
Ram : 384 GB
CPU : 64

Please let us know if you need any further information or if there are
additional details required.

Regards,
Ram.

#4ravi k
ravisql09@gmail.com
In reply to: David Mullineux (#3)
Re: Performance Issue with Hash Partition Query Execution in PostgreSQL 16

Hi ,

Thanks for the suggestions.

Two more observations:

1) no sequence scan noticed from pg_stat_user_tables ( hope stats are
accurate in postgres 16) if parameter sniffing happens the possibility of
going to sequence scan is more right.

2) no blockings or IO issue during the time.

3) even with limit clause if touch all partitions also it could have been
completed in milliseconds as this is just one record.

4) auto_explain in prod we cannot enable as this is expensive and with high
TPS we may face latency issues and lower environment this issue cannot be
reproduced,( this is happening out of Million one case)

This looks puzzle to us, just in case anyone experianced pls share your
experience.

Regards,
Ravi

On Thu, 7 Nov, 2024, 3:41 am David Mullineux, <dmullx@gmail.com> wrote:

Show quoted text

It might be worth eliminating the use of cached plans here. Is your app
using prepared statements at all?
Point is that if the optimizer sees the same prepared query , 5 times,
the it locks the plan that it found at that time. This is a good trade off
as it avoids costly planning-time for repetitive queries. But if you are
manually querying, the a custom plan will be generated anew.
A quick analyze of the table should reset the stats and invalidate any
cached plans.
This may not be your problem just worth eliminating it from the list of
potential causes.

On Wed, 6 Nov 2024, 17:14 Ramakrishna m, <ram.pgdb@gmail.com> wrote:

Hi Team,

One of the queries, which retrieves a single record from a table with 16
hash partitions, is taking more than 10 seconds to execute. In contrast,
when we run the same query manually, it completes within milliseconds. This
issue is causing exhaustion of the application pools. Do we have any bugs
in postgrs16 hash partitions? Please find the attached log, table, and
execution plan.

size of the each partitions : 300GB
Index Size : 12GB

Postgres Version : 16.x
Shared Buffers : 75 GB
Effective_cache : 175 GB
Work _mem : 4MB
Max_connections : 3000

OS : Ubuntu 22.04
Ram : 384 GB
CPU : 64

Please let us know if you need any further information or if there are
additional details required.

Regards,
Ram.

#5David Mullineux
dmullx@gmail.com
In reply to: ravi k (#4)
Re: Performance Issue with Hash Partition Query Execution in PostgreSQL 16

Just spotted a potential problem. The indexed column is a bigint. Are you,
in your prepared statement passing a string or a big int ?
I notice your plan is doing an implicit type conversion when you run it
manually.
Sometimes the wrong type will make it not use the index.

On Fri, 8 Nov 2024, 03:07 ravi k, <ravisql09@gmail.com> wrote:

Show quoted text

Hi ,

Thanks for the suggestions.

Two more observations:

1) no sequence scan noticed from pg_stat_user_tables ( hope stats are
accurate in postgres 16) if parameter sniffing happens the possibility of
going to sequence scan is more right.

2) no blockings or IO issue during the time.

3) even with limit clause if touch all partitions also it could have been
completed in milliseconds as this is just one record.

4) auto_explain in prod we cannot enable as this is expensive and with
high TPS we may face latency issues and lower environment this issue cannot
be reproduced,( this is happening out of Million one case)

This looks puzzle to us, just in case anyone experianced pls share your
experience.

Regards,
Ravi

On Thu, 7 Nov, 2024, 3:41 am David Mullineux, <dmullx@gmail.com> wrote:

It might be worth eliminating the use of cached plans here. Is your app
using prepared statements at all?
Point is that if the optimizer sees the same prepared query , 5 times,
the it locks the plan that it found at that time. This is a good trade off
as it avoids costly planning-time for repetitive queries. But if you are
manually querying, the a custom plan will be generated anew.
A quick analyze of the table should reset the stats and invalidate any
cached plans.
This may not be your problem just worth eliminating it from the list of
potential causes.

On Wed, 6 Nov 2024, 17:14 Ramakrishna m, <ram.pgdb@gmail.com> wrote:

Hi Team,

One of the queries, which retrieves a single record from a table with 16
hash partitions, is taking more than 10 seconds to execute. In contrast,
when we run the same query manually, it completes within milliseconds. This
issue is causing exhaustion of the application pools. Do we have any bugs
in postgrs16 hash partitions? Please find the attached log, table, and
execution plan.

size of the each partitions : 300GB
Index Size : 12GB

Postgres Version : 16.x
Shared Buffers : 75 GB
Effective_cache : 175 GB
Work _mem : 4MB
Max_connections : 3000

OS : Ubuntu 22.04
Ram : 384 GB
CPU : 64

Please let us know if you need any further information or if there are
additional details required.

Regards,
Ram.

#6ravi k
ravisql09@gmail.com
In reply to: David Mullineux (#5)
Re: Performance Issue with Hash Partition Query Execution in PostgreSQL 16

Sorry, it was typo. Bind variable is bigint only.

Thanks

On Fri, 8 Nov, 2024, 7:09 pm David Mullineux, <dmullx@gmail.com> wrote:

Show quoted text

Just spotted a potential problem. The indexed column is a bigint. Are you,
in your prepared statement passing a string or a big int ?
I notice your plan is doing an implicit type conversion when you run it
manually.
Sometimes the wrong type will make it not use the index.

On Fri, 8 Nov 2024, 03:07 ravi k, <ravisql09@gmail.com> wrote:

Hi ,

Thanks for the suggestions.

Two more observations:

1) no sequence scan noticed from pg_stat_user_tables ( hope stats are
accurate in postgres 16) if parameter sniffing happens the possibility of
going to sequence scan is more right.

2) no blockings or IO issue during the time.

3) even with limit clause if touch all partitions also it could have been
completed in milliseconds as this is just one record.

4) auto_explain in prod we cannot enable as this is expensive and with
high TPS we may face latency issues and lower environment this issue cannot
be reproduced,( this is happening out of Million one case)

This looks puzzle to us, just in case anyone experianced pls share your
experience.

Regards,
Ravi

On Thu, 7 Nov, 2024, 3:41 am David Mullineux, <dmullx@gmail.com> wrote:

It might be worth eliminating the use of cached plans here. Is your app
using prepared statements at all?
Point is that if the optimizer sees the same prepared query , 5 times,
the it locks the plan that it found at that time. This is a good trade off
as it avoids costly planning-time for repetitive queries. But if you are
manually querying, the a custom plan will be generated anew.
A quick analyze of the table should reset the stats and invalidate any
cached plans.
This may not be your problem just worth eliminating it from the list of
potential causes.

On Wed, 6 Nov 2024, 17:14 Ramakrishna m, <ram.pgdb@gmail.com> wrote:

Hi Team,

One of the queries, which retrieves a single record from a table with
16 hash partitions, is taking more than 10 seconds to execute. In contrast,
when we run the same query manually, it completes within milliseconds. This
issue is causing exhaustion of the application pools. Do we have any bugs
in postgrs16 hash partitions? Please find the attached log, table, and
execution plan.

size of the each partitions : 300GB
Index Size : 12GB

Postgres Version : 16.x
Shared Buffers : 75 GB
Effective_cache : 175 GB
Work _mem : 4MB
Max_connections : 3000

OS : Ubuntu 22.04
Ram : 384 GB
CPU : 64

Please let us know if you need any further information or if there are
additional details required.

Regards,
Ram.

#7David Mullineux
dmullx@gmail.com
In reply to: ravi k (#6)
Re: Performance Issue with Hash Partition Query Execution in PostgreSQL 16

Thanks for correction. At this point I would be trying to modify
plan_cache_mode
for the session which uses the bond variable. alter it so that
plan_cache_mode=force_custom_plan
One hypothesis is that, a bad plan got cached for that SQL pattern.
Obviously, when you run it *manually* you are always getting a *custom*
plan as it's not a prepared statement.

On Sat, 9 Nov 2024, 03:46 ravi k, <ravisql09@gmail.com> wrote:

Show quoted text

Sorry, it was typo. Bind variable is bigint only.

Thanks

On Fri, 8 Nov, 2024, 7:09 pm David Mullineux, <dmullx@gmail.com> wrote:

Just spotted a potential problem. The indexed column is a bigint. Are
you, in your prepared statement passing a string or a big int ?
I notice your plan is doing an implicit type conversion when you run it
manually.
Sometimes the wrong type will make it not use the index.

On Fri, 8 Nov 2024, 03:07 ravi k, <ravisql09@gmail.com> wrote:

Hi ,

Thanks for the suggestions.

Two more observations:

1) no sequence scan noticed from pg_stat_user_tables ( hope stats are
accurate in postgres 16) if parameter sniffing happens the possibility of
going to sequence scan is more right.

2) no blockings or IO issue during the time.

3) even with limit clause if touch all partitions also it could have
been completed in milliseconds as this is just one record.

4) auto_explain in prod we cannot enable as this is expensive and with
high TPS we may face latency issues and lower environment this issue cannot
be reproduced,( this is happening out of Million one case)

This looks puzzle to us, just in case anyone experianced pls share your
experience.

Regards,
Ravi

On Thu, 7 Nov, 2024, 3:41 am David Mullineux, <dmullx@gmail.com> wrote:

It might be worth eliminating the use of cached plans here. Is your app
using prepared statements at all?
Point is that if the optimizer sees the same prepared query , 5 times,
the it locks the plan that it found at that time. This is a good trade off
as it avoids costly planning-time for repetitive queries. But if you are
manually querying, the a custom plan will be generated anew.
A quick analyze of the table should reset the stats and invalidate any
cached plans.
This may not be your problem just worth eliminating it from the list
of potential causes.

On Wed, 6 Nov 2024, 17:14 Ramakrishna m, <ram.pgdb@gmail.com> wrote:

Hi Team,

One of the queries, which retrieves a single record from a table with
16 hash partitions, is taking more than 10 seconds to execute. In contrast,
when we run the same query manually, it completes within milliseconds. This
issue is causing exhaustion of the application pools. Do we have any bugs
in postgrs16 hash partitions? Please find the attached log, table, and
execution plan.

size of the each partitions : 300GB
Index Size : 12GB

Postgres Version : 16.x
Shared Buffers : 75 GB
Effective_cache : 175 GB
Work _mem : 4MB
Max_connections : 3000

OS : Ubuntu 22.04
Ram : 384 GB
CPU : 64

Please let us know if you need any further information or if there are
additional details required.

Regards,
Ram.

#8ravi k
ravisql09@gmail.com
In reply to: David Mullineux (#7)
Re: Performance Issue with Hash Partition Query Execution in PostgreSQL 16

Thanks for the advice!

I am planing to set session level!

but before that one more observations noticed i.e One more table has same
issue, which is having similar like hash partitions.

And I scheduled manual analyze for all parent hash tables(thus all stats
will update together).

After this change I didn't noticed the issue, not sure does this addressed
issue or not, just monitoring if this not works will set custom plan in
session level.

I have seen in SQL server parameter sniffing regularly but in postgres I
never experienced. I am still wondering does this sniffing or not as from
stats I didn't notice any sequence scan.

Best,

On Sat, 9 Nov, 2024, 3:40 pm David Mullineux, <dmullx@gmail.com> wrote:

Show quoted text

Thanks for correction. At this point I would be trying to modify
plan_cache_mode
for the session which uses the bond variable. alter it so that
plan_cache_mode=force_custom_plan
One hypothesis is that, a bad plan got cached for that SQL pattern.
Obviously, when you run it *manually* you are always getting a *custom*
plan as it's not a prepared statement.

On Sat, 9 Nov 2024, 03:46 ravi k, <ravisql09@gmail.com> wrote:

Sorry, it was typo. Bind variable is bigint only.

Thanks

On Fri, 8 Nov, 2024, 7:09 pm David Mullineux, <dmullx@gmail.com> wrote:

Just spotted a potential problem. The indexed column is a bigint. Are
you, in your prepared statement passing a string or a big int ?
I notice your plan is doing an implicit type conversion when you run it
manually.
Sometimes the wrong type will make it not use the index.

On Fri, 8 Nov 2024, 03:07 ravi k, <ravisql09@gmail.com> wrote:

Hi ,

Thanks for the suggestions.

Two more observations:

1) no sequence scan noticed from pg_stat_user_tables ( hope stats are
accurate in postgres 16) if parameter sniffing happens the possibility of
going to sequence scan is more right.

2) no blockings or IO issue during the time.

3) even with limit clause if touch all partitions also it could have
been completed in milliseconds as this is just one record.

4) auto_explain in prod we cannot enable as this is expensive and with
high TPS we may face latency issues and lower environment this issue cannot
be reproduced,( this is happening out of Million one case)

This looks puzzle to us, just in case anyone experianced pls share your
experience.

Regards,
Ravi

On Thu, 7 Nov, 2024, 3:41 am David Mullineux, <dmullx@gmail.com> wrote:

It might be worth eliminating the use of cached plans here. Is your
app using prepared statements at all?
Point is that if the optimizer sees the same prepared query , 5 times,
the it locks the plan that it found at that time. This is a good trade off
as it avoids costly planning-time for repetitive queries. But if you are
manually querying, the a custom plan will be generated anew.
A quick analyze of the table should reset the stats and invalidate any
cached plans.
This may not be your problem just worth eliminating it from the list
of potential causes.

On Wed, 6 Nov 2024, 17:14 Ramakrishna m, <ram.pgdb@gmail.com> wrote:

Hi Team,

One of the queries, which retrieves a single record from a table with
16 hash partitions, is taking more than 10 seconds to execute. In contrast,
when we run the same query manually, it completes within milliseconds. This
issue is causing exhaustion of the application pools. Do we have any bugs
in postgrs16 hash partitions? Please find the attached log, table, and
execution plan.

size of the each partitions : 300GB
Index Size : 12GB

Postgres Version : 16.x
Shared Buffers : 75 GB
Effective_cache : 175 GB
Work _mem : 4MB
Max_connections : 3000

OS : Ubuntu 22.04
Ram : 384 GB
CPU : 64

Please let us know if you need any further information or if there
are additional details required.

Regards,
Ram.