Suddenly all queries moved to seq scan

Started by Sreejith Pover 1 year ago6 messagesgeneral
Jump to latest
#1Sreejith P
sreejith@lifetrenz.com

Hi,

We are using PostgresQL 10 in our production database. We have around 890
req /s request on peak time.

We have 1 primary and 4 slave databases as well in the same postgres
cluster.

2 days back we applied some patches in the primary server and restarted. We
didn't do anything on the secondary server.

Next day, After 18 hours all our queries from secondary servers started
taking too much time. queries were working in 2 sec started taking 80
seconds. Almost all queries behaved the same way.

After half an hour of outage we restarted all db servers and system back to
normal.

Still we are not able to understand the root case. We couldn't find any
error log or fatal errors. During the incident, in one of the read server
disks was full. We couldn't see any replication lag or query
cancellation due to replication.

please help

Regards
Sreejith

--

 

*Solutions for Care Anywhere*
*dWise HealthCare IT Solutions Pvt.
Ltd.* | www.lifetrenz.com <http://www.lifetrenz.com&gt;
*Disclaimer*:
The
information and attachments contained in this email are intended
for
exclusive use of the addressee(s) and may contain confidential or
privileged information. If you are not the intended recipient, please
notify the sender immediately and destroy all copies of this message and

any attachments. The views expressed in this email are, unless
otherwise
stated, those of the author and not those of dWise HealthCare IT Solutions
or its management.

#2Achilleas Mantzios
a.mantzios@cloud.gatewaynet.com
In reply to: Sreejith P (#1)
Re: Suddenly all queries moved to seq scan

On 11/20/24 12:50, Sreejith P wrote:

Hi,

We are using PostgresQL 10 in our production database. We have around
890 req /s request on peak time.

We have 1 primary and 4 slave databases as well in the same postgres
cluster.

2 days back we applied some patches in the primary server and
restarted. We didn't do anything on the secondary server.

Next day, After 18 hours all our queries from secondary servers
started taking too much time.  queries were working in 2 sec started
taking 80 seconds. Almost all queries behaved the same way.

After half an hour of outage we restarted all db servers and system
back to normal.

Still we are not able to understand the root case. We couldn't find
any error log or fatal errors.  During the incident, in  one of the
read server disks was full. We couldn't see any replication lag or
query cancellation due to replication.

please help

Still you say seq scan, how do you know that? You run explain (buffers,
analyze, settings) ? If this is indeed the case, then you need to check
your indexes. Check when they were last used/scanned and if they are
usable. Did the disk error had any of your PostgreSQL tablespaces /
tables/indexes  ?

Show quoted text

Regards
Sreejith

*Solutions for Care Anywhere*

------------------------------------------------------------------------
*dWise HealthCare IT Solutions Pvt. Ltd.*| www.lifetrenz.com
<http://www.lifetrenz.com&gt;
*Disclaimer*: The information and attachments contained in this email
are intended for exclusive use of the addressee(s) and may contain
confidential or privileged information. If you are not the intended
recipient, please notify the sender immediately and destroy all copies
of this message and any attachments. The views expressed in this email
are, unless otherwise stated, those of the author and not those of
dWise HealthCare IT Solutions or its management.

#3Daniel Gustafsson
daniel@yesql.se
In reply to: Sreejith P (#1)
Re: Suddenly all queries moved to seq scan

On 20 Nov 2024, at 11:50, Sreejith P <sreejith@lifetrenz.com> wrote:

We are using PostgresQL 10 in our production database. We have around 890 req /s request on peak time.

PostgreSQL 10 is well out of support and does not receive bugfixes or security
fixes, you should plan a migration to a supported version sooner rather than
later.

2 days back we applied some patches in the primary server and restarted. We didn't do anything on the secondary server.

Patches to the operating system, postgres, another application?

Next day, After 18 hours all our queries from secondary servers started taking too much time. queries were working in 2 sec started taking 80 seconds. Almost all queries behaved the same way.

After half an hour of outage we restarted all db servers and system back to normal.

Still we are not able to understand the root case. We couldn't find any error log or fatal errors. During the incident, in one of the read server disks was full. We couldn't see any replication lag or query cancellation due to replication.

You say that all queries started doing sequential scans, is that an assumption
from queries being slow or did you capture plans for the queries which be
compared against "normal" production plans?

--
Daniel Gustafsson

#4Efrain J. Berdecia
ejberdecia@yahoo.com
In reply to: Daniel Gustafsson (#3)
Re: Suddenly all queries moved to seq scan

Make sure to run analyze on the entire database, possibly using vacuumdb would be faster.
Also, check for invalid indexes.
Efrain J. Berdecia

On Wednesday, November 20, 2024 at 08:02:36 AM EST, Daniel Gustafsson <daniel@yesql.se> wrote:

On 20 Nov 2024, at 11:50, Sreejith P <sreejith@lifetrenz.com> wrote:

We are using PostgresQL 10 in our production database.  We have around 890 req /s request on peak time.

PostgreSQL 10 is well out of support and does not receive bugfixes or security
fixes, you should plan a migration to a supported version sooner rather than
later.

2 days back we applied some patches in the primary server and restarted. We didn't do anything on the secondary server.

Patches to the operating system, postgres, another application?

Next day, After 18 hours all our queries from secondary servers started taking too much time.  queries were working in 2 sec started taking 80 seconds. Almost all queries behaved the same way.

After half an hour of outage we restarted all db servers and system back to normal.

Still we are not able to understand the root case. We couldn't find any error log or fatal errors.  During the incident, in  one of the read server disks was full. We couldn't see any replication lag or query cancellation due to replication.

You say that all queries started doing sequential scans, is that an assumption
from queries being slow or did you capture plans for the queries which be
compared against "normal" production plans?

--
Daniel Gustafsson

#5Sreejith P
sreejith@lifetrenz.com
In reply to: Daniel Gustafsson (#3)
Re: Suddenly all queries moved to seq scan

On 20 Nov 2024, at 6:32 PM, Daniel Gustafsson <daniel@yesql.se> wrote:

On 20 Nov 2024, at 11:50, Sreejith P <sreejith@lifetrenz.com> wrote:

We are using PostgresQL 10 in our production database. We have around 890 req /s request on peak time.

PostgreSQL 10 is well out of support and does not receive bugfixes or security
fixes, you should plan a migration to a supported version sooner rather than
later.

2 days back we applied some patches in the primary server and restarted. We didn't do anything on the secondary server.

Patches to the operating system, postgres, another application?

PostgreSQL Common 10.23-6

Next day, After 18 hours all our queries from secondary servers started taking too much time. queries were working in 2 sec started taking 80 seconds. Almost all queries behaved the same way.

After half an hour of outage we restarted all db servers and system back to normal.

Still we are not able to understand the root case. We couldn't find any error log or fatal errors. During the incident, in one of the read server disks was full. We couldn't see any replication lag or query cancellation due to replication.

You say that all queries started doing sequential scans, is that an assumption
from queries being slow or did you capture plans for the queries which be
compared against "normal" production plans?.

Queries were taking 20 ms started taking 60 seconds. So have done SQL analyse to understand about query plan. There we found that query planner taking seq scan instead in index scan.

I would like to add one ore point. A delete query were running in DB from 2 days for deleting around 80 million records.

--
Daniel Gustafsson

--

 

*Solutions for Care Anywhere*
*dWise HealthCare IT Solutions Pvt.
Ltd.* | www.lifetrenz.com <http://www.lifetrenz.com&gt;
*Disclaimer*:
The
information and attachments contained in this email are intended
for
exclusive use of the addressee(s) and may contain confidential or
privileged information. If you are not the intended recipient, please
notify the sender immediately and destroy all copies of this message and

any attachments. The views expressed in this email are, unless
otherwise
stated, those of the author and not those of dWise HealthCare IT Solutions
or its management.

#6Justin
zzzzz.graf@gmail.com
In reply to: Sreejith P (#5)
Re: Suddenly all queries moved to seq scan

On Wed, Nov 20, 2024 at 9:09 AM Sreejith P <sreejith@lifetrenz.com> wrote:

Queries were taking 20 ms started taking 60 seconds. So have done SQL
analyse to understand about query plan. There we found that query planner
taking seq scan instead in index scan.

I would like to add one ore point. A delete query were running in DB from
2 days for deleting around 80 million records.

This can cause this specific problem where the number of dead tuples and
lack of autovacuum running can cause the statistics to favor a sequential
scan over an index scan. Taking into account the length of time the delete
took it would hold a number of datapages and tuples in a lock state, which
can lead to blocking queries and prevent autovacuum/analyze.

It is best to do bulk deletes in batches and have a rest period between
batches to allow autovacuum and analyze to keep up.. Doing deletes in
batches reduces the number of resources being consumed.