BUG #14171: Wrong FSM file after switching hot standby to master

Started by Timofei Dynikovalmost 10 years ago5 messagesbugs

timofeid@outlook.com

almost 10 years ago

The following bug has been logged on the website:

Bug reference: 14171
Logged by: Timofei Dynikov
Email address: timofeid@outlook.com
PostgreSQL version: 9.4.4
Operating system: RHEL 6.6
Description:

We have an installation of Postgres 9.4.4(PostgreSQL 9.4.4 on
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat
4.4.7-11), 64-bit) on RHEL 6.6. DB installed on 2 nodes, 1 node is master,
another node is hot standby(streaming replication). DB is monitored by
pacеmaker pgsql agent.
Sometimes we have troubles with fsm-files. In case:
• master instance is switching to another node(failover or switchover) on
highload
• Hot standby node restart and run as master succesfully.
• After that we sometimes get FSM files pointing to non-existent blocks in
the table, so subsequent insert operations on such tables fails with error
message like following: 'could not read block XX in file "base/YYYY/ZZZZZ"'.
The issue can be resolved by either deleting of wrong FSM file (while
database is stopped) or performing VACUUM FULL on erroneous table. The
problem is usually observed on relatively small tables (e.g. up to 30
blocks) which are often cleaned out (having most rows deleted).
Does anybody already faced such behavior? What can be the root cause of such
problems? Are there any recommendations on how to avoid them?

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Michael Paquier

michael@paquier.xyz

almost 10 years ago

In reply to: Timofei Dynikov (#1)

Re: BUG #14171: Wrong FSM file after switching hot standby to master

On Wed, Jun 1, 2016 at 10:48 PM, <timofeid@outlook.com> wrote:

We have an installation of Postgres 9.4.4(PostgreSQL 9.4.4 on
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat
4.4.7-11), 64-bit) on RHEL 6.6. DB installed on 2 nodes, 1 node is master,
another node is hot standby(streaming replication). DB is monitored by
pacеmaker pgsql agent.

You surely want to update to 9.4.8 first. You are missing many bug fixes.

Sometimes we have troubles with fsm-files. In case:
• master instance is switching to another node(failover or switchover) on
highload
• Hot standby node restart and run as master succesfully.
• After that we sometimes get FSM files pointing to non-existent blocks in
the table, so subsequent insert operations on such tables fails with error
message like following: 'could not read block XX in file "base/YYYY/ZZZZZ"'.
The issue can be resolved by either deleting of wrong FSM file (while
database is stopped) or performing VACUUM FULL on erroneous table. The
problem is usually observed on relatively small tables (e.g. up to 30
blocks) which are often cleaned out (having most rows deleted).
Does anybody already faced such behavior? What can be the root cause of such
problems? Are there any recommendations on how to avoid them?

Andres, do you think that c6ff84b0 can help here? Those symptoms look
rather similar to some missing invalidation messages on the standby.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Michael Paquier (#2)

Re: BUG #14171: Wrong FSM file after switching hot standby to master

On 2016-06-02 16:42:35 +0900, Michael Paquier wrote:

Sometimes we have troubles with fsm-files. In case:
• master instance is switching to another node(failover or switchover) on
highload
• Hot standby node restart and run as master succesfully.
• After that we sometimes get FSM files pointing to non-existent blocks in
the table, so subsequent insert operations on such tables fails with error
message like following: 'could not read block XX in file "base/YYYY/ZZZZZ"'.
The issue can be resolved by either deleting of wrong FSM file (while
database is stopped) or performing VACUUM FULL on erroneous table. The
problem is usually observed on relatively small tables (e.g. up to 30
blocks) which are often cleaned out (having most rows deleted).
Does anybody already faced such behavior? What can be the root cause of such
problems? Are there any recommendations on how to avoid them?

Andres, do you think that c6ff84b0 can help here? Those symptoms look
rather similar to some missing invalidation messages on the standby.

If there was a restart involved, it seems unlikely that that'll be
relevant. Timofei, do I understand correctly that the problem persists
across restarts?

Regards,

Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Timofei Dynikov

timofeid@outlook.com

almost 10 years ago

In reply to: Andres Freund (#3)

Re: BUG #14171: Wrong FSM file after switching hot standby to master

Date: Thu, 2 Jun 2016 07:42:32 -0700
From: andres@anarazel.de
To: michael.paquier@gmail.com
CC: timofeid@outlook.com; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #14171: Wrong FSM file after switching hot standby to master

On 2016-06-02 16:42:35 +0900, Michael Paquier wrote:

Sometimes we have troubles with fsm-files. In case:
• master instance is switching to another node(failover or switchover) on
highload
• Hot standby node restart and run as master succesfully.
• After that we sometimes get FSM files pointing to non-existent blocks in
the table, so subsequent insert operations on such tables fails with error
message like following: 'could not read block XX in file "base/YYYY/ZZZZZ"'.
The issue can be resolved by either deleting of wrong FSM file (while
database is stopped) or performing VACUUM FULL on erroneous table. The
problem is usually observed on relatively small tables (e.g. up to 30
blocks) which are often cleaned out (having most rows deleted).
Does anybody already faced such behavior? What can be the root cause of such
problems? Are there any recommendations on how to avoid them?

Andres, do you think that c6ff84b0 can help here? Those symptoms look
rather similar to some missing invalidation messages on the standby.

If there was a restart involved, it seems unlikely that that'll be
relevant. Timofei, do I understand correctly that the problem persists
across restarts?

Regards,

Andres

Yes, problem persists across restarts. We can resolve problem only by performing VACUUM FULL or delete inconsistent FSM file.

Regards,Timofei Dynikov

Michael Paquier

michael@paquier.xyz

almost 10 years ago

In reply to: Timofei Dynikov (#4)

Re: BUG #14171: Wrong FSM file after switching hot standby to master

On Fri, Jun 3, 2016 at 7:09 PM, Timofei Dynikov <timofeid@outlook.com> wrote:

Date: Thu, 2 Jun 2016 07:42:32 -0700 andres@anarazel.de wrote:
If there was a restart involved, it seems unlikely that that'll be
relevant. Timofei, do I understand correctly that the problem persists
across restarts?

Yes, problem persists across restarts. We can resolve problem only by
performing VACUUM FULL or delete inconsistent FSM file.

pacemaker removes recovery.conf and then restarts the node at
failover, so the node moves on with a crash recovery on the same
timeline in this case. I recall seeing cases where a relation file was
truncated when crash recovery began in 9.4.4, that got fixed in 9.4.5.
The environment where this happened made it hard to compile to
reproduce it but I somewhat diagnosed this as being a side effect of
be25a08, that e118555 fixed afterwards, at least I did not see
anything else that could have been the origin of the problem between
9.4.4 and 9.4.5. The problem was in the same way happening on a small
table, one that had no more than 5 tuples, and those were removed
quite frequently to the table was most of the time empty, however when
crash recovery began it had some records.

Could you update to at least 9.4.5 and see if the problem goes away?
We may as well have another problem hidden here..
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs