Help: Postgres Replication issues with pacemaker

Started by Shital Aover 6 years ago2 messagesgeneral

brightuser2019@gmail.com

over 6 years ago

Hello,

We have setup active-passive cluster using streaming replication on Rhe
7.5. We are testing pacemaker for automated failover.
We are seeing below issues with the setup :

1. When a failoveris triggered when data is being added to the primary by
killing primary (killall -9 postgres), the standby doesnt come up in sync.
On pacemaker, the crm_mon -Afr shows standby in disconnected and HS:alone
state.

On postgres, we see below error:

< 2019-09-20 17:07:46.266 IST > LOG: entering standby mode
< 2019-09-20 17:07:46.267 IST > LOG: database system was not properly shut
down; automatic recovery in progress
< 2019-09-20 17:07:46.270 IST > LOG: redo starts at 1/680A2188
< 2019-09-20 17:07:46.370 IST > LOG: consistent recovery state reached at
1/6879D9F8
< 2019-09-20 17:07:46.370 IST > LOG: database system is ready to accept
read only connections
cp: cannot stat
'/var/lib/pgsql/9.6/data/archivedir/000000010000000100000068': No such file
or directory
< 2019-09-20 17:07:46.751 IST > LOG: statement: select pg_is_in_recovery()
< 2019-09-20 17:07:46.782 IST > LOG: statement: show
synchronous_standby_names
< 2019-09-20 17:07:50.993 IST > LOG: statement: select pg_is_in_recovery()
< 2019-09-20 17:07:53.395 IST > LOG: started streaming WAL from primary at
1/68000000 on timeline 1
< 2019-09-20 17:07:53.436 IST > LOG: invalid contrecord length 2662 at
1/6879D9F8
< 2019-09-20 17:07:53.438 IST > FATAL: terminating walreceiver process due
to administrator command
cp: cannot stat '/var/lib/pgsql/9.6/data/archivedir/00000002.history': No
such file or directory
cp: cannot stat
'/var/lib/pgsql/9.6/data/archivedir/000000010000000100000068': No such file
or directory

When we try to restart postgres on the standby, using pg_ctl restart, the
standby start syncing.

2. After standby syncs using pg_ctl restart as mentioned above, we found
out that 1-2 records are missing on the standby.

Need help to check:
1. why the standby fails to start in the first place and complains about
missing logs?
2. can record mismatch be a problem related to failover not successful?

If you have faced this issue/have knowledge, please let us know.

replication is async.
recovery.conf file has restore_command that uses cp

Thanks.

Shital A

brightuser2019@gmail.com

over 6 years ago

In reply to: Shital A (#1)

Re: Help: Postgres Replication issues with pacemaker

On Mon, 23 Sep 2019, 00:46 Shital A, <brightuser2019@gmail.com> wrote:

Hello,

We have setup active-passive cluster using streaming replication on Rhe
7.5. We are testing pacemaker for automated failover.
We are seeing below issues with the setup :

1. When a failoveris triggered when data is being added to the primary by
killing primary (killall -9 postgres), the standby doesnt come up in sync.
On pacemaker, the crm_mon -Afr shows standby in disconnected and HS:alone
state.

On postgres, we see below error:

< 2019-09-20 17:07:46.266 IST > LOG: entering standby mode
< 2019-09-20 17:07:46.267 IST > LOG: database system was not properly
shut down; automatic recovery in progress
< 2019-09-20 17:07:46.270 IST > LOG: redo starts at 1/680A2188
< 2019-09-20 17:07:46.370 IST > LOG: consistent recovery state reached at
1/6879D9F8
< 2019-09-20 17:07:46.370 IST > LOG: database system is ready to accept
read only connections
cp: cannot stat
'/var/lib/pgsql/9.6/data/archivedir/000000010000000100000068': No such file
or directory
< 2019-09-20 17:07:46.751 IST > LOG: statement: select pg_is_in_recovery()
< 2019-09-20 17:07:46.782 IST > LOG: statement: show
synchronous_standby_names
< 2019-09-20 17:07:50.993 IST > LOG: statement: select pg_is_in_recovery()
< 2019-09-20 17:07:53.395 IST > LOG: started streaming WAL from primary
at 1/68000000 on timeline 1
< 2019-09-20 17:07:53.436 IST > LOG: invalid contrecord length 2662 at
1/6879D9F8
< 2019-09-20 17:07:53.438 IST > FATAL: terminating walreceiver process
due to administrator command
cp: cannot stat '/var/lib/pgsql/9.6/data/archivedir/00000002.history': No
such file or directory
cp: cannot stat
'/var/lib/pgsql/9.6/data/archivedir/000000010000000100000068': No such file
or directory

When we try to restart postgres on the standby, using pg_ctl restart, the
standby start syncing.

2. After standby syncs using pg_ctl restart as mentioned above, we found
out that 1-2 records are missing on the standby.

Need help to check:
1. why the standby fails to start in the first place and complains about
missing logs?
2. can record mismatch be a problem related to failover not successful?

If you have faced this issue/have knowledge, please let us know.

replication is async.
recovery.conf file has restore_command that uses cp

Thanks.

Hello Team,

Any ideas?

Thanks..

Show quoted text