Fwd: Postgres HA - pacemaker RA do not support auto failback

Started by Shital Aover 6 years ago5 messagesgeneral
Jump to latest
#1Shital A
brightuser2019@gmail.com

Hello,

Postgres version : 9.6
OS:Rhel 7.6

We are working on HA setup for postgres cluster of two nodes in
active-passive mode.

Installed:
Pacemaker 1.1.19
Corosync 2.4.3

The pacemaker agent with this installation doesn't support automatic
failback. What I mean by that is explained below:
1. Cluster is setup like A - B with A as master.
2. Kill services on A, node B will come up as master.
3. node A is ready to join the cluster, we have to delete the lock file it
creates on any one of the node and execute the cleanup command to get the
node back as standby

Step 3 is manual so HA is not achieved in real sense.

Please help to check:
1. Is there any version of the resouce agent which supports automatic
failback? To avoid generation of lock file and deleting it.

2. If there is no such support, what checks should be added in pgsql RA to
achieve

Please suggest.
Thanks.

#2Adrien Nayrat
adrien.nayrat@anayrat.info
In reply to: Shital A (#1)
Re: Fwd: Postgres HA - pacemaker RA do not support auto failback

On 8/12/19 2:57 PM, Shital A wrote:

Hello,

Postgres version : 9.6
OS:Rhel 7.6

We are working on HA setup for postgres cluster of two nodes in active-passive
mode. 

Installed: 
Pacemaker 1.1.19
Corosync 2.4.3

The pacemaker agent with this installation doesn't support automatic failback.
What I mean by that is explained below:
1. Cluster is setup like A - B with A as master. 
2. Kill services on A, node B will come up as master.
3. node A is ready to join the cluster, we have to delete the lock file it
creates on any one of the node and execute the cleanup command to get the node
back as standby

Step 3 is manual so HA is not achieved in real sense. 

Please help to check:
1. Is there any version of the resouce agent which supports automatic failback?
To avoid generation of lock file and deleting it.

2. If there is no such support, what checks should be added in pgsql RA to achieve 

Please suggest.
Thanks.

Hello,

Which RA did you use? AFAIK there is two RA :
- pgsql : https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
- PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql RA :
http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)

If I am right, PAF should handle the case when there no need to perform pgrewind
(if the old master is not too advanced in the transaction log).

Regards,

--
Adrien NAYRAT
https://blog.anayrat.info

#3Shital A
brightuser2019@gmail.com
In reply to: Adrien Nayrat (#2)
Re: Fwd: Postgres HA - pacemaker RA do not support auto failback

On Mon, 12 Aug 2019, 18:54 Adrien Nayrat, <adrien.nayrat@anayrat.info>
wrote:

On 8/12/19 2:57 PM, Shital A wrote:

Hello,

Postgres version : 9.6
OS:Rhel 7.6

We are working on HA setup for postgres cluster of two nodes in

active-passive

mode.

Installed:
Pacemaker 1.1.19
Corosync 2.4.3

The pacemaker agent with this installation doesn't support automatic

failback.

What I mean by that is explained below:
1. Cluster is setup like A - B with A as master.
2. Kill services on A, node B will come up as master.
3. node A is ready to join the cluster, we have to delete the lock file

it

creates on any one of the node and execute the cleanup command to get

the node

back as standby

Step 3 is manual so HA is not achieved in real sense.

Please help to check:
1. Is there any version of the resouce agent which supports automatic

failback?

To avoid generation of lock file and deleting it.

2. If there is no such support, what checks should be added in pgsql RA

to achieve

Please suggest.
Thanks.

Hello,

Which RA did you use? AFAIK there is two RA :
- pgsql :
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
- PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql
RA :
http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)

If I am right, PAF should handle the case when there no need to perform
pgrewind
(if the old master is not too advanced in the transaction log).

Regards,

--
Adrien NAYRAT
https://blog.anayrat.info

Hello

We are using the pgsql RA installed with pacemaker.

Is PAF recommended over pgsql?

Has anyone changed pgsql to handle the lock file and other cases that might
not have been handled?

Please advise.

Thanks.

#4Shital A
brightuser2019@gmail.com
In reply to: Shital A (#3)
Re: Fwd: Postgres HA - pacemaker RA do not support auto failback

On Tue, 13 Aug 2019, 11:50 Shital A, <brightuser2019@gmail.com> wrote:

On Mon, 12 Aug 2019, 18:54 Adrien Nayrat, <adrien.nayrat@anayrat.info>
wrote:

On 8/12/19 2:57 PM, Shital A wrote:

Hello,

Postgres version : 9.6
OS:Rhel 7.6

We are working on HA setup for postgres cluster of two nodes in

active-passive

mode.

Installed:
Pacemaker 1.1.19
Corosync 2.4.3

The pacemaker agent with this installation doesn't support automatic

failback.

What I mean by that is explained below:
1. Cluster is setup like A - B with A as master.
2. Kill services on A, node B will come up as master.
3. node A is ready to join the cluster, we have to delete the lock file

it

creates on any one of the node and execute the cleanup command to get

the node

back as standby

Step 3 is manual so HA is not achieved in real sense.

Please help to check:
1. Is there any version of the resouce agent which supports automatic

failback?

To avoid generation of lock file and deleting it.

2. If there is no such support, what checks should be added in pgsql RA

to achieve

Please suggest.
Thanks.

Hello,

Which RA did you use? AFAIK there is two RA :
- pgsql :
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
- PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql
RA :
http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)

If I am right, PAF should handle the case when there no need to perform
pgrewind
(if the old master is not too advanced in the transaction log).

Regards,

--
Adrien NAYRAT
https://blog.anayrat.info

Hello

We are using the pgsql RA installed with pacemaker.

Is PAF recommended over pgsql?

Has anyone changed pgsql to handle the lock file and other cases that
might not have been handled?

Please advise.

Thanks.

Looks like the previous email bounced, received undelivered message.
Sending again. We are using the pgsql RA installed with pacemaker and seems
like it prevents failback by creating a lock file. We want to implement
auto failback and need pointers on how to implement.

Thanks.

In reply to: Shital A (#4)
Re: Postgres HA - pacemaker RA do not support auto failback

On Tue, 13 Aug 2019 16:38:44 +0530
Shital A <brightuser2019@gmail.com> wrote:

On Tue, 13 Aug 2019, 11:50 Shital A, <brightuser2019@gmail.com> wrote:

On Mon, 12 Aug 2019, 18:54 Adrien Nayrat, <adrien.nayrat@anayrat.info>
wrote:

On 8/12/19 2:57 PM, Shital A wrote:

Postgres version : 9.6
OS:Rhel 7.6

We are working on HA setup for postgres cluster of two nodes in
active-passive mode.

Installed:
Pacemaker 1.1.19
Corosync 2.4.3

The pacemaker agent with this installation doesn't support automatic
failback.
What I mean by that is explained below:
1. Cluster is setup like A - B with A as master.
2. Kill services on A, node B will come up as master.
3. node A is ready to join the cluster, we have to delete the lock file
it creates on any one of the node and execute the cleanup command to
get the node back as standby

Step 3 is manual so HA is not achieved in real sense.

Please help to check:
1. Is there any version of the resouce agent which supports automatic
failback?
To avoid generation of lock file and deleting it.

2. If there is no such support, what checks should be added in pgsql RA
to achieve

Which RA did you use? AFAIK there is two RA :
- pgsql :
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
- PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql
RA :
http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)

If I am right, PAF should handle the case when there no need to perform
pgrewind
(if the old master is not too advanced in the transaction log).

PAF only make sure controlled failover (aka switchover) performs safely so the
old master hook back as a standby correctly.

Should a failure occurs with an automatic failover, PAF has no dark magic to
automagicaly failoback the failing master as a standby.

We are using the pgsql RA installed with pacemaker.

Is PAF recommended over pgsql?

Yes, as much as the historical RA.

Has anyone changed pgsql to handle the lock file and other cases that
might not have been handled?

I don't know. However, I do think it would be either presumptuous to climb this
road today.

Looks like the previous email bounced, received undelivered message.
Sending again. We are using the pgsql RA installed with pacemaker and seems
like it prevents failback by creating a lock file. We want to implement
auto failback and need pointers on how to implement.

The lock file is here because PgSQL has no safety belt against data corruption
when an old master hook on a new one while its LSN was further away from
the timeline fork. You must take care of this by yourself.

Moreover, should a failover occurs, you should probably check WHY it occurred
and fix things before performing a blindly failback of a potential wounded
instance. A wounded instance can hurt your cluster in many different and
inconfortable ways.

Last, as Adrien wrote, you can manually use pg_rewind to revert the old master
to the timeline fork, as far as you enabled checksums or 'wal_log_hints=on' and
kept enough WAL around.

Another option is just to make sure you are able to build a new standby as was
fast as possible.

But remember: if your data is important, use quorum, fencing and watchdog.

Good luck.