Fast switchover

Started by legrand legrand7 months ago8 messagesgeneral
Jump to latest
#1legrand legrand
legrand_legrand@hotmail.com

Hello all the readers,

For some projects we need a fast manual switchover to address Near Zero downtime maintenance
(not speaking here about automated failover like those provided by HA tools, but just planned, controlled operations)

Database Physical replication switchover itself:
- initial replication (before switchover) should be synchronous or replication LAG should be controlled to prevent data loss.
- Switchover duration seems not "compressible" under a few seconds (because of primary shutdown, promotion, new standby catch up, ...)
- Application retry strategy (after disconnection) should be tuned using proper retry delay. Pooler or specific driver may help.

May logical replication ( bi-directional, with one instance RW and the other RO) be a better solution ?
This solution is more complex because of sequences, DDL, Large Objects, Conflict resolution (if any)
but switchover should be faster ...

what could we expect (in term of downtime in both worlds) ?
Are there any Logical Replication Manager available, or admin tools (preferably open source) ?
any feedback is welcome

Thanks in advance
Regards
PAscal

#2Ron
ronljohnsonjr@gmail.com
In reply to: legrand legrand (#1)
Re: Fast switchover

On Mon, Sep 8, 2025 at 11:03 AM legrand legrand <legrand_legrand@hotmail.com>
wrote:

Hello all the readers,

For some projects we need a fast *manual* switchover to address Near Zero
downtime maintenance
(not speaking here about automated failover like those provided by HA
tools, but just planned, controlled operations)

Database Physical replication switchover itself:
- initial replication (before switchover) should be synchronous or
replication LAG should be controlled to prevent data loss.
- Switchover duration seems not "compressible" under a few seconds
(because of primary shutdown, promotion, new standby catch up, ...)
- Application retry strategy (after disconnection) should be tuned using
proper retry delay. Pooler or specific driver may help.

There will always be a few seconds delay while the applications reconnect.

Do the applications connect via a VIP? That's simpler for the application.

This is what I do from the not-yet-new-primary:

1. psql -h $CurrentPrimary -c "ALTER SYSTEM SET
synchronous_standby_names TO '*';"
2. Wait a few seconds.
3. ssh $CurrentPrimary sudo ip del $VIP # cmd is more complicated, but
you get the idea
4. ssh $CurrentPrimary pg_ctl stop -mfast # to kill connections, has to
happen, no matter the solution.
5. pg_ctl promote
6. sudo ip add $VIP
7. Replicate from new-primary to new-replica "at leisure".

No retry delay, since the application directly goes to the new server.
Steps 3-6 are in a script, and what pgpool does, except I do it. #4 is by
far the slowest. ssh authentication delay in #3 and #4 are nonexistent if
you have "pre-created" an ssh socket.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#3Klaus Darilion
klaus.darilion@nic.at
In reply to: Ron (#2)
RE: Fast switchover

From: Ron Johnson <ronljohnsonjr@gmail.com>
Sent: Monday, September 8, 2025 6:10 PM
To: pgsql-general@lists.postgresql.org
Subject: Re: Fast switchover

On Mon, Sep 8, 2025 at 11:03 AM legrand legrand <legrand_legrand@hotmail.com<mailto:legrand_legrand@hotmail.com>> wrote:
Hello all the readers,

For some projects we need a fast manual switchover to address Near Zero downtime maintenance
(not speaking here about automated failover like those provided by HA tools, but just planned, controlled operations)

Database Physical replication switchover itself:
- initial replication (before switchover) should be synchronous or replication LAG should be controlled to prevent data loss.
- Switchover duration seems not "compressible" under a few seconds (because of primary shutdown, promotion, new standby catch up, ...)
- Application retry strategy (after disconnection) should be tuned using proper retry delay. Pooler or specific driver may help.

There will always be a few seconds delay while the applications reconnect.

Do the applications connect via a VIP? That's simpler for the application.

This is what I do from the not-yet-new-primary:

1. psql -h $CurrentPrimary -c "ALTER SYSTEM SET synchronous_standby_names TO '*';"
2. Wait a few seconds.
3. ssh $CurrentPrimary sudo ip del $VIP # cmd is more complicated, but you get the idea
4. ssh $CurrentPrimary pg_ctl stop -mfast # to kill connections, has to happen, no matter the solution.
If you remove the VIP in step 3, the TCP connections on the client side are broken (may hang around), and will not be properly terminated if you stop postgresql in step 4. Thay may cause delays on the client detecting the broken TCP connection and reconnect to the server (depending on the network/firewall configuration on the servers). Maybe faster reconnect can be achieved if you first stop postgresql, and then remove the VIP.

Regards
Klaus

#4Ron
ronljohnsonjr@gmail.com
In reply to: Klaus Darilion (#3)
Re: Fast switchover

On Mon, Sep 8, 2025 at 12:37 PM Klaus Darilion <klaus.darilion@nic.at>
wrote:

*From:* Ron Johnson <ronljohnsonjr@gmail.com>
*Sent:* Monday, September 8, 2025 6:10 PM
*To:* pgsql-general@lists.postgresql.org
*Subject:* Re: Fast switchover

On Mon, Sep 8, 2025 at 11:03 AM legrand legrand <
legrand_legrand@hotmail.com> wrote:

Hello all the readers,

For some projects we need a fast *manual* switchover to address Near Zero
downtime maintenance

(not speaking here about automated failover like those provided by HA
tools, but just planned, controlled operations)

Database Physical replication switchover itself:

- initial replication (before switchover) should be synchronous or
replication LAG should be controlled to prevent data loss.

- Switchover duration seems not "compressible" under a few seconds
(because of primary shutdown, promotion, new standby catch up, ...)

- Application retry strategy (after disconnection) should be tuned using
proper retry delay. Pooler or specific driver may help.

There will always be a few seconds delay while the applications reconnect.

Do the applications connect via a VIP? That's simpler for the application.

This is what I do from the not-yet-new-primary:

1. psql -h $CurrentPrimary -c "ALTER SYSTEM SET
synchronous_standby_names TO '*';"
2. Wait a few seconds.
3. ssh $CurrentPrimary sudo ip del $VIP # cmd is more complicated, but
you get the idea
4. ssh $CurrentPrimary pg_ctl stop -mfast # to kill connections, has
to happen, no matter the solution.

If you remove the VIP in step 3, the TCP connections on the client side
are broken (may hang around), and will not be properly terminated if you
stop postgresql in step 4. Thay may cause delays on the client detecting
the broken TCP connection and reconnect to the server (depending on the
network/firewall configuration on the servers). Maybe faster reconnect can
be achieved if you first stop postgresql, and then remove the VIP.

Interesting. Thanks.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#5Laurenz Albe
laurenz.albe@cybertec.at
In reply to: legrand legrand (#1)
Re: Fast switchover

On Mon, 2025-09-08 at 15:03 +0000, legrand legrand wrote:

For some projects we need a fast manual switchover to address Near Zero downtime maintenance
(not speaking here about automated failover like those provided by HA tools, but just planned, controlled operations)

Database Physical replication switchover itself:
- initial replication (before switchover) should be synchronous or replication LAG should be controlled to prevent data loss.
- Switchover duration seems not "compressible" under a few seconds (because of primary shutdown, promotion, new standby catch up, ...)
- Application retry strategy (after disconnection) should be tuned using proper retry delay. Pooler or specific driver may help.

There is no need for synchronous replication; you cannot lose data with a switchover,
if you do it right:

- run a CHACKPOINT on the primary (to speed up the shutdown)
- when the checkpoint is done, perform a clean shutdown
- when the primary is down, promote the standby

The primary will transmit *all* data to the standby before it shuts down.

May logical replication ( bi-directional, with one instance RW and the other RO) be a better solution ?

I'd say no.

what could we expect (in term of downtime in both worlds) ?

Usually seconds, so plan for ten minutes.

Yours,
Laurenz Albe

#6Klaus Darilion
klaus.darilion@nic.at
In reply to: Laurenz Albe (#5)
RE: Fast switchover

what could we expect (in term of downtime in both worlds) ?

Usually seconds, so plan for ten minutes.

*lol*
So true ...

#7legrand legrand
legrand_legrand@hotmail.com
In reply to: Laurenz Albe (#5)
Re : Fast switchover

Hi Laurenz,
Thank you for your answer

For some projects we need a fast manual switchover to address Near Zero downtime maintenance

I forgot to say that application would not be stopped during maintenance…

There is no need for synchronous replication; you cannot lose data with a switchover,
if you do it right:

Ok

May logical replication ( bi-directional, with one instance RW and the other RO) be a better solution ?

I'd say no.

Really ?

what could we expect (in term of downtime in both worlds) ?

Usually seconds, so plan for ten minutes.

Brrr, I was thînking about a more reliable process.

Regards
PAscal

#8Laurenz Albe
laurenz.albe@cybertec.at
In reply to: legrand legrand (#7)
Re: Re : Fast switchover

On Tue, 2025-09-09 at 05:07 +0000, legrand legrand wrote:

what could we expect (in term of downtime in both worlds) ?

Usually seconds, so plan for ten minutes.

Brrr, I was thînking about a more reliable process.

If you want more reliable numbers, make a test run on your system.

Yours,
Laurenz Albe