broken backup trail in case of quickly patroni switchback and forth

Started by Zwettler Markus (OIZ)over 6 years ago18 messagesgeneral
Jump to latest
#1Zwettler Markus (OIZ)
Markus.Zwettler@zuerich.ch

we are using Patroni for management of our Postgres standby databases.

we take our (wal) backups on the primary side based on intervals and thresholds.
our archived wal's are written to a local wal directory first and moved to tape afterwards.

we got a case where Patroni switched back and forth sides quickly, e.g.:
12:00h: primary - standby
12:05h: standby - primary
12:10h: primary - standby

we realised that we will not have a wal backup of those wal's generated between 12:05h and 12:10h in this scenario.

how can we make sure that the whole wal sequence trail will be backuped? any idea?

- Markus

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Zwettler Markus (OIZ) (#1)
Re: broken backup trail in case of quickly patroni switchback and forth

On 11/7/19 5:52 AM, Zwettler Markus (OIZ) wrote:

we are using Patroni for management of our Postgres standby databases.

we take our (wal) backups on the primary side based on intervals and thresholds.
our archived wal's are written to a local wal directory first and moved to tape afterwards.

we got a case where Patroni switched back and forth sides quickly, e.g.:
12:00h: primary - standby
12:05h: standby - primary
12:10h: primary - standby

we realised that we will not have a wal backup of those wal's generated between 12:05h and 12:10h in this scenario.

how can we make sure that the whole wal sequence trail will be backuped? any idea?

Probably best to ask the Patroni folks:

https://github.com/zalando/patroni#community

- Markus

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Zwettler Markus (OIZ)
Markus.Zwettler@zuerich.ch
In reply to: Adrian Klaver (#2)
AW: broken backup trail in case of quickly patroni switchback and forth

I already asked the Patroni folks. They told me this is not related to Patroni but Postgresql. ;-)

- Markus

On 11/7/19 5:52 AM, Zwettler Markus (OIZ) wrote:

we are using Patroni for management of our Postgres standby databases.

we take our (wal) backups on the primary side based on intervals and thresholds.
our archived wal's are written to a local wal directory first and moved to tape afterwards.

we got a case where Patroni switched back and forth sides quickly, e.g.:
12:00h: primary - standby
12:05h: standby - primary
12:10h: primary - standby

we realised that we will not have a wal backup of those wal's generated between 12:05h and 12:10h in this scenario.

how can we make sure that the whole wal sequence trail will be backuped? any idea?

Probably best to ask the Patroni folks:

https://github.com/zalando/patroni#community

- Markus

--
Adrian Klaver
adrian.klaver@aklaver.com

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Zwettler Markus (OIZ) (#3)
Re: AW: broken backup trail in case of quickly patroni switchback and forth

On 11/7/19 7:18 AM, Zwettler Markus (OIZ) wrote:

I already asked the Patroni folks. They told me this is not related to Patroni but Postgresql. ;-)

Hard to say without more information:

1) Postgres version

2) Setup/config info

3) Detail if what happened between 12:00 and 12:10

- Markus

On 11/7/19 5:52 AM, Zwettler Markus (OIZ) wrote:

we are using Patroni for management of our Postgres standby databases.

we take our (wal) backups on the primary side based on intervals and thresholds.
our archived wal's are written to a local wal directory first and moved to tape afterwards.

we got a case where Patroni switched back and forth sides quickly, e.g.:
12:00h: primary - standby
12:05h: standby - primary
12:10h: primary - standby

we realised that we will not have a wal backup of those wal's generated between 12:05h and 12:10h in this scenario.

how can we make sure that the whole wal sequence trail will be backuped? any idea?

Probably best to ask the Patroni folks:

https://github.com/zalando/patroni#community

- Markus

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Zwettler Markus (OIZ)
Markus.Zwettler@zuerich.ch
In reply to: Adrian Klaver (#4)
AW: AW: broken backup trail in case of quickly patroni switchback and forth

1) 9.6

2)
$ cat postgresql.conf
# Do not edit this file manually!
# It will be overwritten by Patroni!
include 'postgresql.base.conf'

cluster_name = 'pcl_l702'
hot_standby = 'on'
hot_standby_feedback = 'True'
listen_addresses = 'localhost,tstm49003.tstglobal.tst.loc,pcl_l702.tstglobal.tst.loc'
max_connections = '100'
max_locks_per_transaction = '64'
max_prepared_transactions = '0'
max_replication_slots = '10'
max_wal_senders = '10'
max_worker_processes = '8'
port = '5436'
track_commit_timestamp = 'off'
wal_keep_segments = '8'
wal_level = 'replica'
wal_log_hints = 'on'
hba_file = '/pgdata/pcl_l702/pg_hba.conf'
ident_file = '/pgdata/pcl_l702/pg_ident.conf'
$
$
$
$ cat postgresql.base.conf
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
dynamic_shared_memory_type = posix
lc_messages = 'en_US.UTF-8'
lc_monetary = 'de_CH.UTF-8'
lc_numeric = 'de_CH.UTF-8'
lc_time = 'de_CH.UTF-8'
logging_collector = on
log_directory = 'pg_log'
log_rotation_age = 1d
log_rotation_size = 0
log_timezone = 'Europe/Vaduz'
log_truncate_on_rotation = on
max_connections = 100
timezone = 'Europe/Vaduz'
archive_command = 'test ! -f /tmp/pg_archive_backup_running_on_pcl_l702* && rsync --checksum %p /pgxlog_archive/pcl_l702/%f'
archive_mode = on
archive_timeout = 1800
cluster_name = pcl_l702
cron.database_name = 'pdb_l72_oiz'
# effective_cache_size
listen_addresses = '*'
log_connections = on
log_destination = 'stderr, csvlog'
log_disconnections = on
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%t : %h=>%u@%d : %p-%c-%v : %e '
log_statement = 'ddl'
max_wal_senders = 5
port = 5436
shared_buffers = 512MB
shared_preload_libraries = 'auto_explain, pg_stat_statements, pg_cron, pg_statsinfo'
wal_buffers = 16MB
wal_compression = on
wal_level = replica
# work_mem

3)
12:00h: primary - standby
=> Some clients commited some transactions; Failover
12:05h: standby - primary
=> Some clients connected + commited some transactions; Failover
12:10h: primary - standby

On 11/7/19 7:18 AM, Zwettler Markus (OIZ) wrote:

I already asked the Patroni folks. They told me this is not related to Patroni but Postgresql. ;-)

Hard to say without more information:

1) Postgres version

2) Setup/config info

3) Detail if what happened between 12:00 and 12:10

- Markus

On 11/7/19 5:52 AM, Zwettler Markus (OIZ) wrote:

we are using Patroni for management of our Postgres standby databases.

we take our (wal) backups on the primary side based on intervals and thresholds.
our archived wal's are written to a local wal directory first and moved to tape afterwards.

we got a case where Patroni switched back and forth sides quickly, e.g.:
12:00h: primary - standby
12:05h: standby - primary
12:10h: primary - standby

we realised that we will not have a wal backup of those wal's generated between 12:05h and 12:10h in this scenario.

how can we make sure that the whole wal sequence trail will be backuped? any idea?

Probably best to ask the Patroni folks:

https://github.com/zalando/patroni#community

- Markus

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Zwettler Markus (OIZ) (#5)
Re: AW: AW: broken backup trail in case of quickly patroni switchback and forth

On 11/7/19 7:47 AM, Zwettler Markus (OIZ) wrote:

I am heading out the door so I will not have time to look at below until
later. For those that get a chance before then, it would be nice to have
the Patroni conf file information also. The Patroni information may
answer the question, but it case it does not what actually is failover
in 3) below?

1) 9.6

2)
$ cat postgresql.conf
# Do not edit this file manually!
# It will be overwritten by Patroni!
include 'postgresql.base.conf'

cluster_name = 'pcl_l702'
hot_standby = 'on'
hot_standby_feedback = 'True'
listen_addresses = 'localhost,tstm49003.tstglobal.tst.loc,pcl_l702.tstglobal.tst.loc'
max_connections = '100'
max_locks_per_transaction = '64'
max_prepared_transactions = '0'
max_replication_slots = '10'
max_wal_senders = '10'
max_worker_processes = '8'
port = '5436'
track_commit_timestamp = 'off'
wal_keep_segments = '8'
wal_level = 'replica'
wal_log_hints = 'on'
hba_file = '/pgdata/pcl_l702/pg_hba.conf'
ident_file = '/pgdata/pcl_l702/pg_ident.conf'
$
$
$
$ cat postgresql.base.conf
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
dynamic_shared_memory_type = posix
lc_messages = 'en_US.UTF-8'
lc_monetary = 'de_CH.UTF-8'
lc_numeric = 'de_CH.UTF-8'
lc_time = 'de_CH.UTF-8'
logging_collector = on
log_directory = 'pg_log'
log_rotation_age = 1d
log_rotation_size = 0
log_timezone = 'Europe/Vaduz'
log_truncate_on_rotation = on
max_connections = 100
timezone = 'Europe/Vaduz'
archive_command = 'test ! -f /tmp/pg_archive_backup_running_on_pcl_l702* && rsync --checksum %p /pgxlog_archive/pcl_l702/%f'
archive_mode = on
archive_timeout = 1800
cluster_name = pcl_l702
cron.database_name = 'pdb_l72_oiz'
# effective_cache_size
listen_addresses = '*'
log_connections = on
log_destination = 'stderr, csvlog'
log_disconnections = on
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%t : %h=>%u@%d : %p-%c-%v : %e '
log_statement = 'ddl'
max_wal_senders = 5
port = 5436
shared_buffers = 512MB
shared_preload_libraries = 'auto_explain, pg_stat_statements, pg_cron, pg_statsinfo'
wal_buffers = 16MB
wal_compression = on
wal_level = replica
# work_mem

3)
12:00h: primary - standby
=> Some clients commited some transactions; Failover
12:05h: standby - primary
=> Some clients connected + commited some transactions; Failover
12:10h: primary - standby

On 11/7/19 7:18 AM, Zwettler Markus (OIZ) wrote:

I already asked the Patroni folks. They told me this is not related to Patroni but Postgresql. ;-)

Hard to say without more information:

1) Postgres version

2) Setup/config info

3) Detail if what happened between 12:00 and 12:10

- Markus

On 11/7/19 5:52 AM, Zwettler Markus (OIZ) wrote:

we are using Patroni for management of our Postgres standby databases.

we take our (wal) backups on the primary side based on intervals and thresholds.
our archived wal's are written to a local wal directory first and moved to tape afterwards.

we got a case where Patroni switched back and forth sides quickly, e.g.:
12:00h: primary - standby
12:05h: standby - primary
12:10h: primary - standby

we realised that we will not have a wal backup of those wal's generated between 12:05h and 12:10h in this scenario.

how can we make sure that the whole wal sequence trail will be backuped? any idea?

Probably best to ask the Patroni folks:

https://github.com/zalando/patroni#community

- Markus

--
Adrian Klaver
adrian.klaver@aklaver.com

#7Zwettler Markus (OIZ)
Markus.Zwettler@zuerich.ch
In reply to: Adrian Klaver (#6)
AW: AW: AW: broken backup trail in case of quickly patroni switchback and forth

3)
Patroni does only failovers. Also in case of regular shutdown of the primary. A failover is a promote of the standby + automatic reinstate (pg_rewind or pg_basebackup) of the former primary.

Time: role site 1 - role site 2
====================
12:00h: primary - standby
=> Some clients commited some transactions; Primary stopped => Failover to standby
12:05h: standby - primary
=> Some clients connected + commited some transactions; Primary stopped => Failover to standby
12:10h: primary - standby

Patroni.yml)
$ cat pcl_l702.yml
scope: pcl_l702
name: pcl_l702@tstm49003
namespace: /patroni/

log:
level: DEBUG
dir: /opt/app/patroni/etc/log/
file_num: 10
file_size: 104857600

restapi:
listen: tstm49003.tstglobal.tst.loc:8010
connect_address: tstm49003.tstglobal.tst.loc:8010

etcd:
hosts: etcdlab01.tstglobal.tst.loc:2379,etcdlab02.tstglobal.tst.loc:2379,etcdlab03.tstglobal.tst.loc:2379,etcdlab04.tstglobal.tst.loc:2379,etcdlab05.tstglobal.tst.loc:2379
username: patroni
password: censored

bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
master_start_timeout: 300
synchronous_mode: true
postgresql:
use_pg_rewind: true
use_slots: true

# NO BOOTSTRAPPING USED
method: do_not_bootstrap
do_not_bootstrap:
command: /bin/false

postgresql:
authentication:
replication:
username: repadmin
password: censored
superuser:
username: patroni
password: censored
callbacks:
on_reload: /opt/app/patroni/etc/callback_patroni.sh
on_restart: /opt/app/patroni/etc/callback_patroni.sh
on_role_change: /opt/app/patroni/etc/callback_patroni.sh
on_start: /opt/app/patroni/etc/callback_patroni.sh
on_stop: /opt/app/patroni/etc/callback_patroni.sh
connect_address: tstm49003.tstglobal.tst.loc:5436
database: pcl_l702
data_dir: /pgdata/pcl_l702
bin_dir: /usr/pgsql-9.6/bin
listen: localhost,tstm49003.tstglobal.tst.loc,pcl_l702.tstglobal.tst.loc:5436
pgpass: /home/postgres/.pgpass_patroni
recovery_conf:
restore_command: cp /pgxlog_archive/pcl_l702/%f %p
parameters:
hot_standby_feedback: on
wal_keep_segments: 64
use_pg_rewind: true

watchdog:
mode: automatic
device: /dev/watchdog
safety_margin: 5

tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false

-----Ursprüngliche Nachricht-----
Von: Adrian Klaver <adrian.klaver@aklaver.com>
Gesendet: Donnerstag, 7. November 2019 17:06
An: Zwettler Markus (OIZ) <Markus.Zwettler@zuerich.ch>; pgsql-general@lists.postgresql.org
Betreff: Re: AW: AW: broken backup trail in case of quickly patroni switchback and forth

On 11/7/19 7:47 AM, Zwettler Markus (OIZ) wrote:

I am heading out the door so I will not have time to look at below until later. For those that get a chance before then, it would be nice to have the Patroni conf file information also. The Patroni information may answer the question, but it case it does not what actually is failover in 3) below?

1) 9.6

2)
$ cat postgresql.conf
# Do not edit this file manually!
# It will be overwritten by Patroni!
include 'postgresql.base.conf'

cluster_name = 'pcl_l702'
hot_standby = 'on'
hot_standby_feedback = 'True'
listen_addresses = 'localhost,tstm49003.tstglobal.tst.loc,pcl_l702.tstglobal.tst.loc'
max_connections = '100'
max_locks_per_transaction = '64'
max_prepared_transactions = '0'
max_replication_slots = '10'
max_wal_senders = '10'
max_worker_processes = '8'
port = '5436'
track_commit_timestamp = 'off'
wal_keep_segments = '8'
wal_level = 'replica'
wal_log_hints = 'on'
hba_file = '/pgdata/pcl_l702/pg_hba.conf'
ident_file = '/pgdata/pcl_l702/pg_ident.conf'
$
$
$
$ cat postgresql.base.conf
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
dynamic_shared_memory_type = posix
lc_messages = 'en_US.UTF-8'
lc_monetary = 'de_CH.UTF-8'
lc_numeric = 'de_CH.UTF-8'
lc_time = 'de_CH.UTF-8'
logging_collector = on
log_directory = 'pg_log'
log_rotation_age = 1d
log_rotation_size = 0
log_timezone = 'Europe/Vaduz'
log_truncate_on_rotation = on
max_connections = 100
timezone = 'Europe/Vaduz'
archive_command = 'test ! -f /tmp/pg_archive_backup_running_on_pcl_l702* && rsync --checksum %p /pgxlog_archive/pcl_l702/%f'
archive_mode = on
archive_timeout = 1800
cluster_name = pcl_l702
cron.database_name = 'pdb_l72_oiz'
# effective_cache_size
listen_addresses = '*'
log_connections = on
log_destination = 'stderr, csvlog'
log_disconnections = on
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%t : %h=>%u@%d : %p-%c-%v : %e '
log_statement = 'ddl'
max_wal_senders = 5
port = 5436
shared_buffers = 512MB
shared_preload_libraries = 'auto_explain, pg_stat_statements, pg_cron, pg_statsinfo'
wal_buffers = 16MB
wal_compression = on
wal_level = replica
# work_mem

3)
12:00h: primary - standby
=> Some clients commited some transactions; Failover
12:05h: standby - primary
=> Some clients connected + commited some transactions; Failover
12:10h: primary - standby

On 11/7/19 7:18 AM, Zwettler Markus (OIZ) wrote:

I already asked the Patroni folks. They told me this is not related
to Patroni but Postgresql. ;-)

Hard to say without more information:

1) Postgres version

2) Setup/config info

3) Detail if what happened between 12:00 and 12:10

- Markus

On 11/7/19 5:52 AM, Zwettler Markus (OIZ) wrote:

we are using Patroni for management of our Postgres standby databases.

we take our (wal) backups on the primary side based on intervals and thresholds.
our archived wal's are written to a local wal directory first and moved to tape afterwards.

we got a case where Patroni switched back and forth sides quickly, e.g.:
12:00h: primary - standby
12:05h: standby - primary
12:10h: primary - standby

we realised that we will not have a wal backup of those wal's generated between 12:05h and 12:10h in this scenario.

how can we make sure that the whole wal sequence trail will be backuped? any idea?

Probably best to ask the Patroni folks:

https://github.com/zalando/patroni#community

- Markus

--
Adrian Klaver
adrian.klaver@aklaver.com

#8Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Zwettler Markus (OIZ) (#1)
Re: broken backup trail in case of quickly patroni switchback and forth

On Thu, 2019-11-07 at 13:52 +0000, Zwettler Markus (OIZ) wrote:

we are using Patroni for management of our Postgres standby databases.

we take our (wal) backups on the primary side based on intervals and thresholds.
our archived wal's are written to a local wal directory first and moved to tape afterwards.

we got a case where Patroni switched back and forth sides quickly, e.g.:
12:00h: primary - standby
12:05h: standby - primary
12:10h: primary - standby

we realised that we will not have a wal backup of those wal's generated between 12:05h and 12:10h in this scenario.

how can we make sure that the whole wal sequence trail will be backuped? any idea?

You'll have to archive WAL from both machines. Then you have everything you should need.

Make sure "recovery_target_timeline = 'latest'" so that recovery will
follow the timeline jumps.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#9Brad Nicholson
bradn@ca.ibm.com
In reply to: Zwettler Markus (OIZ) (#7)
Re: AW: AW: AW: broken backup trail in case of quickly patroni switchback and forth

"Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch> wrote on 2019/11/07
11:32:42 AM:

From: "Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch>
To: Adrian Klaver <adrian.klaver@aklaver.com>, "pgsql-
general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>
Date: 2019/11/07 11:33 AM
Subject: [EXTERNAL] AW: AW: AW: broken backup trail in case of
quickly patroni switchback and forth

3)
Patroni does only failovers. Also in case of regular shutdown of the
primary. A failover is a promote of the standby + automatic
reinstate (pg_rewind or pg_basebackup) of the former primary.

This is not accurate. Patroni does controlled switchovers as well as
failovers. Controlled switchover issues a fast shutdown to Postgres, hard
ones issue an immediate shutdown. From this point, it's how Postgres
responds to those that matter.

Fast shutdown will attempt to ensure the wal stream is transmitted to the
replica and the wal files are archived. Immediate shutdown will not do any
of this. This issue explains more about when Patroni may choose an
immediate shutdown (it might not be totally accurate anymore as it's a year
old).

https://github.com/zalando/patroni/issues/837#issuecomment-433686687

I agree with the Patroni folks that this is not a Patroni issue, but simply
how Postgres responds to the required shutdown types.

#10Zwettler Markus (OIZ)
Markus.Zwettler@zuerich.ch
In reply to: Laurenz Albe (#8)
AW: broken backup trail in case of quickly patroni switchback and forth

1) If I got you right this means enabling archiving on both machines (archive_mode=on, archive_command=cp...). Yes?

2) Will the latest transactions on the actual primary be archived (copied from pg_xlog to the local archive_directory) before this primary is reinstated as new standby?

Thanks,
Markus

On Thu, 2019-11-07 at 13:52 +0000, Zwettler Markus (OIZ) wrote:

we are using Patroni for management of our Postgres standby databases.

we take our (wal) backups on the primary side based on intervals and thresholds.
our archived wal's are written to a local wal directory first and moved to tape afterwards.

we got a case where Patroni switched back and forth sides quickly, e.g.:
12:00h: primary - standby
12:05h: standby - primary
12:10h: primary - standby

we realised that we will not have a wal backup of those wal's generated between 12:05h and 12:10h in this scenario.

how can we make sure that the whole wal sequence trail will be backuped? any idea?

You'll have to archive WAL from both machines. Then you have everything you should need.

Make sure "recovery_target_timeline = 'latest'" so that recovery will follow the timeline jumps.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#11Zwettler Markus (OIZ)
Markus.Zwettler@zuerich.ch
In reply to: Brad Nicholson (#9)
AW: AW: AW: AW: broken backup trail in case of quickly patroni switchback and forth

It depends. It is a switchover if Patroni could to a clean shutdown. But, it might start killing processes after a certain period if a normal shutdown after SIGTERM didn't happen. This would not be a switchover anymore. In other words there is no guarantee for a "clean" switchover. This might be the reason why the Patroni guys are always talking about failover only.

It's not a Patroni issue but it's triggered by Patroni as it will do "some kind of switchover" on a regular shutdown.

"Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch<mailto:Markus.Zwettler@zuerich.ch>> wrote on 2019/11/07 11:32:42 AM:

From: "Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch<mailto:Markus.Zwettler@zuerich.ch>>
To: Adrian Klaver <adrian.klaver@aklaver.com<mailto:adrian.klaver@aklaver.com>>, "pgsql-
general@lists.postgresql.org<mailto:general@lists.postgresql.org>" <pgsql-general@lists.postgresql.org<mailto:pgsql-general@lists.postgresql.org>>
Date: 2019/11/07 11:33 AM
Subject: [EXTERNAL] AW: AW: AW: broken backup trail in case of
quickly patroni switchback and forth

3)
Patroni does only failovers. Also in case of regular shutdown of the
primary. A failover is a promote of the standby + automatic
reinstate (pg_rewind or pg_basebackup) of the former primary.

This is not accurate. Patroni does controlled switchovers as well as failovers. Controlled switchover issues a fast shutdown to Postgres, hard ones issue an immediate shutdown. From this point, it's how Postgres responds to those that matter.

Fast shutdown will attempt to ensure the wal stream is transmitted to the replica and the wal files are archived. Immediate shutdown will not do any of this. This issue explains more about when Patroni may choose an immediate shutdown (it might not be totally accurate anymore as it's a year old).

https://github.com/zalando/patroni/issues/837#issuecomment-433686687

I agree with the Patroni folks that this is not a Patroni issue, but simply how Postgres responds to the required shutdown types.

#12Brad Nicholson
bradn@ca.ibm.com
In reply to: Zwettler Markus (OIZ) (#11)
Re: AW: AW: AW: AW: broken backup trail in case of quickly patroni switchback and forth

"Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch> wrote on 2019/11/08
07:51:33 AM:

From: "Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch>
To: Brad Nicholson <bradn@ca.ibm.com>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>, "pgsql-
general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>
Date: 2019/11/08 07:51 AM
Subject: [EXTERNAL] AW: AW: AW: AW: broken backup trail in case of
quickly patroni switchback and forth

It depends. It is a switchover if Patroni could to a clean shutdown.
But, it might start killing processes after a certain period if a
normal shutdown after SIGTERM didn't happen. This would not be a
switchover anymore. In other words there is no guarantee for a
"clean" switchover. This might be the reason why the Patroni guys
are always talking about failover only.

If it can't do a clean shutdown, that points to something wrong with
Postgres itself. Why doesn't a fast shutdown work for you in those cases?

It's not a Patroni issue but it's triggered by Patroni as it will do
"some kind of switchover" on a regular shutdown.

Sure, but you should be looking at why Postgres can't cleanly shutdown.

How are you telling Patroni to switchover? Are you using the Patroni
switchover command via patronictl or the API, or sending a signal to the
Patroni process? I think the explicit switchover
command will not behave this way. It will return you a 503 if it can't
switchover and not change the primary (that is something you can confirm
with the Patroni developers).

Brad.

#13Zwettler Markus (OIZ)
Markus.Zwettler@zuerich.ch
In reply to: Brad Nicholson (#12)
AW: AW: AW: AW: AW: broken backup trail in case of quickly patroni switchback and forth

Let me clarify: "But, it might start killing processes after a certain period if a _fast_ shutdown after SIGTERM didn't happen".

I am talking about stopping the Patroni master process with a systemd scipt.

Von: Brad Nicholson <bradn@ca.ibm.com>
Gesendet: Freitag, 8. November 2019 15:58
An: Zwettler Markus (OIZ) <Markus.Zwettler@zuerich.ch>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>; pgsql-general@lists.postgresql.org
Betreff: Re: AW: AW: AW: AW: broken backup trail in case of quickly patroni switchback and forth

"Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch<mailto:Markus.Zwettler@zuerich.ch>> wrote on 2019/11/08 07:51:33 AM:

From: "Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch<mailto:Markus.Zwettler@zuerich.ch>>
To: Brad Nicholson <bradn@ca.ibm.com<mailto:bradn@ca.ibm.com>>
Cc: Adrian Klaver <adrian.klaver@aklaver.com<mailto:adrian.klaver@aklaver.com>>, "pgsql-
general@lists.postgresql.org<mailto:general@lists.postgresql.org>" <pgsql-general@lists.postgresql.org<mailto:pgsql-general@lists.postgresql.org>>
Date: 2019/11/08 07:51 AM
Subject: [EXTERNAL] AW: AW: AW: AW: broken backup trail in case of
quickly patroni switchback and forth

It depends. It is a switchover if Patroni could to a clean shutdown.
But, it might start killing processes after a certain period if a
normal shutdown after SIGTERM didn't happen. This would not be a
switchover anymore. In other words there is no guarantee for a
"clean" switchover. This might be the reason why the Patroni guys
are always talking about failover only.

If it can't do a clean shutdown, that points to something wrong with Postgres itself. Why doesn't a fast shutdown work for you in those cases?

It's not a Patroni issue but it's triggered by Patroni as it will do
"some kind of switchover" on a regular shutdown.

Sure, but you should be looking at why Postgres can't cleanly shutdown.

How are you telling Patroni to switchover? Are you using the Patroni switchover command via patronictl or the API, or sending a signal to the Patroni process? I think the explicit switchover
command will not behave this way. It will return you a 503 if it can't switchover and not change the primary (that is something you can confirm with the Patroni developers).

Brad.

#14Brad Nicholson
bradn@ca.ibm.com
In reply to: Zwettler Markus (OIZ) (#13)
Re: AW: AW: AW: AW: AW: broken backup trail in case of quickly patroni switchback and forth

"Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch> wrote on 2019/11/08
11:02:49 AM:

From: "Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch>
To: Brad Nicholson <bradn@ca.ibm.com>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>, "pgsql-
general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>
Date: 2019/11/08 11:02 AM
Subject: [EXTERNAL] AW: AW: AW: AW: AW: broken backup trail in
case of quickly patroni switchback and forth

Let me clarify: "But, it might start killing processes after a
certain period if a _fast_ shutdown after SIGTERM didn't happen".

I am talking about stopping the Patroni master process with a systemd

scipt.

Use the switchover functionality in Patroni first, and gate youur shutdown
via systemd on the success of that operation.

Brad.

#15Zwettler Markus (OIZ)
Markus.Zwettler@zuerich.ch
In reply to: Brad Nicholson (#14)
AW: AW: AW: AW: AW: AW: broken backup trail in case of quickly patroni switchback and forth

How exactly? Please clarify.

"Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch<mailto:Markus.Zwettler@zuerich.ch>> wrote on 2019/11/08 11:02:49 AM:

From: "Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch<mailto:Markus.Zwettler@zuerich.ch>>
To: Brad Nicholson <bradn@ca.ibm.com<mailto:bradn@ca.ibm.com>>
Cc: Adrian Klaver <adrian.klaver@aklaver.com<mailto:adrian.klaver@aklaver.com>>, "pgsql-
general@lists.postgresql.org<mailto:general@lists.postgresql.org>" <pgsql-general@lists.postgresql.org<mailto:pgsql-general@lists.postgresql.org>>
Date: 2019/11/08 11:02 AM
Subject: [EXTERNAL] AW: AW: AW: AW: AW: broken backup trail in
case of quickly patroni switchback and forth

Let me clarify: "But, it might start killing processes after a
certain period if a _fast_ shutdown after SIGTERM didn't happen".

I am talking about stopping the Patroni master process with a systemd scipt.

Use the switchover functionality in Patroni first, and gate youur shutdown via systemd on the success of that operation.

Brad.

#16Brad Nicholson
bradn@ca.ibm.com
In reply to: Zwettler Markus (OIZ) (#15)
Re: AW: AW: AW: AW: AW: AW: broken backup trail in case of quickly patroni switchback and forth

"Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch> wrote on 2019/11/08
11:27:00 AM:

From: "Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch>
To: Brad Nicholson <bradn@ca.ibm.com>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>, "pgsql-
general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>
Date: 2019/11/08 11:27 AM
Subject: [EXTERNAL] AW: AW: AW: AW: AW: AW: broken backup trail
in case of quickly patroni switchback and forth

How exactly? Please clarify.

(please don't top post, makes the replies hard to follow)

patronictl switchover <clustername>

follow the prompts

there is also a /switchover API endpoint you can use.

Brad

#17Zwettler Markus (OIZ)
Markus.Zwettler@zuerich.ch
In reply to: Brad Nicholson (#16)
AW: AW: AW: AW: AW: AW: AW: broken backup trail in case of quickly patroni switchback and forth

? "Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch<mailto:Markus.Zwettler@zuerich.ch>> wrote on 2019/11/08 11:27:00 AM:

From: "Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch<mailto:Markus.Zwettler@zuerich.ch>>
To: Brad Nicholson <bradn@ca.ibm.com<mailto:bradn@ca.ibm.com>>
Cc: Adrian Klaver <adrian.klaver@aklaver.com<mailto:adrian.klaver@aklaver.com>>, "pgsql-
general@lists.postgresql.org<mailto:general@lists.postgresql.org>" <pgsql-general@lists.postgresql.org<mailto:pgsql-general@lists.postgresql.org>>
Date: 2019/11/08 11:27 AM

Subject: [EXTERNAL] AW: AW: AW: AW: AW: AW: broken backup trail

in case of quickly patroni switchback and forth

How exactly? Please clarify.

(please don't top post, makes the replies hard to follow)

patronictl switchover <clustername>

follow the prompts

there is also a /switchover API endpoint you can use.

Brad

I wondered about your "patronictl switchover + systemd" hint. How would you do ("gate") this combination?

Markus

#18Brad Nicholson
bradn@ca.ibm.com
In reply to: Zwettler Markus (OIZ) (#17)
Re: AW: AW: AW: AW: AW: AW: AW: broken backup trail in case of quickly patroni switchback and forth

"Zwettler Markus (OIZ)" <Markus.Zwettler@zuerich.ch> wrote on 2019/11/08
11:54:14 AM:

(please don't top post, makes the replies hard to follow)

patronictl switchover <clustername>

follow the prompts

there is also a /switchover API endpoint you can use.

Brad

I wondered about your "patronictl switchover + systemd" hint. How
would you do ("gate") this combination?

Change whatever process you are using today to shut things down to call the
patroni switchover first, check error codes, etc.