BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

Started by PG Bug reporting formover 7 years ago12 messagesbugs

noreply@postgresql.org

over 7 years ago

The following bug has been logged on the website:

Bug reference: 15638
Logged by: Andre Piwoni
Email address: apiwoni@webmd.net
PostgreSQL version: 10.5
Operating system: CentOS 7.3
Description:

When new slave is created by taking base backup from the primary using
pg_basebackup with --wal-method=stream option the WAL file generated during
the backup is different (as compared with diff or cmp command) than that on
the master and in WAL archive directory. Furthermore, this file does not
exist in pg_wal/archive_status with .done extension on new slave, though it
exists in pg_wal directory, resulting in failed attempt to archive this file
when slave node is promoted as master node.
2019-02-15 14:15:58.872 PST [5369] DETAIL: The failed archive command was:
test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002
/mnt/pgsql/archive/000000010000000000000002

pg_basebackup with --wal-method=fetch option correctly creates this WAL
segment on slave in pg_wal/archive_status with .done extension and in pg_wal
directory and its contents are equivalent to those on the old primary and in
WAL archive directory.

Here are contents of old_primary, new_primary and WAL archive:
OLD PRIMARY
/var/lib/pgsql/10/data/pg_wal/:
-rw-------. 1 postgres postgres 16777216 Feb 15 13:39
000000010000000000000001
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000002
-rw-------. 1 postgres postgres 302 Feb 15 13:40
000000010000000000000002.00000028.backup
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000003
drwx------. 2 postgres postgres 133 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
-rw-------. 1 postgres postgres 0 Feb 15 13:39
000000010000000000000001.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40
000000010000000000000002.00000028.backup.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40
000000010000000000000002.done

WAL ARCHIVE
-rw-rw-rw-. 1 root root 16777216 Feb 15 13:39 000000010000000000000001
-rw-rw-rw-. 1 root root 16777216 Feb 15 13:40 000000010000000000000002
-rw-rw-rw-. 1 root root 302 Feb 15 13:40
000000010000000000000002.00000028.backup

NEW PRIMARY (created with --wal-method=stream)
/var/lib/pgsql/10/data/pg_wal/:
-rw-------. 1 postgres root 16777216 Feb 15 13:40
000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000003
drwx------. 2 postgres root 6 Feb 15 13:40 archive_status
/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0

Please note lack of 000000010000000000000002.done file in
/var/lib/pgsql/10/data/pg_wal/archive_status.
When using backup with --wal-method=fetch this file is created in
/var/lib/pgsql/10/data/pg_wal/archive_status and NEW_PRIMARY won't try to
archive it, which would fail because it has been archived when NEW_PRIMARY
was created as slave from OLD_PRIMARY.

Consequence of all this is that failover does not work correctly.

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

On Sat, Feb 16, 2019 at 12:26:13AM +0000, PG Bug reporting form wrote:

When new slave is created by taking base backup from the primary using
pg_basebackup with --wal-method=stream option the WAL file generated during
the backup is different (as compared with diff or cmp command) than that on
the master and in WAL archive directory. Furthermore, this file does not
exist in pg_wal/archive_status with .done extension on new slave, though it
exists in pg_wal directory, resulting in failed attempt to archive this file
when slave node is promoted as master node.
2019-02-15 14:15:58.872 PST [5369] DETAIL: The failed archive command was:
test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002
/mnt/pgsql/archive/000000010000000000000002

How do you promote your standby? In Postgres 10, the last, partial
WAL segment of a past timeline generated at promotion is renamed
.partial to avoid any conflicts, so as this should normally not
happen if you do not use archive_mode = always.

Please note that your archive command is not safe. For one, it does
not sync the archived segment before archive_command returns to the
backend..
--
Michael

Maeldron T.

maeldron@gmail.com

over 7 years ago

In reply to: Michael Paquier (#2)

Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

I did a quick test using near to empty databases. I did not do the promote
but the basebackup with two different methods.

When I did it on MacOS and PostgreSQL 11.1, the .done file existed only
under the data directory I created with -X fetch. The files were the same.

When I did it on FreeBSD and PostgreSQL 10.6, the .done file existed only
under the -X fetch directory, and the wal files were also different. II
don’t know whether it is a problem or not, but I could reproduce it at the
first attempt.

This was after the two basebackups:

$ pg_basebackup -p 5433 -v -R -P -D 1 -X fetch
$ pg_basebackup -p 5433 -v -R -P -D 2 -X stream

$ diff -ur 1/pg_wal/ 2/pg_wal/
Only in 1/pg_wal/: 00000001000000000000000C
Only in 1/pg_wal/: 00000001000000000000000D
Files 1/pg_wal/00000001000000000000000E and
2/pg_wal/00000001000000000000000E differ
Only in 1/pg_wal/archive_status: 00000001000000000000000C.done
Only in 1/pg_wal/archive_status: 00000001000000000000000D.done
Only in 1/pg_wal/archive_status: 00000001000000000000000E.done

$ less log/1/2019-02-16_19-48-29.log
2019-02-16 19:48:29 CET LOG: database system was interrupted; last known
up at 2019-02-16 19:44:45 CET
2019-02-16 19:48:29 CET LOG: entering standby mode
2019-02-16 19:48:29 CET LOG: redo starts at 0/C000028
2019-02-16 19:48:29 CET LOG: consistent recovery state reached at
0/C000130
2019-02-16 19:48:29 CET LOG: database system is ready to accept read only
connections
2019-02-16 19:48:29 CET LOG: started streaming WAL from primary at
0/D000000 on timeline 1

$ less log/2/2019-02-16_19-48-34.log
2019-02-16 19:48:34 CET LOG: database system was interrupted; last known
up at 2019-02-16 19:45:15 CET
2019-02-16 19:48:34 CET LOG: entering standby mode
2019-02-16 19:48:34 CET LOG: redo starts at 0/E000028
2019-02-16 19:48:34 CET LOG: consistent recovery state reached at
0/E000130
2019-02-16 19:48:34 CET LOG: database system is ready to accept read only
connections
2019-02-16 19:48:34 CET LOG: started streaming WAL from primary at
0/F000000 on timeline 1

$ diff -ur 1/base/ 2/base/
Files 1/base/16386/pg_internal.init and 2/base/16386/pg_internal.init differ

I did nothing except for starting the two clusters. There was no activity
on the master. I did not promote.

On Sat, Feb 16, 2019 at 4:25 PM Michael Paquier <michael@paquier.xyz> wrote:

Show quoted text

On Sat, Feb 16, 2019 at 12:26:13AM +0000, PG Bug reporting form wrote:

When new slave is created by taking base backup from the primary using
pg_basebackup with --wal-method=stream option the WAL file generated

during

the backup is different (as compared with diff or cmp command) than that

on

the master and in WAL archive directory. Furthermore, this file does not
exist in pg_wal/archive_status with .done extension on new slave, though

it

exists in pg_wal directory, resulting in failed attempt to archive this

file

when slave node is promoted as master node.
2019-02-15 14:15:58.872 PST [5369] DETAIL: The failed archive command

was:

test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002
/mnt/pgsql/archive/000000010000000000000002

How do you promote your standby? In Postgres 10, the last, partial
WAL segment of a past timeline generated at promotion is renamed
.partial to avoid any conflicts, so as this should normally not
happen if you do not use archive_mode = always.

Please note that your archive command is not safe. For one, it does
not sync the archived segment before archive_command returns to the
backend..
--
Michael

Maksim Milyutin

milyutinma@gmail.com

over 7 years ago

In reply to: Michael Paquier (#2)

Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

On 2/16/19 6:25 PM, Michael Paquier wrote:

On Sat, Feb 16, 2019 at 12:26:13AM +0000, PG Bug reporting form wrote:

When new slave is created by taking base backup from the primary using
pg_basebackup with --wal-method=stream option the WAL file generated during
the backup is different (as compared with diff or cmp command) than that on
the master and in WAL archive directory. Furthermore, this file does not
exist in pg_wal/archive_status with .done extension on new slave, though it
exists in pg_wal directory, resulting in failed attempt to archive this file
when slave node is promoted as master node.
2019-02-15 14:15:58.872 PST [5369] DETAIL: The failed archive command was:
test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002
/mnt/pgsql/archive/000000010000000000000002

How do you promote your standby? In Postgres 10, the last, partial
WAL segment of a past timeline generated at promotion is renamed
.partial to avoid any conflicts, so as this should normally not
happen if you do not use archive_mode = always.

This issue concerns only segment that contains wal records from
pg_basebackup. For example, I have reproduced this issue on master
branch getting the following content of archive directory after standby
promoting:

000000010000000000000001.node1
000000010000000000000002.00000028.backup.node1
*000000010000000000000002.node1**
**000000010000000000000002.node2*
000000010000000000000003.node1
000000010000000000000004.node1
*000000010000000000000005.partial.node2**
**000000020000000000000005.node2*
00000002.history.node2

Each wal segment specifically was appended with .<node> suffix to
distinguish its source. Initially the node1 was master, node2 - standby.
The segment 000000010000000000000002 has the following content:

rmgr: Standbyï¿½ï¿½ï¿½ï¿½ len (rec/tot):ï¿½ï¿½ï¿½ï¿½ 50/ï¿½ï¿½ï¿½ 50, tx: 0, lsn: 0/02000028,
prev 0/015DB060, desc: RUNNING_XACTS nextXid 474 latestCompletedXid 473
oldestRunningXid 474
rmgr: XLOGï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ len (rec/tot):ï¿½ï¿½ï¿½ 106/ï¿½ï¿½ 106, tx:ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 0, lsn:
0/02000060, prev 0/02000028, desc: CHECKPOINT_ONLINE redo 0/2000028; tli
1; prev tli 1; fpw true; xid 0:474; oid 12690; multi 1; offset 0; oldest
xid 467 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp
xid: 0/0; oldest running xid 474; online
rmgr: XLOGï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ len (rec/tot):ï¿½ï¿½ï¿½ï¿½ 34/ï¿½ï¿½ï¿½ 34, tx:ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 0, lsn:
0/020000D0, prev 0/02000060, desc: BACKUP_END 0/2000028
rmgr: XLOGï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ len (rec/tot):ï¿½ï¿½ï¿½ï¿½ 24/ï¿½ï¿½ï¿½ 24, tx:ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 0, lsn:
0/020000F8, prev 0/020000D0, desc: SWITCH

On the current state of master branch the content of two segments are
the same:

$ md5sum 000000010000000000000002.node1 000000010000000000000002.node2

252e3e8cf3a85f218bb467005c565e3c 000000010000000000000002.node1
252e3e8cf3a85f218bb467005c565e3c 000000010000000000000002.node2

But on PG 10.6 the contents are differentiated on the tails:

$ cmp 000000010000000000000002.node1 000000010000000000000002.node2

000000010000000000000002.node1 000000010000000000000002.node2 differ:
byte *131073*, line 1

while logical part (prefix) are the same - the lsn of last record
(SWITCH) in this segment is 0/02000130 and have length 24 bytes.

Is it correct behavior? What would be the correct canonical
representation of archive command taking into account this issue?

--
Regards,
Maksim Milyutin

Andre Piwoni

apiwoni@webmd.net

over 7 years ago

In reply to: Maksim Milyutin (#4)

Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

I call pg_ctl -D /var/lib/pgsql/10/data promote to upgrade slave to master
when failover happens. archive_mode is set to "on" and not "always".
I repoint slave to the master by stopping it, updating recovery.conf and
restarting it. Let me know if I'm doing it wrong.

I think this problem is created before promotion when new slave is created
using pg_basebackup with --wal-method=stream and manifests when actual
promotion happens.
What I'm trying to say it does not seem that .partial extension is the
issue here but lack of .done extension.

Wouldn't you agree that given directory contents below, when failover
happens and new slave is promoted to the primary then with fetch mode
master would should not archive 000000010000000000000002 segment because it
is marked as .done but in stream method segment is not marked as .done so
it would be attempted to be archived again? This obviously fails because
segment has been archived by initial master. If you disagree, can you
explain the purpose behind .done extension in pg_wal/archive_status
directory?

Below are contents of directories after base backup but before promoting
new slave to master status..

PRIMARY
/var/lib/pgsql/10/data/pg_wal/:
total 49156
-rw-------. 1 postgres postgres 16777216 Feb 15 13:39
000000010000000000000001
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000002
-rw-------. 1 postgres postgres 302 Feb 15 13:40
000000010000000000000002.00000028.backup
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000003
drwx------. 2 postgres postgres 133 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres postgres 0 Feb 15 13:39 000000010000000000000001.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40
000000010000000000000002.00000028.backup.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40 000000010000000000000002.done

NEW SLAVE (stream method)
-rw-------. 1 postgres root 16777216 Feb 15 13:40
000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000003
drwx------. 2 postgres root 6 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0

NEW SLAVE (fetch method)
-rw-------. 1 postgres root 16777216 Feb 15 13:24
000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:24
000000010000000000000003
drwx------. 2 postgres root 43 Feb 15 13:24 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres root 0 Feb 15 13:24 000000010000000000000002.done

Eventual failover(s) produced this structure on master:
/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres postgres 0 Feb 15 14:15
000000010000000000000002.ready

Hence attempt to archive again:
2019-02-15 14:15:58.872 PST [5369] DETAIL: The failed archive command was:
test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002 /mnt/pgsql/archive/000000010000000000000002

On Mon, Feb 18, 2019 at 9:03 AM Maksim Milyutin <milyutinma@gmail.com>
wrote:

On 2/16/19 6:25 PM, Michael Paquier wrote:

On Sat, Feb 16, 2019 at 12:26:13AM +0000, PG Bug reporting form wrote:

When new slave is created by taking base backup from the primary using
pg_basebackup with --wal-method=stream option the WAL file generated during
the backup is different (as compared with diff or cmp command) than that on
the master and in WAL archive directory. Furthermore, this file does not
exist in pg_wal/archive_status with .done extension on new slave, though it
exists in pg_wal directory, resulting in failed attempt to archive this file
when slave node is promoted as master node.
2019-02-15 14:15:58.872 PST [5369] DETAIL: The failed archive command was:
test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002
/mnt/pgsql/archive/000000010000000000000002

How do you promote your standby? In Postgres 10, the last, partial
WAL segment of a past timeline generated at promotion is renamed
.partial to avoid any conflicts, so as this should normally not
happen if you do not use archive_mode = always.

This issue concerns only segment that contains wal records from
pg_basebackup. For example, I have reproduced this issue on master branch
getting the following content of archive directory after standby promoting:

000000010000000000000001.node1
000000010000000000000002.00000028.backup.node1
*000000010000000000000002.node1*
*000000010000000000000002.node2*
000000010000000000000003.node1
000000010000000000000004.node1
*000000010000000000000005.partial.node2*
*000000020000000000000005.node2*
00000002.history.node2

Each wal segment specifically was appended with .<node> suffix to
distinguish its source. Initially the node1 was master, node2 - standby.
The segment 000000010000000000000002 has the following content:

rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
0/02000028, prev 0/015DB060, desc: RUNNING_XACTS nextXid 474
latestCompletedXid 473 oldestRunningXid 474
rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn:
0/02000060, prev 0/02000028, desc: CHECKPOINT_ONLINE redo 0/2000028; tli 1;
prev tli 1; fpw true; xid 0:474; oid 12690; multi 1; offset 0; oldest xid
467 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid:
0/0; oldest running xid 474; online
rmgr: XLOG len (rec/tot): 34/ 34, tx: 0, lsn:
0/020000D0, prev 0/02000060, desc: BACKUP_END 0/2000028
rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn:
0/020000F8, prev 0/020000D0, desc: SWITCH

On the current state of master branch the content of two segments are the
same:

$ md5sum 000000010000000000000002.node1 000000010000000000000002.node2

252e3e8cf3a85f218bb467005c565e3c 000000010000000000000002.node1
252e3e8cf3a85f218bb467005c565e3c 000000010000000000000002.node2

But on PG 10.6 the contents are differentiated on the tails:

$ cmp 000000010000000000000002.node1 000000010000000000000002.node2

000000010000000000000002.node1 000000010000000000000002.node2 differ: byte
*131073*, line 1

while logical part (prefix) are the same - the lsn of last record (SWITCH)
in this segment is 0/02000130 and have length 24 bytes.

Is it correct behavior? What would be the correct canonical representation
of archive command taking into account this issue?

--
Regards,
Maksim Milyutin

*Andre Piwoni*

Sr. Software Developer, BI/Database

*Web*MD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com

Andre Piwoni

apiwoni@webmd.net

over 7 years ago

In reply to: Andre Piwoni (#5)

Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

By the way, you can assert this structure, including now explained diff in
WAL segment taken during backup, for fetch vs stream WAL method by creating
master and then creating new slave by taking base backup using both methods:

pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U
replication --wal-method=stream
pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U
replication --wal-method=fetch

Please note to see these differences I don't even have to do failover,
however, these structural (.done extension) differences seem to affect
subsequent failover. I seems that promoted slave is utilizing it's own
pg_wal directory with archive status to determine what should be archived
next.

On Tue, Feb 19, 2019 at 9:25 AM Andre Piwoni <apiwoni@webmd.net> wrote:

I call pg_ctl -D /var/lib/pgsql/10/data promote to upgrade slave to master
when failover happens. archive_mode is set to "on" and not "always".
I repoint slave to the master by stopping it, updating recovery.conf and
restarting it. Let me know if I'm doing it wrong.

I think this problem is created before promotion when new slave is created
using pg_basebackup with --wal-method=stream and manifests when actual
promotion happens.
What I'm trying to say it does not seem that .partial extension is the
issue here but lack of .done extension.

Wouldn't you agree that given directory contents below, when failover
happens and new slave is promoted to the primary then with fetch mode
master would should not archive 000000010000000000000002 segment because it
is marked as .done but in stream method segment is not marked as .done so
it would be attempted to be archived again? This obviously fails because
segment has been archived by initial master. If you disagree, can you
explain the purpose behind .done extension in pg_wal/archive_status
directory?

Below are contents of directories after base backup but before promoting
new slave to master status..

PRIMARY
/var/lib/pgsql/10/data/pg_wal/:
total 49156
-rw-------. 1 postgres postgres 16777216 Feb 15 13:39
000000010000000000000001
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000002
-rw-------. 1 postgres postgres 302 Feb 15 13:40
000000010000000000000002.00000028.backup
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000003
drwx------. 2 postgres postgres 133 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres postgres 0 Feb 15 13:39
000000010000000000000001.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40
000000010000000000000002.00000028.backup.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40
000000010000000000000002.done

WAL ARCHIVE
-rw-rw-rw-. 1 root root 16777216 Feb 15 13:39 000000010000000000000001
-rw-rw-rw-. 1 root root 16777216 Feb 15 13:40 000000010000000000000002
-rw-rw-rw-. 1 root root 302 Feb 15 13:40
000000010000000000000002.00000028.backup

NEW SLAVE (stream method)
-rw-------. 1 postgres root 16777216 Feb 15 13:40
000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000003
drwx------. 2 postgres root 6 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0

NEW SLAVE (stream method)
-rw-------. 1 postgres root 16777216 Feb 15 13:40
000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000003
drwx------. 2 postgres root 6 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0

NEW SLAVE (fetch method)
-rw-------. 1 postgres root 16777216 Feb 15 13:24
000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:24
000000010000000000000003
drwx------. 2 postgres root 43 Feb 15 13:24 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres root 0 Feb 15 13:24 000000010000000000000002.done

Eventual failover(s) produced this structure on master:
/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres postgres 0 Feb 15 14:15
000000010000000000000002.ready

Hence attempt to archive again:
2019-02-15 14:15:58.872 PST [5369] DETAIL: The failed archive command
was: test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002 /mnt/pgsql/archive/000000010000000000000002

On Mon, Feb 18, 2019 at 9:03 AM Maksim Milyutin <milyutinma@gmail.com>
wrote:

On 2/16/19 6:25 PM, Michael Paquier wrote:

On Sat, Feb 16, 2019 at 12:26:13AM +0000, PG Bug reporting form wrote:

When new slave is created by taking base backup from the primary using
pg_basebackup with --wal-method=stream option the WAL file generated during
the backup is different (as compared with diff or cmp command) than that on
the master and in WAL archive directory. Furthermore, this file does not
exist in pg_wal/archive_status with .done extension on new slave, though it
exists in pg_wal directory, resulting in failed attempt to archive this file
when slave node is promoted as master node.
2019-02-15 14:15:58.872 PST [5369] DETAIL: The failed archive command was:
test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002
/mnt/pgsql/archive/000000010000000000000002

How do you promote your standby? In Postgres 10, the last, partial
WAL segment of a past timeline generated at promotion is renamed
.partial to avoid any conflicts, so as this should normally not
happen if you do not use archive_mode = always.

This issue concerns only segment that contains wal records from
pg_basebackup. For example, I have reproduced this issue on master branch
getting the following content of archive directory after standby promoting:

000000010000000000000001.node1
000000010000000000000002.00000028.backup.node1
*000000010000000000000002.node1*
*000000010000000000000002.node2*
000000010000000000000003.node1
000000010000000000000004.node1
*000000010000000000000005.partial.node2*
*000000020000000000000005.node2*
00000002.history.node2

Each wal segment specifically was appended with .<node> suffix to
distinguish its source. Initially the node1 was master, node2 - standby.
The segment 000000010000000000000002 has the following content:

rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
0/02000028, prev 0/015DB060, desc: RUNNING_XACTS nextXid 474
latestCompletedXid 473 oldestRunningXid 474
rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn:
0/02000060, prev 0/02000028, desc: CHECKPOINT_ONLINE redo 0/2000028; tli 1;
prev tli 1; fpw true; xid 0:474; oid 12690; multi 1; offset 0; oldest xid
467 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid:
0/0; oldest running xid 474; online
rmgr: XLOG len (rec/tot): 34/ 34, tx: 0, lsn:
0/020000D0, prev 0/02000060, desc: BACKUP_END 0/2000028
rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn:
0/020000F8, prev 0/020000D0, desc: SWITCH

On the current state of master branch the content of two segments are the
same:

$ md5sum 000000010000000000000002.node1 000000010000000000000002.node2

252e3e8cf3a85f218bb467005c565e3c 000000010000000000000002.node1
252e3e8cf3a85f218bb467005c565e3c 000000010000000000000002.node2

But on PG 10.6 the contents are differentiated on the tails:

$ cmp 000000010000000000000002.node1 000000010000000000000002.node2

000000010000000000000002.node1 000000010000000000000002.node2 differ:
byte *131073*, line 1

while logical part (prefix) are the same - the lsn of last record
(SWITCH) in this segment is 0/02000130 and have length 24 bytes.

Is it correct behavior? What would be the correct canonical
representation of archive command taking into account this issue?

--
Regards,
Maksim Milyutin

--

*Andre Piwoni*

Sr. Software Developer, BI/Database

*Web*MD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com

*Andre Piwoni*

Sr. Software Developer, BI/Database

*Web*MD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Andre Piwoni (#5)

Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

On Tue, Feb 19, 2019 at 09:25:57AM -0800, Andre Piwoni wrote:

I call pg_ctl -D /var/lib/pgsql/10/data promote to upgrade slave to master
when failover happens. archive_mode is set to "on" and not "always".
I repoint slave to the master by stopping it, updating recovery.conf and
restarting it. Let me know if I'm doing it wrong.

As long as you stop the primary cleanly (stop or smart mode) so as the
primary has the possibility to send its shutdown checkpoint record to
the standby and makes sure that the standby has flushed the record,
that's safe.

I think this problem is created before promotion when new slave is created
using pg_basebackup with --wal-method=stream and manifests when actual
promotion happens.
What I'm trying to say it does not seem that .partial extension is the
issue here but lack of .done extension.

Well, sure. If you begin by reusing an old backup, you have a risk to
potentially archive the same segment multiple times if you use the
same archive location for all your servers. Since 9.5 this can get
even more complication as archive_mode has gained an "always" mode
which makes also standbys archive segments while in recovery to give
the users a switch for more archiving redundancy, which is useful when
working with async standbys across multiple sites. My point is that
this stuff has always worked this way. And people around do not
actually complain about the difference made for archive_status/ when
using the stream of fetch methods with pg_basebackup. From what I can
see as well, your archive_command is actually unsafe on many points,
so my take is that you should more carefully design it, or rely on an
existing backup solution developed by experts in PostgreSQL backups.
And no, it is not safe to change a behavior that other people may rely
heavily on for their solutions since pg_basebackup got smarter with
its stream mode.
--
Michael

Andre Piwoni

apiwoni@webmd.net

over 7 years ago

In reply to: Michael Paquier (#7)

Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

The reason you may not see a lot of people complaining about fetch vs.
stream mode while using pg_basebackup seems to be largely related to the
way many people implement failover. Mainly, when you look at majority of
all the failover scripts online when failover happens these scripts take
pg_basebackup (directly or via repmgr) on slave(s) as opposed to using WAL
archiving+streaming for recovery which is extremely inefficient for large
databases. See my point here:
https://www.pgpool.net/pipermail/pgpool-general/2019-February/006464.html
This is ridiculous given WAL-based system, but I suspect a lot of people
are bumping into the same issue given that you mentioned this behavior has
not changed.

Also, what do you mean that my archive command is not safe? It is straight
from PostrgreSQL docs and other resources:
https://www.postgresql.org/docs/10/continuous-archiving.html
https://www.opsdash.com/blog/postgresql-wal-archiving-backup.html

PostgreSQL docs do not mandate separate location for WAL archive for each
server. There's a mention of making sure that when cleaning up WAL archive
location that all slaves processed segments which makes sense. I'm not
going to use replication slots either because this requires very
sophisticated monitoring and maintenance to ensure database space does not
blow up.

So far I have no problems doing multiple failovers when creating slaves
with base backup in fetch mode but your response is not re-assuring. The
only downside that I see from docs is that primary may have removed WAL
segment during heavy load (default max_wal_size is 1GB) but then
pg_basebackup would fail.

It seems you agree that segment created during pg_basebackup with stream
mode, which is not marked with .done extension, would be attempted to be
archived again. I also understand stream method during backup is default so
changing this behavior may have an impact. If there's any free solution you
could recommend for backup strategy please let me know.

Thanks,
Andre Piwoni

On Tue, Feb 19, 2019 at 7:50 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Feb 19, 2019 at 09:25:57AM -0800, Andre Piwoni wrote:

I call pg_ctl -D /var/lib/pgsql/10/data promote to upgrade slave to

master

when failover happens. archive_mode is set to "on" and not "always".
I repoint slave to the master by stopping it, updating recovery.conf and
restarting it. Let me know if I'm doing it wrong.

As long as you stop the primary cleanly (stop or smart mode) so as the
primary has the possibility to send its shutdown checkpoint record to
the standby and makes sure that the standby has flushed the record,
that's safe.

I think this problem is created before promotion when new slave is

created

using pg_basebackup with --wal-method=stream and manifests when actual
promotion happens.
What I'm trying to say it does not seem that .partial extension is the
issue here but lack of .done extension.

Well, sure. If you begin by reusing an old backup, you have a risk to
potentially archive the same segment multiple times if you use the
same archive location for all your servers. Since 9.5 this can get
even more complication as archive_mode has gained an "always" mode
which makes also standbys archive segments while in recovery to give
the users a switch for more archiving redundancy, which is useful when
working with async standbys across multiple sites. My point is that
this stuff has always worked this way. And people around do not
actually complain about the difference made for archive_status/ when
using the stream of fetch methods with pg_basebackup. From what I can
see as well, your archive_command is actually unsafe on many points,
so my take is that you should more carefully design it, or rely on an
existing backup solution developed by experts in PostgreSQL backups.
And no, it is not safe to change a behavior that other people may rely
heavily on for their solutions since pg_basebackup got smarter with
its stream mode.
--
Michael

*Andre Piwoni*

Sr. Software Developer, BI/Database

*Web*MD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Andre Piwoni (#8)

Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

On Wed, Feb 20, 2019 at 09:58:12AM -0800, Andre Piwoni wrote:

Also, what do you mean that my archive command is not safe? It is straight
from PostrgreSQL docs and other resources:
https://www.postgresql.org/docs/10/continuous-archiving.html
https://www.opsdash.com/blog/postgresql-wal-archiving-backup.html

The docs are not the best thing on this side... One immediate problem
one could note is that cp does not flush the segment. So if your host
crashes before the OS flushes the copied data then the WAL segment
data is lost. The flush needs to happen before the archive command
returns its status to the server.

It seems you agree that segment created during pg_basebackup with stream
mode, which is not marked with .done extension, would be attempted to be
archived again. I also understand stream method during backup is default so
changing this behavior may have an impact. If there's any free solution you
could recommend for backup strategy please let me know.

pgBackRest and repmgr are tools mentioned on the lists.
--
Michael

#10

Maksim Milyutin

milyutinma@gmail.com

over 7 years ago

In reply to: Andre Piwoni (#6)

Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

On 2/19/19 9:08 PM, Andre Piwoni wrote:

By the way, you can assert this structure, including now explained
diff in WAL segment taken during backup, for fetch vs stream WAL
method by creating master and then creating new slave by taking base
backup using both methods:

pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U
replication --wal-method=stream
pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U
replication --wal-method=fetch

I have verified on different versions of PostgreSQL the identity of
conflict segments received from master node and from pg_basebackup
--wal-method=stream. On PG11+ they are binary identical. However on PG10
and lower the segments are binary different. The logical content are the
same while meaningless tails differ.

In case of pg_basebackup --wal-method=fetch the segments are entirely
the same. And this case doesn't produce any problem in archiving.

I think the backup tools as well as custom archive scripts have to take
account of these points.

<http://www.webmdhealthservices.com/>

--
Regards,
Maksim Milyutin

#11

Andre Piwoni

apiwoni@webmd.net

over 7 years ago

In reply to: Maksim Milyutin (#10)

Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

Thank you for checking. Bianary diffs I had were small and at the tail even
though there were no changes made to database so this makes sense that
these diffs were logically meaningless.

Were you able to confirm that with fetch method slave created with
--wal-method=fetch had record in pg_wal/archive_status directory with .done
extension after running pg_basebackup and no record with
--wal-method=stream? I'm also wondering if PG11 has the same behavior.

On Thu, Feb 21, 2019 at 10:28 AM Maksim Milyutin <milyutinma@gmail.com>
wrote:

On 2/19/19 9:08 PM, Andre Piwoni wrote:

By the way, you can assert this structure, including now explained diff in
WAL segment taken during backup, for fetch vs stream WAL method by creating
master and then creating new slave by taking base backup using both methods:

pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U
replication --wal-method=stream
pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U
replication --wal-method=fetch

I have verified on different versions of PostgreSQL the identity of
conflict segments received from master node and from pg_basebackup
--wal-method=stream. On PG11+ they are binary identical. However on PG10
and lower the segments are binary different. The logical content are the
same while meaningless tails differ.

In case of pg_basebackup --wal-method=fetch the segments are entirely the
same. And this case doesn't produce any problem in archiving.
I think the backup tools as well as custom archive scripts have to take
account of these points.

<http://www.webmdhealthservices.com/>

--
Regards,
Maksim Milyutin

*Andre Piwoni*

Sr. Software Developer, BI/Database

*Web*MD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com

#12

Maksim Milyutin

milyutinma@gmail.com

over 7 years ago

In reply to: Andre Piwoni (#11)

Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

On 2/21/19 9:57 PM, Andre Piwoni wrote:

Were you able to confirm that with fetch method slave created with
--wal-method=fetch had record in pg_wal/archive_status directory with
.done extension after running pg_basebackup and no record with
--wal-method=stream? I'm also wondering if PG11 has the same behavior.

Yes, it's so. Such behavior is also on PG11 and master branch.

--
Regards,
Maksim Milyutin