Replication failure, slave requesting old segments
Dear Experts,
I recently set up replication for the first time. It seemed to be
working OK in my initial tests, but didn't cope when the slave was
down for a longer period. This is all with the Debian stable
packages of PostgreSQL 9.6. My replication setup involves a third
server, "backup", to and from which WAL files are copied using scp
(and essentially never deleted), plus streaming replication in
"hot standby" mode.
On the master, I have:
wal_level = replica
archive_mode = on
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
scp %p backup:backup/postgresql/archivedir/%f'
On the slave I have:
standby_mode = 'on'
primary_conninfo = 'user=postgres host=master port=5432'
restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
hot_standby = on
This all seemed to work OK until the slave was shut down for a few days.
(During this time the master was also down for a shorter period.)
When the slave came back up it successfully copied several WAL files
from the backup server until it reached one that hasn't been created yet:
2018-08-10 22:37:38.322 UTC [615] LOG: restored log file "0000000100000007000000CC" from archive
2018-08-10 22:38:02.011 UTC [615] LOG: restored log file "0000000100000007000000CD" from archive
2018-08-10 22:38:29.802 UTC [615] LOG: restored log file "0000000100000007000000CE" from archive
2018-08-10 22:38:55.973 UTC [615] LOG: restored log file "0000000100000007000000CF" from archive
2018-08-10 22:39:24.964 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
At this point there was a temporary problem due to a firewall rule that
blocked the replication connection to the master:
2018-08-10 22:41:35.865 UTC [1289] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly
The slave then entered a loop, retrying to scp the next file from the backup
archive (and failing) and connecting to the master for streaming replication
(and also failing). That is as expected, except for the "invalid magic number":
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:41:37.018 UTC [615] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-10 22:43:46.936 UTC [1445] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly
Note that it's complaining about an invalid magic number in log segment D1, which is
the one that has just failed to scp. That looked suspicious to me and I remembered
notes in the docs about ensuring that the archive and/or restore commands return an
error exit status when they fail; I checked that and yes scp does exit(1) when the
requested file doesn't exist:
$ scp backup:/egheriugherg /tmp/a
scp: /egheriugherg: No such file or directory
$ echo $?
1
Anyway, the slave continued in its retrying loop as expected, except that each time
it re-fetched the previous segment, D0; this seems a bit peculiar:
2018-08-10 22:44:17.796 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:46:28.727 UTC [1628] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly
(Note the "invalid magic number" was only reported the first time.)
Eventually I fixed the networking issue that had prevented the replication connection
to master and it could start streaming:
2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removed
So here is the real problem which worries me: it seems to ask the master for
segment D0, which is one that it did successfully restore from the backup earlier
on, and the master - not unexpectedly - reports that it has already been removed.
I tried restarting the slave:
2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088
2018-08-11 00:12:16.068 UTC [7953] LOG: database system is ready to accept read only connections
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:12:17.057 UTC [7954] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-11 00:12:17.242 UTC [8034] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:12:17.242 UTC [8034] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removed
Note that again it failed to get D1 from the backup, as expected, and then
reports "invalid magic number" in that file, and again tries to start
streaming from a segment that the master doesn't have.
Eventually I decided to try and cause the D1 file to exist in the backup:
db=> create table waste_space as select * from large_table;
That caused four new WAL segments to be copied to the backup, and restored
by the slave:
2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file "0000000100000007000000D1" from archive
2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file "0000000100000007000000D2" from archive
2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file "0000000100000007000000D3" from archive
2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file "0000000100000007000000D4" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory
2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000 in log segment 00000001000000070000
00D5, offset 0
2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from primary at 7/D5000000 on timeline 1
Note the "unexpected pageaddr" message. Despite that, the streaming started
and everything was working correctly. Hurray!
I remain concerned, however, that something bad happened and I'd like to
understand it. Can anyone help?
Thanks,
Phil.
On 08/11/2018 08:18 AM, Phil Endecott wrote:
Dear Experts,
I recently set up replication for the first time. It seemed to be
working OK in my initial tests, but didn't cope when the slave was
down for a longer period. This is all with the Debian stable
packages of PostgreSQL 9.6. My replication setup involves a third
server, "backup", to and from which WAL files are copied using scp
(and essentially never deleted), plus streaming replication in
"hot standby" mode.On the master, I have:
wal_level = replica
archive_mode = on
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
scp %p backup:backup/postgresql/archivedir/%f'On the slave I have:
standby_mode = 'on'
primary_conninfo = 'user=postgres host=master port=5432'
restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'hot_standby = on
This all seemed to work OK until the slave was shut down for a few days.
(During this time the master was also down for a shorter period.)
When the slave came back up it successfully copied several WAL files
from the backup server until it reached one that hasn't been created yet:2018-08-10 22:37:38.322 UTC [615] LOG: restored log file "0000000100000007000000CC" from archive
2018-08-10 22:38:02.011 UTC [615] LOG: restored log file "0000000100000007000000CD" from archive
2018-08-10 22:38:29.802 UTC [615] LOG: restored log file "0000000100000007000000CE" from archive
2018-08-10 22:38:55.973 UTC [615] LOG: restored log file "0000000100000007000000CF" from archive
2018-08-10 22:39:24.964 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directoryAt this point there was a temporary problem due to a firewall rule that
blocked the replication connection to the master:2018-08-10 22:41:35.865 UTC [1289] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly
The slave then entered a loop, retrying to scp the next file from the backup
archive (and failing) and connecting to the master for streaming replication
(and also failing). That is as expected, except for the "invalid magic number":scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:41:37.018 UTC [615] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-10 22:43:46.936 UTC [1445] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedlyNote that it's complaining about an invalid magic number in log segment D1, which is
the one that has just failed to scp. That looked suspicious to me and I remembered
notes in the docs about ensuring that the archive and/or restore commands return an
error exit status when they fail; I checked that and yes scp does exit(1) when the
requested file doesn't exist:$ scp backup:/egheriugherg /tmp/a
scp: /egheriugherg: No such file or directory
$ echo $?
1Anyway, the slave continued in its retrying loop as expected, except that each time
it re-fetched the previous segment, D0; this seems a bit peculiar:2018-08-10 22:44:17.796 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:46:28.727 UTC [1628] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly(Note the "invalid magic number" was only reported the first time.)
Eventually I fixed the networking issue that had prevented the replication connection
to master and it could start streaming:2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removedSo here is the real problem which worries me: it seems to ask the master for
segment D0, which is one that it did successfully restore from the backup earlier
on, and the master - not unexpectedly - reports that it has already been removed.I tried restarting the slave:
2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088
2018-08-11 00:12:16.068 UTC [7953] LOG: database system is ready to accept read only connections
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:12:17.057 UTC [7954] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-11 00:12:17.242 UTC [8034] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:12:17.242 UTC [8034] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removedNote that again it failed to get D1 from the backup, as expected, and then
reports "invalid magic number" in that file, and again tries to start
streaming from a segment that the master doesn't have.Eventually I decided to try and cause the D1 file to exist in the backup:
db=> create table waste_space as select * from large_table;
That caused four new WAL segments to be copied to the backup, and restored
by the slave:2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file "0000000100000007000000D1" from archive
2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file "0000000100000007000000D2" from archive
2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file "0000000100000007000000D3" from archive
2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file "0000000100000007000000D4" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory
2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000 in log segment 00000001000000070000
00D5, offset 0
2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from primary at 7/D5000000 on timeline 1Note the "unexpected pageaddr" message. Despite that, the streaming started
and everything was working correctly. Hurray!I remain concerned, however, that something bad happened and I'd like to
understand it. Can anyone help?
Looks like the master recycled the WAL's while the slave could not connect.
Might want to take a look at replication slots to keep WAL's from being
recycled:
https://www.postgresql.org/docs/9.6/static/warm-standby.html#STREAMING-REPLICATION-SLOTS
with the heads up that slots do not limit the space taken up by WAL's.
The "unexpected pageaddr" probably is for same reason, per comments in
xlogreader.c:
"
/*
* Check that the address on the page agrees with what we expected.
* This check typically fails when an old WAL segment is recycled,
* and hasn't yet been overwritten with new data yet.
*/
...
report_invalid_record(state,
"unexpected pageaddr %X/%X in log segment %s,
offset %u",
(uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
fname,
offset);
"
Thanks,
Phil.
--
Adrian Klaver
adrian.klaver@aklaver.com
Hi Adrian,
Adrian Klaver wrote:
Looks like the master recycled the WAL's while the slave could not connect.
Yes but... why is that a problem? The master is copying the WALs to
the backup server using scp, where they remain forever. The slave gets
them from there before it starts streaming. So it shouldn't matter
if the master recycles them, as the slave should be able to get everything
using the combination of scp and then streaming.
Am I missing something about how this sort of replication is supposed to
work?
Thanks, Phil.
On 08/11/2018 12:42 PM, Phil Endecott wrote:
Hi Adrian,
Adrian Klaver wrote:
Looks like the master recycled the WAL's while the slave could not
connect.Yes but... why is that a problem? The master is copying the WALs to
the backup server using scp, where they remain forever. The slave gets
To me it looks like that did not happen:
2018-08-11 00:05:50.364 UTC [615] LOG: restored log file
"0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file
or directory
2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from
primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has
already been removed
Above 0000000100000007000000D0 is gone/recycled on the master and the
archived version does not seem to be complete as the streaming
replication is trying to find it.
Below you kick the master and it coughs up the files to the archive
including *D0 and *D1 on up to *D4 and then the streaming picks using *D5.
2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file
"0000000100000007000000D0" from archive
2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file
"0000000100000007000000D1" from archive
2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file
"0000000100000007000000D2" from archive
2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file
"0000000100000007000000D3" from archive
2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file
"0000000100000007000000D4" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file
or directory
2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000
in log segment 00000001000000070000
00D5, offset 0
2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from
primary at 7/D5000000 on timeline 1
Best guess is the archiving did not work as expected during:
"(During this time the master was also down for a shorter period.)"
them from there before it starts streaming. So it shouldn't matter
if the master recycles them, as the slave should be able to get everything
using the combination of scp and then streaming.Am I missing something about how this sort of replication is supposed to
work?Thanks, Phil.
--
Adrian Klaver
adrian.klaver@aklaver.com
Greetings,
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
Dear Experts,
Since you're asking ...
I recently set up replication for the first time. It seemed to be
working OK in my initial tests, but didn't cope when the slave was
down for a longer period. This is all with the Debian stable
packages of PostgreSQL 9.6. My replication setup involves a third
server, "backup", to and from which WAL files are copied using scp
(and essentially never deleted), plus streaming replication in
"hot standby" mode.On the master, I have:
wal_level = replica
archive_mode = on
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
scp %p backup:backup/postgresql/archivedir/%f'
This is really not a sufficient or particularly intelligent
archive_command. In general, I'd strongly recommend against trying to
roll your own archiving/backup solution for PostgreSQL. In particular,
the above is quite expensive, doesn't address the case where a file is
only partially copied to the backup server, and doesn't ensure that the
file is completely written and fsync'd on the backup server meaning that
a failure will likely result in WAL going missing. There's much better
solutions out there and which you really should be using instead of
trying to build your own. In particular, my preference is pgbackrest
(though I'm one of the folks in its development, to be clear), but
there's other options such as barman or WAL-E which I believe also
address the concerns raised above.
On the slave I have:
standby_mode = 'on'
primary_conninfo = 'user=postgres host=master port=5432'
restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'hot_standby = on
This all seemed to work OK until the slave was shut down for a few days.
(During this time the master was also down for a shorter period.)
When the slave came back up it successfully copied several WAL files
from the backup server until it reached one that hasn't been created yet:
That all sounds fine.
2018-08-10 22:37:38.322 UTC [615] LOG: restored log file "0000000100000007000000CC" from archive
2018-08-10 22:38:02.011 UTC [615] LOG: restored log file "0000000100000007000000CD" from archive
2018-08-10 22:38:29.802 UTC [615] LOG: restored log file "0000000100000007000000CE" from archive
2018-08-10 22:38:55.973 UTC [615] LOG: restored log file "0000000100000007000000CF" from archive
2018-08-10 22:39:24.964 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directoryAt this point there was a temporary problem due to a firewall rule that
blocked the replication connection to the master:2018-08-10 22:41:35.865 UTC [1289] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly
Well, that's certainly not good.
The slave then entered a loop, retrying to scp the next file from the backup
archive (and failing) and connecting to the master for streaming replication
(and also failing). That is as expected, except for the "invalid magic number":scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:41:37.018 UTC [615] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-10 22:43:46.936 UTC [1445] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly
Note that the invalid magic number error is just LOG-level, and that
PostgreSQL will certainly ask for WAL files which don't exist in the
archive yet.
Note that it's complaining about an invalid magic number in log segment D1, which is
the one that has just failed to scp. That looked suspicious to me and I remembered
notes in the docs about ensuring that the archive and/or restore commands return an
error exit status when they fail; I checked that and yes scp does exit(1) when the
requested file doesn't exist:
Sure, but scp doesn't produce any output when it's successful, so are
you sure that an scp wasn't performed after the "no such file or
directory" log message and which perhaps performed a partial copy
(see above about *exactly* that concern being mentioned by me, even
before getting down to this part, due to using these simple unix
commands...)?
$ scp backup:/egheriugherg /tmp/a
scp: /egheriugherg: No such file or directory
$ echo $?
1
scp has no idea if it made a copy of a partial file though.
Anyway, the slave continued in its retrying loop as expected, except that each time
it re-fetched the previous segment, D0; this seems a bit peculiar:2018-08-10 22:44:17.796 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:46:28.727 UTC [1628] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly
Having it go back to D0 is at least a bit interesting. I wonder if
somehow it was partial.. Have you looked at the files on the archive
server to make sure they're all 16MB and the same size?
(Note the "invalid magic number" was only reported the first time.)
Part of the issue here is that there's no logging happening with these
scp commands, so we don't really know what happened, exactly. I'm
wondering if perhaps the scp has *just* started copying the file from
the primary to the backup server when the scp from the backup server to
the replica started and it ended up copying a zero-byte file, though if
that was the case then the next time around the file should have been
there, unless there was some network hiccup between the primary and the
backup server that caused the copy to fail...
Eventually I fixed the networking issue that had prevented the replication connection
to master and it could start streaming:2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removedSo here is the real problem which worries me: it seems to ask the master for
segment D0, which is one that it did successfully restore from the backup earlier
on, and the master - not unexpectedly - reports that it has already been removed.
If D0 has been removed from the primary then it should be on the backup
server, right? Are you sure it actually is, and that it's a complete
segment? I'm starting to think that you're in exactly the predicament
that I mentioned above- maybe D0 was a partial copy to the backup
server, and that failed and was returned to PG as failed but then it
tried to run again and the stupid 'test' command returned success
because the file exists, even though it wasn't complete...
I tried restarting the slave:
2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088
2018-08-11 00:12:16.068 UTC [7953] LOG: database system is ready to accept read only connections
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:12:17.057 UTC [7954] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-11 00:12:17.242 UTC [8034] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:12:17.242 UTC [8034] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removedNote that again it failed to get D1 from the backup, as expected, and then
reports "invalid magic number" in that file, and again tries to start
streaming from a segment that the master doesn't have.Eventually I decided to try and cause the D1 file to exist in the backup:
db=> create table waste_space as select * from large_table;
That caused four new WAL segments to be copied to the backup, and restored
by the slave:2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file "0000000100000007000000D1" from archive
2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file "0000000100000007000000D2" from archive
2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file "0000000100000007000000D3" from archive
2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file "0000000100000007000000D4" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory
2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000 in log segment 00000001000000070000
00D5, offset 0
2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from primary at 7/D5000000 on timeline 1Note the "unexpected pageaddr" message. Despite that, the streaming started
and everything was working correctly. Hurray!
So, just to be clear, those LOG-level messages aren't generally an
issue, they're just LOGs. We expect to see them.
I remain concerned, however, that something bad happened and I'd like to
understand it. Can anyone help?
Really hard to tell from this. On the one hand, it seems like maybe the
primary was done with D0, but hadn't finished with D1 yet, but the
replica hadn't made that transistion from D0 -> D1, hence why it was
still asking for D0. I'm certainly concerned that there might have
been some amount of partial-WAL-file copying happening, but PG is pretty
good at dealing with a variety of odd cases and so I tend to doubt that
the replica actually ended up being corrupted, but I'll again strongly
advocate for using a proven solution which avoids these risks by doing
things like copying into temporary files on the backup server and then
using an atomic 'mv' to make sure that anyone reading from the archive
wouldn't ever see a partially-written file, and, just to double-check,
also calculate a hash of every WAL file written into the archive and
re-verify that hash when reading it back out. That's what we do in
pgbackrest, at least.
Thanks!
Stephen
Hi Stephen,
Stephen Frost wrote:
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
scp %p backup:backup/postgresql/archivedir/%f'This is really not a sufficient or particularly intelligent
archive_command. In general, I'd strongly recommend against trying to
roll your own archiving/backup solution for PostgreSQL. In particular,
the above is quite expensive, doesn't address the case where a file is
only partially copied to the backup server, and doesn't ensure that the
file is completely written and fsync'd on the backup server meaning that
a failure will likely result in WAL going missing. There's much better
solutions out there and which you really should be using instead of
trying to build your own. In particular, my preference is pgbackrest
(though I'm one of the folks in its development, to be clear), but
there's other options such as barman or WAL-E which I believe also
address the concerns raised above.
Hmmm, well I based it on what I read in the documentation:
https://www.postgresql.org/docs/9.6/static/continuous-archiving.html
The expense is not an issue in this case. I would be surprised if
scp were to exit(0) after only partially copying a file but perhaps
you know something I don't about that.
I know that rsync creates a temporary file and moves it into place when
all the data has been transferred, which scp doesn't do; perhaps that
would be safer.
The master was only writing WAL files maybe 3 or 4 times per day at
the time, so any scenario that requires its crash to have occurred
exactly as it was doing an scp involves a substantial coincidence.
I am limited in what I can run on the backup server.
Having it go back to D0 is at least a bit interesting. I wonder if
somehow it was partial.. Have you looked at the files on the archive
server to make sure they're all 16MB and the same size?
Yes, all the files on the backup including D0 were 16 MB. But:
2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088
Are the last two log lines above telling us anything useful? Is that
saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as
far as byte 0xf956c0 or 0xfff088? Is that what we would expect? Is it
trying to use the streaming connection to get the missing bytes from
FFFF088 to FFFFFFFF? Is that just an empty gap at the end of the file
due to the next record being too big to fit?
db=> create table waste_space as select * from large_table;
That caused four new WAL segments to be copied to the backup, and restored
by the slave:
Note again that this was sufficient to make it start working again,
which I think is important.
Really hard to tell from this. On the one hand, it seems like maybe the
primary was done with D0, but hadn't finished with D1 yet, but the
replica hadn't made that transistion from D0 -> D1, hence why it was
still asking for D0.
Yes. What is actually supposed to happen? What exactly do you mean by
"make the transition"?
I'm certainly concerned that there might have
been some amount of partial-WAL-file copying happening, but PG is pretty
good at dealing with a variety of odd cases and so I tend to doubt that
the replica actually ended up being corrupted, but I'll again strongly
advocate for using a proven solution which avoids these risks by doing
things like copying into temporary files on the backup server and then
using an atomic 'mv' to make sure that anyone reading from the archive
wouldn't ever see a partially-written file, and, just to double-check,
also calculate a hash of every WAL file written into the archive and
re-verify that hash when reading it back out. That's what we do in
pgbackrest, at least.
Noted. But I don't see any evidence that that was actually the root
cause here, and it relies on an unlikely timing coincidence. This is
the very first time that I've had a replica recover after a significant
downtime, and it failed.
Regards, Phil.
Hi Adrian,
Adrian Klaver wrote:
On 08/11/2018 12:42 PM, Phil Endecott wrote:
Hi Adrian,
Adrian Klaver wrote:
Looks like the master recycled the WAL's while the slave could not
connect.Yes but... why is that a problem? The master is copying the WALs to
the backup server using scp, where they remain forever. The slave getsTo me it looks like that did not happen:
2018-08-11 00:05:50.364 UTC [615] LOG: restored log file
"0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file
or directory
2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from
primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has
already been removedAbove 0000000100000007000000D0 is gone/recycled on the master and the
archived version does not seem to be complete as the streaming
replication is trying to find it.
The files on the backup server were all 16 MB.
Below you kick the master and it coughs up the files to the archive
including *D0 and *D1 on up to *D4 and then the streaming picks using *D5.
When I kicked it, the master wrote D1 to D4 to the backup. It did not
change D0 (its modification time on the backup is from before the "kick").
The slave re-read D0, again, as it had been doing throughout this period,
and then read D1 to D4.
Best guess is the archiving did not work as expected during:
"(During this time the master was also down for a shorter period.)"
Around the time the master was down, the WAL segment names were CB and CC.
Files CD to CF were written between the master coming up and the slave
coming up. The slave had no trouble restoring those segments when it started.
The problematic segments D0 and D1 were the ones that were "current"
when the
slave restarted, at which time the master was up consistently.
Regards, Phil.
Phil Endecott wrote:
On the master, I have:
wal_level = replica
archive_mode = on
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
scp %p backup:backup/postgresql/archivedir/%f'On the slave I have:
standby_mode = 'on'
primary_conninfo = 'user=postgres host=master port=5432'
restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'hot_standby = on
2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removed
I am wondering if I need to set wal_keep_segments to at least 1 or 2 for
this to work. I currently have it unset and I believe the default is 0.
My understanding was that when using archive_command/restore_command to copy
WAL segments it would not be necessary to use wal_keep_segments to retain
files in pg_xlog on the server; the slave can get everything using a
combination of copying files using the restore_command and streaming.
But these lines from the log:
2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088
make me think that there is an issue when the slave reaches the end of the
copied WAL file. I speculate that the useful content of this WAL segment
ends at FFF088, which is followed by an empty gap due to record sizes. But
the slave tries to start streaming from this point, D0FFF088, not D1000000.
If the master still had a copy of segment D0 then it would be able to stream
this gap followed by the real content in the current segment D1.
Does that make any sense at all?
Regards, Phil.
On 08/12/2018 12:25 PM, Phil Endecott wrote:
Hi Adrian,
Adrian Klaver wrote:
On 08/11/2018 12:42 PM, Phil Endecott wrote:
Hi Adrian,
Adrian Klaver wrote:
Looks like the master recycled the WAL's while the slave could not
connect.Yes but... why is that a problem? The master is copying the WALs to
the backup server using scp, where they remain forever. The slave getsTo me it looks like that did not happen:
2018-08-11 00:05:50.364 UTC [615] LOG: restored log file
"0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such
file or directory
2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from
primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has
already been removedAbove 0000000100000007000000D0 is gone/recycled on the master and the
archived version does not seem to be complete as the streaming
replication is trying to find it.The files on the backup server were all 16 MB.
WAL files are created/recycled as 16 MB files, which is not the same as
saying they are complete for the purposes of restoring. In other words
you could be looking at a 16 MB file full of 0's.
Below you kick the master and it coughs up the files to the archive
including *D0 and *D1 on up to *D4 and then the streaming picks using
*D5.When I kicked it, the master wrote D1 to D4 to the backup. It did not
change D0 (its modification time on the backup is from before the "kick").
The slave re-read D0, again, as it had been doing throughout this period,
and then read D1 to D4.
Well something happened because the slave could not get all the
information it needed from the D0 in the archive and was trying to get
it from the masters pg_xlog.
Best guess is the archiving did not work as expected during:
"(During this time the master was also down for a shorter period.)"
Around the time the master was down, the WAL segment names were CB and CC.
Files CD to CF were written between the master coming up and the slave
coming up. The slave had no trouble restoring those segments when it
started.
The problematic segments D0 and D1 were the ones that were "current"
when the
slave restarted, at which time the master was up consistently.Regards, Phil.
--
Adrian Klaver
adrian.klaver@aklaver.com
On 08/12/2018 12:53 PM, Phil Endecott wrote:
Phil Endecott wrote:
On the master, I have:
wal_level = replica
archive_mode = on
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
scp %p backup:backup/postgresql/archivedir/%f'On the slave I have:
standby_mode = 'on'
primary_conninfo = 'user=postgres host=master port=5432'
restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'hot_standby = on
2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removedI am wondering if I need to set wal_keep_segments to at least 1 or 2 for
this to work. I currently have it unset and I believe the default is 0.
Given that WAL's are only 16 MB I would probably bump it up to be on
safe side, or use:
https://www.postgresql.org/docs/9.6/static/warm-standby.html
26.2.6. Replication Slots
Though the above does not limit storage of WAL's, so a long outage could
result in WAL's piling up.
My understanding was that when using archive_command/restore_command to copy
WAL segments it would not be necessary to use wal_keep_segments to retain
files in pg_xlog on the server; the slave can get everything using a
combination of copying files using the restore_command and streaming.
But these lines from the log:2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088make me think that there is an issue when the slave reaches the end of the
copied WAL file. I speculate that the useful content of this WAL segment
ends at FFF088, which is followed by an empty gap due to record sizes. But
the slave tries to start streaming from this point, D0FFF088, not D1000000.
If the master still had a copy of segment D0 then it would be able to stream
this gap followed by the real content in the current segment D1.Does that make any sense at all?
Regards, Phil.
--
Adrian Klaver
adrian.klaver@aklaver.com
Greetings,
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
Stephen Frost wrote:
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
scp %p backup:backup/postgresql/archivedir/%f'This is really not a sufficient or particularly intelligent
archive_command. In general, I'd strongly recommend against trying to
roll your own archiving/backup solution for PostgreSQL. In particular,
the above is quite expensive, doesn't address the case where a file is
only partially copied to the backup server, and doesn't ensure that the
file is completely written and fsync'd on the backup server meaning that
a failure will likely result in WAL going missing. There's much better
solutions out there and which you really should be using instead of
trying to build your own. In particular, my preference is pgbackrest
(though I'm one of the folks in its development, to be clear), but
there's other options such as barman or WAL-E which I believe also
address the concerns raised above.Hmmm, well I based it on what I read in the documentation:
https://www.postgresql.org/docs/9.6/static/continuous-archiving.html
The expense is not an issue in this case. I would be surprised if
scp were to exit(0) after only partially copying a file but perhaps
you know something I don't about that.
The scp might not, but so what? If a partial file remains on the far
side, then a subsequent invokation of that archive_command will come
back saying that the file is there. Further, the file isn't going to
actually be sync'd to the filesystem on the far side, so even if the scp
returns success, a failure on the backup server could result in loss of
the WAL file.
I know that rsync creates a temporary file and moves it into place when
all the data has been transferred, which scp doesn't do; perhaps that
would be safer.
No, what would be better is using a tool which will actually fsync() the
file on the backup server after writing it out there, making sure it's
been stored to the filesystem, and then to perform an atomic mv
operation to put it in place, and then to fsync() the directory to
ensure that change has also been sync'd to disk.
Even better is to calculate a hash of the file and to reverify that on
restore, to make sure that it doesn't end up getting corrupted on the
backup server somehow.
The master was only writing WAL files maybe 3 or 4 times per day at
the time, so any scenario that requires its crash to have occurred
exactly as it was doing an scp involves a substantial coincidence.
While I get that it's unlikely to be relevant to this particular case, I
wouldn't recommend depending on never having that coincidence happen.
What I've seen quite a bit of, just so you know, is that things tend to
all break down at about the same time- so right when something bad is
happening on the primary, the backup server runs into issues too. This
might be due to networking breaking down in a particular area all at
once or shared storage having some issue but it's happened often enough
that relying on uncommon things to avoid each other is something I've
got a hard time doing these days.
I am limited in what I can run on the backup server.
You could consider backing up locally using a tool which is built for
PostgreSQL and then sync'ing the results of that backup over to the
backup server, perhaps. Or point out that you need a backup server
where you don't have those restirctions in order to properly have
backups. Or use an external service like s3 (or a clone) to back up to.
Having it go back to D0 is at least a bit interesting. I wonder if
somehow it was partial.. Have you looked at the files on the archive
server to make sure they're all 16MB and the same size?Yes, all the files on the backup including D0 were 16 MB. But:
Ok. Did you see any failures on the sending side when archive_command
was being run? PG shouldn't start trying to archive the file until it's
done with it, of course, so unless something failed right at the end of
the scp and left the file at 16MB but it wasn't actually filled..
2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088Are the last two log lines above telling us anything useful? Is that
saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as
far as byte 0xf956c0 or 0xfff088? Is that what we would expect? Is it
trying to use the streaming connection to get the missing bytes from
FFFF088 to FFFFFFFF? Is that just an empty gap at the end of the file
due to the next record being too big to fit?
The short answer is that, yes, the next record was likely too big to
fit, but that's what the replica was trying to figure out and couldn't
because D0 was gone from the primary already. One thing which should
help this would be to use physical replication slots on the primary,
which would keep it from throwing away WAL files until it knows the
replica has them, but that runs the risk of ending up with lots of extra
WAL on the primary if the replica is gone for a while. If you'd prefer
to avoid that then having wal_keep_segments set to '1' would avoid this
particular issue as well, I'd expect.
I do wonder if perhaps we should just default to having it as '1' to
avoid exactly this case, as it seems like perhaps PG archived D0 and
then flipped to D1 and got rid of D0, which is all pretty reasonable,
except that a replica trying to catch up is going to end up asking for
D0 from the primary because it didn't know if there was anything else
that should have been in D0..
db=> create table waste_space as select * from large_table;
That caused four new WAL segments to be copied to the backup, and
restored by the slave:Note again that this was sufficient to make it start working again,
which I think is important.
I agree that it's certainly interesting though presumably what ends up
happening is that the replica finally gets to D1 and is then able to
connect to the primary before the primary has finished with D1 and it's
able to start tracking along with the primary.
Really hard to tell from this. On the one hand, it seems like maybe the
primary was done with D0, but hadn't finished with D1 yet, but the
replica hadn't made that transistion from D0 -> D1, hence why it was
still asking for D0.Yes. What is actually supposed to happen? What exactly do you mean by
"make the transition"?
Simply moving along the WAL stream to get to the D1 WAL file. This all
more-or-less looks like what's "supposed to happen" except that the lack
of additional writes to the primary meant that things appeared to get
"stuck" here, and then got "unstuck" once you created some data. Looks
like this might be a bit of an edge case that we aren't handling very
cleanly.
I'm certainly concerned that there might have
been some amount of partial-WAL-file copying happening, but PG is pretty
good at dealing with a variety of odd cases and so I tend to doubt that
the replica actually ended up being corrupted, but I'll again strongly
advocate for using a proven solution which avoids these risks by doing
things like copying into temporary files on the backup server and then
using an atomic 'mv' to make sure that anyone reading from the archive
wouldn't ever see a partially-written file, and, just to double-check,
also calculate a hash of every WAL file written into the archive and
re-verify that hash when reading it back out. That's what we do in
pgbackrest, at least.Noted. But I don't see any evidence that that was actually the root
cause here, and it relies on an unlikely timing coincidence. This is
the very first time that I've had a replica recover after a significant
downtime, and it failed.
Failed is really over-stating things, isn't it? The replica caught all
the way up to D0 just fine, but it just couldn't get past that because
the primary had removed D0 already, and the replica couldn't actually
reconnect to the primary to start streaming again, but it's not like it
would have been all *that* far behind the primary since the primary was
still working on D1, and once the primary did write a few things, the
replica was able to start streaming again.
If this can be reproduced and then a fix put in place for it, I'm all
for that, to be clear, but I don't see this as a particularly big issue
or a systemic problem with the way WAL replay and replication work.
Thanks!
Stephen
Stephen Frost wrote:
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
Stephen Frost wrote:
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088Are the last two log lines above telling us anything useful? Is that
saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as
far as byte 0xf956c0 or 0xfff088? Is that what we would expect? Is it
trying to use the streaming connection to get the missing bytes from
FFFF088 to FFFFFFFF? Is that just an empty gap at the end of the file
due to the next record being too big to fit?The short answer is that, yes, the next record was likely too big to
fit, but that's what the replica was trying to figure out and couldn't
because D0 was gone from the primary already. One thing which should
help this would be to use physical replication slots on the primary,
which would keep it from throwing away WAL files until it knows the
replica has them, but that runs the risk of ending up with lots of extra
WAL on the primary if the replica is gone for a while. If you'd prefer
to avoid that then having wal_keep_segments set to '1' would avoid this
particular issue as well, I'd expect.I do wonder if perhaps we should just default to having it as '1' to
avoid exactly this case, as it seems like perhaps PG archived D0 and
then flipped to D1 and got rid of D0, which is all pretty reasonable,
except that a replica trying to catch up is going to end up asking for
D0 from the primary because it didn't know if there was anything else
that should have been in D0..
OK. I think this is perhaps a documentation bug, maybe a missing
warning when the master reads its configuration, and maybe (as you say)
a bad default value.
Specifically, section 26.2.5 of the docs says:
"If you use streaming replication without file-based continuous archiving,
the server might recycle old WAL segments before the standby has received
them. If this occurs, the standby will need to be reinitialized from a new
base backup. You can avoid this by setting wal_keep_segments to a value
large enough to ensure that WAL segments are not recycled too early, or by
configuring a replication slot for the standby. If you set up a WAL archive
that's accessible from the standby, these solutions are not required, since
the standby can always use the archive to catch up provided it retains enough
segments."
OR, maybe the WAL reader that process the files that restore_command fetches
could be smart enough to realise that it can skip over the gap at the end?
Anyway. Do others agree that my issue was the result of
wal_keep_segments=0 ?
Regards, Phil.
Greetings,
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
OK. I think this is perhaps a documentation bug, maybe a missing
warning when the master reads its configuration, and maybe (as you say)
a bad default value.
If we consider it to be an issue worthy of a change then we should
probably just change the default value, and maybe not even allow it to
be set lower than '1'.
Specifically, section 26.2.5 of the docs says:
"If you use streaming replication without file-based continuous archiving,
the server might recycle old WAL segments before the standby has received
them. If this occurs, the standby will need to be reinitialized from a new
base backup. You can avoid this by setting wal_keep_segments to a value
large enough to ensure that WAL segments are not recycled too early, or by
configuring a replication slot for the standby. If you set up a WAL archive
that's accessible from the standby, these solutions are not required, since
the standby can always use the archive to catch up provided it retains enough
segments."OR, maybe the WAL reader that process the files that restore_command fetches
could be smart enough to realise that it can skip over the gap at the end?
That strikes me as a whole lot more complication in something we'd
rather not introduce additional complications into without very good
reason. Then again, there was just a nearby discussion about how it'd
be nice if the backend could realize when a WAL file is complete, and I
do think that'll be more of an issue when users start configuring larger
WAL files, so perhaps we should figure out a way to handle this..
Anyway. Do others agree that my issue was the result of wal_keep_segments=0 ?
Yeah, I've not spent much time actually looking at code around this, so
it'd be nice to get:
a) a reproducible case demonstrating it's happening
b) testing to see how long it's been this way
c) if setting wal_keep_segments=1 fixes it
d) perhaps some thought around if we could address this some other way
Thanks!
Stephen
Import Notes
Reply to msg id not found: 1534110785010@dmwebmail.dmwebmail.chezphil.org
On 08/12/2018 02:56 PM, Phil Endecott wrote:
Stephen Frost wrote:
Specifically, section 26.2.5 of the docs says:
"If you use streaming replication without file-based continuous archiving,
the server might recycle old WAL segments before the standby has received
them. If this occurs, the standby will need to be reinitialized from a new
base backup. You can avoid this by setting wal_keep_segments to a value
large enough to ensure that WAL segments are not recycled too early, or by
configuring a replication slot for the standby. If you set up a WAL archive
that's accessible from the standby, these solutions are not required, since
the standby can always use the archive to catch up provided it retains
enough
segments."OR, maybe the WAL reader that process the files that restore_command
fetches
could be smart enough to realise that it can skip over the gap at the end?Anyway. Do others agree that my issue was the result of
wal_keep_segments=0 ?
Only as a sub-issue of the slave losing contact with the master. The
basic problem is maintaining two separate operations, archiving and
streaming, in sync. If either or some combination of both lose
synchronization then it is anyone's guess on what is appropriate for
wal_keep_segments.
Regards, Phil.
--
Adrian Klaver
adrian.klaver@aklaver.com
On 08/12/2018 03:54 PM, Stephen Frost wrote:
Greetings,
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
OK. I think this is perhaps a documentation bug, maybe a missing
warning when the master reads its configuration, and maybe (as you say)
a bad default value.If we consider it to be an issue worthy of a change then we should
probably just change the default value, and maybe not even allow it to
be set lower than '1'.
I would say leave the default at 0 as it leaves no doubt that you are
performing without a net. A setting of '1' implies you are covered and
for a fast moving cluster or slow moving one with sufficient downtime
that would not be the case. Better to let the end user know this is not
a simple problem and some thought needs to go into configuration.
--
Adrian Klaver
adrian.klaver@aklaver.com
Greetings,
* Adrian Klaver (adrian.klaver@aklaver.com) wrote:
On 08/12/2018 03:54 PM, Stephen Frost wrote:
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
OK. I think this is perhaps a documentation bug, maybe a missing
warning when the master reads its configuration, and maybe (as you say)
a bad default value.If we consider it to be an issue worthy of a change then we should
probably just change the default value, and maybe not even allow it to
be set lower than '1'.I would say leave the default at 0 as it leaves no doubt that you are
performing without a net. A setting of '1' implies you are covered and for a
fast moving cluster or slow moving one with sufficient downtime that would
not be the case. Better to let the end user know this is not a simple
problem and some thought needs to go into configuration.
Uh, this specific case is where there *is* a 'safety net' though-
archive command and restore command were configured and being used, so I
don't buy off on this argument at all.
Maybe we just internally bump wal_keep_segments to '1' to avoid this
specific risk without actually changing the default or making people
change their existing configurations, but if this is really what's
happening then I don't think the answer is "don't do anything."
Thanks!
Stephen
Adrian Klaver wrote:
On 08/12/2018 02:56 PM, Phil Endecott wrote:
Anyway. Do others agree that my issue was the result of
wal_keep_segments=0 ?Only as a sub-issue of the slave losing contact with the master. The
basic problem is maintaining two separate operations, archiving and
streaming, in sync. If either or some combination of both lose
synchronization then it is anyone's guess on what is appropriate for
wal_keep_segments.
Really? I thought the intention was that the system should be
able to recover reliably when the slave reconnects after a
period of downtime, subject only to there being sufficient
network/CPU/disk bandwidth etc. for it to eventually catch up.
If that's not true, I think the docs need an even more extensive
overhaul! Suggestion for the paragraph that I quoted before from
26.2.5:
"If you set up a WAL archive that's accessible from the standby,
it's anyone's guess what is appropriate for wal_keep_segments."
Regards, Phil.
Adrian Klaver wrote:
On 08/12/2018 03:54 PM, Stephen Frost wrote:
Greetings,
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
OK. I think this is perhaps a documentation bug, maybe a missing
warning when the master reads its configuration, and maybe (as you say)
a bad default value.If we consider it to be an issue worthy of a change then we should
probably just change the default value, and maybe not even allow it to
be set lower than '1'.I would say leave the default at 0 as it leaves no doubt that you are
performing without a net. A setting of '1' implies you are covered and
for a fast moving cluster or slow moving one with sufficient downtime
that would not be the case.
Can you explain how it can fail in the case of a "slow moving cluster with
sufficient downtime"?
It seems to me that if I have correctly understood what happened in this
case then 0, the default, really cannot ever work properly when you have
enabled WAL archiving plus streaming.
Better to let the end user know this is not
a simple problem and some thought needs to go into configuration.
I certainly agree that this is "not a simple problem", having read something
like 18,000 words of documentation multiple times and, apparently, still
got it wrong in multiple ways.
Regards, Phil.
Greetings,
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
Adrian Klaver wrote:
On 08/12/2018 03:54 PM, Stephen Frost wrote:
Greetings,
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
OK. I think this is perhaps a documentation bug, maybe a missing
warning when the master reads its configuration, and maybe (as you say)
a bad default value.If we consider it to be an issue worthy of a change then we should
probably just change the default value, and maybe not even allow it to
be set lower than '1'.I would say leave the default at 0 as it leaves no doubt that you are
performing without a net. A setting of '1' implies you are covered and for
a fast moving cluster or slow moving one with sufficient downtime that
would not be the case.Can you explain how it can fail in the case of a "slow moving cluster with
sufficient downtime"?
I'm guessing 'sufficient downtime' here is, basically, 'offline until
the next checkpoint', which isn't actually all *that* much time.
It seems to me that if I have correctly understood what happened in this
case then 0, the default, really cannot ever work properly when you have
enabled WAL archiving plus streaming.
Well, it's not like it'd work without WAL archiving either, though
that's perhaps more obviously true.
Thanks!
Stephen
Greetings,
* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
Adrian Klaver wrote:
On 08/12/2018 02:56 PM, Phil Endecott wrote:
Anyway. Do others agree that my issue was the result of
wal_keep_segments=0 ?Only as a sub-issue of the slave losing contact with the master. The basic
problem is maintaining two separate operations, archiving and streaming,
in sync. If either or some combination of both lose synchronization then
it is anyone's guess on what is appropriate for wal_keep_segments.
Uh, no, having an archive_command and a restore_command configures
exactly should remove the need to worry about what wal_keep_segments is
set to because anything not on the primary really should be available
through what's been archived and PG shouldn't have any trouble figuring
that out and working with it.
If all you've got is streaming replication then, sure, you have no idea
what to set wal_keep_segments to because the replica could be offline
for an indeterminate amount of time, but as long as you're keeping track
of all the WAL through archive_command, that shouldn't be an issue.
Really? I thought the intention was that the system should be
able to recover reliably when the slave reconnects after a
period of downtime, subject only to there being sufficient
network/CPU/disk bandwidth etc. for it to eventually catch up.
Yes, that's correct, the replica should always be able to catch back up
presuming there's no gaps in the WAL between when the replica failed and
where the primary is at.
Thanks!
Stephen