Walsender may fail to send wal to the end.
Hello, I happened to see a doubious behavior of walsender.
On a replication set with wal_keep_size/(segments) = 0, running the
following command on the primary causes walsender to fail to send up
to the final shutdown checkpoint record to the standby.
(create table t in advance)
psql -c 'insert into t values(0); select pg_switch_wal();'; pg_ctl stop
The primary complains like this:
2021-03-26 17:59:29.324 JST [checkpointer][140697] LOG: shutting down
2021-03-26 17:59:29.387 JST [walsender][140816] ERROR: requested WAL segment 000000010000000000000032 has already been removed
2021-03-26 17:59:29.387 JST [walsender][140816] STATEMENT: START_REPLICATION 0/32000000 TIMELINE 1
2021-03-26 17:59:29.394 JST [postmaster][140695] LOG: database system is shut down
This is because XLogSendPhysical detects removal of the wal segment
currently reading by shutdown checkpoint. However, there' no fear of
overwriting of WAL segments at the time.
So I think we can omit the call to CheckXLogRemoved() while
MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after
the shutdown checkpoint completes.
Of course that doesn't help if walsender was running two segments
behind. There still could be a small window for the failure. But it's
a great help to save the case of just 1 segment behind.
Is it worth fixing?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
phys_walsnd_reads_removed_segs_while_shutdown.patchtext/x-patch; charset=us-asciiDownload+12-3
Hi,
On 2021-03-26 18:20:14 +0900, Kyotaro Horiguchi wrote:
This is because XLogSendPhysical detects removal of the wal segment
currently reading by shutdown checkpoint. However, there' no fear of
overwriting of WAL segments at the time.So I think we can omit the call to CheckXLogRemoved() while
MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after
the shutdown checkpoint completes.Of course that doesn't help if walsender was running two segments
behind. There still could be a small window for the failure. But it's
a great help to save the case of just 1 segment behind.
-1. This seems like a bandaid to make a broken configuration work a tiny
bit better, without actually being meaningfully better.
Greetings,
Andres Freund
On Fri, Mar 26, 2021 at 10:16:40AM -0700, Andres Freund wrote:
Hi,
On 2021-03-26 18:20:14 +0900, Kyotaro Horiguchi wrote:
This is because XLogSendPhysical detects removal of the wal segment
currently reading by shutdown checkpoint. However, there' no fear of
overwriting of WAL segments at the time.So I think we can omit the call to CheckXLogRemoved() while
MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after
the shutdown checkpoint completes.Of course that doesn't help if walsender was running two segments
behind. There still could be a small window for the failure. But it's
a great help to save the case of just 1 segment behind.-1. This seems like a bandaid to make a broken configuration work a tiny
bit better, without actually being meaningfully better.
Agreed. Still, wouldn't it be better to avoid such configurations and
protect a bit things with a check on the new value?
--
Michael
At Mon, 29 Mar 2021 14:47:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Fri, Mar 26, 2021 at 10:16:40AM -0700, Andres Freund wrote:
Hi,
On 2021-03-26 18:20:14 +0900, Kyotaro Horiguchi wrote:
This is because XLogSendPhysical detects removal of the wal segment
currently reading by shutdown checkpoint. However, there' no fear of
overwriting of WAL segments at the time.So I think we can omit the call to CheckXLogRemoved() while
MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after
the shutdown checkpoint completes.Of course that doesn't help if walsender was running two segments
behind. There still could be a small window for the failure. But it's
a great help to save the case of just 1 segment behind.-1. This seems like a bandaid to make a broken configuration work a tiny
bit better, without actually being meaningfully better.Agreed. Still, wouldn't it be better to avoid such configurations and
protect a bit things with a check on the new value?
The repro was a bit artificial but the symptom happened without
pg_switch_wal() and no load. It caused just by shutting down of
primary. If it is normal behavior for walsenders to fail to send the
last shutdown record to standby while fast shutdown, we should refuse
to startup at least wal sender if wal_keep_size = 0.
I can guess two ways to do that.
1. refuse to start server if wal_keep_size = 0 when max_wal_senders > 0.
2. refuse to start wal sender if wal_keep_size= 0.
2 looks like broken. 1 is somewhat annoying.. However, since
max_wal_senders already premises wal_level > minimal, we can accept
that restriction?
<start serer>
FATAL: WAL streaming (max_wal_senders > 0) requires wal_level "replica" or "logical"
<Mmm. wal_level, fixed, then retry starting server>
FATAL: WAL streaming (max_wal_senders > 0) requires wal_keep_size to be at least 1MB
<Oops!>
Of couse we can list all incompatible parameters at once.
FATAL: WAL streaming (max_wal_senders > 0) requires wal_level "replica" or "logical" and wal_keep_size to be at least 1MB
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Greetings,
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
At Mon, 29 Mar 2021 14:47:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Fri, Mar 26, 2021 at 10:16:40AM -0700, Andres Freund wrote:
On 2021-03-26 18:20:14 +0900, Kyotaro Horiguchi wrote:
This is because XLogSendPhysical detects removal of the wal segment
currently reading by shutdown checkpoint. However, there' no fear of
overwriting of WAL segments at the time.So I think we can omit the call to CheckXLogRemoved() while
MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after
the shutdown checkpoint completes.Of course that doesn't help if walsender was running two segments
behind. There still could be a small window for the failure. But it's
a great help to save the case of just 1 segment behind.-1. This seems like a bandaid to make a broken configuration work a tiny
bit better, without actually being meaningfully better.Agreed. Still, wouldn't it be better to avoid such configurations and
protect a bit things with a check on the new value?
I have a hard time agreeing that this is somehow a 'broken'
configuration, instead it looks like a race condition that wasn't
considered and should be addressed. If there's zero lag then we really
should allow the final WAL to get sent to the replica.
The repro was a bit artificial but the symptom happened without
pg_switch_wal() and no load. It caused just by shutting down of
primary. If it is normal behavior for walsenders to fail to send the
last shutdown record to standby while fast shutdown, we should refuse
to startup at least wal sender if wal_keep_size = 0.I can guess two ways to do that.
Both of which will break things for people, so this certainly isn't a
great approach, and besides, if archiving is happening with
archive_command and the replica has a restore command then it should be
able to follow that just fine, no? So we'd have to also check if
archive_command has been set up and hope the admin has a restore
command. Having to go through that dance instead of just making sure to
push out the last WAL to the replica seems a bit silly though.
Thanks,
Stephen
At Mon, 29 Mar 2021 11:41:32 -0400, Stephen Frost <sfrost@snowman.net> wrote in
Greetings,
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
At Mon, 29 Mar 2021 14:47:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Fri, Mar 26, 2021 at 10:16:40AM -0700, Andres Freund wrote:
On 2021-03-26 18:20:14 +0900, Kyotaro Horiguchi wrote:
This is because XLogSendPhysical detects removal of the wal segment
currently reading by shutdown checkpoint. However, there' no fear of
overwriting of WAL segments at the time.So I think we can omit the call to CheckXLogRemoved() while
MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after
the shutdown checkpoint completes.Of course that doesn't help if walsender was running two segments
behind. There still could be a small window for the failure. But it's
a great help to save the case of just 1 segment behind.-1. This seems like a bandaid to make a broken configuration work a tiny
bit better, without actually being meaningfully better.Agreed. Still, wouldn't it be better to avoid such configurations and
protect a bit things with a check on the new value?I have a hard time agreeing that this is somehow a 'broken'
configuration, instead it looks like a race condition that wasn't
considered and should be addressed. If there's zero lag then we really
should allow the final WAL to get sent to the replica.
My unstated point was switching primary/secondary roles in a
replication set where both host have separate archives, by the steps
"fast shutdown primary"->"promote standby"->"attach the old primary as
new standby", wihtout a need of synchronizing old primary's archive to
that of the new standby before starting the new standby. I thought
that should work even if wal_keep_size = 0.
The repro was a bit artificial but the symptom happened without
pg_switch_wal() and no load. It caused just by shutting down of
primary. If it is normal behavior for walsenders to fail to send the
last shutdown record to standby while fast shutdown, we should refuse
to startup at least wal sender if wal_keep_size = 0.I can guess two ways to do that.
Both of which will break things for people, so this certainly isn't a
great approach, and besides, if archiving is happening with
archive_command and the replica has a restore command then it should be
Right.
able to follow that just fine, no? So we'd have to also check if
archive_command has been set up and hope the admin has a restore
Yeah, that sounds stupid (or kind of impossible).
command. Having to go through that dance instead of just making sure to
push out the last WAL to the replica seems a bit silly though.
Sounds reasonable to me.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center