prevent immature WAL streaming

Started by Alvaro Herreraover 4 years ago121 messageshackers
Jump to latest
#1Alvaro Herrera
alvherre@2ndquadrant.com

Included 蔡梦娟 and Jakub Wartak because they've expressed interest on
this topic -- notably [2]/messages/by-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com ("Bug on update timing of walrcv->flushedUpto
variable").

As mentioned in the course of thread [1]/messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com, we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet. This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.

This patch is probably incomplete. First, I'm not sure that logical
replication is affected by this problem. I think it isn't, because
logical replication will halt until the record can be read completely --
maybe I'm wrong and there is a way for things to go wrong with logical
replication as well. But also, I need to look at the other uses of
GetFlushRecPtr() and see if those need to change to the new function too
or they can remain what they are now.

I'd also like to have tests. That seems moderately hard, but if we had
WAL-molasses that could be used in walreceiver, it could be done. (It
sounds easier to write tests with a molasses-archive_command.)

[1]: /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
[2]: /messages/by-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/

Attachments:

0001-Don-t-stream-non-final-WAL-segments.patchtext/x-diff; charset=utf-8Download+37-6
#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Alvaro Herrera (#1)
Re: prevent immature WAL streaming

At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

Included 蔡梦娟 and Jakub Wartak because they've expressed interest on
this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto
variable").

As mentioned in the course of thread [1], we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet. This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.

This patch is probably incomplete. First, I'm not sure that logical
replication is affected by this problem. I think it isn't, because
logical replication will halt until the record can be read completely --
maybe I'm wrong and there is a way for things to go wrong with logical
replication as well. But also, I need to look at the other uses of
GetFlushRecPtr() and see if those need to change to the new function too
or they can remain what they are now.

I'd also like to have tests. That seems moderately hard, but if we had
WAL-molasses that could be used in walreceiver, it could be done. (It
sounds easier to write tests with a molasses-archive_command.)

[1] /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
[2] /messages/by-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com

(I'm not sure what "WAL-molasses" above expresses, same as "sugar"?)

For our information, this issue is related to the commit 0668719801
which makes XLogPageRead restart reading a (continued or
segments-spanning) record with switching sources. In that thread, I
modifed the code to cause a server crash under the desired situation.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Nathan Bossart
nathandbossart@gmail.com
In reply to: Kyotaro Horiguchi (#2)
Re: prevent immature WAL streaming

On 8/23/21, 3:53 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:

As mentioned in the course of thread [1], we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet. This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.

I wonder if we need to move the call to RegisterSegmentBoundary() to
somewhere before WALInsertLockRelease() for this to work as intended.
Right now, boundary registration could take place after the flush
pointer has been advanced, which means GetSafeFlushRecPtr() could
still return an unsafe position.

Nathan

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Nathan Bossart (#3)
Re: prevent immature WAL streaming

On 2021-Aug-24, Bossart, Nathan wrote:

I wonder if we need to move the call to RegisterSegmentBoundary() to
somewhere before WALInsertLockRelease() for this to work as intended.
Right now, boundary registration could take place after the flush
pointer has been advanced, which means GetSafeFlushRecPtr() could
still return an unsafe position.

Yeah, you're right -- that's a definite risk. I didn't try to reproduce
a problem with that, but it is seems pretty obvious that it can happen.

I didn't have a lot of luck with a reliable reproducer script. I was
able to reproduce the problem starting with Ryo Matsumura's script and
attaching a replica; most of the time the replica would recover by
restarting from a streaming position earlier than where the problem
occurred; but a few times it would just get stuck with a WAL segment
containing a bogus record. Then, after patch, the problem no longer
occurs.

I attach the patch with the change you suggested.

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes... Fixed.
http://archives.postgresql.org/message-id/482D1632.8010507@sigaev.ru

Attachments:

v2-0001-Don-t-stream-non-final-WAL-segments.patchtext/x-diff; charset=utf-8Download+62-22
#5Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#4)
Re: prevent immature WAL streaming

On 8/24/21, 4:03 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:

On 2021-Aug-24, Bossart, Nathan wrote:

I wonder if we need to move the call to RegisterSegmentBoundary() to
somewhere before WALInsertLockRelease() for this to work as intended.
Right now, boundary registration could take place after the flush
pointer has been advanced, which means GetSafeFlushRecPtr() could
still return an unsafe position.

Yeah, you're right -- that's a definite risk. I didn't try to reproduce
a problem with that, but it is seems pretty obvious that it can happen.

I didn't have a lot of luck with a reliable reproducer script. I was
able to reproduce the problem starting with Ryo Matsumura's script and
attaching a replica; most of the time the replica would recover by
restarting from a streaming position earlier than where the problem
occurred; but a few times it would just get stuck with a WAL segment
containing a bogus record. Then, after patch, the problem no longer
occurs.

If moving RegisterSegmentBoundary() is sufficient to prevent the flush
pointer from advancing before we register the boundary, I bet we could
also remove the WAL writer nudge.

Another interesting thing I see is that the boundary stored in
earliestSegBoundary is not necessarily the earliest one. It's just
the first one that has been registered. I did this for simplicity for
the .ready file fix, but I can see it causing problems here. I think
we can move earliestSegBoundary backwards as long as it is greater
than lastNotifiedSeg + 1. However, it's still not necessarily the
earliest one if we copied latestSegBoundary to earliestSegBoundary in
NotifySegmentsReadyForArchive(). To handle this, we could track
several boundaries in an array, but then we'd have to hold the safe
LSN back whenever the array overflowed and we started forgetting
boundaries.

Perhaps there's a simpler solution. I'll keep thinking...

Nathan

#6Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Alvaro Herrera (#4)
RE: prevent immature WAL streaming

Hi Álvaro, -hackers,

I attach the patch with the change you suggested.

I've gave a shot to to the v02 patch on top of REL_12_STABLE (already including 5065aeafb0b7593c04d3bc5bc2a86037f32143fc). Previously(yesterday) without the v02 patch I was getting standby corruption always via simulation by having separate /pg_xlog dedicated fs, and archive_mode=on, wal_keep_segments=120, archive_command set to rsync to different dir on same fs, wal_init_zero at default(true).

Today (with v02) I've got corruption in only initial 2 runs out of ~ >30 tries on standby. Probably the 2 failures were somehow my fault (?) or some rare condition (and in 1 of those 2 cases simply restarting standby did help). To be honest I've tried to force this error, but with v02 I simply cannot force this error anymore, so that's good! :)

I didn't have a lot of luck with a reliable reproducer script. I was able to
reproduce the problem starting with Ryo Matsumura's script and attaching
a replica; most of the time the replica would recover by restarting from a
streaming position earlier than where the problem occurred; but a few
times it would just get stuck with a WAL segment containing a bogus
record.

In order to get reliable reproducer and get proper the fault injection instead of playing with really filling up fs, apparently one could substitute fd with fd of /dev/full using e.g. dup2() so that every write is going to throw this error too:

root@hive:~# ./t & # simple while(1) { fprintf() flush () } testcase
root@hive:~# ls -l /proc/27296/fd/3
lrwx------ 1 root root 64 Aug 25 06:22 /proc/27296/fd/3 -> /tmp/testwrite
root@hive:~# gdb -q -p 27296
-- 1089 is bitmask O_WRONLY|..
(gdb) p dup2(open("/dev/full", 1089, 0777), 3)
$1 = 3
(gdb) c
Continuing.
==>
fflush/write(): : No space left on device

So I've also tried to be malicious while writing to the DB and inject ENOSPCE near places like:

a) XLogWrite()->XLogFileInit() near line 3322 // assuming: if (wal_init_zero) is true, one gets classic "PANIC: could not write to file "pg_wal/xlogtemp.90670": No space left on device"
b) XLogWrite() near line 2547 just after pg_pwrite // one can get "PANIC: could not write to log file 000000010000003B000000A8 at offset 0, length 15466496: No space left on device" (that would be possible with wal_init_zero=false?)
c) XLogWrite() near line 2592 // just before issue_xlog_fsync to get "PANIC: could not fdatasync file "000000010000004300000004": Invalid argument" that would pretty much mean same as above but with last possible offset near end of WAL?

This was done with gdb voodoo:
handle SIGUSR1 noprint nostop
break xlog.c:<LINE> // https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/transam/xlog.c#L3311
c
print fd or openLogFile -- to verify it is 3
p dup2(open("/dev/full", 1089, 0777), 3) -- during most of walwriter runtime it has current log as fd=3

After restarting master and inspecting standby - in all of those above 3 cases - the standby didn't inhibit the "invalid contrecord length" at least here, while without corruption this v02 patch it is notorious. So if it passes the worst-case code review assumptions I would be wondering if it shouldn't even be committed as it stands right now.

-J.

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Nathan Bossart (#5)
Re: prevent immature WAL streaming

On 2021-Aug-24, Bossart, Nathan wrote:

If moving RegisterSegmentBoundary() is sufficient to prevent the flush
pointer from advancing before we register the boundary, I bet we could
also remove the WAL writer nudge.

Can you elaborate on this? I'm not sure I see the connection.

Another interesting thing I see is that the boundary stored in
earliestSegBoundary is not necessarily the earliest one. It's just
the first one that has been registered. I did this for simplicity for
the .ready file fix, but I can see it causing problems here.

Hmm, is there really a problem here? Surely the flush point cannot go
past whatever has been written. If somebody is writing an earlier
section of WAL, then we cannot move the flush pointer to a later
position. So it doesn't matter if the earliest point we have registered
is the true earliest -- we only care for it to be the earliest that is
past the flush point.

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/

#8Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#2)
Re: prevent immature WAL streaming

On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

I'd also like to have tests. That seems moderately hard, but if we had
WAL-molasses that could be used in walreceiver, it could be done. (It
sounds easier to write tests with a molasses-archive_command.)

[1] /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
[2] /messages/by-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com

(I'm not sure what "WAL-molasses" above expresses, same as "sugar"?)

I think, but am not 100% sure, that "molasses" here is being used to
refer to fault injection.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#8)
Re: prevent immature WAL streaming

On 8/25/21, 5:33 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:

On 2021-Aug-24, Bossart, Nathan wrote:

If moving RegisterSegmentBoundary() is sufficient to prevent the flush
pointer from advancing before we register the boundary, I bet we could
also remove the WAL writer nudge.

Can you elaborate on this? I'm not sure I see the connection.

The reason we are moving RegisterSegmentBoundary() to before
WALInsertLockRelease() is because we believe it will prevent boundary
registration from taking place after the flush pointer has already
advanced past the boundary in question. We had added the WAL writer
nudge to make sure we called NotifySegmentsReadyForArchive() whenever
that happened.

If moving boundary registration to before we release the lock(s) is
enough to prevent the race condition with the flush pointer, then ISTM
we no longer have to worry about nudging the WAL writer.

Another interesting thing I see is that the boundary stored in
earliestSegBoundary is not necessarily the earliest one. It's just
the first one that has been registered. I did this for simplicity for
the .ready file fix, but I can see it causing problems here.

Hmm, is there really a problem here? Surely the flush point cannot go
past whatever has been written. If somebody is writing an earlier
section of WAL, then we cannot move the flush pointer to a later
position. So it doesn't matter if the earliest point we have registered
is the true earliest -- we only care for it to be the earliest that is
past the flush point.

Let's say we have the following situation (F = flush, E = earliest
registered boundary, and L = latest registered boundary), and let's
assume that each segment has a cross-segment record that ends in the
next segment.

F E L
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8

Then, we write out WAL to disk and create .ready files as needed. If
we didn't flush beyond the latest registered boundary, the latest
registered boundary now becomes the earliest boundary.

F E
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8

At this point, the earliest segment boundary past the flush point is
before the "earliest" boundary we are tracking.

Nathan

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jakub Wartak (#6)
Re: prevent immature WAL streaming

On 2021-Aug-25, Jakub Wartak wrote:

In order to get reliable reproducer and get proper the fault injection
instead of playing with really filling up fs, apparently one could
substitute fd with fd of /dev/full using e.g. dup2() so that every
write is going to throw this error too:

Oh, this is a neat trick that I didn't know about. Thanks.

After restarting master and inspecting standby - in all of those above
3 cases - the standby didn't inhibit the "invalid contrecord length"
at least here, while without corruption this v02 patch it is
notorious. So if it passes the worst-case code review assumptions I
would be wondering if it shouldn't even be committed as it stands
right now.

Well, Nathan is right that this patch isn't really closing the hole
because we aren't tracking segment boundaries that aren't "earliest" nor
"latest" at the moment of registration. The simplistic approach seems
okay for the archive case, but not for the replication case.

I also noticed today (facepalm) that the patch doesn't work unless you
have archiving enabled, because the registration code is only invoked
when XLogArchivingActive(). Doh. This part is easy to solve. The
other doesn't look easy ...

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/

#11Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#10)
Re: prevent immature WAL streaming

BTW while going about testing this, I noticed that we forbid
pg_walfile_name() while in recovery. That restriction was added by
commit 370f770c15a4 because ThisTimeLineID was not set correctly during
recovery. That was supposed to be fixed by commit 1148e22a82ed, so I
thought that it should be possible to remove the restriction. However,
I did that per the attached patch, but was quickly disappointed because
ThisTimeLineID seems to remain zero in a standby for reasons that I
didn't investigate.

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
(Paul Graham)

Attachments:

0001-Don-t-forbid-pg_walfile_name-in-recovery-mode.patchtext/x-diff; charset=utf-8Download+0-8
#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#9)
Re: prevent immature WAL streaming

At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in

On 8/25/21, 5:33 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:

On 2021-Aug-24, Bossart, Nathan wrote:

Another interesting thing I see is that the boundary stored in
earliestSegBoundary is not necessarily the earliest one. It's just
the first one that has been registered. I did this for simplicity for
the .ready file fix, but I can see it causing problems here.

Hmm, is there really a problem here? Surely the flush point cannot go
past whatever has been written. If somebody is writing an earlier
section of WAL, then we cannot move the flush pointer to a later
position. So it doesn't matter if the earliest point we have registered
is the true earliest -- we only care for it to be the earliest that is
past the flush point.

Let's say we have the following situation (F = flush, E = earliest
registered boundary, and L = latest registered boundary), and let's
assume that each segment has a cross-segment record that ends in the
next segment.

F E L
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8

Then, we write out WAL to disk and create .ready files as needed. If
we didn't flush beyond the latest registered boundary, the latest
registered boundary now becomes the earliest boundary.

F E
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8

At this point, the earliest segment boundary past the flush point is
before the "earliest" boundary we are tracking.

We know we have some cross-segment records in the regin [E L] so we
cannot add a .ready file if flush is in the region. I haven't looked
the latest patch (or I may misunderstand the discussion here) but I
think we shouldn't move E before F exceeds previous (or in the first
picture above) L. Things are done that way in my ancient proposal in
[1]: /messages/by-id/20201216.110120.887433782054853494.horikyota.ntt@gmail.com

https://www.postgresql.org/message-id/attachment/117052/v4-0001-Avoid-archiving-a-WAL-segment-that-continues-to-t.patch
+    if (LogwrtResult.Write < firstSegContRecStart ||
+        lastSegContRecEnd <= LogwrtResult.Write)
+    {
        <notify the last segment>

By doing so, at the time of the second picutre, the pointers are set as:

E F L
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8

Then the poiter are cleard at the time F reaches L,

F
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8

Isn't this work correctly? As I think I mentioned in the thread, I
don't think we don't have so many (more than several, specifically)
segments in a region [E L].

[1]: /messages/by-id/20201216.110120.887433782054853494.horikyota.ntt@gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Alvaro Herrera (#11)
Re: prevent immature WAL streaming

At Wed, 25 Aug 2021 20:20:04 -0400, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote in

BTW while going about testing this, I noticed that we forbid
pg_walfile_name() while in recovery. That restriction was added by
commit 370f770c15a4 because ThisTimeLineID was not set correctly during
recovery. That was supposed to be fixed by commit 1148e22a82ed, so I
thought that it should be possible to remove the restriction. However,
I did that per the attached patch, but was quickly disappointed because
ThisTimeLineID seems to remain zero in a standby for reasons that I
didn't investigate.

On a intermediate node of a cascading replication set, timeline id on
walsender and walrecever can differ and ordinary backends cannot
decide which to use.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#14Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#8)
Re: prevent immature WAL streaming

(this is off-topic here)

At Wed, 25 Aug 2021 09:56:56 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

I'd also like to have tests. That seems moderately hard, but if we had
WAL-molasses that could be used in walreceiver, it could be done. (It
sounds easier to write tests with a molasses-archive_command.)

[1] /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
[2] /messages/by-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com

(I'm not sure what "WAL-molasses" above expresses, same as "sugar"?)

I think, but am not 100% sure, that "molasses" here is being used to
refer to fault injection.

Oh. That makes sense, thanks.

I sometimes inject artificial faults (a server crash, in this case) to
create specific on-disk states but I cannot imagine that that kind of
machinery can be statically placed in the source tree..

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15Nathan Bossart
nathandbossart@gmail.com
In reply to: Kyotaro Horiguchi (#14)
Re: prevent immature WAL streaming

On 8/25/21, 5:40 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:

At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in

Let's say we have the following situation (F = flush, E = earliest
registered boundary, and L = latest registered boundary), and let's
assume that each segment has a cross-segment record that ends in the
next segment.

F E L
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8

Then, we write out WAL to disk and create .ready files as needed. If
we didn't flush beyond the latest registered boundary, the latest
registered boundary now becomes the earliest boundary.

F E
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8

At this point, the earliest segment boundary past the flush point is
before the "earliest" boundary we are tracking.

We know we have some cross-segment records in the regin [E L] so we
cannot add a .ready file if flush is in the region. I haven't looked
the latest patch (or I may misunderstand the discussion here) but I
think we shouldn't move E before F exceeds previous (or in the first
picture above) L. Things are done that way in my ancient proposal in
[1].

The strategy in place ensures that we track a boundary that doesn't
change until the flush position passes it as well as the latest
registered boundary. I think it is important that any segment
boundary tracking mechanism does at least those two things. I don't
see how we could do that if we didn't update E until F passed both E
and L.

Nathan

#16Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#15)
Re: prevent immature WAL streaming

At Thu, 26 Aug 2021 03:24:48 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in

On 8/25/21, 5:40 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:

At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in

Let's say we have the following situation (F = flush, E = earliest
registered boundary, and L = latest registered boundary), and let's
assume that each segment has a cross-segment record that ends in the
next segment.

F E L
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8

Then, we write out WAL to disk and create .ready files as needed. If
we didn't flush beyond the latest registered boundary, the latest
registered boundary now becomes the earliest boundary.

F E
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8

At this point, the earliest segment boundary past the flush point is
before the "earliest" boundary we are tracking.

We know we have some cross-segment records in the regin [E L] so we
cannot add a .ready file if flush is in the region. I haven't looked
the latest patch (or I may misunderstand the discussion here) but I
think we shouldn't move E before F exceeds previous (or in the first
picture above) L. Things are done that way in my ancient proposal in
[1].

The strategy in place ensures that we track a boundary that doesn't
change until the flush position passes it as well as the latest
registered boundary. I think it is important that any segment
boundary tracking mechanism does at least those two things. I don't
see how we could do that if we didn't update E until F passed both E
and L.

(Sorry, but I didn't get you clearly. So the discussion below might be
pointless.)

The ancient patch did:

If a flush didn't reach E, we can archive finished segments.

If a flush ends between E and L, we shouldn't archive finshed segments
at all. L can be moved further while this state, while E sits on the
same location while this state.

Once a flush passes L, we can archive all finished segments and can
erase both E and L.

A drawback of this strategy is that the region [E L] can contain gaps
(that is, segment boundaries that is not bonded by a continuation
record) and archive can be excessively retarded. Perhaps if flush
goes behind write head by more than two segments, the probability of
creating the gaps would be higher.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#16)
Re: prevent immature WAL streaming

So I'm again distracted by something else, so here's what will have to
pass for v3 for the time being.

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/

Attachments:

v3-replication-segment-boundaries.patchtext/x-diff; charset=utf-8Download+97-59
#18Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#1)
Re: prevent immature WAL streaming

Hi,

On 2021-08-23 18:52:17 -0400, Alvaro Herrera wrote:

Included 蔡梦娟 and Jakub Wartak because they've expressed interest on
this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto
variable").

As mentioned in the course of thread [1], we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet. This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.

I'm doubtful that the approach of adding awareness of record boundaries
is a good path to go down:

- It adds nontrivial work to hot code paths to handle an edge case,
rather than making rare code paths more expensive.

- There are very similar issues with promotions of replicas (consider
what happens if we need to promote with the end of local WAL spanning
a segment boundary, and what happens to cascading replicas). We have
some logic to try to deal with that, but it's pretty grotty and I
think incomplete.

- It seems to make some future optimizations harder - we should work
towards replicating data sooner, rather than the opposite. Right now
that's a major bottleneck around syncrep.

- Once XLogFlush() for some LSN returned we can write that LSN to
disk. The LSN doesn't necessarily have to correspond to a specific
on-disk location (it could e.g. be the return value from
GetFlushRecPtr()). But "rewinding" to before the last record makes that
problematic.

- I suspect that schemes with heuristic knowledge of segment boundary
spanning records have deadlock or at least latency spike issues. What
if synchronous commit needs to flush up to a certain record boundary,
but streaming rep doesn't replicate it out because there's segment
spanning records both before and after?

I think a better approach might be to handle this on the WAL layout
level. What if we never overwrite partial records but instead just
skipped over them during decoding?

Of course there's some difficulties with that - the checksum and the
length from the record header aren't going to be meaningful.

But we could deal with that using a special flag in the
XLogPageHeaderData.xlp_info of the following page. If that flag is set,
xlp_rem_len could contain the checksum of the partial record.

Greetings,

Andres Freund

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#18)
Re: prevent immature WAL streaming

On 2021-Aug-30, Andres Freund wrote:

I'm doubtful that the approach of adding awareness of record boundaries
is a good path to go down:

Honestly, I do not like it one bit and if I can avoid relying on them
while making the whole thing work correctly, I am happy. Clearly it
wasn't a problem for the ancient recovery-only WAL design, but as soon
as we added replication on top the whole issue of continuation records
became a bug.

I do think that the code should be first correct and second performant,
though.

- There are very similar issues with promotions of replicas (consider
what happens if we need to promote with the end of local WAL spanning
a segment boundary, and what happens to cascading replicas). We have
some logic to try to deal with that, but it's pretty grotty and I
think incomplete.

Ouch, I hadn't thought of cascading replicas.

- It seems to make some future optimizations harder - we should work
towards replicating data sooner, rather than the opposite. Right now
that's a major bottleneck around syncrep.

Absolutely.

I think a better approach might be to handle this on the WAL layout
level. What if we never overwrite partial records but instead just
skipped over them during decoding?

Maybe this is a workable approach, let's work it out fully.

Let me see if I understand what you mean:
* We would remove the logic to inhibit archiving and streaming-
replicating the tail end of a split WAL record; that logic deals with
bytes only, so doesn't have to be aware of record boundaries.
* On WAL replay, we ignore records that are split across a segment
boundary and whose checksum does not match.
* On WAL write ... ?

How do we detect after recovery that a record that was being written,
and potentially was sent to the archive, needs to be "skipped"?

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/

#20Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#19)
Re: prevent immature WAL streaming

Hi,

On 2021-08-31 09:56:30 -0400, Alvaro Herrera wrote:

On 2021-Aug-30, Andres Freund wrote:

I think a better approach might be to handle this on the WAL layout
level. What if we never overwrite partial records but instead just
skipped over them during decoding?

Maybe this is a workable approach, let's work it out fully.

Let me see if I understand what you mean:
* We would remove the logic to inhibit archiving and streaming-
replicating the tail end of a split WAL record; that logic deals with
bytes only, so doesn't have to be aware of record boundaries.
* On WAL replay, we ignore records that are split across a segment
boundary and whose checksum does not match.
* On WAL write ... ?

I was thinking that on a normal WAL write we'd do nothing. Instead we would
have dedicated code at the end of recovery that, if the WAL ends in a partial
record, changes the page following the "valid" portion of the WAL to indicate
that an incomplete record is to be skipped.

Of course, we need to be careful to not weaken WAL validity checking too
much. How about the following:

If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on
the page at which we do so (i.e. the page after the valid end of the WAL).

On a page with XLP_FIRST_IS_ABORTED_PARTIAL we expect a special type of record
to start just after the page header. That record contains sufficient
information for us to verify the validity of the partial record (since its
checksum and length aren't valid, and may not even be all readable if the
record header itself was split). I think it would make sense to include the
LSN of the aborted record, and a checksum of the partial data.

How do we detect after recovery that a record that was being written,
and potentially was sent to the archive, needs to be "skipped"?

I think we can just read the WAL and see if it ends with a partial
record. It'd add a bit of complication to the error checking in xlogreader,
because we'd likely want to treat verification from page headers a bit
different from verification due to record data. But that seems doable.

Does this make sense?

Greetings,

Andres Freund

#21Fujii Masao
masao.fujii@gmail.com
In reply to: Andres Freund (#20)
#22Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#21)
#23Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#22)
#24Fujii Masao
masao.fujii@gmail.com
In reply to: Andres Freund (#22)
#25Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#24)
#26Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#25)
#27Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#24)
#28Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#27)
#29Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Alvaro Herrera (#28)
#30Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#29)
#31Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#30)
#32Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#29)
#33Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#31)
#34Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#33)
#35Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#32)
#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#35)
#38Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#37)
#39Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#37)
#40Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#39)
#41Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#40)
#42Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#41)
#43Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#42)
#44Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Alvaro Herrera (#43)
#45Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#44)
#46Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#44)
#47Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#46)
#48Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#46)
#49Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Nathan Bossart (#47)
#50Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#49)
#51Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Nathan Bossart (#50)
#52Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#51)
#53Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Nathan Bossart (#52)
#54Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#53)
#55Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#54)
#56Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#55)
#57Hannu Krosing
hannu@tm.ee
In reply to: Alvaro Herrera (#56)
#58Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#56)
#59Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#58)
#60Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#58)
#61Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Hannu Krosing (#57)
#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#60)
#63Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#62)
#64Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#63)
#65Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#63)
#66Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#65)
#67Andrew Dunstan
andrew@dunslane.net
In reply to: Alvaro Herrera (#64)
#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#67)
#69Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#63)
#70Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#69)
#71Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#65)
#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#71)
#73Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#72)
#74Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#73)
#75Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#74)
#76Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#75)
#77Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#76)
#78Amul Sul
sulamul@gmail.com
In reply to: Alvaro Herrera (#60)
#79Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amul Sul (#78)
#80Amul Sul
sulamul@gmail.com
In reply to: Alvaro Herrera (#79)
#81Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amul Sul (#80)
#82Amul Sul
sulamul@gmail.com
In reply to: Alvaro Herrera (#81)
#83Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Alvaro Herrera (#59)
#84Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#82)
#85Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amul Sul (#82)
#86Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#84)
#87Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#76)
#88Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#87)
#89Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#88)
#90Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#89)
#91Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#90)
#92Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#90)
#93Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#91)
#94Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#92)
#95Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#94)
#96Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#95)
#97Amul Sul
sulamul@gmail.com
In reply to: Alvaro Herrera (#85)
#98Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#97)
#99Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amul Sul (#98)
#100Amul Sul
sulamul@gmail.com
In reply to: Kyotaro Horiguchi (#99)
#101Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amul Sul (#100)
#102Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#96)
#103Alexander Lakhin
exclusion@gmail.com
In reply to: Alvaro Herrera (#96)
#104Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alexander Lakhin (#103)
#105Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#102)
#106Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#105)
In reply to: Tom Lane (#105)
#108Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#106)
#109Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#108)
#110Amul Sul
sulamul@gmail.com
In reply to: Alvaro Herrera (#109)
#111Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amul Sul (#110)
#112Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#111)
#113Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#112)
#114Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#113)
#115Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#114)
#116Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#115)
#117Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#116)
#118Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#117)
#119Amul Sul
sulamul@gmail.com
In reply to: Tom Lane (#117)
#120Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amul Sul (#119)
#121Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#35)