prevent immature WAL streaming
Included 蔡梦娟 and Jakub Wartak because they've expressed interest on
this topic -- notably [2]/messages/by-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com ("Bug on update timing of walrcv->flushedUpto
variable").
As mentioned in the course of thread [1]/messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com, we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet. This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.
This patch is probably incomplete. First, I'm not sure that logical
replication is affected by this problem. I think it isn't, because
logical replication will halt until the record can be read completely --
maybe I'm wrong and there is a way for things to go wrong with logical
replication as well. But also, I need to look at the other uses of
GetFlushRecPtr() and see if those need to change to the new function too
or they can remain what they are now.
I'd also like to have tests. That seems moderately hard, but if we had
WAL-molasses that could be used in walreceiver, it could be done. (It
sounds easier to write tests with a molasses-archive_command.)
[1]: /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
[2]: /messages/by-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com
--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Attachments:
0001-Don-t-stream-non-final-WAL-segments.patchtext/x-diff; charset=utf-8Download+37-6
At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in
Included 蔡梦娟 and Jakub Wartak because they've expressed interest on
this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto
variable").As mentioned in the course of thread [1], we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet. This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.This patch is probably incomplete. First, I'm not sure that logical
replication is affected by this problem. I think it isn't, because
logical replication will halt until the record can be read completely --
maybe I'm wrong and there is a way for things to go wrong with logical
replication as well. But also, I need to look at the other uses of
GetFlushRecPtr() and see if those need to change to the new function too
or they can remain what they are now.I'd also like to have tests. That seems moderately hard, but if we had
WAL-molasses that could be used in walreceiver, it could be done. (It
sounds easier to write tests with a molasses-archive_command.)[1] /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
[2] /messages/by-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com
(I'm not sure what "WAL-molasses" above expresses, same as "sugar"?)
For our information, this issue is related to the commit 0668719801
which makes XLogPageRead restart reading a (continued or
segments-spanning) record with switching sources. In that thread, I
modifed the code to cause a server crash under the desired situation.)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 8/23/21, 3:53 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
As mentioned in the course of thread [1], we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet. This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.
I wonder if we need to move the call to RegisterSegmentBoundary() to
somewhere before WALInsertLockRelease() for this to work as intended.
Right now, boundary registration could take place after the flush
pointer has been advanced, which means GetSafeFlushRecPtr() could
still return an unsafe position.
Nathan
On 2021-Aug-24, Bossart, Nathan wrote:
I wonder if we need to move the call to RegisterSegmentBoundary() to
somewhere before WALInsertLockRelease() for this to work as intended.
Right now, boundary registration could take place after the flush
pointer has been advanced, which means GetSafeFlushRecPtr() could
still return an unsafe position.
Yeah, you're right -- that's a definite risk. I didn't try to reproduce
a problem with that, but it is seems pretty obvious that it can happen.
I didn't have a lot of luck with a reliable reproducer script. I was
able to reproduce the problem starting with Ryo Matsumura's script and
attaching a replica; most of the time the replica would recover by
restarting from a streaming position earlier than where the problem
occurred; but a few times it would just get stuck with a WAL segment
containing a bogus record. Then, after patch, the problem no longer
occurs.
I attach the patch with the change you suggested.
--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes... Fixed.
http://archives.postgresql.org/message-id/482D1632.8010507@sigaev.ru
Attachments:
v2-0001-Don-t-stream-non-final-WAL-segments.patchtext/x-diff; charset=utf-8Download+62-22
On 8/24/21, 4:03 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
On 2021-Aug-24, Bossart, Nathan wrote:
I wonder if we need to move the call to RegisterSegmentBoundary() to
somewhere before WALInsertLockRelease() for this to work as intended.
Right now, boundary registration could take place after the flush
pointer has been advanced, which means GetSafeFlushRecPtr() could
still return an unsafe position.Yeah, you're right -- that's a definite risk. I didn't try to reproduce
a problem with that, but it is seems pretty obvious that it can happen.I didn't have a lot of luck with a reliable reproducer script. I was
able to reproduce the problem starting with Ryo Matsumura's script and
attaching a replica; most of the time the replica would recover by
restarting from a streaming position earlier than where the problem
occurred; but a few times it would just get stuck with a WAL segment
containing a bogus record. Then, after patch, the problem no longer
occurs.
If moving RegisterSegmentBoundary() is sufficient to prevent the flush
pointer from advancing before we register the boundary, I bet we could
also remove the WAL writer nudge.
Another interesting thing I see is that the boundary stored in
earliestSegBoundary is not necessarily the earliest one. It's just
the first one that has been registered. I did this for simplicity for
the .ready file fix, but I can see it causing problems here. I think
we can move earliestSegBoundary backwards as long as it is greater
than lastNotifiedSeg + 1. However, it's still not necessarily the
earliest one if we copied latestSegBoundary to earliestSegBoundary in
NotifySegmentsReadyForArchive(). To handle this, we could track
several boundaries in an array, but then we'd have to hold the safe
LSN back whenever the array overflowed and we started forgetting
boundaries.
Perhaps there's a simpler solution. I'll keep thinking...
Nathan
Hi Álvaro, -hackers,
I attach the patch with the change you suggested.
I've gave a shot to to the v02 patch on top of REL_12_STABLE (already including 5065aeafb0b7593c04d3bc5bc2a86037f32143fc). Previously(yesterday) without the v02 patch I was getting standby corruption always via simulation by having separate /pg_xlog dedicated fs, and archive_mode=on, wal_keep_segments=120, archive_command set to rsync to different dir on same fs, wal_init_zero at default(true).
Today (with v02) I've got corruption in only initial 2 runs out of ~ >30 tries on standby. Probably the 2 failures were somehow my fault (?) or some rare condition (and in 1 of those 2 cases simply restarting standby did help). To be honest I've tried to force this error, but with v02 I simply cannot force this error anymore, so that's good! :)
I didn't have a lot of luck with a reliable reproducer script. I was able to
reproduce the problem starting with Ryo Matsumura's script and attaching
a replica; most of the time the replica would recover by restarting from a
streaming position earlier than where the problem occurred; but a few
times it would just get stuck with a WAL segment containing a bogus
record.
In order to get reliable reproducer and get proper the fault injection instead of playing with really filling up fs, apparently one could substitute fd with fd of /dev/full using e.g. dup2() so that every write is going to throw this error too:
root@hive:~# ./t & # simple while(1) { fprintf() flush () } testcase
root@hive:~# ls -l /proc/27296/fd/3
lrwx------ 1 root root 64 Aug 25 06:22 /proc/27296/fd/3 -> /tmp/testwrite
root@hive:~# gdb -q -p 27296
-- 1089 is bitmask O_WRONLY|..
(gdb) p dup2(open("/dev/full", 1089, 0777), 3)
$1 = 3
(gdb) c
Continuing.
==>
fflush/write(): : No space left on device
So I've also tried to be malicious while writing to the DB and inject ENOSPCE near places like:
a) XLogWrite()->XLogFileInit() near line 3322 // assuming: if (wal_init_zero) is true, one gets classic "PANIC: could not write to file "pg_wal/xlogtemp.90670": No space left on device"
b) XLogWrite() near line 2547 just after pg_pwrite // one can get "PANIC: could not write to log file 000000010000003B000000A8 at offset 0, length 15466496: No space left on device" (that would be possible with wal_init_zero=false?)
c) XLogWrite() near line 2592 // just before issue_xlog_fsync to get "PANIC: could not fdatasync file "000000010000004300000004": Invalid argument" that would pretty much mean same as above but with last possible offset near end of WAL?
This was done with gdb voodoo:
handle SIGUSR1 noprint nostop
break xlog.c:<LINE> // https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/transam/xlog.c#L3311
c
print fd or openLogFile -- to verify it is 3
p dup2(open("/dev/full", 1089, 0777), 3) -- during most of walwriter runtime it has current log as fd=3
After restarting master and inspecting standby - in all of those above 3 cases - the standby didn't inhibit the "invalid contrecord length" at least here, while without corruption this v02 patch it is notorious. So if it passes the worst-case code review assumptions I would be wondering if it shouldn't even be committed as it stands right now.
-J.
On 2021-Aug-24, Bossart, Nathan wrote:
If moving RegisterSegmentBoundary() is sufficient to prevent the flush
pointer from advancing before we register the boundary, I bet we could
also remove the WAL writer nudge.
Can you elaborate on this? I'm not sure I see the connection.
Another interesting thing I see is that the boundary stored in
earliestSegBoundary is not necessarily the earliest one. It's just
the first one that has been registered. I did this for simplicity for
the .ready file fix, but I can see it causing problems here.
Hmm, is there really a problem here? Surely the flush point cannot go
past whatever has been written. If somebody is writing an earlier
section of WAL, then we cannot move the flush pointer to a later
position. So it doesn't matter if the earliest point we have registered
is the true earliest -- we only care for it to be the earliest that is
past the flush point.
--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in
I'd also like to have tests. That seems moderately hard, but if we had
WAL-molasses that could be used in walreceiver, it could be done. (It
sounds easier to write tests with a molasses-archive_command.)[1] /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
[2] /messages/by-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com(I'm not sure what "WAL-molasses" above expresses, same as "sugar"?)
I think, but am not 100% sure, that "molasses" here is being used to
refer to fault injection.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 8/25/21, 5:33 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
On 2021-Aug-24, Bossart, Nathan wrote:
If moving RegisterSegmentBoundary() is sufficient to prevent the flush
pointer from advancing before we register the boundary, I bet we could
also remove the WAL writer nudge.Can you elaborate on this? I'm not sure I see the connection.
The reason we are moving RegisterSegmentBoundary() to before
WALInsertLockRelease() is because we believe it will prevent boundary
registration from taking place after the flush pointer has already
advanced past the boundary in question. We had added the WAL writer
nudge to make sure we called NotifySegmentsReadyForArchive() whenever
that happened.
If moving boundary registration to before we release the lock(s) is
enough to prevent the race condition with the flush pointer, then ISTM
we no longer have to worry about nudging the WAL writer.
Another interesting thing I see is that the boundary stored in
earliestSegBoundary is not necessarily the earliest one. It's just
the first one that has been registered. I did this for simplicity for
the .ready file fix, but I can see it causing problems here.Hmm, is there really a problem here? Surely the flush point cannot go
past whatever has been written. If somebody is writing an earlier
section of WAL, then we cannot move the flush pointer to a later
position. So it doesn't matter if the earliest point we have registered
is the true earliest -- we only care for it to be the earliest that is
past the flush point.
Let's say we have the following situation (F = flush, E = earliest
registered boundary, and L = latest registered boundary), and let's
assume that each segment has a cross-segment record that ends in the
next segment.
F E L
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8
Then, we write out WAL to disk and create .ready files as needed. If
we didn't flush beyond the latest registered boundary, the latest
registered boundary now becomes the earliest boundary.
F E
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8
At this point, the earliest segment boundary past the flush point is
before the "earliest" boundary we are tracking.
Nathan
On 2021-Aug-25, Jakub Wartak wrote:
In order to get reliable reproducer and get proper the fault injection
instead of playing with really filling up fs, apparently one could
substitute fd with fd of /dev/full using e.g. dup2() so that every
write is going to throw this error too:
Oh, this is a neat trick that I didn't know about. Thanks.
After restarting master and inspecting standby - in all of those above
3 cases - the standby didn't inhibit the "invalid contrecord length"
at least here, while without corruption this v02 patch it is
notorious. So if it passes the worst-case code review assumptions I
would be wondering if it shouldn't even be committed as it stands
right now.
Well, Nathan is right that this patch isn't really closing the hole
because we aren't tracking segment boundaries that aren't "earliest" nor
"latest" at the moment of registration. The simplistic approach seems
okay for the archive case, but not for the replication case.
I also noticed today (facepalm) that the patch doesn't work unless you
have archiving enabled, because the registration code is only invoked
when XLogArchivingActive(). Doh. This part is easy to solve. The
other doesn't look easy ...
--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
BTW while going about testing this, I noticed that we forbid
pg_walfile_name() while in recovery. That restriction was added by
commit 370f770c15a4 because ThisTimeLineID was not set correctly during
recovery. That was supposed to be fixed by commit 1148e22a82ed, so I
thought that it should be possible to remove the restriction. However,
I did that per the attached patch, but was quickly disappointed because
ThisTimeLineID seems to remain zero in a standby for reasons that I
didn't investigate.
--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
(Paul Graham)
Attachments:
0001-Don-t-forbid-pg_walfile_name-in-recovery-mode.patchtext/x-diff; charset=utf-8Download+0-8
At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
On 8/25/21, 5:33 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
On 2021-Aug-24, Bossart, Nathan wrote:
Another interesting thing I see is that the boundary stored in
earliestSegBoundary is not necessarily the earliest one. It's just
the first one that has been registered. I did this for simplicity for
the .ready file fix, but I can see it causing problems here.Hmm, is there really a problem here? Surely the flush point cannot go
past whatever has been written. If somebody is writing an earlier
section of WAL, then we cannot move the flush pointer to a later
position. So it doesn't matter if the earliest point we have registered
is the true earliest -- we only care for it to be the earliest that is
past the flush point.Let's say we have the following situation (F = flush, E = earliest
registered boundary, and L = latest registered boundary), and let's
assume that each segment has a cross-segment record that ends in the
next segment.F E L
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8Then, we write out WAL to disk and create .ready files as needed. If
we didn't flush beyond the latest registered boundary, the latest
registered boundary now becomes the earliest boundary.F E
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8At this point, the earliest segment boundary past the flush point is
before the "earliest" boundary we are tracking.
We know we have some cross-segment records in the regin [E L] so we
cannot add a .ready file if flush is in the region. I haven't looked
the latest patch (or I may misunderstand the discussion here) but I
think we shouldn't move E before F exceeds previous (or in the first
picture above) L. Things are done that way in my ancient proposal in
[1]: /messages/by-id/20201216.110120.887433782054853494.horikyota.ntt@gmail.com
https://www.postgresql.org/message-id/attachment/117052/v4-0001-Avoid-archiving-a-WAL-segment-that-continues-to-t.patch
+ if (LogwrtResult.Write < firstSegContRecStart ||
+ lastSegContRecEnd <= LogwrtResult.Write)
+ {
<notify the last segment>
By doing so, at the time of the second picutre, the pointers are set as:
E F L
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8
Then the poiter are cleard at the time F reaches L,
F
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8
Isn't this work correctly? As I think I mentioned in the thread, I
don't think we don't have so many (more than several, specifically)
segments in a region [E L].
[1]: /messages/by-id/20201216.110120.887433782054853494.horikyota.ntt@gmail.com
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Wed, 25 Aug 2021 20:20:04 -0400, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote in
BTW while going about testing this, I noticed that we forbid
pg_walfile_name() while in recovery. That restriction was added by
commit 370f770c15a4 because ThisTimeLineID was not set correctly during
recovery. That was supposed to be fixed by commit 1148e22a82ed, so I
thought that it should be possible to remove the restriction. However,
I did that per the attached patch, but was quickly disappointed because
ThisTimeLineID seems to remain zero in a standby for reasons that I
didn't investigate.
On a intermediate node of a cascading replication set, timeline id on
walsender and walrecever can differ and ordinary backends cannot
decide which to use.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
(this is off-topic here)
At Wed, 25 Aug 2021 09:56:56 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in
I'd also like to have tests. That seems moderately hard, but if we had
WAL-molasses that could be used in walreceiver, it could be done. (It
sounds easier to write tests with a molasses-archive_command.)[1] /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
[2] /messages/by-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com(I'm not sure what "WAL-molasses" above expresses, same as "sugar"?)
I think, but am not 100% sure, that "molasses" here is being used to
refer to fault injection.
Oh. That makes sense, thanks.
I sometimes inject artificial faults (a server crash, in this case) to
create specific on-disk states but I cannot imagine that that kind of
machinery can be statically placed in the source tree..
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 8/25/21, 5:40 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
Let's say we have the following situation (F = flush, E = earliest
registered boundary, and L = latest registered boundary), and let's
assume that each segment has a cross-segment record that ends in the
next segment.F E L
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8Then, we write out WAL to disk and create .ready files as needed. If
we didn't flush beyond the latest registered boundary, the latest
registered boundary now becomes the earliest boundary.F E
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8At this point, the earliest segment boundary past the flush point is
before the "earliest" boundary we are tracking.We know we have some cross-segment records in the regin [E L] so we
cannot add a .ready file if flush is in the region. I haven't looked
the latest patch (or I may misunderstand the discussion here) but I
think we shouldn't move E before F exceeds previous (or in the first
picture above) L. Things are done that way in my ancient proposal in
[1].
The strategy in place ensures that we track a boundary that doesn't
change until the flush position passes it as well as the latest
registered boundary. I think it is important that any segment
boundary tracking mechanism does at least those two things. I don't
see how we could do that if we didn't update E until F passed both E
and L.
Nathan
At Thu, 26 Aug 2021 03:24:48 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
On 8/25/21, 5:40 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
Let's say we have the following situation (F = flush, E = earliest
registered boundary, and L = latest registered boundary), and let's
assume that each segment has a cross-segment record that ends in the
next segment.F E L
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8Then, we write out WAL to disk and create .ready files as needed. If
we didn't flush beyond the latest registered boundary, the latest
registered boundary now becomes the earliest boundary.F E
|-----|-----|-----|-----|-----|-----|-----|-----|
1 2 3 4 5 6 7 8At this point, the earliest segment boundary past the flush point is
before the "earliest" boundary we are tracking.We know we have some cross-segment records in the regin [E L] so we
cannot add a .ready file if flush is in the region. I haven't looked
the latest patch (or I may misunderstand the discussion here) but I
think we shouldn't move E before F exceeds previous (or in the first
picture above) L. Things are done that way in my ancient proposal in
[1].The strategy in place ensures that we track a boundary that doesn't
change until the flush position passes it as well as the latest
registered boundary. I think it is important that any segment
boundary tracking mechanism does at least those two things. I don't
see how we could do that if we didn't update E until F passed both E
and L.
(Sorry, but I didn't get you clearly. So the discussion below might be
pointless.)
The ancient patch did:
If a flush didn't reach E, we can archive finished segments.
If a flush ends between E and L, we shouldn't archive finshed segments
at all. L can be moved further while this state, while E sits on the
same location while this state.
Once a flush passes L, we can archive all finished segments and can
erase both E and L.
A drawback of this strategy is that the region [E L] can contain gaps
(that is, segment boundaries that is not bonded by a continuation
record) and archive can be excessively retarded. Perhaps if flush
goes behind write head by more than two segments, the probability of
creating the gaps would be higher.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
So I'm again distracted by something else, so here's what will have to
pass for v3 for the time being.
--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Attachments:
v3-replication-segment-boundaries.patchtext/x-diff; charset=utf-8Download+97-59
Hi,
On 2021-08-23 18:52:17 -0400, Alvaro Herrera wrote:
Included 蔡梦娟 and Jakub Wartak because they've expressed interest on
this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto
variable").As mentioned in the course of thread [1], we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet. This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.
I'm doubtful that the approach of adding awareness of record boundaries
is a good path to go down:
- It adds nontrivial work to hot code paths to handle an edge case,
rather than making rare code paths more expensive.
- There are very similar issues with promotions of replicas (consider
what happens if we need to promote with the end of local WAL spanning
a segment boundary, and what happens to cascading replicas). We have
some logic to try to deal with that, but it's pretty grotty and I
think incomplete.
- It seems to make some future optimizations harder - we should work
towards replicating data sooner, rather than the opposite. Right now
that's a major bottleneck around syncrep.
- Once XLogFlush() for some LSN returned we can write that LSN to
disk. The LSN doesn't necessarily have to correspond to a specific
on-disk location (it could e.g. be the return value from
GetFlushRecPtr()). But "rewinding" to before the last record makes that
problematic.
- I suspect that schemes with heuristic knowledge of segment boundary
spanning records have deadlock or at least latency spike issues. What
if synchronous commit needs to flush up to a certain record boundary,
but streaming rep doesn't replicate it out because there's segment
spanning records both before and after?
I think a better approach might be to handle this on the WAL layout
level. What if we never overwrite partial records but instead just
skipped over them during decoding?
Of course there's some difficulties with that - the checksum and the
length from the record header aren't going to be meaningful.
But we could deal with that using a special flag in the
XLogPageHeaderData.xlp_info of the following page. If that flag is set,
xlp_rem_len could contain the checksum of the partial record.
Greetings,
Andres Freund
On 2021-Aug-30, Andres Freund wrote:
I'm doubtful that the approach of adding awareness of record boundaries
is a good path to go down:
Honestly, I do not like it one bit and if I can avoid relying on them
while making the whole thing work correctly, I am happy. Clearly it
wasn't a problem for the ancient recovery-only WAL design, but as soon
as we added replication on top the whole issue of continuation records
became a bug.
I do think that the code should be first correct and second performant,
though.
- There are very similar issues with promotions of replicas (consider
what happens if we need to promote with the end of local WAL spanning
a segment boundary, and what happens to cascading replicas). We have
some logic to try to deal with that, but it's pretty grotty and I
think incomplete.
Ouch, I hadn't thought of cascading replicas.
- It seems to make some future optimizations harder - we should work
towards replicating data sooner, rather than the opposite. Right now
that's a major bottleneck around syncrep.
Absolutely.
I think a better approach might be to handle this on the WAL layout
level. What if we never overwrite partial records but instead just
skipped over them during decoding?
Maybe this is a workable approach, let's work it out fully.
Let me see if I understand what you mean:
* We would remove the logic to inhibit archiving and streaming-
replicating the tail end of a split WAL record; that logic deals with
bytes only, so doesn't have to be aware of record boundaries.
* On WAL replay, we ignore records that are split across a segment
boundary and whose checksum does not match.
* On WAL write ... ?
How do we detect after recovery that a record that was being written,
and potentially was sent to the archive, needs to be "skipped"?
--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Hi,
On 2021-08-31 09:56:30 -0400, Alvaro Herrera wrote:
On 2021-Aug-30, Andres Freund wrote:
I think a better approach might be to handle this on the WAL layout
level. What if we never overwrite partial records but instead just
skipped over them during decoding?Maybe this is a workable approach, let's work it out fully.
Let me see if I understand what you mean:
* We would remove the logic to inhibit archiving and streaming-
replicating the tail end of a split WAL record; that logic deals with
bytes only, so doesn't have to be aware of record boundaries.
* On WAL replay, we ignore records that are split across a segment
boundary and whose checksum does not match.
* On WAL write ... ?
I was thinking that on a normal WAL write we'd do nothing. Instead we would
have dedicated code at the end of recovery that, if the WAL ends in a partial
record, changes the page following the "valid" portion of the WAL to indicate
that an incomplete record is to be skipped.
Of course, we need to be careful to not weaken WAL validity checking too
much. How about the following:
If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on
the page at which we do so (i.e. the page after the valid end of the WAL).
On a page with XLP_FIRST_IS_ABORTED_PARTIAL we expect a special type of record
to start just after the page header. That record contains sufficient
information for us to verify the validity of the partial record (since its
checksum and length aren't valid, and may not even be all readable if the
record header itself was split). I think it would make sense to include the
LSN of the aborted record, and a checksum of the partial data.
How do we detect after recovery that a record that was being written,
and potentially was sent to the archive, needs to be "skipped"?
I think we can just read the WAL and see if it ends with a partial
record. It'd add a bit of complication to the error checking in xlogreader,
because we'd likely want to treat verification from page headers a bit
different from verification due to record data. But that seems doable.
Does this make sense?
Greetings,
Andres Freund