pgsql: Fix WAL replay in presence of an incomplete record
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: /messages/by-id/202108232252.dh7uxf6oxwcy@alvherre.pgsql
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/ff9f111bce24fd9bbca7a20315586de877d74923
Modified Files
--------------
src/backend/access/rmgrdesc/xlogdesc.c | 12 ++
src/backend/access/transam/xlog.c | 154 +++++++++++++++++-
src/backend/access/transam/xlogreader.c | 40 ++++-
src/include/access/xlog_internal.h | 11 +-
src/include/access/xlogreader.h | 10 ++
src/include/catalog/pg_control.h | 2 +
src/test/recovery/t/026_overwrite_contrecord.pl | 207 ++++++++++++++++++++++++
src/test/recovery/t/idiosyncratic_copy | 20 +++
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 450 insertions(+), 7 deletions(-)
Hi Alvaro,
On Wed, Sep 29, 2021 at 02:40:29PM +0000, Alvaro Herrera wrote:
Fix WAL replay in presence of an incomplete record
[...]
src/test/recovery/t/026_overwrite_contrecord.pl | 207 ++++++++++++++++++++++++
src/test/recovery/t/idiosyncratic_copy | 20 +++
The builfarm is saying that this test fails on Windows:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2021-09-29%2020%3A00%3A01
Sep 29 17:27:23 t/026_overwrite_contrecord..........FAILED--Further testing stopped: command "pg_basebackup -D...
[...]
pg_basebackup: error: connection to server at "127.0.0.1", port 55644 failed: FATAL: no pg_hba.conf entry for replication connection from host "127.0.0.1", user "pgrunner", no encryption
+# Second test: a standby that receives WAL via archive/restore commands.
+$node = PostgresNode->new('primary2');
+$node->init(
+ has_archiving => 1,
+ extra => ['--wal-segsize=1']);
The error is here, where you need to set has_streaming => 1 to set up
primary2 correctly on Windows (see 992d353).
Thanks,
--
Michael
On 2021-Sep-30, Michael Paquier wrote:
Hi Alvaro,
On Wed, Sep 29, 2021 at 02:40:29PM +0000, Alvaro Herrera wrote:
Fix WAL replay in presence of an incomplete record
[...]
src/test/recovery/t/026_overwrite_contrecord.pl | 207 ++++++++++++++++++++++++
src/test/recovery/t/idiosyncratic_copy | 20 +++The builfarm is saying that this test fails on Windows:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2021-09-29%2020%3A00%3A01
Sep 29 17:27:23 t/026_overwrite_contrecord..........FAILED--Further testing stopped: command "pg_basebackup -D...
[...]
pg_basebackup: error: connection to server at "127.0.0.1", port 55644 failed: FATAL: no pg_hba.conf entry for replication connection from host "127.0.0.1", user "pgrunner", no encryption
Thanks. We had already discussed this in the other thread and I opted
to call ->set_replication_conf instead:
/messages/by-id/202109292127.7q66qhxhde67@alvherre.pgsql
According to Andres, there's still going to be a failure for other
reasons, but let's see what happens.
--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
<inflex> really, I see PHP as like a strange amalgamation of C, Perl, Shell
<crab> inflex: you know that "amalgam" means "mixture with mercury",
more or less, right?
<crab> i.e., "deadly poison"
[ I'm working on the release notes ]
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Fix WAL replay in presence of an incomplete record
...
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
Is there really any point in issuing such advice? IIUC, the standbys
would be unable to proceed anyway in case of a primary crash at the
wrong time, because an un-updated primary would send them inconsistent
WAL. If anything, it seems like it might be marginally better to
update the primary first, reducing the window for it to send WAL that
the standbys will *never* be able to handle. Then, if it crashes, at
least the WAL contains something the standbys can process once you
update them.
Or am I missing something?
regards, tom lane
On 2021-Nov-04, Tom Lane wrote:
Is there really any point in issuing such advice? IIUC, the standbys
would be unable to proceed anyway in case of a primary crash at the
wrong time, because an un-updated primary would send them inconsistent
WAL. If anything, it seems like it might be marginally better to
update the primary first, reducing the window for it to send WAL that
the standbys will *never* be able to handle. Then, if it crashes, at
least the WAL contains something the standbys can process once you
update them.
Yes -- in production settings, it is better to be able to shut down the
standbys in a scheduled manner, than find out after updating the primary
that your standbys are suddenly inaccessible until you take the further
action of updating them.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Si no sabes adonde vas, es muy probable que acabes en otra parte.
On 2021-Nov-05, Alvaro Herrera wrote:
On 2021-Nov-04, Tom Lane wrote:
the standbys
would be unable to proceed anyway in case of a primary crash at the
wrong time, because an un-updated primary would send them inconsistent
WAL. If anything, it seems like it might be marginally better to
update the primary first, reducing the window for it to send WAL that
the standbys will *never* be able to handle. Then, if it crashes, at
least the WAL contains something the standbys can process once you
update them.
I suppose the strategy is useless if the primary never crashes. If the
situation does occur, users can handle it the same way they've handled
it thus far: manually delete the segment from the standby and restart.
At least they know what to do and may even have already automated it.
The other situation is new and would need somebody, possibly taken
abruptly from their sleep, to try to understand why their standbys
refuse to proceed replication in a novel way.
--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
"Porque Kim no hacía nada, pero, eso sí,
con extraordinario éxito" ("Kim", Kipling)