Incorrect snapshots while promoting hot standby node when 2PC is used

Started by Andres Freundover 4 years ago12 messages

andres@anarazel.de

over 4 years ago

Hi,

Michael Paquier (running locally I think), and subsequently Thomas Munro
(noticing [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-04-20%2003%3A04%3A04), privately reported that they noticed an assertion failure in
GetSnapshotData(). Both reasonably were wondering if that's related to the
snapshot scalability patches.

Michael reported the following assertion failure in 023_pitr_prepared_xact.pl:

TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468, PID: 22901)

The failure was triggered by one of the new TAP tests,
023_pitr_prepared_xact.pl, after recovering a 2PC transaction that
used a transaction ID that matches with RecentXmin:
(gdb) p RecentXmin
$1 = 588
(gdb) p TransactionXmin
$2 = 589

I tried for a while to reproduce that, but couldn't. Adding a bunch of
debugging output and increasing the log level shows the problem pretty clearly
however, just not tripping any asserts:

2021-04-21 17:55:54.287 PDT [1829098] [unknown] LOG: setting xmin to 588
...
2021-04-21 17:55:54.377 PDT [1829049] DEBUG: removing all KnownAssignedXids
2021-04-21 17:55:54.377 PDT [1829049] DEBUG: release all standby locks
...
2021-04-21 17:55:54.396 PDT [1829100] [unknown] LOG: setting xmin to 589
...
2021-04-21 17:55:55.379 PDT [1829048] LOG: database system is ready to accept connections
...
2021-04-21 17:55:55.380 PDT [1829120] LOG: setting xmin to 588
...
2021-04-21 17:55:55.386 PDT [1829126] [unknown] LOG: setting xmin to 588
2021-04-21 17:55:55.387 PDT [1829126] 023_pitr_prepared_xact.pl LOG: statement: COMMIT PREPARED 'fooinsert';
...
2021-04-21 17:55:55.428 PDT [1829128] [unknown] LOG: setting xmin to 589

So there's clear proof for xmin going from 588 to 589 and back and
forth.

After looking some more the bug isn't even that subtle. And definitely not new
- likely it exists since the introduction of hot standby.

The sequence in StartupXLOG() leading to the issue is the following:

1) RecoverPreparedTransactions();
2) ShutdownRecoveryTransactionEnvironment();
3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;

Because 2) resets the KnownAssignedXids machinery, snapshots that happen
between 2) and 3) will not actually look at the procarray to compute
snapshots, as that only happens within

snapshot->takenDuringRecovery = RecoveryInProgress();
if (!snapshot->takenDuringRecovery)

and RecoveryInProgress() checks XLogCtl->SharedRecoveryState !=
RECOVERY_STATE_DONE, which is set in 3).

So snapshots within that window will always be "empty", i.e. xmin is
latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll
look at the procarray, which then leads xmin going back to 588.

I think that this can lead to data corruption, because a too new xmin horizon
could lead to rows from a prepared transaction getting hint bitted as dead (or
perhaps even pruned, although that's probably harder to hit). Due to the too
new xmin horizon we won't treat rows by the prepared xact as in-progress, and
TransactionIdDidCommit() will return false, as the transaction didn't commit
yet. Which afaict can result in row versions created by the prepared
transaction being invisible even after the prepared transaction commits.

Without prepared transaction there probably isn't an issue, because there
shouldn't be any other in-progress xids at that point?

I think to fix the issue we'd have to move
ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState
= RECOVERY_STATE_DONE.

The acquisition of ProcArrayLock() in
ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds()
should prevent the data from being removed between the RecoveryInProgress()
and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData().

I haven't yet figured out whether there would be a problem with deferring the
other tasks in ShutdownRecoveryTransactionEnvironment() until after
RECOVERY_STATE_DONE.

I think we ought to introduce assertions that have a higher chance to
catch cases like this. The window to hit the new assertion that caused
Michael to hit this is pretty darn small (xmin needs to move backwards
between two snapshot computations inside a single transaction). I
*think* we can safely assert that xmin doesn't move backwards globally,
if we store it as a 64bit xid, and don't perform that check in
walsender?

Greetings,

Andres Freund

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-04-20%2003%3A04%3A04

Andrey Borodin

x4mmm@yandex-team.ru

over 4 years ago

In reply to: Andres Freund (#1)

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

Hi Andres!

23 апр. 2021 г., в 01:36, Andres Freund <andres@anarazel.de> написал(а):

So snapshots within that window will always be "empty", i.e. xmin is
latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll
look at the procarray, which then leads xmin going back to 588.

I think that this can lead to data corruption, because a too new xmin horizon
could lead to rows from a prepared transaction getting hint bitted as dead (or
perhaps even pruned, although that's probably harder to hit). Due to the too
new xmin horizon we won't treat rows by the prepared xact as in-progress, and
TransactionIdDidCommit() will return false, as the transaction didn't commit
yet. Which afaict can result in row versions created by the prepared
transaction being invisible even after the prepared transaction commits.

Without prepared transaction there probably isn't an issue, because there
shouldn't be any other in-progress xids at that point?

I'm investigating somewhat resemblant case.
We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented with 2pc.
Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6.

In January heapcheck of backup reported some suspicious problems
ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=1
ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470241, offnum(tuple)=1
ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470242, offnum(tuple)=1
...
and so on for ~100 pages - tuples with lp==1 were not frozen.

We froze tuples with pg_dirty_hands and run VACUUM (DSIABLE_PAGE_SKIPPING) on the table.

In the end of the March the same shard stroke again with:
ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=42
....
around ~1040 blocks (starting from the same 1470240!) had non-frozen tuple at lp==42.
I've run
update s3.objects_65 set created = created where ctid = '(1470241,42)' returning *;

After that heapcheck showed VM problem
ERROR: XX001: Found non all-visible tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=42
LOCATION: collect_corrupt_items, heap_check.c:186

VACUUM fixed it with warnings.
WARNING: 01000: page is not marked all-visible but visibility map bit is set in relation "objects_65" page 1470240
and failed on next page
ERROR: XX001: found xmin 1650436694 from before relfrozenxid 1752174172
LOCATION: heap_prepare_freeze_tuple, heapam.c:6172

I run update from all tuples in heapcheks ctid list and subsequent vacuum (without page skipping). This satisfied corruption monitoring.

Can this case be related to the problem that you described?

Or, perhaps, it looks more like a hardware problem? Data_checksums are on, but few years ago we observed ssd firmware that was loosing updates, but passing checksums. I'm sure that we would benefit from having separate relation fork for checksums or LSNs.

We observe similar cases 3-5 times a year. To the date no data was lost due to this, but it's somewhat annoying.
BTW I'd say that such things are an argument for back-porting pg_surgery and heapcheck to old versions.

Thanks!

Best regards, Andrey Borodin.

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Andrey Borodin (#2)

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

Hi,

On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote:

I'm investigating somewhat resemblant case.
We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented with 2pc.
Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6.

If you still have the data it would be useful if you could check if the
LSNs of the corrupted pages are LSNs from shortly after standby
promotion/switchover?

Can this case be related to the problem that you described?

It seems possible, but it's hard to know without a lot more information.

Or, perhaps, it looks more like a hardware problem? Data_checksums are
on, but few years ago we observed ssd firmware that was loosing
updates, but passing checksums. I'm sure that we would benefit from
having separate relation fork for checksums or LSNs.

Right - checksums are "page local". They can only detect if a page is
corrupted, not if e.g. an older version of the page (with correct
checksum) has been restored. While there are schemes to have stronger
error detection properties, they do come with substantial overhead (at
least the ones I can think of right now).

Greetings,

Andres Freund

Andrey Borodin

x4mmm@yandex-team.ru

over 4 years ago

In reply to: Andres Freund (#3)

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

3 мая 2021 г., в 23:10, Andres Freund <andres@anarazel.de> написал(а):

Hi,

On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote:

I'm investigating somewhat resemblant case.
We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented with 2pc.
Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6.

If you still have the data it would be useful if you could check if the
LSNs of the corrupted pages are LSNs from shortly after standby
promotion/switchover?

That's a neat idea, I'll check if I can restore backup with corruptions.
I have a test cluster with corruptions, but it has undergone tens of switchovers...

Or, perhaps, it looks more like a hardware problem? Data_checksums are
on, but few years ago we observed ssd firmware that was loosing
updates, but passing checksums. I'm sure that we would benefit from
having separate relation fork for checksums or LSNs.

Right - checksums are "page local". They can only detect if a page is
corrupted, not if e.g. an older version of the page (with correct
checksum) has been restored. While there are schemes to have stronger
error detection properties, they do come with substantial overhead (at
least the ones I can think of right now).

We can have PTRACK-like fork with page LSNs. It can be flushed on checkpoint and restored from WAL on crash. So we always can detect stale page version. Like LSN-track :) We will have much faster rewind and delta-backups for free.

Though I don't think it worth an effort until we at least checksum CLOG.

Thanks!

Best regards, Andrey Borodin.

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andres Freund (#1)

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

Andres Freund <andres@anarazel.de> writes:

Michael Paquier (running locally I think), and subsequently Thomas Munro
(noticing [1]), privately reported that they noticed an assertion failure in
GetSnapshotData(). Both reasonably were wondering if that's related to the
snapshot scalability patches.
Michael reported the following assertion failure in 023_pitr_prepared_xact.pl:

TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468, PID: 22901)

mantid just showed a failure that looks like the same thing, at
least it's also in 023_pitr_prepared_xact.pl:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mantid&dt=2021-05-03%2013%3A07%3A06

The assertion line number is rather different though:

TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2094, PID: 1163004)

and interestingly, this happened in a parallel worker:

postgres: node_pitr: parallel worker for PID 1162998 (ExceptionalCondition+0x7a)[0x946eca]
postgres: node_pitr: parallel worker for PID 1162998 (GetSnapshotData+0x897)[0x7ef327]
postgres: node_pitr: parallel worker for PID 1162998 (GetNonHistoricCatalogSnapshot+0x4a)[0x986a5a]
postgres: node_pitr: parallel worker for PID 1162998 (systable_beginscan+0x189)[0x4fffe9]
postgres: node_pitr: parallel worker for PID 1162998 [0x937336]
postgres: node_pitr: parallel worker for PID 1162998 [0x937743]
postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0x85)[0x93f155]
postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c]
postgres: node_pitr: parallel worker for PID 1162998 (index_open+0x6)[0x5007a6]
postgres: node_pitr: parallel worker for PID 1162998 (systable_beginscan+0x177)[0x4fffd7]
postgres: node_pitr: parallel worker for PID 1162998 [0x937336]
postgres: node_pitr: parallel worker for PID 1162998 [0x93e1c1]
postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0xbd)[0x93f18d]
postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c]
postgres: node_pitr: parallel worker for PID 1162998 (table_open+0x6)[0x532656]
postgres: node_pitr: parallel worker for PID 1162998 [0x937306]
postgres: node_pitr: parallel worker for PID 1162998 [0x93e1c1]
postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0xbd)[0x93f18d]
postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c]
postgres: node_pitr: parallel worker for PID 1162998 (table_open+0x6)[0x532656]
postgres: node_pitr: parallel worker for PID 1162998 [0x92c4f1]
postgres: node_pitr: parallel worker for PID 1162998 (SearchCatCache1+0x176)[0x92e236]
postgres: node_pitr: parallel worker for PID 1162998 (TupleDescInitEntry+0xb3)[0x4a7eb3]
postgres: node_pitr: parallel worker for PID 1162998 [0x688f30]
postgres: node_pitr: parallel worker for PID 1162998 (ExecInitResultTupleSlotTL+0x1b)[0x68a8fb]
postgres: node_pitr: parallel worker for PID 1162998 (ExecInitResult+0x92)[0x6af902]
postgres: node_pitr: parallel worker for PID 1162998 (ExecInitNode+0x446)[0x684b06]
postgres: node_pitr: parallel worker for PID 1162998 (standard_ExecutorStart+0x269)[0x67d7b9]
postgres: node_pitr: parallel worker for PID 1162998 (ParallelQueryMain+0x1a3)[0x681d83]
postgres: node_pitr: parallel worker for PID 1162998 (ParallelWorkerMain+0x408)[0x53c218]
postgres: node_pitr: parallel worker for PID 1162998 (StartBackgroundWorker+0x23f)[0x774ebf]
postgres: node_pitr: parallel worker for PID 1162998 [0x780f3d]

regards, tom lane

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#5)

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

Hi,

On 2021-05-04 12:32:34 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Michael Paquier (running locally I think), and subsequently Thomas Munro
(noticing [1]), privately reported that they noticed an assertion failure in
GetSnapshotData(). Both reasonably were wondering if that's related to the
snapshot scalability patches.
Michael reported the following assertion failure in 023_pitr_prepared_xact.pl:

TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468, PID: 22901)

mantid just showed a failure that looks like the same thing, at
least it's also in 023_pitr_prepared_xact.pl:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mantid&dt=2021-05-03%2013%3A07%3A06

The assertion line number is rather different though:

TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2094, PID: 1163004)

I managed to hit that one as well and it's also what fairywren hit - the
assertion in 2094 and 2468 are basically copies of the same check, and
which one hit is a question of timing.

and interestingly, this happened in a parallel worker:

I think the issue can be hit (or rather detected) whenever a transaction
builds one snapshot while in recovery, and a second one during
end-of-recovery. The parallel query here is just
2021-05-03 09:18:35.602 EDT [1162987:6] DETAIL: Failed process was running: SELECT pg_is_in_recovery() = 'f';
(parallel due to force_parallel_mode) - which of course is likely to run
during end-of-recovery

So it does seem like the same bug of resetting the KnownAssignedXids
stuff too early.

Greetings,

Andres Freund

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Andres Freund (#1)

1 attachment(s)

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

On Thu, Apr 22, 2021 at 01:36:03PM -0700, Andres Freund wrote:

The sequence in StartupXLOG() leading to the issue is the following:

1) RecoverPreparedTransactions();
2) ShutdownRecoveryTransactionEnvironment();
3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;

Because 2) resets the KnownAssignedXids machinery, snapshots that happen
between 2) and 3) will not actually look at the procarray to compute
snapshots, as that only happens within

snapshot->takenDuringRecovery = RecoveryInProgress();
if (!snapshot->takenDuringRecovery)

and RecoveryInProgress() checks XLogCtl->SharedRecoveryState !=
RECOVERY_STATE_DONE, which is set in 3).

Oh, indeed. It is easy to see RecentXmin jumping back-and-worth while
running 023_pitr_prepared_xact.pl with a small sleep added just after
ShutdownRecoveryTransactionEnvironment().

Without prepared transaction there probably isn't an issue, because there
shouldn't be any other in-progress xids at that point?

Yes, there should not be any as far as I recall. 2PC is kind of
special with its fake ProcArray entries.

I think to fix the issue we'd have to move
ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState
= RECOVERY_STATE_DONE.

The acquisition of ProcArrayLock() in
ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds()
should prevent the data from being removed between the RecoveryInProgress()
and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData().

I haven't yet figured out whether there would be a problem with deferring the
other tasks in ShutdownRecoveryTransactionEnvironment() until after
RECOVERY_STATE_DONE.

Hmm. This would mean releasing all the exclusive locks tracked by a
standby, as of StandbyReleaseAllLocks(), after opening the instance
for writes after a promotion. I don't think that's unsafe, but it
would be intrusive.

Anyway, isn't the issue ExpireAllKnownAssignedTransactionIds() itself,
where we should try to not wipe out the 2PC entries to make sure that
all those snapshots still see the 2PC transactions as something to
count on? I am attaching a crude patch to show the idea.

Thoughts?
--
Michael

Attachments:

2pc-recovery-snap-v1.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 42a89fc5dc..2e819163c4 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4853,10 +4853,24 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
 
 	if (!TransactionIdIsValid(removeXid))
 	{
-		elog(trace_recovery(DEBUG4), "removing all KnownAssignedXids");
-		pArray->numKnownAssignedXids = 0;
-		pArray->headKnownAssignedXids = pArray->tailKnownAssignedXids = 0;
-		return;
+		tail = pArray->tailKnownAssignedXids;
+		head = pArray->headKnownAssignedXids;
+
+		for (i = tail; i < head; i++)
+		{
+			if (KnownAssignedXidsValid[i])
+			{
+				TransactionId knownXid = KnownAssignedXids[i];
+
+				if (!StandbyTransactionIdIsPrepared(knownXid))
+				{
+					KnownAssignedXidsValid[i] = false;
+					count++;
+				}
+			}
+		}
+
+		goto recompute;
 	}
 
 	elog(trace_recovery(DEBUG4), "prune KnownAssignedXids to %u", removeXid);
@@ -4885,6 +4899,9 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
 		}
 	}
 
+	/* fall through */
+recompute:
+
 	pArray->numKnownAssignedXids -= count;
 	Assert(pArray->numKnownAssignedXids >= 0);

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Michael Paquier (#7)

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

Hi,

On 2021-05-26 16:57:31 +0900, Michael Paquier wrote:

Yes, there should not be any as far as I recall. 2PC is kind of
special with its fake ProcArray entries.

It's really quite an awful design :(

I think to fix the issue we'd have to move
ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState
= RECOVERY_STATE_DONE.

The acquisition of ProcArrayLock() in
ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds()
should prevent the data from being removed between the RecoveryInProgress()
and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData().

I haven't yet figured out whether there would be a problem with deferring the
other tasks in ShutdownRecoveryTransactionEnvironment() until after
RECOVERY_STATE_DONE.

Hmm. This would mean releasing all the exclusive locks tracked by a
standby, as of StandbyReleaseAllLocks(), after opening the instance
for writes after a promotion. I don't think that's unsafe, but it
would be intrusive.

Why would it be intrusive? We're talking a split second here, no? More
importantly, I don't think it's correct to release the locks at that
point.

Anyway, isn't the issue ExpireAllKnownAssignedTransactionIds() itself,
where we should try to not wipe out the 2PC entries to make sure that
all those snapshots still see the 2PC transactions as something to
count on? I am attaching a crude patch to show the idea.

I don't think that's sufficient. We can't do most of the other stuff in
ShutdownRecoveryTransactionEnvironment() before changing
XLogCtl->SharedRecoveryState either. As long as the other backends think
we are in recovery, we shouldn't release e.g. the virtual transaction.

Greetings,

Andres Freund

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Andres Freund (#8)

1 attachment(s)

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

On Thu, May 27, 2021 at 10:01:49AM -0700, Andres Freund wrote:

Why would it be intrusive? We're talking a split second here, no? More
importantly, I don't think it's correct to release the locks at that
point.

I have been looking at all that for the last couple of days, and
checked the code to make sure that relying on RecoveryInProgress() as
the tipping point is logically correct in terms of virtual XID,
snapshot build and KnownAssignedXids cleanup. This stuff is tricky
enough that I may have missed something, but my impression (and
testing) is that we should be safe.

I am adding this patch to the next CF for now. More eyes are needed.
--
Michael

Attachments:

end-recovery-2pc.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 441a9124cd..40c56d9012 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8033,13 +8033,6 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
 	/* Shut down xlogreader */
 	if (readFile >= 0)
 	{
@@ -8087,6 +8080,18 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
+	 * and after switching SharedRecoveryState to RECOVERY_STATE_DONE so as
+	 * any session building a snapshot will not rely on KnownAssignedXids as
+	 * RecoveryInProgress() is disabled at this stage.  This is particularly
+	 * critical for prepared 2PC transactions, that may still need to be
+	 * included in snapshots once recovery has ended.
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
 	/*
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.

#10

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Michael Paquier (#9)

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

On Mon, May 31, 2021 at 09:37:17PM +0900, Michael Paquier wrote:

I have been looking at all that for the last couple of days, and
checked the code to make sure that relying on RecoveryInProgress() as
the tipping point is logically correct in terms of virtual XID,
snapshot build and KnownAssignedXids cleanup. This stuff is tricky
enough that I may have missed something, but my impression (and
testing) is that we should be safe.

A couple of months later, I have looked back at this thread and this
report. I have rechecked all the standby handling and snapshot builds
involving KnownAssignedXids and looked at all the phases that are
getting called until we call ShutdownRecoveryTransactionEnvironment()
to release these, and I don't think that there is a problem with the
solution proposed here. So I propose to move on and apply this
patch. Please let me know if there are any objections.
--
Michael

#11

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Michael Paquier (#10)

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

On Fri, Oct 01, 2021 at 02:11:15PM +0900, Michael Paquier wrote:

A couple of months later, I have looked back at this thread and this
report. I have rechecked all the standby handling and snapshot builds
involving KnownAssignedXids and looked at all the phases that are
getting called until we call ShutdownRecoveryTransactionEnvironment()
to release these, and I don't think that there is a problem with the
solution proposed here. So I propose to move on and apply this
patch. Please let me know if there are any objections.

Okay, I have worked more on that today, did more tests and applied the
fix as of 8a42379.
--
Michael

#12

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Michael Paquier (#11)

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

On 2021-10-04 17:27:44 +0900, Michael Paquier wrote:

On Fri, Oct 01, 2021 at 02:11:15PM +0900, Michael Paquier wrote:

A couple of months later, I have looked back at this thread and this
report. I have rechecked all the standby handling and snapshot builds
involving KnownAssignedXids and looked at all the phases that are
getting called until we call ShutdownRecoveryTransactionEnvironment()
to release these, and I don't think that there is a problem with the
solution proposed here. So I propose to move on and apply this
patch. Please let me know if there are any objections.

Okay, I have worked more on that today, did more tests and applied the
fix as of 8a42379.

Thanks for remembering this!