[PATCH] Replica sends an incorrect epoch in its hot standby feedback to the Master

Started by Palamadai, Ekaalmost 6 years ago8 messages
#1Palamadai, Eka
ekanatha@amazon.com
1 attachment(s)

Hello Postgres hackers,
The below problem occurs in Postgres versions 11, 10, and 9.6. However, it doesn’t occur since Postgres version 12, since the commit [6]https://github.com/postgres/postgres/commit/2fc7af5e966043a412e8e69c135fae55a2db6d4f to add basic infrastructure for 64-bit transaction IDs indirectly fixed it.
Problem
The replica sends an incorrect epoch in its hot standby feedback to the master in the scenario outlined below, where a checkpoint is interleaved with the execution of 2 transactions at the master. The incorrect epoch in the feedback causes the master to ignore the “oldest Xmin” X sent by the replica. If a heap page prune[1]https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/access/heap/pruneheap.c#L74 or vacuum were executed at the master immediately thereafter, they may use a newer “oldest Xmin” Y > X, and prematurely delete a tuple T such that X < t_xmax (T) < Y, which is still in use at the replica as part of a long running read query Q. Subsequently, when the replica replays the deletion of T as part of its WAL replay, it cancels the long running query Q causing unnecessary pain to customers.

Scenario
The experimental setup is to start 2 write transactions and get their transaction ids, then create a checkpoint, and then have the oldest of the 2 transactions perform a write operation. The WAL replay of that write operation breaks the monotonically increasing property of “nextXid” tracked by the replica, and causes it to send a wrong epoch in its next feedback.

1. Set the following parameters on the replica and the master.
* replica:

i. set hot_standby = on and hot_standby_feedback = on to provide hot standby feedback.

ii. set log_min_messages = DEBUG2 so that hot standby feedback[2]https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/replication/walreceiver.c#L1244 gets logged.

* master:

i. set checkpoint_timeout = 3600 (1 hour) to turn off automatic checkpoints during the experiment.

ii. set autovacuum = off to avoid interference from the vacuum process.

1. Make sure no other read/write queries get executed on the replica and the master during this experiment, since they might change the nextXid on the replica. To achieve this, I set up a master and replica running Postgres 11 “locally” on my desktop, with the above parameter settings.
2. Start a psql client to the master, and do the following:

a. Create a test table, say “test_epoch_table”. Execute “CREATE TABLE test_epoch_table(id INT, description CHAR(200));”.

b. Begin a transaction, say A, and get the current transaction ID. Execute “begin; select txid_current();”.

1. Start a second psql client to the master. Begin a transaction, say B, and get the current transaction ID. Execute “begin; select txid_current();”.
2. Start a third psql client to the master, and create a checkpoint. Execute “checkpoint;”.
3. From transaction A, insert a tuple into “test_epoch_table”. Execute “insert into test_epoch_table(id, description) values(1, 'one');”
4. Open the Postgres log file on the replica and look for the latest hot standby feedback. The log message will show that the replica sent an incorrect epoch of 1, instead of 0.

Analysis
The variable “ShmemVariableCache->nextXid” (or “nextXid” for short) should be monotonically increasing unless it wraps around to the next epoch. However, in the above sequence, this property is violated on the replica in the function “RecordKnownAssignedTransactionIds”[3]https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/ipc/procarray.c#L3259, when the WAL replay for the insertion at step 6 is executed at the replica.

For example, before step 3b is executed, suppose the nextXid at the replica and master is 100, and that the variable “latestObservedXid”, which tracks the latest observed Xid at the replica, is 99. Now, suppose the Xid for transaction A is 100, and that for transaction B is 101. After step 4 is executed, the latestObservedXid at the replica still remains at 99, since the WAL replay for RUNNING_XACTS doesn’t advance latestObservedXid, once the replica reaches the STANDBY_SNAPSHOT_READY state. See [4]https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/ipc/procarray.c#L693 for reference. After step 5 is executed, the WAL replay of the checkpoint advances nextXid at the replica to 102, but doesn’t invoke RecordKnownAssignedTransactionIds, and hence latestObservedXid still remains 99. After step 6 is executed, the WAL replay of the insertion invokes RecordKnownAssignedTransactionIds with an input Xid of 100, advances latestObservedXid to 100, and sets nextXid at the replica to 101 (at line [3]https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/ipc/procarray.c#L3259) breaking its monotonicity. The hot standby feedback, which is generated immediately after the WAL replay of the insertion, invokes the function “GetNextXidAndEpoch”, which incorrectly calculates the epoch [5]https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/access/transam/xlog.c#L8444.

Versions impacted by this problem: 11, 10, and 9.6

Attached is the git diff of the patch for version 11, which is a 1-line change excluding the curly braces. Patch for versions 10 and 9.6 involve the same change.

Please let me know if you have any questions.

Thanks,
Eka Palamadai
Amazon Web Services

[1]: https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/access/heap/pruneheap.c#L74
[2]: https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/replication/walreceiver.c#L1244
[3]: https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/ipc/procarray.c#L3259
[4]: https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/ipc/procarray.c#L693
[5]: https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/access/transam/xlog.c#L8444
[6]: https://github.com/postgres/postgres/commit/2fc7af5e966043a412e8e69c135fae55a2db6d4f

Attachments:

git_diffapplication/octet-stream; name=git_diffDownload
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ddd3461d56..c907c26588 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3256,7 +3256,10 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 		next_expected_xid = latestObservedXid;
 		TransactionIdAdvance(next_expected_xid);
 		LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
-		ShmemVariableCache->nextXid = next_expected_xid;
+		if (TransactionIdFollows(next_expected_xid, ShmemVariableCache->nextXid))
+		{
+			ShmemVariableCache->nextXid = next_expected_xid;
+		}
 		LWLockRelease(XidGenLock);
 	}
 }
#2Thomas Munro
thomas.munro@gmail.com
In reply to: Palamadai, Eka (#1)
Re: [PATCH] Replica sends an incorrect epoch in its hot standby feedback to the Master

On Fri, Feb 7, 2020 at 1:03 PM Palamadai, Eka <ekanatha@amazon.com> wrote:

The below problem occurs in Postgres versions 11, 10, and 9.6. However, it doesn’t occur since Postgres version 12, since the commit [6] to add basic infrastructure for 64-bit transaction IDs indirectly fixed it.

I'm happy that that stuff is already fixing bugs we didn't know we
had, but, yeah, it looks like it really only fixed it incidentally by
moving all the duplicated "assign if higher" code into a function, not
through the magical power of 64 bit xids.

The replica sends an incorrect epoch in its hot standby feedback to the master in the scenario outlined below, where a checkpoint is interleaved with the execution of 2 transactions at the master. The incorrect epoch in the feedback causes the master to ignore the “oldest Xmin” X sent by the replica. If a heap page prune[1] or vacuum were executed at the master immediately thereafter, they may use a newer “oldest Xmin” Y > X, and prematurely delete a tuple T such that X < t_xmax (T) < Y, which is still in use at the replica as part of a long running read query Q. Subsequently, when the replica replays the deletion of T as part of its WAL replay, it cancels the long running query Q causing unnecessary pain to customers.

Ouch. Thanks for this analysis!

The variable “ShmemVariableCache->nextXid” (or “nextXid” for short) should be monotonically increasing unless it wraps around to the next epoch. However, in the above sequence, this property is violated on the replica in the function “RecordKnownAssignedTransactionIds”[3], when the WAL replay for the insertion at step 6 is executed at the replica.

I haven't tried your repro or studied this closely yet, but yes, that
assignment to nextXid does indeed look pretty fishy. Other similar
code elsewhere always does a check like in your patch, before
clobbering nextXid.

#3Palamadai, Eka
ekanatha@amazon.com
In reply to: Thomas Munro (#2)
Re: [PATCH] Replica sends an incorrect epoch in its hot standby feedback to the Master

Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added this patch to "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464.

Thanks,
Eka Palamadai
Amazon Web Services

On 2/11/20, 11:28 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

On Fri, Feb 7, 2020 at 1:03 PM Palamadai, Eka <ekanatha@amazon.com> wrote:

The below problem occurs in Postgres versions 11, 10, and 9.6. However, it doesn’t occur since Postgres version 12, since the commit [6] to add basic infrastructure for 64-bit transaction IDs indirectly fixed it.

I'm happy that that stuff is already fixing bugs we didn't know we
had, but, yeah, it looks like it really only fixed it incidentally by
moving all the duplicated "assign if higher" code into a function, not
through the magical power of 64 bit xids.

The replica sends an incorrect epoch in its hot standby feedback to the master in the scenario outlined below, where a checkpoint is interleaved with the execution of 2 transactions at the master. The incorrect epoch in the feedback causes the master to ignore the “oldest Xmin” X sent by the replica. If a heap page prune[1] or vacuum were executed at the master immediately thereafter, they may use a newer “oldest Xmin” Y > X, and prematurely delete a tuple T such that X < t_xmax (T) < Y, which is still in use at the replica as part of a long running read query Q. Subsequently, when the replica replays the deletion of T as part of its WAL replay, it cancels the long running query Q causing unnecessary pain to customers.

Ouch. Thanks for this analysis!

The variable “ShmemVariableCache->nextXid” (or “nextXid” for short) should be monotonically increasing unless it wraps around to the next epoch. However, in the above sequence, this property is violated on the replica in the function “RecordKnownAssignedTransactionIds”[3], when the WAL replay for the insertion at step 6 is executed at the replica.

I haven't tried your repro or studied this closely yet, but yes, that
assignment to nextXid does indeed look pretty fishy. Other similar
code elsewhere always does a check like in your patch, before
clobbering nextXid.

#4Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Palamadai, Eka (#3)
Re: [PATCH] Replica sends an incorrect epoch in its hot standby feedback to the Master

On Fri, Feb 21, 2020 at 8:15 PM Palamadai, Eka <ekanatha@amazon.com> wrote:

Please, do not top post on this list.

Thanks a lot for the feedback. Please let me know if you have any further

comments. Meanwhile, I have also added this patch to "Commitfest 2020-03"
at https://commitfest.postgresql.org/27/2464.

Apparently, there are a couple of duplicate entries in the commitfest as:
https://commitfest.postgresql.org/27/2463/ and
https://commitfest.postgresql.org/27/2462/

Could close those as "withdrawn"?

Regards,

Juan José Santamaría Flecha

#5David Steele
david@pgmasters.net
In reply to: Juan José Santamaría Flecha (#4)
Re: [PATCH] Replica sends an incorrect epoch in its hot standby feedback to the Master

On 3/2/20 10:33 AM, Juan José Santamaría Flecha wrote:

On Fri, Feb 21, 2020 at 8:15 PM Palamadai, Eka <ekanatha@amazon.com
<mailto:ekanatha@amazon.com>> wrote:

Please, do not top post on this list.

Thanks a lot for the feedback. Please let me know if you have any
further comments. Meanwhile, I have also added this patch to
"Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464.

Apparently, there are a couple of duplicate entries in the commitfest
as: https://commitfest.postgresql.org/27/2463/ and
https://commitfest.postgresql.org/27/2462/

Could close those as "withdrawn"?

I have marked the duplicate entries 2462 and 2463 as withdrawn.

--
-David
david@pgmasters.net

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Palamadai, Eka (#3)
1 attachment(s)
Re: [PATCH] Replica sends an incorrect epoch in its hot standby feedback to the Master

On Sat, Feb 22, 2020 at 6:10 AM Palamadai, Eka <ekanatha@amazon.com> wrote:

Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added this patch to "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464.

Thanks for the excellent reproducer for this obscure bug. You said
the problem exists in 9.6-11, but I'm also able to reproduce it in
9.5. That's the oldest supported release, but it probably goes back
further. I confirmed that this patch fixes the immediate problem.
I've attached a version of your patch with a commit message, to see if
anyone has more feedback on this.

Attachments:

0001-Fix-nextXid-tracking-bug-on-standbys-9.5-11-only.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-nextXid-tracking-bug-on-standbys-9.5-11-only.patchDownload
From 26740a9a670161f27c07c3c83c1056f8b4373d35 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 11 Mar 2020 17:02:02 +1300
Subject: [PATCH] Fix nextXid tracking bug on standbys (9.5-11 only).

RecordKnownAssignedTransactionIds() should never move
nextXid backwards.  Before this commit, that could happen
if some other code path had advanced it without advancing
latestObservedXid.

One consequence is that a well timed XLOG_CHECKPOINT_ONLINE
could cause hot standby feedback messages to get confused
and report an xmin from a future epoch, potentially allowing
vacuum to run too soon on the primary.

Repair, by making sure RecordKnownAssignedTransactionIds()
can only move nextXid forwards.

In release 12 and master, this was already done by commit
2fc7af5e, which consolidated similar code and straightened
out this bug.  Back-patch to supported releases before that.

Author: Eka Palamadai
Discussion: https://postgr.es/m/98BB4805-D0A2-48E1-96F4-15014313EADC@amazon.com
---
 src/backend/storage/ipc/procarray.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ddd3461d56..50b9aea683 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3256,7 +3256,8 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 		next_expected_xid = latestObservedXid;
 		TransactionIdAdvance(next_expected_xid);
 		LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
-		ShmemVariableCache->nextXid = next_expected_xid;
+		if (TransactionIdFollows(next_expected_xid, ShmemVariableCache->nextXid))
+			ShmemVariableCache->nextXid = next_expected_xid;
 		LWLockRelease(XidGenLock);
 	}
 }
-- 
2.20.1

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#6)
Re: [PATCH] Replica sends an incorrect epoch in its hot standby feedback to the Master

On Wed, Mar 11, 2020 at 7:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Feb 22, 2020 at 6:10 AM Palamadai, Eka <ekanatha@amazon.com> wrote:

Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added this patch to "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464.

Thanks for the excellent reproducer for this obscure bug. You said
the problem exists in 9.6-11, but I'm also able to reproduce it in
9.5. That's the oldest supported release, but it probably goes back
further. I confirmed that this patch fixes the immediate problem.
I've attached a version of your patch with a commit message, to see if
anyone has more feedback on this.

Pushed.

#8Palamadai, Eka
ekanatha@amazon.com
In reply to: Thomas Munro (#7)
Re: [PATCH] Replica sends an incorrect epoch in its hot standby feedback to the Master

Thanks a lot for reviewing and pushing this.

Eka

On 3/12/20, 1:23 AM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Wed, Mar 11, 2020 at 7:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Feb 22, 2020 at 6:10 AM Palamadai, Eka <ekanatha@amazon.com> wrote:

Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added this patch to "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464.

Thanks for the excellent reproducer for this obscure bug. You said
the problem exists in 9.6-11, but I'm also able to reproduce it in
9.5. That's the oldest supported release, but it probably goes back
further. I confirmed that this patch fixes the immediate problem.
I've attached a version of your patch with a commit message, to see if
anyone has more feedback on this.

Pushed.