[BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

Started by Oh, Mikealmost 5 years ago137 messages
#1Oh, Mike
minsoo@amazon.com
1 attachment(s)

Sending this to pgsql-hackers list to create a CommitFest entry with the attached patch proposal.

Hello,
We noticed that the logical replication could fail when the Standby::RUNNING_XACT record is generated in the middle of a catalog modifying transaction and if the logical decoding has to restart from the RUNNING_XACT
WAL entry.
The Standby::RUNNING_XACT record is generated periodically (roughly every 15s by default) or during a CHECKPOINT operation.

Detailed problem description:
Tested on 11.8 & current master.
The logical replication slot restart_lsn advances in the middle of an open txn that modified the catalog (e.g. TRUNCATE operation).
Should the logical decoding has to restart it could fail with an error like this:
ERROR: could not map filenode "base/13237/442428"

Currently, the system relies on processing Heap2::NEW_CID to keep track of catalog modifying (sub)transactions.
This context is lost if the logical decoding has to restart from a Standby::RUNNING_XACTS that is written between the NEW_CID record and its parent txn commit.
If the logical stream restarts from this restart_lsn, then it doesn't have the xid responsible for modifying the catalog.

Repro steps:
1.       We need to generate the Standby::RUNNING_XACT record deterministically using CHECKPOINT. Hence we'll delay the LOG_SNAPSHOT_INTERVAL_MS using the following patch:
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 3e6ffb05b9..b776e8d566 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -76,7 +76,7 @@ int              BgWriterDelay = 200;

* Interval in which standby snapshots are logged into the WAL stream, in
* milliseconds.

  /
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 1500000
2.       Create a table
postgres=# create table bdt (a int);
CREATE TABLE
3.       Create a logical replication slot:
postgres=# select  pg_create_logical_replication_slot('bdt_slot','test_decoding');
pg_create_logical_replication_slot
------------------------------------
(bdt_slot,0/FFAA1C70)
(1 row)
4.       Start reading the slot in a shell (keep the shell so that we can stop reading later):
./bin/pg_recvlogical --slot bdt_slot --start -f bdt.out -d postgres
5.       Execute the workload across 2 different clients in the following order
Session1:
begin;
savepoint b1;
truncate bdt;

Session2:
select * from pg_replication_slots; /* keep note of the confirmed_flush_lsn */
checkpoint;
/* Repeat the following query until the confirmed_flush_lsn changes */
select * from pg_replication_slots;

Once confirmed_flush_lsn, changes:
Session1:
end;
begin;
insert into bdt values (1);
Session2:
select * from pg_replication_slots; /* keep note of both restart_lsn AND the confirmed_flush_lsn */
checkpoint;
/* Repeat the following query until both restart_lsn AND confirmed_flush_lsn change */
select * from pg_replication_slots;
6. Stop the pg_recvlogical (Control-C)
7. Then commit the insert txn:
Session1:
end;
8. Get/peek the replication slot changes
postgres=# select * from pg_logical_slot_get_changes('bdt_slot', null, null);
ERROR: could not map filenode "base/13237/442428" to relation OID

Proposed solution:
If we’re decoding a catalog modifying commit record, then check whether it’s part of the RUNNING_XACT xid’s processed @ the restart_lsn. If so, then add its xid & subxacts in the committed txns list in the logical decoding snapshot.

Please refer to the attachment for the proposed patch.

Thanks,
Mike

Attachments:

logical_decoding_xact_bookkeep.patchapplication/octet-stream; name=logical_decoding_xact_bookkeep.patchDownload
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f59613..2512912 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -42,6 +42,7 @@
 #include "replication/reorderbuffer.h"
 #include "replication/snapbuild.h"
 #include "storage/standby.h"
+#include "utils/builtins.h"
 
 typedef struct XLogRecordBuffer
 {
@@ -85,6 +86,9 @@ static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
 							  XLogRecordBuffer *buf, Oid dbId,
 							  RepOriginId origin_id);
 
+/* record previous restart_lsn running xacts */
+xl_running_xacts *last_running = NULL;
+
 /*
  * Take every XLogReadRecord()ed record and perform the actions required to
  * decode it using the output plugin already setup in the logical decoding
@@ -402,6 +406,28 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			{
 				xl_running_xacts *running = (xl_running_xacts *) XLogRecGetData(r);
 
+				/* record restart_lsn running xacts */
+				if (MyReplicationSlot && (buf->origptr == MyReplicationSlot->data.restart_lsn))
+				{
+					if (last_running)
+						free(last_running);
+
+					last_running = NULL;
+
+					/*
+					 * xl_running_xacts contains a xids Flexible Array
+					 * and its size is subxcnt + xcnt.
+					 * Take that into account while allocating
+					 * the memory for last_running.
+					 */
+					last_running = (xl_running_xacts *) malloc(sizeof(xl_running_xacts)
+																+ sizeof(TransactionId )
+																* (running->subxcnt + running->xcnt));
+					memcpy(last_running, running, sizeof(xl_running_xacts)
+														 + (sizeof(TransactionId)
+														 * (running->subxcnt + running->xcnt)));
+				}
+
 				SnapBuildProcessRunningXacts(builder, buf->origptr, running);
 
 				/*
@@ -678,6 +704,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	TimestampTz commit_time = parsed->xact_time;
 	RepOriginId origin_id = XLogRecGetOrigin(buf->record);
 	int			i;
+	bool force_travel_and_snapshot = false;
 
 	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
@@ -685,8 +712,29 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * Check if the commit contain catalog invalidations
+	 * and if the xid was part of the restart_lsn
+	 * running ones
+	 */
+	if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) && last_running)
+	{
+		/* make last_running->xids bsearch()able */
+		qsort(last_running->xids,
+			  last_running->subxcnt + last_running->xcnt,
+			  sizeof(TransactionId), xidComparator);
+
+		/*
+		 * Is this xid part of the known running ones
+		 * in the restart_lsn RUNNING_XACT entry?
+		 */
+		force_travel_and_snapshot = TransactionIdInArray(xid, last_running->xids,
+														 last_running->subxcnt
+														 + last_running->xcnt);
+	}
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
-					   parsed->nsubxacts, parsed->subxacts);
+					   parsed->nsubxacts, parsed->subxacts, force_travel_and_snapshot);
 
 	/* ----
 	 * Check whether we are interested in this specific transaction, and tell
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 91600ac..4a5610b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -4892,7 +4892,7 @@ ApplyLogicalMappingFile(HTAB *tuplecid_data, Oid relid, const char *fname)
 /*
  * Check whether the TransactionId 'xid' is in the pre-sorted array 'xip'.
  */
-static bool
+bool
 TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num)
 {
 	return bsearch(&xid, xip, num,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed3acad..6be39d1 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -910,15 +910,19 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
  */
 void
 SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
-				   int nsubxacts, TransactionId *subxacts)
+				   int nsubxacts, TransactionId *subxacts, bool force_travel_and_snapshot)
 {
 	int			nxact;
-
-	bool		needs_snapshot = false;
-	bool		needs_timetravel = false;
-	bool		sub_needs_timetravel = false;
+	bool		needs_snapshot;
+	bool		needs_timetravel;
+	bool		sub_needs_timetravel;
 
 	TransactionId xmax = xid;
+	/*
+	 * if force_travel_and_snapshot is set to true
+	 * then proceed as if there are any catalog modifying subxacts.
+	 */
+	needs_snapshot = needs_timetravel = sub_needs_timetravel = force_travel_and_snapshot;
 
 	/*
 	 * Transactions preceding BUILDING_SNAPSHOT will neither be decoded, nor
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..7fc79f3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -685,4 +685,6 @@ void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
 void		StartupReorderBuffer(void);
 
+extern bool TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num);
+
 #endif
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..46f7df0 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -80,7 +80,7 @@ extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
-							   TransactionId *subxacts);
+							   TransactionId *subxacts, bool force_travel_and_snapshot);
 extern bool SnapBuildProcessChange(SnapBuild *builder, TransactionId xid,
 								   XLogRecPtr lsn);
 extern void SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
#2ahsan hadi
ahsan.hadi@gmail.com
In reply to: Oh, Mike (#1)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

I have also seen this error with logical replication using pglogical extension, will this patch also address similar problem with pglogical?

#3Japin Li
japinli@hotmail.com
In reply to: ahsan hadi (#2)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, 07 May 2021 at 19:50, ahsan hadi <ahsan.hadi@gmail.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

I have also seen this error with logical replication using pglogical extension, will this patch also address similar problem with pglogical?

Does there is a test case to reproduce this problem (using pglogical)?
I encountered this, however I'm not find a case to reproduce it.

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

#4Ahsan Hadi
ahsan.hadi@gmail.com
In reply to: Japin Li (#3)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Sat, May 8, 2021 at 8:17 AM Japin Li <japinli@hotmail.com> wrote:

On Fri, 07 May 2021 at 19:50, ahsan hadi <ahsan.hadi@gmail.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

I have also seen this error with logical replication using pglogical

extension, will this patch also address similar problem with pglogical?

Does there is a test case to reproduce this problem (using pglogical)?
I encountered this, however I'm not find a case to reproduce it.

I have seen a user run into this with pglogical, the error is produced
after logical decoding finds an inconsistent point. However we haven't been
able to reproduce the user scenario locally...

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

--
Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
EMAIL: mailto: ahsan.hadi@highgo.ca

#5Oh, Mike
minsoo@amazon.com
In reply to: ahsan hadi (#2)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

This patch should address the same problem for pglogical as well.

Thanks,
Mike

On 6/4/21, 3:55 PM, "ahsan hadi" <ahsan.hadi@gmail.com> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

I have also seen this error with logical replication using pglogical extension, will this patch also address similar problem with pglogical?

#6Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Oh, Mike (#1)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Mar 16, 2021 at 1:35 AM Oh, Mike <minsoo@amazon.com> wrote:

Sending this to pgsql-hackers list to create a CommitFest entry with the attached patch proposal.

Hello,

We noticed that the logical replication could fail when the Standby::RUNNING_XACT record is generated in the middle of a catalog modifying transaction and if the logical decoding has to restart from the RUNNING_XACT

WAL entry.

The Standby::RUNNING_XACT record is generated periodically (roughly every 15s by default) or during a CHECKPOINT operation.

Detailed problem description:

Tested on 11.8 & current master.

The logical replication slot restart_lsn advances in the middle of an open txn that modified the catalog (e.g. TRUNCATE operation).

Should the logical decoding has to restart it could fail with an error like this:

ERROR: could not map filenode "base/13237/442428"

Thank you for reporting the issue.

I could reproduce this issue by the steps you shared.

Currently, the system relies on processing Heap2::NEW_CID to keep track of catalog modifying (sub)transactions.

This context is lost if the logical decoding has to restart from a Standby::RUNNING_XACTS that is written between the NEW_CID record and its parent txn commit.

If the logical stream restarts from this restart_lsn, then it doesn't have the xid responsible for modifying the catalog.

I agree with your analysis. Since we don’t use commit WAL record to
track the transaction that has modified system catalogs, if we decode
only the commit record of such transaction, we cannot know the
transaction has been modified system catalogs, resulting in the
subsequent transaction scans system catalog with the wrong snapshot.

With the patch, if the commit WAL record has a XACT_XINFO_HAS_INVALS
flag and its xid is included in RUNNING_XACT record written at
restart_lsn, we forcibly add the top XID and its sub XIDs as a
committed transaction that has modified system catalogs to the
snapshot. I might be missing something about your patch but I have
some comments on this approach:

1. Commit WAL record may not have invalidation message for system
catalogs (e.g., when commit record has only invalidation message for
relcache) even if it has XACT_XINFO_HAS_INVALS flag. In this case, the
transaction wrongly is added to the snapshot, is that okay?

2. We might add a subtransaction XID as a committed transaction that
has modified system catalogs even if it actually didn't. As the
comment in SnapBuildBuildSnapshot() describes, we track only the
transactions that have modified the system catalog and store in the
snapshot (in the ‘xip' array). The patch could break that assumption.
However, I’m really not sure how to deal with this point. We cannot
know which subtransaction has actually modified system catalogs by
using only the commit WAL record.

3. The patch covers only the case where the restart_lsn exactly
matches the LSN of RUNNING_XACT. I wonder if there could be a case
where the decoding starts at a WAL record other than RUNNING_XACT but
the next WAL record is RUNNING_XACT.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#7Drouvot, Bertrand
bdrouvot@amazon.com
In reply to: Masahiko Sawada (#6)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

Hi,

On 7/29/21 10:25 AM, Masahiko Sawada wrote:

Thank you for reporting the issue.

I could reproduce this issue by the steps you shared.

Thanks for looking at it!

Currently, the system relies on processing Heap2::NEW_CID to keep track of catalog modifying (sub)transactions.

This context is lost if the logical decoding has to restart from a Standby::RUNNING_XACTS that is written between the NEW_CID record and its parent txn commit.

If the logical stream restarts from this restart_lsn, then it doesn't have the xid responsible for modifying the catalog.

I agree with your analysis. Since we don’t use commit WAL record to
track the transaction that has modified system catalogs, if we decode
only the commit record of such transaction, we cannot know the
transaction has been modified system catalogs, resulting in the
subsequent transaction scans system catalog with the wrong snapshot.

Right.

With the patch, if the commit WAL record has a XACT_XINFO_HAS_INVALS
flag and its xid is included in RUNNING_XACT record written at
restart_lsn, we forcibly add the top XID and its sub XIDs as a
committed transaction that has modified system catalogs to the
snapshot. I might be missing something about your patch but I have
some comments on this approach:

1. Commit WAL record may not have invalidation message for system
catalogs (e.g., when commit record has only invalidation message for
relcache) even if it has XACT_XINFO_HAS_INVALS flag.

Right, good point (create policy for example would lead to an
invalidation for relcache only).

In this case, the
transaction wrongly is added to the snapshot, is that okay?

This transaction is a committed one, and IIUC the snapshot would be used
only for catalog visibility, so i don't see any issue to add it in the
snapshot, what do you think?

2. We might add a subtransaction XID as a committed transaction that
has modified system catalogs even if it actually didn't.

Right, like when needs_timetravel is true.

As the
comment in SnapBuildBuildSnapshot() describes, we track only the
transactions that have modified the system catalog and store in the
snapshot (in the ‘xip' array). The patch could break that assumption.

Right. It looks to me that breaking this assumption is not an issue.

IIUC currently the committed ones that are not modifying the catalog are
not stored "just" because we don't need them.

However, I’m really not sure how to deal with this point. We cannot
know which subtransaction has actually modified system catalogs by
using only the commit WAL record.

Right, unless we rewrite this patch so that a commit WAL record will
produce this information.

3. The patch covers only the case where the restart_lsn exactly
matches the LSN of RUNNING_XACT.

Right.

I wonder if there could be a case
where the decoding starts at a WAL record other than RUNNING_XACT but
the next WAL record is RUNNING_XACT.

Not sure, but could a restart_lsn not be a RUNNING_XACTS?

Thanks

Bertrand

#8osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Oh, Mike (#1)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

Hi

On Tuesday, March 16, 2021 1:35 AM Oh, Mike <minsoo@amazon.com> wrote:

We noticed that the logical replication could fail when the
Standby::RUNNING_XACT record is generated in the middle of a catalog
modifying transaction and if the logical decoding has to restart from the
RUNNING_XACT
WAL entry.

...

Proposed solution:
If we’re decoding a catalog modifying commit record, then check whether
it’s part of the RUNNING_XACT xid’s processed @ the restart_lsn. If so,
then add its xid & subxacts in the committed txns list in the logical decoding
snapshot.

Please refer to the attachment for the proposed patch.

Let me share some review comments for the patch.

(1) last_running declaration

Isn't it better to add static for this variable,
because we don't use this in other places ?

@@ -85,6 +86,9 @@ static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf, Oid dbId,
RepOriginId origin_id);

+/* record previous restart_lsn running xacts */
+xl_running_xacts *last_running = NULL;

(2) DecodeStandbyOp's memory free

I'm not sure when
we pass this condition with already allocated last_running,
but do you need to free it's xid array here as well,
if last_running isn't null ?
Otherwise, we'll miss the chance after this.

+                               /* record restart_lsn running xacts */
+                               if (MyReplicationSlot && (buf->origptr == MyReplicationSlot->data.restart_lsn))
+                               {
+                                       if (last_running)
+                                               free(last_running);
+
+                                       last_running = NULL;

(3) suggestion of small readability improvement

We calculate the same size twice here and DecodeCommit.
I suggest you declare a variable that stores the computed result of size,
which might shorten those codes.

+                                       /*
+                                        * xl_running_xacts contains a xids Flexible Array
+                                        * and its size is subxcnt + xcnt.
+                                        * Take that into account while allocating
+                                        * the memory for last_running.
+                                        */
+                                       last_running = (xl_running_xacts *) malloc(sizeof(xl_running_xacts)
+                                                                                                                               + sizeof(TransactionId )
+                                                                                                                               * (running->subxcnt + running->xcnt));
+                                       memcpy(last_running, running, sizeof(xl_running_xacts)
+                                                                                                                + (sizeof(TransactionId)
+                                                                                                                * (running->subxcnt + running->xcnt)));

Best Regards,
Takamichi Osumi

#9Jeremy Schneider
schnjere@amazon.com
In reply to: Masahiko Sawada (#6)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On 7/29/21 01:25, Masahiko Sawada wrote:

On Tue, Mar 16, 2021 at 1:35 AM Oh, Mike <minsoo@amazon.com> wrote:

Sending this to pgsql-hackers list to create a CommitFest entry with the attached patch proposal.

...

Detailed problem description:

Tested on 11.8 & current master.

The logical replication slot restart_lsn advances in the middle of an open txn that modified the catalog (e.g. TRUNCATE operation).

Should the logical decoding has to restart it could fail with an error like this:

ERROR: could not map filenode "base/13237/442428"

Thank you for reporting the issue.

I could reproduce this issue by the steps you shared.

I also noticed a bug report earlier this year with another PG user
reporting the same error - on version 12.3

/messages/by-id/16812-3d9df99bd77ff616@postgresql.org

Today I received a report from a new PG user of this same error message
causing their logical replication to break. This customer was also
running PostgreSQL 12.3 on both source and target side.

Haven't yet dumped WAL or anything, but wanted to point out that the
error is being seen in the wild - I hope we can get a version of this
patch committed soon, as it will help with at least one cause.

-Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

#10osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#8)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Friday, September 24, 2021 5:03 PM I wrote:

On Tuesday, March 16, 2021 1:35 AM Oh, Mike <minsoo@amazon.com> wrote:

We noticed that the logical replication could fail when the
Standby::RUNNING_XACT record is generated in the middle of a catalog
modifying transaction and if the logical decoding has to restart from
the RUNNING_XACT WAL entry.

...

Proposed solution:
If we’re decoding a catalog modifying commit record, then check
whether it’s part of the RUNNING_XACT xid’s processed @ the
restart_lsn. If so, then add its xid & subxacts in the committed txns
list in the logical decoding snapshot.

Please refer to the attachment for the proposed patch.

Let me share some review comments for the patch.

....

(3) suggestion of small readability improvement

We calculate the same size twice here and DecodeCommit.
I suggest you declare a variable that stores the computed result of size, which
might shorten those codes.

+                                       /*
+                                        * xl_running_xacts contains a xids
Flexible Array
+                                        * and its size is subxcnt + xcnt.
+                                        * Take that into account while
allocating
+                                        * the memory for last_running.
+                                        */
+                                       last_running = (xl_running_xacts *)
malloc(sizeof(xl_running_xacts)
+
+ sizeof(TransactionId )
+
* (running->subxcnt + running->xcnt));
+                                       memcpy(last_running, running,
sizeof(xl_running_xacts)
+
+ (sizeof(TransactionId)
+
+ * (running->subxcnt + running->xcnt)));

Let me add one more basic review comment in DecodeStandbyOp().

Why do you call raw malloc directly ?
You don't have the basic check whether the return value is
NULL or not and intended to call palloc here instead ?

Best Regards,
Takamichi Osumi

#11Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Drouvot, Bertrand (#7)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Sep 23, 2021 at 5:44 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

Hi,

On 7/29/21 10:25 AM, Masahiko Sawada wrote:

Thank you for reporting the issue.

I could reproduce this issue by the steps you shared.

Thanks for looking at it!

Currently, the system relies on processing Heap2::NEW_CID to keep track of catalog modifying (sub)transactions.

This context is lost if the logical decoding has to restart from a Standby::RUNNING_XACTS that is written between the NEW_CID record and its parent txn commit.

If the logical stream restarts from this restart_lsn, then it doesn't have the xid responsible for modifying the catalog.

I agree with your analysis. Since we don’t use commit WAL record to
track the transaction that has modified system catalogs, if we decode
only the commit record of such transaction, we cannot know the
transaction has been modified system catalogs, resulting in the
subsequent transaction scans system catalog with the wrong snapshot.

Right.

With the patch, if the commit WAL record has a XACT_XINFO_HAS_INVALS
flag and its xid is included in RUNNING_XACT record written at
restart_lsn, we forcibly add the top XID and its sub XIDs as a
committed transaction that has modified system catalogs to the
snapshot. I might be missing something about your patch but I have
some comments on this approach:

1. Commit WAL record may not have invalidation message for system
catalogs (e.g., when commit record has only invalidation message for
relcache) even if it has XACT_XINFO_HAS_INVALS flag.

Right, good point (create policy for example would lead to an
invalidation for relcache only).

In this case, the
transaction wrongly is added to the snapshot, is that okay?

This transaction is a committed one, and IIUC the snapshot would be used
only for catalog visibility, so i don't see any issue to add it in the
snapshot, what do you think?

It seems to me that it's no problem since we always transaction with
catalog-changed when decoding XLOG_XACT_INVALIDATIONS records.

2. We might add a subtransaction XID as a committed transaction that
has modified system catalogs even if it actually didn't.

Right, like when needs_timetravel is true.

As the
comment in SnapBuildBuildSnapshot() describes, we track only the
transactions that have modified the system catalog and store in the
snapshot (in the ‘xip' array). The patch could break that assumption.

Right. It looks to me that breaking this assumption is not an issue.

IIUC currently the committed ones that are not modifying the catalog are
not stored "just" because we don't need them.

However, I’m really not sure how to deal with this point. We cannot
know which subtransaction has actually modified system catalogs by
using only the commit WAL record.

Right, unless we rewrite this patch so that a commit WAL record will
produce this information.

3. The patch covers only the case where the restart_lsn exactly
matches the LSN of RUNNING_XACT.

Right.

I wonder if there could be a case
where the decoding starts at a WAL record other than RUNNING_XACT but
the next WAL record is RUNNING_XACT.

Not sure, but could a restart_lsn not be a RUNNING_XACTS?

I guess the decoding always starts from RUNING_XACTS.
After more thought, I think that the basic approach of the proposed
patch is a probably good idea, which we add xid whose commit record
has XACT_XINFO_HAS_INVALS to the snapshot. The problem as I see is
that during decoding COMMIT record we cannot know which transactions
(top transaction or subtransactions) actually did catalog changes. But
given that even if XLOG_XACT_INVALIDATION has only relcache
invalidation message we always mark the transaction with
catalog-changed, it seems no problem. Therefore, in the reported
cases, probably we can add both the top transaction xid and its
subscription xids to the snapshot.

Regarding the patch details, I have two comments:

---
+ if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) && last_running)
+ {
+     /* make last_running->xids bsearch()able */
+     qsort(last_running->xids,
+              last_running->subxcnt + last_running->xcnt,
+              sizeof(TransactionId), xidComparator);

The patch does qsort() every time when the commit message has
XACT_XINFO_HAS_INVALS. IIUC the xids we need to remember is the only
xids that are recorded in the first replayed XLOG_RUNNING_XACTS,
right? If so, we need to do qsort() once, can remove xid from the
array once it gets committed, and then can eventually make
last_running empty so that we can skip even TransactionIdInArray().

---
Since last_running is allocated by malloc() and it isn't freed even
after finishing logical decoding.

Another idea to fix this problem would be that before calling
SnapBuildCommitTxn() we create transaction entries in ReorderBuffer
for (sub)transactions whose COMMIT record has XACT_XINFO_HAS_INVALS,
and then mark all of them as catalog-changed by calling
ReorderBufferXidSetCatalogChanges(). I've attached a PoC patch for
this idea. What the patch does is essentially the same as what the
proposed patch does. But the patch doesn't modify the
SnapBuildCommitTxn(). And we remember the list of last running
transactions in reorder buffer and the list is periodically purged
during decoding RUNNING_XACTS records, eventually making it empty.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

poc_remember_last_running_xacts.patchapplication/octet-stream; name=poc_remember_last_running_xacts.patchDownload
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2874dc0612..c919d0fa77 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -405,6 +405,26 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			{
 				xl_running_xacts *running = (xl_running_xacts *) XLogRecGetData(r);
 
+				/*
+				 * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+				 * if the transaction has changed the catalog, and that information
+				 * is not serialized to SnapBuilder.  Therefore, if the logical
+				 * decoding decodes the commit record of the transaction that actually
+				 * has done catalog changes without these records, we miss to add
+				 * the xid to the snapshot so up creating the wrong snapshot. To
+				 * avoid such a problem, if the COMMIT record of the xid listed in
+				 * the RUNNING_XACTS record read at the start of logical decoding
+				 * has XACT_XINFO_HAS_INVALS flag, we mark both the top transaction
+				 * and its substransactions as containing catalog changes (see also
+				 * ReorderBufferSetLastRunningXactsCatalogChanges()). Since we cannot
+				 * know which transactions actually have done catalog changes only
+				 * by reading the COMMIT record we do that for both.  So we might
+				 * mark an xid that actually has not done that but it’s not a
+				 * problem since we use historic snapshot only for reading system
+				 * catalogs.
+				 */
+				ReorderBufferProcessLastRunningXacts(ctx->reorder, running);
+
 				SnapBuildProcessRunningXacts(builder, buf->origptr, running);
 
 				/*
@@ -689,6 +709,15 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * Set the last running xacts as containing catalog change if necessary.
+	 * This must be done before SnapBuildCommitTxn() so that we include catalog
+	 * change transactions to the historic snapshot.
+	 */
+	ReorderBufferSetLastRunningXactsCatalogChanges(ctx->reorder, xid, parsed->xinfo,
+												   parsed->nsubxacts, parsed->subxacts,
+												   buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 46e66608cf..e2b688e107 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -346,6 +346,9 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->last_running_xacts = NULL;
+	buffer->n_last_running_xacts = -1;
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -5155,3 +5158,91 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+void
+ReorderBufferProcessLastRunningXacts(ReorderBuffer *rb, xl_running_xacts *running)
+{
+	/* Quick exit if there is no longer last running xacts */
+	if (likely(rb->n_last_running_xacts == 0))
+		return;
+
+	/* First call, build the last running xact list */
+	if (rb->n_last_running_xacts == -1)
+	{
+		int nxacts = running->subxcnt + running->xcnt;
+		Size sz = sizeof(TransactionId) * nxacts;;
+
+		rb->last_running_xacts = MemoryContextAlloc(rb->context, sz);
+		memcpy(rb->last_running_xacts, running->xids, sz);
+		qsort(rb->last_running_xacts, nxacts, sizeof(TransactionId), xidComparator);
+
+		rb->n_last_running_xacts = nxacts;
+
+		return;
+	}
+
+	/*
+	 * Purge xids in the last running xacts list if we can do that for at least
+	 * one xid.
+	 */
+	if (NormalTransactionIdPrecedes(rb->last_running_xacts[0],
+									running->oldestRunningXid))
+	{
+		TransactionId *workspace;
+		int nxids = 0;
+
+		workspace = MemoryContextAlloc(rb->context, rb->n_last_running_xacts);
+		for (int i = 0; i < rb->n_last_running_xacts; i++)
+		{
+			if (NormalTransactionIdPrecedes(rb->last_running_xacts[i],
+											running->oldestRunningXid))
+				;	/* remove */
+			else
+				workspace[nxids++] = rb->last_running_xacts[i];
+		}
+
+		if (nxids > 0)
+			memcpy(rb->last_running_xacts, workspace, sizeof(TransactionId) * nxids);
+		else
+		{
+			pfree(rb->last_running_xacts);
+			rb->last_running_xacts = NULL;
+		}
+
+		rb->n_last_running_xacts = nxids;
+	}
+}
+
+void
+ReorderBufferSetLastRunningXactsCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+											   uint32 xinfo, int subxcnt,
+											   TransactionId *subxacts, XLogRecPtr lsn)
+{
+	void *test;
+
+	/*
+	 * Skip if there is no longer last running xacts information or the COMMIT record
+	 * doesn't have invalidation message, which is a common case.
+	 */
+	if (likely(rb->n_last_running_xacts == 0 || !(xinfo & XACT_XINFO_HAS_INVALS)))
+		return;
+
+	test = bsearch(&xid, rb->last_running_xacts, rb->n_last_running_xacts,
+				   sizeof(TransactionId), xidComparator);
+
+	/*
+	 * If this committed transaction is the one that was running at the time when
+	 * decoding the first RUNNING_XACTS record and have done catalog changes, we
+	 * can mark the top transaction and its subtransactions as catalog-changes.
+	 */
+	if (test != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(rb, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(rb, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(rb, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5b40ff75f7..234d7c0c61 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
 #include "access/htup_details.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -593,6 +594,9 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	TransactionId *last_running_xacts;
+	int n_last_running_xacts;	/* -1 for initial value */
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
@@ -682,4 +686,9 @@ void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
 void		StartupReorderBuffer(void);
 
+void		ReorderBufferProcessLastRunningXacts(ReorderBuffer *rb, xl_running_xacts *running);
+void		ReorderBufferSetLastRunningXactsCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+														   uint32 xinfo, int subxcnt,
+														   TransactionId *subxacts, XLogRecPtr lsn);
+
 #endif
#12osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Masahiko Sawada (#11)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thursday, October 7, 2021 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Regarding the patch details, I have two comments:

---
+ if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) && last_running) {
+     /* make last_running->xids bsearch()able */
+     qsort(last_running->xids,
+              last_running->subxcnt + last_running->xcnt,
+              sizeof(TransactionId), xidComparator);

The patch does qsort() every time when the commit message has
XACT_XINFO_HAS_INVALS. IIUC the xids we need to remember is the only
xids that are recorded in the first replayed XLOG_RUNNING_XACTS, right? If so,
we need to do qsort() once, can remove xid from the array once it gets
committed, and then can eventually make last_running empty so that we can
skip even TransactionIdInArray().

---
Since last_running is allocated by malloc() and it isn't freed even after finishing
logical decoding.

Another idea to fix this problem would be that before calling
SnapBuildCommitTxn() we create transaction entries in ReorderBuffer for
(sub)transactions whose COMMIT record has XACT_XINFO_HAS_INVALS,
and then mark all of them as catalog-changed by calling
ReorderBufferXidSetCatalogChanges(). I've attached a PoC patch for this idea.
What the patch does is essentially the same as what the proposed patch does.
But the patch doesn't modify the SnapBuildCommitTxn(). And we remember
the list of last running transactions in reorder buffer and the list is periodically
purged during decoding RUNNING_XACTS records, eventually making it
empty.

Thanks for the patch.

Conducted a quick check of the POC.

Test of check-world PASSED with your patch and head.
Also, the original scenario described in [1]/messages/by-id/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2@amazon.com looks fine
with your revised patch and LOG_SNAPSHOT_INTERVAL_MS expansion in the procedure.

The last command in the provided steps showed below.

postgres=# select * from pg_logical_slot_get_changes('bdt_slot', null, null);
lsn | xid | data
-----------+-----+----------------------------------------
0/1560020 | 710 | BEGIN 710
0/1560020 | 710 | table public.bdt: INSERT: a[integer]:1
0/1560140 | 710 | COMMIT 710

Minor comments for DecodeStandbyOp changes I noticed instantly
(1) minor suggestion of your comment.

+                                * has done catalog changes without these records, we miss to add
+                                * the xid to the snapshot so up creating the wrong snapshot. To

"miss to add" would be miss adding or fail to add.
And "up creating" is natural in this sentence ?

(2) a full-width space between "it'" and "s" in the next sentence.

+ * mark an xid that actually has not done that but it’s not a

[1]: /messages/by-id/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2@amazon.com

Best Regards,
Takamichi Osumi

#13Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Masahiko Sawada (#11)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Thu, 7 Oct 2021 13:20:14 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

Another idea to fix this problem would be that before calling
SnapBuildCommitTxn() we create transaction entries in ReorderBuffer
for (sub)transactions whose COMMIT record has XACT_XINFO_HAS_INVALS,
and then mark all of them as catalog-changed by calling
ReorderBufferXidSetCatalogChanges(). I've attached a PoC patch for
this idea. What the patch does is essentially the same as what the
proposed patch does. But the patch doesn't modify the
SnapBuildCommitTxn(). And we remember the list of last running
transactions in reorder buffer and the list is periodically purged
during decoding RUNNING_XACTS records, eventually making it empty.

I came up with the third way. SnapBuildCommitTxn already properly
handles the case where a ReorderBufferTXN with
RBTXN_HAS_CATALOG_CHANGES. So this issue can be resolved by create
such ReorderBufferTXNs in SnapBuildProcessRunningXacts.

One problem with this is that change creates the case where multiple
ReorderBufferTXNs share the same first_lsn. I haven't come up with a
clean idea to avoid relaxing the restriction of AssertTXNLsnOrder..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

register_running_xacts_at_logical_decoding_start_PoC.txttext/plain; charset=us-asciiDownload
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 46e66608cf..503116764f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -887,9 +887,14 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 		if (cur_txn->end_lsn != InvalidXLogRecPtr)
 			Assert(cur_txn->first_lsn <= cur_txn->end_lsn);
 
-		/* Current initial LSN must be strictly higher than previous */
+		/*
+		 * Current initial LSN must be strictly higher than previous. except
+		 * this transaction is created by XLOG_RUNNING_XACTS.  If one
+		 * XLOG_RUNNING_XACTS creates multiple transactions, they share the
+		 * same LSN. See SnapBuildProcessRunningXacts.
+		 */
 		if (prev_first_lsn != InvalidXLogRecPtr)
-			Assert(prev_first_lsn < cur_txn->first_lsn);
+			Assert(prev_first_lsn <= cur_txn->first_lsn);
 
 		/* known-as-subtxn txns must not be listed */
 		Assert(!rbtxn_is_known_subxact(cur_txn));
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a5333349a8..58859112dc 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1097,6 +1097,20 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 	{
+		/*
+		 * At the time we passed the first XLOG_RUNNING_XACTS record, the
+		 * transactions notified by the record may have updated
+		 * catalogs. Register the transactions with marking them as having
+		 * caused catalog changes.  The worst misbehavior here is some spurious
+		 * invalidation at decoding start.
+		 */
+		if (builder->state == SNAPBUILD_START)
+		{
+			for (int i = 0 ; i < running->xcnt + running->subxcnt ; i++)
+				ReorderBufferXidSetCatalogChanges(builder->reorder,
+												  running->xids[i], lsn);
+		}
+
 		/* returns false if there's no point in performing cleanup just yet */
 		if (!SnapBuildFindSnapshot(builder, lsn, running))
 			return;
#14Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kyotaro Horiguchi (#13)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

.

On Fri, Oct 8, 2021 at 4:50 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 7 Oct 2021 13:20:14 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

Another idea to fix this problem would be that before calling
SnapBuildCommitTxn() we create transaction entries in ReorderBuffer
for (sub)transactions whose COMMIT record has XACT_XINFO_HAS_INVALS,
and then mark all of them as catalog-changed by calling
ReorderBufferXidSetCatalogChanges(). I've attached a PoC patch for
this idea. What the patch does is essentially the same as what the
proposed patch does. But the patch doesn't modify the
SnapBuildCommitTxn(). And we remember the list of last running
transactions in reorder buffer and the list is periodically purged
during decoding RUNNING_XACTS records, eventually making it empty.

I came up with the third way. SnapBuildCommitTxn already properly
handles the case where a ReorderBufferTXN with
RBTXN_HAS_CATALOG_CHANGES. So this issue can be resolved by create
such ReorderBufferTXNs in SnapBuildProcessRunningXacts.

Thank you for the idea and patch!

It's much simpler than mine. I think that creating an entry of a
catalog-changed transaction in the reorder buffer before
SunapBuildCommitTxn() is the right direction.

After more thought, given DDLs are not likely to happen than DML in
practice, probably we can always mark both the top transaction and its
subtransactions as containing catalog changes if the commit record has
XACT_XINFO_HAS_INVALS? I believe this is not likely to lead to
overhead in practice. That way, the patch could be more simple and
doesn't need the change of AssertTXNLsnOrder().

I've attached another PoC patch. Also, I've added the tests for this
issue in test_decoding.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

poc_mark_all_txns_catalog_change.patchapplication/octet-stream; name=poc_mark_all_txns_catalog_change.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..4553252d75 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..bc142fc384
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..43c0b64289
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,32 @@
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the second checkpoint execution.  This transaction must be marked as
+# containing catalog changes during decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2874dc0612..8af420ccea 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -689,6 +689,25 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * Mark the top transaction and its subtransactions as containing catalog
+	 * changes, if the commit record has invalidation message.  This is necessary
+	 * for the case where we decode only the commit record of the transaction
+	 * that actually has done catalog changes.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+	{
+		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+
+		for (int i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAssignChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr);
+			ReorderBufferXidSetCatalogChanges(ctx->reorder, parsed->subxacts[i],
+											  buf->origptr);
+		}
+	}
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
#15Drouvot, Bertrand
bdrouvot@amazon.com
In reply to: Masahiko Sawada (#14)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

Hi,

On 10/11/21 8:27 AM, Masahiko Sawada wrote:

On Fri, Oct 8, 2021 at 4:50 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 7 Oct 2021 13:20:14 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

Another idea to fix this problem would be that before calling
SnapBuildCommitTxn() we create transaction entries in ReorderBuffer
for (sub)transactions whose COMMIT record has XACT_XINFO_HAS_INVALS,
and then mark all of them as catalog-changed by calling
ReorderBufferXidSetCatalogChanges(). I've attached a PoC patch for
this idea. What the patch does is essentially the same as what the
proposed patch does. But the patch doesn't modify the
SnapBuildCommitTxn(). And we remember the list of last running
transactions in reorder buffer and the list is periodically purged
during decoding RUNNING_XACTS records, eventually making it empty.

I came up with the third way. SnapBuildCommitTxn already properly
handles the case where a ReorderBufferTXN with
RBTXN_HAS_CATALOG_CHANGES. So this issue can be resolved by create
such ReorderBufferTXNs in SnapBuildProcessRunningXacts.

Thank you for the idea and patch!

Thanks you both for your new patches proposal!

I liked Sawada's one but also do "prefer" Horiguchi's one.

It's much simpler than mine. I think that creating an entry of a
catalog-changed transaction in the reorder buffer before
SunapBuildCommitTxn() is the right direction.

+1

After more thought, given DDLs are not likely to happen than DML in
practice, probably we can always mark both the top transaction and its
subtransactions as containing catalog changes if the commit record has
XACT_XINFO_HAS_INVALS? I believe this is not likely to lead to
overhead in practice. That way, the patch could be more simple and
doesn't need the change of AssertTXNLsnOrder().

I've attached another PoC patch. Also, I've added the tests for this
issue in test_decoding.

Thanks!

It looks good to me, just have a remark about the comment:

+   /*
+    * Mark the top transaction and its subtransactions as containing 
catalog
+    * changes, if the commit record has invalidation message. This is 
necessary
+    * for the case where we decode only the commit record of the 
transaction
+    * that actually has done catalog changes.
+    */

What about?

    /*
     * Mark the top transaction and its subtransactions as containing
catalog
     * changes, if the commit record has invalidation message. This is
necessary
     * for the case where we did not decode the transaction that did
the catalog
     * change(s) (the decoding restarted after). So that we are
decoding only the
     * commit record of the transaction that actually has done catalog
changes.
     */

Thanks

Bertrand

#16Jeremy Schneider
schneider@ardentperf.com
In reply to: Masahiko Sawada (#14)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On 10/10/21 23:27, Masahiko Sawada wrote:

After more thought, given DDLs are not likely to happen than DML in
practice, ...

I haven't looked closely at the patch, but I'd be careful about
workloads where people create and drop "temporary tables". I've seen
this pattern used a few times, especially by developers who came from a
SQL server background, for some reason.

I certainly don't think we need to optimize for this workload - which is
not a best practice on PostreSQL. I'd just want to be careful not to
make PostgreSQL logical replication crumble underneath it, if PG was
previously keeping up with difficulty. That would be a sad upgrade
experience!

-Jeremy

--
http://about.me/jeremy_schneider

#17osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Masahiko Sawada (#14)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Monday, October 11, 2021 3:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Oct 8, 2021 at 4:50 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

At Thu, 7 Oct 2021 13:20:14 +0900, Masahiko Sawada
<sawada.mshk@gmail.com> wrote in

Another idea to fix this problem would be that before calling
SnapBuildCommitTxn() we create transaction entries in ReorderBuffer
for (sub)transactions whose COMMIT record has

XACT_XINFO_HAS_INVALS,

and then mark all of them as catalog-changed by calling
ReorderBufferXidSetCatalogChanges(). I've attached a PoC patch for
this idea. What the patch does is essentially the same as what the
proposed patch does. But the patch doesn't modify the
SnapBuildCommitTxn(). And we remember the list of last running
transactions in reorder buffer and the list is periodically purged
during decoding RUNNING_XACTS records, eventually making it empty.

I came up with the third way. SnapBuildCommitTxn already properly
handles the case where a ReorderBufferTXN with
RBTXN_HAS_CATALOG_CHANGES. So this issue can be resolved by

create

such ReorderBufferTXNs in SnapBuildProcessRunningXacts.

Thank you for the idea and patch!

It's much simpler than mine. I think that creating an entry of a catalog-changed
transaction in the reorder buffer before
SunapBuildCommitTxn() is the right direction.

After more thought, given DDLs are not likely to happen than DML in practice,
probably we can always mark both the top transaction and its subtransactions
as containing catalog changes if the commit record has
XACT_XINFO_HAS_INVALS? I believe this is not likely to lead to overhead in
practice. That way, the patch could be more simple and doesn't need the
change of AssertTXNLsnOrder().

I've attached another PoC patch. Also, I've added the tests for this issue in
test_decoding.

I also felt that your patch addresses the problem in a good way.
Even without setting xid by NEW_CID decoding like in the original scenario,
we can set catalog change flag.

One really minor comment I have is,
in DecodeCommit(), you don't need to declar i. It's defined at the top of the function.

+ for (int i = 0; i < parsed->nsubxacts; i++)

Best Regards,
Takamichi Osumi

#18Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Jeremy Schneider (#16)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Oct 13, 2021 at 7:55 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:

On 10/10/21 23:27, Masahiko Sawada wrote:

After more thought, given DDLs are not likely to happen than DML in
practice, ...

I haven't looked closely at the patch, but I'd be careful about
workloads where people create and drop "temporary tables". I've seen
this pattern used a few times, especially by developers who came from a
SQL server background, for some reason.

True. But since the snapshot builder is designed on the same
assumption it would not be problematic. It keeps track of the
committed catalog modifying transaction instead of keeping track of
all running transactions. See the header comment of snapbuild.c

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#19Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Masahiko Sawada (#14)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Mon, 11 Oct 2021 15:27:41 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

.

On Fri, Oct 8, 2021 at 4:50 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

I came up with the third way. SnapBuildCommitTxn already properly
handles the case where a ReorderBufferTXN with
RBTXN_HAS_CATALOG_CHANGES. So this issue can be resolved by create
such ReorderBufferTXNs in SnapBuildProcessRunningXacts.

Thank you for the idea and patch!

It's much simpler than mine. I think that creating an entry of a
catalog-changed transaction in the reorder buffer before
SunapBuildCommitTxn() is the right direction.

After more thought, given DDLs are not likely to happen than DML in
practice, probably we can always mark both the top transaction and its
subtransactions as containing catalog changes if the commit record has
XACT_XINFO_HAS_INVALS? I believe this is not likely to lead to
overhead in practice. That way, the patch could be more simple and
doesn't need the change of AssertTXNLsnOrder().

I've attached another PoC patch. Also, I've added the tests for this
issue in test_decoding.

Thanks for the test script. (I did that with TAP framework but
isolation tester version is simpler.)

It adds a call to ReorderBufferAssignChild but usually subtransactions
are assigned to top level elsewherae. Addition to that
ReorderBufferCommitChild() called just later does the same thing. We
are adding the third call to the same function, which looks a bit odd.

And I'm not sure it is wise to mark all subtransactions as "catalog
changed" always when the top transaction is XACT_XINFO_HAS_INVALS. The
reason I did that in the snapshiot building phase is to prevent adding
to DecodeCommit an extra code that is needed only while any
transaction running since before replication start is surviving.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#20osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Kyotaro Horiguchi (#19)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thursday, October 14, 2021 11:21 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Mon, 11 Oct 2021 15:27:41 +0900, Masahiko Sawada
<sawada.mshk@gmail.com> wrote in

On Fri, Oct 8, 2021 at 4:50 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

I came up with the third way. SnapBuildCommitTxn already properly
handles the case where a ReorderBufferTXN with
RBTXN_HAS_CATALOG_CHANGES. So this issue can be resolved by

create

such ReorderBufferTXNs in SnapBuildProcessRunningXacts.

Thank you for the idea and patch!

It's much simpler than mine. I think that creating an entry of a
catalog-changed transaction in the reorder buffer before
SunapBuildCommitTxn() is the right direction.

After more thought, given DDLs are not likely to happen than DML in
practice, probably we can always mark both the top transaction and its
subtransactions as containing catalog changes if the commit record has
XACT_XINFO_HAS_INVALS? I believe this is not likely to lead to
overhead in practice. That way, the patch could be more simple and
doesn't need the change of AssertTXNLsnOrder().

I've attached another PoC patch. Also, I've added the tests for this
issue in test_decoding.

Thanks for the test script. (I did that with TAP framework but isolation tester
version is simpler.)

It adds a call to ReorderBufferAssignChild but usually subtransactions are
assigned to top level elsewherae. Addition to that
ReorderBufferCommitChild() called just later does the same thing. We are
adding the third call to the same function, which looks a bit odd.

It can be odd. However, we
have a check at the top of ReorderBufferAssignChild
to judge if the sub transaction is already associated or not
and skip the processings if it is.

And I'm not sure it is wise to mark all subtransactions as "catalog changed"
always when the top transaction is XACT_XINFO_HAS_INVALS.

In order to avoid this,
can't we have a new flag (for example, in reorderbuffer struct) to check
if we start decoding from RUNNING_XACTS, which is similar to the first patch of [1]/messages/by-id/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2@amazon.com
and use it at DecodeCommit ? This still leads to some extra specific codes added
to DecodeCommit and this solution becomes a bit similar to other previous patches though.

[1]: /messages/by-id/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2@amazon.com

Best Regards,
Takamichi Osumi

#21Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#20)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Tue, 19 Oct 2021 02:45:24 +0000, "osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com> wrote in

On Thursday, October 14, 2021 11:21 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

It adds a call to ReorderBufferAssignChild but usually subtransactions are
assigned to top level elsewherae. Addition to that
ReorderBufferCommitChild() called just later does the same thing. We are
adding the third call to the same function, which looks a bit odd.

It can be odd. However, we
have a check at the top of ReorderBufferAssignChild
to judge if the sub transaction is already associated or not
and skip the processings if it is.

My question was why do we need to make the extra call to
ReorerBufferCommitChild when XACT_XINFO_HAS_INVALS in spite of the
existing call to the same fuction that unconditionally made. It
doesn't cost so much but also it's not free.

And I'm not sure it is wise to mark all subtransactions as "catalog changed"
always when the top transaction is XACT_XINFO_HAS_INVALS.

In order to avoid this,
can't we have a new flag (for example, in reorderbuffer struct) to check
if we start decoding from RUNNING_XACTS, which is similar to the first patch of [1]
and use it at DecodeCommit ? This still leads to some extra specific codes added
to DecodeCommit and this solution becomes a bit similar to other previous patches though.

If it is somehow wrong in any sense that we add subtransactions in
SnapBuildProcessRunningXacts (for example, we should avoid relaxing
the assertion condition.), I think we would go another way. Otherwise
we don't even need that additional flag. (But Sawadasan's recent PoC
also needs that relaxation.)

ASAICS, and unless I'm missing something (that odds are rtlatively
high:p), we need the specially added subransactions only for the
transactions that were running at passing the first RUNNING_XACTS,
becuase otherwise (substantial) subtransactions are assigned to
toplevel by the first record of the subtransaction.

Before reaching consistency, DecodeCommit feeds the subtransactions to
ReorderBufferForget individually so the subtransactions are not needed
to be assigned to the top transaction at all. Since the
subtransactions added by the first RUNNING_XACT are processed that
way, we don't need in the first place to call ReorderBufferCommitChild
for such subtransactions.

[1] - /messages/by-id/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2@amazon.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#22Drouvot, Bertrand
bdrouvot@amazon.com
In reply to: Kyotaro Horiguchi (#21)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

Hi,

On 10/19/21 8:43 AM, Kyotaro Horiguchi wrote:

At Tue, 19 Oct 2021 02:45:24 +0000, "osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com> wrote in

On Thursday, October 14, 2021 11:21 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

It adds a call to ReorderBufferAssignChild but usually subtransactions are
assigned to top level elsewherae. Addition to that
ReorderBufferCommitChild() called just later does the same thing. We are
adding the third call to the same function, which looks a bit odd.

It can be odd. However, we
have a check at the top of ReorderBufferAssignChild
to judge if the sub transaction is already associated or not
and skip the processings if it is.

My question was why do we need to make the extra call to
ReorerBufferCommitChild when XACT_XINFO_HAS_INVALS in spite of the
existing call to the same fuction that unconditionally made. It
doesn't cost so much but also it's not free.

And I'm not sure it is wise to mark all subtransactions as "catalog changed"
always when the top transaction is XACT_XINFO_HAS_INVALS.

In order to avoid this,
can't we have a new flag (for example, in reorderbuffer struct) to check
if we start decoding from RUNNING_XACTS, which is similar to the first patch of [1]
and use it at DecodeCommit ? This still leads to some extra specific codes added
to DecodeCommit and this solution becomes a bit similar to other previous patches though.

If it is somehow wrong in any sense that we add subtransactions in
SnapBuildProcessRunningXacts (for example, we should avoid relaxing
the assertion condition.), I think we would go another way. Otherwise
we don't even need that additional flag. (But Sawadasan's recent PoC
also needs that relaxation.)

ASAICS, and unless I'm missing something (that odds are rtlatively
high:p), we need the specially added subransactions only for the
transactions that were running at passing the first RUNNING_XACTS,
becuase otherwise (substantial) subtransactions are assigned to
toplevel by the first record of the subtransaction.

Before reaching consistency, DecodeCommit feeds the subtransactions to
ReorderBufferForget individually so the subtransactions are not needed
to be assigned to the top transaction at all. Since the
subtransactions added by the first RUNNING_XACT are processed that
way, we don't need in the first place to call ReorderBufferCommitChild
for such subtransactions.

[1] - /messages/by-id/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2@amazon.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Just rebased (minor change in the contrib/test_decoding/Makefile) the
last POC version linked to the CF entry as it was failing the CF bot.

Thanks

Bertrand

Attachments:

poc_mark_all_txns_catalog_change.patchtext/plain; charset=UTF-8; name=poc_mark_all_txns_catalog_change.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 36929dd97d..05c0c5a2f8 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	sequence
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..bc142fc384
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..43c0b64289
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,32 @@
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the second checkpoint execution.  This transaction must be marked as
+# containing catalog changes during decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 18cf931822..ade48bd71e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -629,6 +629,25 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * Mark the top transaction and its subtransactions as containing catalog
+	 * changes, if the commit record has invalidation message.  This is necessary
+	 * for the case where we decode only the commit record of the transaction
+	 * that actually has done catalog changes.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+	{
+		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+
+		for (int i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAssignChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr);
+			ReorderBufferXidSetCatalogChanges(ctx->reorder, parsed->subxacts[i],
+											  buf->origptr);
+		}
+	}
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#14)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Oct 11, 2021 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It's much simpler than mine. I think that creating an entry of a
catalog-changed transaction in the reorder buffer before
SunapBuildCommitTxn() is the right direction.

After more thought, given DDLs are not likely to happen than DML in
practice, probably we can always mark both the top transaction and its
subtransactions as containing catalog changes if the commit record has
XACT_XINFO_HAS_INVALS?

I have some observations and thoughts on this work.

1.
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT
record emitted
+# during the second checkpoint execution.  This transaction must be marked as
+# containing catalog changes during decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct
historic snapshot.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"

In the first line of comment, do you want to say "... record emitted
during the first checkpoint" because only then it can start from the
commit record of the transaction that has performed truncate.

2.
+ /*
+ * Mark the top transaction and its subtransactions as containing catalog
+ * changes, if the commit record has invalidation message.  This is necessary
+ * for the case where we decode only the commit record of the transaction
+ * that actually has done catalog changes.
+ */
+ if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+ {
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+
+ for (int i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferAssignChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, parsed->subxacts[i],
+   buf->origptr);
+ }
+ }
+
  SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
     parsed->nsubxacts, parsed->subxacts);

Marking it before SnapBuildCommitTxn has one disadvantage that we
sometimes do this work even if the snapshot state is SNAPBUILD_START
or SNAPBUILD_BUILDING_SNAPSHOT in which case SnapBuildCommitTxn
wouldn't do anything. Now, whereas this will fix the issue but it
seems we need to do this work even when we would have already marked
the txn has catalog changes, and then probably there are cases when we
mark them when it is not required as discussed in this thread.

I think if we don't have any better ideas then we should go with
either this or one of the other proposals in this thread. The other
idea that occurred to me is whether we can somehow update the snapshot
we have serialized on disk about this information. On each
running_xact record when we serialize the snapshot, we also try to
purge the committed xacts (via SnapBuildPurgeCommittedTxn). So, during
that we can check if there are committed xacts to be purged and if we
have previously serialized the snapshot for the prior running xact
record, if so, we can update it with the list of xacts that have
catalog changes. If this is feasible then I think we need to somehow
remember the point where we last serialized the snapshot (maybe by
using builder->last_serialized_snapshot). Even, if this is feasible we
may not be able to do this in back-branches because of the disk-format
change required for this.

Thoughts?

--
With Regards,
Amit Kapila.

#24Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#23)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Sat, 21 May 2022 15:35:58 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

I think if we don't have any better ideas then we should go with
either this or one of the other proposals in this thread. The other
idea that occurred to me is whether we can somehow update the snapshot
we have serialized on disk about this information. On each
running_xact record when we serialize the snapshot, we also try to
purge the committed xacts (via SnapBuildPurgeCommittedTxn). So, during
that we can check if there are committed xacts to be purged and if we
have previously serialized the snapshot for the prior running xact
record, if so, we can update it with the list of xacts that have
catalog changes. If this is feasible then I think we need to somehow
remember the point where we last serialized the snapshot (maybe by
using builder->last_serialized_snapshot). Even, if this is feasible we
may not be able to do this in back-branches because of the disk-format
change required for this.

Thoughts?

I didn't look it closer, but it seems to work. I'm not sure how much
spurious invalidations at replication start impacts on performance,
but it is promising if the impact is significant. That being said I'm
a bit negative for doing that in post-beta1 stage.

I thought for a moment that RUNNING_XACT might be able to contain
invalidation information but it seems too complex to happen with such
a frequency..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#24)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, May 23, 2022 at 10:03 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Sat, 21 May 2022 15:35:58 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

I think if we don't have any better ideas then we should go with
either this or one of the other proposals in this thread. The other
idea that occurred to me is whether we can somehow update the snapshot
we have serialized on disk about this information. On each
running_xact record when we serialize the snapshot, we also try to
purge the committed xacts (via SnapBuildPurgeCommittedTxn). So, during
that we can check if there are committed xacts to be purged and if we
have previously serialized the snapshot for the prior running xact
record, if so, we can update it with the list of xacts that have
catalog changes. If this is feasible then I think we need to somehow
remember the point where we last serialized the snapshot (maybe by
using builder->last_serialized_snapshot). Even, if this is feasible we
may not be able to do this in back-branches because of the disk-format
change required for this.

Thoughts?

I didn't look it closer, but it seems to work. I'm not sure how much
spurious invalidations at replication start impacts on performance,
but it is promising if the impact is significant.

It seems Sawada-San's patch is doing at each commit not at the start
of replication and I think that is required because we need this each
time for replication restart. So, I feel this will be an ongoing
overhead for spurious cases with the current approach.

That being said I'm
a bit negative for doing that in post-beta1 stage.

Fair point. We can use the do it early in PG-16 if the approach is
feasible, and backpatch something on lines of what Sawada-San or you
proposed.

--
With Regards,
Amit Kapila.

#26Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#25)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, May 23, 2022 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 23, 2022 at 10:03 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Sat, 21 May 2022 15:35:58 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

I think if we don't have any better ideas then we should go with
either this or one of the other proposals in this thread. The other
idea that occurred to me is whether we can somehow update the snapshot
we have serialized on disk about this information. On each
running_xact record when we serialize the snapshot, we also try to
purge the committed xacts (via SnapBuildPurgeCommittedTxn). So, during
that we can check if there are committed xacts to be purged and if we
have previously serialized the snapshot for the prior running xact
record, if so, we can update it with the list of xacts that have
catalog changes. If this is feasible then I think we need to somehow
remember the point where we last serialized the snapshot (maybe by
using builder->last_serialized_snapshot). Even, if this is feasible we
may not be able to do this in back-branches because of the disk-format
change required for this.

Thoughts?

It seems to work, could you draft the patch?

I didn't look it closer, but it seems to work. I'm not sure how much
spurious invalidations at replication start impacts on performance,
but it is promising if the impact is significant.

It seems Sawada-San's patch is doing at each commit not at the start
of replication and I think that is required because we need this each
time for replication restart. So, I feel this will be an ongoing
overhead for spurious cases with the current approach.

That being said I'm
a bit negative for doing that in post-beta1 stage.

Fair point. We can use the do it early in PG-16 if the approach is
feasible, and backpatch something on lines of what Sawada-San or you
proposed.

+1.

I proposed two approaches: [1]/messages/by-id/CAD21AoAn-k6OpZ6HSAH_G91tpTXR6KYvkf663kg6EqW-f6sz1w@mail.gmail.com and [2,] and I prefer [1]/messages/by-id/CAD21AoAn-k6OpZ6HSAH_G91tpTXR6KYvkf663kg6EqW-f6sz1w@mail.gmail.com.
Horiguchi-san's idea[3]/messages/by-id/20211008.165055.1621145185927268721.horikyota.ntt@gmail.com also looks good but I think it's better to
somehow deal with the problem he mentioned:

One problem with this is that change creates the case where multiple
ReorderBufferTXNs share the same first_lsn. I haven't come up with a
clean idea to avoid relaxing the restriction of AssertTXNLsnOrder..

Regards,

[1]: /messages/by-id/CAD21AoAn-k6OpZ6HSAH_G91tpTXR6KYvkf663kg6EqW-f6sz1w@mail.gmail.com
[2]: /messages/by-id/CAD21AoD00wV4gt-53ze+ZB8n4bqJrdH8J_UnDHddy8S2A+a25g@mail.gmail.com
[3]: /messages/by-id/20211008.165055.1621145185927268721.horikyota.ntt@gmail.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#27Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#26)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, May 24, 2022 at 7:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, May 23, 2022 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 23, 2022 at 10:03 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Sat, 21 May 2022 15:35:58 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

I think if we don't have any better ideas then we should go with
either this or one of the other proposals in this thread. The other
idea that occurred to me is whether we can somehow update the snapshot
we have serialized on disk about this information. On each
running_xact record when we serialize the snapshot, we also try to
purge the committed xacts (via SnapBuildPurgeCommittedTxn). So, during
that we can check if there are committed xacts to be purged and if we
have previously serialized the snapshot for the prior running xact
record, if so, we can update it with the list of xacts that have
catalog changes. If this is feasible then I think we need to somehow
remember the point where we last serialized the snapshot (maybe by
using builder->last_serialized_snapshot). Even, if this is feasible we
may not be able to do this in back-branches because of the disk-format
change required for this.

Thoughts?

It seems to work, could you draft the patch?

I can help with the review and discussion.

--
With Regards,
Amit Kapila.

#28Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#27)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, May 24, 2022 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, May 24, 2022 at 7:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, May 23, 2022 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 23, 2022 at 10:03 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Sat, 21 May 2022 15:35:58 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

I think if we don't have any better ideas then we should go with
either this or one of the other proposals in this thread. The other
idea that occurred to me is whether we can somehow update the snapshot
we have serialized on disk about this information. On each
running_xact record when we serialize the snapshot, we also try to
purge the committed xacts (via SnapBuildPurgeCommittedTxn). So, during
that we can check if there are committed xacts to be purged and if we
have previously serialized the snapshot for the prior running xact
record, if so, we can update it with the list of xacts that have
catalog changes. If this is feasible then I think we need to somehow
remember the point where we last serialized the snapshot (maybe by
using builder->last_serialized_snapshot). Even, if this is feasible we
may not be able to do this in back-branches because of the disk-format
change required for this.

Thoughts?

It seems to work, could you draft the patch?

I can help with the review and discussion.

Okay, I'll draft the patch for this idea.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#29Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#28)
3 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, May 25, 2022 at 12:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, May 24, 2022 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, May 24, 2022 at 7:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, May 23, 2022 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 23, 2022 at 10:03 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Sat, 21 May 2022 15:35:58 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

I think if we don't have any better ideas then we should go with
either this or one of the other proposals in this thread. The other
idea that occurred to me is whether we can somehow update the snapshot
we have serialized on disk about this information. On each
running_xact record when we serialize the snapshot, we also try to
purge the committed xacts (via SnapBuildPurgeCommittedTxn). So, during
that we can check if there are committed xacts to be purged and if we
have previously serialized the snapshot for the prior running xact
record, if so, we can update it with the list of xacts that have
catalog changes. If this is feasible then I think we need to somehow
remember the point where we last serialized the snapshot (maybe by
using builder->last_serialized_snapshot). Even, if this is feasible we
may not be able to do this in back-branches because of the disk-format
change required for this.

Thoughts?

It seems to work, could you draft the patch?

I can help with the review and discussion.

Okay, I'll draft the patch for this idea.

I've attached three POC patches:

poc_remember_last_running_xacts_v2.patch is a rebased patch of my
previous proposal[1]. This is based on the original proposal: we
remember the last-running-xacts list of the first decoded
RUNNING_XACTS record and check if the transaction whose commit record
has XACT_XINFO_HAS_INVALS and whose xid is in the list. This doesn’t
require any file format changes but the transaction will end up being
added to the snapshot even if it has only relcache invalidations.

poc_add_running_catchanges_xacts_to_serialized_snapshot.patch is a
patch for the idea Amit Kapila proposed with some changes. The basic
approach is to remember the list of xids that changed catalogs and
were running when serializing the snapshot. The list of xids is kept
in SnapShotBuilder and is serialized and restored to/from the
serialized snapshot. When decoding a commit record, we check if the
transaction is already marked as catalog-changes or its xid is in the
list. If so, we add it to the snapshot. Unlike the first patch, it can
add only transactions properly that have changed catalogs, but as Amit
mentioned before, this idea cannot be back patched as this changes the
on-disk format of the serialized snapshot.

poc_add_regression_tests.patch adds regression tests for this bug. The
regression tests are required for both HEAD and back-patching but I've
separated this patch for testing the above two patches easily.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

poc_add_running_catchanges_xacts_to_serialized_snapshot.patchapplication/x-patch; name=poc_add_running_catchanges_xacts_to_serialized_snapshot.patchDownload
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8da5f9089c..19123cbfa3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -4821,6 +4821,45 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	txn->toast_hash = NULL;
 }
 
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p)
+{
+	HASH_SEQ_STATUS hash_seq;
+	ReorderBufferTXNByIdEnt *ent;
+	TransactionId *xids;
+	size_t	xcnt = 0;
+	size_t	xcnt_space = 64; /* arbitrary number */
+
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * xcnt_space);
+
+	hash_seq_init(&hash_seq, rb->by_txn);
+	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	{
+		ReorderBufferTXN *txn = ent->txn;
+
+		if (!rbtxn_has_catalog_changes(txn))
+			continue;
+
+		if (xcnt >= xcnt_space)
+		{
+			xcnt_space *= 2;
+			xids = repalloc(xids, sizeof(TransactionId) * xcnt_space);
+		}
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	*xcnt_p = xcnt;
+	return xids;
+}
 
 /* ---------------------------------------
  * Visibility support for logical decoding
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1119a12db9..c57d5f91d9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,26 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions that were running when the snapshot serialization
+	 * and changed system catalogs, but that are not committed.
+	 *
+	 * We normally rely on HEAP2_NEW_CID records and XLOG_XACT_INVALIDATIONS to
+	 * know if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction.
+	 * This array keeps track of the transactions that were running but not
+	 * committed when serializing and restoring a snapshot, and is used to add
+	 * such transactions to the snapshot.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchanges;
 };
 
 /*
@@ -306,6 +326,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchanges.xcnt = 0;
+	builder->catchanges.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -983,7 +1006,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid) ||
+			bsearch(&xid, builder->catchanges.xip, builder->catchanges.xcnt,
+					sizeof(TransactionId), xidComparator) != NULL)
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1037,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid) ||
+		bsearch(&xid, builder->catchanges.xip, builder->catchanges.xcnt,
+				sizeof(TransactionId), xidComparator) != NULL)
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1438,6 +1465,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchanges.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1578,8 +1606,17 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	/*
+	 * Update the transactions that are running and changes catalogs that are
+	 * not committed.
+	 */
+	if (builder->catchanges.xip)
+		pfree(builder->catchanges.xip);
+	builder->catchanges.xip = ReorderBufferGetCatalogChangesXacts(builder->reorder,
+																  &builder->catchanges.xcnt);
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + builder->catchanges.xcnt);
 
 	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
@@ -1598,6 +1635,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchanges.xip = NULL;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
@@ -1609,6 +1647,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
 	ondisk_c += sz;
 
+	/* copy catalog-changes xacts */
+	sz = sizeof(TransactionId) * builder->catchanges.xcnt;
+	memcpy(ondisk_c, builder->catchanges.xip, sz);
+	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+	ondisk_c += sz;
+
 	FIN_CRC32C(ondisk->checksum);
 
 	/* we have valid data now, open tempfile and write it there */
@@ -1832,6 +1876,33 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
+	/* restore catalog-changes xacts information */
+	sz = sizeof(TransactionId) * ondisk.builder.catchanges.xcnt;
+	ondisk.builder.catchanges.xip = MemoryContextAllocZero(builder->context, sz);
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, ondisk.builder.catchanges.xip, sz);
+	pgstat_report_wait_end();
+	if (readBytes != sz)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sz)));
+	}
+	COMP_CRC32C(checksum, ondisk.builder.catchanges.xip, sz);
+
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -1885,6 +1956,14 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	builder->catchanges.xcnt = ondisk.builder.catchanges.xcnt;
+	if (builder->catchanges.xcnt > 0)
+	{
+		pfree(builder->committed.xip);
+		builder->catchanges.xip = ondisk.builder.catchanges.xip;
+	}
+	ondisk.builder.catchanges.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4a01f877e5..07e378d3ef 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -677,6 +677,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
poc_add_regression_tests.patchapplication/x-patch; name=poc_add_regression_tests.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..fba67c49d6
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,33 @@
+# Test that decoding only the commit record of the transaction that have catalog-changed.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# catalog-changes while decoding the COMMIT record and the decoding of the INSERT
+# record must read the pg_class with the correct historic snapshot.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
poc_remember_last_running_xacts_v2.patchapplication/x-patch; name=poc_remember_last_running_xacts_v2.patchDownload
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index aa2427ba73..13d0c16541 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -343,6 +343,26 @@ standby_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			{
 				xl_running_xacts *running = (xl_running_xacts *) XLogRecGetData(r);
 
+				/*
+				 * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+				 * if the transaction has changed the catalog, and that information
+				 * is not serialized to SnapBuilder.  Therefore, if the logical
+				 * decoding decodes the commit record of the transaction that actually
+				 * has done catalog changes without these records, we miss to add
+				 * the xid to the snapshot so up creating the wrong snapshot. To
+				 * avoid such a problem, if the COMMIT record of the xid listed in
+				 * the RUNNING_XACTS record read at the start of logical decoding
+				 * has XACT_XINFO_HAS_INVALS flag, we mark both the top transaction
+				 * and its substransactions as containing catalog changes (see also
+				 * ReorderBufferSetLastRunningXactsCatalogChanges()). Since we cannot
+				 * know which transactions actually have done catalog changes only
+				 * by reading the COMMIT record we do that for both.  So we might
+				 * mark an xid that actually has not done that but it’s not a
+				 * problem since we use historic snapshot only for reading system
+				 * catalogs.
+				 */
+				ReorderBufferProcessLastRunningXacts(ctx->reorder, running);
+
 				SnapBuildProcessRunningXacts(builder, buf->origptr, running);
 
 				/*
@@ -627,6 +647,15 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * Set the last running xacts as containing catalog change if necessary.
+	 * This must be done before SnapBuildCommitTxn() so that we include catalog
+	 * change transactions to the historic snapshot.
+	 */
+	ReorderBufferSetLastRunningXactsCatalogChanges(ctx->reorder, xid, parsed->xinfo,
+												   parsed->nsubxacts, parsed->subxacts,
+												   buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8da5f9089c..fa5785e679 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -353,6 +353,9 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->last_running_xacts = NULL;
+	buffer->n_last_running_xacts = -1;
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -5161,3 +5164,103 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+void
+ReorderBufferProcessLastRunningXacts(ReorderBuffer *rb, xl_running_xacts *running)
+{
+	/* Quick exit if there is no longer last running xacts */
+	if (likely(rb->n_last_running_xacts == 0))
+		return;
+
+	/* First call, build the last running xact list */
+	if (rb->n_last_running_xacts == -1)
+	{
+		int nxacts = running->subxcnt + running->xcnt;
+		Size sz = sizeof(TransactionId) * nxacts;;
+
+		rb->last_running_xacts = MemoryContextAlloc(rb->context, sz);
+		memcpy(rb->last_running_xacts, running->xids, sz);
+		qsort(rb->last_running_xacts, nxacts, sizeof(TransactionId), xidComparator);
+
+		rb->n_last_running_xacts = nxacts;
+
+		return;
+	}
+
+	/*
+	 * Purge xids in the last running xacts list if we can do that for at least
+	 * one xid.
+	 */
+	if (NormalTransactionIdPrecedes(rb->last_running_xacts[0],
+									running->oldestRunningXid))
+	{
+		TransactionId *workspace;
+		int nxids = 0;
+
+		workspace = MemoryContextAlloc(rb->context, rb->n_last_running_xacts);
+		for (int i = 0; i < rb->n_last_running_xacts; i++)
+		{
+			if (NormalTransactionIdPrecedes(rb->last_running_xacts[i],
+											running->oldestRunningXid))
+				;	/* remove */
+			else
+				workspace[nxids++] = rb->last_running_xacts[i];
+		}
+
+		if (nxids > 0)
+			memcpy(rb->last_running_xacts, workspace, sizeof(TransactionId) * nxids);
+		else
+		{
+			pfree(rb->last_running_xacts);
+			rb->last_running_xacts = NULL;
+		}
+
+		rb->n_last_running_xacts = nxids;
+	}
+}
+
+void
+ReorderBufferSetLastRunningXactsCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+											   uint32 xinfo, int subxcnt,
+											   TransactionId *subxacts, XLogRecPtr lsn)
+{
+	void *test;
+
+	/*
+	 * Skip if there is no longer last running xacts information or the COMMIT record
+	 * doesn't have invalidation message, which is a common case.
+	 */
+	if (likely(rb->n_last_running_xacts == 0 || !(xinfo & XACT_XINFO_HAS_INVALS)))
+		return;
+
+	test = bsearch(&xid, rb->last_running_xacts, rb->n_last_running_xacts,
+				   sizeof(TransactionId), xidComparator);
+
+	if (test == NULL)
+	{
+		for (int i = 0; i < subxcnt; i++)
+		{
+			test = bsearch(&subxacts[i], rb->last_running_xacts, rb->n_last_running_xacts,
+						   sizeof(TransactionId), xidComparator);
+
+			if (test != NULL)
+				break;
+		}
+	}
+
+	/*
+	 * If this committed transaction is the one that was running at the time when
+	 * decoding the first RUNNING_XACTS record and have done catalog changes, we
+	 * can mark the top transaction and its subtransactions as catalog-changes.
+	 */
+	if (test != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(rb, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(rb, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(rb, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4a01f877e5..c06df80e08 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
 #include "access/htup_details.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -593,6 +594,9 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	TransactionId *last_running_xacts;
+	int n_last_running_xacts;	/* -1 for initial value */
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
@@ -682,4 +686,9 @@ extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
 extern void StartupReorderBuffer(void);
 
+void		ReorderBufferProcessLastRunningXacts(ReorderBuffer *rb, xl_running_xacts *running);
+void		ReorderBufferSetLastRunningXactsCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+														   uint32 xinfo, int subxcnt,
+														   TransactionId *subxacts, XLogRecPtr lsn);
+
 #endif
#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#29)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, May 25, 2022 at 12:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

poc_add_regression_tests.patch adds regression tests for this bug. The
regression tests are required for both HEAD and back-patching but I've
separated this patch for testing the above two patches easily.

Few comments on the test case patch:
===============================
1.
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT
record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# catalog-changes while decoding the COMMIT record and the decoding
of the INSERT
+# record must read the pg_class with the correct historic snapshot.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"

Will this test always work? What if we get an additional running_xact
record between steps "s0_commit" and "s0_begin" that is logged via
bgwriter? You can mimic that by adding an additional checkpoint
between those two steps. If we do that, the test will pass even
without the patch because I think the last decoding will start
decoding from this new running_xact record.

2.
+step "s1_get_changes" { SELECT data FROM
pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
'include-xids', '0'); }

It is better to skip empty transactions by using 'skip-empty-xacts' to
avoid any transaction from a background process like autovacuum. We
have previously seen some buildfarm failures due to that.

3. Did you intentionally omit the .out from the test case patch?

4.
This transaction must be marked as
+# catalog-changes while decoding the COMMIT record and the decoding
of the INSERT
+# record must read the pg_class with the correct historic snapshot.

/marked as catalog-changes/marked as containing catalog changes

--
With Regards,
Amit Kapila.

#31Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#30)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jun 7, 2022 at 9:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, May 25, 2022 at 12:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

poc_add_regression_tests.patch adds regression tests for this bug. The
regression tests are required for both HEAD and back-patching but I've
separated this patch for testing the above two patches easily.

Thank you for the comments.

Few comments on the test case patch:
===============================
1.
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT
record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# catalog-changes while decoding the COMMIT record and the decoding
of the INSERT
+# record must read the pg_class with the correct historic snapshot.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"

Will this test always work? What if we get an additional running_xact
record between steps "s0_commit" and "s0_begin" that is logged via
bgwriter? You can mimic that by adding an additional checkpoint
between those two steps. If we do that, the test will pass even
without the patch because I think the last decoding will start
decoding from this new running_xact record.

Right. It could pass depending on the timing but doesn't fail
depending on the timing. I think we need to somehow stop bgwriter to
make the test case stable but it seems unrealistic. Do you have any
better ideas?

2.
+step "s1_get_changes" { SELECT data FROM
pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
'include-xids', '0'); }

It is better to skip empty transactions by using 'skip-empty-xacts' to
avoid any transaction from a background process like autovacuum. We
have previously seen some buildfarm failures due to that.

Agreed.

3. Did you intentionally omit the .out from the test case patch?

No, I'll add .out file in the next version patch.

4.
This transaction must be marked as
+# catalog-changes while decoding the COMMIT record and the decoding
of the INSERT
+# record must read the pg_class with the correct historic snapshot.

/marked as catalog-changes/marked as containing catalog changes

Agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#31)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Jun 13, 2022 at 8:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jun 7, 2022 at 9:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, May 25, 2022 at 12:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

poc_add_regression_tests.patch adds regression tests for this bug. The
regression tests are required for both HEAD and back-patching but I've
separated this patch for testing the above two patches easily.

Thank you for the comments.

Few comments on the test case patch:
===============================
1.
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT
record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# catalog-changes while decoding the COMMIT record and the decoding
of the INSERT
+# record must read the pg_class with the correct historic snapshot.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"

Will this test always work? What if we get an additional running_xact
record between steps "s0_commit" and "s0_begin" that is logged via
bgwriter? You can mimic that by adding an additional checkpoint
between those two steps. If we do that, the test will pass even
without the patch because I think the last decoding will start
decoding from this new running_xact record.

Right. It could pass depending on the timing but doesn't fail
depending on the timing. I think we need to somehow stop bgwriter to
make the test case stable but it seems unrealistic.

Agreed, in my local testing for this case, I use to increase
LOG_SNAPSHOT_INTERVAL_MS to avoid such a situation but I understand it
is not practical via test.

Do you have any
better ideas?

No, I don't have any better ideas. I think it is better to add some
information related to this in the comments because it may help to
improve the test in the future if we come up with a better idea.

--
With Regards,
Amit Kapila.

#33Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#32)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jun 14, 2022 at 3:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 13, 2022 at 8:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jun 7, 2022 at 9:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, May 25, 2022 at 12:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

poc_add_regression_tests.patch adds regression tests for this bug. The
regression tests are required for both HEAD and back-patching but I've
separated this patch for testing the above two patches easily.

Thank you for the comments.

Few comments on the test case patch:
===============================
1.
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT
record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# catalog-changes while decoding the COMMIT record and the decoding
of the INSERT
+# record must read the pg_class with the correct historic snapshot.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"

Will this test always work? What if we get an additional running_xact
record between steps "s0_commit" and "s0_begin" that is logged via
bgwriter? You can mimic that by adding an additional checkpoint
between those two steps. If we do that, the test will pass even
without the patch because I think the last decoding will start
decoding from this new running_xact record.

Right. It could pass depending on the timing but doesn't fail
depending on the timing. I think we need to somehow stop bgwriter to
make the test case stable but it seems unrealistic.

Agreed, in my local testing for this case, I use to increase
LOG_SNAPSHOT_INTERVAL_MS to avoid such a situation but I understand it
is not practical via test.

Do you have any
better ideas?

No, I don't have any better ideas. I think it is better to add some
information related to this in the comments because it may help to
improve the test in the future if we come up with a better idea.

I also don't have any better ideas to make it stable, and agreed. I've
attached an updated version patch for adding regression tests.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

poc_add_regression_tests_v2.patchapplication/octet-stream; name=poc_add_regression_tests_v2.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..bffd856bbb
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test that decoding only the commit record of the transaction that have
+# catalog-changed.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#29)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached three POC patches:

I think it will be a good idea if you can add a short commit message
at least to say which patch is proposed for HEAD and which one is for
back branches. Also, it would be good if you can add some description
of the fix in the commit message. Let's remove poc* from the patch
name.

Review poc_add_running_catchanges_xacts_to_serialized_snapshot
=====================================================
1.
+ /*
+ * Array of transactions that were running when the snapshot serialization
+ * and changed system catalogs,

The part of the sentence after serialization is not very clear.

2.
- if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+ if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid) ||
+ bsearch(&xid, builder->catchanges.xip, builder->catchanges.xcnt,
+ sizeof(TransactionId), xidComparator) != NULL)

Why are you using xid instead of subxid in bsearch call? Can we add a
comment to say why it is okay to use xid if there is a valid reason?
But note, we are using subxid to add to the committed xact array so
not sure if this is a good idea but I might be missing something.

Suggestions for improvement in comments:
-       /*
-        * Update the transactions that are running and changes
catalogs that are
-        * not committed.
-        */
+       /* Update the catalog modifying transactions that are yet not
committed. */
        if (builder->catchanges.xip)
                pfree(builder->catchanges.xip);
        builder->catchanges.xip =
ReorderBufferGetCatalogChangesXacts(builder->reorder,
@@ -1647,7 +1644,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
        COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
        ondisk_c += sz;
-       /* copy catalog-changes xacts */
+       /* copy catalog modifying xacts */
        sz = sizeof(TransactionId) * builder->catchanges.xcnt;
        memcpy(ondisk_c, builder->catchanges.xip, sz);
        COMP_CRC32C(ondisk->checksum, ondisk_c, sz);

--
With Regards,
Amit Kapila.

#35Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#34)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Jul 4, 2022 at 6:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached three POC patches:

I think it will be a good idea if you can add a short commit message
at least to say which patch is proposed for HEAD and which one is for
back branches. Also, it would be good if you can add some description
of the fix in the commit message. Let's remove poc* from the patch
name.

Review poc_add_running_catchanges_xacts_to_serialized_snapshot
=====================================================

Few more comments:
1.
+
+ /* This array must be sorted in xidComparator order */
+ TransactionId *xip;
+ } catchanges;
 };

This array contains the transaction ids for subtransactions as well. I
think it is better mention the same in comments.

2. Are we anytime removing transaction ids from catchanges->xip array?
If not, is there a reason for the same? I think we can remove it
either at commit/abort or even immediately after adding the xid/subxid
to committed->xip array.

3.
+ if (readBytes != sz)
+ {
+ int save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ if (readBytes < 0)
+ {
+ errno = save_errno;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m", path)));
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not read file \"%s\": read %d of %zu",
+ path, readBytes, sz)));
+ }

This is the fourth instance of similar error handling code in
SnapBuildRestore(). Isn't it better to extract this into a separate
function?

4.
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p)
+{
+ HASH_SEQ_STATUS hash_seq;
+ ReorderBufferTXNByIdEnt *ent;
+ TransactionId *xids;
+ size_t xcnt = 0;
+ size_t xcnt_space = 64; /* arbitrary number */
+
+ xids = (TransactionId *) palloc(sizeof(TransactionId) * xcnt_space);
+
+ hash_seq_init(&hash_seq, rb->by_txn);
+ while ((ent = hash_seq_search(&hash_seq)) != NULL)
+ {
+ ReorderBufferTXN *txn = ent->txn;
+
+ if (!rbtxn_has_catalog_changes(txn))
+ continue;

It would be better to allocate memory the first time we have to store
xids. There is a good chance that many a time this function will do
just palloc without having to store any xid.

5. Do you think we should do some performance testing for a mix of
ddl/dml workload to see if it adds any overhead in decoding due to
serialize/restore doing additional work? I don't think it should add
some meaningful overhead but OTOH there is no harm in doing some
testing of the same.

--
With Regards,
Amit Kapila.

#36Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#34)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Jul 4, 2022 at 9:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached three POC patches:

I think it will be a good idea if you can add a short commit message
at least to say which patch is proposed for HEAD and which one is for
back branches. Also, it would be good if you can add some description
of the fix in the commit message. Let's remove poc* from the patch
name.

Updated.

Review poc_add_running_catchanges_xacts_to_serialized_snapshot
=====================================================
1.
+ /*
+ * Array of transactions that were running when the snapshot serialization
+ * and changed system catalogs,

The part of the sentence after serialization is not very clear.

Updated.

2.
- if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+ if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid) ||
+ bsearch(&xid, builder->catchanges.xip, builder->catchanges.xcnt,
+ sizeof(TransactionId), xidComparator) != NULL)

Why are you using xid instead of subxid in bsearch call? Can we add a
comment to say why it is okay to use xid if there is a valid reason?
But note, we are using subxid to add to the committed xact array so
not sure if this is a good idea but I might be missing something.

You're right, subxid should be used here.

Suggestions for improvement in comments:
-       /*
-        * Update the transactions that are running and changes
catalogs that are
-        * not committed.
-        */
+       /* Update the catalog modifying transactions that are yet not
committed. */
if (builder->catchanges.xip)
pfree(builder->catchanges.xip);
builder->catchanges.xip =
ReorderBufferGetCatalogChangesXacts(builder->reorder,
@@ -1647,7 +1644,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
ondisk_c += sz;
-       /* copy catalog-changes xacts */
+       /* copy catalog modifying xacts */
sz = sizeof(TransactionId) * builder->catchanges.xcnt;
memcpy(ondisk_c, builder->catchanges.xip, sz);
COMP_CRC32C(ondisk->checksum, ondisk_c, sz);

Updated.

I'll post a new version patch in the next email with replying to other comments.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#36)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 6, 2022 at 7:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'll post a new version patch in the next email with replying to other comments.

Okay, thanks for working on this. Few comments/suggestions on
poc_remember_last_running_xacts_v2 patch:

1.
+ReorderBufferSetLastRunningXactsCatalogChanges(ReorderBuffer *rb,
TransactionId xid,
+    uint32 xinfo, int subxcnt,
+    TransactionId *subxacts, XLogRecPtr lsn)
+{
...
...
+
+ test = bsearch(&xid, rb->last_running_xacts, rb->n_last_running_xacts,
+    sizeof(TransactionId), xidComparator);
+
+ if (test == NULL)
+ {
+ for (int i = 0; i < subxcnt; i++)
+ {
+ test = bsearch(&subxacts[i], rb->last_running_xacts, rb->n_last_running_xacts,
+    sizeof(TransactionId), xidComparator);
...

Is there ever a possibility that the top transaction id is not in the
running_xacts list but one of its subxids is present? If yes, it is
not very obvious at least to me so adding a comment here could be
useful. If not, then why do we need this additional check for each of
the sub-transaction ids?

2.
@@ -627,6 +647,15 @@ DecodeCommit(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
commit_time = parsed->origin_timestamp;
}

+ /*
+ * Set the last running xacts as containing catalog change if necessary.
+ * This must be done before SnapBuildCommitTxn() so that we include catalog
+ * change transactions to the historic snapshot.
+ */
+ ReorderBufferSetLastRunningXactsCatalogChanges(ctx->reorder, xid,
parsed->xinfo,
+    parsed->nsubxacts, parsed->subxacts,
+    buf->origptr);
+
  SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
     parsed->nsubxacts, parsed->subxacts);

As mentioned previously as well, marking it before SnapBuildCommitTxn
has one disadvantage, we sometimes do this work even if the snapshot
state is SNAPBUILD_START or SNAPBUILD_BUILDING_SNAPSHOT in which case
SnapBuildCommitTxn wouldn't do anything. Can we instead check whether
the particular txn has invalidations and is present in the
last_running_xacts list along with the check
ReorderBufferXidHasCatalogChanges? I think that has the additional
advantage that we don't need this additional marking if the xact is
already marked as containing catalog changes.

3.
1.
+ /*
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder.  Therefore, if the logical
+ * decoding decodes the commit record of the transaction that actually
+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot so up creating the wrong snapshot.

The part of the sentence "... snapshot so up creating the wrong
snapshot." is not clear. In this comment, at one place you have used
two spaces after a full stop, and at another place, there is one
space. I think let's follow nearby code practice to use a single space
before a new sentence.

4.
+void
+ReorderBufferProcessLastRunningXacts(ReorderBuffer *rb,
xl_running_xacts *running)
+{
+ /* Quick exit if there is no longer last running xacts */
+ if (likely(rb->n_last_running_xacts == 0))
+ return;
+
+ /* First call, build the last running xact list */
+ if (rb->n_last_running_xacts == -1)
+ {
+ int nxacts = running->subxcnt + running->xcnt;
+ Size sz = sizeof(TransactionId) * nxacts;;
+
+ rb->last_running_xacts = MemoryContextAlloc(rb->context, sz);
+ memcpy(rb->last_running_xacts, running->xids, sz);
+ qsort(rb->last_running_xacts, nxacts, sizeof(TransactionId), xidComparator);
+
+ rb->n_last_running_xacts = nxacts;
+
+ return;
+ }

a. Can we add the function header comments for this function?
b. We seem to be tracking the running_xact information for the first
running_xact record after start/restart. The name last_running_xacts
doesn't sound appropriate for that, how about initial_running_xacts?

5.
+ /*
+ * Purge xids in the last running xacts list if we can do that for at least
+ * one xid.
+ */
+ if (NormalTransactionIdPrecedes(rb->last_running_xacts[0],
+ running->oldestRunningXid))

I think it would be a good idea to add a few lines here explaining why
it is safe to purge. IIUC, it is because the commit for those xacts
would have already been processed and we don't need such a xid
anymore.

6. As per the discussion above in this thread having
XACT_XINFO_HAS_INVALS in the commit record doesn't indicate that the
xact has catalog changes, so can we add somewhere in comments that for
such a case we can't distinguish whether the txn has catalog change
but we still mark the txn has catalog changes? Can you please share
one example for this case?

--
With Regards,
Amit Kapila.

#38Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#35)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 5, 2022 at 8:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 4, 2022 at 6:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached three POC patches:

I think it will be a good idea if you can add a short commit message
at least to say which patch is proposed for HEAD and which one is for
back branches. Also, it would be good if you can add some description
of the fix in the commit message. Let's remove poc* from the patch
name.

Review poc_add_running_catchanges_xacts_to_serialized_snapshot
=====================================================

Few more comments:

Thank you for the comments.

1.
+
+ /* This array must be sorted in xidComparator order */
+ TransactionId *xip;
+ } catchanges;
};

This array contains the transaction ids for subtransactions as well. I
think it is better mention the same in comments.

Updated.

2. Are we anytime removing transaction ids from catchanges->xip array?

No.

If not, is there a reason for the same? I think we can remove it
either at commit/abort or even immediately after adding the xid/subxid
to committed->xip array.

It might be a good idea but I'm concerned that removing XID from the
array at every commit/abort or after adding it to committed->xip array
might be costly as it requires adjustment of the array to keep its
order. Removing XIDs from the array would make bsearch faster but the
array is updated reasonably often (every 15 sec).

3.
+ if (readBytes != sz)
+ {
+ int save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ if (readBytes < 0)
+ {
+ errno = save_errno;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m", path)));
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not read file \"%s\": read %d of %zu",
+ path, readBytes, sz)));
+ }

This is the fourth instance of similar error handling code in
SnapBuildRestore(). Isn't it better to extract this into a separate
function?

Good idea, updated.

4.
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p)
+{
+ HASH_SEQ_STATUS hash_seq;
+ ReorderBufferTXNByIdEnt *ent;
+ TransactionId *xids;
+ size_t xcnt = 0;
+ size_t xcnt_space = 64; /* arbitrary number */
+
+ xids = (TransactionId *) palloc(sizeof(TransactionId) * xcnt_space);
+
+ hash_seq_init(&hash_seq, rb->by_txn);
+ while ((ent = hash_seq_search(&hash_seq)) != NULL)
+ {
+ ReorderBufferTXN *txn = ent->txn;
+
+ if (!rbtxn_has_catalog_changes(txn))
+ continue;

It would be better to allocate memory the first time we have to store
xids. There is a good chance that many a time this function will do
just palloc without having to store any xid.

Agreed.

5. Do you think we should do some performance testing for a mix of
ddl/dml workload to see if it adds any overhead in decoding due to
serialize/restore doing additional work? I don't think it should add
some meaningful overhead but OTOH there is no harm in doing some
testing of the same.

Yes, it would be worth trying. I also believe this change doesn't
introduce noticeable overhead but let's check just in case.

I've attached an updated patch.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

0001-Add-catalog-modifying-transactions-to-logical-decodi.patchapplication/octet-stream; name=0001-Add-catalog-modifying-transactions-to-logical-decodi.patchDownload
From 98a6ea9d18e0d7a864549db4e388f6e767be0edb Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH] Add catalog modifying transactions to logical decoding
 serialized snapshot.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running when
snapshot serialization, to the serialized snapshot. When decoding a
COMMIT record, we check both the list and the ReorderBuffer to see if
if the transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we take another approach;
remember the last-running-xacts list of the first decoded
RUNNING_XACTS record and check if the transaction whose commit record
has XACT_XINFO_HAS_INVALIS and whose XID is in the list. This doesn't
require any file format changes but the transaction will end up being
added to the snapshot even if it has only relcache invalidations.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++
 .../specs/catalog_change_snapshot.spec        |  39 ++++
 .../replication/logical/reorderbuffer.c       |  41 ++++
 src/backend/replication/logical/snapbuild.c   | 193 +++++++++++-------
 src/include/replication/reorderbuffer.h       |   1 +
 6 files changed, 248 insertions(+), 72 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..bffd856bbb
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test that decoding only the commit record of the transaction that have
+# catalog-changed.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8da5f9089c..266376583f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -4821,6 +4821,47 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	txn->toast_hash = NULL;
 }
 
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p)
+{
+	HASH_SEQ_STATUS hash_seq;
+	ReorderBufferTXNByIdEnt *ent;
+	TransactionId *xids = NULL;
+	size_t	xcnt = 0;
+	size_t	xcnt_space = 64; /* arbitrary number */
+
+	hash_seq_init(&hash_seq, rb->by_txn);
+	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	{
+		ReorderBufferTXN *txn = ent->txn;
+
+		if (!rbtxn_has_catalog_changes(txn))
+			continue;
+
+		/* Initialize XID array */
+		if (xcnt == 0)
+			xids = (TransactionId *) palloc(sizeof(TransactionId) * xcnt_space);
+
+		if (xcnt >= xcnt_space)
+		{
+			xcnt_space *= 2;
+			xids = repalloc(xids, sizeof(TransactionId) * xcnt_space);
+		}
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	*xcnt_p = xcnt;
+	return xids;
+}
 
 /* ---------------------------------------
  * Visibility support for logical decoding
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1119a12db9..e3e7c3dd23 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,28 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on HEAP2_NEW_CID records and XLOG_XACT_INVALIDATIONS to
+	 * know if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction.
+	 * This array keeps track of the transactions that have modified catalogs
+	 * and were running when serializing a snapshot, and this array is used to
+	 * add such transactions to the snapshot.
+	 *
+	 * This field is updated when restoring a serialized snapshot.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchanges;
 };
 
 /*
@@ -262,6 +284,8 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChange(SnapBuild *builder, TransactionId xid);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +293,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +331,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchanges.xcnt = 0;
+	builder->catchanges.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -983,7 +1011,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChange(builder, subxid))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1040,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChange(builder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1117,25 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check both the snapshot and the reorder buffer to see if the given
+ * transaction has modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChange(SnapBuild *builder, TransactionId xid)
+{
+	if (builder->catchanges.xcnt > 0)
+	{
+		if (bsearch(&xid, builder->catchanges.xip, builder->catchanges.xcnt,
+					sizeof(TransactionId), xidComparator) != NULL)
+			return true;
+
+		/* fall through to check the reorder buffer */
+	}
+
+	return ReorderBufferXidHasCatalogChanges(builder->reorder, xid);
+}
+
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1438,6 +1485,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchanges.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1493,6 +1541,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	MemoryContext	old_ctx;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,8 +1627,22 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	/*
+	 * Update the catalog modifying transactions that are yet not committed.
+	 *
+	 * We switch the memory context in order to make sure that the space for
+	 * catalog modifying transactions are allocated in the snapshot builder
+	 * context.
+	 */
+	if (builder->catchanges.xip)
+		pfree(builder->catchanges.xip);
+	old_ctx = MemoryContextSwitchTo(builder->context);
+	builder->catchanges.xip = ReorderBufferGetCatalogChangesXacts(builder->reorder,
+																  &builder->catchanges.xcnt);
+	MemoryContextSwitchTo(old_ctx);
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + builder->catchanges.xcnt);
 
 	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
@@ -1598,6 +1661,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchanges.xip = NULL;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
@@ -1609,6 +1673,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
 	ondisk_c += sz;
 
+	/* copy catalog modifying xacts */
+	sz = sizeof(TransactionId) * builder->catchanges.xcnt;
+	memcpy(ondisk_c, builder->catchanges.xip, sz);
+	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+	ondisk_c += sz;
+
 	FIN_CRC32C(ondisk->checksum);
 
 	/* we have valid data now, open tempfile and write it there */
@@ -1707,7 +1777,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1808,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,57 +1828,21 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
 	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
 	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
-	}
+	SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
 	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
+	/* restore catalog modifying xacts information */
+	sz = sizeof(TransactionId) * ondisk.builder.catchanges.xcnt;
+	ondisk.builder.catchanges.xip = MemoryContextAllocZero(builder->context, sz);
+	SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchanges.xip, sz, path);
+	COMP_CRC32C(checksum, ondisk.builder.catchanges.xip, sz);
+
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -1885,6 +1896,14 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchanges.xip)
+		pfree(builder->catchanges.xip);
+	builder->catchanges.xcnt = ondisk.builder.catchanges.xcnt;
+	builder->catchanges.xip = ondisk.builder.catchanges.xip;
+
+	ondisk.builder.catchanges.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1909,6 +1928,38 @@ snapshot_not_interesting:
 	return false;
 }
 
+/*
+ * Read the contents of the serialized snapshot to the dest.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
 /*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4a01f877e5..07e378d3ef 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -677,6 +677,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
-- 
2.24.3 (Apple Git-128)

#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#38)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 6, 2022 at 12:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 5, 2022 at 8:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2. Are we anytime removing transaction ids from catchanges->xip array?

No.

If not, is there a reason for the same? I think we can remove it
either at commit/abort or even immediately after adding the xid/subxid
to committed->xip array.

It might be a good idea but I'm concerned that removing XID from the
array at every commit/abort or after adding it to committed->xip array
might be costly as it requires adjustment of the array to keep its
order. Removing XIDs from the array would make bsearch faster but the
array is updated reasonably often (every 15 sec).

Fair point. However, I am slightly worried that we are unnecessarily
searching in this new array even when ReorderBufferTxn has the
required information. To avoid that, in function
SnapBuildXidHasCatalogChange(), we can first check
ReorderBufferXidHasCatalogChanges() and then check the array if the
first check doesn't return true. Also, by the way, do we need to
always keep builder->catchanges.xip updated via SnapBuildRestore()?
Isn't it sufficient that we just read and throw away contents from a
snapshot if builder->catchanges.xip is non-NULL?

I had additionally thought if can further optimize this solution to
just store this additional information when we need to serialize for
checkpoint record but I think that won't work because walsender can
restart even without resatart of server in which case the same problem
can occur. I am not if sure there is a way to further optimize this
solution, let me know if you have any ideas?

--
With Regards,
Amit Kapila.

#40Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#39)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 6, 2022 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 6, 2022 at 12:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 5, 2022 at 8:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2. Are we anytime removing transaction ids from catchanges->xip array?

No.

If not, is there a reason for the same? I think we can remove it
either at commit/abort or even immediately after adding the xid/subxid
to committed->xip array.

It might be a good idea but I'm concerned that removing XID from the
array at every commit/abort or after adding it to committed->xip array
might be costly as it requires adjustment of the array to keep its
order. Removing XIDs from the array would make bsearch faster but the
array is updated reasonably often (every 15 sec).

Fair point. However, I am slightly worried that we are unnecessarily
searching in this new array even when ReorderBufferTxn has the
required information. To avoid that, in function
SnapBuildXidHasCatalogChange(), we can first check
ReorderBufferXidHasCatalogChanges() and then check the array if the
first check doesn't return true. Also, by the way, do we need to
always keep builder->catchanges.xip updated via SnapBuildRestore()?
Isn't it sufficient that we just read and throw away contents from a
snapshot if builder->catchanges.xip is non-NULL?

IIUC catchanges.xip is restored only once when restoring a consistent
snapshot via SnapBuildRestore(). I think it's necessary to set
catchanges.xip for later use in SnapBuildXidHasCatalogChange(). Or did
you mean via SnapBuildSerialize()?∫

I had additionally thought if can further optimize this solution to
just store this additional information when we need to serialize for
checkpoint record but I think that won't work because walsender can
restart even without resatart of server in which case the same problem
can occur.

Yes, probably we need to write catalog modifying transactions for
every serialized snapshot.

I am not if sure there is a way to further optimize this
solution, let me know if you have any ideas?

I suppose that writing additional information to serialized snapshots
would not be a noticeable overhead since we need 4 bytes per
transaction and we would not expect there is a huge number of
concurrent catalog modifying transactions. But both collecting catalog
modifying transactions (especially when there are many ongoing
transactions) and bsearch'ing on the XID list every time decoding the
COMMIT record could bring overhead.

A solution for the first point would be to keep track of catalog
modifying transactions by using a linked list so that we can avoid
checking all ongoing transactions.

Regarding the second point, on reflection, I think we need to look up
the XID list until all XID in the list is committed/aborted. We can
remove XIDs from the list after adding it to committed.xip as you
suggested. Or when decoding a RUNNING_XACTS record, we can remove XIDs
older than builder->xmin from the list like we do for committed.xip in
SnapBuildPurgeCommittedTxn().

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#40)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Jul 7, 2022 at 8:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 6, 2022 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 6, 2022 at 12:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 5, 2022 at 8:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2. Are we anytime removing transaction ids from catchanges->xip array?

No.

If not, is there a reason for the same? I think we can remove it
either at commit/abort or even immediately after adding the xid/subxid
to committed->xip array.

It might be a good idea but I'm concerned that removing XID from the
array at every commit/abort or after adding it to committed->xip array
might be costly as it requires adjustment of the array to keep its
order. Removing XIDs from the array would make bsearch faster but the
array is updated reasonably often (every 15 sec).

Fair point. However, I am slightly worried that we are unnecessarily
searching in this new array even when ReorderBufferTxn has the
required information. To avoid that, in function
SnapBuildXidHasCatalogChange(), we can first check
ReorderBufferXidHasCatalogChanges() and then check the array if the
first check doesn't return true. Also, by the way, do we need to
always keep builder->catchanges.xip updated via SnapBuildRestore()?
Isn't it sufficient that we just read and throw away contents from a
snapshot if builder->catchanges.xip is non-NULL?

IIUC catchanges.xip is restored only once when restoring a consistent
snapshot via SnapBuildRestore(). I think it's necessary to set
catchanges.xip for later use in SnapBuildXidHasCatalogChange(). Or did
you mean via SnapBuildSerialize()?∫

Sorry, I got confused about the way restore is used. You are right, it
will be done once. My main worry is that we shouldn't look at
builder->catchanges.xip array on an ongoing basis which I think can be
dealt with by one of the ideas you mentioned below. But, I think we
can still follow the other suggestion related to moving
ReorderBufferXidHasCatalogChanges() check prior to checking array.

I had additionally thought if can further optimize this solution to
just store this additional information when we need to serialize for
checkpoint record but I think that won't work because walsender can
restart even without resatart of server in which case the same problem
can occur.

Yes, probably we need to write catalog modifying transactions for
every serialized snapshot.

I am not if sure there is a way to further optimize this
solution, let me know if you have any ideas?

I suppose that writing additional information to serialized snapshots
would not be a noticeable overhead since we need 4 bytes per
transaction and we would not expect there is a huge number of
concurrent catalog modifying transactions. But both collecting catalog
modifying transactions (especially when there are many ongoing
transactions) and bsearch'ing on the XID list every time decoding the
COMMIT record could bring overhead.

A solution for the first point would be to keep track of catalog
modifying transactions by using a linked list so that we can avoid
checking all ongoing transactions.

This sounds reasonable to me.

Regarding the second point, on reflection, I think we need to look up
the XID list until all XID in the list is committed/aborted. We can
remove XIDs from the list after adding it to committed.xip as you
suggested. Or when decoding a RUNNING_XACTS record, we can remove XIDs
older than builder->xmin from the list like we do for committed.xip in
SnapBuildPurgeCommittedTxn().

I think doing along with RUNNING_XACTS should be fine. At each
commit/abort, the cost could be high because we need to maintain the
sort order. In general, I feel any one of these should be okay because
once the array becomes empty, it won't be used again and there won't
be any operation related to it during ongoing replication.

--
With Regards,
Amit Kapila.

#42Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#41)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Jul 7, 2022 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 7, 2022 at 8:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 6, 2022 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 6, 2022 at 12:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 5, 2022 at 8:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2. Are we anytime removing transaction ids from catchanges->xip array?

No.

If not, is there a reason for the same? I think we can remove it
either at commit/abort or even immediately after adding the xid/subxid
to committed->xip array.

It might be a good idea but I'm concerned that removing XID from the
array at every commit/abort or after adding it to committed->xip array
might be costly as it requires adjustment of the array to keep its
order. Removing XIDs from the array would make bsearch faster but the
array is updated reasonably often (every 15 sec).

Fair point. However, I am slightly worried that we are unnecessarily
searching in this new array even when ReorderBufferTxn has the
required information. To avoid that, in function
SnapBuildXidHasCatalogChange(), we can first check
ReorderBufferXidHasCatalogChanges() and then check the array if the
first check doesn't return true. Also, by the way, do we need to
always keep builder->catchanges.xip updated via SnapBuildRestore()?
Isn't it sufficient that we just read and throw away contents from a
snapshot if builder->catchanges.xip is non-NULL?

IIUC catchanges.xip is restored only once when restoring a consistent
snapshot via SnapBuildRestore(). I think it's necessary to set
catchanges.xip for later use in SnapBuildXidHasCatalogChange(). Or did
you mean via SnapBuildSerialize()?∫

Sorry, I got confused about the way restore is used. You are right, it
will be done once. My main worry is that we shouldn't look at
builder->catchanges.xip array on an ongoing basis which I think can be
dealt with by one of the ideas you mentioned below. But, I think we
can still follow the other suggestion related to moving
ReorderBufferXidHasCatalogChanges() check prior to checking array.

Agreed. I've incorporated this change in the new version patch.

I had additionally thought if can further optimize this solution to
just store this additional information when we need to serialize for
checkpoint record but I think that won't work because walsender can
restart even without resatart of server in which case the same problem
can occur.

Yes, probably we need to write catalog modifying transactions for
every serialized snapshot.

I am not if sure there is a way to further optimize this
solution, let me know if you have any ideas?

I suppose that writing additional information to serialized snapshots
would not be a noticeable overhead since we need 4 bytes per
transaction and we would not expect there is a huge number of
concurrent catalog modifying transactions. But both collecting catalog
modifying transactions (especially when there are many ongoing
transactions) and bsearch'ing on the XID list every time decoding the
COMMIT record could bring overhead.

A solution for the first point would be to keep track of catalog
modifying transactions by using a linked list so that we can avoid
checking all ongoing transactions.

This sounds reasonable to me.

Regarding the second point, on reflection, I think we need to look up
the XID list until all XID in the list is committed/aborted. We can
remove XIDs from the list after adding it to committed.xip as you
suggested. Or when decoding a RUNNING_XACTS record, we can remove XIDs
older than builder->xmin from the list like we do for committed.xip in
SnapBuildPurgeCommittedTxn().

I think doing along with RUNNING_XACTS should be fine. At each
commit/abort, the cost could be high because we need to maintain the
sort order. In general, I feel any one of these should be okay because
once the array becomes empty, it won't be used again and there won't
be any operation related to it during ongoing replication.

I've attached the new version patch that incorporates the comments and
the optimizations discussed above.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v2-0001-Add-catalog-modifying-transactions-to-logical-dec.patchapplication/octet-stream; name=v2-0001-Add-catalog-modifying-transactions-to-logical-dec.patchDownload
From a3c31df27d9d7720669496ccf5e491df4468d338 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v2] Add catalog modifying transactions to logical decoding
 serialized snapshot.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running when
snapshot serialization, to the serialized snapshot. When decoding a
COMMIT record, we check both the list and the ReorderBuffer to see if
if the transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we take another approach;
remember the last-running-xacts list of the first decoded
RUNNING_XACTS record and check if the transaction whose commit record
has XACT_XINFO_HAS_INVALIS and whose XID is in the list. This doesn't
require any file format changes but the transaction will end up being
added to the snapshot even if it has only relcache invalidations.

This commit bumps SNAPBUILD_VERSION because of change in SnapBuild.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++
 .../specs/catalog_change_snapshot.spec        |  39 +++
 .../replication/logical/reorderbuffer.c       |  66 ++++-
 src/backend/replication/logical/snapbuild.c   | 228 ++++++++++++------
 src/include/replication/reorderbuffer.h       |  11 +
 6 files changed, 311 insertions(+), 79 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..bffd856bbb
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test that decoding only the commit record of the transaction that have
+# catalog-changed.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fde72..b081a94c24 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -366,6 +366,7 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
+	dlist_init(&buffer->catchange_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1526,7 +1527,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
@@ -1535,6 +1536,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	dlist_delete(&txn->node);
 
+	if (rbtxn_has_catalog_changes(txn))
+		dlist_delete(&txn->catchange_node);
+
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,
 				(void *) &txn->xid,
@@ -3278,7 +3282,11 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	if (!rbtxn_has_catalog_changes(txn))
+	{
+		dlist_push_tail(&rb->catchange_txns, &txn->catchange_node);
+		txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	}
 
 	/*
 	 * Mark top-level transaction as having catalog changes too if one of its
@@ -3286,8 +3294,60 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	 * conveniently check just top-level transaction and decide whether to
 	 * build the hash table or not.
 	 */
-	if (txn->toptxn != NULL)
+	if (txn->toptxn != NULL && !rbtxn_has_catalog_changes(txn->toptxn))
+	{
+		dlist_push_tail(&rb->catchange_txns, &txn->toptxn->catchange_node);
 		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	}
+}
+
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p)
+{
+	dlist_iter iter;
+	TransactionId *xids = NULL;
+	size_t	xcnt = 0;
+	size_t	xcnt_space = 64; /* arbitrary number */
+
+	/* Quick return if the list is empty */
+	if (dlist_is_empty(&rb->catchange_txns))
+	{
+		*xcnt_p = 0;
+		return NULL;
+	}
+
+	dlist_foreach(iter, &rb->catchange_txns)
+	{
+		ReorderBufferTXN *txn = dlist_container(ReorderBufferTXN,
+												catchange_node,
+												iter.cur);
+
+		Assert(rbtxn_has_catalog_changes(txn));
+
+		/* Initialize XID array */
+		if (xcnt == 0)
+			xids = (TransactionId *) palloc(sizeof(TransactionId) * xcnt_space);
+
+		if (xcnt >= xcnt_space)
+		{
+			xcnt_space *= 2;
+			xids = repalloc(xids, sizeof(TransactionId) * xcnt_space);
+		}
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	if (xcnt > 0)
+		qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	*xcnt_p = xcnt;
+	return xids;
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15214..2da45c7727 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,30 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on HEAP2_NEW_CID and XLOG_XACT_INVALIDATIONS records to
+	 * know if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction.
+	 * This array keeps track of the transactions that have modified catalogs
+	 * and were running when serializing a snapshot, and this array is used to
+	 * add such transactions to the snapshot.
+	 *
+	 * This array is set once when restoring the snapshot, xids are removed
+	 * from the array when decoding xl_running_xacts record, and then eventually
+	 * becomes an empty.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchange;
 };
 
 /*
@@ -262,6 +286,8 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +295,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +333,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchange.xcnt = 0;
+	builder->catchange.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -888,9 +918,10 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed or containing catalog
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed array and ->catchange, respectively. The committed xids will
+ * get checked via the clog machinery.
  */
 static void
 SnapBuildPurgeCommittedTxn(SnapBuild *builder)
@@ -928,6 +959,39 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* purge xids in ->catchange_xip as well */
+	if (builder->catchange.xcnt > 0)
+	{
+		/*
+		 * Since catchange.xip is sorted, we find the lower bound of
+		 * xids that sill are interesting.
+		 */
+		for (off = 0; off < builder->catchange.xcnt; off++)
+		{
+			/* use macros to check xids for speed */
+			if (TransactionIdEquals(builder->catchange.xip[off],
+									builder->xmin) ||
+				NormalTransactionIdFollows(builder->catchange.xip[off],
+										   builder->xmin))
+				break;
+		}
+
+		surviving_xids = builder->catchange.xcnt - off;
+		if (surviving_xids > 0)
+			memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+					surviving_xids * sizeof(TransactionId));
+		else
+		{
+			/* catchange list becomes an empty */
+			pfree(builder->catchange.xip);
+			builder->catchange.xip = NULL;
+		}
+
+		elog(DEBUG3, "purged catalog modifying transactions from %d to %d",
+			 builder->catchange.xcnt, surviving_xids);
+		builder->catchange.xcnt = surviving_xids;
+	}
 }
 
 /*
@@ -983,7 +1047,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChanges(builder, subxid))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1076,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChanges(builder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1153,21 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check both the reorder buffer and the snapshot to see if the given
+ * transaction has modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid)
+{
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return true;
+
+	/* Check the catchange XID array */
+	return ((builder->catchange.xcnt > 0) &&
+			(bsearch(&xid, builder->catchange.xip, builder->catchange.xcnt,
+					 sizeof(TransactionId), xidComparator) != NULL));
+}
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1438,6 +1517,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchange.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1467,7 +1547,7 @@ typedef struct SnapBuildOnDisk
 	offsetof(SnapBuildOnDisk, version)
 
 #define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 4
+#define SNAPBUILD_VERSION 5
 
 /*
  * Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1493,6 +1573,8 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	TransactionId	*catchange_xip = NULL;
+	size_t		catchange_xcnt = 0;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,8 +1660,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	/* Get the catalog modifying transactions that are yet not committed */
+	catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder,
+														&catchange_xcnt);
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + catchange_xcnt);
 
 	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
@@ -1598,6 +1684,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchange.xip = NULL;
+	/* update catchange only on disk data */
+	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
@@ -1609,6 +1698,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
 	ondisk_c += sz;
 
+	/* copy catalog modifying xacts */
+	sz = sizeof(TransactionId) * catchange_xcnt;
+	memcpy(ondisk_c, catchange_xip, sz);
+	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+	ondisk_c += sz;
+
 	FIN_CRC32C(ondisk->checksum);
 
 	/* we have valid data now, open tempfile and write it there */
@@ -1694,6 +1789,8 @@ out:
 	/* be tidy */
 	if (ondisk)
 		pfree(ondisk);
+	if (catchange_xip)
+		pfree(catchange_xip);
 }
 
 /*
@@ -1707,7 +1804,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1835,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,57 +1855,21 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
 	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
 	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
-	}
+	SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
 	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
+	/* restore catalog modifying xacts information */
+	sz = sizeof(TransactionId) * ondisk.builder.catchange.xcnt;
+	ondisk.builder.catchange.xip = MemoryContextAllocZero(builder->context, sz);
+	SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchange.xip, sz, path);
+	COMP_CRC32C(checksum, ondisk.builder.catchange.xip, sz);
+
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -1885,6 +1923,14 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchange.xip)
+		pfree(builder->catchange.xip);
+	builder->catchange.xcnt = ondisk.builder.catchange.xcnt;
+	builder->catchange.xip = ondisk.builder.catchange.xip;
+
+	ondisk.builder.catchange.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1909,6 +1955,38 @@ snapshot_not_interesting:
 	return false;
 }
 
+/*
+ * Read the contents of the serialized snapshot to the dest.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
 /*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d109d0baed..3c8ee6841e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -380,6 +380,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * A node in the list of catalog modifying transactions
+	 */
+	dlist_node	catchange_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -526,6 +531,11 @@ struct ReorderBuffer
 	 */
 	dlist_head	txns_by_base_snapshot_lsn;
 
+	/*
+	 * Transactions and subtransactions that have modified system catalogs.
+	 */
+	dlist_head	catchange_txns;
+
 	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
@@ -677,6 +687,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
-- 
2.24.3 (Apple Git-128)

#43Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#42)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 8, 2022 at 6:45 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 7, 2022 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 7, 2022 at 8:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new version patch that incorporates the comments and
the optimizations discussed above.

Thanks, few minor comments:
1.
In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
list length of 'catchange_txns' to allocate xids array? If we can do
so, then we will save the need to repalloc as well.

2.
/* ->committed manipulation */
static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);

The above comment also needs to be changed.

3. As SnapBuildPurgeCommittedTxn() removes xacts both from committed
and catchange arrays, the function name no more remains appropriate.
We can either rename to something like SnapBuildPurgeOlderTxn() or
move the catchange logic to a different function and call it from
SnapBuildProcessRunningXacts.

4.
+ if (TransactionIdEquals(builder->catchange.xip[off],
+ builder->xmin) ||
+ NormalTransactionIdFollows(builder->catchange.xip[off],
+    builder->xmin))

Can we use TransactionIdFollowsOrEquals() instead of above?

5. Comment change suggestion:
/*
  * Remove knowledge about transactions we treat as committed or
containing catalog
  * changes that are smaller than ->xmin. Those won't ever get checked via
- * the ->committed array and ->catchange, respectively. The committed xids will
- * get checked via the clog machinery.
+ * the ->committed or ->catchange array, respectively. The committed xids will
+ * get checked via the clog machinery. We can ideally remove the transaction
+ * from catchange array once it is finished (committed/aborted) but that could
+ * be costly as we need to maintain the xids order in the array.
  */

Apart from the above, I think there are pending comments for the
back-branch patch and some performance testing of this work.

--
With Regards,
Amit Kapila.

#44Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#43)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 8, 2022 at 6:45 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 7, 2022 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 7, 2022 at 8:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new version patch that incorporates the comments and
the optimizations discussed above.

Thanks, few minor comments:

Thank you for the comments.

1.
In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
list length of 'catchange_txns' to allocate xids array? If we can do
so, then we will save the need to repalloc as well.

Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
catalog modifying transactions, the length of the array could be
bigger than the one taken last time. We can start with the previous
length but I think we cannot remove the need for repalloc.

2.
/* ->committed manipulation */
static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);

The above comment also needs to be changed.

3. As SnapBuildPurgeCommittedTxn() removes xacts both from committed
and catchange arrays, the function name no more remains appropriate.
We can either rename to something like SnapBuildPurgeOlderTxn() or
move the catchange logic to a different function and call it from
SnapBuildProcessRunningXacts.

4.
+ if (TransactionIdEquals(builder->catchange.xip[off],
+ builder->xmin) ||
+ NormalTransactionIdFollows(builder->catchange.xip[off],
+    builder->xmin))

Can we use TransactionIdFollowsOrEquals() instead of above?

5. Comment change suggestion:
/*
* Remove knowledge about transactions we treat as committed or
containing catalog
* changes that are smaller than ->xmin. Those won't ever get checked via
- * the ->committed array and ->catchange, respectively. The committed xids will
- * get checked via the clog machinery.
+ * the ->committed or ->catchange array, respectively. The committed xids will
+ * get checked via the clog machinery. We can ideally remove the transaction
+ * from catchange array once it is finished (committed/aborted) but that could
+ * be costly as we need to maintain the xids order in the array.
*/

Agreed with the above comments.

Apart from the above, I think there are pending comments for the
back-branch patch and some performance testing of this work.

Right. I'll incorporate all comments I got so far into these patches
and submit them. Also, will do some benchmark tests.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#45Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#44)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 8, 2022 at 12:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

1.
In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
list length of 'catchange_txns' to allocate xids array? If we can do
so, then we will save the need to repalloc as well.

Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
catalog modifying transactions, the length of the array could be
bigger than the one taken last time. We can start with the previous
length but I think we cannot remove the need for repalloc.

It is using the list "catchange_txns" to form xid array which
shouldn't change for the duration of
ReorderBufferGetCatalogChangesXacts(). Then the caller frees the xid
array after its use. Next time in
ReorderBufferGetCatalogChangesXacts(), the fresh allocation for xid
array happens, so not sure why repalloc would be required?

--
With Regards,
Amit Kapila.

#46Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#45)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 8, 2022 at 5:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 8, 2022 at 12:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

1.
In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
list length of 'catchange_txns' to allocate xids array? If we can do
so, then we will save the need to repalloc as well.

Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
catalog modifying transactions, the length of the array could be
bigger than the one taken last time. We can start with the previous
length but I think we cannot remove the need for repalloc.

It is using the list "catchange_txns" to form xid array which
shouldn't change for the duration of
ReorderBufferGetCatalogChangesXacts(). Then the caller frees the xid
array after its use. Next time in
ReorderBufferGetCatalogChangesXacts(), the fresh allocation for xid
array happens, so not sure why repalloc would be required?

Oops, I mistook catchange_txns for catchange->xcnt. You're right.
Starting with the length of catchange_txns should be sufficient.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#47Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#37)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 6, 2022 at 3:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 6, 2022 at 7:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'll post a new version patch in the next email with replying to other comments.

Okay, thanks for working on this. Few comments/suggestions on
poc_remember_last_running_xacts_v2 patch:

1.
+ReorderBufferSetLastRunningXactsCatalogChanges(ReorderBuffer *rb,
TransactionId xid,
+    uint32 xinfo, int subxcnt,
+    TransactionId *subxacts, XLogRecPtr lsn)
+{
...
...
+
+ test = bsearch(&xid, rb->last_running_xacts, rb->n_last_running_xacts,
+    sizeof(TransactionId), xidComparator);
+
+ if (test == NULL)
+ {
+ for (int i = 0; i < subxcnt; i++)
+ {
+ test = bsearch(&subxacts[i], rb->last_running_xacts, rb->n_last_running_xacts,
+    sizeof(TransactionId), xidComparator);
...

Is there ever a possibility that the top transaction id is not in the
running_xacts list but one of its subxids is present? If yes, it is
not very obvious at least to me so adding a comment here could be
useful. If not, then why do we need this additional check for each of
the sub-transaction ids?

I think there is no possibility. The check for subtransactions is not necessary.

2.
@@ -627,6 +647,15 @@ DecodeCommit(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
commit_time = parsed->origin_timestamp;
}

+ /*
+ * Set the last running xacts as containing catalog change if necessary.
+ * This must be done before SnapBuildCommitTxn() so that we include catalog
+ * change transactions to the historic snapshot.
+ */
+ ReorderBufferSetLastRunningXactsCatalogChanges(ctx->reorder, xid,
parsed->xinfo,
+    parsed->nsubxacts, parsed->subxacts,
+    buf->origptr);
+
SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
parsed->nsubxacts, parsed->subxacts);

As mentioned previously as well, marking it before SnapBuildCommitTxn
has one disadvantage, we sometimes do this work even if the snapshot
state is SNAPBUILD_START or SNAPBUILD_BUILDING_SNAPSHOT in which case
SnapBuildCommitTxn wouldn't do anything. Can we instead check whether
the particular txn has invalidations and is present in the
last_running_xacts list along with the check
ReorderBufferXidHasCatalogChanges? I think that has the additional
advantage that we don't need this additional marking if the xact is
already marked as containing catalog changes.

Agreed.

3.
1.
+ /*
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder.  Therefore, if the logical
+ * decoding decodes the commit record of the transaction that actually
+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot so up creating the wrong snapshot.

The part of the sentence "... snapshot so up creating the wrong
snapshot." is not clear. In this comment, at one place you have used
two spaces after a full stop, and at another place, there is one
space. I think let's follow nearby code practice to use a single space
before a new sentence.

Agreed.

4.
+void
+ReorderBufferProcessLastRunningXacts(ReorderBuffer *rb,
xl_running_xacts *running)
+{
+ /* Quick exit if there is no longer last running xacts */
+ if (likely(rb->n_last_running_xacts == 0))
+ return;
+
+ /* First call, build the last running xact list */
+ if (rb->n_last_running_xacts == -1)
+ {
+ int nxacts = running->subxcnt + running->xcnt;
+ Size sz = sizeof(TransactionId) * nxacts;;
+
+ rb->last_running_xacts = MemoryContextAlloc(rb->context, sz);
+ memcpy(rb->last_running_xacts, running->xids, sz);
+ qsort(rb->last_running_xacts, nxacts, sizeof(TransactionId), xidComparator);
+
+ rb->n_last_running_xacts = nxacts;
+
+ return;
+ }

a. Can we add the function header comments for this function?

Updated.

b. We seem to be tracking the running_xact information for the first
running_xact record after start/restart. The name last_running_xacts
doesn't sound appropriate for that, how about initial_running_xacts?

Sound good, updated.

5.
+ /*
+ * Purge xids in the last running xacts list if we can do that for at least
+ * one xid.
+ */
+ if (NormalTransactionIdPrecedes(rb->last_running_xacts[0],
+ running->oldestRunningXid))

I think it would be a good idea to add a few lines here explaining why
it is safe to purge. IIUC, it is because the commit for those xacts
would have already been processed and we don't need such a xid
anymore.

Right, updated.

6. As per the discussion above in this thread having
XACT_XINFO_HAS_INVALS in the commit record doesn't indicate that the
xact has catalog changes, so can we add somewhere in comments that for
such a case we can't distinguish whether the txn has catalog change
but we still mark the txn has catalog changes?

Agreed.

Can you please share one example for this case?

I think it depends on what we did in the transaction but one example I
have is that a commit record for ALTER DATABASE has only a snapshot
invalidation message:

=# alter database postgrse set log_statement to 'all';
ALTER DATABASE

$ pg_waldump $PGDATA/pg_wal/000000010000000000000001 | tail -1
rmgr: Transaction len (rec/tot): 66/ 66, tx: 821, lsn:
0/019B50A8, prev 0/019B5070, desc: COMMIT 2022-07-11 21:38:44.036513
JST; inval msgs: snapshot 2964

I've attached an updated patch, please review it.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

REL14-v1-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REL14-v1-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From e4526ad81d1665ea82a0c7667c25d2ae1ace45b2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 11 Jul 2022 21:49:06 +0900
Subject: [PATCH v1] Fix catalog lookup with the wrong snapshot during logical
 decoding.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change the reorder buffer so that it
remembers the initial running transaction written in the
xl_running_xacts record that we decoded first, and mark the
transaction as containing catalog changes if it’s in the list of the
initial running transactions and its commit record has
XACT_XINFO_HAS_INVALS.

This has false positive; we could end up adding the transaction that
didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the
COMMIT record. It doesn’t have the information on which (sub)
transaction has catalog changes, and XACT_XINFO_HAS_INVALS doesn't
necessarily indicate that the transaction has catalog change. But it
doesn't become a problem since we use historic snapshot only for
reading system catalogs.

On the master branch, we took a more future-proof approach of writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++++
 src/backend/replication/logical/decode.c      |  17 +++
 .../replication/logical/reorderbuffer.c       | 104 ++++++++++++++++++
 src/include/replication/reorderbuffer.h       |  36 ++++++
 6 files changed, 241 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..4553252d75 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..bffd856bbb
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test that decoding only the commit record of the transaction that have
+# catalog-changed.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 92dfafc632..8929ef5cc3 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -407,6 +407,9 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			{
 				xl_running_xacts *running = (xl_running_xacts *) XLogRecGetData(r);
 
+				/* Process the initial running transactions, if any */
+				ReorderBufferProcessInitialXacts(ctx->reorder, running);
+
 				SnapBuildProcessRunningXacts(builder, buf->origptr, running);
 
 				/*
@@ -691,6 +694,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. We check if it's in the list of the initial running transactions
+	 * and mark it as containing catalog change if necessary.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * catalog change transactions to the historic snapshot.
+	 */
+	if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) != 0)
+		ReorderBufferInitialXactsSetCatalogChanges(ctx->reorder, xid,
+												   parsed->nsubxacts,
+												   parsed->subxacts,
+												   buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e59d1396b5..8ef1d310e4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -346,6 +346,9 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->initial_running_xacts = NULL;
+	buffer->n_initial_running_xacts = 0;
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -5154,3 +5157,104 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Process the transactions in xl_running_xacts record, and remember the
+ * transactions first and later remove those that aren't needed anymore.
+ */
+void
+ReorderBufferProcessInitialXacts(ReorderBuffer *rb, xl_running_xacts *running)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild  *builder = ctx->snapshot_builder;
+	TransactionId *workspace;
+	int			surviving_xids = 0;
+
+	/* Build the initial running transactions list for the first call */
+	if (unlikely(SnapBuildCurrentState(builder) == SNAPBUILD_START))
+	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;;
+
+		Assert(rb->n_initial_running_xacts == 0);
+
+		rb->n_initial_running_xacts = nxacts;
+		rb->initial_running_xacts = MemoryContextAlloc(rb->context, sz);
+		memcpy(rb->initial_running_xacts, running->xids, sz);
+		qsort(rb->initial_running_xacts, nxacts, sizeof(TransactionId),
+			  xidComparator);
+
+		return;
+	}
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(rb->n_initial_running_xacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to be removed */
+	if (NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
+									running->oldestRunningXid))
+		return;
+
+	/*
+	 * Remove transactions that would have been processed and we don't need to
+	 * keep track off anymore.
+	 */
+	workspace = MemoryContextAlloc(rb->context, rb->n_initial_running_xacts);
+	for (int i = 0; i < rb->n_initial_running_xacts; i++)
+	{
+		if (NormalTransactionIdPrecedes(rb->initial_running_xacts[i],
+										running->oldestRunningXid))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = rb->initial_running_xacts[i];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(rb->initial_running_xacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(rb->initial_running_xacts);
+		rb->initial_running_xacts = NULL;
+	}
+
+	rb->n_initial_running_xacts = surviving_xids;
+	pfree(workspace);
+}
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark both it
+ * and its subtransactions as containing catalog changes if not yet.
+ */
+void
+ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+										   int subxcnt, TransactionId *subxacts,
+										   XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely(rb->n_initial_running_xacts == 0 ||
+			   ReorderBufferXidHasCatalogChanges(rb, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, rb->initial_running_xacts, rb->n_initial_running_xacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(rb, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(rb, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(rb, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d81b5..fe0f52d4e1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
 #include "access/htup_details.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -589,6 +590,35 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/*
+	 * Array of transactions and subtransactions that were running when
+	 * the xl_running_xacts record that we decoded first was written.
+	 * The array is sorted in xidComparator order. Xids are removed from
+	 * the array when decoding xl_running_xacts record, and then the array
+	 * eventually becomes an empty.
+	 *
+	 * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+	 * if the transaction has changed the catalog, and that information
+	 * is not serialized to SnapBuilder. Therefore, if the logical
+	 * decoding decodes the commit record of the transaction that actually
+	 * has done catalog changes without these records, we miss to add
+	 * the xid to the snapshot, and end up looking at catalogs with the
+	 * wrong snapshot. To avoid this problem, if the COMMIT record of
+	 * the xid listed in initial_running_xacts has XACT_XINFO_HAS_INVALS
+	 * flag, we mark both the top transaction and its substransactions
+	 * as containing catalog changes.
+	 *
+	 * We could end up adding the transaction that didn't change catalog
+	 * to the snapshot since we cannot distinguish whether the transaction
+	 * has catalog changes only by checking the COMMIT record. It doesn't
+	 * have the information on which (sub) transaction has catalog changes,
+	 * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+	 * transaction has catalog change. But it doesn't become a problem since
+	 * we use historic snapshot only for reading system catalogs.
+	 */
+	TransactionId *initial_running_xacts;
+	int n_initial_running_xacts;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
@@ -678,4 +708,10 @@ void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
 void		StartupReorderBuffer(void);
 
+void		ReorderBufferProcessInitialXacts(ReorderBuffer *rb,
+											 xl_running_xacts *running);
+void		ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+													   int subxcnt,
+													   TransactionId *subxacts,
+													   XLogRecPtr lsn);
 #endif
-- 
2.24.3 (Apple Git-128)

#48Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#46)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 8, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 8, 2022 at 5:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 8, 2022 at 12:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

1.
In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
list length of 'catchange_txns' to allocate xids array? If we can do
so, then we will save the need to repalloc as well.

Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
catalog modifying transactions, the length of the array could be
bigger than the one taken last time. We can start with the previous
length but I think we cannot remove the need for repalloc.

It is using the list "catchange_txns" to form xid array which
shouldn't change for the duration of
ReorderBufferGetCatalogChangesXacts(). Then the caller frees the xid
array after its use. Next time in
ReorderBufferGetCatalogChangesXacts(), the fresh allocation for xid
array happens, so not sure why repalloc would be required?

Oops, I mistook catchange_txns for catchange->xcnt. You're right.
Starting with the length of catchange_txns should be sufficient.

I've attached an updated patch.

While trying this idea, I noticed there is no API to get the length of
dlist, as we discussed offlist. Alternative idea was to use List
(T_XidList) but I'm not sure it's a great idea since deleting an xid
from the list is O(N), we need to implement list_delete_xid, and we
need to make sure allocating list node in the reorder buffer context.
So in the patch, I added a variable, catchange_ntxns, to keep track of
the length of the dlist. Please review it.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

master-v3-0001-Add-catalog-modifying-transactions-to-logical-dec.patchapplication/octet-stream; name=master-v3-0001-Add-catalog-modifying-transactions-to-logical-dec.patchDownload
From 052abb0d79914282712d826a822c545d3df5330c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v3] Add catalog modifying transactions to logical decoding
 serialized snapshot.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running when
snapshot serialization, to the serialized snapshot. When decoding a
COMMIT record, we check both the list and the ReorderBuffer to see if
if the transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we take another approach;
remember the last-running-xacts list of the first decoded
RUNNING_XACTS record and check if the transaction whose commit record
has XACT_XINFO_HAS_INVALS and whose XID is in the list. This doesn't
require any file format changes but the transaction will end up being
added to the snapshot even if it has only relcache invalidations.

This commit bumps SNAPBUILD_VERSION because of change in SnapBuild.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++
 .../specs/catalog_change_snapshot.spec        |  39 +++
 .../replication/logical/reorderbuffer.c       |  72 +++++-
 src/backend/replication/logical/snapbuild.c   | 235 ++++++++++++------
 src/include/replication/reorderbuffer.h       |  12 +
 6 files changed, 320 insertions(+), 84 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..bffd856bbb
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test that decoding only the commit record of the transaction that have
+# catalog-changed.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fde72..b71103c60e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,8 @@ ReorderBufferAllocate(void)
 	buffer->by_txn_last_xid = InvalidTransactionId;
 	buffer->by_txn_last_txn = NULL;
 
+	buffer->catchange_ntxns = 0;
+
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 	buffer->size = 0;
@@ -366,6 +368,7 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
+	dlist_init(&buffer->catchange_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1526,7 +1529,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
@@ -1535,6 +1538,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	dlist_delete(&txn->node);
 
+	if (rbtxn_has_catalog_changes(txn))
+	{
+		dlist_delete(&txn->catchange_node);
+		rb->catchange_ntxns--;
+
+		Assert(rb->catchange_ntxns >= 0);
+	}
+
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,
 				(void *) &txn->xid,
@@ -3275,10 +3286,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 								  XLogRecPtr lsn)
 {
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	if (!rbtxn_has_catalog_changes(txn))
+	{
+		txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &txn->catchange_node);
+		rb->catchange_ntxns++;
+	}
 
 	/*
 	 * Mark top-level transaction as having catalog changes too if one of its
@@ -3286,8 +3303,55 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	 * conveniently check just top-level transaction and decide whether to
 	 * build the hash table or not.
 	 */
-	if (txn->toptxn != NULL)
-		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	toptxn = txn->toptxn;
+	if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
+	{
+		toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &toptxn->catchange_node);
+		rb->catchange_ntxns++;
+	}
+}
+
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p)
+{
+	dlist_iter iter;
+	TransactionId *xids = NULL;
+	size_t	xcnt = 0;
+
+	/* Quick return if the list is empty */
+	if (dlist_is_empty(&rb->catchange_txns))
+	{
+		Assert(rb->catchange_ntxns == 0);
+
+		*xcnt_p = 0;
+		return NULL;
+	}
+
+	/* Initialize XID array */
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * rb->catchange_ntxns);
+	dlist_foreach(iter, &rb->catchange_txns)
+	{
+		ReorderBufferTXN *txn = dlist_container(ReorderBufferTXN,
+												catchange_node,
+												iter.cur);
+
+		Assert(rbtxn_has_catalog_changes(txn));
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	if (xcnt > 0)
+		qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	*xcnt_p = xcnt;
+	return xids;
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15214..d015c06ced 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,30 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on HEAP2_NEW_CID and XLOG_XACT_INVALIDATIONS records to
+	 * know if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction.
+	 * This array keeps track of the transactions that have modified catalogs
+	 * and were running when serializing a snapshot, and this array is used to
+	 * add such transactions to the snapshot.
+	 *
+	 * This array is set once when restoring the snapshot, xids are removed
+	 * from the array when decoding xl_running_xacts record, and then eventually
+	 * becomes an empty.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchange;
 };
 
 /*
@@ -250,8 +274,8 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/* ->committed and ->catchange manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -262,6 +286,8 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +295,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +333,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchange.xcnt = 0;
+	builder->catchange.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -888,12 +918,15 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed or containing catalog
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed array and ->catchange, respectively. The committed xids will
+ * get checked via the clog machinery. We can ideally remove the transaction
+ * from catchange array once it is finished (committed/aborted) but that could
+ * be costly as we need to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +961,36 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* purge xids in ->catchange as well */
+	if (builder->catchange.xcnt > 0)
+	{
+		/*
+		 * Since catchange.xip is sorted, we find the lower bound of
+		 * xids that sill are interesting.
+		 */
+		for (off = 0; off < builder->catchange.xcnt; off++)
+		{
+			if (TransactionIdFollowsOrEquals(builder->catchange.xip[off],
+											 builder->xmin))
+				break;
+		}
+
+		surviving_xids = builder->catchange.xcnt - off;
+		if (surviving_xids > 0)
+			memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+					surviving_xids * sizeof(TransactionId));
+		else
+		{
+			/* catchange list becomes an empty */
+			pfree(builder->catchange.xip);
+			builder->catchange.xip = NULL;
+		}
+
+		elog(DEBUG3, "purged catalog modifying transactions from %d to %d",
+			 (uint32) builder->catchange.xcnt, surviving_xids);
+		builder->catchange.xcnt = surviving_xids;
+	}
 }
 
 /*
@@ -983,7 +1046,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChanges(builder, subxid))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1075,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChanges(builder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1152,21 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check both the reorder buffer and the snapshot to see if the given
+ * transaction has modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid)
+{
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return true;
+
+	/* Check the catchange XID array */
+	return ((builder->catchange.xcnt > 0) &&
+			(bsearch(&xid, builder->catchange.xip, builder->catchange.xcnt,
+					 sizeof(TransactionId), xidComparator) != NULL));
+}
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1135,7 +1213,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1438,6 +1516,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchange.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1467,7 +1546,7 @@ typedef struct SnapBuildOnDisk
 	offsetof(SnapBuildOnDisk, version)
 
 #define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 4
+#define SNAPBUILD_VERSION 5
 
 /*
  * Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1493,6 +1572,8 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	TransactionId	*catchange_xip = NULL;
+	size_t		catchange_xcnt = 0;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,8 +1659,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	/* Get the catalog modifying transactions that are yet not committed */
+	catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder,
+														&catchange_xcnt);
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + catchange_xcnt);
 
 	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
@@ -1598,6 +1683,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchange.xip = NULL;
+	/* update catchange only on disk data */
+	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
@@ -1609,6 +1697,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
 	ondisk_c += sz;
 
+	/* copy catalog modifying xacts */
+	sz = sizeof(TransactionId) * catchange_xcnt;
+	memcpy(ondisk_c, catchange_xip, sz);
+	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+	ondisk_c += sz;
+
 	FIN_CRC32C(ondisk->checksum);
 
 	/* we have valid data now, open tempfile and write it there */
@@ -1694,6 +1788,8 @@ out:
 	/* be tidy */
 	if (ondisk)
 		pfree(ondisk);
+	if (catchange_xip)
+		pfree(catchange_xip);
 }
 
 /*
@@ -1707,7 +1803,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1834,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,57 +1854,21 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
 	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
 	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
-	}
+	SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
 	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
+	/* restore catalog modifying xacts information */
+	sz = sizeof(TransactionId) * ondisk.builder.catchange.xcnt;
+	ondisk.builder.catchange.xip = MemoryContextAllocZero(builder->context, sz);
+	SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchange.xip, sz, path);
+	COMP_CRC32C(checksum, ondisk.builder.catchange.xip, sz);
+
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -1885,6 +1922,14 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchange.xip)
+		pfree(builder->catchange.xip);
+	builder->catchange.xcnt = ondisk.builder.catchange.xcnt;
+	builder->catchange.xip = ondisk.builder.catchange.xip;
+
+	ondisk.builder.catchange.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1909,6 +1954,38 @@ snapshot_not_interesting:
 	return false;
 }
 
+/*
+ * Read the contents of the serialized snapshot to the dest.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
 /*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d109d0baed..7446911df1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -380,6 +380,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * A node in the list of catalog modifying transactions
+	 */
+	dlist_node	catchange_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -526,6 +531,12 @@ struct ReorderBuffer
 	 */
 	dlist_head	txns_by_base_snapshot_lsn;
 
+	/*
+	 * Transactions and subtransactions that have modified system catalogs.
+	 */
+	dlist_head	catchange_txns;
+	int			catchange_ntxns;
+
 	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
@@ -677,6 +688,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
-- 
2.24.3 (Apple Git-128)

#49Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#48)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 at 9:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 8, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 8, 2022 at 5:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 8, 2022 at 12:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

1.
In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
list length of 'catchange_txns' to allocate xids array? If we can do
so, then we will save the need to repalloc as well.

Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
catalog modifying transactions, the length of the array could be
bigger than the one taken last time. We can start with the previous
length but I think we cannot remove the need for repalloc.

It is using the list "catchange_txns" to form xid array which
shouldn't change for the duration of
ReorderBufferGetCatalogChangesXacts(). Then the caller frees the xid
array after its use. Next time in
ReorderBufferGetCatalogChangesXacts(), the fresh allocation for xid
array happens, so not sure why repalloc would be required?

Oops, I mistook catchange_txns for catchange->xcnt. You're right.
Starting with the length of catchange_txns should be sufficient.

I've attached an updated patch.

While trying this idea, I noticed there is no API to get the length of
dlist, as we discussed offlist. Alternative idea was to use List
(T_XidList) but I'm not sure it's a great idea since deleting an xid
from the list is O(N), we need to implement list_delete_xid, and we
need to make sure allocating list node in the reorder buffer context.
So in the patch, I added a variable, catchange_ntxns, to keep track of
the length of the dlist. Please review it.

I'm doing benchmark tests and will share the results.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#50shiy.fnst@fujitsu.com
shiy.fnst@fujitsu.com
In reply to: Masahiko Sawada (#48)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 8:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached an updated patch.

While trying this idea, I noticed there is no API to get the length of
dlist, as we discussed offlist. Alternative idea was to use List
(T_XidList) but I'm not sure it's a great idea since deleting an xid
from the list is O(N), we need to implement list_delete_xid, and we
need to make sure allocating list node in the reorder buffer context.
So in the patch, I added a variable, catchange_ntxns, to keep track of
the length of the dlist. Please review it.

Thanks for your patch. Here are some comments on the master patch.

1.
In catalog_change_snapshot.spec, should we use "RUNNING_XACTS record" instead of
"RUNNING_XACT record" / "XACT_RUNNING record" in the comment?

2.
+		 * Since catchange.xip is sorted, we find the lower bound of
+		 * xids that sill are interesting.

Typo?
"sill" -> "still"

3.
+	 * This array is set once when restoring the snapshot, xids are removed
+	 * from the array when decoding xl_running_xacts record, and then eventually
+	 * becomes an empty.
+			/* catchange list becomes an empty */
+			pfree(builder->catchange.xip);
+			builder->catchange.xip = NULL;

Should "becomes an empty" be modified to "becomes empty"?

4.
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed array and ->catchange, respectively. The committed xids will

Should we change
"the ->committed array and ->catchange"
to
"the ->committed or ->catchange array"
?

Regards,
Shi yu

#51Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#49)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 9:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 8, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 8, 2022 at 5:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 8, 2022 at 12:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

1.
In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
list length of 'catchange_txns' to allocate xids array? If we can do
so, then we will save the need to repalloc as well.

Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
catalog modifying transactions, the length of the array could be
bigger than the one taken last time. We can start with the previous
length but I think we cannot remove the need for repalloc.

It is using the list "catchange_txns" to form xid array which
shouldn't change for the duration of
ReorderBufferGetCatalogChangesXacts(). Then the caller frees the xid
array after its use. Next time in
ReorderBufferGetCatalogChangesXacts(), the fresh allocation for xid
array happens, so not sure why repalloc would be required?

Oops, I mistook catchange_txns for catchange->xcnt. You're right.
Starting with the length of catchange_txns should be sufficient.

I've attached an updated patch.

While trying this idea, I noticed there is no API to get the length of
dlist, as we discussed offlist. Alternative idea was to use List
(T_XidList) but I'm not sure it's a great idea since deleting an xid
from the list is O(N), we need to implement list_delete_xid, and we
need to make sure allocating list node in the reorder buffer context.
So in the patch, I added a variable, catchange_ntxns, to keep track of
the length of the dlist. Please review it.

I'm doing benchmark tests and will share the results.

I've done benchmark tests to measure the overhead introduced by doing
bsearch() every time when decoding a commit record. I've simulated a
very intensified situation where we decode 1M commit records while
keeping builder->catchange.xip array but the overhead is negilible:

HEAD: 584 ms
Patched: 614 ms

I've attached the benchmark script I used. With increasing
LOG_SNAPSHOT_INTERVAL_MS to 90000, the last decoding by
pg_logicla_slot_get_changes() decodes 1M commit records while keeping
catalog modifying transactions.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

bench.specapplication/octet-stream; name=bench.specDownload
#52Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#51)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 at 11:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'm doing benchmark tests and will share the results.

I've done benchmark tests to measure the overhead introduced by doing
bsearch() every time when decoding a commit record. I've simulated a
very intensified situation where we decode 1M commit records while
keeping builder->catchange.xip array but the overhead is negilible:

HEAD: 584 ms
Patched: 614 ms

I've attached the benchmark script I used. With increasing
LOG_SNAPSHOT_INTERVAL_MS to 90000, the last decoding by
pg_logicla_slot_get_changes() decodes 1M commit records while keeping
catalog modifying transactions.

Thanks for the test. We should also see how it performs when (a) we
don't change LOG_SNAPSHOT_INTERVAL_MS, and (b) we have more DDL xacts
so that the array to search is somewhat bigger

--
With Regards,
Amit Kapila.

#53Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#52)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 12, 2022 at 11:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'm doing benchmark tests and will share the results.

I've done benchmark tests to measure the overhead introduced by doing
bsearch() every time when decoding a commit record. I've simulated a
very intensified situation where we decode 1M commit records while
keeping builder->catchange.xip array but the overhead is negilible:

HEAD: 584 ms
Patched: 614 ms

I've attached the benchmark script I used. With increasing
LOG_SNAPSHOT_INTERVAL_MS to 90000, the last decoding by
pg_logicla_slot_get_changes() decodes 1M commit records while keeping
catalog modifying transactions.

Thanks for the test. We should also see how it performs when (a) we
don't change LOG_SNAPSHOT_INTERVAL_MS,

What point do you want to see in this test? I think the performance
overhead depends on how many times we do bsearch() and how many
transactions are in the list. I increased this value to easily
simulate the situation where we decode many commit records while
keeping catalog modifying transactions. But even if we don't change
this value, the result would not change if we don't change how many
commit records we decode.

and (b) we have more DDL xacts
so that the array to search is somewhat bigger

I've done the same performance tests while creating 64 catalog
modifying transactions. The result is:

HEAD: 595 ms
Patched: 628 ms

There was no big overhead.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#54Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#53)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 at 1:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 12, 2022 at 11:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'm doing benchmark tests and will share the results.

I've done benchmark tests to measure the overhead introduced by doing
bsearch() every time when decoding a commit record. I've simulated a
very intensified situation where we decode 1M commit records while
keeping builder->catchange.xip array but the overhead is negilible:

HEAD: 584 ms
Patched: 614 ms

I've attached the benchmark script I used. With increasing
LOG_SNAPSHOT_INTERVAL_MS to 90000, the last decoding by
pg_logicla_slot_get_changes() decodes 1M commit records while keeping
catalog modifying transactions.

Thanks for the test. We should also see how it performs when (a) we
don't change LOG_SNAPSHOT_INTERVAL_MS,

What point do you want to see in this test? I think the performance
overhead depends on how many times we do bsearch() and how many
transactions are in the list.

Right, I am not expecting any visible performance difference in this
case. This is to ensure that we are not incurring any overhead in the
more usual scenarios (or default cases). As per my understanding, the
purpose of increasing the value of LOG_SNAPSHOT_INTERVAL_MS is to
simulate a stress case for the changes made by the patch, and keeping
its value default will test the more usual scenarios.

I increased this value to easily
simulate the situation where we decode many commit records while
keeping catalog modifying transactions. But even if we don't change
this value, the result would not change if we don't change how many
commit records we decode.

and (b) we have more DDL xacts
so that the array to search is somewhat bigger

I've done the same performance tests while creating 64 catalog
modifying transactions. The result is:

HEAD: 595 ms
Patched: 628 ms

There was no big overhead.

Yeah, especially considering you have simulated a stress case for the patch.

--
With Regards,
Amit Kapila.

#55shiy.fnst@fujitsu.com
shiy.fnst@fujitsu.com
In reply to: Masahiko Sawada (#48)
1 attachment(s)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 8:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached an updated patch.

Hi,

I met a segmentation fault in test_decoding test after applying the patch for master
branch. Attach the backtrace.

It happened when executing the following code because it tried to free a NULL
pointer (catchange_xip).

/* be tidy */
if (ondisk)
pfree(ondisk);
+ if (catchange_xip)
+ pfree(catchange_xip);
}

It seems to be related to configure option. I could reproduce it when using
`./configure --enable-debug`.
But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -ggdb"`.

Regards,
Shi yu

Attachments:

backtrace.txttext/plain; name=backtrace.txtDownload
#56Masahiko Sawada
sawada.mshk@gmail.com
In reply to: shiy.fnst@fujitsu.com (#55)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 at 5:58 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

On Tue, Jul 12, 2022 8:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached an updated patch.

Hi,

I met a segmentation fault in test_decoding test after applying the patch for master
branch. Attach the backtrace.

Thank you for testing the patch!

It happened when executing the following code because it tried to free a NULL
pointer (catchange_xip).

/* be tidy */
if (ondisk)
pfree(ondisk);
+ if (catchange_xip)
+ pfree(catchange_xip);
}

It seems to be related to configure option. I could reproduce it when using
`./configure --enable-debug`.
But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -ggdb"`.

Hmm, I could not reproduce this problem even if I use ./configure
--enable-debug. And it's weird that we checked if catchange_xip is not
null but we did pfree for it:

#1 pfree (pointer=0x0) at mcxt.c:1177
#2 0x000000000078186b in SnapBuildSerialize (builder=0x1fd5e78,
lsn=25719712) at snapbuild.c:1792

Is it reproducible in your environment? If so, could you test it again
with the following changes?

diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index d015c06ced..a6e76e3781 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1788,7 +1788,7 @@ out:
    /* be tidy */
    if (ondisk)
        pfree(ondisk);
-   if (catchange_xip)
+   if (catchange_xip != NULL)
        pfree(catchange_xip);
 }

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#57Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#56)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 at 2:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 5:58 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

It happened when executing the following code because it tried to free a NULL
pointer (catchange_xip).

/* be tidy */
if (ondisk)
pfree(ondisk);
+ if (catchange_xip)
+ pfree(catchange_xip);
}

It seems to be related to configure option. I could reproduce it when using
`./configure --enable-debug`.
But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -ggdb"`.

Hmm, I could not reproduce this problem even if I use ./configure
--enable-debug. And it's weird that we checked if catchange_xip is not
null but we did pfree for it:

Yeah, this looks weird to me as well but one difference in running
tests could be the timing of WAL LOG for XLOG_RUNNING_XACTS. That may
change the timing of SnapBuildSerialize. The other thing we can try is
by checking the value of catchange_xcnt before pfree.

BTW, I think ReorderBufferGetCatalogChangesXacts should have an Assert
to ensure rb->catchange_ntxns and xcnt are equal. We can probably then
avoid having xcnt_p as an out parameter as the caller can use
rb->catchange_ntxns instead.

--
With Regards,
Amit Kapila.

#58Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#57)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 at 7:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 12, 2022 at 2:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 5:58 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

It happened when executing the following code because it tried to free a NULL
pointer (catchange_xip).

/* be tidy */
if (ondisk)
pfree(ondisk);
+ if (catchange_xip)
+ pfree(catchange_xip);
}

It seems to be related to configure option. I could reproduce it when using
`./configure --enable-debug`.
But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -ggdb"`.

Hmm, I could not reproduce this problem even if I use ./configure
--enable-debug. And it's weird that we checked if catchange_xip is not
null but we did pfree for it:

Yeah, this looks weird to me as well but one difference in running
tests could be the timing of WAL LOG for XLOG_RUNNING_XACTS. That may
change the timing of SnapBuildSerialize. The other thing we can try is
by checking the value of catchange_xcnt before pfree.

Yeah, we can try that.

While reading the code, I realized that we try to pfree both ondisk
and catchange_xip also when we jumped to 'out:':

out:
ReorderBufferSetRestartPoint(builder->reorder,
builder->last_serialized_snapshot);
/* be tidy */
if (ondisk)
pfree(ondisk);
if (catchange_xip)
pfree(catchange_xip);

But we use both ondisk and catchange_xip only if we didn't jump to
'out:'. If this problem is related to compiler optimization with
'goto' statement, moving them before 'out:' might be worth trying.

BTW, I think ReorderBufferGetCatalogChangesXacts should have an Assert
to ensure rb->catchange_ntxns and xcnt are equal. We can probably then
avoid having xcnt_p as an out parameter as the caller can use
rb->catchange_ntxns instead.

Agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#59Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#54)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 at 5:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 12, 2022 at 1:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 12, 2022 at 11:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'm doing benchmark tests and will share the results.

I've done benchmark tests to measure the overhead introduced by doing
bsearch() every time when decoding a commit record. I've simulated a
very intensified situation where we decode 1M commit records while
keeping builder->catchange.xip array but the overhead is negilible:

HEAD: 584 ms
Patched: 614 ms

I've attached the benchmark script I used. With increasing
LOG_SNAPSHOT_INTERVAL_MS to 90000, the last decoding by
pg_logicla_slot_get_changes() decodes 1M commit records while keeping
catalog modifying transactions.

Thanks for the test. We should also see how it performs when (a) we
don't change LOG_SNAPSHOT_INTERVAL_MS,

What point do you want to see in this test? I think the performance
overhead depends on how many times we do bsearch() and how many
transactions are in the list.

Right, I am not expecting any visible performance difference in this
case. This is to ensure that we are not incurring any overhead in the
more usual scenarios (or default cases). As per my understanding, the
purpose of increasing the value of LOG_SNAPSHOT_INTERVAL_MS is to
simulate a stress case for the changes made by the patch, and keeping
its value default will test the more usual scenarios.

Agreed.

I've done simple benchmark tests to decode 100k pgbench transactions:

HEAD: 10.34 s
Patched: 10.29 s

I've attached an updated patch that incorporated comments from Amit and Shi.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v4-0001-Add-catalog-modifying-transactions-to-logical-dec.patchapplication/x-patch; name=v4-0001-Add-catalog-modifying-transactions-to-logical-dec.patchDownload
From 28ca92c9d95cd05a26a7db6e54704f92b1846943 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v4] Add catalog modifying transactions to logical decoding
 serialized snapshot.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running when
snapshot serialization, to the serialized snapshot. When decoding a
COMMIT record, we check both the list and the ReorderBuffer to see if
if the transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we take another approach;
remember the last-running-xacts list of the first decoded
RUNNING_XACTS record and check if the transaction whose commit record
has XACT_XINFO_HAS_INVALS and whose XID is in the list. This doesn't
require any file format changes but the transaction will end up being
added to the snapshot even if it has only relcache invalidations.

This commit bumps SNAPBUILD_VERSION because of change in SnapBuild.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++
 .../specs/catalog_change_snapshot.spec        |  39 +++
 .../replication/logical/reorderbuffer.c       |  69 ++++-
 src/backend/replication/logical/snapbuild.c   | 235 ++++++++++++------
 src/include/replication/reorderbuffer.h       |  12 +
 6 files changed, 317 insertions(+), 84 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..42bad9a45b
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test that decoding only the commit record of the transaction that have
+# catalog-changed.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fde72..d7f430623d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,8 @@ ReorderBufferAllocate(void)
 	buffer->by_txn_last_xid = InvalidTransactionId;
 	buffer->by_txn_last_txn = NULL;
 
+	buffer->catchange_ntxns = 0;
+
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 	buffer->size = 0;
@@ -366,6 +368,7 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
+	dlist_init(&buffer->catchange_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1526,7 +1529,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
@@ -1535,6 +1538,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	dlist_delete(&txn->node);
 
+	if (rbtxn_has_catalog_changes(txn))
+	{
+		dlist_delete(&txn->catchange_node);
+		rb->catchange_ntxns--;
+
+		Assert(rb->catchange_ntxns >= 0);
+	}
+
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,
 				(void *) &txn->xid,
@@ -3275,10 +3286,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 								  XLogRecPtr lsn)
 {
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	if (!rbtxn_has_catalog_changes(txn))
+	{
+		txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &txn->catchange_node);
+		rb->catchange_ntxns++;
+	}
 
 	/*
 	 * Mark top-level transaction as having catalog changes too if one of its
@@ -3286,8 +3303,52 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	 * conveniently check just top-level transaction and decide whether to
 	 * build the hash table or not.
 	 */
-	if (txn->toptxn != NULL)
-		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	toptxn = txn->toptxn;
+	if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
+	{
+		toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &toptxn->catchange_node);
+		rb->catchange_ntxns++;
+	}
+}
+
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb)
+{
+	dlist_iter iter;
+	TransactionId *xids = NULL;
+	size_t	xcnt = 0;
+
+	/* Quick return if the list is empty */
+	if (dlist_is_empty(&rb->catchange_txns))
+	{
+		Assert(rb->catchange_ntxns == 0);
+		return NULL;
+	}
+
+	/* Initialize XID array */
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * rb->catchange_ntxns);
+	dlist_foreach(iter, &rb->catchange_txns)
+	{
+		ReorderBufferTXN *txn = dlist_container(ReorderBufferTXN,
+												catchange_node,
+												iter.cur);
+
+		Assert(rbtxn_has_catalog_changes(txn));
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	Assert((xcnt > 0) && (xcnt == rb->catchange_ntxns));
+	return xids;
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15214..c482e906b0 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,30 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on HEAP2_NEW_CID and XLOG_XACT_INVALIDATIONS records to
+	 * know if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction.
+	 * This array keeps track of the transactions that have modified catalogs
+	 * and were running when serializing a snapshot, and this array is used to
+	 * add such transactions to the snapshot.
+	 *
+	 * This array is set once when restoring the snapshot, xids are removed
+	 * from the array when decoding xl_running_xacts record, and then eventually
+	 * becomes empty.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchange;
 };
 
 /*
@@ -250,8 +274,8 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/* ->committed and ->catchange manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -262,6 +286,8 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +295,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +333,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchange.xcnt = 0;
+	builder->catchange.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -888,12 +918,15 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed or containing catalog
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed or ->catchange array, respectively. The committed xids will
+ * get checked via the clog machinery. We can ideally remove the transaction
+ * from catchange array once it is finished (committed/aborted) but that could
+ * be costly as we need to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +961,36 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* purge xids in ->catchange as well */
+	if (builder->catchange.xcnt > 0)
+	{
+		/*
+		 * Since catchange.xip is sorted, we find the lower bound of
+		 * xids that still are interesting.
+		 */
+		for (off = 0; off < builder->catchange.xcnt; off++)
+		{
+			if (TransactionIdFollowsOrEquals(builder->catchange.xip[off],
+											 builder->xmin))
+				break;
+		}
+
+		surviving_xids = builder->catchange.xcnt - off;
+		if (surviving_xids > 0)
+			memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+					surviving_xids * sizeof(TransactionId));
+		else
+		{
+			/* catchange list becomes empty */
+			pfree(builder->catchange.xip);
+			builder->catchange.xip = NULL;
+		}
+
+		elog(DEBUG3, "purged catalog modifying transactions from %d to %d",
+			 (uint32) builder->catchange.xcnt, surviving_xids);
+		builder->catchange.xcnt = surviving_xids;
+	}
 }
 
 /*
@@ -983,7 +1046,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChanges(builder, subxid))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1075,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChanges(builder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1152,21 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check both the reorder buffer and the snapshot to see if the given
+ * transaction has modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid)
+{
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return true;
+
+	/* Check the catchange XID array */
+	return ((builder->catchange.xcnt > 0) &&
+			(bsearch(&xid, builder->catchange.xip, builder->catchange.xcnt,
+					 sizeof(TransactionId), xidComparator) != NULL));
+}
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1135,7 +1213,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1438,6 +1516,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchange.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1467,7 +1546,7 @@ typedef struct SnapBuildOnDisk
 	offsetof(SnapBuildOnDisk, version)
 
 #define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 4
+#define SNAPBUILD_VERSION 5
 
 /*
  * Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1493,6 +1572,8 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	TransactionId	*catchange_xip = NULL;
+	size_t		catchange_xcnt;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,8 +1659,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	/* Get the catalog modifying transactions that are yet not committed */
+	catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);
+	catchange_xcnt = builder->reorder->catchange_ntxns;
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + catchange_xcnt);
 
 	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
@@ -1598,6 +1683,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchange.xip = NULL;
+	/* update catchange only on disk data */
+	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
@@ -1609,6 +1697,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
 	ondisk_c += sz;
 
+	/* copy catalog modifying xacts */
+	sz = sizeof(TransactionId) * catchange_xcnt;
+	memcpy(ondisk_c, catchange_xip, sz);
+	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+	ondisk_c += sz;
+
 	FIN_CRC32C(ondisk->checksum);
 
 	/* we have valid data now, open tempfile and write it there */
@@ -1694,6 +1788,8 @@ out:
 	/* be tidy */
 	if (ondisk)
 		pfree(ondisk);
+	if (catchange_xip)
+		pfree(catchange_xip);
 }
 
 /*
@@ -1707,7 +1803,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1834,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,57 +1854,21 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
 	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
 	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
-	}
+	SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
 	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
+	/* restore catalog modifying xacts information */
+	sz = sizeof(TransactionId) * ondisk.builder.catchange.xcnt;
+	ondisk.builder.catchange.xip = MemoryContextAllocZero(builder->context, sz);
+	SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchange.xip, sz, path);
+	COMP_CRC32C(checksum, ondisk.builder.catchange.xip, sz);
+
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -1885,6 +1922,14 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchange.xip)
+		pfree(builder->catchange.xip);
+	builder->catchange.xcnt = ondisk.builder.catchange.xcnt;
+	builder->catchange.xip = ondisk.builder.catchange.xip;
+
+	ondisk.builder.catchange.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1909,6 +1954,38 @@ snapshot_not_interesting:
 	return false;
 }
 
+/*
+ * Read the contents of the serialized snapshot to the dest.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
 /*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d109d0baed..fd84f175c0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -380,6 +380,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * A node in the list of catalog modifying transactions
+	 */
+	dlist_node	catchange_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -526,6 +531,12 @@ struct ReorderBuffer
 	 */
 	dlist_head	txns_by_base_snapshot_lsn;
 
+	/*
+	 * Transactions and subtransactions that have modified system catalogs.
+	 */
+	dlist_head	catchange_txns;
+	int			catchange_ntxns;
+
 	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
@@ -677,6 +688,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
-- 
2.24.3 (Apple Git-128)

#60Masahiko Sawada
sawada.mshk@gmail.com
In reply to: shiy.fnst@fujitsu.com (#50)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 at 12:40 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

On Tue, Jul 12, 2022 8:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached an updated patch.

While trying this idea, I noticed there is no API to get the length of
dlist, as we discussed offlist. Alternative idea was to use List
(T_XidList) but I'm not sure it's a great idea since deleting an xid
from the list is O(N), we need to implement list_delete_xid, and we
need to make sure allocating list node in the reorder buffer context.
So in the patch, I added a variable, catchange_ntxns, to keep track of
the length of the dlist. Please review it.

Thanks for your patch. Here are some comments on the master patch.

Thank you for the comments.

1.
In catalog_change_snapshot.spec, should we use "RUNNING_XACTS record" instead of
"RUNNING_XACT record" / "XACT_RUNNING record" in the comment?

2.
+                * Since catchange.xip is sorted, we find the lower bound of
+                * xids that sill are interesting.

Typo?
"sill" -> "still"

3.
+        * This array is set once when restoring the snapshot, xids are removed
+        * from the array when decoding xl_running_xacts record, and then eventually
+        * becomes an empty.
+                       /* catchange list becomes an empty */
+                       pfree(builder->catchange.xip);
+                       builder->catchange.xip = NULL;

Should "becomes an empty" be modified to "becomes empty"?

4.
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed array and ->catchange, respectively. The committed xids will

Should we change
"the ->committed array and ->catchange"
to
"the ->committed or ->catchange array"
?

Agreed with all the above comments. These are incorporated in the
latest v4 patch I just sent[1]/messages/by-id/CAD21AoAyNPrOFg+QGh+=4205TU0=yrE+QyMgzStkH85uBZXptQ@mail.gmail.com.

Regards,

[1]: /messages/by-id/CAD21AoAyNPrOFg+QGh+=4205TU0=yrE+QyMgzStkH85uBZXptQ@mail.gmail.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#61shiy.fnst@fujitsu.com
shiy.fnst@fujitsu.com
In reply to: Masahiko Sawada (#56)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 12, 2022 5:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 5:58 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

It happened when executing the following code because it tried to free a

NULL

pointer (catchange_xip).

/* be tidy */
if (ondisk)
pfree(ondisk);
+ if (catchange_xip)
+ pfree(catchange_xip);
}

It seems to be related to configure option. I could reproduce it when using
`./configure --enable-debug`.
But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -

ggdb"`.

Hmm, I could not reproduce this problem even if I use ./configure
--enable-debug. And it's weird that we checked if catchange_xip is not
null but we did pfree for it:

#1 pfree (pointer=0x0) at mcxt.c:1177
#2 0x000000000078186b in SnapBuildSerialize (builder=0x1fd5e78,
lsn=25719712) at snapbuild.c:1792

Is it reproducible in your environment?

Thanks for your reply! Yes, it is reproducible. And I also reproduced it on the
v4 patch you posted [1]/messages/by-id/CAD21AoAyNPrOFg+QGh+=4205TU0=yrE+QyMgzStkH85uBZXptQ@mail.gmail.com.

[1]: /messages/by-id/CAD21AoAyNPrOFg+QGh+=4205TU0=yrE+QyMgzStkH85uBZXptQ@mail.gmail.com

If so, could you test it again
with the following changes?

diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index d015c06ced..a6e76e3781 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1788,7 +1788,7 @@ out:
/* be tidy */
if (ondisk)
pfree(ondisk);
-   if (catchange_xip)
+   if (catchange_xip != NULL)
pfree(catchange_xip);
}

I tried this and could still reproduce the problem.

Besides, I tried the suggestion from Amit [2]/messages/by-id/CAA4eK1+XPdm8G=EhUJA12Pi1YvQAfcz2=kTd9a4BjVx4=gk-MA@mail.gmail.com, it could be fixed by checking
the value of catchange_xcnt instead of catchange_xip before pfree.

[2]: /messages/by-id/CAA4eK1+XPdm8G=EhUJA12Pi1YvQAfcz2=kTd9a4BjVx4=gk-MA@mail.gmail.com

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c482e906b0..68b9c4ef7d 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1573,7 +1573,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
        Size            needed_length;
        SnapBuildOnDisk *ondisk = NULL;
        TransactionId   *catchange_xip = NULL;
-       size_t          catchange_xcnt;
+       size_t          catchange_xcnt = 0;
        char       *ondisk_c;
        int                     fd;
        char            tmppath[MAXPGPATH];
@@ -1788,7 +1788,7 @@ out:
        /* be tidy */
        if (ondisk)
                pfree(ondisk);
-       if (catchange_xip)
+       if (catchange_xcnt != 0)
                pfree(catchange_xip);
 }

Regards,
Shi yu

#62Masahiko Sawada
sawada.mshk@gmail.com
In reply to: shiy.fnst@fujitsu.com (#61)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Jul 14, 2022 at 11:16 AM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

On Tue, Jul 12, 2022 5:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 5:58 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

It happened when executing the following code because it tried to free a

NULL

pointer (catchange_xip).

/* be tidy */
if (ondisk)
pfree(ondisk);
+ if (catchange_xip)
+ pfree(catchange_xip);
}

It seems to be related to configure option. I could reproduce it when using
`./configure --enable-debug`.
But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -

ggdb"`.

Hmm, I could not reproduce this problem even if I use ./configure
--enable-debug. And it's weird that we checked if catchange_xip is not
null but we did pfree for it:

#1 pfree (pointer=0x0) at mcxt.c:1177
#2 0x000000000078186b in SnapBuildSerialize (builder=0x1fd5e78,
lsn=25719712) at snapbuild.c:1792

Is it reproducible in your environment?

Thanks for your reply! Yes, it is reproducible. And I also reproduced it on the
v4 patch you posted [1].

Thank you for testing!

[1] /messages/by-id/CAD21AoAyNPrOFg+QGh+=4205TU0=yrE+QyMgzStkH85uBZXptQ@mail.gmail.com

If so, could you test it again
with the following changes?

diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index d015c06ced..a6e76e3781 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1788,7 +1788,7 @@ out:
/* be tidy */
if (ondisk)
pfree(ondisk);
-   if (catchange_xip)
+   if (catchange_xip != NULL)
pfree(catchange_xip);
}

I tried this and could still reproduce the problem.

Does the backtrace still show we attempt to pfree a null-pointer?

Besides, I tried the suggestion from Amit [2], it could be fixed by checking
the value of catchange_xcnt instead of catchange_xip before pfree.

Could you check if this problem occurred when we reached there via
goto pass, i.e., did we call ReorderBufferGetCatalogChangesXacts() or
not?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#63Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#62)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Jul 14, 2022 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 14, 2022 at 11:16 AM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

On Tue, Jul 12, 2022 5:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 12, 2022 at 5:58 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

It happened when executing the following code because it tried to free a

NULL

pointer (catchange_xip).

/* be tidy */
if (ondisk)
pfree(ondisk);
+ if (catchange_xip)
+ pfree(catchange_xip);
}

It seems to be related to configure option. I could reproduce it when using
`./configure --enable-debug`.
But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -

ggdb"`.

Hmm, I could not reproduce this problem even if I use ./configure
--enable-debug. And it's weird that we checked if catchange_xip is not
null but we did pfree for it:

#1 pfree (pointer=0x0) at mcxt.c:1177
#2 0x000000000078186b in SnapBuildSerialize (builder=0x1fd5e78,
lsn=25719712) at snapbuild.c:1792

Is it reproducible in your environment?

Thanks for your reply! Yes, it is reproducible. And I also reproduced it on the
v4 patch you posted [1].

Thank you for testing!

I've found out the exact cause of this problem and how to fix it. I'll
submit an updated patch next week with my analysis.

Thank you for testing and providing additional information off-list, Shi yu.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#64shiy.fnst@fujitsu.com
shiy.fnst@fujitsu.com
In reply to: Masahiko Sawada (#47)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Jul 11, 2022 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached an updated patch, please review it.

Thanks for your patch. Here are some comments for the REL14-v1 patch.

1.
+ Size sz = sizeof(TransactionId) * nxacts;;

There is a redundant semicolon at the end.

2.
+ workspace = MemoryContextAlloc(rb->context, rb->n_initial_running_xacts);

Should it be:
+ workspace = MemoryContextAlloc(rb->context, sizeof(TransactionId) * rb->n_initial_running_xacts);

3.
+	/* bound check if there is at least one transaction to be removed */
+	if (NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
+									running->oldestRunningXid))
+		return;
+

Here, I think it should return if rb->initial_running_xacts[0] is older than
oldestRunningXid, right? Should it be changed to:

+	if (!NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
+									running->oldestRunningXid))
+		return;

4.
+ if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) != 0)

Maybe we can change it like the following, to be consistent with other places in
this file. It's also fine if you don't change it.

+ if (parsed->xinfo & XACT_XINFO_HAS_INVALS)

Regards,
Shi yu

#65osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Masahiko Sawada (#59)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thursday, July 14, 2022 10:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached an updated patch that incorporated comments from Amit and Shi.

Hi,

Minor comments for v4.

(1) typo in the commit message

"When decoding a COMMIT record, we check both the list and the ReorderBuffer to see if
if the transaction has modified catalogs."

There are two 'if's in succession in the last sentence of the second paragraph.

(2) The header comment for the spec test

+# Test that decoding only the commit record of the transaction that have
+# catalog-changed.

Rewording of this part looks required, because "test that ... " requires a complete sentence
after that, right ?

(3) SnapBuildRestore

snapshot_not_interesting:
if (ondisk.builder.committed.xip != NULL)
pfree(ondisk.builder.committed.xip);
return false;
}

Do we need to add pfree for ondisk.builder.catchange.xip after the 'snapshot_not_interesting' label ?

(4) SnapBuildPurgeOlderTxn

+               elog(DEBUG3, "purged catalog modifying transactions from %d to %d",
+                        (uint32) builder->catchange.xcnt, surviving_xids);

To make this part more aligned with existing codes,
probably we can have a look at another elog for debug in the same function.

We should use %u for casted xcnt & surviving_xids,
while adding a format for xmin if necessary ?

Best Regards,
Takamichi Osumi

#66Masahiko Sawada
sawada.mshk@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#65)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 15, 2022 at 10:43 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Thursday, July 14, 2022 10:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached an updated patch that incorporated comments from Amit and Shi.

Hi,

Minor comments for v4.

Thank you for the comments!

(1) typo in the commit message

"When decoding a COMMIT record, we check both the list and the ReorderBuffer to see if
if the transaction has modified catalogs."

There are two 'if's in succession in the last sentence of the second paragraph.

(2) The header comment for the spec test

+# Test that decoding only the commit record of the transaction that have
+# catalog-changed.

Rewording of this part looks required, because "test that ... " requires a complete sentence
after that, right ?

(3) SnapBuildRestore

snapshot_not_interesting:
if (ondisk.builder.committed.xip != NULL)
pfree(ondisk.builder.committed.xip);
return false;
}

Do we need to add pfree for ondisk.builder.catchange.xip after the 'snapshot_not_interesting' label ?

(4) SnapBuildPurgeOlderTxn

+               elog(DEBUG3, "purged catalog modifying transactions from %d to %d",
+                        (uint32) builder->catchange.xcnt, surviving_xids);

To make this part more aligned with existing codes,
probably we can have a look at another elog for debug in the same function.

We should use %u for casted xcnt & surviving_xids,
while adding a format for xmin if necessary ?

I agreed with all the above comments and incorporated them into the
updated patch.

This patch should have the fix for the issue that Shi yu reported. Shi
yu, could you please test it again with this patch?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v5-0001-Add-catalog-modifying-transactions-to-logical-dec.patchapplication/octet-stream; name=v5-0001-Add-catalog-modifying-transactions-to-logical-dec.patchDownload
From 7b2f8a7f730333ab2e7d587a38c8310cd567decf Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v5] Add catalog modifying transactions to logical decoding
 serialized snapshot.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running when
snapshot serialization, to the serialized snapshot. When decoding a
COMMIT record, we check both the list and the ReorderBuffer to see if
the transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we take another approach;
remember the last-running-xacts list of the first decoded
RUNNING_XACTS record and check if the transaction whose commit record
has XACT_XINFO_HAS_INVALS and whose XID is in the list. This doesn't
require any file format changes but the transaction will end up being
added to the snapshot even if it has only relcache invalidations.

This commit bumps SNAPBUILD_VERSION because of change in SnapBuild.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 +++
 .../specs/catalog_change_snapshot.spec        |  39 +++
 .../replication/logical/reorderbuffer.c       |  69 ++++-
 src/backend/replication/logical/snapbuild.c   | 258 ++++++++++++------
 src/include/replication/reorderbuffer.h       |  12 +
 6 files changed, 336 insertions(+), 88 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..d79f9bb415
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# catalog-changed.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fde72..d7f430623d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,8 @@ ReorderBufferAllocate(void)
 	buffer->by_txn_last_xid = InvalidTransactionId;
 	buffer->by_txn_last_txn = NULL;
 
+	buffer->catchange_ntxns = 0;
+
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 	buffer->size = 0;
@@ -366,6 +368,7 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
+	dlist_init(&buffer->catchange_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1526,7 +1529,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
@@ -1535,6 +1538,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	dlist_delete(&txn->node);
 
+	if (rbtxn_has_catalog_changes(txn))
+	{
+		dlist_delete(&txn->catchange_node);
+		rb->catchange_ntxns--;
+
+		Assert(rb->catchange_ntxns >= 0);
+	}
+
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,
 				(void *) &txn->xid,
@@ -3275,10 +3286,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 								  XLogRecPtr lsn)
 {
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	if (!rbtxn_has_catalog_changes(txn))
+	{
+		txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &txn->catchange_node);
+		rb->catchange_ntxns++;
+	}
 
 	/*
 	 * Mark top-level transaction as having catalog changes too if one of its
@@ -3286,8 +3303,52 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	 * conveniently check just top-level transaction and decide whether to
 	 * build the hash table or not.
 	 */
-	if (txn->toptxn != NULL)
-		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	toptxn = txn->toptxn;
+	if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
+	{
+		toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &toptxn->catchange_node);
+		rb->catchange_ntxns++;
+	}
+}
+
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb)
+{
+	dlist_iter iter;
+	TransactionId *xids = NULL;
+	size_t	xcnt = 0;
+
+	/* Quick return if the list is empty */
+	if (dlist_is_empty(&rb->catchange_txns))
+	{
+		Assert(rb->catchange_ntxns == 0);
+		return NULL;
+	}
+
+	/* Initialize XID array */
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * rb->catchange_ntxns);
+	dlist_foreach(iter, &rb->catchange_txns)
+	{
+		ReorderBufferTXN *txn = dlist_container(ReorderBufferTXN,
+												catchange_node,
+												iter.cur);
+
+		Assert(rbtxn_has_catalog_changes(txn));
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	Assert((xcnt > 0) && (xcnt == rb->catchange_ntxns));
+	return xids;
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15214..f25c4a1157 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,30 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on HEAP2_NEW_CID and XLOG_XACT_INVALIDATIONS records to
+	 * know if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction.
+	 * This array keeps track of the transactions that have modified catalogs
+	 * and were running when serializing a snapshot, and this array is used to
+	 * add such transactions to the snapshot.
+	 *
+	 * This array is set once when restoring the snapshot, xids are removed
+	 * from the array when decoding xl_running_xacts record, and then eventually
+	 * becomes empty.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchange;
 };
 
 /*
@@ -250,8 +274,8 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/* ->committed and ->catchange manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -262,6 +286,8 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +295,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +333,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchange.xcnt = 0;
+	builder->catchange.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -888,12 +918,15 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed or containing catalog
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed or ->catchange array, respectively. The committed xids will
+ * get checked via the clog machinery. We can ideally remove the transaction
+ * from catchange array once it is finished (committed/aborted) but that could
+ * be costly as we need to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +961,37 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* purge xids in ->catchange as well */
+	if (builder->catchange.xcnt > 0)
+	{
+		/*
+		 * Since catchange.xip is sorted, we find the lower bound of
+		 * xids that still are interesting.
+		 */
+		for (off = 0; off < builder->catchange.xcnt; off++)
+		{
+			if (TransactionIdFollowsOrEquals(builder->catchange.xip[off],
+											 builder->xmin))
+				break;
+		}
+
+		surviving_xids = builder->catchange.xcnt - off;
+		if (surviving_xids > 0)
+			memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+					surviving_xids * sizeof(TransactionId));
+		else
+		{
+			/* catchange list becomes empty */
+			pfree(builder->catchange.xip);
+			builder->catchange.xip = NULL;
+		}
+
+		elog(DEBUG3, "purged catalog modifying transactions from % to %u, xmin %u",
+			 (uint32) builder->catchange.xcnt, (uint32) surviving_xids,
+			builder->xmin);
+		builder->catchange.xcnt = surviving_xids;
+	}
 }
 
 /*
@@ -983,7 +1047,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChanges(builder, subxid))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1076,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChanges(builder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1153,21 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check both the reorder buffer and the snapshot to see if the given
+ * transaction has modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid)
+{
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return true;
+
+	/* Check the catchange XID array */
+	return ((builder->catchange.xcnt > 0) &&
+			(bsearch(&xid, builder->catchange.xip, builder->catchange.xcnt,
+					 sizeof(TransactionId), xidComparator) != NULL));
+}
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1135,7 +1214,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1438,6 +1517,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchange.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1467,7 +1547,7 @@ typedef struct SnapBuildOnDisk
 	offsetof(SnapBuildOnDisk, version)
 
 #define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 4
+#define SNAPBUILD_VERSION 5
 
 /*
  * Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1493,6 +1573,8 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	TransactionId	*catchange_xip = NULL;
+	size_t		catchange_xcnt;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,8 +1660,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	/* Get the catalog modifying transactions that are yet not committed */
+	catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);
+	catchange_xcnt = builder->reorder->catchange_ntxns;
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + catchange_xcnt);
 
 	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
@@ -1598,16 +1684,31 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchange.xip = NULL;
+	/* update catchange only on disk data */
+	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
 				sizeof(SnapBuild));
 
 	/* copy committed xacts */
-	sz = sizeof(TransactionId) * builder->committed.xcnt;
-	memcpy(ondisk_c, builder->committed.xip, sz);
-	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
-	ondisk_c += sz;
+	if (builder->committed.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * builder->committed.xcnt;
+		memcpy(ondisk_c, builder->committed.xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
+
+	/* copy catalog modifying xacts */
+	if (catchange_xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * catchange_xcnt;
+		memcpy(ondisk_c, catchange_xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
 
 	FIN_CRC32C(ondisk->checksum);
 
@@ -1694,6 +1795,8 @@ out:
 	/* be tidy */
 	if (ondisk)
 		pfree(ondisk);
+	if (catchange_xip)
+		pfree(catchange_xip);
 }
 
 /*
@@ -1707,7 +1810,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1841,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,56 +1861,26 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
-	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
-	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
+	if (ondisk.builder.committed.xcnt > 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
+		sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
+		ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
+	}
 
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
+	/* restore catalog modifying xacts information */
+	if (ondisk.builder.catchange.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * ondisk.builder.catchange.xcnt;
+		ondisk.builder.catchange.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchange.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.catchange.xip, sz);
 	}
-	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
@@ -1885,6 +1935,14 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchange.xip)
+		pfree(builder->catchange.xip);
+	builder->catchange.xcnt = ondisk.builder.catchange.xcnt;
+	builder->catchange.xip = ondisk.builder.catchange.xip;
+
+	ondisk.builder.catchange.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1906,9 +1964,43 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 snapshot_not_interesting:
 	if (ondisk.builder.committed.xip != NULL)
 		pfree(ondisk.builder.committed.xip);
+	if (ondisk.builder.catchange.xip != NULL)
+		pfree(ondisk.builder.catchange.xip);
 	return false;
 }
 
+/*
+ * Read the contents of the serialized snapshot to the dest.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
 /*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d109d0baed..fd84f175c0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -380,6 +380,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * A node in the list of catalog modifying transactions
+	 */
+	dlist_node	catchange_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -526,6 +531,12 @@ struct ReorderBuffer
 	 */
 	dlist_head	txns_by_base_snapshot_lsn;
 
+	/*
+	 * Transactions and subtransactions that have modified system catalogs.
+	 */
+	dlist_head	catchange_txns;
+	int			catchange_ntxns;
+
 	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
@@ -677,6 +688,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
-- 
2.24.3 (Apple Git-128)

#67Masahiko Sawada
sawada.mshk@gmail.com
In reply to: shiy.fnst@fujitsu.com (#64)
6 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 15, 2022 at 3:32 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

On Mon, Jul 11, 2022 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached an updated patch, please review it.

Thanks for your patch. Here are some comments for the REL14-v1 patch.

1.
+ Size sz = sizeof(TransactionId) * nxacts;;

There is a redundant semicolon at the end.

2.
+ workspace = MemoryContextAlloc(rb->context, rb->n_initial_running_xacts);

Should it be:
+ workspace = MemoryContextAlloc(rb->context, sizeof(TransactionId) * rb->n_initial_running_xacts);

3.
+       /* bound check if there is at least one transaction to be removed */
+       if (NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
+                                                                       running->oldestRunningXid))
+               return;
+

Here, I think it should return if rb->initial_running_xacts[0] is older than
oldestRunningXid, right? Should it be changed to:

+       if (!NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
+                                                                       running->oldestRunningXid))
+               return;

4.
+ if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) != 0)

Maybe we can change it like the following, to be consistent with other places in
this file. It's also fine if you don't change it.

+ if (parsed->xinfo & XACT_XINFO_HAS_INVALS)

Thank you for the comments!

I've attached patches for all supported branches including the master.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

REL13-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REL13-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From 02e669a57d782b4488860ace7167bcff09af397f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 11 Jul 2022 21:49:06 +0900
Subject: [PATCH v6] Fix catalog lookup with the wrong snapshot during logical
 decoding.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change the reorder buffer so that it
remembers the initial running transaction written in the
xl_running_xacts record that we decoded first, and mark the
transaction as containing catalog changes if it’s in the list of the
initial running transactions and its commit record has
XACT_XINFO_HAS_INVALS.

This has false positive; we could end up adding the transaction that
didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the
COMMIT record. It doesn’t have the information on which (sub)
transaction has catalog changes, and XACT_XINFO_HAS_INVALS doesn't
necessarily indicate that the transaction has catalog change. But it
doesn't become a problem since we use historic snapshot only for
reading system catalogs.

On the master branch, we took a more future-proof approach of writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.

Back-patch to all supported released.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 +++++++
 .../specs/catalog_change_snapshot.spec        |  39 ++++++
 src/backend/replication/logical/decode.c      |  17 +++
 .../replication/logical/reorderbuffer.c       | 116 ++++++++++++++++++
 src/include/replication/reorderbuffer.h       |  36 ++++++
 6 files changed, 253 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..6ec09ab192 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5a2b828aa3..e2fa5ae6b5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -316,6 +316,9 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			{
 				xl_running_xacts *running = (xl_running_xacts *) XLogRecGetData(r);
 
+				/* Process the initial running transactions, if any */
+				ReorderBufferProcessInitialXacts(ctx->reorder, running);
+
 				SnapBuildProcessRunningXacts(builder, buf->origptr, running);
 
 				/*
@@ -572,6 +575,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. We check if it's in the list of the initial running transactions
+	 * and then mark it as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * catalog change transactions to the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		ReorderBufferInitialXactsSetCatalogChanges(ctx->reorder, xid,
+												   parsed->nsubxacts,
+												   parsed->subxacts,
+												   buf->origptr);
+
 	/*
 	 * Process invalidation messages, even if we're not interested in the
 	 * transaction's contents, since the various caches need to always be
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ef8c2ea6df..e6905d9264 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -318,6 +318,9 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->initial_running_xacts = NULL;
+	buffer->n_initial_running_xacts = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3859,3 +3862,116 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Process the transactions in xl_running_xacts record, and remember the
+ * transactions first and later remove those that aren't needed anymore.
+ *
+ * We can ideally remove the transactions from the initial running xacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
+ */
+void
+ReorderBufferProcessInitialXacts(ReorderBuffer *rb, xl_running_xacts *running)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild  *builder = ctx->snapshot_builder;
+	TransactionId *workspace;
+	int			surviving_xids = 0;
+
+	/* Build the initial running transactions list for the first call */
+	if (unlikely(SnapBuildCurrentState(builder) == SNAPBUILD_START))
+	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		Assert(rb->n_initial_running_xacts == 0);
+
+		rb->n_initial_running_xacts = nxacts;
+		rb->initial_running_xacts = MemoryContextAlloc(rb->context, sz);
+		memcpy(rb->initial_running_xacts, running->xids, sz);
+		qsort(rb->initial_running_xacts, nxacts, sizeof(TransactionId),
+			  xidComparator);
+
+		return;
+	}
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(rb->n_initial_running_xacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
+									 running->oldestRunningXid))
+		return;
+
+	/*
+	 * Remove transactions that would have been processed and we don't need to
+	 * keep track off anymore.
+	 *
+	 * The purged array must also be sorted in xidComparator order.
+	 */
+	workspace = MemoryContextAlloc(rb->context,
+								   rb->n_initial_running_xacts * sizeof(TransactionId));
+	for (int i = 0; i < rb->n_initial_running_xacts; i++)
+	{
+		if (NormalTransactionIdPrecedes(rb->initial_running_xacts[i],
+										running->oldestRunningXid))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = rb->initial_running_xacts[i];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(rb->initial_running_xacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(rb->initial_running_xacts);
+		rb->initial_running_xacts = NULL;
+	}
+
+	elog(DEBUG3, "purged catalog modifying transactions from %u to %u, oldest running xid %u",
+		 (uint32) rb->n_initial_running_xacts,
+		 (uint32) surviving_xids,
+		 running->oldestRunningXid);
+
+	rb->n_initial_running_xacts = surviving_xids;
+	pfree(workspace);
+}
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark both it
+ * and its subtransactions as containing catalog changes if not yet.
+ */
+void
+ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+										   int subxcnt, TransactionId *subxacts,
+										   XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely(rb->n_initial_running_xacts == 0 ||
+			   ReorderBufferXidHasCatalogChanges(rb, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, rb->initial_running_xacts, rb->n_initial_running_xacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(rb, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(rb, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(rb, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5347597e92..d90640ead8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
 #include "access/htup_details.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -408,6 +409,35 @@ struct ReorderBuffer
 
 	XLogRecPtr	current_restart_decoding_lsn;
 
+	/*
+	 * Array of transactions and subtransactions that were running when
+	 * the xl_running_xacts record that we decoded first was written.
+	 * The array is sorted in xidComparator order. Xids are removed from
+	 * the array when decoding xl_running_xacts record, and then the array
+	 * eventually becomes an empty.
+	 *
+	 * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+	 * if the transaction has changed the catalog, and that information
+	 * is not serialized to SnapBuilder. Therefore, if the logical
+	 * decoding decodes the commit record of the transaction that actually
+	 * has done catalog changes without these records, we miss to add
+	 * the xid to the snapshot, and end up looking at catalogs with the
+	 * wrong snapshot. To avoid this problem, if the COMMIT record of
+	 * the xid listed in initial_running_xacts has XACT_XINFO_HAS_INVALS
+	 * flag, we mark both the top transaction and its substransactions
+	 * as containing catalog changes.
+	 *
+	 * We could end up adding the transaction that didn't change catalog
+	 * to the snapshot since we cannot distinguish whether the transaction
+	 * has catalog changes only by checking the COMMIT record. It doesn't
+	 * have the information on which (sub) transaction has catalog changes,
+	 * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+	 * transaction has catalog change. But it doesn't become a problem since
+	 * we use historic snapshot only for reading system catalogs.
+	 */
+	TransactionId *initial_running_xacts;
+	int n_initial_running_xacts;
+
 	/* buffer for disk<->memory conversions */
 	char	   *outbuf;
 	Size		outbufsize;
@@ -465,4 +495,10 @@ void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
 void		StartupReorderBuffer(void);
 
+void		ReorderBufferProcessInitialXacts(ReorderBuffer *rb,
+											 xl_running_xacts *running);
+void		ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+													   int subxcnt,
+													   TransactionId *subxacts,
+													   XLogRecPtr lsn);
 #endif
-- 
2.24.3 (Apple Git-128)

REL12-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REL12-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From ef780b39be74ca91ed94031451a823268c1e05ee Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sun, 17 Jul 2022 07:19:00 +0900
Subject: [PATCH v6] Fix catalog lookup with the wrong snapshot during logical
 decoding.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change the reorder buffer so that it
remembers the initial running transaction written in the
xl_running_xacts record that we decoded first, and mark the
transaction as containing catalog changes if it’s in the list of the
initial running transactions and its commit record has
XACT_XINFO_HAS_INVALS.

This has false positive; we could end up adding the transaction that
didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the
COMMIT record. It doesn’t have the information on which (sub)
transaction has catalog changes, and XACT_XINFO_HAS_INVALS doesn't
necessarily indicate that the transaction has catalog change. But it
doesn't become a problem since we use historic snapshot only for
reading system catalogs.

On the master branch, we took a more future-proof approach of writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.

Back-patch to all supported released.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 +++++++
 .../specs/catalog_change_snapshot.spec        |  39 ++++++
 src/backend/replication/logical/decode.c      |  17 +++
 .../replication/logical/reorderbuffer.c       | 116 ++++++++++++++++++
 src/include/replication/reorderbuffer.h       |  36 ++++++
 6 files changed, 253 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..6ec09ab192 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 60d07ce4eb..94983644a1 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -319,6 +319,9 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			{
 				xl_running_xacts *running = (xl_running_xacts *) XLogRecGetData(r);
 
+				/* Process the initial running transactions, if any */
+				ReorderBufferProcessInitialXacts(ctx->reorder, running);
+
 				SnapBuildProcessRunningXacts(builder, buf->origptr, running);
 
 				/*
@@ -575,6 +578,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. We check if it's in the list of the initial running transactions
+	 * and then mark it as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * catalog change transactions to the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		ReorderBufferInitialXactsSetCatalogChanges(ctx->reorder, xid,
+												   parsed->nsubxacts,
+												   parsed->subxacts,
+												   buf->origptr);
+
 	/*
 	 * Process invalidation messages, even if we're not interested in the
 	 * transaction's contents, since the various caches need to always be
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 475f76fa5e..314111dcd2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -279,6 +279,9 @@ ReorderBufferAllocate(void)
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 
+	buffer->initial_running_xacts = NULL;
+	buffer->n_initial_running_xacts = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3593,3 +3596,116 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Process the transactions in xl_running_xacts record, and remember the
+ * transactions first and later remove those that aren't needed anymore.
+ *
+ * We can ideally remove the transactions from the initial running xacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
+ */
+void
+ReorderBufferProcessInitialXacts(ReorderBuffer *rb, xl_running_xacts *running)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild  *builder = ctx->snapshot_builder;
+	TransactionId *workspace;
+	int			surviving_xids = 0;
+
+	/* Build the initial running transactions list for the first call */
+	if (unlikely(SnapBuildCurrentState(builder) == SNAPBUILD_START))
+	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		Assert(rb->n_initial_running_xacts == 0);
+
+		rb->n_initial_running_xacts = nxacts;
+		rb->initial_running_xacts = MemoryContextAlloc(rb->context, sz);
+		memcpy(rb->initial_running_xacts, running->xids, sz);
+		qsort(rb->initial_running_xacts, nxacts, sizeof(TransactionId),
+			  xidComparator);
+
+		return;
+	}
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(rb->n_initial_running_xacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
+									 running->oldestRunningXid))
+		return;
+
+	/*
+	 * Remove transactions that would have been processed and we don't need to
+	 * keep track off anymore.
+	 *
+	 * The purged array must also be sorted in xidComparator order.
+	 */
+	workspace = MemoryContextAlloc(rb->context,
+								   rb->n_initial_running_xacts * sizeof(TransactionId));
+	for (int i = 0; i < rb->n_initial_running_xacts; i++)
+	{
+		if (NormalTransactionIdPrecedes(rb->initial_running_xacts[i],
+										running->oldestRunningXid))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = rb->initial_running_xacts[i];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(rb->initial_running_xacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(rb->initial_running_xacts);
+		rb->initial_running_xacts = NULL;
+	}
+
+	elog(DEBUG3, "purged catalog modifying transactions from %u to %u, oldest running xid %u",
+		 (uint32) rb->n_initial_running_xacts,
+		 (uint32) surviving_xids,
+		 running->oldestRunningXid);
+
+	rb->n_initial_running_xacts = surviving_xids;
+	pfree(workspace);
+}
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark both it
+ * and its subtransactions as containing catalog changes if not yet.
+ */
+void
+ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+										   int subxcnt, TransactionId *subxacts,
+										   XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely(rb->n_initial_running_xacts == 0 ||
+			   ReorderBufferXidHasCatalogChanges(rb, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, rb->initial_running_xacts, rb->n_initial_running_xacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(rb, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(rb, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(rb, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bc97b08a90..d6466a4e20 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
 #include "access/htup_details.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -385,6 +386,35 @@ struct ReorderBuffer
 
 	XLogRecPtr	current_restart_decoding_lsn;
 
+	/*
+	 * Array of transactions and subtransactions that were running when
+	 * the xl_running_xacts record that we decoded first was written.
+	 * The array is sorted in xidComparator order. Xids are removed from
+	 * the array when decoding xl_running_xacts record, and then the array
+	 * eventually becomes an empty.
+	 *
+	 * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+	 * if the transaction has changed the catalog, and that information
+	 * is not serialized to SnapBuilder. Therefore, if the logical
+	 * decoding decodes the commit record of the transaction that actually
+	 * has done catalog changes without these records, we miss to add
+	 * the xid to the snapshot, and end up looking at catalogs with the
+	 * wrong snapshot. To avoid this problem, if the COMMIT record of
+	 * the xid listed in initial_running_xacts has XACT_XINFO_HAS_INVALS
+	 * flag, we mark both the top transaction and its substransactions
+	 * as containing catalog changes.
+	 *
+	 * We could end up adding the transaction that didn't change catalog
+	 * to the snapshot since we cannot distinguish whether the transaction
+	 * has catalog changes only by checking the COMMIT record. It doesn't
+	 * have the information on which (sub) transaction has catalog changes,
+	 * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+	 * transaction has catalog change. But it doesn't become a problem since
+	 * we use historic snapshot only for reading system catalogs.
+	 */
+	TransactionId *initial_running_xacts;
+	int n_initial_running_xacts;
+
 	/* buffer for disk<->memory conversions */
 	char	   *outbuf;
 	Size		outbufsize;
@@ -439,4 +469,10 @@ void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
 void		StartupReorderBuffer(void);
 
+void		ReorderBufferProcessInitialXacts(ReorderBuffer *rb,
+											 xl_running_xacts *running);
+void		ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+													   int subxcnt,
+													   TransactionId *subxacts,
+													   XLogRecPtr lsn);
 #endif
-- 
2.24.3 (Apple Git-128)

REL14-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REL14-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From 7fddf151b27e185cf27ae0f6d1a965a0b928c22a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 11 Jul 2022 21:49:06 +0900
Subject: [PATCH v6] Fix catalog lookup with the wrong snapshot during logical
 decoding.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change the reorder buffer so that it
remembers the initial running transaction written in the
xl_running_xacts record that we decoded first, and mark the
transaction as containing catalog changes if it’s in the list of the
initial running transactions and its commit record has
XACT_XINFO_HAS_INVALS.

This has false positive; we could end up adding the transaction that
didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the
COMMIT record. It doesn’t have the information on which (sub)
transaction has catalog changes, and XACT_XINFO_HAS_INVALS doesn't
necessarily indicate that the transaction has catalog change. But it
doesn't become a problem since we use historic snapshot only for
reading system catalogs.

On the master branch, we took a more future-proof approach of writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.

Back-patch to all supported released.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 +++++++
 .../specs/catalog_change_snapshot.spec        |  39 ++++++
 src/backend/replication/logical/decode.c      |  17 +++
 .../replication/logical/reorderbuffer.c       | 116 ++++++++++++++++++
 src/include/replication/reorderbuffer.h       |  36 ++++++
 6 files changed, 253 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..4553252d75 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 92dfafc632..8d72c5af1f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -407,6 +407,9 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			{
 				xl_running_xacts *running = (xl_running_xacts *) XLogRecGetData(r);
 
+				/* Process the initial running transactions, if any */
+				ReorderBufferProcessInitialXacts(ctx->reorder, running);
+
 				SnapBuildProcessRunningXacts(builder, buf->origptr, running);
 
 				/*
@@ -691,6 +694,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. We check if it's in the list of the initial running transactions
+	 * and then mark it as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * catalog change transactions to the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		ReorderBufferInitialXactsSetCatalogChanges(ctx->reorder, xid,
+												   parsed->nsubxacts,
+												   parsed->subxacts,
+												   buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e59d1396b5..bef9fb9bb2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -346,6 +346,9 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->initial_running_xacts = NULL;
+	buffer->n_initial_running_xacts = 0;
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -5154,3 +5157,116 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Process the transactions in xl_running_xacts record, and remember the
+ * transactions first and later remove those that aren't needed anymore.
+ *
+ * We can ideally remove the transactions from the initial running xacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
+ */
+void
+ReorderBufferProcessInitialXacts(ReorderBuffer *rb, xl_running_xacts *running)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild  *builder = ctx->snapshot_builder;
+	TransactionId *workspace;
+	int			surviving_xids = 0;
+
+	/* Build the initial running transactions list for the first call */
+	if (unlikely(SnapBuildCurrentState(builder) == SNAPBUILD_START))
+	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		Assert(rb->n_initial_running_xacts == 0);
+
+		rb->n_initial_running_xacts = nxacts;
+		rb->initial_running_xacts = MemoryContextAlloc(rb->context, sz);
+		memcpy(rb->initial_running_xacts, running->xids, sz);
+		qsort(rb->initial_running_xacts, nxacts, sizeof(TransactionId),
+			  xidComparator);
+
+		return;
+	}
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(rb->n_initial_running_xacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
+									 running->oldestRunningXid))
+		return;
+
+	/*
+	 * Remove transactions that would have been processed and we don't need to
+	 * keep track off anymore.
+	 *
+	 * The purged array must also be sorted in xidComparator order.
+	 */
+	workspace = MemoryContextAlloc(rb->context,
+								   rb->n_initial_running_xacts * sizeof(TransactionId));
+	for (int i = 0; i < rb->n_initial_running_xacts; i++)
+	{
+		if (NormalTransactionIdPrecedes(rb->initial_running_xacts[i],
+										running->oldestRunningXid))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = rb->initial_running_xacts[i];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(rb->initial_running_xacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(rb->initial_running_xacts);
+		rb->initial_running_xacts = NULL;
+	}
+
+	elog(DEBUG3, "purged catalog modifying transactions from %u to %u, oldest running xid %u",
+		 (uint32) rb->n_initial_running_xacts,
+		 (uint32) surviving_xids,
+		 running->oldestRunningXid);
+
+	rb->n_initial_running_xacts = surviving_xids;
+	pfree(workspace);
+}
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark both it
+ * and its subtransactions as containing catalog changes if not yet.
+ */
+void
+ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+										   int subxcnt, TransactionId *subxacts,
+										   XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely(rb->n_initial_running_xacts == 0 ||
+			   ReorderBufferXidHasCatalogChanges(rb, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, rb->initial_running_xacts, rb->n_initial_running_xacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(rb, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(rb, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(rb, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d81b5..fe0f52d4e1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
 #include "access/htup_details.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -589,6 +590,35 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/*
+	 * Array of transactions and subtransactions that were running when
+	 * the xl_running_xacts record that we decoded first was written.
+	 * The array is sorted in xidComparator order. Xids are removed from
+	 * the array when decoding xl_running_xacts record, and then the array
+	 * eventually becomes an empty.
+	 *
+	 * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+	 * if the transaction has changed the catalog, and that information
+	 * is not serialized to SnapBuilder. Therefore, if the logical
+	 * decoding decodes the commit record of the transaction that actually
+	 * has done catalog changes without these records, we miss to add
+	 * the xid to the snapshot, and end up looking at catalogs with the
+	 * wrong snapshot. To avoid this problem, if the COMMIT record of
+	 * the xid listed in initial_running_xacts has XACT_XINFO_HAS_INVALS
+	 * flag, we mark both the top transaction and its substransactions
+	 * as containing catalog changes.
+	 *
+	 * We could end up adding the transaction that didn't change catalog
+	 * to the snapshot since we cannot distinguish whether the transaction
+	 * has catalog changes only by checking the COMMIT record. It doesn't
+	 * have the information on which (sub) transaction has catalog changes,
+	 * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+	 * transaction has catalog change. But it doesn't become a problem since
+	 * we use historic snapshot only for reading system catalogs.
+	 */
+	TransactionId *initial_running_xacts;
+	int n_initial_running_xacts;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
@@ -678,4 +708,10 @@ void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
 void		StartupReorderBuffer(void);
 
+void		ReorderBufferProcessInitialXacts(ReorderBuffer *rb,
+											 xl_running_xacts *running);
+void		ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+													   int subxcnt,
+													   TransactionId *subxacts,
+													   XLogRecPtr lsn);
 #endif
-- 
2.24.3 (Apple Git-128)

REL11-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REL11-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From 99f61380e562afbe70f2c0bbd5b025d18c5c36f8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sun, 17 Jul 2022 07:30:23 +0900
Subject: [PATCH v6] Fix catalog lookup with the wrong snapshot during logical
 decoding.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change the reorder buffer so that it
remembers the initial running transaction written in the
xl_running_xacts record that we decoded first, and mark the
transaction as containing catalog changes if it’s in the list of the
initial running transactions and its commit record has
XACT_XINFO_HAS_INVALS.

This has false positive; we could end up adding the transaction that
didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the
COMMIT record. It doesn’t have the information on which (sub)
transaction has catalog changes, and XACT_XINFO_HAS_INVALS doesn't
necessarily indicate that the transaction has catalog change. But it
doesn't become a problem since we use historic snapshot only for
reading system catalogs.

On the master branch, we took a more future-proof approach of writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.

Back-patch to all supported released.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 +++++++
 .../specs/catalog_change_snapshot.spec        |  39 ++++++
 src/backend/replication/logical/decode.c      |  17 +++
 .../replication/logical/reorderbuffer.c       | 116 ++++++++++++++++++
 src/include/replication/reorderbuffer.h       |  36 ++++++
 6 files changed, 253 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 65a91a8014..973b94738a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c085f7b0f3..9c483ddc12 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -320,6 +320,9 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			{
 				xl_running_xacts *running = (xl_running_xacts *) XLogRecGetData(r);
 
+				/* Process the initial running transactions, if any */
+				ReorderBufferProcessInitialXacts(ctx->reorder, running);
+
 				SnapBuildProcessRunningXacts(builder, buf->origptr, running);
 
 				/*
@@ -576,6 +579,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. We check if it's in the list of the initial running transactions
+	 * and then mark it as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * catalog change transactions to the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		ReorderBufferInitialXactsSetCatalogChanges(ctx->reorder, xid,
+												   parsed->nsubxacts,
+												   parsed->subxacts,
+												   buf->origptr);
+
 	/*
 	 * Process invalidation messages, even if we're not interested in the
 	 * transaction's contents, since the various caches need to always be
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6c9b5dbced..495135a9f6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -271,6 +271,9 @@ ReorderBufferAllocate(void)
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 
+	buffer->initial_running_xacts = NULL;
+	buffer->n_initial_running_xacts = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3571,3 +3574,116 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Process the transactions in xl_running_xacts record, and remember the
+ * transactions first and later remove those that aren't needed anymore.
+ *
+ * We can ideally remove the transactions from the initial running xacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
+ */
+void
+ReorderBufferProcessInitialXacts(ReorderBuffer *rb, xl_running_xacts *running)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild  *builder = ctx->snapshot_builder;
+	TransactionId *workspace;
+	int			surviving_xids = 0;
+
+	/* Build the initial running transactions list for the first call */
+	if (unlikely(SnapBuildCurrentState(builder) == SNAPBUILD_START))
+	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		Assert(rb->n_initial_running_xacts == 0);
+
+		rb->n_initial_running_xacts = nxacts;
+		rb->initial_running_xacts = MemoryContextAlloc(rb->context, sz);
+		memcpy(rb->initial_running_xacts, running->xids, sz);
+		qsort(rb->initial_running_xacts, nxacts, sizeof(TransactionId),
+			  xidComparator);
+
+		return;
+	}
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(rb->n_initial_running_xacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
+									 running->oldestRunningXid))
+		return;
+
+	/*
+	 * Remove transactions that would have been processed and we don't need to
+	 * keep track off anymore.
+	 *
+	 * The purged array must also be sorted in xidComparator order.
+	 */
+	workspace = MemoryContextAlloc(rb->context,
+								   rb->n_initial_running_xacts * sizeof(TransactionId));
+	for (int i = 0; i < rb->n_initial_running_xacts; i++)
+	{
+		if (NormalTransactionIdPrecedes(rb->initial_running_xacts[i],
+										running->oldestRunningXid))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = rb->initial_running_xacts[i];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(rb->initial_running_xacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(rb->initial_running_xacts);
+		rb->initial_running_xacts = NULL;
+	}
+
+	elog(DEBUG3, "purged catalog modifying transactions from %u to %u, oldest running xid %u",
+		 (uint32) rb->n_initial_running_xacts,
+		 (uint32) surviving_xids,
+		 running->oldestRunningXid);
+
+	rb->n_initial_running_xacts = surviving_xids;
+	pfree(workspace);
+}
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark both it
+ * and its subtransactions as containing catalog changes if not yet.
+ */
+void
+ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+										   int subxcnt, TransactionId *subxacts,
+										   XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely(rb->n_initial_running_xacts == 0 ||
+			   ReorderBufferXidHasCatalogChanges(rb, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, rb->initial_running_xacts, rb->n_initial_running_xacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(rb, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(rb, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(rb, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3686cd2800..e8c6b661ba 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
 #include "access/htup_details.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -390,6 +391,35 @@ struct ReorderBuffer
 
 	XLogRecPtr	current_restart_decoding_lsn;
 
+	/*
+	 * Array of transactions and subtransactions that were running when
+	 * the xl_running_xacts record that we decoded first was written.
+	 * The array is sorted in xidComparator order. Xids are removed from
+	 * the array when decoding xl_running_xacts record, and then the array
+	 * eventually becomes an empty.
+	 *
+	 * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+	 * if the transaction has changed the catalog, and that information
+	 * is not serialized to SnapBuilder. Therefore, if the logical
+	 * decoding decodes the commit record of the transaction that actually
+	 * has done catalog changes without these records, we miss to add
+	 * the xid to the snapshot, and end up looking at catalogs with the
+	 * wrong snapshot. To avoid this problem, if the COMMIT record of
+	 * the xid listed in initial_running_xacts has XACT_XINFO_HAS_INVALS
+	 * flag, we mark both the top transaction and its substransactions
+	 * as containing catalog changes.
+	 *
+	 * We could end up adding the transaction that didn't change catalog
+	 * to the snapshot since we cannot distinguish whether the transaction
+	 * has catalog changes only by checking the COMMIT record. It doesn't
+	 * have the information on which (sub) transaction has catalog changes,
+	 * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+	 * transaction has catalog change. But it doesn't become a problem since
+	 * we use historic snapshot only for reading system catalogs.
+	 */
+	TransactionId *initial_running_xacts;
+	int n_initial_running_xacts;
+
 	/* buffer for disk<->memory conversions */
 	char	   *outbuf;
 	Size		outbufsize;
@@ -444,4 +474,10 @@ void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
 void		StartupReorderBuffer(void);
 
+void		ReorderBufferProcessInitialXacts(ReorderBuffer *rb,
+											 xl_running_xacts *running);
+void		ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+													   int subxcnt,
+													   TransactionId *subxacts,
+													   XLogRecPtr lsn);
 #endif
-- 
2.24.3 (Apple Git-128)

REL10-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REL10-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From 6140ffb31521bd2bc89201bcb7b550a9adf992ee Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sun, 17 Jul 2022 21:28:35 +0900
Subject: [PATCH v6] Fix catalog lookup with the wrong snapshot during logical
 decoding.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change the reorder buffer so that it
remembers the initial running transaction written in the
xl_running_xacts record that we decoded first, and mark the
transaction as containing catalog changes if it’s in the list of the
initial running transactions and its commit record has
XACT_XINFO_HAS_INVALS.

This has false positive; we could end up adding the transaction that
didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the
COMMIT record. It doesn’t have the information on which (sub)
transaction has catalog changes, and XACT_XINFO_HAS_INVALS doesn't
necessarily indicate that the transaction has catalog change. But it
doesn't become a problem since we use historic snapshot only for
reading system catalogs.

On the master branch, we took a more future-proof approach of writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.

Back-patch to all supported released.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  41 +++++++
 .../specs/catalog_change_snapshot.spec        |  39 ++++++
 src/backend/replication/logical/decode.c      |  17 +++
 .../replication/logical/reorderbuffer.c       | 116 ++++++++++++++++++
 src/include/replication/reorderbuffer.h       |  36 ++++++
 6 files changed, 250 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2db2b2774b..73bc0fe1fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..15f9540b3f
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,41 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6f8920f52c..fa0e9b1f38 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -316,6 +316,9 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			{
 				xl_running_xacts *running = (xl_running_xacts *) XLogRecGetData(r);
 
+				/* Process the initial running transactions, if any */
+				ReorderBufferProcessInitialXacts(ctx->reorder, running);
+
 				SnapBuildProcessRunningXacts(builder, buf->origptr, running);
 
 				/*
@@ -552,6 +555,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. We check if it's in the list of the initial running transactions
+	 * and then mark it as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * catalog change transactions to the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		ReorderBufferInitialXactsSetCatalogChanges(ctx->reorder, xid,
+												   parsed->nsubxacts,
+												   parsed->subxacts,
+												   buf->origptr);
+
 	/*
 	 * Process invalidation messages, even if we're not interested in the
 	 * transaction's contents, since the various caches need to always be
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 061555652c..661436622f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -272,6 +272,9 @@ ReorderBufferAllocate(void)
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 
+	buffer->initial_running_xacts = NULL;
+	buffer->n_initial_running_xacts = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3509,3 +3512,116 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Process the transactions in xl_running_xacts record, and remember the
+ * transactions first and later remove those that aren't needed anymore.
+ *
+ * We can ideally remove the transactions from the initial running xacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
+ */
+void
+ReorderBufferProcessInitialXacts(ReorderBuffer *rb, xl_running_xacts *running)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild  *builder = ctx->snapshot_builder;
+	TransactionId *workspace;
+	int			surviving_xids = 0;
+
+	/* Build the initial running transactions list for the first call */
+	if (unlikely(SnapBuildCurrentState(builder) == SNAPBUILD_START))
+	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		Assert(rb->n_initial_running_xacts == 0);
+
+		rb->n_initial_running_xacts = nxacts;
+		rb->initial_running_xacts = MemoryContextAlloc(rb->context, sz);
+		memcpy(rb->initial_running_xacts, running->xids, sz);
+		qsort(rb->initial_running_xacts, nxacts, sizeof(TransactionId),
+			  xidComparator);
+
+		return;
+	}
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(rb->n_initial_running_xacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
+									 running->oldestRunningXid))
+		return;
+
+	/*
+	 * Remove transactions that would have been processed and we don't need to
+	 * keep track off anymore.
+	 *
+	 * The purged array must also be sorted in xidComparator order.
+	 */
+	workspace = MemoryContextAlloc(rb->context,
+								   rb->n_initial_running_xacts * sizeof(TransactionId));
+	for (int i = 0; i < rb->n_initial_running_xacts; i++)
+	{
+		if (NormalTransactionIdPrecedes(rb->initial_running_xacts[i],
+										running->oldestRunningXid))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = rb->initial_running_xacts[i];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(rb->initial_running_xacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(rb->initial_running_xacts);
+		rb->initial_running_xacts = NULL;
+	}
+
+	elog(DEBUG3, "purged catalog modifying transactions from %u to %u, oldest running xid %u",
+		 (uint32) rb->n_initial_running_xacts,
+		 (uint32) surviving_xids,
+		 running->oldestRunningXid);
+
+	rb->n_initial_running_xacts = surviving_xids;
+	pfree(workspace);
+}
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark both it
+ * and its subtransactions as containing catalog changes if not yet.
+ */
+void
+ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+										   int subxcnt, TransactionId *subxacts,
+										   XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely(rb->n_initial_running_xacts == 0 ||
+			   ReorderBufferXidHasCatalogChanges(rb, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, rb->initial_running_xacts, rb->n_initial_running_xacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(rb, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(rb, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(rb, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d4555676d6..956d4e3329 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
 #include "access/htup_details.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -376,6 +377,35 @@ struct ReorderBuffer
 
 	XLogRecPtr	current_restart_decoding_lsn;
 
+	/*
+	 * Array of transactions and subtransactions that were running when
+	 * the xl_running_xacts record that we decoded first was written.
+	 * The array is sorted in xidComparator order. Xids are removed from
+	 * the array when decoding xl_running_xacts record, and then the array
+	 * eventually becomes an empty.
+	 *
+	 * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+	 * if the transaction has changed the catalog, and that information
+	 * is not serialized to SnapBuilder. Therefore, if the logical
+	 * decoding decodes the commit record of the transaction that actually
+	 * has done catalog changes without these records, we miss to add
+	 * the xid to the snapshot, and end up looking at catalogs with the
+	 * wrong snapshot. To avoid this problem, if the COMMIT record of
+	 * the xid listed in initial_running_xacts has XACT_XINFO_HAS_INVALS
+	 * flag, we mark both the top transaction and its substransactions
+	 * as containing catalog changes.
+	 *
+	 * We could end up adding the transaction that didn't change catalog
+	 * to the snapshot since we cannot distinguish whether the transaction
+	 * has catalog changes only by checking the COMMIT record. It doesn't
+	 * have the information on which (sub) transaction has catalog changes,
+	 * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+	 * transaction has catalog change. But it doesn't become a problem since
+	 * we use historic snapshot only for reading system catalogs.
+	 */
+	TransactionId *initial_running_xacts;
+	int n_initial_running_xacts;
+
 	/* buffer for disk<->memory conversions */
 	char	   *outbuf;
 	Size		outbufsize;
@@ -427,4 +457,10 @@ void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
 void		StartupReorderBuffer(void);
 
+void		ReorderBufferProcessInitialXacts(ReorderBuffer *rb,
+											 xl_running_xacts *running);
+void		ReorderBufferInitialXactsSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
+													   int subxcnt,
+													   TransactionId *subxacts,
+													   XLogRecPtr lsn);
 #endif
-- 
2.24.3 (Apple Git-128)

master-v6-0001-Add-catalog-modifying-transactions-to-logical-dec.patchapplication/octet-stream; name=master-v6-0001-Add-catalog-modifying-transactions-to-logical-dec.patchDownload
From 832c8b155da37dd2505218a366053e9c74b4203a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v6] Add catalog modifying transactions to logical decoding
 serialized snapshot.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running when
snapshot serialization, to the serialized snapshot. When decoding a
COMMIT record, we check both the list and the ReorderBuffer to see if
the transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we take another approach;
remember the last-running-xacts list of the first decoded
RUNNING_XACTS record and check if the transaction whose commit record
has XACT_XINFO_HAS_INVALS and whose XID is in the list. This doesn't
require any file format changes but the transaction will end up being
added to the snapshot even if it has only relcache invalidations.

This commit bumps SNAPBUILD_VERSION because of change in SnapBuild.

Back-patch to all supported released.
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 +++
 .../specs/catalog_change_snapshot.spec        |  39 +++
 .../replication/logical/reorderbuffer.c       |  69 ++++-
 src/backend/replication/logical/snapbuild.c   | 261 ++++++++++++------
 src/include/replication/reorderbuffer.h       |  12 +
 6 files changed, 339 insertions(+), 88 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fde72..d7f430623d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,8 @@ ReorderBufferAllocate(void)
 	buffer->by_txn_last_xid = InvalidTransactionId;
 	buffer->by_txn_last_txn = NULL;
 
+	buffer->catchange_ntxns = 0;
+
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 	buffer->size = 0;
@@ -366,6 +368,7 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
+	dlist_init(&buffer->catchange_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1526,7 +1529,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
@@ -1535,6 +1538,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	dlist_delete(&txn->node);
 
+	if (rbtxn_has_catalog_changes(txn))
+	{
+		dlist_delete(&txn->catchange_node);
+		rb->catchange_ntxns--;
+
+		Assert(rb->catchange_ntxns >= 0);
+	}
+
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,
 				(void *) &txn->xid,
@@ -3275,10 +3286,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 								  XLogRecPtr lsn)
 {
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	if (!rbtxn_has_catalog_changes(txn))
+	{
+		txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &txn->catchange_node);
+		rb->catchange_ntxns++;
+	}
 
 	/*
 	 * Mark top-level transaction as having catalog changes too if one of its
@@ -3286,8 +3303,52 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	 * conveniently check just top-level transaction and decide whether to
 	 * build the hash table or not.
 	 */
-	if (txn->toptxn != NULL)
-		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	toptxn = txn->toptxn;
+	if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
+	{
+		toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &toptxn->catchange_node);
+		rb->catchange_ntxns++;
+	}
+}
+
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb)
+{
+	dlist_iter iter;
+	TransactionId *xids = NULL;
+	size_t	xcnt = 0;
+
+	/* Quick return if the list is empty */
+	if (dlist_is_empty(&rb->catchange_txns))
+	{
+		Assert(rb->catchange_ntxns == 0);
+		return NULL;
+	}
+
+	/* Initialize XID array */
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * rb->catchange_ntxns);
+	dlist_foreach(iter, &rb->catchange_txns)
+	{
+		ReorderBufferTXN *txn = dlist_container(ReorderBufferTXN,
+												catchange_node,
+												iter.cur);
+
+		Assert(rbtxn_has_catalog_changes(txn));
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	Assert((xcnt > 0) && (xcnt == rb->catchange_ntxns));
+	return xids;
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15214..dce8da9e25 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,30 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on HEAP2_NEW_CID and XLOG_XACT_INVALIDATIONS records to
+	 * know if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction.
+	 * This array keeps track of the transactions that have modified catalogs
+	 * and were running when serializing a snapshot, and this array is used to
+	 * add such transactions to the snapshot.
+	 *
+	 * This array is set once when restoring the snapshot, xids are removed
+	 * from the array when decoding xl_running_xacts record, and then eventually
+	 * becomes empty.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchange;
 };
 
 /*
@@ -250,8 +274,8 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/* ->committed and ->catchange manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -262,6 +286,8 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +295,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +333,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchange.xcnt = 0;
+	builder->catchange.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -888,12 +918,15 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed or containing catalog
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed or ->catchange array, respectively. The committed xids will
+ * get checked via the clog machinery. We can ideally remove the transaction
+ * from catchange array once it is finished (committed/aborted) but that could
+ * be costly as we need to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +961,40 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/*
+	 * purge xids in ->catchange as well. The purged array must also be
+	 * sorted in xidComparator order.
+	 */
+	if (builder->catchange.xcnt > 0)
+	{
+		/*
+		 * Since catchange.xip is sorted, we find the lower bound of
+		 * xids that still are interesting.
+		 */
+		for (off = 0; off < builder->catchange.xcnt; off++)
+		{
+			if (TransactionIdFollowsOrEquals(builder->catchange.xip[off],
+											 builder->xmin))
+				break;
+		}
+
+		surviving_xids = builder->catchange.xcnt - off;
+		if (surviving_xids > 0)
+			memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+					surviving_xids * sizeof(TransactionId));
+		else
+		{
+			/* catchange list becomes empty */
+			pfree(builder->catchange.xip);
+			builder->catchange.xip = NULL;
+		}
+
+		elog(DEBUG3, "purged catalog modifying transactions from %u to %u, xmin %u",
+			 (uint32) builder->catchange.xcnt, (uint32) surviving_xids,
+			builder->xmin);
+		builder->catchange.xcnt = surviving_xids;
+	}
 }
 
 /*
@@ -983,7 +1050,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChanges(builder, subxid))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1079,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChanges(builder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1156,21 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check both the reorder buffer and the snapshot to see if the given
+ * transaction has modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid)
+{
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return true;
+
+	/* Check the catchange XID array */
+	return ((builder->catchange.xcnt > 0) &&
+			(bsearch(&xid, builder->catchange.xip, builder->catchange.xcnt,
+					 sizeof(TransactionId), xidComparator) != NULL));
+}
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1135,7 +1217,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1438,6 +1520,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchange.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1467,7 +1550,7 @@ typedef struct SnapBuildOnDisk
 	offsetof(SnapBuildOnDisk, version)
 
 #define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 4
+#define SNAPBUILD_VERSION 5
 
 /*
  * Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1493,6 +1576,8 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	TransactionId	*catchange_xip = NULL;
+	size_t		catchange_xcnt;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,8 +1663,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	/* Get the catalog modifying transactions that are yet not committed */
+	catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);
+	catchange_xcnt = builder->reorder->catchange_ntxns;
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + catchange_xcnt);
 
 	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
@@ -1598,16 +1687,31 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchange.xip = NULL;
+	/* update catchange only on disk data */
+	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
 				sizeof(SnapBuild));
 
 	/* copy committed xacts */
-	sz = sizeof(TransactionId) * builder->committed.xcnt;
-	memcpy(ondisk_c, builder->committed.xip, sz);
-	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
-	ondisk_c += sz;
+	if (builder->committed.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * builder->committed.xcnt;
+		memcpy(ondisk_c, builder->committed.xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
+
+	/* copy catalog modifying xacts */
+	if (catchange_xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * catchange_xcnt;
+		memcpy(ondisk_c, catchange_xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
 
 	FIN_CRC32C(ondisk->checksum);
 
@@ -1694,6 +1798,8 @@ out:
 	/* be tidy */
 	if (ondisk)
 		pfree(ondisk);
+	if (catchange_xip)
+		pfree(catchange_xip);
 }
 
 /*
@@ -1707,7 +1813,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1844,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,56 +1864,26 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
-	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
-	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
+	if (ondisk.builder.committed.xcnt > 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
+		sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
+		ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
+	}
 
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
+	/* restore catalog modifying xacts information */
+	if (ondisk.builder.catchange.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * ondisk.builder.catchange.xcnt;
+		ondisk.builder.catchange.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchange.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.catchange.xip, sz);
 	}
-	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
@@ -1885,6 +1938,14 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchange.xip)
+		pfree(builder->catchange.xip);
+	builder->catchange.xcnt = ondisk.builder.catchange.xcnt;
+	builder->catchange.xip = ondisk.builder.catchange.xip;
+
+	ondisk.builder.catchange.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1906,9 +1967,43 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 snapshot_not_interesting:
 	if (ondisk.builder.committed.xip != NULL)
 		pfree(ondisk.builder.committed.xip);
+	if (ondisk.builder.catchange.xip != NULL)
+		pfree(ondisk.builder.catchange.xip);
 	return false;
 }
 
+/*
+ * Read the contents of the serialized snapshot to the dest.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
 /*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d109d0baed..fd84f175c0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -380,6 +380,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * A node in the list of catalog modifying transactions
+	 */
+	dlist_node	catchange_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -526,6 +531,12 @@ struct ReorderBuffer
 	 */
 	dlist_head	txns_by_base_snapshot_lsn;
 
+	/*
+	 * Transactions and subtransactions that have modified system catalogs.
+	 */
+	dlist_head	catchange_txns;
+	int			catchange_ntxns;
+
 	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
@@ -677,6 +688,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
-- 
2.24.3 (Apple Git-128)

#68shiy.fnst@fujitsu.com
shiy.fnst@fujitsu.com
In reply to: Masahiko Sawada (#66)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 15, 2022 10:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This patch should have the fix for the issue that Shi yu reported. Shi
yu, could you please test it again with this patch?

Thanks for updating the patch!
I have tested and confirmed that the problem I found has been fixed.

Regards,
Shi yu

#69Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#66)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 15, 2022 at 8:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This patch should have the fix for the issue that Shi yu reported. Shi
yu, could you please test it again with this patch?

Can you explain the cause of the failure and your fix for the same?

--
With Regards,
Amit Kapila.

#70Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#67)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Sun, Jul 17, 2022 at 6:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 15, 2022 at 3:32 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

I've attached patches for all supported branches including the master.

For back branch patches,
* Wouldn't it be better to move purge logic into the function
SnapBuildPurge* function for the sake of consistency?
* Do we really need ReorderBufferInitialXactsSetCatalogChanges()?
Can't we instead have a function similar to
SnapBuildXidHasCatalogChanges() as we have for the master branch? That
will avoid calling it when the snapshot
state is SNAPBUILD_START or SNAPBUILD_BUILDING_SNAPSHOT

--
With Regards,
Amit Kapila.

#71Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#69)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Jul 18, 2022 at 1:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 15, 2022 at 8:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This patch should have the fix for the issue that Shi yu reported. Shi
yu, could you please test it again with this patch?

Can you explain the cause of the failure and your fix for the same?

@@ -1694,6 +1788,8 @@ out:
    /* be tidy */
    if (ondisk)
        pfree(ondisk);
+   if (catchange_xip)
+       pfree(catchange_xip);

Regarding the above code in the previous version patch, looking at the
generated assembler code shared by Shi yu offlist, I realized that the
“if (catchange_xip)” is removed (folded) by gcc optimization. This is
because we dereference catchange_xip before null-pointer check as
follow:

+   /* copy catalog modifying xacts */
+   sz = sizeof(TransactionId) * catchange_xcnt;
+   memcpy(ondisk_c, catchange_xip, sz);
+   COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+   ondisk_c += sz;

Since sz is 0 in this case, memcpy doesn’t do anything actually.

By checking the assembler code, I’ve confirmed that gcc does the
optimization for these code and setting
-fno-delete-null-pointer-checks flag prevents the if statement from
being folded. Also, I’ve confirmed that adding the check if
"catchange.xcnt > 0” before the null-pointer check also can prevent
that. Adding a check if "catchange.xcnt > 0” looks more robust. I’ve
added a similar check for builder->committed.xcnt as well for
consistency. builder->committed.xip could have no transactions.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#72Masahiko Sawada
sawada.mshk@gmail.com
In reply to: shiy.fnst@fujitsu.com (#68)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Jul 18, 2022 at 12:28 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

On Fri, Jul 15, 2022 10:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This patch should have the fix for the issue that Shi yu reported. Shi
yu, could you please test it again with this patch?

Thanks for updating the patch!
I have tested and confirmed that the problem I found has been fixed.

Thank you for testing!

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#73Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#71)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 19, 2022 at 6:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jul 18, 2022 at 1:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 15, 2022 at 8:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This patch should have the fix for the issue that Shi yu reported. Shi
yu, could you please test it again with this patch?

Can you explain the cause of the failure and your fix for the same?

@@ -1694,6 +1788,8 @@ out:
/* be tidy */
if (ondisk)
pfree(ondisk);
+   if (catchange_xip)
+       pfree(catchange_xip);

Regarding the above code in the previous version patch, looking at the
generated assembler code shared by Shi yu offlist, I realized that the
“if (catchange_xip)” is removed (folded) by gcc optimization. This is
because we dereference catchange_xip before null-pointer check as
follow:

+   /* copy catalog modifying xacts */
+   sz = sizeof(TransactionId) * catchange_xcnt;
+   memcpy(ondisk_c, catchange_xip, sz);
+   COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+   ondisk_c += sz;

Since sz is 0 in this case, memcpy doesn’t do anything actually.

By checking the assembler code, I’ve confirmed that gcc does the
optimization for these code and setting
-fno-delete-null-pointer-checks flag prevents the if statement from
being folded. Also, I’ve confirmed that adding the check if
"catchange.xcnt > 0” before the null-pointer check also can prevent
that. Adding a check if "catchange.xcnt > 0” looks more robust. I’ve
added a similar check for builder->committed.xcnt as well for
consistency. builder->committed.xip could have no transactions.

Good work. I wonder without comments this may create a problem in the
future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
freeing the memory any less robust. Also, for consistency, we can use
a similar check based on xcnt in the SnapBuildRestore to free the
memory in the below code:
+ /* set catalog modifying transactions */
+ if (builder->catchange.xip)
+ pfree(builder->catchange.xip);

--
With Regards,
Amit Kapila.

#74Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#73)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 19, 2022 at 1:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 19, 2022 at 6:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jul 18, 2022 at 1:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 15, 2022 at 8:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This patch should have the fix for the issue that Shi yu reported. Shi
yu, could you please test it again with this patch?

Can you explain the cause of the failure and your fix for the same?

@@ -1694,6 +1788,8 @@ out:
/* be tidy */
if (ondisk)
pfree(ondisk);
+   if (catchange_xip)
+       pfree(catchange_xip);

Regarding the above code in the previous version patch, looking at the
generated assembler code shared by Shi yu offlist, I realized that the
“if (catchange_xip)” is removed (folded) by gcc optimization. This is
because we dereference catchange_xip before null-pointer check as
follow:

+   /* copy catalog modifying xacts */
+   sz = sizeof(TransactionId) * catchange_xcnt;
+   memcpy(ondisk_c, catchange_xip, sz);
+   COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+   ondisk_c += sz;

Since sz is 0 in this case, memcpy doesn’t do anything actually.

By checking the assembler code, I’ve confirmed that gcc does the
optimization for these code and setting
-fno-delete-null-pointer-checks flag prevents the if statement from
being folded. Also, I’ve confirmed that adding the check if
"catchange.xcnt > 0” before the null-pointer check also can prevent
that. Adding a check if "catchange.xcnt > 0” looks more robust. I’ve
added a similar check for builder->committed.xcnt as well for
consistency. builder->committed.xip could have no transactions.

Good work. I wonder without comments this may create a problem in the
future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
freeing the memory any less robust. Also, for consistency, we can use
a similar check based on xcnt in the SnapBuildRestore to free the
memory in the below code:
+ /* set catalog modifying transactions */
+ if (builder->catchange.xip)
+ pfree(builder->catchange.xip);

I would hesitate to add comments about preventing the particular
optimization. I think we do null-pointer-check-then-pfree many place.
It seems to me that checking the array length before memcpy is more
natural than checking both the array length and the array existence
before pfree.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#75osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Masahiko Sawada (#67)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Sunday, July 17, 2022 9:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached patches for all supported branches including the master.

Hi,

Minor comments for REL14.

(1) There are some foreign characters in the patches (in the commit message)

When I had a look at your patch for back branches with some editor,
I could see some unfamiliar full-width characters like below two cases,
mainly around "single quotes" in the sentences.

Could you please check the entire patches,
probably by some tool that helps you to detect this kind of characters ?

* the 2nd paragraph of the commit message

...mark the transaction as containing catalog changes if it窶冱 in the list of the
initial running transactions ...

* the 3rd paragraph of the same

It doesn窶冲 have the information on which (sub) transaction has catalog changes....

FYI, this comment applies to other patches for REL13, REL12, REL11, REL10.

(2) typo in the commit message

FROM:
To fix this problem, this change the reorder buffer so that...
TO:
To fix this problem, this changes the reorder buffer so that...

(3) typo in ReorderBufferProcessInitialXacts

+       /*
+        * Remove transactions that would have been processed and we don't need to
+        * keep track off anymore.

Kindly change
FROM:
keep track off
TO:
keep track of

Best Regards,
Takamichi Osumi

#76Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#73)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Tue, 19 Jul 2022 10:17:15 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

Good work. I wonder without comments this may create a problem in the
future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
freeing the memory any less robust. Also, for consistency, we can use
a similar check based on xcnt in the SnapBuildRestore to free the
memory in the below code:
+ /* set catalog modifying transactions */
+ if (builder->catchange.xip)
+ pfree(builder->catchange.xip);

But xip must be positive there. We can add a comment explains that.

+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on HEAP2_NEW_CID and XLOG_XACT_INVALIDATIONS records to
+	 * know if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction.
+	 * This array keeps track of the transactions that have modified catalogs

(Might be only me, but) "track" makes me think that xids are added and
removed by activities. On the other hand the array just remembers
catalog-modifying xids in the last life until the all xids in the list
gone.

+	 * and were running when serializing a snapshot, and this array is used to
+	 * add such transactions to the snapshot.
+	 *
+	 * This array is set once when restoring the snapshot, xids are removed

(So I want to add "only" between "are removed").

+	 * from the array when decoding xl_running_xacts record, and then eventually
+	 * becomes empty.

+ catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);

catchange_xip is allocated in the current context, but ondisk is
allocated in builder->context. I see it kind of inconsistent (even if
the current context is same with build->context).

+ if (builder->committed.xcnt > 0)
+ {

It seems to me comitted.xip is always non-null, so we don't need this.
I don't strongly object to do that, though.

-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.

The comment body only describes abut txn->nodes. I think we need to
add that for catchange_node.

+ Assert((xcnt > 0) && (xcnt == rb->catchange_ntxns));

(xcnt > 0) is obvious here (otherwise means dlist_foreach is broken..).
(xcnt == rb->catchange_ntxns) is not what should be checked here. The
assert just requires that catchange_txns and catchange_ntxns are
consistent so it should be checked just after dlist_empty.. I think.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#77Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#70)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Jul 18, 2022 at 8:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 17, 2022 at 6:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 15, 2022 at 3:32 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

I've attached patches for all supported branches including the master.

For back branch patches,
* Wouldn't it be better to move purge logic into the function
SnapBuildPurge* function for the sake of consistency?

Agreed.

* Do we really need ReorderBufferInitialXactsSetCatalogChanges()?
Can't we instead have a function similar to
SnapBuildXidHasCatalogChanges() as we have for the master branch? That
will avoid calling it when the snapshot
state is SNAPBUILD_START or SNAPBUILD_BUILDING_SNAPSHOT

Seems a good idea. We would need to pass the information about
(parsed->xinfo & XACT_XINFO_HAS_INVALS) to the function but probably
we can change ReorderBufferXidHasCatalogChanges() so that it checks
the RBTXN_HAS_CATALOG_CHANGES flag and then the initial running xacts
array.

BTW on backbranches, I think that the reason why we add
initial_running_xacts stuff to ReorderBuffer is that we cannot modify
SnapBuild that could be serialized. Can we add a (private) array for
the initial running xacts in snapbuild.c instead of adding new
variables to ReorderBuffer? That way, the code would become more
consistent with the changes on the master branch.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#78Masahiko Sawada
sawada.mshk@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#75)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 19, 2022 at 4:28 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Sunday, July 17, 2022 9:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached patches for all supported branches including the master.

Hi,

Minor comments for REL14.

(1) There are some foreign characters in the patches (in the commit message)

When I had a look at your patch for back branches with some editor,
I could see some unfamiliar full-width characters like below two cases,
mainly around "single quotes" in the sentences.

Could you please check the entire patches,
probably by some tool that helps you to detect this kind of characters ?

* the 2nd paragraph of the commit message

...mark the transaction as containing catalog changes if it窶冱 in the list of the
initial running transactions ...

* the 3rd paragraph of the same

It doesn窶冲 have the information on which (sub) transaction has catalog changes....

FYI, this comment applies to other patches for REL13, REL12, REL11, REL10.

(2) typo in the commit message

FROM:
To fix this problem, this change the reorder buffer so that...
TO:
To fix this problem, this changes the reorder buffer so that...

(3) typo in ReorderBufferProcessInitialXacts

+       /*
+        * Remove transactions that would have been processed and we don't need to
+        * keep track off anymore.

Kindly change
FROM:
keep track off
TO:
keep track of

Thank you for the comments! I'll address these comments in the next
version patch.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#79Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Masahiko Sawada (#74)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Tue, 19 Jul 2022 16:02:26 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

On Tue, Jul 19, 2022 at 1:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Good work. I wonder without comments this may create a problem in the
future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
freeing the memory any less robust. Also, for consistency, we can use
a similar check based on xcnt in the SnapBuildRestore to free the
memory in the below code:
+ /* set catalog modifying transactions */
+ if (builder->catchange.xip)
+ pfree(builder->catchange.xip);

I would hesitate to add comments about preventing the particular
optimization. I think we do null-pointer-check-then-pfree many place.
It seems to me that checking the array length before memcpy is more
natural than checking both the array length and the array existence
before pfree.

Anyway according to commit message of 46ab07ffda, POSIX forbits
memcpy(NULL, NULL, 0). It seems to me that it is the cause of the
false (or over) optimization. So if we add some comment, it would be
for memcpy, not pfree..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#80Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#79)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Tue, 19 Jul 2022 16:57:14 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 19 Jul 2022 16:02:26 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

On Tue, Jul 19, 2022 at 1:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Good work. I wonder without comments this may create a problem in the
future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
freeing the memory any less robust. Also, for consistency, we can use
a similar check based on xcnt in the SnapBuildRestore to free the
memory in the below code:
+ /* set catalog modifying transactions */
+ if (builder->catchange.xip)
+ pfree(builder->catchange.xip);

I would hesitate to add comments about preventing the particular
optimization. I think we do null-pointer-check-then-pfree many place.
It seems to me that checking the array length before memcpy is more
natural than checking both the array length and the array existence
before pfree.

Anyway according to commit message of 46ab07ffda, POSIX forbits
memcpy(NULL, NULL, 0). It seems to me that it is the cause of the
false (or over) optimization. So if we add some comment, it would be
for memcpy, not pfree..

For clarilty, I meant that I don't think we need that comment.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#81Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kyotaro Horiguchi (#76)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 19, 2022 at 4:35 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Thank you for the comments!

At Tue, 19 Jul 2022 10:17:15 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

Good work. I wonder without comments this may create a problem in the
future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
freeing the memory any less robust. Also, for consistency, we can use
a similar check based on xcnt in the SnapBuildRestore to free the
memory in the below code:
+ /* set catalog modifying transactions */
+ if (builder->catchange.xip)
+ pfree(builder->catchange.xip);

But xip must be positive there. We can add a comment explains that.

Yes, if we add the comment for it, probably we need to explain a gcc's
optimization but it seems to be too much to me.

+        * Array of transactions and subtransactions that had modified catalogs
+        * and were running when the snapshot was serialized.
+        *
+        * We normally rely on HEAP2_NEW_CID and XLOG_XACT_INVALIDATIONS records to
+        * know if the transaction has changed the catalog. But it could happen that
+        * the logical decoding decodes only the commit record of the transaction.
+        * This array keeps track of the transactions that have modified catalogs

(Might be only me, but) "track" makes me think that xids are added and
removed by activities. On the other hand the array just remembers
catalog-modifying xids in the last life until the all xids in the list
gone.

+        * and were running when serializing a snapshot, and this array is used to
+        * add such transactions to the snapshot.
+        *
+        * This array is set once when restoring the snapshot, xids are removed

(So I want to add "only" between "are removed").

+        * from the array when decoding xl_running_xacts record, and then eventually
+        * becomes empty.

Agreed. WIll fix.

+ catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);

catchange_xip is allocated in the current context, but ondisk is
allocated in builder->context. I see it kind of inconsistent (even if
the current context is same with build->context).

Right. I thought that since the lifetime of catchange_xip is short,
until the end of SnapBuildSerialize() function we didn't need to
allocate it in builder->context. But given ondisk, we need to do that
for catchange_xip as well. Will fix it.

+ if (builder->committed.xcnt > 0)
+ {

It seems to me comitted.xip is always non-null, so we don't need this.
I don't strongly object to do that, though.

But committed.xcnt could be 0, right? We don't need to copy anything
by calling memcpy with size = 0 in this case. Also, it looks more
consistent with what we do for catchange_xcnt.

-        * Remove TXN from its containing list.
+        * Remove TXN from its containing lists.

The comment body only describes abut txn->nodes. I think we need to
add that for catchange_node.

Will add.

+ Assert((xcnt > 0) && (xcnt == rb->catchange_ntxns));

(xcnt > 0) is obvious here (otherwise means dlist_foreach is broken..).
(xcnt == rb->catchange_ntxns) is not what should be checked here. The
assert just requires that catchange_txns and catchange_ntxns are
consistent so it should be checked just after dlist_empty.. I think.

If we want to check if catchange_txns and catchange_ntxns are
consistent, should we check (xcnt == rb->catchange_ntxns) as well, no?
This function requires the caller to use rb->catchange_ntxns as the
length of the returned array. I think this assertion ensures that the
actual length of the array is consistent with the length we
pre-calculated.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#82Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#80)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 19, 2022 at 1:43 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 19 Jul 2022 16:57:14 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 19 Jul 2022 16:02:26 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

On Tue, Jul 19, 2022 at 1:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Good work. I wonder without comments this may create a problem in the
future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
freeing the memory any less robust. Also, for consistency, we can use
a similar check based on xcnt in the SnapBuildRestore to free the
memory in the below code:
+ /* set catalog modifying transactions */
+ if (builder->catchange.xip)
+ pfree(builder->catchange.xip);

I would hesitate to add comments about preventing the particular
optimization. I think we do null-pointer-check-then-pfree many place.
It seems to me that checking the array length before memcpy is more
natural than checking both the array length and the array existence
before pfree.

Anyway according to commit message of 46ab07ffda, POSIX forbits
memcpy(NULL, NULL, 0). It seems to me that it is the cause of the
false (or over) optimization. So if we add some comment, it would be
for memcpy, not pfree..

For clarilty, I meant that I don't think we need that comment.

Fair enough. I think commit 46ab07ffda clearly explains why it is a
good idea to add a check as Sawada-San did in his latest version. I
also agree that we don't any comment for this change.

--
With Regards,
Amit Kapila.

#83Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#77)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 19, 2022 at 1:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jul 18, 2022 at 8:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 17, 2022 at 6:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 15, 2022 at 3:32 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

I've attached patches for all supported branches including the master.

For back branch patches,
* Wouldn't it be better to move purge logic into the function
SnapBuildPurge* function for the sake of consistency?

Agreed.

* Do we really need ReorderBufferInitialXactsSetCatalogChanges()?
Can't we instead have a function similar to
SnapBuildXidHasCatalogChanges() as we have for the master branch? That
will avoid calling it when the snapshot
state is SNAPBUILD_START or SNAPBUILD_BUILDING_SNAPSHOT

Seems a good idea. We would need to pass the information about
(parsed->xinfo & XACT_XINFO_HAS_INVALS) to the function but probably
we can change ReorderBufferXidHasCatalogChanges() so that it checks
the RBTXN_HAS_CATALOG_CHANGES flag and then the initial running xacts
array.

Let's try to keep this as much similar to the master branch patch as possible.

BTW on backbranches, I think that the reason why we add
initial_running_xacts stuff to ReorderBuffer is that we cannot modify
SnapBuild that could be serialized. Can we add a (private) array for
the initial running xacts in snapbuild.c instead of adding new
variables to ReorderBuffer?

While thinking about this, I wonder if the current patch for back
branches can lead to an ABI break as it changes the exposed structure?
If so, it may be another reason to change it to some other way
probably as you are suggesting.

--
With Regards,
Amit Kapila.

#84Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#81)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 19, 2022 at 2:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 19, 2022 at 4:35 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

+ Assert((xcnt > 0) && (xcnt == rb->catchange_ntxns));

(xcnt > 0) is obvious here (otherwise means dlist_foreach is broken..).
(xcnt == rb->catchange_ntxns) is not what should be checked here. The
assert just requires that catchange_txns and catchange_ntxns are
consistent so it should be checked just after dlist_empty.. I think.

If we want to check if catchange_txns and catchange_ntxns are
consistent, should we check (xcnt == rb->catchange_ntxns) as well, no?
This function requires the caller to use rb->catchange_ntxns as the
length of the returned array. I think this assertion ensures that the
actual length of the array is consistent with the length we
pre-calculated.

Right, so, I think it is better to keep this assertion but remove
(xcnt > 0) part as pointed out by Horiguchi-San.

--
With Regards,
Amit Kapila.

#85Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#83)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 19, 2022 at 9:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 19, 2022 at 1:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jul 18, 2022 at 8:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 17, 2022 at 6:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 15, 2022 at 3:32 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

I've attached patches for all supported branches including the master.

For back branch patches,
* Wouldn't it be better to move purge logic into the function
SnapBuildPurge* function for the sake of consistency?

Agreed.

* Do we really need ReorderBufferInitialXactsSetCatalogChanges()?
Can't we instead have a function similar to
SnapBuildXidHasCatalogChanges() as we have for the master branch? That
will avoid calling it when the snapshot
state is SNAPBUILD_START or SNAPBUILD_BUILDING_SNAPSHOT

Seems a good idea. We would need to pass the information about
(parsed->xinfo & XACT_XINFO_HAS_INVALS) to the function but probably
we can change ReorderBufferXidHasCatalogChanges() so that it checks
the RBTXN_HAS_CATALOG_CHANGES flag and then the initial running xacts
array.

Let's try to keep this as much similar to the master branch patch as possible.

BTW on backbranches, I think that the reason why we add
initial_running_xacts stuff to ReorderBuffer is that we cannot modify
SnapBuild that could be serialized. Can we add a (private) array for
the initial running xacts in snapbuild.c instead of adding new
variables to ReorderBuffer?

While thinking about this, I wonder if the current patch for back
branches can lead to an ABI break as it changes the exposed structure?
If so, it may be another reason to change it to some other way
probably as you are suggesting.

Yeah, it changes the size of ReorderBuffer, which is not good.
Changing the function names and arguments would also break ABI. So
probably we cannot do the above idea of removing
ReorderBufferInitialXactsSetCatalogChanges() as well.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#86Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Masahiko Sawada (#81)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Tue, 19 Jul 2022 17:31:07 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

On Tue, Jul 19, 2022 at 4:35 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 19 Jul 2022 10:17:15 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

Good work. I wonder without comments this may create a problem in the
future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
freeing the memory any less robust. Also, for consistency, we can use
a similar check based on xcnt in the SnapBuildRestore to free the
memory in the below code:
+ /* set catalog modifying transactions */
+ if (builder->catchange.xip)
+ pfree(builder->catchange.xip);

But xip must be positive there. We can add a comment explains that.

Yes, if we add the comment for it, probably we need to explain a gcc's
optimization but it seems to be too much to me.

Ah, sorry. I confused with other place in SnapBuildPurgeCommitedTxn.
I agree to you, that we don't need additional comment *there*.

+ catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);

catchange_xip is allocated in the current context, but ondisk is
allocated in builder->context. I see it kind of inconsistent (even if
the current context is same with build->context).

Right. I thought that since the lifetime of catchange_xip is short,
until the end of SnapBuildSerialize() function we didn't need to
allocate it in builder->context. But given ondisk, we need to do that
for catchange_xip as well. Will fix it.

Thanks.

+ if (builder->committed.xcnt > 0)
+ {

It seems to me comitted.xip is always non-null, so we don't need this.
I don't strongly object to do that, though.

But committed.xcnt could be 0, right? We don't need to copy anything
by calling memcpy with size = 0 in this case. Also, it looks more
consistent with what we do for catchange_xcnt.

Mmm. the patch changed that behavior. AllocateSnapshotBuilder always
allocate the array with a fixed size. SnapBuildAddCommittedTxn still
assumes builder->committed.xip is non-NULL. SnapBuildRestore *kept*
ondisk.builder.commited.xip populated with a valid array pointer. But
the patch allows committed.xip be NULL, thus in that case,
SnapBuildAddCommitedTxn calls repalloc(NULL) which triggers assertion
failure.

+ Assert((xcnt > 0) && (xcnt == rb->catchange_ntxns));

(xcnt > 0) is obvious here (otherwise means dlist_foreach is broken..).
(xcnt == rb->catchange_ntxns) is not what should be checked here. The
assert just requires that catchange_txns and catchange_ntxns are
consistent so it should be checked just after dlist_empty.. I think.

If we want to check if catchange_txns and catchange_ntxns are
consistent, should we check (xcnt == rb->catchange_ntxns) as well, no?
This function requires the caller to use rb->catchange_ntxns as the
length of the returned array. I think this assertion ensures that the
actual length of the array is consistent with the length we
pre-calculated.

Sorry again. Please forget the comment about xcnt == rb->catchange_ntxns..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#87Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kyotaro Horiguchi (#86)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 20, 2022 at 9:58 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 19 Jul 2022 17:31:07 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

On Tue, Jul 19, 2022 at 4:35 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 19 Jul 2022 10:17:15 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

Good work. I wonder without comments this may create a problem in the
future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
freeing the memory any less robust. Also, for consistency, we can use
a similar check based on xcnt in the SnapBuildRestore to free the
memory in the below code:
+ /* set catalog modifying transactions */
+ if (builder->catchange.xip)
+ pfree(builder->catchange.xip);

But xip must be positive there. We can add a comment explains that.

Yes, if we add the comment for it, probably we need to explain a gcc's
optimization but it seems to be too much to me.

Ah, sorry. I confused with other place in SnapBuildPurgeCommitedTxn.
I agree to you, that we don't need additional comment *there*.

+ catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);

catchange_xip is allocated in the current context, but ondisk is
allocated in builder->context. I see it kind of inconsistent (even if
the current context is same with build->context).

Right. I thought that since the lifetime of catchange_xip is short,
until the end of SnapBuildSerialize() function we didn't need to
allocate it in builder->context. But given ondisk, we need to do that
for catchange_xip as well. Will fix it.

Thanks.

+ if (builder->committed.xcnt > 0)
+ {

It seems to me comitted.xip is always non-null, so we don't need this.
I don't strongly object to do that, though.

But committed.xcnt could be 0, right? We don't need to copy anything
by calling memcpy with size = 0 in this case. Also, it looks more
consistent with what we do for catchange_xcnt.

Mmm. the patch changed that behavior. AllocateSnapshotBuilder always
allocate the array with a fixed size. SnapBuildAddCommittedTxn still
assumes builder->committed.xip is non-NULL. SnapBuildRestore *kept*
ondisk.builder.commited.xip populated with a valid array pointer. But
the patch allows committed.xip be NULL, thus in that case,
SnapBuildAddCommitedTxn calls repalloc(NULL) which triggers assertion
failure.

IIUC the patch doesn't allow committed.xip to be NULL since we don't
overwrite it if builder->committed.xcnt is 0 (i.e.,
ondisk.builder.committed.xip is NULL):

builder->committed.xcnt = ondisk.builder.committed.xcnt;
/* We only allocated/stored xcnt, not xcnt_space xids ! */
/* don't overwrite preallocated xip, if we don't have anything here */
if (builder->committed.xcnt > 0)
{
pfree(builder->committed.xip);
builder->committed.xcnt_space = ondisk.builder.committed.xcnt;
builder->committed.xip = ondisk.builder.committed.xip;
}

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#88Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#85)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 19, 2022 at 7:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 19, 2022 at 9:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 19, 2022 at 1:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

BTW on backbranches, I think that the reason why we add
initial_running_xacts stuff to ReorderBuffer is that we cannot modify
SnapBuild that could be serialized. Can we add a (private) array for
the initial running xacts in snapbuild.c instead of adding new
variables to ReorderBuffer?

While thinking about this, I wonder if the current patch for back
branches can lead to an ABI break as it changes the exposed structure?
If so, it may be another reason to change it to some other way
probably as you are suggesting.

Yeah, it changes the size of ReorderBuffer, which is not good.

So, are you planning to give a try with your idea of making a private
array for the initial running xacts? I am not sure but I guess you are
proposing to add it in SnapBuild structure, if so, that seems safe as
that structure is not exposed.

Changing the function names and arguments would also break ABI. So
probably we cannot do the above idea of removing
ReorderBufferInitialXactsSetCatalogChanges() as well.

Why do you think we can't remove
ReorderBufferInitialXactsSetCatalogChanges() from the back branch
patch? I think we don't need to change the existing function
ReorderBufferXidHasCatalogChanges() but instead can have a wrapper
like SnapBuildXidHasCatalogChanges() similar to master branch patch.

--
With Regards,
Amit Kapila.

#89Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#88)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 20, 2022 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 19, 2022 at 7:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 19, 2022 at 9:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 19, 2022 at 1:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

BTW on backbranches, I think that the reason why we add
initial_running_xacts stuff to ReorderBuffer is that we cannot modify
SnapBuild that could be serialized. Can we add a (private) array for
the initial running xacts in snapbuild.c instead of adding new
variables to ReorderBuffer?

While thinking about this, I wonder if the current patch for back
branches can lead to an ABI break as it changes the exposed structure?
If so, it may be another reason to change it to some other way
probably as you are suggesting.

Yeah, it changes the size of ReorderBuffer, which is not good.

So, are you planning to give a try with your idea of making a private
array for the initial running xacts?

Yes.

I am not sure but I guess you are
proposing to add it in SnapBuild structure, if so, that seems safe as
that structure is not exposed.

We cannot add it in SnapBuild as it gets serialized to the disk.

Changing the function names and arguments would also break ABI. So
probably we cannot do the above idea of removing
ReorderBufferInitialXactsSetCatalogChanges() as well.

Why do you think we can't remove
ReorderBufferInitialXactsSetCatalogChanges() from the back branch
patch? I think we don't need to change the existing function
ReorderBufferXidHasCatalogChanges() but instead can have a wrapper
like SnapBuildXidHasCatalogChanges() similar to master branch patch.

IIUC we need to change SnapBuildCommitTxn() but it's exposed.

Currently, we call like DecodeCommit() -> SnapBuildCommitTxn() ->
ReorderBufferXidHasCatalogChanges(). If we have a wrapper function, we
call like DecodeCommit() -> SnapBuildCommitTxn() ->
SnapBuildXidHasCatalogChanges() ->
ReorderBufferXidHasCatalogChanges(). In
SnapBuildXidHasCatalogChanges(), we need to check if the transaction
has XACT_XINFO_HAS_INVALS, which means DecodeCommit() needs to pass
either parsed->xinfo or (parsed->xinfo & XACT_XINFO_HAS_INVALS != 0)
down to SnapBuildXidHasCatalogChanges(). However, since
SnapBuildCommitTxn(), between DecodeCommit() and
SnapBuildXidHasCatalogChanges(), is exposed we cannot change it.

Another idea would be to have functions, say
SnapBuildCommitTxnWithXInfo() and SnapBuildCommitTxn_ext(). The latter
does actual work of handling transaction commits and both
SnapBuildCommitTxn() and SnapBuildCommit() call
SnapBuildCommitTxnWithXInfo() with different arguments.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#90Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#89)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 20, 2022 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 20, 2022 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 19, 2022 at 7:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Why do you think we can't remove
ReorderBufferInitialXactsSetCatalogChanges() from the back branch
patch? I think we don't need to change the existing function
ReorderBufferXidHasCatalogChanges() but instead can have a wrapper
like SnapBuildXidHasCatalogChanges() similar to master branch patch.

IIUC we need to change SnapBuildCommitTxn() but it's exposed.

Currently, we call like DecodeCommit() -> SnapBuildCommitTxn() ->
ReorderBufferXidHasCatalogChanges(). If we have a wrapper function, we
call like DecodeCommit() -> SnapBuildCommitTxn() ->
SnapBuildXidHasCatalogChanges() ->
ReorderBufferXidHasCatalogChanges(). In
SnapBuildXidHasCatalogChanges(), we need to check if the transaction
has XACT_XINFO_HAS_INVALS, which means DecodeCommit() needs to pass
either parsed->xinfo or (parsed->xinfo & XACT_XINFO_HAS_INVALS != 0)
down to SnapBuildXidHasCatalogChanges(). However, since
SnapBuildCommitTxn(), between DecodeCommit() and
SnapBuildXidHasCatalogChanges(), is exposed we cannot change it.

Agreed.

Another idea would be to have functions, say
SnapBuildCommitTxnWithXInfo() and SnapBuildCommitTxn_ext(). The latter
does actual work of handling transaction commits and both
SnapBuildCommitTxn() and SnapBuildCommit() call
SnapBuildCommitTxnWithXInfo() with different arguments.

Do you want to say DecodeCommit() instead of SnapBuildCommit() in
above para? Yet another idea could be to have another flag
RBTXN_HAS_INVALS which will be set by DecodeCommit for top-level TXN.
Then, we can retrieve it even for each of the subtxn's if and when
required.

--
With Regards,
Amit Kapila.

#91Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Masahiko Sawada (#87)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Wed, 20 Jul 2022 10:58:16 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

On Wed, Jul 20, 2022 at 9:58 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Mmm. the patch changed that behavior. AllocateSnapshotBuilder always
allocate the array with a fixed size. SnapBuildAddCommittedTxn still
assumes builder->committed.xip is non-NULL. SnapBuildRestore *kept*
ondisk.builder.commited.xip populated with a valid array pointer. But
the patch allows committed.xip be NULL, thus in that case,
SnapBuildAddCommitedTxn calls repalloc(NULL) which triggers assertion
failure.

IIUC the patch doesn't allow committed.xip to be NULL since we don't
overwrite it if builder->committed.xcnt is 0 (i.e.,
ondisk.builder.committed.xip is NULL):

I meant that ondisk.builder.committed.xip can be NULL.. But looking
again that cannot be. I don't understand what I was looking at at
that time.

So, sorry for the noise.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#92Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kyotaro Horiguchi (#91)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 20, 2022 at 4:16 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 20 Jul 2022 10:58:16 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

On Wed, Jul 20, 2022 at 9:58 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Mmm. the patch changed that behavior. AllocateSnapshotBuilder always
allocate the array with a fixed size. SnapBuildAddCommittedTxn still
assumes builder->committed.xip is non-NULL. SnapBuildRestore *kept*
ondisk.builder.commited.xip populated with a valid array pointer. But
the patch allows committed.xip be NULL, thus in that case,
SnapBuildAddCommitedTxn calls repalloc(NULL) which triggers assertion
failure.

IIUC the patch doesn't allow committed.xip to be NULL since we don't
overwrite it if builder->committed.xcnt is 0 (i.e.,
ondisk.builder.committed.xip is NULL):

I meant that ondisk.builder.committed.xip can be NULL.. But looking
again that cannot be. I don't understand what I was looking at at
that time.

So, sorry for the noise.

No problem. Thank you for your review and comments!

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#93Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#90)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 20, 2022 at 2:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 20, 2022 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 20, 2022 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 19, 2022 at 7:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Why do you think we can't remove
ReorderBufferInitialXactsSetCatalogChanges() from the back branch
patch? I think we don't need to change the existing function
ReorderBufferXidHasCatalogChanges() but instead can have a wrapper
like SnapBuildXidHasCatalogChanges() similar to master branch patch.

IIUC we need to change SnapBuildCommitTxn() but it's exposed.

Currently, we call like DecodeCommit() -> SnapBuildCommitTxn() ->
ReorderBufferXidHasCatalogChanges(). If we have a wrapper function, we
call like DecodeCommit() -> SnapBuildCommitTxn() ->
SnapBuildXidHasCatalogChanges() ->
ReorderBufferXidHasCatalogChanges(). In
SnapBuildXidHasCatalogChanges(), we need to check if the transaction
has XACT_XINFO_HAS_INVALS, which means DecodeCommit() needs to pass
either parsed->xinfo or (parsed->xinfo & XACT_XINFO_HAS_INVALS != 0)
down to SnapBuildXidHasCatalogChanges(). However, since
SnapBuildCommitTxn(), between DecodeCommit() and
SnapBuildXidHasCatalogChanges(), is exposed we cannot change it.

Agreed.

Another idea would be to have functions, say
SnapBuildCommitTxnWithXInfo() and SnapBuildCommitTxn_ext(). The latter
does actual work of handling transaction commits and both
SnapBuildCommitTxn() and SnapBuildCommit() call
SnapBuildCommitTxnWithXInfo() with different arguments.

Do you want to say DecodeCommit() instead of SnapBuildCommit() in
above para?

I meant that we will call like DecodeCommit() ->
SnapBuildCommitTxnWithXInfo() -> SnapBuildCommitTxn_ext(has_invals =
true) -> SnapBuildXidHasCatalogChanges(has_invals = true) -> ... If
SnapBuildCommitTxn() gets called, it calls SnapBuildCommitTxn_ext()
with has_invals = false and behaves the same as before.

Yet another idea could be to have another flag
RBTXN_HAS_INVALS which will be set by DecodeCommit for top-level TXN.
Then, we can retrieve it even for each of the subtxn's if and when
required.

Do you mean that when checking if the subtransaction has catalog
changes, we check if its top-level XID has this new flag? Why do we
need the new flag?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#94Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#93)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 20, 2022 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 20, 2022 at 2:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 20, 2022 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Another idea would be to have functions, say
SnapBuildCommitTxnWithXInfo() and SnapBuildCommitTxn_ext(). The latter
does actual work of handling transaction commits and both
SnapBuildCommitTxn() and SnapBuildCommit() call
SnapBuildCommitTxnWithXInfo() with different arguments.

Do you want to say DecodeCommit() instead of SnapBuildCommit() in
above para?

I meant that we will call like DecodeCommit() ->
SnapBuildCommitTxnWithXInfo() -> SnapBuildCommitTxn_ext(has_invals =
true) -> SnapBuildXidHasCatalogChanges(has_invals = true) -> ... If
SnapBuildCommitTxn() gets called, it calls SnapBuildCommitTxn_ext()
with has_invals = false and behaves the same as before.

Okay, understood. This will work.

Yet another idea could be to have another flag
RBTXN_HAS_INVALS which will be set by DecodeCommit for top-level TXN.
Then, we can retrieve it even for each of the subtxn's if and when
required.

Do you mean that when checking if the subtransaction has catalog
changes, we check if its top-level XID has this new flag?

Yes.

Why do we
need the new flag?

This is required if we don't want to introduce a new set of functions
as you proposed above. I am not sure which one is better w.r.t back
patching effort later but it seems to me using flag stuff would make
future back patches easier if we make any changes in
SnapBuildCommitTxn.

--
With Regards,
Amit Kapila.

#95Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#94)
2 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 20, 2022 at 5:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 20, 2022 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 20, 2022 at 2:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 20, 2022 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Another idea would be to have functions, say
SnapBuildCommitTxnWithXInfo() and SnapBuildCommitTxn_ext(). The latter
does actual work of handling transaction commits and both
SnapBuildCommitTxn() and SnapBuildCommit() call
SnapBuildCommitTxnWithXInfo() with different arguments.

Do you want to say DecodeCommit() instead of SnapBuildCommit() in
above para?

I meant that we will call like DecodeCommit() ->
SnapBuildCommitTxnWithXInfo() -> SnapBuildCommitTxn_ext(has_invals =
true) -> SnapBuildXidHasCatalogChanges(has_invals = true) -> ... If
SnapBuildCommitTxn() gets called, it calls SnapBuildCommitTxn_ext()
with has_invals = false and behaves the same as before.

Okay, understood. This will work.

Yet another idea could be to have another flag
RBTXN_HAS_INVALS which will be set by DecodeCommit for top-level TXN.
Then, we can retrieve it even for each of the subtxn's if and when
required.

Do you mean that when checking if the subtransaction has catalog
changes, we check if its top-level XID has this new flag?

Yes.

Why do we
need the new flag?

This is required if we don't want to introduce a new set of functions
as you proposed above. I am not sure which one is better w.r.t back
patching effort later but it seems to me using flag stuff would make
future back patches easier if we make any changes in
SnapBuildCommitTxn.

Understood.

I've implemented this idea as well for discussion. Both patches have
the common change to remember the initial running transactions and to
purge them when decoding xl_running_xacts records. The difference is
how to mark the transactions as needing to be added to the snapshot.

In v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch,
when the transaction is in the initial running xact list and its
commit record has XINFO_HAS_INVAL flag, we mark both the top
transaction and its all subtransactions as containing catalog changes
(which also means to create ReorderBufferTXN entries for them). These
transactions are added to the snapshot in SnapBuildCommitTxn() since
ReorderBufferXidHasCatalogChanges () for them returns true.

In poc_mark_top_txn_has_inval.patch, when the transaction is in the
initial running xacts list and its commit record has XINFO_HAS_INVALS
flag, we set a new flag, say RBTXN_COMMIT_HAS_INVALS, only to the top
transaction. In SnapBuildCommitTxn(), we add all subtransactions to
the snapshot without checking ReorderBufferXidHasCatalogChanges() for
subtransactions if its top transaction has the RBTXN_COMMIT_HAS_INVALS
flag.

A difference between the two ideas is the scope of changes: the former
changes only snapbuild.c but the latter changes both snapbuild.c and
reorderbuffer.c. Moreover, while the former uses the existing flag,
the latter adds a new flag to the reorder buffer for dealing with only
this case. I think the former idea is simpler in terms of that. But,
an advantage of the latter idea is that the latter idea can save to
create ReorderBufferTXN entries for subtransactions.

Overall I prefer the former for now but I'd like to hear what others think.

FWIW, I didn't try the idea of adding wrapper functions since it would
be costly in terms of back patching effort in the future.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/x-patch; name=v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From 725249d8ec56a7ea3219df801dc7b93f93eea145 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 11 Jul 2022 21:49:06 +0900
Subject: [PATCH v7] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the initial running transaction written in the
xl_running_xacts record that we decoded first, and mark the
transaction as containing catalog changes if it's in the list of the
initial running transactions and its commit record has
XACT_XINFO_HAS_INVALS.

This has false positive; we could end up adding the transaction that
didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the
COMMIT record. It doesn't have the information on which (sub)
transaction has catalog changes, and XACT_XINFO_HAS_INVALS doesn't
necessarily indicate that the transaction has catalog change. But it
doesn't become a problem since we use historic snapshot only for
reading system catalogs.

On the master branch, we took a more future-proof approach -- writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild. Also, to avoid ABI
breakage, we store the array of the initial running transactions in the
static variables InitialRunningXacts and NInitialRunningXacts, but not
in ReorderBuffer.

Back-patch to all supported released.

Reported-by: Mike Oh <minsoo@amazon.com>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Bertrand Drouvot <bdrouvot@amazon.com>
Reviewed-by: Shi yu <shiy.fnst@fujitsu.com>
Reviewed-by: Ahsan Hadi <ahsan.hadi@gmail.com>
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
Backpatch-through: 10
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 143 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   5 +
 6 files changed, 240 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..4553252d75 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 92dfafc632..a164442436 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -691,6 +691,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. We check if it's in the list of the initial running transactions
+	 * and then mark it as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * catalog change transactions to the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildInitialXactSetCatalogChanges(ctx->snapshot_builder,
+											  xid,
+											  parsed->nsubxacts,
+											  parsed->subxacts,
+											  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 6df602485b..3e9d1a7931 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded first was written.
+ * The array is sorted in xidComparator order. Xids are only removed
+ * from the array when decoding xl_running_xacts record, and then
+ * the array eventually becomes an empty. This array is allocated in
+ * builder->context so its lifetime is the same as the snapshot builder.
+ *
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder. Therefore, if the logical
+ * decoding decodes the commit record of the transaction that actually
+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot, and end up looking at catalogs with the
+ * wrong snapshot. To avoid this problem, if the COMMIT record of
+ * the xid listed in InitialRunningXacts has XACT_XINFO_HAS_INVALS
+ * flag, we mark both the top transaction and its substransactions
+ * as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But it doesn't become a problem since
+ * we use historic snapshot only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -879,12 +909,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -919,6 +954,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also be
+	 * sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1104,6 +1182,21 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 	{
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded first was written.
+		 */
+		if (builder->state == SNAPBUILD_START)
+		{
+			int			nxacts = running->subxcnt + running->xcnt;
+			Size		sz = sizeof(TransactionId) * nxacts;
+
+			NInitialRunningXacts = nxacts;
+			InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+			memcpy(InitialRunningXacts, running->xids, sz);
+			qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+		}
+
 		/* returns false if there's no point in performing cleanup just yet */
 		if (!SnapBuildFindSnapshot(builder, lsn, running))
 			return;
@@ -1126,7 +1219,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1993,3 +2086,39 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark both it
+ * and its subtransactions as containing catalog changes if not yet.
+ */
+void
+SnapBuildInitialXactSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+									  int subxcnt, TransactionId *subxacts,
+									  XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely((NInitialRunningXacts == 0) ||
+			   ReorderBufferXidHasCatalogChanges(builder->reorder, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3604621e88..d8f15d70e0 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -90,4 +90,9 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildInitialXactSetCatalogChanges(SnapBuild *builder,
+												  TransactionId xid,
+												  int subxcnt,
+												  TransactionId *subxacts,
+												  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

poc_mark_top_txn_has_inval.patchapplication/x-patch; name=poc_mark_top_txn_has_inval.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..4553252d75 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 92dfafc632..364d832e6d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -691,6 +691,17 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. We check if it's in the list of the initial running transactions
+	 * and then mark it as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * catalog change transactions to the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		ReorderBufferXidSetCommitHasInvals(ctx->reorder, xid, buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e59d1396b5..1a081ca966 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3300,6 +3300,38 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
 	return rbtxn_has_catalog_changes(txn);
 }
 
+/*
+ * Mark the commit record of the transaction has invalidation messages.
+ */
+void
+ReorderBufferXidSetCommitHasInvals(ReorderBuffer *rb, TransactionId xid,
+								   XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+
+	txn->txn_flags |= RBTXN_COMMIT_HAS_INVALS;
+
+	/* We mark the flag to for top-level transaction */
+}
+
+/*
+ * Query whether the commit of the transaction has invalidation messages.
+ */
+bool
+ReorderBufferXidCommitHasInvals(ReorderBuffer *rb, TransactionId xid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+	if (txn == NULL)
+		return false;
+
+	return rbtxn_commit_has_invals(txn);
+}
+
 /*
  * ReorderBufferXidHasBaseSnapshot
  *		Have we already set the base snapshot for the given txn/subtxn?
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 6df602485b..5af8b959c8 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded first was written.
+ * The array is sorted in xidComparator order. Xids are only removed
+ * from the array when decoding xl_running_xacts record, and then
+ * the array eventually becomes an empty. This array is allocated in
+ * builder->context so its lifetime is the same as the snapshot builder.
+ *
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder. Therefore, if the logical
+ * decoding decodes the commit record of the transaction that actually
+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot, and end up looking at catalogs with the
+ * wrong snapshot. To avoid this problem, if the COMMIT record of
+ * the xid listed in InitialRunningXacts has XACT_XINFO_HAS_INVALS
+ * flag, we mark both the top transaction and its substransactions
+ * as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But it doesn't become a problem since
+ * we use historic snapshot only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -879,12 +909,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -919,6 +954,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also be
+	 * sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -934,6 +1012,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	bool		needs_timetravel = false;
 	bool		sub_needs_timetravel = false;
 
+	bool		top_has_invals = false;
+
 	TransactionId xmax = xid;
 
 	/*
@@ -966,6 +1046,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		}
 	}
 
+	top_has_invals = ReorderBufferXidCommitHasInvals(builder->reorder, xid);
 	for (nxact = 0; nxact < nsubxacts; nxact++)
 	{
 		TransactionId subxid = subxacts[nxact];
@@ -974,7 +1055,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (top_has_invals ||
+			ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1003,7 +1085,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (top_has_invals ||
+		ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1104,6 +1187,21 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 	{
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded first was written.
+		 */
+		if (builder->state == SNAPBUILD_START)
+		{
+			int			nxacts = running->subxcnt + running->xcnt;
+			Size		sz = sizeof(TransactionId) * nxacts;
+
+			NInitialRunningXacts = nxacts;
+			InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+			memcpy(InitialRunningXacts, running->xids, sz);
+			qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+		}
+
 		/* returns false if there's no point in performing cleanup just yet */
 		if (!SnapBuildFindSnapshot(builder, lsn, running))
 			return;
@@ -1126,7 +1224,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d81b5..ac00575db1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -176,6 +176,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_PARTIAL_CHANGE  0x0020
 #define RBTXN_PREPARE             0x0040
 #define RBTXN_SKIPPED_PREPARE	  0x0080
+#define RBTXN_COMMIT_HAS_INVALS	  0x0100
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -207,6 +208,12 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_HAS_PARTIAL_CHANGE) != 0 \
 )
 
+/* Does the commit record of the transaction has invalidation messages? */
+#define rbtxn_commit_has_invals(txn) \
+( \
+	 ((txn)->txn_flags & RBTXN_COMMIT_HAS_INVALS) != 0 \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -665,6 +672,10 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+void		ReorderBufferXidSetCommitHasInvals(ReorderBuffer *rb, TransactionId xid,
+											   XLogRecPtr lsn);
+bool		ReorderBufferXidCommitHasInvals(ReorderBuffer *rb, TransactionId xid);
+
 bool		ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 											 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
 											 TimestampTz prepare_time,
#96Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#95)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 22, 2022 at 11:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 20, 2022 at 5:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 20, 2022 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This is required if we don't want to introduce a new set of functions
as you proposed above. I am not sure which one is better w.r.t back
patching effort later but it seems to me using flag stuff would make
future back patches easier if we make any changes in
SnapBuildCommitTxn.

Understood.

I've implemented this idea as well for discussion. Both patches have
the common change to remember the initial running transactions and to
purge them when decoding xl_running_xacts records. The difference is
how to mark the transactions as needing to be added to the snapshot.

In v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch,
when the transaction is in the initial running xact list and its
commit record has XINFO_HAS_INVAL flag, we mark both the top
transaction and its all subtransactions as containing catalog changes
(which also means to create ReorderBufferTXN entries for them). These
transactions are added to the snapshot in SnapBuildCommitTxn() since
ReorderBufferXidHasCatalogChanges () for them returns true.

In poc_mark_top_txn_has_inval.patch, when the transaction is in the
initial running xacts list and its commit record has XINFO_HAS_INVALS
flag, we set a new flag, say RBTXN_COMMIT_HAS_INVALS, only to the top
transaction.

It seems that the patch has missed the part to check if the xid is in
the initial running xacts list?

In SnapBuildCommitTxn(), we add all subtransactions to
the snapshot without checking ReorderBufferXidHasCatalogChanges() for
subtransactions if its top transaction has the RBTXN_COMMIT_HAS_INVALS
flag.

A difference between the two ideas is the scope of changes: the former
changes only snapbuild.c but the latter changes both snapbuild.c and
reorderbuffer.c. Moreover, while the former uses the existing flag,
the latter adds a new flag to the reorder buffer for dealing with only
this case. I think the former idea is simpler in terms of that. But,
an advantage of the latter idea is that the latter idea can save to
create ReorderBufferTXN entries for subtransactions.

Overall I prefer the former for now but I'd like to hear what others think.

I agree that the latter idea can have better performance in extremely
special scenarios but introducing a new flag for the same sounds a bit
ugly to me. So, I would also prefer to go with the former idea,
however, I would also like to hear what Horiguchi-San and others have
to say.

Few comments on v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during:
1.
+void
+SnapBuildInitialXactSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+   int subxcnt, TransactionId *subxacts,
+   XLogRecPtr lsn)
+{

I think it is better to name this function as
SnapBuildXIDSetCatalogChanges as we use this to mark a particular
transaction as having catalog changes.

2. Changed/added a few comments in the attached.

--
With Regards,
Amit Kapila.

Attachments:

v7-0001-diff-amit.patchapplication/octet-stream; name=v7-0001-diff-amit.patchDownload
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 78b9ca48c7..05b9c41e88 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -629,11 +629,13 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 
 	/*
 	 * If the COMMIT record has invalidation messages, it could have catalog
-	 * changes. We check if it's in the list of the initial running transactions
-	 * and then mark it as containing catalog change.
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit record
+	 * without decoding the transaction's other changes. So, we ensure to mark
+	 * such transactions as containing catalog change.
 	 *
 	 * This must be done before SnapBuildCommitTxn() so that we can include
-	 * catalog change transactions to the historic snapshot.
+	 * these transactions in the historic snapshot.
 	 */
 	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
 		SnapBuildInitialXactSetCatalogChanges(ctx->snapshot_builder,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 2585922dad..3cdee00b9b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -253,10 +253,10 @@ static bool ExportInProgress = false;
 /*
  * Array of transactions and subtransactions that were running when
  * the xl_running_xacts record that we decoded first was written.
- * The array is sorted in xidComparator order. Xids are only removed
- * from the array when decoding xl_running_xacts record, and then
- * the array eventually becomes an empty. This array is allocated in
- * builder->context so its lifetime is the same as the snapshot builder.
+ * The array is sorted in xidComparator order. We remove Xids from
+ * this array when they become old enough to matter. This array is
+ * allocated in builder->context so its lifetime is the same as the
+ * snapshot builder.
  *
  * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
  * if the transaction has changed the catalog, and that information
@@ -2095,8 +2095,9 @@ CheckPointSnapBuild(void)
 }
 
 /*
- * If the given xid is in the list of the initial running xacts, we mark both it
- * and its subtransactions as containing catalog changes if not yet.
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments
  */
 void
 SnapBuildInitialXactSetCatalogChanges(SnapBuild *builder, TransactionId xid,
#97Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#96)
6 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Sat, Jul 23, 2022 at 8:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 22, 2022 at 11:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 20, 2022 at 5:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 20, 2022 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This is required if we don't want to introduce a new set of functions
as you proposed above. I am not sure which one is better w.r.t back
patching effort later but it seems to me using flag stuff would make
future back patches easier if we make any changes in
SnapBuildCommitTxn.

Understood.

I've implemented this idea as well for discussion. Both patches have
the common change to remember the initial running transactions and to
purge them when decoding xl_running_xacts records. The difference is
how to mark the transactions as needing to be added to the snapshot.

In v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch,
when the transaction is in the initial running xact list and its
commit record has XINFO_HAS_INVAL flag, we mark both the top
transaction and its all subtransactions as containing catalog changes
(which also means to create ReorderBufferTXN entries for them). These
transactions are added to the snapshot in SnapBuildCommitTxn() since
ReorderBufferXidHasCatalogChanges () for them returns true.

In poc_mark_top_txn_has_inval.patch, when the transaction is in the
initial running xacts list and its commit record has XINFO_HAS_INVALS
flag, we set a new flag, say RBTXN_COMMIT_HAS_INVALS, only to the top
transaction.

It seems that the patch has missed the part to check if the xid is in
the initial running xacts list?

Oops, right.

In SnapBuildCommitTxn(), we add all subtransactions to
the snapshot without checking ReorderBufferXidHasCatalogChanges() for
subtransactions if its top transaction has the RBTXN_COMMIT_HAS_INVALS
flag.

A difference between the two ideas is the scope of changes: the former
changes only snapbuild.c but the latter changes both snapbuild.c and
reorderbuffer.c. Moreover, while the former uses the existing flag,
the latter adds a new flag to the reorder buffer for dealing with only
this case. I think the former idea is simpler in terms of that. But,
an advantage of the latter idea is that the latter idea can save to
create ReorderBufferTXN entries for subtransactions.

Overall I prefer the former for now but I'd like to hear what others think.

I agree that the latter idea can have better performance in extremely
special scenarios but introducing a new flag for the same sounds a bit
ugly to me. So, I would also prefer to go with the former idea,
however, I would also like to hear what Horiguchi-San and others have
to say.

Agreed.

Few comments on v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during:
1.
+void
+SnapBuildInitialXactSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+   int subxcnt, TransactionId *subxacts,
+   XLogRecPtr lsn)
+{

I think it is better to name this function as
SnapBuildXIDSetCatalogChanges as we use this to mark a particular
transaction as having catalog changes.

2. Changed/added a few comments in the attached.

Thank you for the comments.

I've attached updated version patches for the master and back branches.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

REL14-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REL14-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From 09b47b0cd9bfbbb64b1f41b5fae5169a40845e68 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 11 Jul 2022 21:49:06 +0900
Subject: [PATCH v8] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transaction written in the xl_running_xacts
record that we decoded first, and mark the transaction as containing
catalog changes if it's in the list of the initial running
transactions and its commit record has XACT_XINFO_HAS_INVALS. To avoid
ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
neither SnapBuild nor ReorderBuffer.

This approach has false positive; we could end up adding the
transaction that didn't change catalog to the snapshot since we cannot
distinguish whether the transaction has catalog changes only by
checking the COMMIT record. It doesn't have the information on
which (sub) transaction has catalog changes, and XACT_XINFO_HAS_INVALS
doesn't necessarily indicate that the transaction has catalog
change. But it doesn't become a problem since we use historic snapshot
only for system catalog lookups.

On the master branch, we took a more future-proof approach -- writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.

Back-patch to all supported released.

Reported-by: Mike Oh <minsoo@amazon.com>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Bertrand Drouvot <bdrouvot@amazon.com>
Reviewed-by: Shi yu <shiy.fnst@fujitsu.com>
Reviewed-by: Ahsan Hadi <ahsan.hadi@gmail.com>
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
Backpatch-through: 10
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 143 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 238 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..4553252d75 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 92dfafc632..6fefe9e964 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -691,6 +691,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit record
+	 * without decoding the transaction's other changes. So, we ensure to mark
+	 * such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 6df602485b..17e93aac67 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded first was written.
+ * The array is sorted in xidComparator order. We remove Xids from
+ * this array when they become old enough to matter. This array is
+ * allocated in builder->context so its lifetime is the same as the
+ * snapshot builder.
+ *
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder. Therefore, if the logical
+ * decoding decodes the commit record of the transaction that actually
+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot, and end up looking at catalogs with the
+ * wrong snapshot. To avoid this problem, if the COMMIT record of
+ * the xid listed in InitialRunningXacts has XACT_XINFO_HAS_INVALS
+ * flag, we mark both the top transaction and its substransactions
+ * as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But it doesn't become a problem since
+ * we use historic snapshot only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -879,12 +909,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -919,6 +954,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also be
+	 * sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1104,6 +1182,21 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 	{
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded first was written.
+		 */
+		if (builder->state == SNAPBUILD_START)
+		{
+			int			nxacts = running->subxcnt + running->xcnt;
+			Size		sz = sizeof(TransactionId) * nxacts;
+
+			NInitialRunningXacts = nxacts;
+			InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+			memcpy(InitialRunningXacts, running->xids, sz);
+			qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+		}
+
 		/* returns false if there's no point in performing cleanup just yet */
 		if (!SnapBuildFindSnapshot(builder, lsn, running))
 			return;
@@ -1126,7 +1219,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1993,3 +2086,39 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely((NInitialRunningXacts == 0) ||
+			   ReorderBufferXidHasCatalogChanges(builder->reorder, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3604621e88..a19b59e100 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -90,4 +90,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL10-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REL10-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From 83341fb2d72e30bb8ec75783e58990d63bd5d1ca Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sun, 17 Jul 2022 21:28:35 +0900
Subject: [PATCH v8] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transaction written in the xl_running_xacts
record that we decoded first, and mark the transaction as containing
catalog changes if it's in the list of the initial running
transactions and its commit record has XACT_XINFO_HAS_INVALS. To avoid
ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
neither SnapBuild nor ReorderBuffer.

This approach has false positive; we could end up adding the
transaction that didn't change catalog to the snapshot since we cannot
distinguish whether the transaction has catalog changes only by
checking the COMMIT record. It doesn't have the information on
which (sub) transaction has catalog changes, and XACT_XINFO_HAS_INVALS
doesn't necessarily indicate that the transaction has catalog
change. But it doesn't become a problem since we use historic snapshot
only for system catalog lookups.

On the master branch, we took a more future-proof approach -- writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.

Back-patch to all supported released.

Reported-by: Mike Oh <minsoo@amazon.com>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Bertrand Drouvot <bdrouvot@amazon.com>
Reviewed-by: Shi yu <shiy.fnst@fujitsu.com>
Reviewed-by: Ahsan Hadi <ahsan.hadi@gmail.com>
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
Backpatch-through: 10
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  41 +++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 143 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 235 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2db2b2774b..73bc0fe1fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..15f9540b3f
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,41 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6f8920f52c..fd4d457e64 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -552,6 +552,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit record
+	 * without decoding the transaction's other changes. So, we ensure to mark
+	 * such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	/*
 	 * Process invalidation messages, even if we're not interested in the
 	 * transaction's contents, since the various caches need to always be
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1010a2e869..9519b953d2 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -258,8 +258,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded first was written.
+ * The array is sorted in xidComparator order. We remove Xids from
+ * this array when they become old enough to matter. This array is
+ * allocated in builder->context so its lifetime is the same as the
+ * snapshot builder.
+ *
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder. Therefore, if the logical
+ * decoding decodes the commit record of the transaction that actually
+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot, and end up looking at catalogs with the
+ * wrong snapshot. To avoid this problem, if the COMMIT record of
+ * the xid listed in InitialRunningXacts has XACT_XINFO_HAS_INVALS
+ * flag, we mark both the top transaction and its substransactions
+ * as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But it doesn't become a problem since
+ * we use historic snapshot only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -896,12 +926,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -936,6 +971,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also be
+	 * sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1121,6 +1199,21 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 	{
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded first was written.
+		 */
+		if (builder->state == SNAPBUILD_START)
+		{
+			int			nxacts = running->subxcnt + running->xcnt;
+			Size		sz = sizeof(TransactionId) * nxacts;
+
+			NInitialRunningXacts = nxacts;
+			InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+			memcpy(InitialRunningXacts, running->xids, sz);
+			qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+		}
+
 		/* returns false if there's no point in performing cleanup just yet */
 		if (!SnapBuildFindSnapshot(builder, lsn, running))
 			return;
@@ -1143,7 +1236,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1997,3 +2090,39 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely((NInitialRunningXacts == 0) ||
+			   ReorderBufferXidHasCatalogChanges(builder->reorder, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index b95f56eec3..7a796ce136 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 							 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL13-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REL13-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From a01f57fec21e03bb09563f66c3e26816538dd78a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 11 Jul 2022 21:49:06 +0900
Subject: [PATCH v8] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transaction written in the xl_running_xacts
record that we decoded first, and mark the transaction as containing
catalog changes if it's in the list of the initial running
transactions and its commit record has XACT_XINFO_HAS_INVALS. To avoid
ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
neither SnapBuild nor ReorderBuffer.

This approach has false positive; we could end up adding the
transaction that didn't change catalog to the snapshot since we cannot
distinguish whether the transaction has catalog changes only by
checking the COMMIT record. It doesn't have the information on
which (sub) transaction has catalog changes, and XACT_XINFO_HAS_INVALS
doesn't necessarily indicate that the transaction has catalog
change. But it doesn't become a problem since we use historic snapshot
only for system catalog lookups.

On the master branch, we took a more future-proof approach -- writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.

Back-patch to all supported released.

Reported-by: Mike Oh <minsoo@amazon.com>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Bertrand Drouvot <bdrouvot@amazon.com>
Reviewed-by: Shi yu <shiy.fnst@fujitsu.com>
Reviewed-by: Ahsan Hadi <ahsan.hadi@gmail.com>
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
Backpatch-through: 10
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 143 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 238 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..6ec09ab192 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5a2b828aa3..33a2fd2ee8 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -572,6 +572,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit record
+	 * without decoding the transaction's other changes. So, we ensure to mark
+	 * such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	/*
 	 * Process invalidation messages, even if we're not interested in the
 	 * transaction's contents, since the various caches need to always be
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index be46bf0363..b5a9b1e5fb 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -252,8 +252,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded first was written.
+ * The array is sorted in xidComparator order. We remove Xids from
+ * this array when they become old enough to matter. This array is
+ * allocated in builder->context so its lifetime is the same as the
+ * snapshot builder.
+ *
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder. Therefore, if the logical
+ * decoding decodes the commit record of the transaction that actually
+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot, and end up looking at catalogs with the
+ * wrong snapshot. To avoid this problem, if the COMMIT record of
+ * the xid listed in InitialRunningXacts has XACT_XINFO_HAS_INVALS
+ * flag, we mark both the top transaction and its substransactions
+ * as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But it doesn't become a problem since
+ * we use historic snapshot only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -890,12 +920,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -930,6 +965,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also be
+	 * sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1115,6 +1193,21 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 	{
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded first was written.
+		 */
+		if (builder->state == SNAPBUILD_START)
+		{
+			int			nxacts = running->subxcnt + running->xcnt;
+			Size		sz = sizeof(TransactionId) * nxacts;
+
+			NInitialRunningXacts = nxacts;
+			InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+			memcpy(InitialRunningXacts, running->xids, sz);
+			qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+		}
+
 		/* returns false if there's no point in performing cleanup just yet */
 		if (!SnapBuildFindSnapshot(builder, lsn, running))
 			return;
@@ -1137,7 +1230,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -2030,3 +2123,39 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely((NInitialRunningXacts == 0) ||
+			   ReorderBufferXidHasCatalogChanges(builder->reorder, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index b048dc7484..17d2f93300 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REl11-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REl11-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From 950f16af07d4ddf448f2fe0d66a52b856aa825c9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sun, 17 Jul 2022 07:30:23 +0900
Subject: [PATCH v8] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transaction written in the xl_running_xacts
record that we decoded first, and mark the transaction as containing
catalog changes if it's in the list of the initial running
transactions and its commit record has XACT_XINFO_HAS_INVALS. To avoid
ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
neither SnapBuild nor ReorderBuffer.

This approach has false positive; we could end up adding the
transaction that didn't change catalog to the snapshot since we cannot
distinguish whether the transaction has catalog changes only by
checking the COMMIT record. It doesn't have the information on
which (sub) transaction has catalog changes, and XACT_XINFO_HAS_INVALS
doesn't necessarily indicate that the transaction has catalog
change. But it doesn't become a problem since we use historic snapshot
only for system catalog lookups.

On the master branch, we took a more future-proof approach -- writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.

Back-patch to all supported released.

Reported-by: Mike Oh <minsoo@amazon.com>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Bertrand Drouvot <bdrouvot@amazon.com>
Reviewed-by: Shi yu <shiy.fnst@fujitsu.com>
Reviewed-by: Ahsan Hadi <ahsan.hadi@gmail.com>
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
Backpatch-through: 10
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 143 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 238 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 65a91a8014..973b94738a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c085f7b0f3..beadabb804 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -576,6 +576,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit record
+	 * without decoding the transaction's other changes. So, we ensure to mark
+	 * such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	/*
 	 * Process invalidation messages, even if we're not interested in the
 	 * transaction's contents, since the various caches need to always be
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1c52bc64e3..8190e7a911 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -258,8 +258,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded first was written.
+ * The array is sorted in xidComparator order. We remove Xids from
+ * this array when they become old enough to matter. This array is
+ * allocated in builder->context so its lifetime is the same as the
+ * snapshot builder.
+ *
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder. Therefore, if the logical
+ * decoding decodes the commit record of the transaction that actually
+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot, and end up looking at catalogs with the
+ * wrong snapshot. To avoid this problem, if the COMMIT record of
+ * the xid listed in InitialRunningXacts has XACT_XINFO_HAS_INVALS
+ * flag, we mark both the top transaction and its substransactions
+ * as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But it doesn't become a problem since
+ * we use historic snapshot only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -896,12 +926,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -936,6 +971,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also be
+	 * sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1121,6 +1199,21 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 	{
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded first was written.
+		 */
+		if (builder->state == SNAPBUILD_START)
+		{
+			int			nxacts = running->subxcnt + running->xcnt;
+			Size		sz = sizeof(TransactionId) * nxacts;
+
+			NInitialRunningXacts = nxacts;
+			InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+			memcpy(InitialRunningXacts, running->xids, sz);
+			qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+		}
+
 		/* returns false if there's no point in performing cleanup just yet */
 		if (!SnapBuildFindSnapshot(builder, lsn, running))
 			return;
@@ -1143,7 +1236,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1996,3 +2089,39 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely((NInitialRunningXacts == 0) ||
+			   ReorderBufferXidHasCatalogChanges(builder->reorder, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 1df66a3c75..4df3c3f2f7 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 							 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REl12-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REl12-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From 6eb109bf6429c2da07b96d26b65571c3172ec568 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sun, 17 Jul 2022 07:19:00 +0900
Subject: [PATCH v8] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transaction written in the xl_running_xacts
record that we decoded first, and mark the transaction as containing
catalog changes if it's in the list of the initial running
transactions and its commit record has XACT_XINFO_HAS_INVALS. To avoid
ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
neither SnapBuild nor ReorderBuffer.

This approach has false positive; we could end up adding the
transaction that didn't change catalog to the snapshot since we cannot
distinguish whether the transaction has catalog changes only by
checking the COMMIT record. It doesn't have the information on
which (sub) transaction has catalog changes, and XACT_XINFO_HAS_INVALS
doesn't necessarily indicate that the transaction has catalog
change. But it doesn't become a problem since we use historic snapshot
only for system catalog lookups.

On the master branch, we took a more future-proof approach -- writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.

Back-patch to all supported released.

Reported-by: Mike Oh <minsoo@amazon.com>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Bertrand Drouvot <bdrouvot@amazon.com>
Reviewed-by: Shi yu <shiy.fnst@fujitsu.com>
Reviewed-by: Ahsan Hadi <ahsan.hadi@gmail.com>
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
Backpatch-through: 10
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 143 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 238 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..6ec09ab192 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 60d07ce4eb..56a0e3e255 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -575,6 +575,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit record
+	 * without decoding the transaction's other changes. So, we ensure to mark
+	 * such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	/*
 	 * Process invalidation messages, even if we're not interested in the
 	 * transaction's contents, since the various caches need to always be
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5a1bce5acc..9d1039f7df 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -257,8 +257,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded first was written.
+ * The array is sorted in xidComparator order. We remove Xids from
+ * this array when they become old enough to matter. This array is
+ * allocated in builder->context so its lifetime is the same as the
+ * snapshot builder.
+ *
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder. Therefore, if the logical
+ * decoding decodes the commit record of the transaction that actually
+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot, and end up looking at catalogs with the
+ * wrong snapshot. To avoid this problem, if the COMMIT record of
+ * the xid listed in InitialRunningXacts has XACT_XINFO_HAS_INVALS
+ * flag, we mark both the top transaction and its substransactions
+ * as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But it doesn't become a problem since
+ * we use historic snapshot only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -895,12 +925,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -935,6 +970,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also be
+	 * sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1120,6 +1198,21 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 	{
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded first was written.
+		 */
+		if (builder->state == SNAPBUILD_START)
+		{
+			int			nxacts = running->subxcnt + running->xcnt;
+			Size		sz = sizeof(TransactionId) * nxacts;
+
+			NInitialRunningXacts = nxacts;
+			InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+			memcpy(InitialRunningXacts, running->xids, sz);
+			qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+		}
+
 		/* returns false if there's no point in performing cleanup just yet */
 		if (!SnapBuildFindSnapshot(builder, lsn, running))
 			return;
@@ -1142,7 +1235,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -2035,3 +2128,39 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely((NInitialRunningXacts == 0) ||
+			   ReorderBufferXidHasCatalogChanges(builder->reorder, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3acf68f5bd..2eb9532a1b 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

master-v8-0001-Add-catalog-modifying-transactions-to-logical-dec.patchapplication/octet-stream; name=master-v8-0001-Add-catalog-modifying-transactions-to-logical-dec.patchDownload
From 27c6daab9a4787683c60445fc6cc5dba550bdeb2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v8] Add catalog modifying transactions to logical decoding
 serialized snapshot.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running when
snapshot serialization, to the serialized snapshot. When decoding a
COMMIT record, we check both the list and the ReorderBuffer to see if
the transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we take another approach;
remember the last-running-xacts list of the first decoded
RUNNING_XACTS record, check if the transaction whose commit record
has XACT_XINFO_HAS_INVALS and whose XID is in the list, and mark the
top-level transaction and sub transactions as containing catalog
changes. This doesn't require any file format changes but the
transaction will end up being added to the snapshot even if it has
only relcache invalidations. It doesn't become a problem as we use the
historical snapshot for only catalog lookups.

This commit bumps SNAPBUILD_VERSION because of change in SnapBuild.

Back-patch to all supported released.

Reported-by: Mike Oh <minsoo@amazon.com>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Bertrand Drouvot <bdrouvot@amazon.com>
Reviewed-by: Shi yu <shiy.fnst@fujitsu.com>
Reviewed-by: Ahsan Hadi <ahsan.hadi@gmail.com>
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
Backpatch-through: 10
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 +++
 .../specs/catalog_change_snapshot.spec        |  39 +++
 .../replication/logical/reorderbuffer.c       |  71 ++++-
 src/backend/replication/logical/snapbuild.c   | 269 ++++++++++++------
 src/include/replication/reorderbuffer.h       |  12 +
 6 files changed, 347 insertions(+), 90 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fde72..0b2d9b7930 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,8 @@ ReorderBufferAllocate(void)
 	buffer->by_txn_last_xid = InvalidTransactionId;
 	buffer->by_txn_last_txn = NULL;
 
+	buffer->catchange_ntxns = 0;
+
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 	buffer->size = 0;
@@ -366,6 +368,7 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
+	dlist_init(&buffer->catchange_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1526,14 +1529,22 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
-	 * from the LSN-ordered list of toplevel TXNs.
+	 * from the LSN-ordered list of toplevel TXNs. We remove TXN from
+	 * the list of catalog modifying transactions as well.
 	 */
 	dlist_delete(&txn->node);
+	if (rbtxn_has_catalog_changes(txn))
+	{
+		dlist_delete(&txn->catchange_node);
+		rb->catchange_ntxns--;
+
+		Assert(rb->catchange_ntxns >= 0);
+	}
 
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,
@@ -3275,10 +3286,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 								  XLogRecPtr lsn)
 {
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	if (!rbtxn_has_catalog_changes(txn))
+	{
+		txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &txn->catchange_node);
+		rb->catchange_ntxns++;
+	}
 
 	/*
 	 * Mark top-level transaction as having catalog changes too if one of its
@@ -3286,8 +3303,52 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	 * conveniently check just top-level transaction and decide whether to
 	 * build the hash table or not.
 	 */
-	if (txn->toptxn != NULL)
-		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	toptxn = txn->toptxn;
+	if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
+	{
+		toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &toptxn->catchange_node);
+		rb->catchange_ntxns++;
+	}
+}
+
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb)
+{
+	dlist_iter iter;
+	TransactionId *xids = NULL;
+	size_t	xcnt = 0;
+
+	/* Quick return if the list is empty */
+	if (dlist_is_empty(&rb->catchange_txns))
+	{
+		Assert(rb->catchange_ntxns == 0);
+		return NULL;
+	}
+
+	/* Initialize XID array */
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * rb->catchange_ntxns);
+	dlist_foreach(iter, &rb->catchange_txns)
+	{
+		ReorderBufferTXN *txn = dlist_container(ReorderBufferTXN,
+												catchange_node,
+												iter.cur);
+
+		Assert(rbtxn_has_catalog_changes(txn));
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	Assert(xcnt == rb->catchange_ntxns);
+	return xids;
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15214..4f766866b2 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,30 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on HEAP2_NEW_CID and XLOG_XACT_INVALIDATIONS records to
+	 * know if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction.
+	 * This array stores the transactions that have modified catalogs and were
+	 * running when serializing a snapshot, and this array is used to add such
+	 * transactions to the snapshot.
+	 *
+	 * This array is set once when restoring the snapshot, we remove xids from
+	 * this array when they become old enough to matter, and then it eventually
+	 * becomes empty.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchange;
 };
 
 /*
@@ -250,8 +274,8 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/* ->committed and ->catchange manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -262,6 +286,8 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +295,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +333,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchange.xcnt = 0;
+	builder->catchange.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -888,12 +918,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed or containing catalog
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed or ->catchange array, respectively. The committed xids will
+ * get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction
+ * from catchange array once it is finished (committed/aborted) but that could
+ * be costly as we need to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +963,40 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/*
+	 * purge xids in ->catchange as well. The purged array must also be
+	 * sorted in xidComparator order.
+	 */
+	if (builder->catchange.xcnt > 0)
+	{
+		/*
+		 * Since catchange.xip is sorted, we find the lower bound of
+		 * xids that still are interesting.
+		 */
+		for (off = 0; off < builder->catchange.xcnt; off++)
+		{
+			if (TransactionIdFollowsOrEquals(builder->catchange.xip[off],
+											 builder->xmin))
+				break;
+		}
+
+		surviving_xids = builder->catchange.xcnt - off;
+		if (surviving_xids > 0)
+			memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+					surviving_xids * sizeof(TransactionId));
+		else
+		{
+			/* catchange list becomes empty */
+			pfree(builder->catchange.xip);
+			builder->catchange.xip = NULL;
+		}
+
+		elog(DEBUG3, "purged catalog modifying transactions from %u to %u, xmin %u",
+			 (uint32) builder->catchange.xcnt, (uint32) surviving_xids,
+			builder->xmin);
+		builder->catchange.xcnt = surviving_xids;
+	}
 }
 
 /*
@@ -983,7 +1052,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChanges(builder, subxid))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1081,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChanges(builder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1158,21 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check the reorder buffer and the snapshot to see if the given transaction has
+ * modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid)
+{
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return true;
+
+	/* Check the catchange XID array */
+	return ((builder->catchange.xcnt > 0) &&
+			(bsearch(&xid, builder->catchange.xip, builder->catchange.xcnt,
+					 sizeof(TransactionId), xidComparator) != NULL));
+}
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1135,7 +1219,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1438,6 +1522,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchange.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1467,7 +1552,7 @@ typedef struct SnapBuildOnDisk
 	offsetof(SnapBuildOnDisk, version)
 
 #define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 4
+#define SNAPBUILD_VERSION 5
 
 /*
  * Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1493,6 +1578,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	TransactionId	*catchange_xip = NULL;
+	MemoryContext	old_ctx;
+	size_t		catchange_xcnt;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,10 +1666,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	old_ctx = MemoryContextSwitchTo(builder->context);
+
+	/* Get the catalog modifying transactions that are yet not committed */
+	catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);
+	catchange_xcnt = builder->reorder->catchange_ntxns;
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + catchange_xcnt);
 
-	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
+	ondisk_c = palloc0(needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
 	ondisk->magic = SNAPBUILD_MAGIC;
 	ondisk->version = SNAPBUILD_VERSION;
@@ -1598,16 +1692,31 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchange.xip = NULL;
+	/* update catchange only on disk data */
+	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
 				sizeof(SnapBuild));
 
 	/* copy committed xacts */
-	sz = sizeof(TransactionId) * builder->committed.xcnt;
-	memcpy(ondisk_c, builder->committed.xip, sz);
-	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
-	ondisk_c += sz;
+	if (builder->committed.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * builder->committed.xcnt;
+		memcpy(ondisk_c, builder->committed.xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
+
+	/* copy catalog modifying xacts */
+	if (catchange_xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * catchange_xcnt;
+		memcpy(ondisk_c, catchange_xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
 
 	FIN_CRC32C(ondisk->checksum);
 
@@ -1688,12 +1797,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	 */
 	builder->last_serialized_snapshot = lsn;
 
+	MemoryContextSwitchTo(old_ctx);
+
 out:
 	ReorderBufferSetRestartPoint(builder->reorder,
 								 builder->last_serialized_snapshot);
 	/* be tidy */
 	if (ondisk)
 		pfree(ondisk);
+	if (catchange_xip)
+		pfree(catchange_xip);
 }
 
 /*
@@ -1707,7 +1820,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1851,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,56 +1871,26 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
-	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
-	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
+	if (ondisk.builder.committed.xcnt > 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
+		sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
+		ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
+	}
 
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
+	/* restore catalog modifying xacts information */
+	if (ondisk.builder.catchange.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * ondisk.builder.catchange.xcnt;
+		ondisk.builder.catchange.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchange.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.catchange.xip, sz);
 	}
-	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
@@ -1885,6 +1945,13 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchange.xip)
+		pfree(builder->catchange.xip);
+	builder->catchange.xcnt = ondisk.builder.catchange.xcnt;
+	builder->catchange.xip = ondisk.builder.catchange.xip;
+	ondisk.builder.catchange.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1906,9 +1973,43 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 snapshot_not_interesting:
 	if (ondisk.builder.committed.xip != NULL)
 		pfree(ondisk.builder.committed.xip);
+	if (ondisk.builder.catchange.xip != NULL)
+		pfree(ondisk.builder.catchange.xip);
 	return false;
 }
 
+/*
+ * Read the contents of the serialized snapshot to the dest.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
 /*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d109d0baed..fd84f175c0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -380,6 +380,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * A node in the list of catalog modifying transactions
+	 */
+	dlist_node	catchange_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -526,6 +531,12 @@ struct ReorderBuffer
 	 */
 	dlist_head	txns_by_base_snapshot_lsn;
 
+	/*
+	 * Transactions and subtransactions that have modified system catalogs.
+	 */
+	dlist_head	catchange_txns;
+	int			catchange_ntxns;
+
 	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
@@ -677,6 +688,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
-- 
2.24.3 (Apple Git-128)

#98Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#97)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Jul 25, 2022 at 10:45 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Jul 23, 2022 at 8:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 22, 2022 at 11:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 20, 2022 at 5:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 20, 2022 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This is required if we don't want to introduce a new set of functions
as you proposed above. I am not sure which one is better w.r.t back
patching effort later but it seems to me using flag stuff would make
future back patches easier if we make any changes in
SnapBuildCommitTxn.

Understood.

I've implemented this idea as well for discussion. Both patches have
the common change to remember the initial running transactions and to
purge them when decoding xl_running_xacts records. The difference is
how to mark the transactions as needing to be added to the snapshot.

In v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch,
when the transaction is in the initial running xact list and its
commit record has XINFO_HAS_INVAL flag, we mark both the top
transaction and its all subtransactions as containing catalog changes
(which also means to create ReorderBufferTXN entries for them). These
transactions are added to the snapshot in SnapBuildCommitTxn() since
ReorderBufferXidHasCatalogChanges () for them returns true.

In poc_mark_top_txn_has_inval.patch, when the transaction is in the
initial running xacts list and its commit record has XINFO_HAS_INVALS
flag, we set a new flag, say RBTXN_COMMIT_HAS_INVALS, only to the top
transaction.

It seems that the patch has missed the part to check if the xid is in
the initial running xacts list?

Oops, right.

In SnapBuildCommitTxn(), we add all subtransactions to
the snapshot without checking ReorderBufferXidHasCatalogChanges() for
subtransactions if its top transaction has the RBTXN_COMMIT_HAS_INVALS
flag.

A difference between the two ideas is the scope of changes: the former
changes only snapbuild.c but the latter changes both snapbuild.c and
reorderbuffer.c. Moreover, while the former uses the existing flag,
the latter adds a new flag to the reorder buffer for dealing with only
this case. I think the former idea is simpler in terms of that. But,
an advantage of the latter idea is that the latter idea can save to
create ReorderBufferTXN entries for subtransactions.

Overall I prefer the former for now but I'd like to hear what others think.

I agree that the latter idea can have better performance in extremely
special scenarios but introducing a new flag for the same sounds a bit
ugly to me. So, I would also prefer to go with the former idea,
however, I would also like to hear what Horiguchi-San and others have
to say.

Agreed.

Few comments on v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during:
1.
+void
+SnapBuildInitialXactSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+   int subxcnt, TransactionId *subxacts,
+   XLogRecPtr lsn)
+{

I think it is better to name this function as
SnapBuildXIDSetCatalogChanges as we use this to mark a particular
transaction as having catalog changes.

2. Changed/added a few comments in the attached.

Thank you for the comments.

I've attached updated version patches for the master and back branches.

I've attached the patch for REl15 that I forgot.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

REL15-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REL15-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From b4ce41c18f35ee32fb82b509e193645b8270edbd Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v8] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transaction written in the xl_running_xacts
record that we decoded first, and mark the transaction as containing
catalog changes if it's in the list of the initial running
transactions and its commit record has XACT_XINFO_HAS_INVALS. To avoid
ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
neither SnapBuild nor ReorderBuffer.

This approach has false positive; we could end up adding the
transaction that didn't change catalog to the snapshot since we cannot
distinguish whether the transaction has catalog changes only by
checking the COMMIT record. It doesn't have the information on
which (sub) transaction has catalog changes, and XACT_XINFO_HAS_INVALS
doesn't necessarily indicate that the transaction has catalog
change. But it doesn't become a problem since we use historic snapshot
only for system catalog lookups.

On the master branch, we took a more future-proof approach -- writing
catalog modifying transactions to the serialized snapshot. But we
cannot backpatch it because of change in SnapBuild.

Back-patch to all supported released.

Reported-by: Mike Oh <minsoo@amazon.com>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Bertrand Drouvot <bdrouvot@amazon.com>
Reviewed-by: Shi yu <shiy.fnst@fujitsu.com>
Reviewed-by: Ahsan Hadi <ahsan.hadi@gmail.com>
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
Backpatch-through: 10
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 143 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 238 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index aa2427ba73..312cce827c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -627,6 +627,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit record
+	 * without decoding the transaction's other changes. So, we ensure to mark
+	 * such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1119a12db9..d075e5ee4e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded first was written.
+ * The array is sorted in xidComparator order. We remove Xids from
+ * this array when they become old enough to matter. This array is
+ * allocated in builder->context so its lifetime is the same as the
+ * snapshot builder.
+ *
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder. Therefore, if the logical
+ * decoding decodes the commit record of the transaction that actually
+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot, and end up looking at catalogs with the
+ * wrong snapshot. To avoid this problem, if the COMMIT record of
+ * the xid listed in InitialRunningXacts has XACT_XINFO_HAS_INVALS
+ * flag, we mark both the top transaction and its substransactions
+ * as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But it doesn't become a problem since
+ * we use historic snapshot only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -888,12 +918,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +963,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also be
+	 * sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1113,6 +1191,21 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 	{
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded first was written.
+		 */
+		if (builder->state == SNAPBUILD_START)
+		{
+			int			nxacts = running->subxcnt + running->xcnt;
+			Size		sz = sizeof(TransactionId) * nxacts;
+
+			NInitialRunningXacts = nxacts;
+			InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+			memcpy(InitialRunningXacts, running->xids, sz);
+			qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+		}
+
 		/* returns false if there's no point in performing cleanup just yet */
 		if (!SnapBuildFindSnapshot(builder, lsn, running))
 			return;
@@ -1135,7 +1228,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -2000,3 +2093,39 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely((NInitialRunningXacts == 0) ||
+			   ReorderBufferXidHasCatalogChanges(builder->reorder, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..53d83f348a 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -91,4 +91,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

#99shiy.fnst@fujitsu.com
shiy.fnst@fujitsu.com
In reply to: Masahiko Sawada (#98)
1 attachment(s)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

Hi,

I did some performance test for the master branch patch (based on v6 patch) to
see if the bsearch() added by this patch will cause any overhead.

I tested them three times and took the average.

The results are as follows, and attach the bar chart.

case 1
---------
No catalog modifying transaction.
Decode 800k pgbench transactions. (8 clients, 100k transactions per client)

master 7.5417
patched 7.4107

case 2
---------
There's one catalog modifying transaction.
Decode 100k/500k/1M transactions.

100k 500k 1M
master 0.0576 0.1491 0.4346
patched 0.0586 0.1500 0.4344

case 3
---------
There are 64 catalog modifying transactions.
Decode 100k/500k/1M transactions.

100k 500k 1M
master 0.0600 0.1666 0.4876
patched 0.0620 0.1653 0.4795

(Because the result of case 3 shows that there is a overhead of about 3% in the
case decoding 100k transactions with 64 catalog modifying transactions, I
tested the next run of 100k xacts with or without catalog modifying
transactions, to see if it affects subsequent decoding.)

case 4.1
---------
After the test steps in case 3 (64 catalog modifying transactions, decode 100k
transactions), run 100k xacts and then decode.

master 0.3699
patched 0.3701

case 4.2
---------
After the test steps in case 3 (64 catalog modifying transactions, decode 100k
transactions), run 64 DDLs(without checkpoint) and 100k xacts, then decode.

master 0.3687
patched 0.3696

Summary of the tests:
After applying this patch, there is a overhead of about 3% in the case decoding
100k transactions with 64 catalog modifying transactions. This is an extreme
case, so maybe it's okay. And case 4.1 and case 4.2 shows that the patch has no
effect on subsequent decoding. In other cases, there are no significant
differences.

For your information, here are the parameters specified in postgresql.conf in
the test.

shared_buffers = 8GB
checkpoint_timeout = 30min
max_wal_size = 20GB
min_wal_size = 10GB
autovacuum = off

Regards,
Shi yu

Attachments:

performance_test.pngimage/png; name=performance_test.pngDownload
#100Masahiko Sawada
sawada.mshk@gmail.com
In reply to: shiy.fnst@fujitsu.com (#99)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Jul 25, 2022 at 7:57 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

Hi,

I did some performance test for the master branch patch (based on v6 patch) to
see if the bsearch() added by this patch will cause any overhead.

Thank you for doing performance tests!

I tested them three times and took the average.

The results are as follows, and attach the bar chart.

case 1
---------
No catalog modifying transaction.
Decode 800k pgbench transactions. (8 clients, 100k transactions per client)

master 7.5417
patched 7.4107

case 2
---------
There's one catalog modifying transaction.
Decode 100k/500k/1M transactions.

100k 500k 1M
master 0.0576 0.1491 0.4346
patched 0.0586 0.1500 0.4344

case 3
---------
There are 64 catalog modifying transactions.
Decode 100k/500k/1M transactions.

100k 500k 1M
master 0.0600 0.1666 0.4876
patched 0.0620 0.1653 0.4795

(Because the result of case 3 shows that there is a overhead of about 3% in the
case decoding 100k transactions with 64 catalog modifying transactions, I
tested the next run of 100k xacts with or without catalog modifying
transactions, to see if it affects subsequent decoding.)

case 4.1
---------
After the test steps in case 3 (64 catalog modifying transactions, decode 100k
transactions), run 100k xacts and then decode.

master 0.3699
patched 0.3701

case 4.2
---------
After the test steps in case 3 (64 catalog modifying transactions, decode 100k
transactions), run 64 DDLs(without checkpoint) and 100k xacts, then decode.

master 0.3687
patched 0.3696

Summary of the tests:
After applying this patch, there is a overhead of about 3% in the case decoding
100k transactions with 64 catalog modifying transactions. This is an extreme
case, so maybe it's okay.

Yes. If we're worried about the overhead and doing bsearch() is the
cause, probably we can try simplehash instead of the array.

An improvement idea is that we pass the parsed->xinfo down to
SnapBuildXidHasCatalogChanges(), and then return from that function
before doing bearch() if the parsed->xinfo doesn't have
XACT_XINFO_HAS_INVALS. That would save calling bsearch() for
non-catalog-modifying transactions. Is it worth trying?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#101Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#100)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 26, 2022 at 7:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jul 25, 2022 at 7:57 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

Hi,

I did some performance test for the master branch patch (based on v6 patch) to
see if the bsearch() added by this patch will cause any overhead.

Thank you for doing performance tests!

I tested them three times and took the average.

The results are as follows, and attach the bar chart.

case 1
---------
No catalog modifying transaction.
Decode 800k pgbench transactions. (8 clients, 100k transactions per client)

master 7.5417
patched 7.4107

case 2
---------
There's one catalog modifying transaction.
Decode 100k/500k/1M transactions.

100k 500k 1M
master 0.0576 0.1491 0.4346
patched 0.0586 0.1500 0.4344

case 3
---------
There are 64 catalog modifying transactions.
Decode 100k/500k/1M transactions.

100k 500k 1M
master 0.0600 0.1666 0.4876
patched 0.0620 0.1653 0.4795

(Because the result of case 3 shows that there is a overhead of about 3% in the
case decoding 100k transactions with 64 catalog modifying transactions, I
tested the next run of 100k xacts with or without catalog modifying
transactions, to see if it affects subsequent decoding.)

case 4.1
---------
After the test steps in case 3 (64 catalog modifying transactions, decode 100k
transactions), run 100k xacts and then decode.

master 0.3699
patched 0.3701

case 4.2
---------
After the test steps in case 3 (64 catalog modifying transactions, decode 100k
transactions), run 64 DDLs(without checkpoint) and 100k xacts, then decode.

master 0.3687
patched 0.3696

Summary of the tests:
After applying this patch, there is a overhead of about 3% in the case decoding
100k transactions with 64 catalog modifying transactions. This is an extreme
case, so maybe it's okay.

Yes. If we're worried about the overhead and doing bsearch() is the
cause, probably we can try simplehash instead of the array.

I am not sure if we need to go that far for this extremely corner
case. Let's first try your below idea.

An improvement idea is that we pass the parsed->xinfo down to
SnapBuildXidHasCatalogChanges(), and then return from that function
before doing bearch() if the parsed->xinfo doesn't have
XACT_XINFO_HAS_INVALS. That would save calling bsearch() for
non-catalog-modifying transactions. Is it worth trying?

I think this is worth trying and this might reduce some of the
overhead as well in the case presented by Shi-San.

--
With Regards,
Amit Kapila.

#102Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#101)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 26, 2022 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 26, 2022 at 7:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jul 25, 2022 at 7:57 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

Hi,

I did some performance test for the master branch patch (based on v6 patch) to
see if the bsearch() added by this patch will cause any overhead.

Thank you for doing performance tests!

I tested them three times and took the average.

The results are as follows, and attach the bar chart.

case 1
---------
No catalog modifying transaction.
Decode 800k pgbench transactions. (8 clients, 100k transactions per client)

master 7.5417
patched 7.4107

case 2
---------
There's one catalog modifying transaction.
Decode 100k/500k/1M transactions.

100k 500k 1M
master 0.0576 0.1491 0.4346
patched 0.0586 0.1500 0.4344

case 3
---------
There are 64 catalog modifying transactions.
Decode 100k/500k/1M transactions.

100k 500k 1M
master 0.0600 0.1666 0.4876
patched 0.0620 0.1653 0.4795

(Because the result of case 3 shows that there is a overhead of about 3% in the
case decoding 100k transactions with 64 catalog modifying transactions, I
tested the next run of 100k xacts with or without catalog modifying
transactions, to see if it affects subsequent decoding.)

case 4.1
---------
After the test steps in case 3 (64 catalog modifying transactions, decode 100k
transactions), run 100k xacts and then decode.

master 0.3699
patched 0.3701

case 4.2
---------
After the test steps in case 3 (64 catalog modifying transactions, decode 100k
transactions), run 64 DDLs(without checkpoint) and 100k xacts, then decode.

master 0.3687
patched 0.3696

Summary of the tests:
After applying this patch, there is a overhead of about 3% in the case decoding
100k transactions with 64 catalog modifying transactions. This is an extreme
case, so maybe it's okay.

Yes. If we're worried about the overhead and doing bsearch() is the
cause, probably we can try simplehash instead of the array.

I am not sure if we need to go that far for this extremely corner
case. Let's first try your below idea.

An improvement idea is that we pass the parsed->xinfo down to
SnapBuildXidHasCatalogChanges(), and then return from that function
before doing bearch() if the parsed->xinfo doesn't have
XACT_XINFO_HAS_INVALS. That would save calling bsearch() for
non-catalog-modifying transactions. Is it worth trying?

I think this is worth trying and this might reduce some of the
overhead as well in the case presented by Shi-San.

Okay, I've attached an updated patch that does the above idea. Could
you please do the performance tests again to see if the idea can help
reduce the overhead, Shi yu?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

master-v9-0001-Add-catalog-modifying-transactions-to-logical-dec.patchapplication/octet-stream; name=master-v9-0001-Add-catalog-modifying-transactions-to-logical-dec.patchDownload
From ad94fd644e6f3fc110ba5b55348df63290e5189a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v9] Add catalog modifying transactions to logical decoding
 serialized snapshot.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running when
snapshot serialization, to the serialized snapshot. When decoding a
COMMIT record, we check both the list and the ReorderBuffer to see if
the transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we take another approach;
remember the last-running-xacts list of the first decoded
RUNNING_XACTS record, check if the transaction whose commit record
has XACT_XINFO_HAS_INVALS and whose XID is in the list, and mark the
top-level transaction and sub transactions as containing catalog
changes. This doesn't require any file format changes but the
transaction will end up being added to the snapshot even if it has
only relcache invalidations. It doesn't become a problem as we use the
historical snapshot for only catalog lookups.

This commit bumps SNAPBUILD_VERSION because of change in SnapBuild.

Back-patch to all supported released.

Reported-by: Mike Oh <minsoo@amazon.com>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Bertrand Drouvot <bdrouvot@amazon.com>
Reviewed-by: Shi yu <shiy.fnst@fujitsu.com>
Reviewed-by: Ahsan Hadi <ahsan.hadi@gmail.com>
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
Backpatch-through: 10
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 +++
 .../specs/catalog_change_snapshot.spec        |  39 +++
 src/backend/replication/logical/decode.c      |   3 +-
 .../replication/logical/reorderbuffer.c       |  71 ++++-
 src/backend/replication/logical/snapbuild.c   | 280 ++++++++++++------
 src/include/replication/reorderbuffer.h       |  12 +
 src/include/replication/snapbuild.h           |   2 +-
 8 files changed, 360 insertions(+), 93 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c5c6a2ba68..1667d720b1 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -628,7 +628,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
-					   parsed->nsubxacts, parsed->subxacts);
+					   parsed->nsubxacts, parsed->subxacts,
+					   parsed->xinfo);
 
 	/* ----
 	 * Check whether we are interested in this specific transaction, and tell
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fde72..0b2d9b7930 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,8 @@ ReorderBufferAllocate(void)
 	buffer->by_txn_last_xid = InvalidTransactionId;
 	buffer->by_txn_last_txn = NULL;
 
+	buffer->catchange_ntxns = 0;
+
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 	buffer->size = 0;
@@ -366,6 +368,7 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
+	dlist_init(&buffer->catchange_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1526,14 +1529,22 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
-	 * from the LSN-ordered list of toplevel TXNs.
+	 * from the LSN-ordered list of toplevel TXNs. We remove TXN from
+	 * the list of catalog modifying transactions as well.
 	 */
 	dlist_delete(&txn->node);
+	if (rbtxn_has_catalog_changes(txn))
+	{
+		dlist_delete(&txn->catchange_node);
+		rb->catchange_ntxns--;
+
+		Assert(rb->catchange_ntxns >= 0);
+	}
 
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,
@@ -3275,10 +3286,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 								  XLogRecPtr lsn)
 {
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	if (!rbtxn_has_catalog_changes(txn))
+	{
+		txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &txn->catchange_node);
+		rb->catchange_ntxns++;
+	}
 
 	/*
 	 * Mark top-level transaction as having catalog changes too if one of its
@@ -3286,8 +3303,52 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	 * conveniently check just top-level transaction and decide whether to
 	 * build the hash table or not.
 	 */
-	if (txn->toptxn != NULL)
-		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	toptxn = txn->toptxn;
+	if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
+	{
+		toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &toptxn->catchange_node);
+		rb->catchange_ntxns++;
+	}
+}
+
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb)
+{
+	dlist_iter iter;
+	TransactionId *xids = NULL;
+	size_t	xcnt = 0;
+
+	/* Quick return if the list is empty */
+	if (dlist_is_empty(&rb->catchange_txns))
+	{
+		Assert(rb->catchange_ntxns == 0);
+		return NULL;
+	}
+
+	/* Initialize XID array */
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * rb->catchange_ntxns);
+	dlist_foreach(iter, &rb->catchange_txns)
+	{
+		ReorderBufferTXN *txn = dlist_container(ReorderBufferTXN,
+												catchange_node,
+												iter.cur);
+
+		Assert(rbtxn_has_catalog_changes(txn));
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	Assert(xcnt == rb->catchange_ntxns);
+	return xids;
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15214..ab6bae3b6e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,30 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on HEAP2_NEW_CID and XLOG_XACT_INVALIDATIONS records to
+	 * know if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction.
+	 * This array stores the transactions that have modified catalogs and were
+	 * running when serializing a snapshot, and this array is used to add such
+	 * transactions to the snapshot.
+	 *
+	 * This array is set once when restoring the snapshot, we remove xids from
+	 * this array when they become old enough to matter, and then it eventually
+	 * becomes empty.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchange;
 };
 
 /*
@@ -250,8 +274,8 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/* ->committed and ->catchange manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -262,6 +286,9 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
+												 uint32 xinfo);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +296,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +334,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchange.xcnt = 0;
+	builder->catchange.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -888,12 +919,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed or containing catalog
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed or ->catchange array, respectively. The committed xids will
+ * get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction
+ * from catchange array once it is finished (committed/aborted) but that could
+ * be costly as we need to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +964,40 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/*
+	 * purge xids in ->catchange as well. The purged array must also be
+	 * sorted in xidComparator order.
+	 */
+	if (builder->catchange.xcnt > 0)
+	{
+		/*
+		 * Since catchange.xip is sorted, we find the lower bound of
+		 * xids that still are interesting.
+		 */
+		for (off = 0; off < builder->catchange.xcnt; off++)
+		{
+			if (TransactionIdFollowsOrEquals(builder->catchange.xip[off],
+											 builder->xmin))
+				break;
+		}
+
+		surviving_xids = builder->catchange.xcnt - off;
+		if (surviving_xids > 0)
+			memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+					surviving_xids * sizeof(TransactionId));
+		else
+		{
+			/* catchange list becomes empty */
+			pfree(builder->catchange.xip);
+			builder->catchange.xip = NULL;
+		}
+
+		elog(DEBUG3, "purged catalog modifying transactions from %u to %u, xmin %u",
+			 (uint32) builder->catchange.xcnt, (uint32) surviving_xids,
+			builder->xmin);
+		builder->catchange.xcnt = surviving_xids;
+	}
 }
 
 /*
@@ -935,7 +1005,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
  */
 void
 SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
-				   int nsubxacts, TransactionId *subxacts)
+				   int nsubxacts, TransactionId *subxacts, uint32 xinfo)
 {
 	int			nxact;
 
@@ -983,7 +1053,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChanges(builder, subxid, xinfo))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1082,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1159,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check the reorder buffer and the snapshot to see if the given transaction has
+ * modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
+							  uint32 xinfo)
+{
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return true;
+
+	/*
+	 * If the commit record of the transaction does not have invalidation
+	 * messages, it did not change catalogs for sure.
+	 */
+	if (!(xinfo & XACT_XINFO_HAS_INVALS))
+		return false;
+
+	/* Check the catchange XID array */
+	return ((builder->catchange.xcnt > 0) &&
+			(bsearch(&xid, builder->catchange.xip, builder->catchange.xcnt,
+					 sizeof(TransactionId), xidComparator) != NULL));
+}
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1135,7 +1228,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1438,6 +1531,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchange.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1467,7 +1561,7 @@ typedef struct SnapBuildOnDisk
 	offsetof(SnapBuildOnDisk, version)
 
 #define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 4
+#define SNAPBUILD_VERSION 5
 
 /*
  * Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1493,6 +1587,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	TransactionId	*catchange_xip = NULL;
+	MemoryContext	old_ctx;
+	size_t		catchange_xcnt;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,10 +1675,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	old_ctx = MemoryContextSwitchTo(builder->context);
+
+	/* Get the catalog modifying transactions that are yet not committed */
+	catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);
+	catchange_xcnt = builder->reorder->catchange_ntxns;
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + catchange_xcnt);
 
-	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
+	ondisk_c = palloc0(needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
 	ondisk->magic = SNAPBUILD_MAGIC;
 	ondisk->version = SNAPBUILD_VERSION;
@@ -1598,16 +1701,31 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchange.xip = NULL;
+	/* update catchange only on disk data */
+	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
 				sizeof(SnapBuild));
 
 	/* copy committed xacts */
-	sz = sizeof(TransactionId) * builder->committed.xcnt;
-	memcpy(ondisk_c, builder->committed.xip, sz);
-	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
-	ondisk_c += sz;
+	if (builder->committed.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * builder->committed.xcnt;
+		memcpy(ondisk_c, builder->committed.xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
+
+	/* copy catalog modifying xacts */
+	if (catchange_xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * catchange_xcnt;
+		memcpy(ondisk_c, catchange_xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
 
 	FIN_CRC32C(ondisk->checksum);
 
@@ -1688,12 +1806,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	 */
 	builder->last_serialized_snapshot = lsn;
 
+	MemoryContextSwitchTo(old_ctx);
+
 out:
 	ReorderBufferSetRestartPoint(builder->reorder,
 								 builder->last_serialized_snapshot);
 	/* be tidy */
 	if (ondisk)
 		pfree(ondisk);
+	if (catchange_xip)
+		pfree(catchange_xip);
 }
 
 /*
@@ -1707,7 +1829,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1860,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,56 +1880,26 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
-	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
-	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
+	if (ondisk.builder.committed.xcnt > 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
+		sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
+		ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
+	}
 
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
+	/* restore catalog modifying xacts information */
+	if (ondisk.builder.catchange.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * ondisk.builder.catchange.xcnt;
+		ondisk.builder.catchange.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchange.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.catchange.xip, sz);
 	}
-	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
@@ -1885,6 +1954,13 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchange.xip)
+		pfree(builder->catchange.xip);
+	builder->catchange.xcnt = ondisk.builder.catchange.xcnt;
+	builder->catchange.xip = ondisk.builder.catchange.xip;
+	ondisk.builder.catchange.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1906,9 +1982,43 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 snapshot_not_interesting:
 	if (ondisk.builder.committed.xip != NULL)
 		pfree(ondisk.builder.committed.xip);
+	if (ondisk.builder.catchange.xip != NULL)
+		pfree(ondisk.builder.catchange.xip);
 	return false;
 }
 
+/*
+ * Read the contents of the serialized snapshot to the dest.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
 /*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d109d0baed..fd84f175c0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -380,6 +380,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * A node in the list of catalog modifying transactions
+	 */
+	dlist_node	catchange_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -526,6 +531,12 @@ struct ReorderBuffer
 	 */
 	dlist_head	txns_by_base_snapshot_lsn;
 
+	/*
+	 * Transactions and subtransactions that have modified system catalogs.
+	 */
+	dlist_head	catchange_txns;
+	int			catchange_ntxns;
+
 	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
@@ -677,6 +688,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..e6adea24f2 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -82,7 +82,7 @@ extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
-							   TransactionId *subxacts);
+							   TransactionId *subxacts, uint32 xinfo);
 extern bool SnapBuildProcessChange(SnapBuild *builder, TransactionId xid,
 								   XLogRecPtr lsn);
 extern void SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
-- 
2.24.3 (Apple Git-128)

#103shiy.fnst@fujitsu.com
shiy.fnst@fujitsu.com
In reply to: Masahiko Sawada (#102)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 26, 2022 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 26, 2022 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Jul 26, 2022 at 7:00 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

On Mon, Jul 25, 2022 at 7:57 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

case 3
---------
There are 64 catalog modifying transactions.
Decode 100k/500k/1M transactions.

100k 500k 1M
master 0.0600 0.1666 0.4876
patched 0.0620 0.1653 0.4795

Summary of the tests:
After applying this patch, there is a overhead of about 3% in the case

decoding

100k transactions with 64 catalog modifying transactions. This is an

extreme

case, so maybe it's okay.

Yes. If we're worried about the overhead and doing bsearch() is the
cause, probably we can try simplehash instead of the array.

I am not sure if we need to go that far for this extremely corner
case. Let's first try your below idea.

An improvement idea is that we pass the parsed->xinfo down to
SnapBuildXidHasCatalogChanges(), and then return from that function
before doing bearch() if the parsed->xinfo doesn't have
XACT_XINFO_HAS_INVALS. That would save calling bsearch() for
non-catalog-modifying transactions. Is it worth trying?

I think this is worth trying and this might reduce some of the
overhead as well in the case presented by Shi-San.

Okay, I've attached an updated patch that does the above idea. Could
you please do the performance tests again to see if the idea can help
reduce the overhead, Shi yu?

Thanks for your improvement. I have tested the case which shows overhead before
(decoding 100k transactions with 64 catalog modifying transactions) for the v9
patch, the result is as follows.

master 0.0607
patched 0.0613

There's almost no difference compared with master (less than 1%), which looks
good to me.

Regards,
Shi yu

#104Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#98)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Jul 25, 2022 at 11:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jul 25, 2022 at 10:45 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the patch for REl15 that I forgot.

I feel the place to remember running xacts information in
SnapBuildProcessRunningXacts is not appropriate. Because in cases
where there are no running xacts or when xl_running_xact is old enough
that we can't use it, we don't need that information. I feel we need
it only when we have to reuse the already serialized snapshot, so,
won't it be better to initialize at that place in
SnapBuildFindSnapshot()? I have changed accordingly in the attached
and apart from that slightly modified the comments and commit message.
Do let me know what you think of the attached?

--
With Regards,
Amit Kapila.

Attachments:

REL15_v9-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchapplication/octet-stream; name=REL15_v9-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patchDownload
From 415f48ea772c3119625dc2fd8ca0915dbbbb752b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v9] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transactions written in the xl_running_xacts record
that we decoded first, and mark the transaction as containing catalog
changes if it's in the list of the initial running transactions and its
commit record have XACT_XINFO_HAS_INVALS. To avoid ABI breakage, we store
the array of the initial running transactions in the static variables
InitialRunningXacts and NInitialRunningXacts, instead of storing those in
SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../expected/catalog_change_snapshot.out           |  44 ++++++
 .../specs/catalog_change_snapshot.spec             |  39 +++++
 src/backend/replication/logical/decode.c           |  15 ++
 src/backend/replication/logical/snapbuild.c        | 158 +++++++++++++++++++--
 src/include/replication/snapbuild.h                |   3 +
 6 files changed, 248 insertions(+), 13 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906..c7ce603 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000..dc4f9b7
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000..662760f
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index aa2427b..ea8a216 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -627,6 +627,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit
+	 * record without decoding the transaction's other changes. So, we ensure
+	 * to mark such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1119a12..d264609 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded first was written.
+ * The array is sorted in xidComparator order. We remove Xids from
+ * this array when they become old enough to matter. This array is
+ * allocated in builder->context so its lifetime is the same as the
+ * snapshot builder.
+ *
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder. Therefore, after the restart, if the
+ * logical decoding decodes the commit record of the transaction that
+ * actually has done catalog changes without these records, we miss to
+ * add the xid to the snapshot, and end up looking at catalogs with the
+ * wrong snapshot. To avoid this problem, if the COMMIT record of the
+ * xid listed in InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we
+ * mark both the top transaction and its substransactions as containing
+ * catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -888,12 +918,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +963,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1135,7 +1213,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1283,11 +1361,30 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		return false;
 	}
 	/* b) valid on disk state and not building full snapshot */
-	else if (!builder->building_full_snapshot &&
-			 SnapBuildRestore(builder, lsn))
+	else if (!builder->building_full_snapshot)
 	{
-		/* there won't be any state to cleanup */
-		return false;
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded first was written. We
+		 * use this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		if (builder->state == SNAPBUILD_START)
+		{
+			int			nxacts = running->subxcnt + running->xcnt;
+			Size		sz = sizeof(TransactionId) * nxacts;
+
+			NInitialRunningXacts = nxacts;
+			InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+			memcpy(InitialRunningXacts, running->xids, sz);
+			qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+		}
+
+		if (SnapBuildRestore(builder, lsn))
+		{
+			/* there won't be any state to cleanup */
+			return false;
+		}
 	}
 
 	/*
@@ -1302,7 +1399,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 * as running, might already have inserted their commit record - it's
 	 * infeasible to change that with locking.
 	 */
-	else if (builder->state == SNAPBUILD_START)
+	if (builder->state == SNAPBUILD_START)
 	{
 		builder->state = SNAPBUILD_BUILDING_SNAPSHOT;
 		builder->next_phase_at = running->nextXid;
@@ -2000,3 +2097,40 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments for NInitialRunningXacts and InitialRunningXacts for additional
+ * info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely((NInitialRunningXacts == 0) ||
+			   ReorderBufferXidHasCatalogChanges(builder->reorder, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the first RUNNING_XACTS record and have done catalog
+	 * changes, we can mark both the top transaction and its subtransactions
+	 * as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251..53d83f3 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -91,4 +91,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
1.8.3.1

#105Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#104)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Jul 27, 2022 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 25, 2022 at 11:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jul 25, 2022 at 10:45 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the patch for REl15 that I forgot.

I feel the place to remember running xacts information in
SnapBuildProcessRunningXacts is not appropriate. Because in cases
where there are no running xacts or when xl_running_xact is old enough
that we can't use it, we don't need that information. I feel we need
it only when we have to reuse the already serialized snapshot, so,
won't it be better to initialize at that place in
SnapBuildFindSnapshot()?

Good point, agreed.

I have changed accordingly in the attached
and apart from that slightly modified the comments and commit message.
Do let me know what you think of the attached?

It would be better to remember the initial running xacts after
SnapBuildRestore() returns true? Because otherwise, we could end up
allocating InitialRunningXacts multiple times while leaking the old
ones if there are no serialized snapshots that we are interested in.

---
+               if (builder->state == SNAPBUILD_START)
+               {
+                       int                     nxacts =
running->subxcnt + running->xcnt;
+                       Size            sz = sizeof(TransactionId) * nxacts;
+
+                       NInitialRunningXacts = nxacts;
+                       InitialRunningXacts =
MemoryContextAlloc(builder->context, sz);
+                       memcpy(InitialRunningXacts, running->xids, sz);
+                       qsort(InitialRunningXacts, nxacts,
sizeof(TransactionId), xidComparator);
+               }

We should allocate the memory for InitialRunningXacts only when
(running->subxcnt + running->xcnt) > 0.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#106Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#105)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Jul 28, 2022 at 7:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 27, 2022 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have changed accordingly in the attached
and apart from that slightly modified the comments and commit message.
Do let me know what you think of the attached?

It would be better to remember the initial running xacts after
SnapBuildRestore() returns true? Because otherwise, we could end up
allocating InitialRunningXacts multiple times while leaking the old
ones if there are no serialized snapshots that we are interested in.

Right, this makes sense. But note that you can no longer have a check
(builder->state == SNAPBUILD_START) which I believe is not required.
We need to do this after restore, in whichever state snapshot was as
any state other than SNAPBUILD_CONSISTENT can have commits without all
their changes.

Accordingly, I think the comment: "Remember the transactions and
subtransactions that were running when xl_running_xacts record that we
decoded first was written." needs to be slightly modified to something
like: "Remember the transactions and subtransactions that were running
when xl_running_xacts record that we decoded was written.". Change
this if it is used at any other place in the patch.

---
+               if (builder->state == SNAPBUILD_START)
+               {
+                       int                     nxacts =
running->subxcnt + running->xcnt;
+                       Size            sz = sizeof(TransactionId) * nxacts;
+
+                       NInitialRunningXacts = nxacts;
+                       InitialRunningXacts =
MemoryContextAlloc(builder->context, sz);
+                       memcpy(InitialRunningXacts, running->xids, sz);
+                       qsort(InitialRunningXacts, nxacts,
sizeof(TransactionId), xidComparator);
+               }

We should allocate the memory for InitialRunningXacts only when
(running->subxcnt + running->xcnt) > 0.

There is no harm in doing that but ideally, that case would have been
covered by an earlier check "if (running->oldestRunningXid ==
running->nextXid)" which suggests "No transactions were running, so we
can jump to consistent."

Kindly make the required changes and submit the back branch patches again.

--
With Regards,
Amit Kapila.

#107Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#106)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

() an

On Thu, Jul 28, 2022 at 12:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 28, 2022 at 7:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 27, 2022 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have changed accordingly in the attached
and apart from that slightly modified the comments and commit message.
Do let me know what you think of the attached?

It would be better to remember the initial running xacts after
SnapBuildRestore() returns true? Because otherwise, we could end up
allocating InitialRunningXacts multiple times while leaking the old
ones if there are no serialized snapshots that we are interested in.

Right, this makes sense. But note that you can no longer have a check
(builder->state == SNAPBUILD_START) which I believe is not required.
We need to do this after restore, in whichever state snapshot was as
any state other than SNAPBUILD_CONSISTENT can have commits without all
their changes.

Right.

Accordingly, I think the comment: "Remember the transactions and
subtransactions that were running when xl_running_xacts record that we
decoded first was written." needs to be slightly modified to something
like: "Remember the transactions and subtransactions that were running
when xl_running_xacts record that we decoded was written.". Change
this if it is used at any other place in the patch.

Agreed.

---
+               if (builder->state == SNAPBUILD_START)
+               {
+                       int                     nxacts =
running->subxcnt + running->xcnt;
+                       Size            sz = sizeof(TransactionId) * nxacts;
+
+                       NInitialRunningXacts = nxacts;
+                       InitialRunningXacts =
MemoryContextAlloc(builder->context, sz);
+                       memcpy(InitialRunningXacts, running->xids, sz);
+                       qsort(InitialRunningXacts, nxacts,
sizeof(TransactionId), xidComparator);
+               }

We should allocate the memory for InitialRunningXacts only when
(running->subxcnt + running->xcnt) > 0.

d > There is no harm in doing that but ideally, that case would have been

covered by an earlier check "if (running->oldestRunningXid ==
running->nextXid)" which suggests "No transactions were running, so we
can jump to consistent."

You're right.

While editing back branch patches, I realized that the following
(parsed->xinfo & XACT_XINFO_HAS_INVALS) and (parsed->nmsgs > 0) are
equivalent:

+   /*
+    * If the COMMIT record has invalidation messages, it could have catalog
+    * changes. It is possible that we didn't mark this transaction as
+    * containing catalog changes when the decoding starts from a commit
+    * record without decoding the transaction's other changes. So, we ensure
+    * to mark such transactions as containing catalog change.
+    *
+    * This must be done before SnapBuildCommitTxn() so that we can include
+    * these transactions in the historic snapshot.
+    */
+   if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+       SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+                                     parsed->nsubxacts, parsed->subxacts,
+                                     buf->origptr);
+
    /*
     * Process invalidation messages, even if we're not interested in the
     * transaction's contents, since the various caches need to always be
     * consistent.
     */
    if (parsed->nmsgs > 0)
    {
        if (!ctx->fast_forward)
            ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
                                          parsed->nmsgs, parsed->msgs);
        ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
    }

If that's right, I think we can merge these if branches. We can call
ReorderBufferXidSetCatalogChanges() for top-txn and in
SnapBuildXidSetCatalogChanges() we mark its subtransactions if top-txn
is in the list. What do you think?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#108Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#107)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Jul 28, 2022 at 11:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 28, 2022 at 12:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 28, 2022 at 7:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

While editing back branch patches, I realized that the following
(parsed->xinfo & XACT_XINFO_HAS_INVALS) and (parsed->nmsgs > 0) are
equivalent:

+   /*
+    * If the COMMIT record has invalidation messages, it could have catalog
+    * changes. It is possible that we didn't mark this transaction as
+    * containing catalog changes when the decoding starts from a commit
+    * record without decoding the transaction's other changes. So, we ensure
+    * to mark such transactions as containing catalog change.
+    *
+    * This must be done before SnapBuildCommitTxn() so that we can include
+    * these transactions in the historic snapshot.
+    */
+   if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+       SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+                                     parsed->nsubxacts, parsed->subxacts,
+                                     buf->origptr);
+
/*
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
* consistent.
*/
if (parsed->nmsgs > 0)
{
if (!ctx->fast_forward)
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
parsed->nmsgs, parsed->msgs);
ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
}

If that's right, I think we can merge these if branches. We can call
ReorderBufferXidSetCatalogChanges() for top-txn and in
SnapBuildXidSetCatalogChanges() we mark its subtransactions if top-txn
is in the list. What do you think?

Note that this code doesn't exist in 14 and 15, so we need to create
different patches for those. BTW, how in 13 and lower versions did we
identify and mark subxacts as having catalog changes without our
patch?

--
With Regards,
Amit Kapila.

#109Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#108)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Jul 28, 2022 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 28, 2022 at 11:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 28, 2022 at 12:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 28, 2022 at 7:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

While editing back branch patches, I realized that the following
(parsed->xinfo & XACT_XINFO_HAS_INVALS) and (parsed->nmsgs > 0) are
equivalent:

+   /*
+    * If the COMMIT record has invalidation messages, it could have catalog
+    * changes. It is possible that we didn't mark this transaction as
+    * containing catalog changes when the decoding starts from a commit
+    * record without decoding the transaction's other changes. So, we ensure
+    * to mark such transactions as containing catalog change.
+    *
+    * This must be done before SnapBuildCommitTxn() so that we can include
+    * these transactions in the historic snapshot.
+    */
+   if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+       SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+                                     parsed->nsubxacts, parsed->subxacts,
+                                     buf->origptr);
+
/*
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
* consistent.
*/
if (parsed->nmsgs > 0)
{
if (!ctx->fast_forward)
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
parsed->nmsgs, parsed->msgs);
ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
}

If that's right, I think we can merge these if branches. We can call
ReorderBufferXidSetCatalogChanges() for top-txn and in
SnapBuildXidSetCatalogChanges() we mark its subtransactions if top-txn
is in the list. What do you think?

Note that this code doesn't exist in 14 and 15, so we need to create
different patches for those.

Right.

BTW, how in 13 and lower versions did we
identify and mark subxacts as having catalog changes without our
patch?

I think we use HEAP_INPLACE and XLOG_HEAP2_NEW_CID to mark subxacts as well.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#110Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#109)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Jul 28, 2022 at 12:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 28, 2022 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

While editing back branch patches, I realized that the following
(parsed->xinfo & XACT_XINFO_HAS_INVALS) and (parsed->nmsgs > 0) are
equivalent:

+   /*
+    * If the COMMIT record has invalidation messages, it could have catalog
+    * changes. It is possible that we didn't mark this transaction as
+    * containing catalog changes when the decoding starts from a commit
+    * record without decoding the transaction's other changes. So, we ensure
+    * to mark such transactions as containing catalog change.
+    *
+    * This must be done before SnapBuildCommitTxn() so that we can include
+    * these transactions in the historic snapshot.
+    */
+   if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+       SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+                                     parsed->nsubxacts, parsed->subxacts,
+                                     buf->origptr);
+
/*
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
* consistent.
*/
if (parsed->nmsgs > 0)
{
if (!ctx->fast_forward)
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
parsed->nmsgs, parsed->msgs);
ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
}

If that's right, I think we can merge these if branches. We can call
ReorderBufferXidSetCatalogChanges() for top-txn and in
SnapBuildXidSetCatalogChanges() we mark its subtransactions if top-txn
is in the list. What do you think?

Note that this code doesn't exist in 14 and 15, so we need to create
different patches for those.

Right.

Okay, then this sounds reasonable to me.

--
With Regards,
Amit Kapila.

#111Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#102)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Jul 26, 2022 at 1:22 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Okay, I've attached an updated patch that does the above idea. Could
you please do the performance tests again to see if the idea can help
reduce the overhead, Shi yu?

While reviewing the patch for HEAD, I have changed a few comments. See
attached, if you agree with these changes then include them in the
next version.

--
With Regards,
Amit Kapila.

Attachments:

master_v9_amit.diff.patchapplication/octet-stream; name=master_v9_amit.diff.patchDownload
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ab6bae3b6e..ca56f7bfe3 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -247,15 +247,18 @@ struct SnapBuild
 	 * and were running when the snapshot was serialized.
 	 *
 	 * We normally rely on HEAP2_NEW_CID and XLOG_XACT_INVALIDATIONS records to
-	 * know if the transaction has changed the catalog. But it could happen that
-	 * the logical decoding decodes only the commit record of the transaction.
-	 * This array stores the transactions that have modified catalogs and were
-	 * running when serializing a snapshot, and this array is used to add such
-	 * transactions to the snapshot.
+	 * know if the transaction has changed the catalog. But it could happen
+	 * that the logical decoding decodes only the commit record of the
+	 * transaction after restoring the previously serialized snapshot in which
+	 * case we will miss adding the xid to the snapshot and end up looking at
+	 * the catalogs with the wrong snapshot.
 	 *
-	 * This array is set once when restoring the snapshot, we remove xids from
-	 * this array when they become old enough to matter, and then it eventually
-	 * becomes empty.
+	 * Now to avoid the above problem, we serialize the transactions that had
+	 * modified the catalogs and are still running at the time of snapshot
+	 * serialization. We fill this array while restoring the snapshot and then
+	 * refer it while decoding commit to ensure if the xact has modified the
+	 * catalog. We remove xids from this array when they become old enough to
+	 * matter, and then it eventually becomes empty.
 	 */
 	struct
 	{
@@ -924,9 +927,9 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
  * the ->committed or ->catchange array, respectively. The committed xids will
  * get checked via the clog machinery.
  *
- * We can ideally remove the transaction
- * from catchange array once it is finished (committed/aborted) but that could
- * be costly as we need to maintain the xids order in the array.
+ * We can ideally remove the transaction from catchange array once it is
+ * finished (committed/aborted) but that could be costly as we need to maintain
+ * the xids order in the array.
  */
 static void
 SnapBuildPurgeOlderTxn(SnapBuild *builder)
@@ -1171,8 +1174,8 @@ SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 		return true;
 
 	/*
-	 * If the commit record of the transaction does not have invalidation
-	 * messages, it did not change catalogs for sure.
+	 * The transactions that have changed catalogs must have invalidation
+	 * info.
 	 */
 	if (!(xinfo & XACT_XINFO_HAS_INVALS))
 		return false;
#112Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#111)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Jul 28, 2022 at 3:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 26, 2022 at 1:22 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Okay, I've attached an updated patch that does the above idea. Could
you please do the performance tests again to see if the idea can help
reduce the overhead, Shi yu?

While reviewing the patch for HEAD, I have changed a few comments. See
attached, if you agree with these changes then include them in the
next version.

I have another comment on this patch:
SnapBuildPurgeOlderTxn()
{
...
+ if (surviving_xids > 0)
+ memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+ surviving_xids * sizeof(TransactionId))
...

For this code to hit, we must have a situation where one or more of
the xacts in this array must be still running. And, if that is true,
we would not have started from the restart point where the
corresponding snapshot (that contains the still running xacts) has
been serialized because we advance the restart point to not before the
oldest running xacts restart_decoding_lsn. This may not be easy to
understand so let me take an example to explain. Say we have two
transactions t1 and t2, and both have made catalog changes. We want a
situation where one of those gets purged and the other remains in
builder->catchange.xip array. I have tried variants of the below
sequence to see if I can get into the required situation but am not
able to make it.

Session-1
Checkpoint -1;
T1
DDL

Session-2
T2
DDL

Session-3
Checkpoint-2;
pg_logical_slot_get_changes()
-- Here when we serialize the snapshot corresponding to
CHECKPOINT-2's running_xact record, we will serialize both t1 and t2
as catalog-changing xacts.

Session-1
T1
Commit;

Checkpoint;
pg_logical_slot_get_changes()
-- Here we will restore from Checkpoint-1's serialized snapshot and
won't be able to move restart_point to Checkpoint-2 because T2 is
still open.

Now, as per my understanding, it is only possible to move
restart_point to Checkpoint-2 if T2 gets committed/rolled-back in
which case we will never have that in surviving_xids array after the
purge.

It is possible I am missing something here. Do let me know your thoughts.

--
With Regards,
Amit Kapila.

#113Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#112)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Jul 28, 2022 at 8:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 28, 2022 at 3:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 26, 2022 at 1:22 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Okay, I've attached an updated patch that does the above idea. Could
you please do the performance tests again to see if the idea can help
reduce the overhead, Shi yu?

While reviewing the patch for HEAD, I have changed a few comments. See
attached, if you agree with these changes then include them in the
next version.

I have another comment on this patch:
SnapBuildPurgeOlderTxn()
{
...
+ if (surviving_xids > 0)
+ memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+ surviving_xids * sizeof(TransactionId))
...

For this code to hit, we must have a situation where one or more of
the xacts in this array must be still running. And, if that is true,
we would not have started from the restart point where the
corresponding snapshot (that contains the still running xacts) has
been serialized because we advance the restart point to not before the
oldest running xacts restart_decoding_lsn. This may not be easy to
understand so let me take an example to explain. Say we have two
transactions t1 and t2, and both have made catalog changes. We want a
situation where one of those gets purged and the other remains in
builder->catchange.xip array. I have tried variants of the below
sequence to see if I can get into the required situation but am not
able to make it.

Session-1
Checkpoint -1;
T1
DDL

Session-2
T2
DDL

Session-3
Checkpoint-2;
pg_logical_slot_get_changes()
-- Here when we serialize the snapshot corresponding to
CHECKPOINT-2's running_xact record, we will serialize both t1 and t2
as catalog-changing xacts.

Session-1
T1
Commit;

Checkpoint;
pg_logical_slot_get_changes()
-- Here we will restore from Checkpoint-1's serialized snapshot and
won't be able to move restart_point to Checkpoint-2 because T2 is
still open.

Now, as per my understanding, it is only possible to move
restart_point to Checkpoint-2 if T2 gets committed/rolled-back in
which case we will never have that in surviving_xids array after the
purge.

It is possible I am missing something here. Do let me know your thoughts.

Yeah, your description makes sense to me. I've also considered how to
hit this path but I guess it is never hit. Thinking of it in another
way, first of all, at least 2 catalog modifying transactions have to
be running while writing a xl_running_xacts. The serialized snapshot
that is written when we decode the first xl_running_xact has two
transactions. Then, one of them is committed before the second
xl_running_xacts. The second serialized snapshot has only one
transaction. Then, the transaction is also committed after that. Now,
in order to execute the path, we need to start decoding from the first
serialized snapshot. However, if we start from there, we cannot decode
the full contents of the transaction that was committed later.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#114Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#113)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 29, 2022 at 5:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 28, 2022 at 8:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 28, 2022 at 3:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have another comment on this patch:
SnapBuildPurgeOlderTxn()
{
...
+ if (surviving_xids > 0)
+ memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+ surviving_xids * sizeof(TransactionId))
...

For this code to hit, we must have a situation where one or more of
the xacts in this array must be still running. And, if that is true,
we would not have started from the restart point where the
corresponding snapshot (that contains the still running xacts) has
been serialized because we advance the restart point to not before the
oldest running xacts restart_decoding_lsn. This may not be easy to
understand so let me take an example to explain. Say we have two
transactions t1 and t2, and both have made catalog changes. We want a
situation where one of those gets purged and the other remains in
builder->catchange.xip array. I have tried variants of the below
sequence to see if I can get into the required situation but am not
able to make it.

Session-1
Checkpoint -1;
T1
DDL

Session-2
T2
DDL

Session-3
Checkpoint-2;
pg_logical_slot_get_changes()
-- Here when we serialize the snapshot corresponding to
CHECKPOINT-2's running_xact record, we will serialize both t1 and t2
as catalog-changing xacts.

Session-1
T1
Commit;

Checkpoint;
pg_logical_slot_get_changes()
-- Here we will restore from Checkpoint-1's serialized snapshot and
won't be able to move restart_point to Checkpoint-2 because T2 is
still open.

Now, as per my understanding, it is only possible to move
restart_point to Checkpoint-2 if T2 gets committed/rolled-back in
which case we will never have that in surviving_xids array after the
purge.

It is possible I am missing something here. Do let me know your thoughts.

Yeah, your description makes sense to me. I've also considered how to
hit this path but I guess it is never hit. Thinking of it in another
way, first of all, at least 2 catalog modifying transactions have to
be running while writing a xl_running_xacts. The serialized snapshot
that is written when we decode the first xl_running_xact has two
transactions. Then, one of them is committed before the second
xl_running_xacts. The second serialized snapshot has only one
transaction. Then, the transaction is also committed after that. Now,
in order to execute the path, we need to start decoding from the first
serialized snapshot. However, if we start from there, we cannot decode
the full contents of the transaction that was committed later.

I think then we should change this code in the master branch patch
with an additional comment on the lines of: "Either all the xacts got
purged or none. It is only possible to partially remove the xids from
this array if one or more of the xids are still running but not all.
That can happen if we start decoding from a point (LSN where the
snapshot state became consistent) where all the xacts in this were
running and then at least one of those got committed and a few are
still running. We will never start from such a point because we won't
move the slot's restart_lsn past the point where the oldest running
transaction's restart_decoding_lsn is."

I suggest keeping the back branch as it is w.r.t this change as if
this logic proves to be faulty it won't affect the stable branches. We
can always back-patch this small change if required.

--
With Regards,
Amit Kapila.

#115Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#114)
7 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 29, 2022 at 3:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 29, 2022 at 5:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 28, 2022 at 8:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 28, 2022 at 3:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have another comment on this patch:
SnapBuildPurgeOlderTxn()
{
...
+ if (surviving_xids > 0)
+ memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+ surviving_xids * sizeof(TransactionId))
...

For this code to hit, we must have a situation where one or more of
the xacts in this array must be still running. And, if that is true,
we would not have started from the restart point where the
corresponding snapshot (that contains the still running xacts) has
been serialized because we advance the restart point to not before the
oldest running xacts restart_decoding_lsn. This may not be easy to
understand so let me take an example to explain. Say we have two
transactions t1 and t2, and both have made catalog changes. We want a
situation where one of those gets purged and the other remains in
builder->catchange.xip array. I have tried variants of the below
sequence to see if I can get into the required situation but am not
able to make it.

Session-1
Checkpoint -1;
T1
DDL

Session-2
T2
DDL

Session-3
Checkpoint-2;
pg_logical_slot_get_changes()
-- Here when we serialize the snapshot corresponding to
CHECKPOINT-2's running_xact record, we will serialize both t1 and t2
as catalog-changing xacts.

Session-1
T1
Commit;

Checkpoint;
pg_logical_slot_get_changes()
-- Here we will restore from Checkpoint-1's serialized snapshot and
won't be able to move restart_point to Checkpoint-2 because T2 is
still open.

Now, as per my understanding, it is only possible to move
restart_point to Checkpoint-2 if T2 gets committed/rolled-back in
which case we will never have that in surviving_xids array after the
purge.

It is possible I am missing something here. Do let me know your thoughts.

Yeah, your description makes sense to me. I've also considered how to
hit this path but I guess it is never hit. Thinking of it in another
way, first of all, at least 2 catalog modifying transactions have to
be running while writing a xl_running_xacts. The serialized snapshot
that is written when we decode the first xl_running_xact has two
transactions. Then, one of them is committed before the second
xl_running_xacts. The second serialized snapshot has only one
transaction. Then, the transaction is also committed after that. Now,
in order to execute the path, we need to start decoding from the first
serialized snapshot. However, if we start from there, we cannot decode
the full contents of the transaction that was committed later.

I think then we should change this code in the master branch patch
with an additional comment on the lines of: "Either all the xacts got
purged or none. It is only possible to partially remove the xids from
this array if one or more of the xids are still running but not all.
That can happen if we start decoding from a point (LSN where the
snapshot state became consistent) where all the xacts in this were
running and then at least one of those got committed and a few are
still running. We will never start from such a point because we won't
move the slot's restart_lsn past the point where the oldest running
transaction's restart_decoding_lsn is."

Agreed.

I suggest keeping the back branch as it is w.r.t this change as if
this logic proves to be faulty it won't affect the stable branches. We
can always back-patch this small change if required.

Yes, during PG16 release cycle, we can have time for evaluating
whether the approach in the master branch is correct. We can always
back-patch the part.

I've attached updated patches for all branches. Please review them.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

REL15_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL15_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 748e398923331223d6544c8e1236bde573cf3042 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v10] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on some WAL record types such as HEAP2_NEW_CID
to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the
restart, if the logical decoding decodes only the commit record of the
transaction that actually has modified a catalog, we missed adding its
XID to the snapshot. We ended up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transactions written in the xl_running_xacts record
that we decoded first, and mark the transaction as containing catalog
changes if it's in the list of the initial running transactions and its
commit record have XACT_XINFO_HAS_INVALS. To avoid ABI breakage, we store
the array of the initial running transactions in the static variables
InitialRunningXacts and NInitialRunningXacts, instead of storing those in
SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 142 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 237 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index aa2427ba73..ea8a2166ab 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -627,6 +627,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit
+	 * record without decoding the transaction's other changes. So, we ensure
+	 * to mark such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1119a12db9..e22827caf6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,37 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction in
+ * which case we will miss adding the xid to the snapshot and end up looking
+ * at the catalogs with the wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -888,12 +917,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +962,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1135,7 +1212,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1286,6 +1363,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2000,3 +2091,40 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments for NInitialRunningXacts and InitialRunningXacts for additional
+ * info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely((NInitialRunningXacts == 0) ||
+			   ReorderBufferXidHasCatalogChanges(builder->reorder, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the RUNNING_XACTS record and have done catalog changes,
+	 * we can mark both the top transaction and its subtransactions as
+	 * containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..53d83f348a 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -91,4 +91,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL13_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL13_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From db54b55e1d8a19d3ad3459158cf069eb4e47694c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v10] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on some WAL record types such as HEAP2_NEW_CID
to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the
restart, if the logical decoding decodes only the commit record of the
transaction that actually has modified a catalog, we missed adding its
XID to the snapshot. We ended up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transactions written in the xl_running_xacts record
that we decoded first, and mark the transaction as containing catalog
changes if it's in the list of the initial running transactions and its
commit record have XACT_XINFO_HAS_INVALS. To avoid ABI breakage, we store
the array of the initial running transactions in the static variables
InitialRunningXacts and NInitialRunningXacts, instead of storing those in
SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 +-
 src/backend/replication/logical/snapbuild.c   | 137 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 231 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..6ec09ab192 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5a2b828aa3..87cbd08e85 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -582,7 +582,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index be46bf0363..d88cbebb9b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -252,8 +252,37 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction in
+ * which case we will miss adding the xid to the snapshot and end up looking
+ * at the catalogs with the wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -890,12 +919,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -930,6 +964,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1137,7 +1214,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1288,6 +1365,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2030,3 +2121,35 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark the
+ * its subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the RUNNING_XACTS record and have done catalog changes,
+	 * we can mark its subtransactions as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index b048dc7484..17d2f93300 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL14_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL14_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From ea0306e91772d652d216f9e5efa4a0eb653f133c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v10] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on some WAL record types such as HEAP2_NEW_CID
to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the
restart, if the logical decoding decodes only the commit record of the
transaction that actually has modified a catalog, we missed adding its
XID to the snapshot. We ended up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transactions written in the xl_running_xacts record
that we decoded first, and mark the transaction as containing catalog
changes if it's in the list of the initial running transactions and its
commit record have XACT_XINFO_HAS_INVALS. To avoid ABI breakage, we store
the array of the initial running transactions in the static variables
InitialRunningXacts and NInitialRunningXacts, instead of storing those in
SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 142 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 237 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..4553252d75 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 92dfafc632..5a440e6eb7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -691,6 +691,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit
+	 * record without decoding the transaction's other changes. So, we ensure
+	 * to mark such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 6df602485b..e775301023 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,37 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction in
+ * which case we will miss adding the xid to the snapshot and end up looking
+ * at the catalogs with the wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -879,12 +908,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -919,6 +953,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1126,7 +1203,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1277,6 +1354,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1993,3 +2084,40 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments for NInitialRunningXacts and InitialRunningXacts for additional
+ * info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (likely((NInitialRunningXacts == 0) ||
+			   ReorderBufferXidHasCatalogChanges(builder->reorder, xid)))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the RUNNING_XACTS record and have done catalog changes,
+	 * we can mark both the top transaction and its subtransactions as
+	 * containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3604621e88..a19b59e100 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -90,4 +90,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL12_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL12_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From c55fc9f7a24c5e95467606d2968a6d1969c7f9a8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v10] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on some WAL record types such as HEAP2_NEW_CID
to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the
restart, if the logical decoding decodes only the commit record of the
transaction that actually has modified a catalog, we missed adding its
XID to the snapshot. We ended up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transactions written in the xl_running_xacts record
that we decoded first, and mark the transaction as containing catalog
changes if it's in the list of the initial running transactions and its
commit record have XACT_XINFO_HAS_INVALS. To avoid ABI breakage, we store
the array of the initial running transactions in the static variables
InitialRunningXacts and NInitialRunningXacts, instead of storing those in
SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 +-
 src/backend/replication/logical/snapbuild.c   | 137 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 231 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..6ec09ab192 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 60d07ce4eb..19cd0bf76a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -585,7 +585,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5a1bce5acc..567d823324 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -257,8 +257,37 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction in
+ * which case we will miss adding the xid to the snapshot and end up looking
+ * at the catalogs with the wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -895,12 +924,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -935,6 +969,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1142,7 +1219,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1293,6 +1370,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2035,3 +2126,35 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark the
+ * its subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the RUNNING_XACTS record and have done catalog changes,
+	 * we can mark its subtransactions as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3acf68f5bd..2eb9532a1b 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

master_v10-0001-Add-catalog-modifying-transactions-to-logical-de.patchapplication/octet-stream; name=master_v10-0001-Add-catalog-modifying-transactions-to-logical-de.patchDownload
From 8ea358da34eb45064aa4c2071304d51d4651adc4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v10] Add catalog modifying transactions to logical decoding
 serialized snapshot.

Previously, we relied on some WAL record types such as HEAP2_NEW_CID
to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the
restart, if the logical decoding decodes only the commit record of the
transaction that actually has modified a catalog, we missed adding its
XID to the snapshot. We ended up looking at catalogs with the wrong
snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running during
snapshot serialization, to the serialized snapshot. After restart or
otherwise, when we restore from such a serialized snapshot, the
corresponding list is restored in memory. Now, when decoding a COMMIT
record, we check both the list and the ReorderBuffer to see if the
transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we took another approach.
We remember the last-running-xacts list of the decoded RUNNING_XACTS
record after restoring the previously serialized snapshot. Then, we mark
the transaction as containing catalog changes if it's in the list of
initial running transactions and its commit record has
XACT_XINFO_HAS_INVALS. This doesn't require any file format changes but
the transaction will end up being added to the snapshot even if it has
only relcache invalidations. But that won't be a problem since we use
snapshot built during decoding only to read system catalogs.

This commit bumps SNAPBUILD_VERSION because of a change in SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 +++
 .../specs/catalog_change_snapshot.spec        |  39 +++
 src/backend/replication/logical/decode.c      |   3 +-
 .../replication/logical/reorderbuffer.c       |  71 ++++-
 src/backend/replication/logical/snapbuild.c   | 273 ++++++++++++------
 src/include/replication/reorderbuffer.h       |  12 +
 src/include/replication/snapbuild.h           |   2 +-
 8 files changed, 353 insertions(+), 93 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c5c6a2ba68..1667d720b1 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -628,7 +628,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
-					   parsed->nsubxacts, parsed->subxacts);
+					   parsed->nsubxacts, parsed->subxacts,
+					   parsed->xinfo);
 
 	/* ----
 	 * Check whether we are interested in this specific transaction, and tell
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fde72..0b2d9b7930 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,8 @@ ReorderBufferAllocate(void)
 	buffer->by_txn_last_xid = InvalidTransactionId;
 	buffer->by_txn_last_txn = NULL;
 
+	buffer->catchange_ntxns = 0;
+
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 	buffer->size = 0;
@@ -366,6 +368,7 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
+	dlist_init(&buffer->catchange_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1526,14 +1529,22 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
-	 * from the LSN-ordered list of toplevel TXNs.
+	 * from the LSN-ordered list of toplevel TXNs. We remove TXN from
+	 * the list of catalog modifying transactions as well.
 	 */
 	dlist_delete(&txn->node);
+	if (rbtxn_has_catalog_changes(txn))
+	{
+		dlist_delete(&txn->catchange_node);
+		rb->catchange_ntxns--;
+
+		Assert(rb->catchange_ntxns >= 0);
+	}
 
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,
@@ -3275,10 +3286,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 								  XLogRecPtr lsn)
 {
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	if (!rbtxn_has_catalog_changes(txn))
+	{
+		txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &txn->catchange_node);
+		rb->catchange_ntxns++;
+	}
 
 	/*
 	 * Mark top-level transaction as having catalog changes too if one of its
@@ -3286,8 +3303,52 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	 * conveniently check just top-level transaction and decide whether to
 	 * build the hash table or not.
 	 */
-	if (txn->toptxn != NULL)
-		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	toptxn = txn->toptxn;
+	if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
+	{
+		toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &toptxn->catchange_node);
+		rb->catchange_ntxns++;
+	}
+}
+
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb)
+{
+	dlist_iter iter;
+	TransactionId *xids = NULL;
+	size_t	xcnt = 0;
+
+	/* Quick return if the list is empty */
+	if (dlist_is_empty(&rb->catchange_txns))
+	{
+		Assert(rb->catchange_ntxns == 0);
+		return NULL;
+	}
+
+	/* Initialize XID array */
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * rb->catchange_ntxns);
+	dlist_foreach(iter, &rb->catchange_txns)
+	{
+		ReorderBufferTXN *txn = dlist_container(ReorderBufferTXN,
+												catchange_node,
+												iter.cur);
+
+		Assert(rbtxn_has_catalog_changes(txn));
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	Assert(xcnt == rb->catchange_ntxns);
+	return xids;
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15214..aee69d160a 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,33 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+	 * if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction
+	 * after restoring the previously serialized snapshot in which case we
+	 * will miss adding the xid to the snapshot and end up looking at the
+	 * catalogs with the wrong snapshot.
+	 *
+	 * Now to avoid the above problem, we serialize the transactions that had
+	 * modified the catalogs and are still running at the time of snapshot
+	 * serialization. We fill this array while restoring the snapshot and then
+	 * refer it while decoding commit to ensure if the xact has modified the
+	 * catalog. We remove xids from this array when they become old enough to
+	 * matter, and then it eventually becomes empty.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchange;
 };
 
 /*
@@ -250,8 +277,8 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/* ->committed and ->catchange manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -262,6 +289,9 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
+												 uint32 xinfo);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +299,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +337,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchange.xcnt = 0;
+	builder->catchange.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -888,12 +922,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed or containing catalog
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed or ->catchange array, respectively. The committed xids will
+ * get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from catchange array once it is
+ * finished (committed/aborted) but that could be costly as we need to maintain
+ * the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +967,30 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/*
+	 * Either all the xacts got purged or none. It is only possible to
+	 * partially remove the xids from this array if one or more of the xids
+	 * are still running but not all. That can happen if we start decoding
+	 * from a point (LSN where the snapshot state became consistent) where all
+	 * the xacts in this were running and then at least one of those got
+	 * committed and a few are still running. We will never start from such a
+	 * point because we won't move the slot's restart_lsn past the point where
+	 * the oldest running transaction’s restart_decoding_lsn is.
+	 */
+	if (likely(builder->catchange.xcnt == 0 ||
+			   TransactionIdFollowsOrEquals(builder->catchange.xip[0],
+											builder->xmin)))
+		return;
+
+	Assert(TransactionIdFollows(builder->xmin,
+								builder->catchange.xip[builder->catchange.xcnt - 1]));
+	pfree(builder->catchange.xip);
+	builder->catchange.xip = NULL;
+	builder->catchange.xcnt = 0;
+
+	elog(DEBUG3, "purged catalog modifying transactions, oldest running xid %u",
+		 builder->xmin);
 }
 
 /*
@@ -935,7 +998,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
  */
 void
 SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
-				   int nsubxacts, TransactionId *subxacts)
+				   int nsubxacts, TransactionId *subxacts, uint32 xinfo)
 {
 	int			nxact;
 
@@ -983,7 +1046,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChanges(builder, subxid, xinfo))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1075,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1152,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check the reorder buffer and the snapshot to see if the given transaction has
+ * modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
+							  uint32 xinfo)
+{
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return true;
+
+	/*
+	 * The transactions that have changed catalogs must have invalidation
+	 * info.
+	 */
+	if (!(xinfo & XACT_XINFO_HAS_INVALS))
+		return false;
+
+	/* Check the catchange XID array */
+	return ((builder->catchange.xcnt > 0) &&
+			(bsearch(&xid, builder->catchange.xip, builder->catchange.xcnt,
+					 sizeof(TransactionId), xidComparator) != NULL));
+}
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1135,7 +1221,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1438,6 +1524,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchange.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1467,7 +1554,7 @@ typedef struct SnapBuildOnDisk
 	offsetof(SnapBuildOnDisk, version)
 
 #define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 4
+#define SNAPBUILD_VERSION 5
 
 /*
  * Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1493,6 +1580,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	TransactionId *catchange_xip = NULL;
+	MemoryContext old_ctx;
+	size_t		catchange_xcnt;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,10 +1668,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	old_ctx = MemoryContextSwitchTo(builder->context);
+
+	/* Get the catalog modifying transactions that are yet not committed */
+	catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);
+	catchange_xcnt = builder->reorder->catchange_ntxns;
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + catchange_xcnt);
 
-	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
+	ondisk_c = palloc0(needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
 	ondisk->magic = SNAPBUILD_MAGIC;
 	ondisk->version = SNAPBUILD_VERSION;
@@ -1598,16 +1694,31 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchange.xip = NULL;
+	/* update catchange only on disk data */
+	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
 				sizeof(SnapBuild));
 
 	/* copy committed xacts */
-	sz = sizeof(TransactionId) * builder->committed.xcnt;
-	memcpy(ondisk_c, builder->committed.xip, sz);
-	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
-	ondisk_c += sz;
+	if (builder->committed.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * builder->committed.xcnt;
+		memcpy(ondisk_c, builder->committed.xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
+
+	/* copy catalog modifying xacts */
+	if (catchange_xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * catchange_xcnt;
+		memcpy(ondisk_c, catchange_xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
 
 	FIN_CRC32C(ondisk->checksum);
 
@@ -1688,12 +1799,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	 */
 	builder->last_serialized_snapshot = lsn;
 
+	MemoryContextSwitchTo(old_ctx);
+
 out:
 	ReorderBufferSetRestartPoint(builder->reorder,
 								 builder->last_serialized_snapshot);
 	/* be tidy */
 	if (ondisk)
 		pfree(ondisk);
+	if (catchange_xip)
+		pfree(catchange_xip);
 }
 
 /*
@@ -1707,7 +1822,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1853,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,56 +1873,26 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
-	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
-	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
+	if (ondisk.builder.committed.xcnt > 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
+		sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
+		ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
+	}
 
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
+	/* restore catalog modifying xacts information */
+	if (ondisk.builder.catchange.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * ondisk.builder.catchange.xcnt;
+		ondisk.builder.catchange.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchange.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.catchange.xip, sz);
 	}
-	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
@@ -1885,6 +1947,13 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchange.xip)
+		pfree(builder->catchange.xip);
+	builder->catchange.xcnt = ondisk.builder.catchange.xcnt;
+	builder->catchange.xip = ondisk.builder.catchange.xip;
+	ondisk.builder.catchange.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1906,9 +1975,43 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 snapshot_not_interesting:
 	if (ondisk.builder.committed.xip != NULL)
 		pfree(ondisk.builder.committed.xip);
+	if (ondisk.builder.catchange.xip != NULL)
+		pfree(ondisk.builder.catchange.xip);
 	return false;
 }
 
+/*
+ * Read the contents of the serialized snapshot to the dest.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
 /*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d109d0baed..fd84f175c0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -380,6 +380,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * A node in the list of catalog modifying transactions
+	 */
+	dlist_node	catchange_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -526,6 +531,12 @@ struct ReorderBuffer
 	 */
 	dlist_head	txns_by_base_snapshot_lsn;
 
+	/*
+	 * Transactions and subtransactions that have modified system catalogs.
+	 */
+	dlist_head	catchange_txns;
+	int			catchange_ntxns;
+
 	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
@@ -677,6 +688,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..e6adea24f2 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -82,7 +82,7 @@ extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
-							   TransactionId *subxacts);
+							   TransactionId *subxacts, uint32 xinfo);
 extern bool SnapBuildProcessChange(SnapBuild *builder, TransactionId xid,
 								   XLogRecPtr lsn);
 extern void SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
-- 
2.24.3 (Apple Git-128)

REL10_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL10_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From e58d030287759fcf3793eca8c6632e5ae1638300 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v10] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on some WAL record types such as HEAP2_NEW_CID
to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the
restart, if the logical decoding decodes only the commit record of the
transaction that actually has modified a catalog, we missed adding its
XID to the snapshot. We ended up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transactions written in the xl_running_xacts record
that we decoded first, and mark the transaction as containing catalog
changes if it's in the list of the initial running transactions and its
commit record have XACT_XINFO_HAS_INVALS. To avoid ABI breakage, we store
the array of the initial running transactions in the static variables
InitialRunningXacts and NInitialRunningXacts, instead of storing those in
SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  41 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  16 +-
 src/backend/replication/logical/snapbuild.c   | 137 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 229 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2db2b2774b..73bc0fe1fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..15f9540b3f
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,41 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6f8920f52c..3233104fc9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -561,7 +561,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	{
 		ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 									  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1010a2e869..33a659d5b6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -258,8 +258,37 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction in
+ * which case we will miss adding the xid to the snapshot and end up looking
+ * at the catalogs with the wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -896,12 +925,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -936,6 +970,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1143,7 +1220,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1294,6 +1371,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1997,3 +2088,35 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark the
+ * its subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the RUNNING_XACTS record and have done catalog changes,
+	 * we can mark its subtransactions as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index b95f56eec3..7a796ce136 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 							 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL11_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL11_v10-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 6315672c9af2dca8a604c31621ecfc831e13c84c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v10] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on some WAL record types such as HEAP2_NEW_CID
to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the
restart, if the logical decoding decodes only the commit record of the
transaction that actually has modified a catalog, we missed adding its
XID to the snapshot. We ended up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the running transactions written in the xl_running_xacts record
that we decoded first, and mark the transaction as containing catalog
changes if it's in the list of the initial running transactions and its
commit record have XACT_XINFO_HAS_INVALS. To avoid ABI breakage, we store
the array of the initial running transactions in the static variables
InitialRunningXacts and NInitialRunningXacts, instead of storing those in
SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 +-
 src/backend/replication/logical/snapbuild.c   | 137 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 231 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 65a91a8014..973b94738a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c085f7b0f3..dc83743c38 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -586,7 +586,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1c52bc64e3..b3b30f8840 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -258,8 +258,37 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction in
+ * which case we will miss adding the xid to the snapshot and end up looking
+ * at the catalogs with the wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -896,12 +925,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -936,6 +970,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1143,7 +1220,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1294,6 +1371,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1996,3 +2087,35 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark the
+ * its subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (likely(NInitialRunningXacts == 0))
+		return;
+
+	/*
+	 * If this committed transaction is the one that was running at the time
+	 * when decoding the RUNNING_XACTS record and have done catalog changes,
+	 * we can mark its subtransactions as containing catalog changes.
+	 */
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 1df66a3c75..4df3c3f2f7 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 							 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

#116Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#115)
7 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Aug 1, 2022 at 7:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 29, 2022 at 3:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I've attached updated patches for all branches. Please review them.

Thanks, the patches look mostly good to me. I have made minor edits by
removing 'likely' from a few places as those don't seem to be adding
much value, changed comments at a few places, and was getting
compilation in error in v11/10 (snapbuild.c:2111:3: error: ‘for’ loop
initial declarations are only allowed in C99 mode) which I have fixed.
See attached, unless there are major comments/suggestions, I am
planning to push this day after tomorrow (by Wednesday) after another
pass.

--
With Regards,
Amit Kapila.

Attachments:

master_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=master_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 32bcb82edaeebf9cd381851773ed3b06a8626e37 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v11] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running during
snapshot serialization, to the serialized snapshot. After restart or
otherwise, when we restore from such a serialized snapshot, the
corresponding list is restored in memory. Now, when decoding a COMMIT
record, we check both the list and the ReorderBuffer to see if the
transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we took another approach.
We remember the last-running-xacts list of the decoded RUNNING_XACTS
record after restoring the previously serialized snapshot. Then, we mark
the transaction as containing catalog changes if it's in the list of
initial running transactions and its commit record has
XACT_XINFO_HAS_INVALS. This doesn't require any file format changes but
the transaction will end up being added to the snapshot even if it has
only relcache invalidations. But that won't be a problem since we use
snapshot built during decoding only to read system catalogs.

This commit bumps SNAPBUILD_VERSION because of a change in SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../expected/catalog_change_snapshot.out           |  44 ++++
 .../specs/catalog_change_snapshot.spec             |  39 +++
 src/backend/replication/logical/decode.c           |   3 +-
 src/backend/replication/logical/reorderbuffer.c    |  71 +++++-
 src/backend/replication/logical/snapbuild.c        | 273 ++++++++++++++-------
 src/include/replication/reorderbuffer.h            |  12 +
 src/include/replication/snapbuild.h                |   2 +-
 8 files changed, 353 insertions(+), 93 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906..c7ce603 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000..dc4f9b7
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000..2971ddc
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c5c6a2b..1667d72 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -628,7 +628,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
-					   parsed->nsubxacts, parsed->subxacts);
+					   parsed->nsubxacts, parsed->subxacts,
+					   parsed->xinfo);
 
 	/* ----
 	 * Check whether we are interested in this specific transaction, and tell
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fd..f704613 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,8 @@ ReorderBufferAllocate(void)
 	buffer->by_txn_last_xid = InvalidTransactionId;
 	buffer->by_txn_last_txn = NULL;
 
+	buffer->catchange_ntxns = 0;
+
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 	buffer->size = 0;
@@ -366,6 +368,7 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
+	dlist_init(&buffer->catchange_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1526,14 +1529,22 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
-	 * from the LSN-ordered list of toplevel TXNs.
+	 * from the LSN-ordered list of toplevel TXNs. We remove TXN from the list
+	 * of catalog modifying transactions as well.
 	 */
 	dlist_delete(&txn->node);
+	if (rbtxn_has_catalog_changes(txn))
+	{
+		dlist_delete(&txn->catchange_node);
+		rb->catchange_ntxns--;
+
+		Assert(rb->catchange_ntxns >= 0);
+	}
 
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,
@@ -3275,10 +3286,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 								  XLogRecPtr lsn)
 {
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	if (!rbtxn_has_catalog_changes(txn))
+	{
+		txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &txn->catchange_node);
+		rb->catchange_ntxns++;
+	}
 
 	/*
 	 * Mark top-level transaction as having catalog changes too if one of its
@@ -3286,8 +3303,52 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	 * conveniently check just top-level transaction and decide whether to
 	 * build the hash table or not.
 	 */
-	if (txn->toptxn != NULL)
-		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	toptxn = txn->toptxn;
+	if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
+	{
+		toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &toptxn->catchange_node);
+		rb->catchange_ntxns++;
+	}
+}
+
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	TransactionId *xids = NULL;
+	size_t		xcnt = 0;
+
+	/* Quick return if the list is empty */
+	if (dlist_is_empty(&rb->catchange_txns))
+	{
+		Assert(rb->catchange_ntxns == 0);
+		return NULL;
+	}
+
+	/* Initialize XID array */
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * rb->catchange_ntxns);
+	dlist_foreach(iter, &rb->catchange_txns)
+	{
+		ReorderBufferTXN *txn = dlist_container(ReorderBufferTXN,
+												catchange_node,
+												iter.cur);
+
+		Assert(rbtxn_has_catalog_changes(txn));
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	Assert(xcnt == rb->catchange_ntxns);
+	return xids;
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15..6f64d01 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,33 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+	 * if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction
+	 * after restoring the previously serialized snapshot in which case we
+	 * will miss adding the xid to the snapshot and end up looking at the
+	 * catalogs with the wrong snapshot.
+	 *
+	 * Now to avoid the above problem, we serialize the transactions that had
+	 * modified the catalogs and are still running at the time of snapshot
+	 * serialization. We fill this array while restoring the snapshot and then
+	 * refer it while decoding commit to ensure if the xact has modified the
+	 * catalog. We remove xids from this array when they become old enough to
+	 * matter, and then it eventually becomes empty.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchange;
 };
 
 /*
@@ -250,8 +277,8 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/* ->committed and ->catchange manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -262,6 +289,9 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
+												 uint32 xinfo);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +299,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +337,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchange.xcnt = 0;
+	builder->catchange.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -888,12 +922,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed or containing catalog
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed or ->catchange array, respectively. The committed xids will
+ * get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from catchange array once it is
+ * finished (committed/aborted) but that could be costly as we need to maintain
+ * the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +967,30 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/*
+	 * Either all the xacts got purged or none. It is only possible to
+	 * partially remove the xids from this array if one or more of the xids
+	 * are still running but not all. That can happen if we start decoding
+	 * from a point (LSN where the snapshot state became consistent) where all
+	 * the xacts in this were running and then at least one of those got
+	 * committed and a few are still running. We will never start from such a
+	 * point because we won't move the slot's restart_lsn past the point where
+	 * the oldest running transaction’s restart_decoding_lsn is.
+	 */
+	if (builder->catchange.xcnt == 0 ||
+		TransactionIdFollowsOrEquals(builder->catchange.xip[0],
+									 builder->xmin))
+		return;
+
+	Assert(TransactionIdFollows(builder->xmin,
+								builder->catchange.xip[builder->catchange.xcnt - 1]));
+	pfree(builder->catchange.xip);
+	builder->catchange.xip = NULL;
+	builder->catchange.xcnt = 0;
+
+	elog(DEBUG3, "purged catalog modifying transactions, oldest running xid %u",
+		 builder->xmin);
 }
 
 /*
@@ -935,7 +998,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
  */
 void
 SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
-				   int nsubxacts, TransactionId *subxacts)
+				   int nsubxacts, TransactionId *subxacts, uint32 xinfo)
 {
 	int			nxact;
 
@@ -983,7 +1046,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChanges(builder, subxid, xinfo))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1075,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1152,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check the reorder buffer and the snapshot to see if the given transaction has
+ * modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
+							  uint32 xinfo)
+{
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return true;
+
+	/*
+	 * The transactions that have changed catalogs must have invalidation
+	 * info.
+	 */
+	if (!(xinfo & XACT_XINFO_HAS_INVALS))
+		return false;
+
+	/* Check the catchange XID array */
+	return ((builder->catchange.xcnt > 0) &&
+			(bsearch(&xid, builder->catchange.xip, builder->catchange.xcnt,
+					 sizeof(TransactionId), xidComparator) != NULL));
+}
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1135,7 +1221,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1438,6 +1524,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchange.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1467,7 +1554,7 @@ typedef struct SnapBuildOnDisk
 	offsetof(SnapBuildOnDisk, version)
 
 #define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 4
+#define SNAPBUILD_VERSION 5
 
 /*
  * Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1493,6 +1580,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	TransactionId *catchange_xip = NULL;
+	MemoryContext old_ctx;
+	size_t		catchange_xcnt;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,10 +1668,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	old_ctx = MemoryContextSwitchTo(builder->context);
+
+	/* Get the catalog modifying transactions that are yet not committed */
+	catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);
+	catchange_xcnt = builder->reorder->catchange_ntxns;
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + catchange_xcnt);
 
-	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
+	ondisk_c = palloc0(needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
 	ondisk->magic = SNAPBUILD_MAGIC;
 	ondisk->version = SNAPBUILD_VERSION;
@@ -1598,16 +1694,31 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchange.xip = NULL;
+	/* update catchange only on disk data */
+	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
 				sizeof(SnapBuild));
 
 	/* copy committed xacts */
-	sz = sizeof(TransactionId) * builder->committed.xcnt;
-	memcpy(ondisk_c, builder->committed.xip, sz);
-	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
-	ondisk_c += sz;
+	if (builder->committed.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * builder->committed.xcnt;
+		memcpy(ondisk_c, builder->committed.xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
+
+	/* copy catalog modifying xacts */
+	if (catchange_xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * catchange_xcnt;
+		memcpy(ondisk_c, catchange_xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
 
 	FIN_CRC32C(ondisk->checksum);
 
@@ -1688,12 +1799,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	 */
 	builder->last_serialized_snapshot = lsn;
 
+	MemoryContextSwitchTo(old_ctx);
+
 out:
 	ReorderBufferSetRestartPoint(builder->reorder,
 								 builder->last_serialized_snapshot);
 	/* be tidy */
 	if (ondisk)
 		pfree(ondisk);
+	if (catchange_xip)
+		pfree(catchange_xip);
 }
 
 /*
@@ -1707,7 +1822,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1853,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,56 +1873,26 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
-	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
-	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
+	if (ondisk.builder.committed.xcnt > 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
+		sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
+		ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
+	}
 
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
+	/* restore catalog modifying xacts information */
+	if (ondisk.builder.catchange.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * ondisk.builder.catchange.xcnt;
+		ondisk.builder.catchange.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchange.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.catchange.xip, sz);
 	}
-	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
@@ -1885,6 +1947,13 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchange.xip)
+		pfree(builder->catchange.xip);
+	builder->catchange.xcnt = ondisk.builder.catchange.xcnt;
+	builder->catchange.xip = ondisk.builder.catchange.xip;
+	ondisk.builder.catchange.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1906,10 +1975,44 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 snapshot_not_interesting:
 	if (ondisk.builder.committed.xip != NULL)
 		pfree(ondisk.builder.committed.xip);
+	if (ondisk.builder.catchange.xip != NULL)
+		pfree(ondisk.builder.catchange.xip);
 	return false;
 }
 
 /*
+ * Read the contents of the serialized snapshot to the dest.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
+/*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
  * but it's a convenient point to schedule this.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d109d0b..fd84f17 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -381,6 +381,11 @@ typedef struct ReorderBufferTXN
 	dlist_node	node;
 
 	/*
+	 * A node in the list of catalog modifying transactions
+	 */
+	dlist_node	catchange_node;
+
+	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
@@ -527,6 +532,12 @@ struct ReorderBuffer
 	dlist_head	txns_by_base_snapshot_lsn;
 
 	/*
+	 * Transactions and subtransactions that have modified system catalogs.
+	 */
+	dlist_head	catchange_txns;
+	int			catchange_ntxns;
+
+	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
 	 */
@@ -677,6 +688,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251..e6adea2 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -82,7 +82,7 @@ extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
-							   TransactionId *subxacts);
+							   TransactionId *subxacts, uint32 xinfo);
 extern bool SnapBuildProcessChange(SnapBuild *builder, TransactionId xid,
 								   XLogRecPtr lsn);
 extern void SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
-- 
1.8.3.1

REL15_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL15_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 33b9a7002f3f0f2733f02c676b43fbdb6e70600e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v11] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../expected/catalog_change_snapshot.out           |  44 +++++++
 .../specs/catalog_change_snapshot.spec             |  39 ++++++
 src/backend/replication/logical/decode.c           |  15 +++
 src/backend/replication/logical/snapbuild.c        | 137 +++++++++++++++++++--
 src/include/replication/snapbuild.h                |   3 +
 6 files changed, 232 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906..c7ce603 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000..dc4f9b7
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000..662760f
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index aa2427b..ea8a216 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -627,6 +627,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit
+	 * record without decoding the transaction's other changes. So, we ensure
+	 * to mark such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1119a12..385817e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -888,12 +918,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +963,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1135,7 +1213,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1286,6 +1364,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2000,3 +2092,34 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments for NInitialRunningXacts and InitialRunningXacts for additional
+ * info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (NInitialRunningXacts == 0 ||
+		ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251..53d83f3 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -91,4 +91,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
1.8.3.1

REL14_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL14_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From f2ff48209f82d942417389cc0e67e828252c2309 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v11] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../expected/catalog_change_snapshot.out           |  44 +++++++
 .../specs/catalog_change_snapshot.spec             |  39 ++++++
 src/backend/replication/logical/decode.c           |  15 +++
 src/backend/replication/logical/snapbuild.c        | 137 +++++++++++++++++++--
 src/include/replication/snapbuild.h                |   3 +
 6 files changed, 232 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b..4553252 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000..dc4f9b7
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000..662760f
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 92dfafc..5a440e6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -691,6 +691,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit
+	 * record without decoding the transaction's other changes. So, we ensure
+	 * to mark such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 6df6024..ac09f0e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -879,12 +909,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -919,6 +954,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1126,7 +1204,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1277,6 +1355,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1993,3 +2085,34 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments for NInitialRunningXacts and InitialRunningXacts for additional
+ * info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (NInitialRunningXacts == 0 ||
+		ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3604621..a19b59e 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -90,4 +90,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
1.8.3.1

REL13_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL13_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 070c7091aec51adaa13d9b8587f1a5e8679a5011 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v11] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../expected/catalog_change_snapshot.out           |  44 +++++++
 .../specs/catalog_change_snapshot.spec             |  39 ++++++
 src/backend/replication/logical/decode.c           |  15 ++-
 src/backend/replication/logical/snapbuild.c        | 133 +++++++++++++++++++--
 src/include/replication/snapbuild.h                |   3 +
 6 files changed, 227 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..6ec09ab 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000..dc4f9b7
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000..662760f
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5a2b828..87cbd08e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -582,7 +582,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index be46bf0..83e4026 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -252,8 +252,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -890,12 +920,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -930,6 +965,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1137,7 +1215,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1288,6 +1366,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2030,3 +2122,30 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark the
+ * its subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index b048dc7..17d2f93 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
1.8.3.1

REL12_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL12_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 7b84a4d2b9e4b7bc0b432c338c3d0d0032ec0e09 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v11] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../expected/catalog_change_snapshot.out           |  44 +++++++
 .../specs/catalog_change_snapshot.spec             |  39 ++++++
 src/backend/replication/logical/decode.c           |  15 ++-
 src/backend/replication/logical/snapbuild.c        | 133 +++++++++++++++++++--
 src/include/replication/snapbuild.h                |   3 +
 6 files changed, 227 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..6ec09ab 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000..dc4f9b7
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000..662760f
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 60d07ce..19cd0bf 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -585,7 +585,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5a1bce5..bee10ff 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -257,8 +257,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -895,12 +925,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -935,6 +970,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1142,7 +1220,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1293,6 +1371,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2035,3 +2127,30 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark the
+ * its subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3acf68f..2eb9532 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
1.8.3.1

REL11_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL11_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From cee70ac1e8199ca415aface489808c01058b779b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v11] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../expected/catalog_change_snapshot.out           |  44 +++++++
 .../specs/catalog_change_snapshot.spec             |  39 ++++++
 src/backend/replication/logical/decode.c           |  15 ++-
 src/backend/replication/logical/snapbuild.c        | 134 +++++++++++++++++++--
 src/include/replication/snapbuild.h                |   3 +
 6 files changed, 228 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 65a91a8..973b947 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000..dc4f9b7
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000..662760f
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c085f7b..dc83743 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -586,7 +586,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1c52bc6..e1d20e2 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -258,8 +258,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -896,12 +926,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -936,6 +971,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1143,7 +1221,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1294,6 +1372,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1996,3 +2088,31 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark the
+ * its subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	int		i;
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 1df66a3..4df3c3f 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 							 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
1.8.3.1

REL10_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL10_v11-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From ce18adbdf7b33c745aa96701b85b4f691e3068cb Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v11] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  41 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  16 ++-
 src/backend/replication/logical/snapbuild.c   | 134 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 226 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2db2b2774b..73bc0fe1fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..15f9540b3f
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,41 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..662760fbcf
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the XACT_RUNNING record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACT
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6f8920f52c..3233104fc9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -561,7 +561,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	{
 		ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 									  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1010a2e869..1acb44c686 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -258,8 +258,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -896,12 +926,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -936,6 +971,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1143,7 +1221,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1294,6 +1372,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1997,3 +2089,31 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark the
+ * its subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	int		i;
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index b95f56eec3..7a796ce136 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 							 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.28.0.windows.1

#117Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#116)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Mon, 1 Aug 2022 20:01:00 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Mon, Aug 1, 2022 at 7:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 29, 2022 at 3:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I've attached updated patches for all branches. Please review them.

Thanks, the patches look mostly good to me. I have made minor edits by
removing 'likely' from a few places as those don't seem to be adding
much value, changed comments at a few places, and was getting
compilation in error in v11/10 (snapbuild.c:2111:3: error: ‘for’ loop
initial declarations are only allowed in C99 mode) which I have fixed.
See attached, unless there are major comments/suggestions, I am
planning to push this day after tomorrow (by Wednesday) after another
pass.

master:
+ * Read the contents of the serialized snapshot to the dest.

Do we need the "the" before the "dest"?

+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,

Do we need the CloseTransientFile(fd) there? This call requires errno
to be remembered but anyway OpenTransientFile'd files are to be close
at transaction end. Actually CloseTransientFile() is not called
before error'ing-out at error in other places.

+ * from the LSN-ordered list of toplevel TXNs. We remove TXN from the list

We remove "the" TXN"?

+	if (dlist_is_empty(&rb->catchange_txns))
+	{
+		Assert(rb->catchange_ntxns == 0);
+		return NULL;
+	}

It seems that the assert is far simpler than dlist_is_empty(). Why
don't we swap the conditions for if() and Assert() in the above?

+ * the oldest running transaction窶冱 restart_decoding_lsn is.

The line contains a broken characters.

+	 * Either all the xacts got purged or none. It is only possible to
+	 * partially remove the xids from this array if one or more of the xids
+	 * are still running but not all. That can happen if we start decoding

Assuming this, the commment below seems getting stale.

+	 * catalog. We remove xids from this array when they become old enough to
+	 * matter, and then it eventually becomes empty.

"We discard this array when the all containing xids are gone. See
SnapBuildPurgeOlderTxn for details." or something like?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#118Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#117)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Aug 2, 2022 at 12:00 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 1 Aug 2022 20:01:00 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Mon, Aug 1, 2022 at 7:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 29, 2022 at 3:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I've attached updated patches for all branches. Please review them.

Thanks, the patches look mostly good to me. I have made minor edits by
removing 'likely' from a few places as those don't seem to be adding
much value, changed comments at a few places, and was getting
compilation in error in v11/10 (snapbuild.c:2111:3: error: ‘for’ loop
initial declarations are only allowed in C99 mode) which I have fixed.
See attached, unless there are major comments/suggestions, I am
planning to push this day after tomorrow (by Wednesday) after another
pass.

+       {
+               int                     save_errno = errno;
+
+               CloseTransientFile(fd);
+
+               if (readBytes < 0)
+               {
+                       errno = save_errno;
+                       ereport(ERROR,

Do we need the CloseTransientFile(fd) there? This call requires errno
to be remembered but anyway OpenTransientFile'd files are to be close
at transaction end. Actually CloseTransientFile() is not called
before error'ing-out at error in other places.

But this part of the code is just a copy of the existing code. See:

- if (readBytes != sizeof(SnapBuild))
- {
- int save_errno = errno;
-
- CloseTransientFile(fd);
-
- if (readBytes < 0)
- {
- errno = save_errno;
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not read file \"%s\": %m", path)));
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("could not read file \"%s\": read %d of %zu",
- path, readBytes, sizeof(SnapBuild))));
- }

We just moved it to a separate function as the same code is being
duplicated to multiple places.

--
With Regards,
Amit Kapila.

#119shiy.fnst@fujitsu.com
shiy.fnst@fujitsu.com
In reply to: Amit Kapila (#116)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Aug 1, 2022 10:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 1, 2022 at 7:46 AM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Fri, Jul 29, 2022 at 3:45 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

I've attached updated patches for all branches. Please review them.

Thanks, the patches look mostly good to me. I have made minor edits by
removing 'likely' from a few places as those don't seem to be adding
much value, changed comments at a few places, and was getting
compilation in error in v11/10 (snapbuild.c:2111:3: error: ‘for’ loop
initial declarations are only allowed in C99 mode) which I have fixed.
See attached, unless there are major comments/suggestions, I am
planning to push this day after tomorrow (by Wednesday) after another
pass.

Thanks for updating the patch.

Here are some minor comments:

1.
patches for REL10 ~ REL13:
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark the
+ * its subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.

"mark the its subtransactions"
->
"mark its subtransactions"

2.
patches for REL10 ~ REL15:
In the comment in catalog_change_snapshot.spec, maybe we can use "RUNNING_XACTS"
instead of "RUNNING_XACT" "XACT_RUNNING", same as the patch for master branch.

Regards,
Shi yu

#120Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#118)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Tue, 2 Aug 2022 13:54:43 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Tue, Aug 2, 2022 at 12:00 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

+       {
+               int                     save_errno = errno;
+
+               CloseTransientFile(fd);
+
+               if (readBytes < 0)
+               {
+                       errno = save_errno;
+                       ereport(ERROR,

Do we need the CloseTransientFile(fd) there? This call requires errno
to be remembered but anyway OpenTransientFile'd files are to be close
at transaction end. Actually CloseTransientFile() is not called
before error'ing-out at error in other places.

..

We just moved it to a separate function as the same code is being
duplicated to multiple places.

There are code paths that doesn't CloseTransientFile() explicitly,
too. If there were no need of save_errno there, that'd be fine. But
otherwise I guess we prefer to let the orphan fds closed by ERROR and
I don't think we need to preserve the less-preferred code pattern (if
we actually prefer not to have the explicit call).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#121Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kyotaro Horiguchi (#120)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Aug 3, 2022 at 10:20 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 2 Aug 2022 13:54:43 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Tue, Aug 2, 2022 at 12:00 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

+       {
+               int                     save_errno = errno;
+
+               CloseTransientFile(fd);
+
+               if (readBytes < 0)
+               {
+                       errno = save_errno;
+                       ereport(ERROR,

Do we need the CloseTransientFile(fd) there? This call requires errno
to be remembered but anyway OpenTransientFile'd files are to be close
at transaction end. Actually CloseTransientFile() is not called
before error'ing-out at error in other places.

..

We just moved it to a separate function as the same code is being
duplicated to multiple places.

There are code paths that doesn't CloseTransientFile() explicitly,
too. If there were no need of save_errno there, that'd be fine. But
otherwise I guess we prefer to let the orphan fds closed by ERROR and
I don't think we need to preserve the less-preferred code pattern (if
we actually prefer not to have the explicit call).

Looking at other codes in snapbuild.c, we call CloseTransientFile()
before erroring out in SnapBuildSerialize(). I think it's better to
keep it consistent with nearby codes in this patch. I think if we
prefer the style of closing the file by ereport(ERROR), it should be
done for all of them in a separate patch.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#122Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#121)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Aug 3, 2022 at 7:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Aug 3, 2022 at 10:20 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 2 Aug 2022 13:54:43 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Tue, Aug 2, 2022 at 12:00 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

+       {
+               int                     save_errno = errno;
+
+               CloseTransientFile(fd);
+
+               if (readBytes < 0)
+               {
+                       errno = save_errno;
+                       ereport(ERROR,

Do we need the CloseTransientFile(fd) there? This call requires errno
to be remembered but anyway OpenTransientFile'd files are to be close
at transaction end. Actually CloseTransientFile() is not called
before error'ing-out at error in other places.

..

We just moved it to a separate function as the same code is being
duplicated to multiple places.

There are code paths that doesn't CloseTransientFile() explicitly,
too. If there were no need of save_errno there, that'd be fine. But
otherwise I guess we prefer to let the orphan fds closed by ERROR and
I don't think we need to preserve the less-preferred code pattern (if
we actually prefer not to have the explicit call).

Looking at other codes in snapbuild.c, we call CloseTransientFile()
before erroring out in SnapBuildSerialize(). I think it's better to
keep it consistent with nearby codes in this patch. I think if we
prefer the style of closing the file by ereport(ERROR), it should be
done for all of them in a separate patch.

+1. I also feel it is better to change it in a separate patch as this
is not a pattern introduced by this patch.

--
With Regards,
Amit Kapila.

#123Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kyotaro Horiguchi (#117)
7 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Aug 2, 2022 at 3:30 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 1 Aug 2022 20:01:00 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Mon, Aug 1, 2022 at 7:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 29, 2022 at 3:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I've attached updated patches for all branches. Please review them.

Thanks, the patches look mostly good to me. I have made minor edits by
removing 'likely' from a few places as those don't seem to be adding
much value, changed comments at a few places, and was getting
compilation in error in v11/10 (snapbuild.c:2111:3: error: ‘for’ loop
initial declarations are only allowed in C99 mode) which I have fixed.
See attached, unless there are major comments/suggestions, I am
planning to push this day after tomorrow (by Wednesday) after another
pass.

master:
+ * Read the contents of the serialized snapshot to the dest.

Do we need the "the" before the "dest"?

Fixed.

+       {
+               int                     save_errno = errno;
+
+               CloseTransientFile(fd);
+
+               if (readBytes < 0)
+               {
+                       errno = save_errno;
+                       ereport(ERROR,

Do we need the CloseTransientFile(fd) there? This call requires errno
to be remembered but anyway OpenTransientFile'd files are to be close
at transaction end. Actually CloseTransientFile() is not called
before error'ing-out at error in other places.

As Amit mentioned, it's just moved from SnapBuildRestore(). Looking at
other code in snapbuild.c, we call CloseTransientFile before erroring
out. I think it's better to keep it consistent with nearby codes.

+ * from the LSN-ordered list of toplevel TXNs. We remove TXN from the list

We remove "the" TXN"?

Fixed.

+       if (dlist_is_empty(&rb->catchange_txns))
+       {
+               Assert(rb->catchange_ntxns == 0);
+               return NULL;
+       }

It seems that the assert is far simpler than dlist_is_empty(). Why
don't we swap the conditions for if() and Assert() in the above?

Changed.

+ * the oldest running transaction窶冱 restart_decoding_lsn is.

The line contains a broken characters.

Fixed.

+        * Either all the xacts got purged or none. It is only possible to
+        * partially remove the xids from this array if one or more of the xids
+        * are still running but not all. That can happen if we start decoding

Assuming this, the commment below seems getting stale.

+        * catalog. We remove xids from this array when they become old enough to
+        * matter, and then it eventually becomes empty.

"We discard this array when the all containing xids are gone. See
SnapBuildPurgeOlderTxn for details." or something like?

Changed to:

We discard this array when all the xids in the list become old enough
to matter. See SnapBuildPurgeOlderTxn for details.

I've attached updated patches that incorporated the above comments as
well as the comments from Shi yu. Please review them.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

REL13_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL13_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From c661f6637a289ec9b2bae8bf68300c4042e7c79b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v12] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 +-
 src/backend/replication/logical/snapbuild.c   | 133 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 227 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..6ec09ab192 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5a2b828aa3..87cbd08e85 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -582,7 +582,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index be46bf0363..d407fb3440 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -252,8 +252,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -890,12 +920,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -930,6 +965,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1137,7 +1215,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1288,6 +1366,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2030,3 +2122,30 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark its
+ * subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index b048dc7484..17d2f93300 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL12_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL12_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 0fa949642483f54505f56b043a171dd5f949f974 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v12] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 +-
 src/backend/replication/logical/snapbuild.c   | 133 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 227 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..6ec09ab192 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 60d07ce4eb..19cd0bf76a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -585,7 +585,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5a1bce5acc..cd091bb724 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -257,8 +257,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -895,12 +925,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -935,6 +970,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1142,7 +1220,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1293,6 +1371,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2035,3 +2127,30 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark its
+ * subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3acf68f5bd..2eb9532a1b 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL15_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL15_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From eac68bff268d59271b1884996b8456d7cf65c272 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v12] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 137 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 232 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index aa2427ba73..ea8a2166ab 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -627,6 +627,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit
+	 * record without decoding the transaction's other changes. So, we ensure
+	 * to mark such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1119a12db9..385817e295 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -888,12 +918,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +963,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1135,7 +1213,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1286,6 +1364,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2000,3 +2092,34 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments for NInitialRunningXacts and InitialRunningXacts for additional
+ * info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (NInitialRunningXacts == 0 ||
+		ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..53d83f348a 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -91,4 +91,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

master_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=master_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 2b92e601427c7f89673727dd6a152ff46706f796 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v12] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running during
snapshot serialization, to the serialized snapshot. After restart or
otherwise, when we restore from such a serialized snapshot, the
corresponding list is restored in memory. Now, when decoding a COMMIT
record, we check both the list and the ReorderBuffer to see if the
transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we took another approach.
We remember the last-running-xacts list of the decoded RUNNING_XACTS
record after restoring the previously serialized snapshot. Then, we mark
the transaction as containing catalog changes if it's in the list of
initial running transactions and its commit record has
XACT_XINFO_HAS_INVALS. This doesn't require any file format changes but
the transaction will end up being added to the snapshot even if it has
only relcache invalidations. But that won't be a problem since we use
snapshot built during decoding only to read system catalogs.

This commit bumps SNAPBUILD_VERSION because of a change in SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 +++
 .../specs/catalog_change_snapshot.spec        |  39 +++
 src/backend/replication/logical/decode.c      |   3 +-
 .../replication/logical/reorderbuffer.c       |  71 ++++-
 src/backend/replication/logical/snapbuild.c   | 273 ++++++++++++------
 src/include/replication/reorderbuffer.h       |  12 +
 src/include/replication/snapbuild.h           |   2 +-
 8 files changed, 353 insertions(+), 93 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c5c6a2ba68..1667d720b1 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -628,7 +628,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
-					   parsed->nsubxacts, parsed->subxacts);
+					   parsed->nsubxacts, parsed->subxacts,
+					   parsed->xinfo);
 
 	/* ----
 	 * Check whether we are interested in this specific transaction, and tell
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fde72..1c21a1d14b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,8 @@ ReorderBufferAllocate(void)
 	buffer->by_txn_last_xid = InvalidTransactionId;
 	buffer->by_txn_last_txn = NULL;
 
+	buffer->catchange_ntxns = 0;
+
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 	buffer->size = 0;
@@ -366,6 +368,7 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
+	dlist_init(&buffer->catchange_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1526,14 +1529,22 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
-	 * from the LSN-ordered list of toplevel TXNs.
+	 * from the LSN-ordered list of toplevel TXNs. We remove the TXN from the
+	 * list of catalog modifying transactions as well.
 	 */
 	dlist_delete(&txn->node);
+	if (rbtxn_has_catalog_changes(txn))
+	{
+		dlist_delete(&txn->catchange_node);
+		rb->catchange_ntxns--;
+
+		Assert(rb->catchange_ntxns >= 0);
+	}
 
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,
@@ -3275,10 +3286,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 								  XLogRecPtr lsn)
 {
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	if (!rbtxn_has_catalog_changes(txn))
+	{
+		txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &txn->catchange_node);
+		rb->catchange_ntxns++;
+	}
 
 	/*
 	 * Mark top-level transaction as having catalog changes too if one of its
@@ -3286,8 +3303,52 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	 * conveniently check just top-level transaction and decide whether to
 	 * build the hash table or not.
 	 */
-	if (txn->toptxn != NULL)
-		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	toptxn = txn->toptxn;
+	if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
+	{
+		toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &toptxn->catchange_node);
+		rb->catchange_ntxns++;
+	}
+}
+
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	TransactionId *xids = NULL;
+	size_t		xcnt = 0;
+
+	/* Quick return if the list is empty */
+	if (rb->catchange_ntxns == 0)
+	{
+		Assert(dlist_is_empty(&rb->catchange_txns));
+		return NULL;
+	}
+
+	/* Initialize XID array */
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * rb->catchange_ntxns);
+	dlist_foreach(iter, &rb->catchange_txns)
+	{
+		ReorderBufferTXN *txn = dlist_container(ReorderBufferTXN,
+												catchange_node,
+												iter.cur);
+
+		Assert(rbtxn_has_catalog_changes(txn));
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	Assert(xcnt == rb->catchange_ntxns);
+	return xids;
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15214..1ff2c12240 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,33 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+	 * if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction
+	 * after restoring the previously serialized snapshot in which case we
+	 * will miss adding the xid to the snapshot and end up looking at the
+	 * catalogs with the wrong snapshot.
+	 *
+	 * Now to avoid the above problem, we serialize the transactions that had
+	 * modified the catalogs and are still running at the time of snapshot
+	 * serialization. We fill this array while restoring the snapshot and then
+	 * refer it while decoding commit to ensure if the xact has modified the
+	 * catalog. We discard this array when all the xids in the list become old
+	 * enough to matter. See SnapBuildPurgeOlderTxn for details.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchange;
 };
 
 /*
@@ -250,8 +277,8 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/* ->committed and ->catchange manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -262,6 +289,9 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
+												 uint32 xinfo);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +299,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +337,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchange.xcnt = 0;
+	builder->catchange.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -888,12 +922,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed or containing catalog
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed or ->catchange array, respectively. The committed xids will
+ * get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from catchange array once it is
+ * finished (committed/aborted) but that could be costly as we need to maintain
+ * the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +967,30 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/*
+	 * Either all the xacts got purged or none. It is only possible to
+	 * partially remove the xids from this array if one or more of the xids
+	 * are still running but not all. That can happen if we start decoding
+	 * from a point (LSN where the snapshot state became consistent) where all
+	 * the xacts in this were running and then at least one of those got
+	 * committed and a few are still running. We will never start from such a
+	 * point because we won't move the slot's restart_lsn past the point where
+	 * the oldest running transaction's restart_decoding_lsn is.
+	 */
+	if (builder->catchange.xcnt == 0 ||
+		TransactionIdFollowsOrEquals(builder->catchange.xip[0],
+									 builder->xmin))
+		return;
+
+	Assert(TransactionIdFollows(builder->xmin,
+								builder->catchange.xip[builder->catchange.xcnt - 1]));
+	pfree(builder->catchange.xip);
+	builder->catchange.xip = NULL;
+	builder->catchange.xcnt = 0;
+
+	elog(DEBUG3, "purged catalog modifying transactions, oldest running xid %u",
+		 builder->xmin);
 }
 
 /*
@@ -935,7 +998,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
  */
 void
 SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
-				   int nsubxacts, TransactionId *subxacts)
+				   int nsubxacts, TransactionId *subxacts, uint32 xinfo)
 {
 	int			nxact;
 
@@ -983,7 +1046,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChanges(builder, subxid, xinfo))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1075,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1152,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check the reorder buffer and the snapshot to see if the given transaction has
+ * modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
+							  uint32 xinfo)
+{
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return true;
+
+	/*
+	 * The transactions that have changed catalogs must have invalidation
+	 * info.
+	 */
+	if (!(xinfo & XACT_XINFO_HAS_INVALS))
+		return false;
+
+	/* Check the catchange XID array */
+	return ((builder->catchange.xcnt > 0) &&
+			(bsearch(&xid, builder->catchange.xip, builder->catchange.xcnt,
+					 sizeof(TransactionId), xidComparator) != NULL));
+}
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1135,7 +1221,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1438,6 +1524,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchange.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1467,7 +1554,7 @@ typedef struct SnapBuildOnDisk
 	offsetof(SnapBuildOnDisk, version)
 
 #define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 4
+#define SNAPBUILD_VERSION 5
 
 /*
  * Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1493,6 +1580,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	TransactionId *catchange_xip = NULL;
+	MemoryContext old_ctx;
+	size_t		catchange_xcnt;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,10 +1668,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	old_ctx = MemoryContextSwitchTo(builder->context);
+
+	/* Get the catalog modifying transactions that are yet not committed */
+	catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);
+	catchange_xcnt = builder->reorder->catchange_ntxns;
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + catchange_xcnt);
 
-	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
+	ondisk_c = palloc0(needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
 	ondisk->magic = SNAPBUILD_MAGIC;
 	ondisk->version = SNAPBUILD_VERSION;
@@ -1598,16 +1694,31 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchange.xip = NULL;
+	/* update catchange only on disk data */
+	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
 				sizeof(SnapBuild));
 
 	/* copy committed xacts */
-	sz = sizeof(TransactionId) * builder->committed.xcnt;
-	memcpy(ondisk_c, builder->committed.xip, sz);
-	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
-	ondisk_c += sz;
+	if (builder->committed.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * builder->committed.xcnt;
+		memcpy(ondisk_c, builder->committed.xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
+
+	/* copy catalog modifying xacts */
+	if (catchange_xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * catchange_xcnt;
+		memcpy(ondisk_c, catchange_xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
 
 	FIN_CRC32C(ondisk->checksum);
 
@@ -1688,12 +1799,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	 */
 	builder->last_serialized_snapshot = lsn;
 
+	MemoryContextSwitchTo(old_ctx);
+
 out:
 	ReorderBufferSetRestartPoint(builder->reorder,
 								 builder->last_serialized_snapshot);
 	/* be tidy */
 	if (ondisk)
 		pfree(ondisk);
+	if (catchange_xip)
+		pfree(catchange_xip);
 }
 
 /*
@@ -1707,7 +1822,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1853,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,56 +1873,26 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
-	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
-	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
+	if (ondisk.builder.committed.xcnt > 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
+		sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
+		ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
+	}
 
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
+	/* restore catalog modifying xacts information */
+	if (ondisk.builder.catchange.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * ondisk.builder.catchange.xcnt;
+		ondisk.builder.catchange.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchange.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.catchange.xip, sz);
 	}
-	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
@@ -1885,6 +1947,13 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchange.xip)
+		pfree(builder->catchange.xip);
+	builder->catchange.xcnt = ondisk.builder.catchange.xcnt;
+	builder->catchange.xip = ondisk.builder.catchange.xip;
+	ondisk.builder.catchange.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1906,9 +1975,43 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 snapshot_not_interesting:
 	if (ondisk.builder.committed.xip != NULL)
 		pfree(ondisk.builder.committed.xip);
+	if (ondisk.builder.catchange.xip != NULL)
+		pfree(ondisk.builder.catchange.xip);
 	return false;
 }
 
+/*
+ * Read the contents of the serialized snapshot to 'dest'.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
 /*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d109d0baed..fd84f175c0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -380,6 +380,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * A node in the list of catalog modifying transactions
+	 */
+	dlist_node	catchange_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -526,6 +531,12 @@ struct ReorderBuffer
 	 */
 	dlist_head	txns_by_base_snapshot_lsn;
 
+	/*
+	 * Transactions and subtransactions that have modified system catalogs.
+	 */
+	dlist_head	catchange_txns;
+	int			catchange_ntxns;
+
 	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
@@ -677,6 +688,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..e6adea24f2 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -82,7 +82,7 @@ extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
-							   TransactionId *subxacts);
+							   TransactionId *subxacts, uint32 xinfo);
 extern bool SnapBuildProcessChange(SnapBuild *builder, TransactionId xid,
 								   XLogRecPtr lsn);
 extern void SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
-- 
2.24.3 (Apple Git-128)

REL14_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL14_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From cb05f04be5bc67b4cb4f1750555a03cfc3a6838c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v12] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 137 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 232 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..4553252d75 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 92dfafc632..5a440e6eb7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -691,6 +691,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit
+	 * record without decoding the transaction's other changes. So, we ensure
+	 * to mark such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 6df602485b..ac09f0e4f9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -879,12 +909,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -919,6 +954,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1126,7 +1204,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1277,6 +1355,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1993,3 +2085,34 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments for NInitialRunningXacts and InitialRunningXacts for additional
+ * info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (NInitialRunningXacts == 0 ||
+		ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3604621e88..a19b59e100 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -90,4 +90,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL11_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL11_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 86c25a8b7cd6c37fee9624953512176c35e61296 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v12] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 +-
 src/backend/replication/logical/snapbuild.c   | 134 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 228 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 65a91a8014..973b94738a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c085f7b0f3..dc83743c38 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -586,7 +586,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1c52bc64e3..4f6a440546 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -258,8 +258,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -896,12 +926,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -936,6 +971,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1143,7 +1221,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1294,6 +1372,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1996,3 +2088,31 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark its
+ * subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	int		i;
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 1df66a3c75..4df3c3f2f7 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 							 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL10_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL10_v12-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 9722d357a704fdd3aff6449e2ee42444ff87f6fa Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v12] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  41 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  16 ++-
 src/backend/replication/logical/snapbuild.c   | 135 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 227 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2db2b2774b..73bc0fe1fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..15f9540b3f
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,41 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6f8920f52c..3233104fc9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -561,7 +561,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	{
 		ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 									  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1010a2e869..7c637a4019 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -258,8 +258,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -896,12 +926,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -936,6 +971,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1143,7 +1221,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1294,6 +1372,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1997,3 +2089,32 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark its
+ * subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		int		i;
+
+		for (i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index b95f56eec3..7a796ce136 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 							 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

#124Masahiko Sawada
sawada.mshk@gmail.com
In reply to: shiy.fnst@fujitsu.com (#119)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Tue, Aug 2, 2022 at 5:31 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

On Mon, Aug 1, 2022 10:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 1, 2022 at 7:46 AM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Fri, Jul 29, 2022 at 3:45 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

I've attached updated patches for all branches. Please review them.

Thanks, the patches look mostly good to me. I have made minor edits by
removing 'likely' from a few places as those don't seem to be adding
much value, changed comments at a few places, and was getting
compilation in error in v11/10 (snapbuild.c:2111:3: error: ‘for’ loop
initial declarations are only allowed in C99 mode) which I have fixed.
See attached, unless there are major comments/suggestions, I am
planning to push this day after tomorrow (by Wednesday) after another
pass.

Thanks for updating the patch.

Here are some minor comments:

1.
patches for REL10 ~ REL13:
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark the
+ * its subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.

"mark the its subtransactions"
->
"mark its subtransactions"

2.
patches for REL10 ~ REL15:
In the comment in catalog_change_snapshot.spec, maybe we can use "RUNNING_XACTS"
instead of "RUNNING_XACT" "XACT_RUNNING", same as the patch for master branch.

Thank you for the comments! These have been incorporated in the latest
version v12 patch I just submitted.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#125Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#122)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

At Wed, 3 Aug 2022 08:51:40 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Wed, Aug 3, 2022 at 7:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Looking at other codes in snapbuild.c, we call CloseTransientFile()
before erroring out in SnapBuildSerialize(). I think it's better to
keep it consistent with nearby codes in this patch. I think if we
prefer the style of closing the file by ereport(ERROR), it should be
done for all of them in a separate patch.

+1. I also feel it is better to change it in a separate patch as this
is not a pattern introduced by this patch.

Agreed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#126shiy.fnst@fujitsu.com
shiy.fnst@fujitsu.com
In reply to: Masahiko Sawada (#123)
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Aug 3, 2022 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached updated patches that incorporated the above comments as
well as the comments from Shi yu. Please review them.

Thanks for updating the patch.

I noticed that in SnapBuildXidSetCatalogChanges(), "i" is initialized in the if
branch in REL10 patch, which is different from REL11 patch. Maybe we can modify
REL11 patch to be consistent with REL10 patch.

The rest of the patch looks good to me.

Regards,
Shi yu

#127Masahiko Sawada
sawada.mshk@gmail.com
In reply to: shiy.fnst@fujitsu.com (#126)
7 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Aug 3, 2022 at 3:52 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

On Wed, Aug 3, 2022 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached updated patches that incorporated the above comments as
well as the comments from Shi yu. Please review them.

Thanks for updating the patch.

I noticed that in SnapBuildXidSetCatalogChanges(), "i" is initialized in the if
branch in REL10 patch, which is different from REL11 patch. Maybe we can modify
REL11 patch to be consistent with REL10 patch.

The rest of the patch looks good to me.

Oops, thanks for pointing it out. I've fixed it and attached updated
patches for all branches so as not to confuse the patch version. There
is no update from v12 patch on REL12 - master patches.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

REL15_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL15_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From eac68bff268d59271b1884996b8456d7cf65c272 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v13] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 137 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 232 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index aa2427ba73..ea8a2166ab 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -627,6 +627,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit
+	 * record without decoding the transaction's other changes. So, we ensure
+	 * to mark such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1119a12db9..385817e295 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -888,12 +918,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +963,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1135,7 +1213,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1286,6 +1364,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2000,3 +2092,34 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments for NInitialRunningXacts and InitialRunningXacts for additional
+ * info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (NInitialRunningXacts == 0 ||
+		ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..53d83f348a 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -91,4 +91,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

master_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=master_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 2b92e601427c7f89673727dd6a152ff46706f796 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v13] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running during
snapshot serialization, to the serialized snapshot. After restart or
otherwise, when we restore from such a serialized snapshot, the
corresponding list is restored in memory. Now, when decoding a COMMIT
record, we check both the list and the ReorderBuffer to see if the
transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we took another approach.
We remember the last-running-xacts list of the decoded RUNNING_XACTS
record after restoring the previously serialized snapshot. Then, we mark
the transaction as containing catalog changes if it's in the list of
initial running transactions and its commit record has
XACT_XINFO_HAS_INVALS. This doesn't require any file format changes but
the transaction will end up being added to the snapshot even if it has
only relcache invalidations. But that won't be a problem since we use
snapshot built during decoding only to read system catalogs.

This commit bumps SNAPBUILD_VERSION because of a change in SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 +++
 .../specs/catalog_change_snapshot.spec        |  39 +++
 src/backend/replication/logical/decode.c      |   3 +-
 .../replication/logical/reorderbuffer.c       |  71 ++++-
 src/backend/replication/logical/snapbuild.c   | 273 ++++++++++++------
 src/include/replication/reorderbuffer.h       |  12 +
 src/include/replication/snapbuild.h           |   2 +-
 8 files changed, 353 insertions(+), 93 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_error
+	twophase_snapshot slot_creation_error catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c5c6a2ba68..1667d720b1 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -628,7 +628,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
-					   parsed->nsubxacts, parsed->subxacts);
+					   parsed->nsubxacts, parsed->subxacts,
+					   parsed->xinfo);
 
 	/* ----
 	 * Check whether we are interested in this specific transaction, and tell
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fde72..1c21a1d14b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,8 @@ ReorderBufferAllocate(void)
 	buffer->by_txn_last_xid = InvalidTransactionId;
 	buffer->by_txn_last_txn = NULL;
 
+	buffer->catchange_ntxns = 0;
+
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
 	buffer->size = 0;
@@ -366,6 +368,7 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
+	dlist_init(&buffer->catchange_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1526,14 +1529,22 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Remove TXN from its containing list.
+	 * Remove TXN from its containing lists.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
-	 * from the LSN-ordered list of toplevel TXNs.
+	 * from the LSN-ordered list of toplevel TXNs. We remove the TXN from the
+	 * list of catalog modifying transactions as well.
 	 */
 	dlist_delete(&txn->node);
+	if (rbtxn_has_catalog_changes(txn))
+	{
+		dlist_delete(&txn->catchange_node);
+		rb->catchange_ntxns--;
+
+		Assert(rb->catchange_ntxns >= 0);
+	}
 
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,
@@ -3275,10 +3286,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 								  XLogRecPtr lsn)
 {
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	if (!rbtxn_has_catalog_changes(txn))
+	{
+		txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &txn->catchange_node);
+		rb->catchange_ntxns++;
+	}
 
 	/*
 	 * Mark top-level transaction as having catalog changes too if one of its
@@ -3286,8 +3303,52 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	 * conveniently check just top-level transaction and decide whether to
 	 * build the hash table or not.
 	 */
-	if (txn->toptxn != NULL)
-		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+	toptxn = txn->toptxn;
+	if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
+	{
+		toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		dlist_push_tail(&rb->catchange_txns, &toptxn->catchange_node);
+		rb->catchange_ntxns++;
+	}
+}
+
+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	TransactionId *xids = NULL;
+	size_t		xcnt = 0;
+
+	/* Quick return if the list is empty */
+	if (rb->catchange_ntxns == 0)
+	{
+		Assert(dlist_is_empty(&rb->catchange_txns));
+		return NULL;
+	}
+
+	/* Initialize XID array */
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * rb->catchange_ntxns);
+	dlist_foreach(iter, &rb->catchange_txns)
+	{
+		ReorderBufferTXN *txn = dlist_container(ReorderBufferTXN,
+												catchange_node,
+												iter.cur);
+
+		Assert(rbtxn_has_catalog_changes(txn));
+
+		xids[xcnt++] = txn->xid;
+	}
+
+	qsort(xids, xcnt, sizeof(TransactionId), xidComparator);
+
+	Assert(xcnt == rb->catchange_ntxns);
+	return xids;
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15214..1ff2c12240 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -241,6 +241,33 @@ struct SnapBuild
 		 */
 		TransactionId *xip;
 	}			committed;
+
+	/*
+	 * Array of transactions and subtransactions that had modified catalogs
+	 * and were running when the snapshot was serialized.
+	 *
+	 * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+	 * if the transaction has changed the catalog. But it could happen that
+	 * the logical decoding decodes only the commit record of the transaction
+	 * after restoring the previously serialized snapshot in which case we
+	 * will miss adding the xid to the snapshot and end up looking at the
+	 * catalogs with the wrong snapshot.
+	 *
+	 * Now to avoid the above problem, we serialize the transactions that had
+	 * modified the catalogs and are still running at the time of snapshot
+	 * serialization. We fill this array while restoring the snapshot and then
+	 * refer it while decoding commit to ensure if the xact has modified the
+	 * catalog. We discard this array when all the xids in the list become old
+	 * enough to matter. See SnapBuildPurgeOlderTxn for details.
+	 */
+	struct
+	{
+		/* number of transactions */
+		size_t		xcnt;
+
+		/* This array must be sorted in xidComparator order */
+		TransactionId *xip;
+	}			catchange;
 };
 
 /*
@@ -250,8 +277,8 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/* ->committed and ->catchange manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -262,6 +289,9 @@ static void SnapBuildSnapIncRefcount(Snapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
+static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
+												 uint32 xinfo);
+
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
@@ -269,6 +299,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path);
 
 /*
  * Allocate a new snapshot builder.
@@ -306,6 +337,9 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		palloc0(builder->committed.xcnt_space * sizeof(TransactionId));
 	builder->committed.includes_all_transactions = true;
 
+	builder->catchange.xcnt = 0;
+	builder->catchange.xip = NULL;
+
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
@@ -888,12 +922,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed or containing catalog
+ * changes that are smaller than ->xmin. Those won't ever get checked via
+ * the ->committed or ->catchange array, respectively. The committed xids will
+ * get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from catchange array once it is
+ * finished (committed/aborted) but that could be costly as we need to maintain
+ * the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -928,6 +967,30 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/*
+	 * Either all the xacts got purged or none. It is only possible to
+	 * partially remove the xids from this array if one or more of the xids
+	 * are still running but not all. That can happen if we start decoding
+	 * from a point (LSN where the snapshot state became consistent) where all
+	 * the xacts in this were running and then at least one of those got
+	 * committed and a few are still running. We will never start from such a
+	 * point because we won't move the slot's restart_lsn past the point where
+	 * the oldest running transaction's restart_decoding_lsn is.
+	 */
+	if (builder->catchange.xcnt == 0 ||
+		TransactionIdFollowsOrEquals(builder->catchange.xip[0],
+									 builder->xmin))
+		return;
+
+	Assert(TransactionIdFollows(builder->xmin,
+								builder->catchange.xip[builder->catchange.xcnt - 1]));
+	pfree(builder->catchange.xip);
+	builder->catchange.xip = NULL;
+	builder->catchange.xcnt = 0;
+
+	elog(DEBUG3, "purged catalog modifying transactions, oldest running xid %u",
+		 builder->xmin);
 }
 
 /*
@@ -935,7 +998,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
  */
 void
 SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
-				   int nsubxacts, TransactionId *subxacts)
+				   int nsubxacts, TransactionId *subxacts, uint32 xinfo)
 {
 	int			nxact;
 
@@ -983,7 +1046,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 * Add subtransaction to base snapshot if catalog modifying, we don't
 		 * distinguish to toplevel transactions there.
 		 */
-		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (SnapBuildXidHasCatalogChanges(builder, subxid, xinfo))
 		{
 			sub_needs_timetravel = true;
 			needs_snapshot = true;
@@ -1012,7 +1075,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 
 	/* if top-level modified catalog, it'll need a snapshot */
-	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes",
 			 xid);
@@ -1089,6 +1152,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	}
 }
 
+/*
+ * Check the reorder buffer and the snapshot to see if the given transaction has
+ * modified catalogs.
+ */
+static inline bool
+SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
+							  uint32 xinfo)
+{
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return true;
+
+	/*
+	 * The transactions that have changed catalogs must have invalidation
+	 * info.
+	 */
+	if (!(xinfo & XACT_XINFO_HAS_INVALS))
+		return false;
+
+	/* Check the catchange XID array */
+	return ((builder->catchange.xcnt > 0) &&
+			(bsearch(&xid, builder->catchange.xip, builder->catchange.xcnt,
+					 sizeof(TransactionId), xidComparator) != NULL));
+}
 
 /* -----------------------------------
  * Snapshot building functions dealing with xlog records
@@ -1135,7 +1221,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1438,6 +1524,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
  *
  * struct SnapBuildOnDisk;
  * TransactionId * committed.xcnt; (*not xcnt_space*)
+ * TransactionId * catchange.xcnt;
  *
  */
 typedef struct SnapBuildOnDisk
@@ -1467,7 +1554,7 @@ typedef struct SnapBuildOnDisk
 	offsetof(SnapBuildOnDisk, version)
 
 #define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 4
+#define SNAPBUILD_VERSION 5
 
 /*
  * Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1493,6 +1580,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Size		needed_length;
 	SnapBuildOnDisk *ondisk = NULL;
+	TransactionId *catchange_xip = NULL;
+	MemoryContext old_ctx;
+	size_t		catchange_xcnt;
 	char	   *ondisk_c;
 	int			fd;
 	char		tmppath[MAXPGPATH];
@@ -1578,10 +1668,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m", tmppath)));
 
+	old_ctx = MemoryContextSwitchTo(builder->context);
+
+	/* Get the catalog modifying transactions that are yet not committed */
+	catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);
+	catchange_xcnt = builder->reorder->catchange_ntxns;
+
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->committed.xcnt;
+		sizeof(TransactionId) * (builder->committed.xcnt + catchange_xcnt);
 
-	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
+	ondisk_c = palloc0(needed_length);
 	ondisk = (SnapBuildOnDisk *) ondisk_c;
 	ondisk->magic = SNAPBUILD_MAGIC;
 	ondisk->version = SNAPBUILD_VERSION;
@@ -1598,16 +1694,31 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
 	ondisk->builder.committed.xip = NULL;
+	ondisk->builder.catchange.xip = NULL;
+	/* update catchange only on disk data */
+	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
 				sizeof(SnapBuild));
 
 	/* copy committed xacts */
-	sz = sizeof(TransactionId) * builder->committed.xcnt;
-	memcpy(ondisk_c, builder->committed.xip, sz);
-	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
-	ondisk_c += sz;
+	if (builder->committed.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * builder->committed.xcnt;
+		memcpy(ondisk_c, builder->committed.xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
+
+	/* copy catalog modifying xacts */
+	if (catchange_xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * catchange_xcnt;
+		memcpy(ondisk_c, catchange_xip, sz);
+		COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+		ondisk_c += sz;
+	}
 
 	FIN_CRC32C(ondisk->checksum);
 
@@ -1688,12 +1799,16 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	 */
 	builder->last_serialized_snapshot = lsn;
 
+	MemoryContextSwitchTo(old_ctx);
+
 out:
 	ReorderBufferSetRestartPoint(builder->reorder,
 								 builder->last_serialized_snapshot);
 	/* be tidy */
 	if (ondisk)
 		pfree(ondisk);
+	if (catchange_xip)
+		pfree(catchange_xip);
 }
 
 /*
@@ -1707,7 +1822,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	int			fd;
 	char		path[MAXPGPATH];
 	Size		sz;
-	int			readBytes;
 	pg_crc32c	checksum;
 
 	/* no point in loading a snapshot if we're already there */
@@ -1739,29 +1853,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
 	/* read statically sized portion of snapshot */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-	pgstat_report_wait_end();
-	if (readBytes != SnapBuildOnDiskConstantSize)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes,
-							(Size) SnapBuildOnDiskConstantSize)));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk, SnapBuildOnDiskConstantSize, path);
 
 	if (ondisk.magic != SNAPBUILD_MAGIC)
 		ereport(ERROR,
@@ -1781,56 +1873,26 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
 	/* read SnapBuild */
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-	pgstat_report_wait_end();
-	if (readBytes != sizeof(SnapBuild))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sizeof(SnapBuild))));
-	}
+	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
-	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
-	ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.committed.xip, sz);
-	pgstat_report_wait_end();
-	if (readBytes != sz)
+	if (ondisk.builder.committed.xcnt > 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
+		sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
+		ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.committed.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
+	}
 
-		if (readBytes < 0)
-		{
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", path)));
-		}
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read file \"%s\": read %d of %zu",
-							path, readBytes, sz)));
+	/* restore catalog modifying xacts information */
+	if (ondisk.builder.catchange.xcnt > 0)
+	{
+		sz = sizeof(TransactionId) * ondisk.builder.catchange.xcnt;
+		ondisk.builder.catchange.xip = MemoryContextAllocZero(builder->context, sz);
+		SnapBuildRestoreContents(fd, (char *) ondisk.builder.catchange.xip, sz, path);
+		COMP_CRC32C(checksum, ondisk.builder.catchange.xip, sz);
 	}
-	COMP_CRC32C(checksum, ondisk.builder.committed.xip, sz);
 
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
@@ -1885,6 +1947,13 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
+	/* set catalog modifying transactions */
+	if (builder->catchange.xip)
+		pfree(builder->catchange.xip);
+	builder->catchange.xcnt = ondisk.builder.catchange.xcnt;
+	builder->catchange.xip = ondisk.builder.catchange.xip;
+	ondisk.builder.catchange.xip = NULL;
+
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1906,9 +1975,43 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 snapshot_not_interesting:
 	if (ondisk.builder.committed.xip != NULL)
 		pfree(ondisk.builder.committed.xip);
+	if (ondisk.builder.catchange.xip != NULL)
+		pfree(ondisk.builder.catchange.xip);
 	return false;
 }
 
+/*
+ * Read the contents of the serialized snapshot to 'dest'.
+ */
+static void
+SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
+{
+	int			readBytes;
+
+	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+	readBytes = read(fd, dest, size);
+	pgstat_report_wait_end();
+	if (readBytes != size)
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+
+		if (readBytes < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m", path)));
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read file \"%s\": read %d of %zu",
+							path, readBytes, sizeof(SnapBuild))));
+	}
+}
+
 /*
  * Remove all serialized snapshots that are not required anymore because no
  * slot can need them. This doesn't actually have to run during a checkpoint,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d109d0baed..fd84f175c0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -380,6 +380,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * A node in the list of catalog modifying transactions
+	 */
+	dlist_node	catchange_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -526,6 +531,12 @@ struct ReorderBuffer
 	 */
 	dlist_head	txns_by_base_snapshot_lsn;
 
+	/*
+	 * Transactions and subtransactions that have modified system catalogs.
+	 */
+	dlist_head	catchange_txns;
+	int			catchange_ntxns;
+
 	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
@@ -677,6 +688,7 @@ extern void ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
 extern void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 extern ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
+extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..e6adea24f2 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -82,7 +82,7 @@ extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
-							   TransactionId *subxacts);
+							   TransactionId *subxacts, uint32 xinfo);
 extern bool SnapBuildProcessChange(SnapBuild *builder, TransactionId xid,
 								   XLogRecPtr lsn);
 extern void SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
-- 
2.24.3 (Apple Git-128)

REL13_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL13_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From c661f6637a289ec9b2bae8bf68300c4042e7c79b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v13] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 +-
 src/backend/replication/logical/snapbuild.c   | 133 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 227 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..6ec09ab192 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5a2b828aa3..87cbd08e85 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -582,7 +582,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index be46bf0363..d407fb3440 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -252,8 +252,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -890,12 +920,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -930,6 +965,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1137,7 +1215,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1288,6 +1366,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2030,3 +2122,30 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark its
+ * subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index b048dc7484..17d2f93300 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL12_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL12_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 0fa949642483f54505f56b043a171dd5f949f974 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v13] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 +-
 src/backend/replication/logical/snapbuild.c   | 133 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 227 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..6ec09ab192 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 60d07ce4eb..19cd0bf76a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -585,7 +585,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5a1bce5acc..cd091bb724 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -257,8 +257,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -895,12 +925,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -935,6 +970,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1142,7 +1220,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1293,6 +1371,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -2035,3 +2127,30 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark its
+ * subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3acf68f5bd..2eb9532a1b 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL14_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL14_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From cb05f04be5bc67b4cb4f1750555a03cfc3a6838c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v13] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 ++
 src/backend/replication/logical/snapbuild.c   | 137 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 232 insertions(+), 8 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..4553252d75 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot catalog_change_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 92dfafc632..5a440e6eb7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -691,6 +691,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
+	/*
+	 * If the COMMIT record has invalidation messages, it could have catalog
+	 * changes. It is possible that we didn't mark this transaction as
+	 * containing catalog changes when the decoding starts from a commit
+	 * record without decoding the transaction's other changes. So, we ensure
+	 * to mark such transactions as containing catalog change.
+	 *
+	 * This must be done before SnapBuildCommitTxn() so that we can include
+	 * these transactions in the historic snapshot.
+	 */
+	if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
+
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 6df602485b..ac09f0e4f9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -250,8 +250,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -879,12 +909,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -919,6 +954,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1126,7 +1204,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1277,6 +1355,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1993,3 +2085,34 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * If the given xid is in the list of the initial running xacts, we mark the
+ * transaction and its subtransactions as containing catalog changes. See
+ * comments for NInitialRunningXacts and InitialRunningXacts for additional
+ * info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	/*
+	 * Skip if there is no initial running xacts information or the
+	 * transaction is already marked as containing catalog changes.
+	 */
+	if (NInitialRunningXacts == 0 ||
+		ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+		for (int i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3604621e88..a19b59e100 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -90,4 +90,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 										 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL11_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL11_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 201ef206988ebee808ad3a72f9ce808c8935ee98 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v13] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  44 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  15 +-
 src/backend/replication/logical/snapbuild.c   | 135 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 229 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 65a91a8014..973b94738a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c085f7b0f3..dc83743c38 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -586,7 +586,20 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1c52bc64e3..538d757b7d 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -258,8 +258,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -896,12 +926,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -936,6 +971,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1143,7 +1221,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1294,6 +1372,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1996,3 +2088,32 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark its
+ * subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		int		i;
+
+		for (i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 1df66a3c75..4df3c3f2f7 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 							 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

REL10_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchapplication/octet-stream; name=REL10_v13-0001-Fix-catalog-lookup-with-the-wrong-snapshot-durin.patchDownload
From 9722d357a704fdd3aff6449e2ee42444ff87f6fa Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jul 2022 14:02:50 +0900
Subject: [PATCH v13] Fix catalog lookup with the wrong snapshot during logical
 decoding.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, after the restart,
if the logical decoding decodes only the commit record of the transaction
that has actually modified a catalog, we will miss adding its XID to the
snapshot. Thus, we will end up looking at catalogs with the wrong
snapshot.

To fix this problem, this changes the snapshot builder so that it
remembers the last-running-xacts list of the decoded RUNNING_XACTS record
after restoring the previously serialized snapshot. Then, we mark the
transaction as containing catalog changes if it's in the list of initial
running transactions and its commit record has XACT_XINFO_HAS_INVALS. To
avoid ABI breakage, we store the array of the initial running transactions
in the static variables InitialRunningXacts and NInitialRunningXacts,
instead of storing those in SnapBuild or ReorderBuffer.

This approach has a false positive; we could end up adding the transaction
that didn't change catalog to the snapshot since we cannot distinguish
whether the transaction has catalog changes only by checking the COMMIT
record. It doesn't have the information on which (sub) transaction has
catalog changes, and XACT_XINFO_HAS_INVALS doesn't necessarily indicate
that the transaction has catalog change. But that won't be a problem since
we use snapshot built during decoding only to read system catalogs.

On the master branch, we took a more future-proof approach by writing
catalog modifying transactions to the serialized snapshot which avoids the
above false positive. But we cannot backpatch it because of a change in
the SnapBuild.

Reported-by: Mike Oh
Author: Masahiko Sawada
Reviewed-by: Amit Kapila, Shi yu, Takamichi Osumi, Kyotaro Horiguchi, Bertrand Drouvot, Ahsan Hadi
Backpatch-through: 10
Discussion: https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/catalog_change_snapshot.out      |  41 ++++++
 .../specs/catalog_change_snapshot.spec        |  39 +++++
 src/backend/replication/logical/decode.c      |  16 ++-
 src/backend/replication/logical/snapbuild.c   | 135 +++++++++++++++++-
 src/include/replication/snapbuild.h           |   3 +
 6 files changed, 227 insertions(+), 9 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2db2b2774b..73bc0fe1fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 0000000000..15f9540b3f
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,41 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_truncate: TRUNCATE tbl1;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
new file mode 100644
index 0000000000..2971ddc69c
--- /dev/null
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -0,0 +1,39 @@
+# Test decoding only the commit record of the transaction that have
+# modified catalogs.
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1 (val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); }
+step "s0_begin" { BEGIN; }
+step "s0_savepoint" { SAVEPOINT sp1; }
+step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_checkpoint" { CHECKPOINT; }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
+
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# containing catalog changes while decoding the COMMIT record and the decoding
+# of the INSERT record must read the pg_class with the correct historic snapshot.
+#
+# Note that in a case where bgwriter wrote the RUNNING_XACTS record between "s0_commit"
+# and "s0_begin", this doesn't happen as the decoding starts from the RUNNING_XACTS
+# record written by bgwriter.  One might think we can either stop the bgwriter or
+# increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6f8920f52c..3233104fc9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -561,7 +561,21 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	{
 		ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 									  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+
+		/*
+		 * If the COMMIT record has invalidation messages, it could have catalog
+		 * changes. It is possible that we didn't mark this transaction and
+		 * its subtransactions as containing catalog changes when the decoding
+		 * starts from a commit record without decoding the transaction's other
+		 * changes. Therefore, we ensure to mark such transactions as containing
+		 * catalog change.
+		 *
+		 * This must be done before SnapBuildCommitTxn() so that we can include
+		 * these transactions in the historic snapshot.
+		 */
+		SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+									  parsed->nsubxacts, parsed->subxacts,
+									  buf->origptr);
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1010a2e869..7c637a4019 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -258,8 +258,38 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* ->committed manipulation */
-static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
+/*
+ * Array of transactions and subtransactions that were running when
+ * the xl_running_xacts record that we decoded was written. The array is
+ * sorted in xidComparator order. We remove xids from this array when
+ * they become old enough to matter, and then it eventually becomes empty.
+ * This array is allocated in builder->context so its lifetime is the same
+ * as the snapshot builder.
+ *
+ * We normally rely on some WAL record types such as HEAP2_NEW_CID to know
+ * if the transaction has changed the catalog. But it could happen that the
+ * logical decoding decodes only the commit record of the transaction after
+ * restoring the previously serialized snapshot in which case we will miss
+ * adding the xid to the snapshot and end up looking at the catalogs with the
+ * wrong snapshot.
+ *
+ * Now to avoid the above problem, if the COMMIT record of the xid listed in
+ * InitialRunningXacts has XACT_XINFO_HAS_INVALS flag, we mark both the top
+ * transaction and its substransactions as containing catalog changes.
+ *
+ * We could end up adding the transaction that didn't change catalog
+ * to the snapshot since we cannot distinguish whether the transaction
+ * has catalog changes only by checking the COMMIT record. It doesn't
+ * have the information on which (sub) transaction has catalog changes,
+ * and XACT_XINFO_HAS_INVALS doesn't necessarily indicate that the
+ * transaction has catalog change. But that won't be a problem since we
+ * use snapshot built during decoding only for reading system catalogs.
+ */
+static TransactionId *InitialRunningXacts = NULL;
+static int	NInitialRunningXacts = 0;
+
+/* ->committed and InitailRunningXacts manipulation */
+static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
 static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
@@ -896,12 +926,17 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 }
 
 /*
- * Remove knowledge about transactions we treat as committed that are smaller
- * than ->xmin. Those won't ever get checked via the ->committed array but via
- * the clog machinery, so we don't need to waste memory on them.
+ * Remove knowledge about transactions we treat as committed and the initial
+ * running transactions that are smaller than ->xmin. Those won't ever get
+ * checked via the ->committed or InitialRunningXacts array, respectively.
+ * The committed xids will get checked via the clog machinery.
+ *
+ * We can ideally remove the transaction from InitialRunningXacts array
+ * once it is finished (committed/aborted) but that could be costly as we need
+ * to maintain the xids order in the array.
  */
 static void
-SnapBuildPurgeCommittedTxn(SnapBuild *builder)
+SnapBuildPurgeOlderTxn(SnapBuild *builder)
 {
 	int			off;
 	TransactionId *workspace;
@@ -936,6 +971,49 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	builder->committed.xcnt = surviving_xids;
 
 	pfree(workspace);
+
+	/* Quick exit if there is no initial running transactions */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	/* bound check if there is at least one transaction to remove */
+	if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
+									 builder->xmin))
+		return;
+
+	/*
+	 * purge xids in InitialRunningXacts as well. The purged array must also
+	 * be sorted in xidComparator order.
+	 */
+	workspace =
+		MemoryContextAlloc(builder->context,
+						   NInitialRunningXacts * sizeof(TransactionId));
+	surviving_xids = 0;
+	for (off = 0; off < NInitialRunningXacts; off++)
+	{
+		if (NormalTransactionIdPrecedes(InitialRunningXacts[off],
+										builder->xmin))
+			;					/* remove */
+		else
+			workspace[surviving_xids++] = InitialRunningXacts[off];
+	}
+
+	if (surviving_xids > 0)
+		memcpy(InitialRunningXacts, workspace,
+			   sizeof(TransactionId) * surviving_xids);
+	else
+	{
+		pfree(InitialRunningXacts);
+		InitialRunningXacts = NULL;
+	}
+
+	elog(DEBUG3, "purged initial running transactions from %u to %u, oldest running xid %u",
+		 (uint32) NInitialRunningXacts,
+		 (uint32) surviving_xids,
+		 builder->xmin);
+
+	NInitialRunningXacts = surviving_xids;
+	pfree(workspace);
 }
 
 /*
@@ -1143,7 +1221,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeCommittedTxn(builder);
+	SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1294,6 +1372,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
+		int			nxacts = running->subxcnt + running->xcnt;
+		Size		sz = sizeof(TransactionId) * nxacts;
+
+		/*
+		 * Remember the transactions and subtransactions that were running
+		 * when xl_running_xacts record that we decoded was written. We use
+		 * this later to identify the transactions have performed catalog
+		 * changes. See SnapBuildXidSetCatalogChanges.
+		 */
+		NInitialRunningXacts = nxacts;
+		InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+		memcpy(InitialRunningXacts, running->xids, sz);
+		qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
+
 		/* there won't be any state to cleanup */
 		return false;
 	}
@@ -1997,3 +2089,32 @@ CheckPointSnapBuild(void)
 	}
 	FreeDir(snap_dir);
 }
+
+/*
+ * Mark the transaction as containing catalog changes. In addition, if the
+ * given xid is in the list of the initial running xacts, we mark its
+ * subtransactions as well. See comments for NInitialRunningXacts and
+ * InitialRunningXacts for additional info.
+ */
+void
+SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt,
+							  TransactionId *subxacts, XLogRecPtr lsn)
+{
+	ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
+
+	/* Skip if there is no initial running xacts information */
+	if (NInitialRunningXacts == 0)
+		return;
+
+	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
+				sizeof(TransactionId), xidComparator) != NULL)
+	{
+		int		i;
+
+		for (i = 0; i < subxcnt; i++)
+		{
+			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
+			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
+		}
+	}
+}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index b95f56eec3..7a796ce136 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -88,4 +88,7 @@ extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
 							 struct xl_running_xacts *running);
 extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
 
+extern void SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+										  int subxcnt, TransactionId *subxacts,
+										  XLogRecPtr lsn);
 #endif							/* SNAPBUILD_H */
-- 
2.24.3 (Apple Git-128)

#128Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#127)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Wed, Aug 3, 2022 at 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Aug 3, 2022 at 3:52 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:

On Wed, Aug 3, 2022 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached updated patches that incorporated the above comments as
well as the comments from Shi yu. Please review them.

Thanks for updating the patch.

I noticed that in SnapBuildXidSetCatalogChanges(), "i" is initialized in the if
branch in REL10 patch, which is different from REL11 patch. Maybe we can modify
REL11 patch to be consistent with REL10 patch.

The rest of the patch looks good to me.

Oops, thanks for pointing it out. I've fixed it and attached updated
patches for all branches so as not to confuse the patch version. There
is no update from v12 patch on REL12 - master patches.

Thanks for the updated patches, the changes look good to me.
Horiguchi-San, and others, do you have any further comments on this or
do you want to spend time in review of it? If not, I would like to
push this after the current minor version release.

--
With Regards,
Amit Kapila.

#129Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#128)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Mon, Aug 8, 2022 at 9:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 3, 2022 at 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Oops, thanks for pointing it out. I've fixed it and attached updated
patches for all branches so as not to confuse the patch version. There
is no update from v12 patch on REL12 - master patches.

Thanks for the updated patches, the changes look good to me.
Horiguchi-San, and others, do you have any further comments on this or
do you want to spend time in review of it? If not, I would like to
push this after the current minor version release.

Pushed.

--
With Regards,
Amit Kapila.

#130Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#129)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Thu, Aug 11, 2022 at 3:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 8, 2022 at 9:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 3, 2022 at 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Oops, thanks for pointing it out. I've fixed it and attached updated
patches for all branches so as not to confuse the patch version. There
is no update from v12 patch on REL12 - master patches.

Thanks for the updated patches, the changes look good to me.
Horiguchi-San, and others, do you have any further comments on this or
do you want to spend time in review of it? If not, I would like to
push this after the current minor version release.

Pushed.

Thank you!

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#131Drouvot, Bertrand
bdrouvot@amazon.com
In reply to: Amit Kapila (#129)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

Hi,

On 8/11/22 8:10 AM, Amit Kapila wrote:

On Mon, Aug 8, 2022 at 9:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 3, 2022 at 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Oops, thanks for pointing it out. I've fixed it and attached updated
patches for all branches so as not to confuse the patch version. There
is no update from v12 patch on REL12 - master patches.

Thanks for the updated patches, the changes look good to me.
Horiguchi-San, and others, do you have any further comments on this or
do you want to spend time in review of it? If not, I would like to
push this after the current minor version release.

Pushed.

Thank you!

I just marked the corresponding CF entry [1]https://commitfest.postgresql.org/39/3041/ as committed.

[1]: https://commitfest.postgresql.org/39/3041/

Regards,

--

Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

#132Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#114)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Fri, Jul 29, 2022 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, your description makes sense to me. I've also considered how to
hit this path but I guess it is never hit. Thinking of it in another
way, first of all, at least 2 catalog modifying transactions have to
be running while writing a xl_running_xacts. The serialized snapshot
that is written when we decode the first xl_running_xact has two
transactions. Then, one of them is committed before the second
xl_running_xacts. The second serialized snapshot has only one
transaction. Then, the transaction is also committed after that. Now,
in order to execute the path, we need to start decoding from the first
serialized snapshot. However, if we start from there, we cannot decode
the full contents of the transaction that was committed later.

I think then we should change this code in the master branch patch
with an additional comment on the lines of: "Either all the xacts got
purged or none. It is only possible to partially remove the xids from
this array if one or more of the xids are still running but not all.
That can happen if we start decoding from a point (LSN where the
snapshot state became consistent) where all the xacts in this were
running and then at least one of those got committed and a few are
still running. We will never start from such a point because we won't
move the slot's restart_lsn past the point where the oldest running
transaction's restart_decoding_lsn is."

Unfortunately, this theory doesn't turn out to be true. While
investigating the latest buildfarm failure [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&amp;dt=2022-08-25%2004%3A15%3A34, I see that it is
possible that only part of the xacts in the restored catalog modifying
xacts list needs to be purged. See the attached where I have
demonstrated it via a reproducible test. It seems the point we were
missing was that to start from a point where two or more catalog
modifying were serialized, it requires another open transaction before
both get committed, and then we need the checkpoint or other way to
force running_xacts record in-between the commit of initial two
catalog modifying xacts. There could possibly be other ways as well
but the theory above wasn't correct.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&amp;dt=2022-08-25%2004%3A15%3A34

--
With Regards,
Amit Kapila.

Attachments:

fix_purge_catchange_xacts_1.patchapplication/x-patch; name=fix_purge_catchange_xacts_1.patchDownload
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index dc4f9b7..d2a4bdf 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 3 sessions
 
 starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
 step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
@@ -42,3 +42,57 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_truncate s2_begin s2_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s2_commit s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_truncate: TRUNCATE tbl1;
+step s2_begin: BEGIN;
+step s2_truncate: TRUNCATE tbl2;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s2_commit: COMMIT;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl2: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index 2971ddc..ff8f684 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -3,12 +3,15 @@
 setup
 {
     DROP TABLE IF EXISTS tbl1;
+    DROP TABLE IF EXISTS tbl2;
     CREATE TABLE tbl1 (val1 integer, val2 integer);
+    CREATE TABLE tbl2 (val1 integer, val2 integer);
 }
 
 teardown
 {
     DROP TABLE tbl1;
+    DROP TABLE tbl2;
     SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
 }
 
@@ -26,6 +29,12 @@ setup { SET synchronous_commit=on; }
 step "s1_checkpoint" { CHECKPOINT; }
 step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
 
+session "s2"
+setup { SET synchronous_commit=on; }
+step "s2_begin" { BEGIN; }
+step "s2_truncate" { TRUNCATE tbl2; }
+step "s2_commit" { COMMIT; }
+
 # For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
 # only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
 # during the first checkpoint execution.  This transaction must be marked as
@@ -37,3 +46,14 @@ step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_
 # record written by bgwriter.  One might think we can either stop the bgwriter or
 # increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
 permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# Test that we can purge the old catalog modifying transactions after restoring
+# them from the serialized snapshot. The first checkpoint will serialize the list
+# of two catalog modifying xacts. The purpose of the second checkpoint is to allow
+# partial pruning of the list of catalog modifying xact. The third checkpoint
+# followed by get_changes establishes a restart_point at the first checkpoint LSN.
+# The last get_changes will start decoding from the first checkpoint which
+# restores the list of catalog modifying xacts and then while decoding the second
+# checkpoint record it prunes one of the xacts in that list and when decoding the
+# next checkpoint, it will completely prune that list.
+permutation "s0_init" "s0_begin" "s0_truncate" "s2_begin" "s2_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s2_commit" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1ff2c12..cbf16c0 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -969,28 +969,40 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
 	pfree(workspace);
 
 	/*
-	 * Either all the xacts got purged or none. It is only possible to
-	 * partially remove the xids from this array if one or more of the xids
-	 * are still running but not all. That can happen if we start decoding
-	 * from a point (LSN where the snapshot state became consistent) where all
-	 * the xacts in this were running and then at least one of those got
-	 * committed and a few are still running. We will never start from such a
-	 * point because we won't move the slot's restart_lsn past the point where
-	 * the oldest running transaction's restart_decoding_lsn is.
+	 * Purge xids in ->catchange as well. The purged array must also be
+	 * sorted in xidComparator order.
 	 */
-	if (builder->catchange.xcnt == 0 ||
-		TransactionIdFollowsOrEquals(builder->catchange.xip[0],
-									 builder->xmin))
-		return;
+	if (builder->catchange.xcnt > 0)
+	{
+		/*
+		 * Since catchange.xip is sorted, we find the lower bound of xids that
+		 * are still interesting.
+		 */
+		for (off = 0; off < builder->catchange.xcnt; off++)
+		{
+			if (NormalTransactionIdPrecedes(builder->catchange.xip[off],
+											builder->xmin))
+				break;
+		}
 
-	Assert(TransactionIdFollows(builder->xmin,
-								builder->catchange.xip[builder->catchange.xcnt - 1]));
-	pfree(builder->catchange.xip);
-	builder->catchange.xip = NULL;
-	builder->catchange.xcnt = 0;
+		surviving_xids = builder->catchange.xcnt - off;
 
-	elog(DEBUG3, "purged catalog modifying transactions, oldest running xid %u",
-		 builder->xmin);
+		if (surviving_xids > 0)
+		{
+			memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+					surviving_xids * sizeof(TransactionId));
+		}
+		else
+		{
+			pfree(builder->catchange.xip);
+			builder->catchange.xip = NULL;
+		}
+
+		elog(DEBUG3, "purged catalog modifying transactions from %u to %u, xmin: %u, xmax: %u",
+			 (uint32) builder->catchange.xcnt, (uint32) surviving_xids,
+			 builder->xmin, builder->xmax);
+		builder->catchange.xcnt = surviving_xids;
+	}
 }
 
 /*
#133Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#132)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Sat, Aug 27, 2022 at 3:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 29, 2022 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, your description makes sense to me. I've also considered how to
hit this path but I guess it is never hit. Thinking of it in another
way, first of all, at least 2 catalog modifying transactions have to
be running while writing a xl_running_xacts. The serialized snapshot
that is written when we decode the first xl_running_xact has two
transactions. Then, one of them is committed before the second
xl_running_xacts. The second serialized snapshot has only one
transaction. Then, the transaction is also committed after that. Now,
in order to execute the path, we need to start decoding from the first
serialized snapshot. However, if we start from there, we cannot decode
the full contents of the transaction that was committed later.

I think then we should change this code in the master branch patch
with an additional comment on the lines of: "Either all the xacts got
purged or none. It is only possible to partially remove the xids from
this array if one or more of the xids are still running but not all.
That can happen if we start decoding from a point (LSN where the
snapshot state became consistent) where all the xacts in this were
running and then at least one of those got committed and a few are
still running. We will never start from such a point because we won't
move the slot's restart_lsn past the point where the oldest running
transaction's restart_decoding_lsn is."

Unfortunately, this theory doesn't turn out to be true. While
investigating the latest buildfarm failure [1], I see that it is
possible that only part of the xacts in the restored catalog modifying
xacts list needs to be purged. See the attached where I have
demonstrated it via a reproducible test. It seems the point we were
missing was that to start from a point where two or more catalog
modifying were serialized, it requires another open transaction before
both get committed, and then we need the checkpoint or other way to
force running_xacts record in-between the commit of initial two
catalog modifying xacts. There could possibly be other ways as well
but the theory above wasn't correct.

Thank you for the analysis and the patch. I have the same conclusion.
Since we took this approach only on the master the back branches are
not affected.

The new test scenario makes sense to me and looks better than the one
I have. Regarding the fix, I think we should use
TransactionIdFollowsOrEquals() instead of
NormalTransactionIdPrecedes():

 +       for (off = 0; off < builder->catchange.xcnt; off++)
 +       {
 +           if (NormalTransactionIdPrecedes(builder->catchange.xip[off],
 +                                           builder->xmin))
 +               break;
 +       }

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#134Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#133)
1 attachment(s)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Sat, Aug 27, 2022 at 1:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Aug 27, 2022 at 3:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 29, 2022 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, your description makes sense to me. I've also considered how to
hit this path but I guess it is never hit. Thinking of it in another
way, first of all, at least 2 catalog modifying transactions have to
be running while writing a xl_running_xacts. The serialized snapshot
that is written when we decode the first xl_running_xact has two
transactions. Then, one of them is committed before the second
xl_running_xacts. The second serialized snapshot has only one
transaction. Then, the transaction is also committed after that. Now,
in order to execute the path, we need to start decoding from the first
serialized snapshot. However, if we start from there, we cannot decode
the full contents of the transaction that was committed later.

I think then we should change this code in the master branch patch
with an additional comment on the lines of: "Either all the xacts got
purged or none. It is only possible to partially remove the xids from
this array if one or more of the xids are still running but not all.
That can happen if we start decoding from a point (LSN where the
snapshot state became consistent) where all the xacts in this were
running and then at least one of those got committed and a few are
still running. We will never start from such a point because we won't
move the slot's restart_lsn past the point where the oldest running
transaction's restart_decoding_lsn is."

Unfortunately, this theory doesn't turn out to be true. While
investigating the latest buildfarm failure [1], I see that it is
possible that only part of the xacts in the restored catalog modifying
xacts list needs to be purged. See the attached where I have
demonstrated it via a reproducible test. It seems the point we were
missing was that to start from a point where two or more catalog
modifying were serialized, it requires another open transaction before
both get committed, and then we need the checkpoint or other way to
force running_xacts record in-between the commit of initial two
catalog modifying xacts. There could possibly be other ways as well
but the theory above wasn't correct.

Thank you for the analysis and the patch. I have the same conclusion.
Since we took this approach only on the master the back branches are
not affected.

The new test scenario makes sense to me and looks better than the one
I have. Regarding the fix, I think we should use
TransactionIdFollowsOrEquals() instead of
NormalTransactionIdPrecedes():

+       for (off = 0; off < builder->catchange.xcnt; off++)
+       {
+           if (NormalTransactionIdPrecedes(builder->catchange.xip[off],
+                                           builder->xmin))
+               break;
+       }

Right, fixed.

--
With Regards,
Amit Kapila.

Attachments:

v2-0001-Fix-the-incorrect-assertion-introduced-in-commit-.patchapplication/octet-stream; name=v2-0001-Fix-the-incorrect-assertion-introduced-in-commit-.patchDownload
From f7f64b12287f44e58966c2e24503d8969f92ca68 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 27 Aug 2022 15:39:38 +0530
Subject: [PATCH v3] Fix the incorrect assertion introduced in commit
 7f13ac8123.

It has been incorrectly assumed in commit 7f13ac8123 that we can either
purge all or none in the catalog modifying xids list retrieved from a
serialized snapshot. It is quite possible that some of the xids in that
array are old enough to be pruned but not others.

As per buildfarm

Author: Amit Kapila and Masahiko Sawada
Reviwed-by: Masahiko Sawada
Discussion: https://postgr.es/m/CAA4eK1LBtv6ayE+TvCcPmC-xse=DVg=SmbyQD1nv_AaqcpUJEg@mail.gmail.com
---
 .../expected/catalog_change_snapshot.out           | 56 +++++++++++++++++++++-
 .../specs/catalog_change_snapshot.spec             | 20 ++++++++
 src/backend/replication/logical/snapbuild.c        | 50 +++++++++++--------
 3 files changed, 106 insertions(+), 20 deletions(-)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index dc4f9b7..d2a4bdf 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 3 sessions
 
 starting permutation: s0_init s0_begin s0_savepoint s0_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
 step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
@@ -42,3 +42,57 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_truncate s2_begin s2_truncate s1_checkpoint s1_get_changes s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s2_commit s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_truncate: TRUNCATE tbl1;
+step s2_begin: BEGIN;
+step s2_truncate: TRUNCATE tbl2;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl1: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s2_commit: COMMIT;
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                   
+---------------------------------------
+BEGIN                                  
+table public.tbl2: TRUNCATE: (no-flags)
+COMMIT                                 
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index 2971ddc..ff8f684 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -3,12 +3,15 @@
 setup
 {
     DROP TABLE IF EXISTS tbl1;
+    DROP TABLE IF EXISTS tbl2;
     CREATE TABLE tbl1 (val1 integer, val2 integer);
+    CREATE TABLE tbl2 (val1 integer, val2 integer);
 }
 
 teardown
 {
     DROP TABLE tbl1;
+    DROP TABLE tbl2;
     SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
 }
 
@@ -26,6 +29,12 @@ setup { SET synchronous_commit=on; }
 step "s1_checkpoint" { CHECKPOINT; }
 step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); }
 
+session "s2"
+setup { SET synchronous_commit=on; }
+step "s2_begin" { BEGIN; }
+step "s2_truncate" { TRUNCATE tbl2; }
+step "s2_commit" { COMMIT; }
+
 # For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
 # only its COMMIT record, because it starts from the RUNNING_XACTS record emitted
 # during the first checkpoint execution.  This transaction must be marked as
@@ -37,3 +46,14 @@ step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_
 # record written by bgwriter.  One might think we can either stop the bgwriter or
 # increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
 permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# Test that we can purge the old catalog modifying transactions after restoring
+# them from the serialized snapshot. The first checkpoint will serialize the list
+# of two catalog modifying xacts. The purpose of the second checkpoint is to allow
+# partial pruning of the list of catalog modifying xact. The third checkpoint
+# followed by get_changes establishes a restart_point at the first checkpoint LSN.
+# The last get_changes will start decoding from the first checkpoint which
+# restores the list of catalog modifying xacts and then while decoding the second
+# checkpoint record it prunes one of the xacts in that list and when decoding the
+# next checkpoint, it will completely prune that list.
+permutation "s0_init" "s0_begin" "s0_truncate" "s2_begin" "s2_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s2_commit" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1ff2c12..bf72ad4 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -969,28 +969,40 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
 	pfree(workspace);
 
 	/*
-	 * Either all the xacts got purged or none. It is only possible to
-	 * partially remove the xids from this array if one or more of the xids
-	 * are still running but not all. That can happen if we start decoding
-	 * from a point (LSN where the snapshot state became consistent) where all
-	 * the xacts in this were running and then at least one of those got
-	 * committed and a few are still running. We will never start from such a
-	 * point because we won't move the slot's restart_lsn past the point where
-	 * the oldest running transaction's restart_decoding_lsn is.
+	 * Purge xids in ->catchange as well. The purged array must also be sorted
+	 * in xidComparator order.
 	 */
-	if (builder->catchange.xcnt == 0 ||
-		TransactionIdFollowsOrEquals(builder->catchange.xip[0],
-									 builder->xmin))
-		return;
+	if (builder->catchange.xcnt > 0)
+	{
+		/*
+		 * Since catchange.xip is sorted, we find the lower bound of xids that
+		 * are still interesting.
+		 */
+		for (off = 0; off < builder->catchange.xcnt; off++)
+		{
+			if (TransactionIdFollowsOrEquals(builder->catchange.xip[off],
+											 builder->xmin))
+				break;
+		}
 
-	Assert(TransactionIdFollows(builder->xmin,
-								builder->catchange.xip[builder->catchange.xcnt - 1]));
-	pfree(builder->catchange.xip);
-	builder->catchange.xip = NULL;
-	builder->catchange.xcnt = 0;
+		surviving_xids = builder->catchange.xcnt - off;
 
-	elog(DEBUG3, "purged catalog modifying transactions, oldest running xid %u",
-		 builder->xmin);
+		if (surviving_xids > 0)
+		{
+			memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+					surviving_xids * sizeof(TransactionId));
+		}
+		else
+		{
+			pfree(builder->catchange.xip);
+			builder->catchange.xip = NULL;
+		}
+
+		elog(DEBUG3, "purged catalog modifying transactions from %u to %u, xmin: %u, xmax: %u",
+			 (uint32) builder->catchange.xcnt, (uint32) surviving_xids,
+			 builder->xmin, builder->xmax);
+		builder->catchange.xcnt = surviving_xids;
+	}
 }
 
 /*
-- 
1.8.3.1

#135Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#134)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Sat, Aug 27, 2022 at 7:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Aug 27, 2022 at 1:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Aug 27, 2022 at 3:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 29, 2022 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, your description makes sense to me. I've also considered how to
hit this path but I guess it is never hit. Thinking of it in another
way, first of all, at least 2 catalog modifying transactions have to
be running while writing a xl_running_xacts. The serialized snapshot
that is written when we decode the first xl_running_xact has two
transactions. Then, one of them is committed before the second
xl_running_xacts. The second serialized snapshot has only one
transaction. Then, the transaction is also committed after that. Now,
in order to execute the path, we need to start decoding from the first
serialized snapshot. However, if we start from there, we cannot decode
the full contents of the transaction that was committed later.

I think then we should change this code in the master branch patch
with an additional comment on the lines of: "Either all the xacts got
purged or none. It is only possible to partially remove the xids from
this array if one or more of the xids are still running but not all.
That can happen if we start decoding from a point (LSN where the
snapshot state became consistent) where all the xacts in this were
running and then at least one of those got committed and a few are
still running. We will never start from such a point because we won't
move the slot's restart_lsn past the point where the oldest running
transaction's restart_decoding_lsn is."

Unfortunately, this theory doesn't turn out to be true. While
investigating the latest buildfarm failure [1], I see that it is
possible that only part of the xacts in the restored catalog modifying
xacts list needs to be purged. See the attached where I have
demonstrated it via a reproducible test. It seems the point we were
missing was that to start from a point where two or more catalog
modifying were serialized, it requires another open transaction before
both get committed, and then we need the checkpoint or other way to
force running_xacts record in-between the commit of initial two
catalog modifying xacts. There could possibly be other ways as well
but the theory above wasn't correct.

Thank you for the analysis and the patch. I have the same conclusion.
Since we took this approach only on the master the back branches are
not affected.

The new test scenario makes sense to me and looks better than the one
I have. Regarding the fix, I think we should use
TransactionIdFollowsOrEquals() instead of
NormalTransactionIdPrecedes():

+       for (off = 0; off < builder->catchange.xcnt; off++)
+       {
+           if (NormalTransactionIdPrecedes(builder->catchange.xip[off],
+                                           builder->xmin))
+               break;
+       }

Right, fixed.

Thank you for updating the patch! It looks good to me.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#136Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#135)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

On Sat, Aug 27, 2022 at 7:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Aug 27, 2022 at 7:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Aug 27, 2022 at 1:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think then we should change this code in the master branch patch
with an additional comment on the lines of: "Either all the xacts got
purged or none. It is only possible to partially remove the xids from
this array if one or more of the xids are still running but not all.
That can happen if we start decoding from a point (LSN where the
snapshot state became consistent) where all the xacts in this were
running and then at least one of those got committed and a few are
still running. We will never start from such a point because we won't
move the slot's restart_lsn past the point where the oldest running
transaction's restart_decoding_lsn is."

Unfortunately, this theory doesn't turn out to be true. While
investigating the latest buildfarm failure [1], I see that it is
possible that only part of the xacts in the restored catalog modifying
xacts list needs to be purged. See the attached where I have
demonstrated it via a reproducible test. It seems the point we were
missing was that to start from a point where two or more catalog
modifying were serialized, it requires another open transaction before
both get committed, and then we need the checkpoint or other way to
force running_xacts record in-between the commit of initial two
catalog modifying xacts. There could possibly be other ways as well
but the theory above wasn't correct.

Thank you for the analysis and the patch. I have the same conclusion.
Since we took this approach only on the master the back branches are
not affected.

The new test scenario makes sense to me and looks better than the one
I have. Regarding the fix, I think we should use
TransactionIdFollowsOrEquals() instead of
NormalTransactionIdPrecedes():

+       for (off = 0; off < builder->catchange.xcnt; off++)
+       {
+           if (NormalTransactionIdPrecedes(builder->catchange.xip[off],
+                                           builder->xmin))
+               break;
+       }

Right, fixed.

Thank you for updating the patch! It looks good to me.

Pushed.

--
With Regards,
Amit Kapila.

#137Maxim Orlov
orlovmg@gmail.com
In reply to: Amit Kapila (#136)
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

Pushed.

--
With Regards,
Amit Kapila.

Hi!

While working on 64–bit XID's patch set, I stumble into problems with
contrib/test_decoding/catalog_change_snapshot test [0]/messages/by-id/CACG=ezZoz_KG+Ryh9MrU_g5e0HiVoHocEvqFF=NRrhrwKmEQJQ@mail.gmail.com.

AFAICS, the problem is not related to the 64–bit XID's patch set and the
problem is in InitialRunningXacts array, allocaled in
builder->context. Do we really need it to be allocated that way?

[0]: /messages/by-id/CACG=ezZoz_KG+Ryh9MrU_g5e0HiVoHocEvqFF=NRrhrwKmEQJQ@mail.gmail.com
/messages/by-id/CACG=ezZoz_KG+Ryh9MrU_g5e0HiVoHocEvqFF=NRrhrwKmEQJQ@mail.gmail.com

--
Best regards,
Maxim Orlov.