Fix for segfault in logical replication on master

Started by Mark Dilgerover 4 years ago19 messages
#1Mark Dilger
mark.dilger@enterprisedb.com
1 attachment(s)

Hi Amit,

In commit e7eea52b2d, you introduced a new function, RelationGetIdentityKeyBitmap(), which uses some odd logic for determining if a relation has a replica identity index. That code segfaults under certain conditions. A test case to demonstrate that is attached. Prior to patching the code, this new test gets stuck waiting for replication to finish, which never happens. You have to break out of the test and check tmp_check/log/021_no_replica_identity_publisher.log.

I believe this bit of logic in src/backend/utils/cache/relcache.c:

indexDesc = RelationIdGetRelation(relation->rd_replidindex);
for (i = 0; i < indexDesc->rd_index->indnatts; i++)

is unsafe without further checks, also attached.

Would you mind taking a look?

Attachments:

v1-0001-Fixing-bug-in-logical-replication.patchapplication/octet-stream; name=v1-0001-Fixing-bug-in-logical-replication.patch; x-unix-mode=0644Download
From b19cd78d656321f3bf5fa4015020f249f4419d81 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Wed, 16 Jun 2021 21:26:32 -0700
Subject: [PATCH v1] Fixing bug in logical replication

When replicating DML for a table with an index but no REPLICA
IDENTITY, the publisher was crashing with a segmentation fault, at
least under certain conditions.  Adding a fix for that along with a
simple test demonstrating the problem.
---
 src/backend/utils/cache/relcache.c            | 21 ++++++-
 .../subscription/t/021_no_replica_identity.pl | 57 +++++++++++++++++++
 2 files changed, 77 insertions(+), 1 deletion(-)
 create mode 100644 src/test/subscription/t/021_no_replica_identity.pl

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index fd05615e76..f40f294864 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5266,8 +5266,27 @@ RelationGetIdentityKeyBitmap(Relation relation)
 	if (indexoidlist == NIL)
 		return NULL;
 
-	/* Add referenced attributes to idindexattrs */
+	/* Fall out if the replica identity index is invalid */
+	if (!OidIsValid(relation->rd_replidindex))
+		return NULL;
+
+	/* Look up the description for the replica identity index */
 	indexDesc = RelationIdGetRelation(relation->rd_replidindex);
+
+	/*
+	 * Fall out if the description is not for an index, suggesting
+	 * affairs have changed since we looked.  XXX Should we log a
+	 * complaint here?
+	 */
+	if (!indexDesc)
+		return NULL;
+	if (!indexDesc->rd_index)
+	{
+		RelationClose(indexDesc);
+		return NULL;
+	}
+
+	/* Add referenced attributes to idindexattrs */
 	for (i = 0; i < indexDesc->rd_index->indnatts; i++)
 	{
 		int			attrnum = indexDesc->rd_index->indkey.values[i];
diff --git a/src/test/subscription/t/021_no_replica_identity.pl b/src/test/subscription/t/021_no_replica_identity.pl
new file mode 100644
index 0000000000..9a6c8f5f74
--- /dev/null
+++ b/src/test/subscription/t/021_no_replica_identity.pl
@@ -0,0 +1,57 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test of logical replication in the absence of a proper replica
+# identity index
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+my $cmd;
+
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->start;
+
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+$cmd = qq(
+CREATE TABLE tbl (i INTEGER);
+CREATE INDEX tbl_idx ON tbl(i);
+);
+
+$node_publisher->safe_psql('postgres', $cmd);
+$node_subscriber->safe_psql('postgres', $cmd);
+
+$cmd = qq(
+CREATE PUBLICATION pub FOR TABLE tbl);
+$node_publisher->safe_psql('postgres', $cmd);
+
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$cmd = qq(
+CREATE SUBSCRIPTION sub
+	CONNECTION '$publisher_connstr'
+	PUBLICATION pub);
+$node_subscriber->safe_psql('postgres', $cmd);
+
+$node_publisher->wait_for_catchup("sub");
+
+my $synced_query =
+  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('s', 'r');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+$cmd = qq(INSERT INTO tbl (i) VALUES (1));
+$node_publisher->safe_psql('postgres', $cmd);
+
+$node_publisher->wait_for_catchup("sub");
+
+is($node_subscriber->safe_psql('postgres', q(SELECT i FROM tbl)),
+   1, "value replicated to subscriber without replica identity index");
+
+$node_subscriber->stop('smart');
+$node_publisher->stop('smart');
-- 
2.21.1 (Apple Git-122.3)

#2osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Mark Dilger (#1)
RE: Fix for segfault in logical replication on master

On Thursday, June 17, 2021 1:31 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

In commit e7eea52b2d, you introduced a new function,
RelationGetIdentityKeyBitmap(), which uses some odd logic for determining
if a relation has a replica identity index. That code segfaults under certain
conditions. A test case to demonstrate that is attached. Prior to patching
the code, this new test gets stuck waiting for replication to finish, which never
happens. You have to break out of the test and check
tmp_check/log/021_no_replica_identity_publisher.log.

I believe this bit of logic in src/backend/utils/cache/relcache.c:

indexDesc = RelationIdGetRelation(relation->rd_replidindex);
for (i = 0; i < indexDesc->rd_index->indnatts; i++)

is unsafe without further checks, also attached.

Would you mind taking a look?

Hi, Mark

Thanks for your reporting.
I started to analyze your report and
will reply after my idea to your modification is settled.

Best Regards,
Takamichi Osumi

#3Mark Dilger
mark.dilger@enterprisedb.com
In reply to: osumi.takamichi@fujitsu.com (#2)
Re: Fix for segfault in logical replication on master

On Jun 16, 2021, at 10:19 PM, osumi.takamichi@fujitsu.com wrote:

I started to analyze your report and
will reply after my idea to your modification is settled.

Thank you.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Mark Dilger (#3)
RE: Fix for segfault in logical replication on master

On Thursday, June 17, 2021 2:43 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Jun 16, 2021, at 10:19 PM, osumi.takamichi@fujitsu.com wrote:

I started to analyze your report and
will reply after my idea to your modification is settled.

Thank you.

I'll share my first analysis.

In commit e7eea52b2d, you introduced a new function,
RelationGetIdentityKeyBitmap(), which uses some odd logic for determining
if a relation has a replica identity index. That code segfaults under certain
conditions. A test case to demonstrate that is attached. Prior to patching
the code, this new test gets stuck waiting for replication to finish, which never
happens. You have to break out of the test and check
tmp_check/log/021_no_replica_identity_publisher.log.

I believe this bit of logic in src/backend/utils/cache/relcache.c:

indexDesc = RelationIdGetRelation(relation->rd_replidindex);
for (i = 0; i < indexDesc->rd_index->indnatts; i++)

is unsafe without further checks, also attached.

You are absolutely right.
I checked the crash scenario and reproduced the core,
which has a null indexDesc. Also, rd_replidindex must be checked beforehand
as you included in your patch, because having an index does not necessarily
mean to have a replica identity index. As the proof of this, the oid of
rd_replidindex in the scenario is 0. OTOH, I've confirmed your new test
has passed with your fix.

Also, your test looks essentially minimum(suitable for the problem) to me.

* RelationGetIdentityKeyBitmap
+       /*
+        * Fall out if the description is not for an index, suggesting
+        * affairs have changed since we looked.  XXX Should we log a
+        * complaint here?
+        */
+       if (!indexDesc)
+               return NULL;
+       if (!indexDesc->rd_index)
+       {
+               RelationClose(indexDesc);
+               return NULL;
+       }
For the 1st check, isn't it better to use RelationIsValid() ?
I agree with having the check itself of course, though.

Additionally, In what kind of actual scenario, did you think that
we come to the part to "log a complaint" ?

I'm going to spend some time to analyze RelationGetIndexAttrBitmap next
to know if similar hazards can happen, because RelationGetIdentityKeyBitmap's logic
comes from the function mainly.

Best Regards,
Takamichi Osumi

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#4)
Re: Fix for segfault in logical replication on master

On Thu, Jun 17, 2021 at 4:09 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Thursday, June 17, 2021 2:43 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Jun 16, 2021, at 10:19 PM, osumi.takamichi@fujitsu.com wrote:

I started to analyze your report and
will reply after my idea to your modification is settled.

Thank you.

I'll share my first analysis.

In commit e7eea52b2d, you introduced a new function,
RelationGetIdentityKeyBitmap(), which uses some odd logic for determining
if a relation has a replica identity index. That code segfaults under certain
conditions. A test case to demonstrate that is attached. Prior to patching
the code, this new test gets stuck waiting for replication to finish, which never
happens. You have to break out of the test and check
tmp_check/log/021_no_replica_identity_publisher.log.

I believe this bit of logic in src/backend/utils/cache/relcache.c:

indexDesc = RelationIdGetRelation(relation->rd_replidindex);
for (i = 0; i < indexDesc->rd_index->indnatts; i++)

is unsafe without further checks, also attached.

You are absolutely right.
I checked the crash scenario and reproduced the core,
which has a null indexDesc. Also, rd_replidindex must be checked beforehand
as you included in your patch, because having an index does not necessarily
mean to have a replica identity index. As the proof of this, the oid of
rd_replidindex in the scenario is 0. OTOH, I've confirmed your new test
has passed with your fix.

Also, your test looks essentially minimum(suitable for the problem) to me.

* RelationGetIdentityKeyBitmap
+       /*
+        * Fall out if the description is not for an index, suggesting
+        * affairs have changed since we looked.  XXX Should we log a
+        * complaint here?
+        */
+       if (!indexDesc)
+               return NULL;
+       if (!indexDesc->rd_index)
+       {
+               RelationClose(indexDesc);
+               return NULL;
+       }
For the 1st check, isn't it better to use RelationIsValid() ?
I agree with having the check itself of course, though.

Additionally, In what kind of actual scenario, did you think that
we come to the part to "log a complaint" ?

Yeah, I think that part is not required unless there is some case
where it can happen. I guess we might want to have an elog at that
place with a check like:
if (!RelationIsValid(relation))
elog(ERROR, "could not open relation with OID %u", relid);

--
With Regards,
Amit Kapila.

#6Mark Dilger
mark.dilger@enterprisedb.com
In reply to: osumi.takamichi@fujitsu.com (#4)
Re: Fix for segfault in logical replication on master

On Jun 17, 2021, at 3:39 AM, osumi.takamichi@fujitsu.com wrote:

For the 1st check, isn't it better to use RelationIsValid() ?

Yes, you are right.

Additionally, In what kind of actual scenario, did you think that
we come to the part to "log a complaint" ?

The way that RelationGetIndexList assigns rd_replidindex to the Relation seems to lack sufficient locking. After scanning pg_index to find indexes associated with the relation, pg_index is closed and the access share lock released. I couldn't prove to myself that by the time we use the rd_replidindex field thus computed that it was safe to assume that the Oid stored there still refers to an index. The most likely problem would be that the index has since been dropped in a concurrent transaction, but it also seems just barely possible that the Oid has been reused and refers to something else, a table perhaps. The check that I added is not completely bulletproof, because the new object reusing that Oid could be a different index, and we'd be none the wiser. Do you think we should do something about that? I felt the checks I put in place were very cheap and would work in almost all cases. In any event, they seemed better than no checks, which is what we have now.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Dilger (#6)
Re: Fix for segfault in logical replication on master

On Thu, Jun 17, 2021 at 6:50 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 17, 2021, at 3:39 AM, osumi.takamichi@fujitsu.com wrote:

For the 1st check, isn't it better to use RelationIsValid() ?

Yes, you are right.

Additionally, In what kind of actual scenario, did you think that
we come to the part to "log a complaint" ?

The way that RelationGetIndexList assigns rd_replidindex to the Relation seems to lack sufficient locking. After scanning pg_index to find indexes associated with the relation, pg_index is closed and the access share lock released. I couldn't prove to myself that by the time we use the rd_replidindex field thus computed that it was safe to assume that the Oid stored there still refers to an index. The most likely problem would be that the index has since been dropped in a concurrent transaction, but it also seems just barely possible that the Oid has been reused and refers to something else, a table perhaps.

I think such a problem won't happen because we are using historic
snapshots in this context. We rely on that in a similar way in
reorderbuffer.c, see ReorderBufferProcessTXN.

--
With Regards,
Amit Kapila.

#8Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amit Kapila (#7)
1 attachment(s)
Re: Fix for segfault in logical replication on master

On Jun 17, 2021, at 6:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think such a problem won't happen because we are using historic
snapshots in this context. We rely on that in a similar way in
reorderbuffer.c, see ReorderBufferProcessTXN.

I think you are right, but that's the part I have trouble fully convincing myself is safe. We certainly have an historic snapshot when we call RelationGetIndexList, but that has an early exit if the relation already has fields set, and we don't know if those fields were set before or after the historic snapshot was taken. Within the context of the pluggable infrastructure, I think we're safe. The only caller of RelationGetIdentityKeyBitmap() in core is logicalrep_write_attrs(), which is only called by logicalrep_write_rel(), which is only called by send_relation_and_attrs(), which is only called by maybe_send_schema(), which is called by pgoutput_change() and pgoutput_truncate(), both being callbacks in core's logical replication plugin.

ReorderBufferProcessTXN calls SetupHistoricSnapshot before opening the relation and then calling ReorderBufferApplyChange to invoke the plugin on that opened relation, so the relation's fields could not have been setup before the snapshot was taken. Any other plugin would similarly get invoked after that same logic, so they'd be fine, too. The problem would only be if somebody called RelationGetIdentityKeyBitmap() or one of its calling functions from outside that infrastructure. Is that worth worrying about? The function comments for those mention having an historic snapshot, and the Assert will catch if code doesn't have one, but I wonder how much of a trap for the unwary that is, considering that somebody might open the relation and lookup indexes for the relation before taking an historic snapshot and calling these functions.

I thought it was cheap enough to check that the relation we open is an index, because if it is not, we'll segfault when accessing fields of the relation->rd_index struct. I wouldn't necessarily advocate doing any really expensive checks here, but a quick sanity check seemed worth the effort. If you don't want to commit that part, I'm not going to put up a huge fuss.

Since neither of you knew why I was performing that check, it is clear that my code comment was insufficient. I have added a more detailed code comment to explain the purpose of the check. I also changed the first check to use RelationIsValid(), as suggested upthread.

Attachments:

v2-0001-Fixing-bug-in-logical-replication.patchapplication/octet-stream; name=v2-0001-Fixing-bug-in-logical-replication.patch; x-unix-mode=0644Download
From a3d5879e868cbd4269747b07ecf1b3f16bcd2496 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Wed, 16 Jun 2021 21:26:32 -0700
Subject: [PATCH v2] Fixing bug in logical replication

When replicating DML for a table with an index but no REPLICA
IDENTITY, the publisher was crashing with a segmentation fault, at
least under certain conditions.  Adding a fix for that along with a
simple test demonstrating the problem.
---
 src/backend/utils/cache/relcache.c            | 36 +++++++++++-
 .../subscription/t/021_no_replica_identity.pl | 57 +++++++++++++++++++
 2 files changed, 92 insertions(+), 1 deletion(-)
 create mode 100644 src/test/subscription/t/021_no_replica_identity.pl

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index fd05615e76..f57a3b1dc8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5266,8 +5266,42 @@ RelationGetIdentityKeyBitmap(Relation relation)
 	if (indexoidlist == NIL)
 		return NULL;
 
-	/* Add referenced attributes to idindexattrs */
+	/* Fall out if the replica identity index is invalid */
+	if (!OidIsValid(relation->rd_replidindex))
+		return NULL;
+
+	/* Look up the description for the replica identity index */
 	indexDesc = RelationIdGetRelation(relation->rd_replidindex);
+
+	/*
+	 * Fall out if the description is not for an index.  The historic snapshot
+	 * should have been setup before the relation was opened, but if the caller
+	 * managed to set relation->rd_replidindex before taking a snapshot, there
+	 * is a chance that the index was dropped in between those two actions.
+	 * (The replication infrastructure in core should *never* do this, but we
+	 * can't anticipate how third party code might reuse these functions.)  If
+	 * so, the index relation will likely be merely invalid, but in the
+	 * extremely unlikely case that the Oid tracked by rd_replidindex got
+	 * reassigned to something else, we could have just opened a relation of
+	 * type other than index.  Beware that this race condition does not require
+	 * the Oid range to wrap around while we're doing logical replication.  The
+	 * Oid for the index could already have been quite old and nearing the
+	 * point where it would get reused if available.  We perform a half-measure
+	 * to check that the relation opened really is an index, though we do
+	 * nothing to recheck that it is an index assocaited with our table
+	 * relation.  Doing more would require reopening pg_index and rescanning
+	 * which hardly seems worth the performance penalty for something which is
+	 * arguably user error anyway.
+	 */
+	if (!RelationIsValid(indexDesc))
+		return NULL;
+	if (!indexDesc->rd_index)
+	{
+		RelationClose(indexDesc);
+		return NULL;
+	}
+
+	/* Add referenced attributes to idindexattrs */
 	for (i = 0; i < indexDesc->rd_index->indnatts; i++)
 	{
 		int			attrnum = indexDesc->rd_index->indkey.values[i];
diff --git a/src/test/subscription/t/021_no_replica_identity.pl b/src/test/subscription/t/021_no_replica_identity.pl
new file mode 100644
index 0000000000..9a6c8f5f74
--- /dev/null
+++ b/src/test/subscription/t/021_no_replica_identity.pl
@@ -0,0 +1,57 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test of logical replication in the absence of a proper replica
+# identity index
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+my $cmd;
+
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->start;
+
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+$cmd = qq(
+CREATE TABLE tbl (i INTEGER);
+CREATE INDEX tbl_idx ON tbl(i);
+);
+
+$node_publisher->safe_psql('postgres', $cmd);
+$node_subscriber->safe_psql('postgres', $cmd);
+
+$cmd = qq(
+CREATE PUBLICATION pub FOR TABLE tbl);
+$node_publisher->safe_psql('postgres', $cmd);
+
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$cmd = qq(
+CREATE SUBSCRIPTION sub
+	CONNECTION '$publisher_connstr'
+	PUBLICATION pub);
+$node_subscriber->safe_psql('postgres', $cmd);
+
+$node_publisher->wait_for_catchup("sub");
+
+my $synced_query =
+  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('s', 'r');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+$cmd = qq(INSERT INTO tbl (i) VALUES (1));
+$node_publisher->safe_psql('postgres', $cmd);
+
+$node_publisher->wait_for_catchup("sub");
+
+is($node_subscriber->safe_psql('postgres', q(SELECT i FROM tbl)),
+   1, "value replicated to subscriber without replica identity index");
+
+$node_subscriber->stop('smart');
+$node_publisher->stop('smart');
-- 
2.21.1 (Apple Git-122.3)

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Dilger (#8)
Re: Fix for segfault in logical replication on master

On Thu, Jun 17, 2021 at 9:26 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 17, 2021, at 6:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think such a problem won't happen because we are using historic
snapshots in this context. We rely on that in a similar way in
reorderbuffer.c, see ReorderBufferProcessTXN.

I think you are right, but that's the part I have trouble fully convincing myself is safe. We certainly have an historic snapshot when we call RelationGetIndexList, but that has an early exit if the relation already has fields set, and we don't know if those fields were set before or after the historic snapshot was taken. Within the context of the pluggable infrastructure, I think we're safe. The only caller of RelationGetIdentityKeyBitmap() in core is logicalrep_write_attrs(), which is only called by logicalrep_write_rel(), which is only called by send_relation_and_attrs(), which is only called by maybe_send_schema(), which is called by pgoutput_change() and pgoutput_truncate(), both being callbacks in core's logical replication plugin.

ReorderBufferProcessTXN calls SetupHistoricSnapshot before opening the relation and then calling ReorderBufferApplyChange to invoke the plugin on that opened relation, so the relation's fields could not have been setup before the snapshot was taken. Any other plugin would similarly get invoked after that same logic, so they'd be fine, too. The problem would only be if somebody called RelationGetIdentityKeyBitmap() or one of its calling functions from outside that infrastructure. Is that worth worrying about? The function comments for those mention having an historic snapshot, and the Assert will catch if code doesn't have one, but I wonder how much of a trap for the unwary that is, considering that somebody might open the relation and lookup indexes for the relation before taking an historic snapshot and calling these functions.

I think in such a case the caller must call InvalidateSystemCaches
before setting up a historic snapshot, otherwise, there could be other
problems as well.

I thought it was cheap enough to check that the relation we open is an index, because if it is not, we'll segfault when accessing fields of the relation->rd_index struct. I wouldn't necessarily advocate doing any really expensive checks here, but a quick sanity check seemed worth the effort.

I am not telling you anything about the cost of these sanity checks. I
suggest you raise elog rather than return NULL because if this happens
there is definitely some problem and continuing won't be a good idea.

--
With Regards,
Amit Kapila.

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#9)
Re: Fix for segfault in logical replication on master

On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I thought it was cheap enough to check that the relation we open is an index, because if it is not, we'll segfault when accessing fields of the relation->rd_index struct. I wouldn't necessarily advocate doing any really expensive checks here, but a quick sanity check seemed worth the effort.

I am not telling you anything about the cost of these sanity checks. I
suggest you raise elog rather than return NULL because if this happens
there is definitely some problem and continuing won't be a good idea.

Pushed, after making the above change. Additionally, I have moved the
test case to the existing file 001_rep_changes instead of creating a
new one as the test seems to fit there and I was not sure if the test
for just this case deserves a new file.

--
With Regards,
Amit Kapila.

#11Japin Li
japinli@hotmail.com
In reply to: Amit Kapila (#10)
Re: Fix for segfault in logical replication on master

On Sat, 19 Jun 2021 at 17:18, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I thought it was cheap enough to check that the relation we open is an index, because if it is not, we'll segfault when accessing fields of the relation->rd_index struct. I wouldn't necessarily advocate doing any really expensive checks here, but a quick sanity check seemed worth the effort.

I am not telling you anything about the cost of these sanity checks. I
suggest you raise elog rather than return NULL because if this happens
there is definitely some problem and continuing won't be a good idea.

Pushed, after making the above change. Additionally, I have moved the
test case to the existing file 001_rep_changes instead of creating a
new one as the test seems to fit there and I was not sure if the test
for just this case deserves a new file.

Hi, Amit

Sorry for the late repay.

When we find that the relation has no replica identity index, I think we should
free the memory of the indexoidlist. Since we free the memory owned by
indexoidlist at end of RelationGetIdentityKeyBitmap().

if (!OidIsValid(relation->rd_replidindex))
{
list_free(indexoidlist);
return NULL;
}

Or we can free the memory owned by indexoidlist after check whether it is NIL,
because we do not use it in the later.

If we do not free the memory, there might be a memory leak when
relation->rd_replidindex is invalid. Am I right?

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Japin Li (#11)
Re: Fix for segfault in logical replication on master

On Mon, Jun 21, 2021 at 1:30 PM Japin Li <japinli@hotmail.com> wrote:

On Sat, 19 Jun 2021 at 17:18, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Or we can free the memory owned by indexoidlist after check whether it is NIL,
because we do not use it in the later.

Valid point. But I am thinking do we really need to fetch and check
indexoidlist here?

--
With Regards,
Amit Kapila.

#13Japin Li
japinli@hotmail.com
In reply to: Amit Kapila (#12)
Re: Fix for segfault in logical replication on master

On Mon, 21 Jun 2021 at 16:22, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 1:30 PM Japin Li <japinli@hotmail.com> wrote:

On Sat, 19 Jun 2021 at 17:18, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Or we can free the memory owned by indexoidlist after check whether it is NIL,
because we do not use it in the later.

Valid point. But I am thinking do we really need to fetch and check
indexoidlist here?

IMO, we shold not fetch and check the indexoidlist here, since we do not
use it. However, we should use RelationGetIndexList() to update the
reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we
can use the following code:

indexoidlist = RelationGetIndexList(relation);
list_free(indexoidlist);

Or does there any function that only update the relation->rd_replidindex
or related fields, but do not fetch the indexoidlist?

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Japin Li (#13)
Re: Fix for segfault in logical replication on master

On Mon, Jun 21, 2021 at 2:06 PM Japin Li <japinli@hotmail.com> wrote:

On Mon, 21 Jun 2021 at 16:22, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 1:30 PM Japin Li <japinli@hotmail.com> wrote:

On Sat, 19 Jun 2021 at 17:18, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Or we can free the memory owned by indexoidlist after check whether it is NIL,
because we do not use it in the later.

Valid point. But I am thinking do we really need to fetch and check
indexoidlist here?

IMO, we shold not fetch and check the indexoidlist here, since we do not
use it. However, we should use RelationGetIndexList() to update the
reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we
can use the following code:

indexoidlist = RelationGetIndexList(relation);
list_free(indexoidlist);

Or does there any function that only update the relation->rd_replidindex
or related fields, but do not fetch the indexoidlist?

How about RelationGetReplicaIndex? It fetches the indexlist only when
required and frees it immediately. But otherwise, currently, there
shouldn't be any memory leak because we allocate this in "logical
replication output context" which is reset after processing each
change message, see pgoutput_change.

--
With Regards,
Amit Kapila.

#15Japin Li
japinli@hotmail.com
In reply to: Amit Kapila (#14)
Re: Fix for segfault in logical replication on master

On Mon, 21 Jun 2021 at 17:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 2:06 PM Japin Li <japinli@hotmail.com> wrote:

On Mon, 21 Jun 2021 at 16:22, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 1:30 PM Japin Li <japinli@hotmail.com> wrote:

On Sat, 19 Jun 2021 at 17:18, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Or we can free the memory owned by indexoidlist after check whether it is NIL,
because we do not use it in the later.

Valid point. But I am thinking do we really need to fetch and check
indexoidlist here?

IMO, we shold not fetch and check the indexoidlist here, since we do not
use it. However, we should use RelationGetIndexList() to update the
reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we
can use the following code:

indexoidlist = RelationGetIndexList(relation);
list_free(indexoidlist);

Or does there any function that only update the relation->rd_replidindex
or related fields, but do not fetch the indexoidlist?

How about RelationGetReplicaIndex? It fetches the indexlist only when
required and frees it immediately. But otherwise, currently, there
shouldn't be any memory leak because we allocate this in "logical
replication output context" which is reset after processing each
change message, see pgoutput_change.

Thanks for your explanation. It might not be a memory leak, however it's
a little confuse when we free the memory of the indexoidlist in one place,
but not free it in another place.

I attached a patch to fix this. Any thoughts?

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d55ae016d0..94fbf1aa19 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5244,9 +5244,9 @@ Bitmapset *
 RelationGetIdentityKeyBitmap(Relation relation)
 {
        Bitmapset  *idindexattrs = NULL;        /* columns in the replica identity */
-       List       *indexoidlist;
        Relation        indexDesc;
        int                     i;
+       Oid                     replidindex;
        MemoryContext oldcxt;

/* Quick exit if we already computed the result */
@@ -5260,18 +5260,14 @@ RelationGetIdentityKeyBitmap(Relation relation)
/* Historic snapshot must be set. */
Assert(HistoricSnapshotActive());

-       indexoidlist = RelationGetIndexList(relation);
-
-       /* Fall out if no indexes (but relhasindex was set) */
-       if (indexoidlist == NIL)
-               return NULL;
+       replidindex = RelationGetReplicaIndex(relation);
        /* Fall out if there is no replica identity index */
-       if (!OidIsValid(relation->rd_replidindex))
+       if (!OidIsValid(replidindex))
                return NULL;
        /* Look up the description for the replica identity index */
-       indexDesc = RelationIdGetRelation(relation->rd_replidindex);
+       indexDesc = RelationIdGetRelation(replidindex);

if (!RelationIsValid(indexDesc))
elog(ERROR, "could not open relation with OID %u",
@@ -5295,7 +5291,6 @@ RelationGetIdentityKeyBitmap(Relation relation)
}

RelationClose(indexDesc);
- list_free(indexoidlist);

/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_idattr);

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Japin Li (#15)
Re: Fix for segfault in logical replication on master

On Mon, Jun 21, 2021 at 4:09 PM Japin Li <japinli@hotmail.com> wrote:

On Mon, 21 Jun 2021 at 17:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 2:06 PM Japin Li <japinli@hotmail.com> wrote:

On Mon, 21 Jun 2021 at 16:22, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 1:30 PM Japin Li <japinli@hotmail.com> wrote:

On Sat, 19 Jun 2021 at 17:18, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Or we can free the memory owned by indexoidlist after check whether it is NIL,
because we do not use it in the later.

Valid point. But I am thinking do we really need to fetch and check
indexoidlist here?

IMO, we shold not fetch and check the indexoidlist here, since we do not
use it. However, we should use RelationGetIndexList() to update the
reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we
can use the following code:

indexoidlist = RelationGetIndexList(relation);
list_free(indexoidlist);

Or does there any function that only update the relation->rd_replidindex
or related fields, but do not fetch the indexoidlist?

How about RelationGetReplicaIndex? It fetches the indexlist only when
required and frees it immediately. But otherwise, currently, there
shouldn't be any memory leak because we allocate this in "logical
replication output context" which is reset after processing each
change message, see pgoutput_change.

Thanks for your explanation. It might not be a memory leak, however it's
a little confuse when we free the memory of the indexoidlist in one place,
but not free it in another place.

I attached a patch to fix this. Any thoughts?

Your patch appears to be on the lines we discussed but I would prefer
to get it done after Beta2 as this is just a minor code improvement.
Can you please send the change as a patch file instead of copy-pasting
the diff at the end of the email?

--
With Regards,
Amit Kapila.

#17Japin Li
japinli@hotmail.com
In reply to: Amit Kapila (#16)
1 attachment(s)
Re: Fix for segfault in logical replication on master

On Mon, 21 Jun 2021 at 19:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 4:09 PM Japin Li <japinli@hotmail.com> wrote:

On Mon, 21 Jun 2021 at 17:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 2:06 PM Japin Li <japinli@hotmail.com> wrote:

On Mon, 21 Jun 2021 at 16:22, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 1:30 PM Japin Li <japinli@hotmail.com> wrote:

On Sat, 19 Jun 2021 at 17:18, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Or we can free the memory owned by indexoidlist after check whether it is NIL,
because we do not use it in the later.

Valid point. But I am thinking do we really need to fetch and check
indexoidlist here?

IMO, we shold not fetch and check the indexoidlist here, since we do not
use it. However, we should use RelationGetIndexList() to update the
reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we
can use the following code:

indexoidlist = RelationGetIndexList(relation);
list_free(indexoidlist);

Or does there any function that only update the relation->rd_replidindex
or related fields, but do not fetch the indexoidlist?

How about RelationGetReplicaIndex? It fetches the indexlist only when
required and frees it immediately. But otherwise, currently, there
shouldn't be any memory leak because we allocate this in "logical
replication output context" which is reset after processing each
change message, see pgoutput_change.

Thanks for your explanation. It might not be a memory leak, however it's
a little confuse when we free the memory of the indexoidlist in one place,
but not free it in another place.

I attached a patch to fix this. Any thoughts?

Your patch appears to be on the lines we discussed but I would prefer
to get it done after Beta2 as this is just a minor code improvement.
Can you please send the change as a patch file instead of copy-pasting
the diff at the end of the email?

Thanks for your review! Attached v1 patch.

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Attachments:

v1-0001-Minor-code-beautification-for-RelationGetIdentity.patchtext/x-patchDownload
From 0582d92ba4df517f3e67717751cddb8916c8ab9f Mon Sep 17 00:00:00 2001
From: Japin Li <japinli@hotmail.com>
Date: Tue, 22 Jun 2021 09:59:38 +0800
Subject: [PATCH v1] Minor code beautification for RelationGetIdentityKeyBitmap

---
 src/backend/utils/cache/relcache.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d55ae016d0..94fbf1aa19 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5244,9 +5244,9 @@ Bitmapset *
 RelationGetIdentityKeyBitmap(Relation relation)
 {
 	Bitmapset  *idindexattrs = NULL;	/* columns in the replica identity */
-	List	   *indexoidlist;
 	Relation	indexDesc;
 	int			i;
+	Oid			replidindex;
 	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the result */
@@ -5260,18 +5260,14 @@ RelationGetIdentityKeyBitmap(Relation relation)
 	/* Historic snapshot must be set. */
 	Assert(HistoricSnapshotActive());
 
-	indexoidlist = RelationGetIndexList(relation);
-
-	/* Fall out if no indexes (but relhasindex was set) */
-	if (indexoidlist == NIL)
-		return NULL;
+	replidindex = RelationGetReplicaIndex(relation);
 
 	/* Fall out if there is no replica identity index */
-	if (!OidIsValid(relation->rd_replidindex))
+	if (!OidIsValid(replidindex))
 		return NULL;
 
 	/* Look up the description for the replica identity index */
-	indexDesc = RelationIdGetRelation(relation->rd_replidindex);
+	indexDesc = RelationIdGetRelation(replidindex);
 
 	if (!RelationIsValid(indexDesc))
 		elog(ERROR, "could not open relation with OID %u",
@@ -5295,7 +5291,6 @@ RelationGetIdentityKeyBitmap(Relation relation)
 	}
 
 	RelationClose(indexDesc);
-	list_free(indexoidlist);
 
 	/* Don't leak the old values of these bitmaps, if any */
 	bms_free(relation->rd_idattr);
-- 
2.30.2

#18osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Japin Li (#17)
RE: Fix for segfault in logical replication on master

On Tuesday, June 22, 2021 11:08 AM Japin Li <japinli@hotmail.com> wrote:

On Mon, 21 Jun 2021 at 19:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 4:09 PM Japin Li <japinli@hotmail.com> wrote:

On Mon, 21 Jun 2021 at 17:54, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Mon, Jun 21, 2021 at 2:06 PM Japin Li <japinli@hotmail.com> wrote:

On Mon, 21 Jun 2021 at 16:22, Amit Kapila

<amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 1:30 PM Japin Li <japinli@hotmail.com>

wrote:

On Sat, 19 Jun 2021 at 17:18, Amit Kapila

<amit.kapila16@gmail.com> wrote:

On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila

<amit.kapila16@gmail.com> wrote:

Or we can free the memory owned by indexoidlist after check
whether it is NIL, because we do not use it in the later.

Valid point. But I am thinking do we really need to fetch and
check indexoidlist here?

IMO, we shold not fetch and check the indexoidlist here, since we
do not use it. However, we should use RelationGetIndexList() to
update the
reladion->rd_replidindex, so we should fetch the indexoidlist,
reladion->maybe we
can use the following code:

indexoidlist = RelationGetIndexList(relation);
list_free(indexoidlist);

Or does there any function that only update the
relation->rd_replidindex or related fields, but do not fetch the

indexoidlist?

How about RelationGetReplicaIndex? It fetches the indexlist only
when required and frees it immediately. But otherwise, currently,
there shouldn't be any memory leak because we allocate this in
"logical replication output context" which is reset after
processing each change message, see pgoutput_change.

Thanks for your explanation. It might not be a memory leak, however
it's a little confuse when we free the memory of the indexoidlist in
one place, but not free it in another place.

I attached a patch to fix this. Any thoughts?

Your patch appears to be on the lines we discussed but I would prefer
to get it done after Beta2 as this is just a minor code improvement.
Can you please send the change as a patch file instead of copy-pasting
the diff at the end of the email?

Thanks for your review! Attached v1 patch.

Your patch can be applied to the HEAD.
And, I also reviewed your patch, which seems OK.
Make check-world has passed with your patch in my env as well.

Best Regards,
Takamichi Osumi

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#18)
Re: Fix for segfault in logical replication on master

On Tue, Jun 22, 2021 at 9:18 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Tuesday, June 22, 2021 11:08 AM Japin Li <japinli@hotmail.com> wrote:

Your patch appears to be on the lines we discussed but I would prefer
to get it done after Beta2 as this is just a minor code improvement.
Can you please send the change as a patch file instead of copy-pasting
the diff at the end of the email?

Thanks for your review! Attached v1 patch.

Your patch can be applied to the HEAD.
And, I also reviewed your patch, which seems OK.
Make check-world has passed with your patch in my env as well.

Pushed, thanks!

--
With Regards,
Amit Kapila.