segmentation fault when cassert enabled

Started by Jehan-Guillaume de Rorthaisabout 6 years ago17 messages
1 attachment(s)

Hi list,

When investigating for the bug reported in thread "logical replication -
negative bitmapset member not allowed", I found a way to seg fault postgresql
only when cassert is enabled.

See the scenario in attachment.

When executed against binaries compiled with --enable-cassert, I have the
following error in logs:

LOG: 00000: background worker "logical replication worker" (PID 761) was
terminated by signal 11: Segmentation fault

Here is the stack trace:

#0 in slot_store_cstrings (slot=0x55a3c6973b48, rel=0x55a3c6989468,
values=0x7ffe08ae67b0) at worker.c:330
#1 in apply_handle_update (s=0x7ffe08aeddb0) at worker.c:712
#2 in apply_dispatch (s=0x7ffe08aeddb0) at worker.c:968
#3 in LogicalRepApplyLoop (last_received=87957952) at worker.c:1175
#4 in ApplyWorkerMain (main_arg=0) at worker.c:1733
#5 in StartBackgroundWorker () at bgworker.c:834
#6 in do_start_bgworker (rw=0x55a3c68c16d0) at postmaster.c:5763
#7 in maybe_start_bgworkers () at postmaster.c:5976
#8 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5161
#9 <signal handler called>
#10 in __GI___select (nfds=6, readfds=0x7ffe08aee680, writefds=0x0,
exceptfds=0x0, timeout=0x7ffe08aee700)
at ../sysdeps/unix/sysv/linux/select.c:41
#11 in ServerLoop () at postmaster.c:1668
#12 in PostmasterMain (argc=3, argv=0x55a3c6899820) at postmaster.c:1377
#13 in main (argc=3, argv=0x55a3c6899820) at main.c:228

It leads to this conditional test in worker.c:slot_store_cstrings

for (i = 0; i < natts; i++)
{ [...]
if (!att->attisdropped && remoteattnum >= 0 &&
values[remoteattnum] != NULL)

In gdb, I found remoteattnum seems to be not correctly initialized for the
latest column the scenario adds in pgbench_branches:

(gdb) p remoteattnum
$1 = 32639
(gdb) p i
$2 = 3

I hadn't time to digg further yet. However, I don't understand why this crash
is triggered when cassert is enabled.

Regards,

Attachments:

logrep-cassert-crash.bashapplication/octet-stream; name=logrep-cassert-crash.bashDownload
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jehan-Guillaume de Rorthais (#1)
Re: segmentation fault when cassert enabled

Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:

When investigating for the bug reported in thread "logical replication -
negative bitmapset member not allowed", I found a way to seg fault postgresql
only when cassert is enabled.
...
I hadn't time to digg further yet. However, I don't understand why this crash
is triggered when cassert is enabled.

Most likely, it's not so much assertions that provoke the crash as
CLOBBER_FREED_MEMORY, ie the actual problem here is use of already-freed
memory.

regards, tom lane

#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#2)
Re: segmentation fault when cassert enabled

At Fri, 25 Oct 2019 12:28:38 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:

When investigating for the bug reported in thread "logical replication -
negative bitmapset member not allowed", I found a way to seg fault postgresql
only when cassert is enabled.
...
I hadn't time to digg further yet. However, I don't understand why this crash
is triggered when cassert is enabled.

Most likely, it's not so much assertions that provoke the crash as
CLOBBER_FREED_MEMORY, ie the actual problem here is use of already-freed
memory.

Agreed.

By the way I didn't get a crash by Jehan's script with the
--enable-cassert build of the master HEAD of a few days ago.

FWIW I sometimes got SEGVish crashes or mysterious misbehavor when
some structs were changed and I didn't do "make clean". Rarely I
needed "make distclean". (Yeah, I didn't ususally turn on
--enable-depend..)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In reply to: Tom Lane (#2)
Re: segmentation fault when cassert enabled

On Fri, 25 Oct 2019 12:28:38 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:

When investigating for the bug reported in thread "logical replication -
negative bitmapset member not allowed", I found a way to seg fault
postgresql only when cassert is enabled.
...
I hadn't time to digg further yet. However, I don't understand why this
crash is triggered when cassert is enabled.

Most likely, it's not so much assertions that provoke the crash as
CLOBBER_FREED_MEMORY, ie the actual problem here is use of already-freed
memory.

Thank you. Indeed, enabling CLOBBER_FREED_MEMORY on its own is enough to
trigger the segfault.

In fact, valgrind detect it as an uninitialised value, no matter
CLOBBER_FREED_MEMORY is defined or not:

Conditional jump or move depends on uninitialised value(s)
at 0x43F410: slot_modify_cstrings (worker.c:398)
by 0x43FBE9: apply_handle_update (worker.c:744)
by 0x440088: apply_dispatch (worker.c:968)
by 0x4405D7: LogicalRepApplyLoop (worker.c:1175)
by 0x440CD0: ApplyWorkerMain (worker.c:1733)
by 0x411C34: StartBackgroundWorker (bgworker.c:834)
by 0x41EA24: do_start_bgworker (postmaster.c:5763)
by 0x41EB6F: maybe_start_bgworkers (postmaster.c:5976)
by 0x41F562: sigusr1_handler (postmaster.c:5161)
by 0x48A072F: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so)
by 0x4B31FF6: select (select.c:41)
by 0x41FDDE: ServerLoop (postmaster.c:1668)
Uninitialised value was created by a heap allocation
at 0x5C579B: palloc (mcxt.c:949)
by 0x437116: logicalrep_rel_open (relation.c:270)
by 0x43FA8F: apply_handle_update (worker.c:684)
by 0x440088: apply_dispatch (worker.c:968)
by 0x4405D7: LogicalRepApplyLoop (worker.c:1175)
by 0x440CD0: ApplyWorkerMain (worker.c:1733)
by 0x411C34: StartBackgroundWorker (bgworker.c:834)
by 0x41EA24: do_start_bgworker (postmaster.c:5763)
by 0x41EB6F: maybe_start_bgworkers (postmaster.c:5976)
by 0x41F562: sigusr1_handler (postmaster.c:5161)
by 0x48A072F: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so)
by 0x4B31FF6: select (select.c:41)

My best bet so far is that logicalrep_relmap_invalidate_cb is not called after
the DDL on the subscriber so the relmap cache is not invalidated. So we end up
with slot->tts_tupleDescriptor->natts superior than rel->remoterel->natts in
slot_store_cstrings, leading to the overflow on attrmap and the sigsev.

I hadn't follow this path yet.

By the way, I noticed attrmap is declared as AttrNumber * in struct
LogicalRepRelMapEntry, AttrNumber being typedef'd as an int16. However, attrmap
is allocated based on sizeof(int) in logicalrep_rel_open:

entry->attrmap = palloc(desc->natts * sizeof(int));

It doesn't look like a major problem, it just allocates more memory than
needed.

Regards,

#5Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Jehan-Guillaume de Rorthais (#4)
Re: segmentation fault when cassert enabled

On 2019-11-05 17:29, Jehan-Guillaume de Rorthais wrote:

My best bet so far is that logicalrep_relmap_invalidate_cb is not called after
the DDL on the subscriber so the relmap cache is not invalidated. So we end up
with slot->tts_tupleDescriptor->natts superior than rel->remoterel->natts in
slot_store_cstrings, leading to the overflow on attrmap and the sigsev.

It looks like something like that is happening. But it shouldn't.
Different table schemas on publisher and subscriber are well supported,
so this must be an edge case of some kind. Please continue investigating.

By the way, I noticed attrmap is declared as AttrNumber * in struct
LogicalRepRelMapEntry, AttrNumber being typedef'd as an int16. However, attrmap
is allocated based on sizeof(int) in logicalrep_rel_open:

entry->attrmap = palloc(desc->natts * sizeof(int));

It doesn't look like a major problem, it just allocates more memory than
needed.

Right. I have committed a fix for this.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In reply to: Kyotaro Horiguchi (#3)
Re: segmentation fault when cassert enabled

On Mon, 28 Oct 2019 16:47:02 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Fri, 25 Oct 2019 12:28:38 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:

When investigating for the bug reported in thread "logical replication -
negative bitmapset member not allowed", I found a way to seg fault
postgresql only when cassert is enabled.
...
I hadn't time to digg further yet. However, I don't understand why this
crash is triggered when cassert is enabled.

Most likely, it's not so much assertions that provoke the crash as
CLOBBER_FREED_MEMORY, ie the actual problem here is use of already-freed
memory.

Agreed.

By the way I didn't get a crash by Jehan's script with the
--enable-cassert build of the master HEAD of a few days ago.

I am now working with HEAD and I can confirm I am able to make it crash 99% of
the time using my script.
It feels like a race condition between cache invalidation and record
processing from worker.c. Make sure you have enough write activity during the
test.

FWIW I sometimes got SEGVish crashes or mysterious misbehavor when
some structs were changed and I didn't do "make clean". Rarely I
needed "make distclean". (Yeah, I didn't ususally turn on
--enable-depend..)

I'm paranoid, I always do:

* make distclean
* git reset; git clean -df
* ./configure && make install

Regards,

In reply to: Peter Eisentraut (#5)
1 attachment(s)
Re: segmentation fault when cassert enabled

On Wed, 6 Nov 2019 14:34:38 +0100
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

On 2019-11-05 17:29, Jehan-Guillaume de Rorthais wrote:

My best bet so far is that logicalrep_relmap_invalidate_cb is not called
after the DDL on the subscriber so the relmap cache is not invalidated. So
we end up with slot->tts_tupleDescriptor->natts superior than
rel->remoterel->natts in slot_store_cstrings, leading to the overflow on
attrmap and the sigsev.

It looks like something like that is happening. But it shouldn't.
Different table schemas on publisher and subscriber are well supported,
so this must be an edge case of some kind. Please continue investigating.

I've been able to find the origin of the crash, but it was a long journey.

<debugger hard life>

I was unable to debug using gdb record because of this famous error:

Process record does not support instruction 0xc5 at address 0x1482758a4b30.

Program stopped.
__memset_avx2_unaligned_erms ()
at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:168
168 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such
file or directory.

Trying with rr, I had constant "stack depth limit exceeded", even with
unlimited one. Does it worth opening a discussion or a wiki page about these
tools? Peter, it looks like you have some experience with rr, any feedback?

Finally, Julien Rouhaud spend some time with me after work hours, a,swering
my questions about some parts of the code and pointed me to the excellent
backtrace_functions GUC addition few days ago. This finally did the trick to
find out what was happening. Many thanks Julien!

</debugger hard life>

Back to the bug itself. Consider a working logical replication with constant
update/insert activity, eg. pgbench running against provider.

Now, on the subscriber side, a session issue an "ALTER TABLE ADD
COLUMN" on a subscribed table, eg. pgbench_branches. A cache invalidation
message is then pending for this table.

In the meantime, the logical replication worker receive an UPDATE to apply. It
opens the local relation using "logicalrep_rel_open". It finds the related
entry in LogicalRepRelMap is valid, so it does not update its attrmap
and directly opens and locks the local relation:

/* Need to update the local cache? */
if (!OidIsValid(entry->localreloid))
{
[...updates attrmap here...]
}
else
entry->localrel = table_open(entry->localreloid, lockmode);

However, when locking the table, the code in LockRelationOid() actually process
any pending invalidation messages:

LockRelationOid(Oid relid, LOCKMODE lockmode)
{
[...]
/*
* Now that we have the lock, check for invalidation messages, so that we
* will update or flush any stale relcache entry before we try to use it.
* RangeVarGetRelid() specifically relies on us for this. We can skip
* this in the not-uncommon case that we already had the same type of lock
* being requested, since then no one else could have modified the
* relcache entry in an undesirable way. (In the case where our own xact
* modifies the rel, the relcache update happens via
* CommandCounterIncrement, not here.)
*
* However, in corner cases where code acts on tables (usually catalogs)
* recursively, we might get here while still processing invalidation
* messages in some outer execution of this function or a sibling. The
* "cleared" status of the lock tells us whether we really are done
* absorbing relevant inval messages.
*/
if (res != LOCKACQUIRE_ALREADY_CLEAR)
{
AcceptInvalidationMessages();
MarkLockClear(locallock);
}
}

We end up with attrmap referencing N columns and the relcache referencing N+1
columns. Later, in apply_handle_update(), we build a TupleTableSlot based on
the relcache representation and we crash by overflowing attrmap while trying to
feed this larger slot in slot_store_cstrings().

Please find in attachment a bugfix proposal. It just moves the attrmap update
after the table_open() call.

Last, I was wondering if entry->attrmap should be pfree'd before palloc'ing it
again during its rebuild to avoid some memory leak?

Regards,

Attachments:

0001-Fix-subscriber-invalid-memory-access-on-DDL.patchtext/x-patchDownload
From 4295b277952a46378f01211bd37075f20223aadc Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Mon, 25 Nov 2019 15:21:38 +0100
Subject: [PATCH] Fix subscriber invalid memory access on DDL

Before this patch, the attrmap structure mapping the local fields
to remote ones for a subscribed relation was rebuild before handling
pending invalidation messages and potential relcache updates. This
was leading to an invalid memory access by overflowing the attrmap
when building the related tuple slot received from the provider.
---
 src/backend/replication/logical/relation.c | 57 ++++++++++++++++------
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index b386f8460d..cfc34d49e0 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -220,6 +220,8 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 {
 	LogicalRepRelMapEntry *entry;
 	bool		found;
+	Oid		localreloid = InvalidOid;
+	LogicalRepRelation	*remoterel;
 
 	if (LogicalRepRelMap == NULL)
 		logicalrep_relmap_init();
@@ -232,29 +234,56 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 		elog(ERROR, "no relation map entry for remote relation ID %u",
 			 remoteid);
 
-	/* Need to update the local cache? */
+	remoterel = &entry->remoterel;
+
+	/*
+	 * entry->localreloid is set to InvalidOid when the local relation is
+	 * either unknown or invalidated.
+	 * Moreover, when opening and locking a relation, pending invalidation
+	 * messages are processed.
+	 * Because of that, we first need to open the local relation, then rebuild
+	 * the mapping of local attribute if the relation is invalidated.
+	 */
+
+	/*
+	 * Unknown or invalidated local relation. First look for its oid and open
+	 * it. We will build the mapping of local attributes after.
+	 */
 	if (!OidIsValid(entry->localreloid))
 	{
 		Oid			relid;
-		int			i;
-		int			found;
-		Bitmapset  *idkey;
-		TupleDesc	desc;
-		LogicalRepRelation *remoterel;
-		MemoryContext oldctx;
-
-		remoterel = &entry->remoterel;
 
-		/* Try to find and lock the relation by name. */
+		/* Try to find the relation by name */
 		relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,
 											  remoterel->relname, -1),
-								 lockmode, true);
+								 NoLock, true);
 		if (!OidIsValid(relid))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("logical replication target relation \"%s.%s\" does not exist",
 							remoterel->nspname, remoterel->relname)));
-		entry->localrel = table_open(relid, NoLock);
+
+		localreloid = relid;
+	}
+	else
+		localreloid = entry->localreloid;
+
+	/* Actually open and lock the related local relation */
+	entry->localrel = table_open(localreloid, lockmode);
+
+	/*
+	 * Pending invalidation messages might have been processed while grabbing
+	 * the lock during table_open, applying some DDL operations that are then
+	 * reflected in the relcache.
+	 * Because of that, we need to rebuild the mapping of local attribute after
+	 * the relation is opened.
+	 */
+	if (!OidIsValid(entry->localreloid)) {
+		int			found;
+		Bitmapset  *idkey;
+		TupleDesc	desc;
+		MemoryContext oldctx;
+		int			i;
 
 		/* Check for supported relkind. */
 		CheckSubscriptionRelkind(entry->localrel->rd_rel->relkind,
@@ -348,10 +377,8 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 			}
 		}
 
-		entry->localreloid = relid;
+		entry->localreloid = localreloid;
 	}
-	else
-		entry->localrel = table_open(entry->localreloid, lockmode);
 
 	if (entry->state != SUBREL_STATE_READY)
 		entry->state = GetSubscriptionRelState(MySubscription->oid,
-- 
2.20.1

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#7)
Re: segmentation fault when cassert enabled

On Mon, Nov 25, 2019 at 8:25 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

On Wed, 6 Nov 2019 14:34:38 +0100
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

On 2019-11-05 17:29, Jehan-Guillaume de Rorthais wrote:

My best bet so far is that logicalrep_relmap_invalidate_cb is not called
after the DDL on the subscriber so the relmap cache is not invalidated. So
we end up with slot->tts_tupleDescriptor->natts superior than
rel->remoterel->natts in slot_store_cstrings, leading to the overflow on
attrmap and the sigsev.

It looks like something like that is happening. But it shouldn't.
Different table schemas on publisher and subscriber are well supported,
so this must be an edge case of some kind. Please continue investigating.

I've been able to find the origin of the crash, but it was a long journey.

<debugger hard life>

I was unable to debug using gdb record because of this famous error:

Process record does not support instruction 0xc5 at address 0x1482758a4b30.

Program stopped.
__memset_avx2_unaligned_erms ()
at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:168
168 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such
file or directory.

Trying with rr, I had constant "stack depth limit exceeded", even with
unlimited one. Does it worth opening a discussion or a wiki page about these
tools? Peter, it looks like you have some experience with rr, any feedback?

Finally, Julien Rouhaud spend some time with me after work hours, a,swering
my questions about some parts of the code and pointed me to the excellent
backtrace_functions GUC addition few days ago. This finally did the trick to
find out what was happening. Many thanks Julien!

</debugger hard life>

Back to the bug itself. Consider a working logical replication with constant
update/insert activity, eg. pgbench running against provider.

Now, on the subscriber side, a session issue an "ALTER TABLE ADD
COLUMN" on a subscribed table, eg. pgbench_branches. A cache invalidation
message is then pending for this table.

In the meantime, the logical replication worker receive an UPDATE to apply. It
opens the local relation using "logicalrep_rel_open". It finds the related
entry in LogicalRepRelMap is valid, so it does not update its attrmap
and directly opens and locks the local relation:

- /* Try to find and lock the relation by name. */
+ /* Try to find the relation by name */
  relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,\
    remoterel->relname, -1),
- lockmode, true);
+ NoLock, true);

I think we can't do this because it could lead to locking the wrong
reloid. See RangeVarGetRelidExtended. It ensures that after locking
the relation (which includes accepting invalidation messages), that
the reloid is correct. I think changing the code in the way you are
suggesting can lead to locking incorrect reloid.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#9vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#8)
1 attachment(s)
Re: segmentation fault when cassert enabled

On Fri, Dec 6, 2019 at 5:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 25, 2019 at 8:25 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

On Wed, 6 Nov 2019 14:34:38 +0100
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

On 2019-11-05 17:29, Jehan-Guillaume de Rorthais wrote:

My best bet so far is that logicalrep_relmap_invalidate_cb is not called
after the DDL on the subscriber so the relmap cache is not invalidated. So
we end up with slot->tts_tupleDescriptor->natts superior than
rel->remoterel->natts in slot_store_cstrings, leading to the overflow on
attrmap and the sigsev.

It looks like something like that is happening. But it shouldn't.
Different table schemas on publisher and subscriber are well supported,
so this must be an edge case of some kind. Please continue investigating.

I've been able to find the origin of the crash, but it was a long journey.

<debugger hard life>

I was unable to debug using gdb record because of this famous error:

Process record does not support instruction 0xc5 at address 0x1482758a4b30.

Program stopped.
__memset_avx2_unaligned_erms ()
at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:168
168 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such
file or directory.

Trying with rr, I had constant "stack depth limit exceeded", even with
unlimited one. Does it worth opening a discussion or a wiki page about these
tools? Peter, it looks like you have some experience with rr, any feedback?

Finally, Julien Rouhaud spend some time with me after work hours, a,swering
my questions about some parts of the code and pointed me to the excellent
backtrace_functions GUC addition few days ago. This finally did the trick to
find out what was happening. Many thanks Julien!

</debugger hard life>

Back to the bug itself. Consider a working logical replication with constant
update/insert activity, eg. pgbench running against provider.

Now, on the subscriber side, a session issue an "ALTER TABLE ADD
COLUMN" on a subscribed table, eg. pgbench_branches. A cache invalidation
message is then pending for this table.

In the meantime, the logical replication worker receive an UPDATE to apply. It
opens the local relation using "logicalrep_rel_open". It finds the related
entry in LogicalRepRelMap is valid, so it does not update its attrmap
and directly opens and locks the local relation:

- /* Try to find and lock the relation by name. */
+ /* Try to find the relation by name */
relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,\
remoterel->relname, -1),
- lockmode, true);
+ NoLock, true);

I think we can't do this because it could lead to locking the wrong
reloid. See RangeVarGetRelidExtended. It ensures that after locking
the relation (which includes accepting invalidation messages), that
the reloid is correct. I think changing the code in the way you are
suggesting can lead to locking incorrect reloid.

I have made changes to fix the comment provided. The patch for the
same is attached. Could not add a test case for this scenario is based
on timing issue.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Fix-subscriber-invalid-memory-access-on-DDL.patchapplication/x-patch; name=0001-Fix-subscriber-invalid-memory-access-on-DDL.patchDownload
From 9bec5e5250c5af034096ed9ab966c2c237518976 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>, vignesh.c <vignesh21@gmail.com>
Date: Fri, 13 Dec 2019 11:59:13 +0530
Subject: [PATCH] Fix subscriber invalid memory access on DDL

Before this patch, the attrmap structure mapping the local fields
to remote ones for a subscribed relation was rebuild before handling
pending invalidation messages and potential relcache updates. This
was leading to an invalid memory access by overflowing the attrmap
when building the related tuple slot received from the provider.
---
 src/backend/replication/logical/relation.c | 53 ++++++++++++++++++++++--------
 1 file changed, 39 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index b386f84..9009ebc 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -220,6 +220,8 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 {
 	LogicalRepRelMapEntry *entry;
 	bool		found;
+	Oid			relid = InvalidOid;
+	LogicalRepRelation	*remoterel;
 
 	if (LogicalRepRelMap == NULL)
 		logicalrep_relmap_init();
@@ -232,20 +234,24 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 		elog(ERROR, "no relation map entry for remote relation ID %u",
 			 remoteid);
 
-	/* Need to update the local cache? */
-	if (!OidIsValid(entry->localreloid))
-	{
-		Oid			relid;
-		int			i;
-		int			found;
-		Bitmapset  *idkey;
-		TupleDesc	desc;
-		LogicalRepRelation *remoterel;
-		MemoryContext oldctx;
+	remoterel = &entry->remoterel;
 
-		remoterel = &entry->remoterel;
+	/*
+	 * entry->localreloid is set to InvalidOid when the local relation is
+	 * either unknown or invalidated.
+	 * Moreover, when opening and locking a relation, pending invalidation
+	 * messages are processed.
+	 * Because of that, we first need to open the local relation, then rebuild
+	 * the mapping of local attribute if the relation is invalidated.
+	 */
 
-		/* Try to find and lock the relation by name. */
+	/*
+	 * Unknown or invalidated local relation. First look for its oid and open
+	 * it. We will build the mapping of local attributes after.
+	 */
+	if (!OidIsValid(entry->localreloid))
+	{
+		/* Try to find the relation by name */
 		relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,
 											  remoterel->relname, -1),
 								 lockmode, true);
@@ -256,6 +262,27 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 							remoterel->nspname, remoterel->relname)));
 		entry->localrel = table_open(relid, NoLock);
 
+	}
+	else
+	{
+		relid = entry->localreloid;
+		entry->localrel = table_open(entry->localreloid, lockmode);
+	}
+
+	/*
+	 * Pending invalidation messages might have been processed while grabbing
+	 * the lock during table_open, applying some DDL operations that are then
+	 * reflected in the relcache.
+	 * Because of that, we need to rebuild the mapping of local attribute after
+	 * the relation is opened.
+	 */
+	if (!OidIsValid(entry->localreloid)) {
+		int			found;
+		Bitmapset  *idkey;
+		TupleDesc	desc;
+		MemoryContext oldctx;
+		int			i;
+
 		/* Check for supported relkind. */
 		CheckSubscriptionRelkind(entry->localrel->rd_rel->relkind,
 								 remoterel->nspname, remoterel->relname);
@@ -350,8 +377,6 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 
 		entry->localreloid = relid;
 	}
-	else
-		entry->localrel = table_open(entry->localreloid, lockmode);
 
 	if (entry->state != SUBREL_STATE_READY)
 		entry->state = GetSubscriptionRelState(MySubscription->oid,
-- 
1.8.3.1

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#9)
1 attachment(s)
Re: segmentation fault when cassert enabled

On Fri, Dec 13, 2019 at 12:10 PM vignesh C <vignesh21@gmail.com> wrote:

I have made changes to fix the comment provided. The patch for the
same is attached. Could not add a test case for this scenario is based
on timing issue.
Thoughts?

I agree that this is a timing issue. I also don't see a way to write
a reproducible test for this. However, I could reproduce it via
debugger consistently by following the below steps. I have updated a
few comments and commit messages in the attached patch.

Peter E., Petr J or anyone else, do you have comments or objections on
this patch? If none, then I am planning to commit (and backpatch)
this patch in a few days time.

Test steps to reproduce the issue.
Set up
---------
set up master and subscriber nodes.
In code, add a while(true) in apply_handle_update() before a call to
logicalrep_rel_open(). This is to ensure that we can debug the replay
of Update
operation on subscriber.

Master
-----------
Create table t1(c1 int);
Create publication pub_t1 for table t1;
Alter table t1 replica identity full;

Subscriber
-------------
Create table t1(c1 int);
CREATE SUBSCRIPTION sub_t1 CONNECTION 'host=localhost port=5432
dbname=postgres' PUBLICATION pub_t1;

Master
----------
Insert into t1 values(1); --this will create LogicalRepRelMap entry
for t1 on subscriber.

Subscriber
----------
Select * from t1; -- This should display the data inserted in master.

Master
----------
Update t1 set c1 = 2 where c1=1;

Now on the subscriber, attach a debugger and debug logicalrep_rel_open
and stop debugger just before table_open call.

Subscriber
-----------
Alter table t1 add c2 int;

Now, continue in debugger, it should crash in slot_store_cstrings()
because the rel->attrmap is not updated.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Fix-subscriber-invalid-memory-access-on-DDL.amit.1.patchapplication/octet-stream; name=0001-Fix-subscriber-invalid-memory-access-on-DDL.amit.1.patchDownload
From 9b4b92bf35517d95983c1d386944e90f0984b59b Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 16 Dec 2019 15:23:46 +0530
Subject: [PATCH] Fix subscriber invalid memory access on DDL.

This patch allows building the local cache that maps the local attribute
numbers to remote ones for a subscribed relation to be built after
processing pending invalidation messages and potential relcache updates.
Without this, the attributes in the local cache don't tally with the
updated relcache entry leading to invalid memory access.

Reported-by Jehan-Guillaume de Rorthais
Author: Jehan-Guillaume de Rorthais and Vignesh C
Reviewed-by: Amit Kapila
Backpatch-through: 10
Discussion: https://postgr.es/m/20191025175929.7e90dbf5@firost
---
 src/backend/replication/logical/relation.c | 40 +++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index b386f84..ea47138 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -220,6 +220,8 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 {
 	LogicalRepRelMapEntry *entry;
 	bool		found;
+	Oid			relid = InvalidOid;
+	LogicalRepRelation *remoterel;
 
 	if (LogicalRepRelMap == NULL)
 		logicalrep_relmap_init();
@@ -232,20 +234,17 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 		elog(ERROR, "no relation map entry for remote relation ID %u",
 			 remoteid);
 
-	/* Need to update the local cache? */
+	remoterel = &entry->remoterel;
+
+	/*
+	 * When opening and locking a relation, pending invalidation messages are
+	 * processed which can invalidate the relation.  We need to update the
+	 * local cache both when we are first time accessing the relation and when
+	 * the relation is invalidated (aka entry->localreloid is set InvalidOid).
+	 */
 	if (!OidIsValid(entry->localreloid))
 	{
-		Oid			relid;
-		int			i;
-		int			found;
-		Bitmapset  *idkey;
-		TupleDesc	desc;
-		LogicalRepRelation *remoterel;
-		MemoryContext oldctx;
-
-		remoterel = &entry->remoterel;
-
-		/* Try to find and lock the relation by name. */
+		/* Try to find and lock the relation by name */
 		relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,
 											  remoterel->relname, -1),
 								 lockmode, true);
@@ -256,6 +255,21 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 							remoterel->nspname, remoterel->relname)));
 		entry->localrel = table_open(relid, NoLock);
 
+	}
+	else
+	{
+		relid = entry->localreloid;
+		entry->localrel = table_open(entry->localreloid, lockmode);
+	}
+
+	if (!OidIsValid(entry->localreloid))
+	{
+		int			found;
+		Bitmapset  *idkey;
+		TupleDesc	desc;
+		MemoryContext oldctx;
+		int			i;
+
 		/* Check for supported relkind. */
 		CheckSubscriptionRelkind(entry->localrel->rd_rel->relkind,
 								 remoterel->nspname, remoterel->relname);
@@ -350,8 +364,6 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 
 		entry->localreloid = relid;
 	}
-	else
-		entry->localrel = table_open(entry->localreloid, lockmode);
 
 	if (entry->state != SUBREL_STATE_READY)
 		entry->state = GetSubscriptionRelState(MySubscription->oid,
-- 
1.8.3.1

#11Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Amit Kapila (#10)
Re: segmentation fault when cassert enabled

On 2019-12-16 11:11, Amit Kapila wrote:

I agree that this is a timing issue. I also don't see a way to write
a reproducible test for this. However, I could reproduce it via
debugger consistently by following the below steps. I have updated a
few comments and commit messages in the attached patch.

Peter E., Petr J or anyone else, do you have comments or objections on
this patch? If none, then I am planning to commit (and backpatch)
this patch in a few days time.

The patch seems fine to me. Writing a test seems hard. Let's skip it.

The commit message has a duplicate "building"/"built" in the first sentence.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In reply to: vignesh C (#9)
Re: segmentation fault when cassert enabled

On Fri, 13 Dec 2019 12:10:07 +0530
vignesh C <vignesh21@gmail.com> wrote:

On Fri, Dec 6, 2019 at 5:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 25, 2019 at 8:25 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

On Wed, 6 Nov 2019 14:34:38 +0100
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

On 2019-11-05 17:29, Jehan-Guillaume de Rorthais wrote:

My best bet so far is that logicalrep_relmap_invalidate_cb is not
called after the DDL on the subscriber so the relmap cache is not
invalidated. So we end up with slot->tts_tupleDescriptor->natts
superior than rel->remoterel->natts in slot_store_cstrings, leading
to the overflow on attrmap and the sigsev.

It looks like something like that is happening. But it shouldn't.
Different table schemas on publisher and subscriber are well supported,
so this must be an edge case of some kind. Please continue
investigating.

I've been able to find the origin of the crash, but it was a long journey.

<debugger hard life>

I was unable to debug using gdb record because of this famous error:

Process record does not support instruction 0xc5 at address
0x1482758a4b30.

Program stopped.
__memset_avx2_unaligned_erms ()
at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:168
168 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such
file or directory.

Trying with rr, I had constant "stack depth limit exceeded", even with
unlimited one. Does it worth opening a discussion or a wiki page about
these tools? Peter, it looks like you have some experience with rr, any
feedback?

Finally, Julien Rouhaud spend some time with me after work hours,
a,swering my questions about some parts of the code and pointed me to the
excellent backtrace_functions GUC addition few days ago. This finally did
the trick to find out what was happening. Many thanks Julien!

</debugger hard life>

Back to the bug itself. Consider a working logical replication with
constant update/insert activity, eg. pgbench running against provider.

Now, on the subscriber side, a session issue an "ALTER TABLE ADD
COLUMN" on a subscribed table, eg. pgbench_branches. A cache invalidation
message is then pending for this table.

In the meantime, the logical replication worker receive an UPDATE to
apply. It opens the local relation using "logicalrep_rel_open". It finds
the related entry in LogicalRepRelMap is valid, so it does not update its
attrmap and directly opens and locks the local relation:

- /* Try to find and lock the relation by name. */
+ /* Try to find the relation by name */
relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,\
remoterel->relname, -1),
- lockmode, true);
+ NoLock, true);

I think we can't do this because it could lead to locking the wrong
reloid. See RangeVarGetRelidExtended. It ensures that after locking
the relation (which includes accepting invalidation messages), that
the reloid is correct. I think changing the code in the way you are
suggesting can lead to locking incorrect reloid.

Sorry for the delay, I couldn't answer earlier.

To be honest, I was wondering about that. Since we keep in cache the relid and
use it as cache invalidation, I thought it might be fragile. But then - as far
as I could find - the only way to change the relid is to drop and create a new
table. I wasn't sure it could really cause a race condition there because of
the impact of such commands on logical replication.

But now, I realize I should have go all the way through and close this
potential bug as well. Thank you.

I have made changes to fix the comment provided. The patch for the
same is attached. Could not add a test case for this scenario is based
on timing issue.

Thank you for this fix Vignesh!

In reply to: Peter Eisentraut (#11)
Re: segmentation fault when cassert enabled

On Mon, 16 Dec 2019 13:27:43 +0100
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

On 2019-12-16 11:11, Amit Kapila wrote:

I agree that this is a timing issue. I also don't see a way to write
a reproducible test for this. However, I could reproduce it via
debugger consistently by following the below steps. I have updated a
few comments and commit messages in the attached patch.

Peter E., Petr J or anyone else, do you have comments or objections on
this patch? If none, then I am planning to commit (and backpatch)
this patch in a few days time.

The patch seems fine to me. Writing a test seems hard. Let's skip it.

The commit message has a duplicate "building"/"built" in the first sentence.

I think the sentence is quite long. I had to re-read it to get it.

What about:

This patch allows building the local relmap cache for a subscribed relation
after processing pending invalidation messages and potential relcache
updates.

Regards,

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#13)
1 attachment(s)
Re: segmentation fault when cassert enabled

On Mon, Dec 16, 2019 at 9:16 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

On Mon, 16 Dec 2019 13:27:43 +0100
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

On 2019-12-16 11:11, Amit Kapila wrote:

I agree that this is a timing issue. I also don't see a way to write
a reproducible test for this. However, I could reproduce it via
debugger consistently by following the below steps. I have updated a
few comments and commit messages in the attached patch.

Peter E., Petr J or anyone else, do you have comments or objections on
this patch? If none, then I am planning to commit (and backpatch)
this patch in a few days time.

The patch seems fine to me. Writing a test seems hard. Let's skip it.

Okay.

The commit message has a duplicate "building"/"built" in the first sentence.

I think the sentence is quite long. I had to re-read it to get it.

What about:

This patch allows building the local relmap cache for a subscribed relation
after processing pending invalidation messages and potential relcache
updates.

Attached patch with updated commit message based on suggestions. I am
planning to commit this tomorrow unless there are more comments.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Fix-subscriber-invalid-memory-access-on-DDL.amit.2.patchapplication/octet-stream; name=0001-Fix-subscriber-invalid-memory-access-on-DDL.amit.2.patchDownload
From 15d6db4c64d641ec31f1dc2caa0fe1f0900b3887 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 16 Dec 2019 15:23:46 +0530
Subject: [PATCH] Fix subscriber invalid memory access on DDL.

This patch allows building the local relmap cache for a subscribed
relation after processing pending invalidation messages and potential
relcache updates.  Without this, the attributes in the local cache don't
tally with the updated relcache entry leading to invalid memory access.

Reported-by Jehan-Guillaume de Rorthais
Author: Jehan-Guillaume de Rorthais and Vignesh C
Reviewed-by: Amit Kapila
Backpatch-through: 10
Discussion: https://postgr.es/m/20191025175929.7e90dbf5@firost
---
 src/backend/replication/logical/relation.c | 40 +++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index b386f84..ea47138 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -220,6 +220,8 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 {
 	LogicalRepRelMapEntry *entry;
 	bool		found;
+	Oid			relid = InvalidOid;
+	LogicalRepRelation *remoterel;
 
 	if (LogicalRepRelMap == NULL)
 		logicalrep_relmap_init();
@@ -232,20 +234,17 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 		elog(ERROR, "no relation map entry for remote relation ID %u",
 			 remoteid);
 
-	/* Need to update the local cache? */
+	remoterel = &entry->remoterel;
+
+	/*
+	 * When opening and locking a relation, pending invalidation messages are
+	 * processed which can invalidate the relation.  We need to update the
+	 * local cache both when we are first time accessing the relation and when
+	 * the relation is invalidated (aka entry->localreloid is set InvalidOid).
+	 */
 	if (!OidIsValid(entry->localreloid))
 	{
-		Oid			relid;
-		int			i;
-		int			found;
-		Bitmapset  *idkey;
-		TupleDesc	desc;
-		LogicalRepRelation *remoterel;
-		MemoryContext oldctx;
-
-		remoterel = &entry->remoterel;
-
-		/* Try to find and lock the relation by name. */
+		/* Try to find and lock the relation by name */
 		relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,
 											  remoterel->relname, -1),
 								 lockmode, true);
@@ -256,6 +255,21 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 							remoterel->nspname, remoterel->relname)));
 		entry->localrel = table_open(relid, NoLock);
 
+	}
+	else
+	{
+		relid = entry->localreloid;
+		entry->localrel = table_open(entry->localreloid, lockmode);
+	}
+
+	if (!OidIsValid(entry->localreloid))
+	{
+		int			found;
+		Bitmapset  *idkey;
+		TupleDesc	desc;
+		MemoryContext oldctx;
+		int			i;
+
 		/* Check for supported relkind. */
 		CheckSubscriptionRelkind(entry->localrel->rd_rel->relkind,
 								 remoterel->nspname, remoterel->relname);
@@ -350,8 +364,6 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 
 		entry->localreloid = relid;
 	}
-	else
-		entry->localrel = table_open(entry->localreloid, lockmode);
 
 	if (entry->state != SUBREL_STATE_READY)
 		entry->state = GetSubscriptionRelState(MySubscription->oid,
-- 
1.8.3.1

#15vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#14)
1 attachment(s)
Re: segmentation fault when cassert enabled

On Tue, Dec 17, 2019 at 10:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 16, 2019 at 9:16 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

On Mon, 16 Dec 2019 13:27:43 +0100
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

On 2019-12-16 11:11, Amit Kapila wrote:

I agree that this is a timing issue. I also don't see a way to write
a reproducible test for this. However, I could reproduce it via
debugger consistently by following the below steps. I have updated a
few comments and commit messages in the attached patch.

Peter E., Petr J or anyone else, do you have comments or objections on
this patch? If none, then I am planning to commit (and backpatch)
this patch in a few days time.

The patch seems fine to me. Writing a test seems hard. Let's skip it.

Okay.

The commit message has a duplicate "building"/"built" in the first sentence.

I think the sentence is quite long. I had to re-read it to get it.

What about:

This patch allows building the local relmap cache for a subscribed relation
after processing pending invalidation messages and potential relcache
updates.

Attached patch with updated commit message based on suggestions. I am
planning to commit this tomorrow unless there are more comments.

While testing the patch on back versions, I found that the patch does
not apply on PG 11 & PG 10 branch. Attached patch has the changes for
PG 11 & PG 10 branch. Only difference in the patch was that table_open
needed to be changed to heap_open. I have verified the patch on back
branches and found it to be working. For PG 12 & current the previous
patch that Amit need to be used, I'm not reattaching here.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Fix-subscriber-invalid-memory-access-on-DDL_PG_11.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-subscriber-invalid-memory-access-on-DDL_PG_11.patchDownload
From 2c4123d514473a9203c3617655229286a84487e6 Mon Sep 17 00:00:00 2001
From: vignesh <vignesh@localhost.localdomain>
Date: Tue, 17 Dec 2019 17:05:35 +0530
Subject: [PATCH] Fix subscriber invalid memory access on DDL.

This patch allows building the local relmap cache for a subscribed
relation after processing pending invalidation messages and potential
relcache updates.  Without this, the attributes in the local cache don't
tally with the updated relcache entry leading to invalid memory access.

Reported-by Jehan-Guillaume de Rorthais
Author: Jehan-Guillaume de Rorthais and Vignesh C
Reviewed-by: Amit Kapila
Backpatch-through: 10
Discussion: https://postgr.es/m/20191025175929.7e90dbf5@firost
---
 src/backend/replication/logical/relation.c | 40 +++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index cf6f0a7..e1bbabd 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -220,6 +220,8 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 {
 	LogicalRepRelMapEntry *entry;
 	bool		found;
+	Oid			relid = InvalidOid;
+	LogicalRepRelation *remoterel;
 
 	if (LogicalRepRelMap == NULL)
 		logicalrep_relmap_init();
@@ -232,20 +234,17 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 		elog(ERROR, "no relation map entry for remote relation ID %u",
 			 remoteid);
 
-	/* Need to update the local cache? */
+	remoterel = &entry->remoterel;
+
+	/*
+	 * When opening and locking a relation, pending invalidation messages are
+	 * processed which can invalidate the relation.  We need to update the
+	 * local cache both when we are first time accessing the relation and when
+	 * the relation is invalidated (aka entry->localreloid is set InvalidOid).
+	 */
 	if (!OidIsValid(entry->localreloid))
 	{
-		Oid			relid;
-		int			i;
-		int			found;
-		Bitmapset  *idkey;
-		TupleDesc	desc;
-		LogicalRepRelation *remoterel;
-		MemoryContext oldctx;
-
-		remoterel = &entry->remoterel;
-
-		/* Try to find and lock the relation by name. */
+		/* Try to find and lock the relation by name */
 		relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,
 											  remoterel->relname, -1),
 								 lockmode, true);
@@ -256,6 +255,21 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 							remoterel->nspname, remoterel->relname)));
 		entry->localrel = heap_open(relid, NoLock);
 
+	}
+	else
+	{
+		relid = entry->localreloid;
+		entry->localrel = heap_open(entry->localreloid, lockmode);
+	}
+
+	if (!OidIsValid(entry->localreloid))
+	{
+		int			found;
+		Bitmapset  *idkey;
+		TupleDesc	desc;
+		MemoryContext oldctx;
+		int			i;
+
 		/* Check for supported relkind. */
 		CheckSubscriptionRelkind(entry->localrel->rd_rel->relkind,
 								 remoterel->nspname, remoterel->relname);
@@ -350,8 +364,6 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 
 		entry->localreloid = relid;
 	}
-	else
-		entry->localrel = heap_open(entry->localreloid, lockmode);
 
 	if (entry->state != SUBREL_STATE_READY)
 		entry->state = GetSubscriptionRelState(MySubscription->oid,
-- 
1.8.3.1

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#15)
Re: segmentation fault when cassert enabled

On Tue, Dec 17, 2019 at 6:01 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, Dec 17, 2019 at 10:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch with updated commit message based on suggestions. I am
planning to commit this tomorrow unless there are more comments.

While testing the patch on back versions, I found that the patch does
not apply on PG 11 & PG 10 branch. Attached patch has the changes for
PG 11 & PG 10 branch. Only difference in the patch was that table_open
needed to be changed to heap_open. I have verified the patch on back
branches and found it to be working. For PG 12 & current the previous
patch that Amit need to be used, I'm not reattaching here.

Pushed.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In reply to: Amit Kapila (#16)
Re: segmentation fault when cassert enabled

On Wed, 18 Dec 2019 08:46:01 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 17, 2019 at 6:01 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, Dec 17, 2019 at 10:09 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

Attached patch with updated commit message based on suggestions. I am
planning to commit this tomorrow unless there are more comments.

While testing the patch on back versions, I found that the patch does
not apply on PG 11 & PG 10 branch. Attached patch has the changes for
PG 11 & PG 10 branch. Only difference in the patch was that table_open
needed to be changed to heap_open. I have verified the patch on back
branches and found it to be working. For PG 12 & current the previous
patch that Amit need to be used, I'm not reattaching here.

Pushed.

Thanks!