logical replication worker accesses catalogs in error context callback

Started by Andres Freundover 5 years ago36 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

Due to a debug ereport I just noticed that worker.c's
slot_store_error_callback is doing something quite dangerous:

static void
slot_store_error_callback(void *arg)
{
SlotErrCallbackArg *errarg = (SlotErrCallbackArg *) arg;
LogicalRepRelMapEntry *rel;
char *remotetypname;
Oid remotetypoid,
localtypoid;

/* Nothing to do if remote attribute number is not set */
if (errarg->remote_attnum < 0)
return;

rel = errarg->rel;
remotetypoid = rel->remoterel.atttyps[errarg->remote_attnum];

/* Fetch remote type name from the LogicalRepTypMap cache */
remotetypname = logicalrep_typmap_gettypname(remotetypoid);

/* Fetch local type OID from the local sys cache */
localtypoid = get_atttype(rel->localreloid, errarg->local_attnum + 1);

errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", "
"remote type %s, local type %s",
rel->remoterel.nspname, rel->remoterel.relname,
rel->remoterel.attnames[errarg->remote_attnum],
remotetypname,
format_type_be(localtypoid));
}

that's not code that can run in an error context callback. It's
theoretically possible (but unlikely) that
logicalrep_typmap_gettypname() is safe to run in an error context
callback. But get_atttype() definitely isn't.

get_attype() may do catalog accesses. That definitely can't happen
inside an error context callback - the entire transaction might be
borked at this point!

I think this was basically broken from the start in

commit 665d1fad99e7b11678b0d5fa24d2898424243cd6
Author: Peter Eisentraut <peter_e@gmx.net>
Date: 2017-01-19 12:00:00 -0500

Logical replication

but then got a lot worse in

commit 24c0a6c649768f428929e76dd7f5012ec9b93ce1
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: 2018-03-14 21:34:26 -0300

logical replication: fix OID type mapping mechanism

Greetings,

Andres Freund

#2Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#1)
Re: logical replication worker accesses catalogs in error context callback

On Wed, Jan 6, 2021 at 11:02 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

Due to a debug ereport I just noticed that worker.c's
slot_store_error_callback is doing something quite dangerous:

static void
slot_store_error_callback(void *arg)
{
SlotErrCallbackArg *errarg = (SlotErrCallbackArg *) arg;
LogicalRepRelMapEntry *rel;
char *remotetypname;
Oid remotetypoid,
localtypoid;

/* Nothing to do if remote attribute number is not set */
if (errarg->remote_attnum < 0)
return;

rel = errarg->rel;
remotetypoid = rel->remoterel.atttyps[errarg->remote_attnum];

/* Fetch remote type name from the LogicalRepTypMap cache */
remotetypname = logicalrep_typmap_gettypname(remotetypoid);

/* Fetch local type OID from the local sys cache */
localtypoid = get_atttype(rel->localreloid, errarg->local_attnum + 1);

errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", "
"remote type %s, local type %s",
rel->remoterel.nspname, rel->remoterel.relname,
rel->remoterel.attnames[errarg->remote_attnum],
remotetypname,
format_type_be(localtypoid));
}

that's not code that can run in an error context callback. It's
theoretically possible (but unlikely) that
logicalrep_typmap_gettypname() is safe to run in an error context
callback. But get_atttype() definitely isn't.

get_attype() may do catalog accesses. That definitely can't happen
inside an error context callback - the entire transaction might be
borked at this point!

You're right. Perhaps calling to format_type_be() is also dangerous
since it does catalog access. We should have added the local type
names to SlotErrCallbackArg so we avoid catalog access in the error
context.

I'll try to fix this.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#2)
Re: logical replication worker accesses catalogs in error context callback

On Wed, Jan 6, 2021 at 11:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 6, 2021 at 11:02 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

Due to a debug ereport I just noticed that worker.c's
slot_store_error_callback is doing something quite dangerous:

static void
slot_store_error_callback(void *arg)
{
SlotErrCallbackArg *errarg = (SlotErrCallbackArg *) arg;
LogicalRepRelMapEntry *rel;
char *remotetypname;
Oid remotetypoid,
localtypoid;

/* Nothing to do if remote attribute number is not set */
if (errarg->remote_attnum < 0)
return;

rel = errarg->rel;
remotetypoid = rel->remoterel.atttyps[errarg->remote_attnum];

/* Fetch remote type name from the LogicalRepTypMap cache */
remotetypname = logicalrep_typmap_gettypname(remotetypoid);

/* Fetch local type OID from the local sys cache */
localtypoid = get_atttype(rel->localreloid, errarg->local_attnum + 1);

errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", "
"remote type %s, local type %s",
rel->remoterel.nspname, rel->remoterel.relname,
rel->remoterel.attnames[errarg->remote_attnum],
remotetypname,
format_type_be(localtypoid));
}

that's not code that can run in an error context callback. It's
theoretically possible (but unlikely) that
logicalrep_typmap_gettypname() is safe to run in an error context
callback. But get_atttype() definitely isn't.

get_attype() may do catalog accesses. That definitely can't happen
inside an error context callback - the entire transaction might be
borked at this point!

You're right. Perhaps calling to format_type_be() is also dangerous
since it does catalog access. We should have added the local type
names to SlotErrCallbackArg so we avoid catalog access in the error
context.

I'll try to fix this.

Attached the patch that fixes this issue.

Since logicalrep_typmap_gettypname() could search the sys cache by
calling to format_type_be(), I stored both local and remote type names
to SlotErrCallbackArg so that we can just set the names in the error
callback without sys cache lookup.

Please review it.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Attachments:

fix_slot_store_error_callback.patchapplication/octet-stream; name=fix_slot_store_error_callback.patchDownload+28-22
#4Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#3)
Re: logical replication worker accesses catalogs in error context callback

On Thu, Jan 7, 2021 at 5:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 6, 2021 at 11:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 6, 2021 at 11:02 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

Due to a debug ereport I just noticed that worker.c's
slot_store_error_callback is doing something quite dangerous:

static void
slot_store_error_callback(void *arg)
{
SlotErrCallbackArg *errarg = (SlotErrCallbackArg *) arg;
LogicalRepRelMapEntry *rel;
char *remotetypname;
Oid remotetypoid,
localtypoid;

/* Nothing to do if remote attribute number is not set */
if (errarg->remote_attnum < 0)
return;

rel = errarg->rel;
remotetypoid = rel->remoterel.atttyps[errarg->remote_attnum];

/* Fetch remote type name from the LogicalRepTypMap cache */
remotetypname = logicalrep_typmap_gettypname(remotetypoid);

/* Fetch local type OID from the local sys cache */
localtypoid = get_atttype(rel->localreloid, errarg->local_attnum + 1);

errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", "
"remote type %s, local type %s",
rel->remoterel.nspname, rel->remoterel.relname,
rel->remoterel.attnames[errarg->remote_attnum],
remotetypname,
format_type_be(localtypoid));
}

that's not code that can run in an error context callback. It's
theoretically possible (but unlikely) that
logicalrep_typmap_gettypname() is safe to run in an error context
callback. But get_atttype() definitely isn't.

get_attype() may do catalog accesses. That definitely can't happen
inside an error context callback - the entire transaction might be
borked at this point!

You're right. Perhaps calling to format_type_be() is also dangerous
since it does catalog access. We should have added the local type
names to SlotErrCallbackArg so we avoid catalog access in the error
context.

I'll try to fix this.

Attached the patch that fixes this issue.

Since logicalrep_typmap_gettypname() could search the sys cache by
calling to format_type_be(), I stored both local and remote type names
to SlotErrCallbackArg so that we can just set the names in the error
callback without sys cache lookup.

Please review it.

Patch looks good, except one minor comment - can we store
rel->remoterel.atttyps[remoteattnum] into a local variable and use it
in logicalrep_typmap_gettypname, just to save on indentation?

I quickly searched in places where error callbacks are being used, I
think we need a similar kind of fix for conversion_error_callback in
postgres_fdw.c, because get_attname and get_rel_name are being used
which do catalogue lookups. ISTM that all the other error callbacks
are good in the sense that they are not doing sys catalogue lookups.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#5Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#4)
Re: logical replication worker accesses catalogs in error context callback

On Sat, Jan 9, 2021 at 2:57 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Thu, Jan 7, 2021 at 5:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 6, 2021 at 11:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 6, 2021 at 11:02 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

Due to a debug ereport I just noticed that worker.c's
slot_store_error_callback is doing something quite dangerous:

static void
slot_store_error_callback(void *arg)
{
SlotErrCallbackArg *errarg = (SlotErrCallbackArg *) arg;
LogicalRepRelMapEntry *rel;
char *remotetypname;
Oid remotetypoid,
localtypoid;

/* Nothing to do if remote attribute number is not set */
if (errarg->remote_attnum < 0)
return;

rel = errarg->rel;
remotetypoid = rel->remoterel.atttyps[errarg->remote_attnum];

/* Fetch remote type name from the LogicalRepTypMap cache */
remotetypname = logicalrep_typmap_gettypname(remotetypoid);

/* Fetch local type OID from the local sys cache */
localtypoid = get_atttype(rel->localreloid, errarg->local_attnum + 1);

errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", "
"remote type %s, local type %s",
rel->remoterel.nspname, rel->remoterel.relname,
rel->remoterel.attnames[errarg->remote_attnum],
remotetypname,
format_type_be(localtypoid));
}

that's not code that can run in an error context callback. It's
theoretically possible (but unlikely) that
logicalrep_typmap_gettypname() is safe to run in an error context
callback. But get_atttype() definitely isn't.

get_attype() may do catalog accesses. That definitely can't happen
inside an error context callback - the entire transaction might be
borked at this point!

You're right. Perhaps calling to format_type_be() is also dangerous
since it does catalog access. We should have added the local type
names to SlotErrCallbackArg so we avoid catalog access in the error
context.

I'll try to fix this.

Attached the patch that fixes this issue.

Since logicalrep_typmap_gettypname() could search the sys cache by
calling to format_type_be(), I stored both local and remote type names
to SlotErrCallbackArg so that we can just set the names in the error
callback without sys cache lookup.

Please review it.

Patch looks good, except one minor comment - can we store
rel->remoterel.atttyps[remoteattnum] into a local variable and use it
in logicalrep_typmap_gettypname, just to save on indentation?

Thank you for reviewing the patch!

Agreed. Attached the updated patch.

I quickly searched in places where error callbacks are being used, I
think we need a similar kind of fix for conversion_error_callback in
postgres_fdw.c, because get_attname and get_rel_name are being used
which do catalogue lookups. ISTM that all the other error callbacks
are good in the sense that they are not doing sys catalogue lookups.

Indeed. If we need to disallow the catalog lookup during executing
error callbacks we might want to have an assertion checking that in
SearchCatCacheInternal(), in addition to Assert(IsTransactionState()).

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Attachments:

fix_slot_store_error_callback_v2.patchapplication/octet-stream; name=fix_slot_store_error_callback_v2.patchDownload+28-22
#6Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#5)
Re: logical replication worker accesses catalogs in error context callback

On Mon, Jan 11, 2021 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Agreed. Attached the updated patch.

Thanks for the updated patch. Looks like the comment crosses the 80
char limit, I have corrected it. And also changed the variable name
from remotetypeid to remotetypid, so that logicalrep_typmap_gettypname
will not cross the 80 char limit. And also added a commit message.
Attaching v3 patch, please have a look. Both make check and make
check-world passes.

I quickly searched in places where error callbacks are being used, I
think we need a similar kind of fix for conversion_error_callback in
postgres_fdw.c, because get_attname and get_rel_name are being used
which do catalogue lookups. ISTM that all the other error callbacks
are good in the sense that they are not doing sys catalogue lookups.

Indeed. If we need to disallow the catalog lookup during executing
error callbacks we might want to have an assertion checking that in
SearchCatCacheInternal(), in addition to Assert(IsTransactionState()).

I tried to add(as attached in
v1-0001-Avoid-Sys-Cache-Lookups-in-Error-Callbacks.patch) the
Assert(!error_context_stack); in SearchCatCacheInternal, initdb itself
fails [1]running bootstrap script ... TRAP: FailedAssertion("error_context_stack", File: "catcache.c", Line: 1220, PID: 310728) /home/bharath/workspace/postgres/instnew/bin/postgres(ExceptionalCondition+0xd0)[0x56056984c8ba] /home/bharath/workspace/postgres/instnew/bin/postgres(+0x76eb1a)[0x560569826b1a] /home/bharath/workspace/postgres/instnew/bin/postgres(SearchCatCache1+0x3a)[0x5605698269af] /home/bharath/workspace/postgres/instnew/bin/postgres(SearchSysCache1+0xc1)[0x5605698448b2] /home/bharath/workspace/postgres/instnew/bin/postgres(get_typtype+0x1f)[0x56056982e389] /home/bharath/workspace/postgres/instnew/bin/postgres(CheckAttributeType+0x29)[0x5605692bafe9] /home/bharath/workspace/postgres/instnew/bin/postgres(CheckAttributeNamesTypes+0x2c9)[0x5605692bafac] /home/bharath/workspace/postgres/instnew/bin/postgres(heap_create_with_catalog+0x11f)[0x5605692bc436] /home/bharath/workspace/postgres/instnew/bin/postgres(boot_yyparse+0x7f0)[0x5605692a0d3f] /home/bharath/workspace/postgres/instnew/bin/postgres(+0x1ecb36)[0x5605692a4b36] /home/bharath/workspace/postgres/instnew/bin/postgres(AuxiliaryProcessMain+0x5e0)[0x5605692a4997] /home/bharath/workspace/postgres/instnew/bin/postgres(main+0x268)[0x5605694c6bce] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fa2777ce0b3] /home/bharath/workspace/postgres/instnew/bin/postgres(_start+0x2e)[0x5605691794de]. That means, we are doing a bunch of catalogue access from
error context callbacks. Given this, I'm not quite sure whether we can
have such an assertion in SearchCatCacheInternal.

Although unrelated to what we are discussing here - when I looked at
SearchCatCacheInternal, I found that the function SearchCatCache has
SearchCatCacheInternal in the function comment, I think we should
correct it. Thoughts? If okay, I will post a separate patch for this
minor comment fix.

[1]: running bootstrap script ... TRAP: FailedAssertion("error_context_stack", File: "catcache.c", Line: 1220, PID: 310728) /home/bharath/workspace/postgres/instnew/bin/postgres(ExceptionalCondition+0xd0)[0x56056984c8ba] /home/bharath/workspace/postgres/instnew/bin/postgres(+0x76eb1a)[0x560569826b1a] /home/bharath/workspace/postgres/instnew/bin/postgres(SearchCatCache1+0x3a)[0x5605698269af] /home/bharath/workspace/postgres/instnew/bin/postgres(SearchSysCache1+0xc1)[0x5605698448b2] /home/bharath/workspace/postgres/instnew/bin/postgres(get_typtype+0x1f)[0x56056982e389] /home/bharath/workspace/postgres/instnew/bin/postgres(CheckAttributeType+0x29)[0x5605692bafe9] /home/bharath/workspace/postgres/instnew/bin/postgres(CheckAttributeNamesTypes+0x2c9)[0x5605692bafac] /home/bharath/workspace/postgres/instnew/bin/postgres(heap_create_with_catalog+0x11f)[0x5605692bc436] /home/bharath/workspace/postgres/instnew/bin/postgres(boot_yyparse+0x7f0)[0x5605692a0d3f] /home/bharath/workspace/postgres/instnew/bin/postgres(+0x1ecb36)[0x5605692a4b36] /home/bharath/workspace/postgres/instnew/bin/postgres(AuxiliaryProcessMain+0x5e0)[0x5605692a4997] /home/bharath/workspace/postgres/instnew/bin/postgres(main+0x268)[0x5605694c6bce] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fa2777ce0b3] /home/bharath/workspace/postgres/instnew/bin/postgres(_start+0x2e)[0x5605691794de]
running bootstrap script ... TRAP:
FailedAssertion("error_context_stack", File: "catcache.c", Line: 1220,
PID: 310728)
/home/bharath/workspace/postgres/instnew/bin/postgres(ExceptionalCondition+0xd0)[0x56056984c8ba]
/home/bharath/workspace/postgres/instnew/bin/postgres(+0x76eb1a)[0x560569826b1a]
/home/bharath/workspace/postgres/instnew/bin/postgres(SearchCatCache1+0x3a)[0x5605698269af]
/home/bharath/workspace/postgres/instnew/bin/postgres(SearchSysCache1+0xc1)[0x5605698448b2]
/home/bharath/workspace/postgres/instnew/bin/postgres(get_typtype+0x1f)[0x56056982e389]
/home/bharath/workspace/postgres/instnew/bin/postgres(CheckAttributeType+0x29)[0x5605692bafe9]
/home/bharath/workspace/postgres/instnew/bin/postgres(CheckAttributeNamesTypes+0x2c9)[0x5605692bafac]
/home/bharath/workspace/postgres/instnew/bin/postgres(heap_create_with_catalog+0x11f)[0x5605692bc436]
/home/bharath/workspace/postgres/instnew/bin/postgres(boot_yyparse+0x7f0)[0x5605692a0d3f]
/home/bharath/workspace/postgres/instnew/bin/postgres(+0x1ecb36)[0x5605692a4b36]
/home/bharath/workspace/postgres/instnew/bin/postgres(AuxiliaryProcessMain+0x5e0)[0x5605692a4997]
/home/bharath/workspace/postgres/instnew/bin/postgres(main+0x268)[0x5605694c6bce]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fa2777ce0b3]
/home/bharath/workspace/postgres/instnew/bin/postgres(_start+0x2e)[0x5605691794de]

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0001-fix-slot-store-error-callback.patchapplication/x-patch; name=v3-0001-fix-slot-store-error-callback.patchDownload+28-23
v1-0001-Avoid-Sys-Cache-Lookups-in-Error-Callbacks.patchapplication/x-patch; name=v1-0001-Avoid-Sys-Cache-Lookups-in-Error-Callbacks.patchDownload+7-1
#7Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#6)
Re: logical replication worker accesses catalogs in error context callback

On Mon, Jan 11, 2021 at 4:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Jan 11, 2021 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Agreed. Attached the updated patch.

Thanks for the updated patch. Looks like the comment crosses the 80
char limit, I have corrected it. And also changed the variable name
from remotetypeid to remotetypid, so that logicalrep_typmap_gettypname
will not cross the 80 char limit. And also added a commit message.
Attaching v3 patch, please have a look. Both make check and make
check-world passes.

Thanks! The change looks good to me.

I quickly searched in places where error callbacks are being used, I
think we need a similar kind of fix for conversion_error_callback in
postgres_fdw.c, because get_attname and get_rel_name are being used
which do catalogue lookups. ISTM that all the other error callbacks
are good in the sense that they are not doing sys catalogue lookups.

Indeed. If we need to disallow the catalog lookup during executing
error callbacks we might want to have an assertion checking that in
SearchCatCacheInternal(), in addition to Assert(IsTransactionState()).

I tried to add(as attached in
v1-0001-Avoid-Sys-Cache-Lookups-in-Error-Callbacks.patch) the
Assert(!error_context_stack); in SearchCatCacheInternal, initdb itself
fails [1]. That means, we are doing a bunch of catalogue access from
error context callbacks. Given this, I'm not quite sure whether we can
have such an assertion in SearchCatCacheInternal.

I think checking !error_context_stack is not a correct check if we're
executing an error context callback since it's a stack to store
callbacks. It can be non-NULL by setting an error callback, see
setup_parser_errposition_callback() for instance. Probably we need to
check if (recursion_depth > 0) and elevel. Attached a patch for that
as an example.

Although unrelated to what we are discussing here - when I looked at
SearchCatCacheInternal, I found that the function SearchCatCache has
SearchCatCacheInternal in the function comment, I think we should
correct it. Thoughts? If okay, I will post a separate patch for this
minor comment fix.

Perhaps you mean this?

/*
* SearchCatCacheInternal
*
* This call searches a system cache for a tuple, opening the relation
* if necessary (on the first access to a particular cache).
*
* The result is NULL if not found, or a pointer to a HeapTuple in
* the cache. The caller must not modify the tuple, and must call
* ReleaseCatCache() when done with it.
*
* The search key values should be expressed as Datums of the key columns'
* datatype(s). (Pass zeroes for any unused parameters.) As a special
* exception, the passed-in key for a NAME column can be just a C string;
* the caller need not go to the trouble of converting it to a fully
* null-padded NAME.
*/
HeapTuple
SearchCatCache(CatCache *cache,

Looking at commit 141fd1b66 it intentionally changed to
SearchCatCacheInternal from SearchCatCache but it seems to me that
it's a typo.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Attachments:

avoid_sys_cache_lookup_in_error_callback_v2.patchapplication/octet-stream; name=avoid_sys_cache_lookup_in_error_callback_v2.patchDownload+30-0
#8Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#7)
Re: logical replication worker accesses catalogs in error context callback

On Tue, Jan 12, 2021 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 11, 2021 at 4:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Jan 11, 2021 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Agreed. Attached the updated patch.

Thanks for the updated patch. Looks like the comment crosses the 80
char limit, I have corrected it. And also changed the variable name
from remotetypeid to remotetypid, so that logicalrep_typmap_gettypname
will not cross the 80 char limit. And also added a commit message.
Attaching v3 patch, please have a look. Both make check and make
check-world passes.

Thanks! The change looks good to me.

Thanks!

I quickly searched in places where error callbacks are being used, I
think we need a similar kind of fix for conversion_error_callback in
postgres_fdw.c, because get_attname and get_rel_name are being used
which do catalogue lookups. ISTM that all the other error callbacks
are good in the sense that they are not doing sys catalogue lookups.

Indeed. If we need to disallow the catalog lookup during executing
error callbacks we might want to have an assertion checking that in
SearchCatCacheInternal(), in addition to Assert(IsTransactionState()).

I tried to add(as attached in
v1-0001-Avoid-Sys-Cache-Lookups-in-Error-Callbacks.patch) the
Assert(!error_context_stack); in SearchCatCacheInternal, initdb itself
fails [1]. That means, we are doing a bunch of catalogue access from
error context callbacks. Given this, I'm not quite sure whether we can
have such an assertion in SearchCatCacheInternal.

I think checking !error_context_stack is not a correct check if we're
executing an error context callback since it's a stack to store
callbacks. It can be non-NULL by setting an error callback, see
setup_parser_errposition_callback() for instance.

Thanks! Yes, you are right, even though we are not processing the
actual error callback, the error_context_stack can be non-null, hence
the Assert(!error_context_stack); doesn't make any sense.

Probably we need to check if (recursion_depth > 0) and elevel.
Attached a patch for that as an example.

IIUC, we must be knowing in SearchCatCacheInternal, whether errstart
is called with elevel >= ERROR and we have recursion_depth > 0. If
both are true, then the assertion in SearchCatCacheInternal should
fail. I see that in your patch, elevel is being fetched from
errordata, that's fine. What I'm not quite clear is the following
assumption:

+    /* If we doesn't set any error data yet, assume it's an error */
+    if (errordata_stack_depth == -1)
+        return true;

Is it always true that we are in error processing when
errordata_stack_depth is -1, what happens if errordata_stack_depth <
-1? Maybe I'm missing something.

IMHO, adding an assertion in SearchCatCacheInternal(which is a most
generic code part within the server) with a few error context global
variables may not be always safe. Because we might miss using the
error context variables properly. Instead, we could add a comment in
ErrorContextCallback structure saying something like, "it's not
recommended to access any system catalogues within an error context
callback when the callback is expected to be called while processing
an error, because the transaction might have been broken in that
case." And let the future callback developers take care of it.

Thoughts?

As I said earlier [1], currently only two(there could be more) error
context callbacks access the sys catalogues, one is in
slot_store_error_callback which will be fixed with your patch. Another
is in conversion_error_callback, we can also fix this by storing the
relname, attname and other required info in ConversionLocation,
something like the attached. Please have a look.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-Avoid-Catalogue-Accesses-In-conversion_error_call.patchapplication/x-patch; name=v1-0001-Avoid-Catalogue-Accesses-In-conversion_error_call.patchDownload+68-35
#9Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#8)
Re: logical replication worker accesses catalogs in error context callback

On Tue, Jan 12, 2021 at 6:33 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Tue, Jan 12, 2021 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 11, 2021 at 4:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Jan 11, 2021 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Agreed. Attached the updated patch.

Thanks for the updated patch. Looks like the comment crosses the 80
char limit, I have corrected it. And also changed the variable name
from remotetypeid to remotetypid, so that logicalrep_typmap_gettypname
will not cross the 80 char limit. And also added a commit message.
Attaching v3 patch, please have a look. Both make check and make
check-world passes.

Thanks! The change looks good to me.

Thanks!

I quickly searched in places where error callbacks are being used, I
think we need a similar kind of fix for conversion_error_callback in
postgres_fdw.c, because get_attname and get_rel_name are being used
which do catalogue lookups. ISTM that all the other error callbacks
are good in the sense that they are not doing sys catalogue lookups.

Indeed. If we need to disallow the catalog lookup during executing
error callbacks we might want to have an assertion checking that in
SearchCatCacheInternal(), in addition to Assert(IsTransactionState()).

I tried to add(as attached in
v1-0001-Avoid-Sys-Cache-Lookups-in-Error-Callbacks.patch) the
Assert(!error_context_stack); in SearchCatCacheInternal, initdb itself
fails [1]. That means, we are doing a bunch of catalogue access from
error context callbacks. Given this, I'm not quite sure whether we can
have such an assertion in SearchCatCacheInternal.

I think checking !error_context_stack is not a correct check if we're
executing an error context callback since it's a stack to store
callbacks. It can be non-NULL by setting an error callback, see
setup_parser_errposition_callback() for instance.

Thanks! Yes, you are right, even though we are not processing the
actual error callback, the error_context_stack can be non-null, hence
the Assert(!error_context_stack); doesn't make any sense.

Probably we need to check if (recursion_depth > 0) and elevel.
Attached a patch for that as an example.

IIUC, we must be knowing in SearchCatCacheInternal, whether errstart
is called with elevel >= ERROR and we have recursion_depth > 0. If
both are true, then the assertion in SearchCatCacheInternal should
fail. I see that in your patch, elevel is being fetched from
errordata, that's fine. What I'm not quite clear is the following
assumption:

+    /* If we doesn't set any error data yet, assume it's an error */
+    if (errordata_stack_depth == -1)
+        return true;

Is it always true that we are in error processing when
errordata_stack_depth is -1, what happens if errordata_stack_depth <
-1? Maybe I'm missing something.

You're right. I missed something.

IMHO, adding an assertion in SearchCatCacheInternal(which is a most
generic code part within the server) with a few error context global
variables may not be always safe. Because we might miss using the
error context variables properly. Instead, we could add a comment in
ErrorContextCallback structure saying something like, "it's not
recommended to access any system catalogues within an error context
callback when the callback is expected to be called while processing
an error, because the transaction might have been broken in that
case." And let the future callback developers take care of it.

Thoughts?

That sounds good to me. But I still also see the value to add an
assertion in SearchCatCacheInternal(). If we had it, we could find
these two bugs earlier.

Anyway, this seems to be unrelated to this bug fixing so we can start
a new thread for that.

As I said earlier [1], currently only two(there could be more) error
context callbacks access the sys catalogues, one is in
slot_store_error_callback which will be fixed with your patch. Another
is in conversion_error_callback, we can also fix this by storing the
relname, attname and other required info in ConversionLocation,
something like the attached. Please have a look.

Thoughts?

Thank you for the patch!

Here are some comments:

+static void set_error_callback_info(ConversionLocation *errpos,
+                                          Relation rel, int cur_attno,
+                                          ForeignScanState *fsstate);

I'm concerned a bit that the function name is generic. How about
set_conversion_error_callback_info() or something to make the name
clear?

---
+static void
+conversion_error_callback(void *arg)
+{
+       ConversionLocation *errpos = (ConversionLocation *) arg;
+
+       if (errpos->show_generic_message)
+               errcontext("processing expression at position %d in
select list",
+                                       errpos->cur_attno);
+

I think we can set this generic message to the error context when
errpos->relname is NULL instead of using show_generic_message.

---
+       /*
+        * Set error context callback info, so that we could avoid accessing
+        * the system catalogues while processing the error in
+        * conversion_error_callback. System catalogue accesses are not safe in
+        * error context callbacks because the transaction might have been
+        * broken by then.
+        */
+       set_error_callback_info(&errpos, rel, i, fsstate);

Looking at other code, we use "catalog" rather than "catalogue" in
most places. Is it better to use "catalog" for consistency? FYI, the
"catalogue" is used at only three places in the server code:

$ git grep "catalogue" -- '*.[ch]'
src/backend/access/brin/brin.c: * This routine scans the complete
index looking for uncatalogued index pages,
src/backend/catalog/pg_constraint.c: * given relation; or InvalidOid
if no such index is catalogued.
src/backend/executor/execMain.c: * As in case of the catalogued
constraints, we treat a NULL result as

FYI I've added those bug fixes to the next Commitfest[1]https://commitfest.postgresql.org/32/2955/.

Regards,

[1]: https://commitfest.postgresql.org/32/2955/

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#10Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#9)
Re: logical replication worker accesses catalogs in error context callback

On Tue, Jan 26, 2021 at 11:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

IMHO, adding an assertion in SearchCatCacheInternal(which is a most
generic code part within the server) with a few error context global
variables may not be always safe. Because we might miss using the
error context variables properly. Instead, we could add a comment in
ErrorContextCallback structure saying something like, "it's not
recommended to access any system catalogues within an error context
callback when the callback is expected to be called while processing
an error, because the transaction might have been broken in that
case." And let the future callback developers take care of it.

Thoughts?

That sounds good to me. But I still also see the value to add an
assertion in SearchCatCacheInternal(). If we had it, we could find
these two bugs earlier.

Anyway, this seems to be unrelated to this bug fixing so we can start
a new thread for that.

+1 to start a new thread for that.

As I said earlier [1], currently only two(there could be more) error
context callbacks access the sys catalogues, one is in
slot_store_error_callback which will be fixed with your patch. Another
is in conversion_error_callback, we can also fix this by storing the
relname, attname and other required info in ConversionLocation,
something like the attached. Please have a look.

Thoughts?

Thank you for the patch!

Here are some comments:

Thanks for the review comments.

+static void set_error_callback_info(ConversionLocation *errpos,
+                                          Relation rel, int cur_attno,
+                                          ForeignScanState *fsstate);

I'm concerned a bit that the function name is generic. How about
set_conversion_error_callback_info() or something to make the name
clear?

Done.

---
+static void
+conversion_error_callback(void *arg)
+{
+       ConversionLocation *errpos = (ConversionLocation *) arg;
+
+       if (errpos->show_generic_message)
+               errcontext("processing expression at position %d in
select list",
+                                       errpos->cur_attno);
+

I think we can set this generic message to the error context when
errpos->relname is NULL instead of using show_generic_message.

Right. Done.

---
+       /*
+        * Set error context callback info, so that we could avoid accessing
+        * the system catalogues while processing the error in
+        * conversion_error_callback. System catalogue accesses are not safe in
+        * error context callbacks because the transaction might have been
+        * broken by then.
+        */
+       set_error_callback_info(&errpos, rel, i, fsstate);

Looking at other code, we use "catalog" rather than "catalogue" in
most places. Is it better to use "catalog" for consistency? FYI, the
"catalogue" is used at only three places in the server code:

Changed it to "catalog".

FYI I've added those bug fixes to the next Commitfest - https://commitfest.postgresql.org/32/2955/

Thanks. I'm attaching 2 patches to make it easy for reviewing and also
they will get a chance to be tested by cf bot.

0001 - for avoiding system catalog access in slot_store_error_callback.
0002 - for avoiding system catalog access in conversion_error_callback

Please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0001-Avoid-Catalogue-Accesses-In-slot_store_error_call.patchapplication/octet-stream; name=v3-0001-Avoid-Catalogue-Accesses-In-slot_store_error_call.patchDownload+27-22
v3-0002-Avoid-Catalogue-Accesses-In-conversion_error_call.patchapplication/octet-stream; name=v3-0002-Avoid-Catalogue-Accesses-In-conversion_error_call.patchDownload+62-36
#11Zhihong Yu
zyu@yugabyte.com
In reply to: Bharath Rupireddy (#10)
Re: logical replication worker accesses catalogs in error context callback

For 0002, a few comments on the description:

bq. Avoid accessing system catalogues inside conversion_error_callback

catalogues -> catalog

bq. error context callback, because the the entire transaction might

There is redundant 'the'

bq. Since get_attname() and get_rel_name() could search the sys cache by
store the required information

store -> storing

Cheers

On Tue, Jan 26, 2021 at 5:17 PM Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:

Show quoted text

On Tue, Jan 26, 2021 at 11:29 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

IMHO, adding an assertion in SearchCatCacheInternal(which is a most
generic code part within the server) with a few error context global
variables may not be always safe. Because we might miss using the
error context variables properly. Instead, we could add a comment in
ErrorContextCallback structure saying something like, "it's not
recommended to access any system catalogues within an error context
callback when the callback is expected to be called while processing
an error, because the transaction might have been broken in that
case." And let the future callback developers take care of it.

Thoughts?

That sounds good to me. But I still also see the value to add an
assertion in SearchCatCacheInternal(). If we had it, we could find
these two bugs earlier.

Anyway, this seems to be unrelated to this bug fixing so we can start
a new thread for that.

+1 to start a new thread for that.

As I said earlier [1], currently only two(there could be more) error
context callbacks access the sys catalogues, one is in
slot_store_error_callback which will be fixed with your patch. Another
is in conversion_error_callback, we can also fix this by storing the
relname, attname and other required info in ConversionLocation,
something like the attached. Please have a look.

Thoughts?

Thank you for the patch!

Here are some comments:

Thanks for the review comments.

+static void set_error_callback_info(ConversionLocation *errpos,
+                                          Relation rel, int cur_attno,
+                                          ForeignScanState *fsstate);

I'm concerned a bit that the function name is generic. How about
set_conversion_error_callback_info() or something to make the name
clear?

Done.

---
+static void
+conversion_error_callback(void *arg)
+{
+       ConversionLocation *errpos = (ConversionLocation *) arg;
+
+       if (errpos->show_generic_message)
+               errcontext("processing expression at position %d in
select list",
+                                       errpos->cur_attno);
+

I think we can set this generic message to the error context when
errpos->relname is NULL instead of using show_generic_message.

Right. Done.

---
+       /*
+        * Set error context callback info, so that we could avoid

accessing

+        * the system catalogues while processing the error in
+        * conversion_error_callback. System catalogue accesses are not

safe in

+ * error context callbacks because the transaction might have

been

+        * broken by then.
+        */
+       set_error_callback_info(&errpos, rel, i, fsstate);

Looking at other code, we use "catalog" rather than "catalogue" in
most places. Is it better to use "catalog" for consistency? FYI, the
"catalogue" is used at only three places in the server code:

Changed it to "catalog".

FYI I've added those bug fixes to the next Commitfest -

https://commitfest.postgresql.org/32/2955/

Thanks. I'm attaching 2 patches to make it easy for reviewing and also
they will get a chance to be tested by cf bot.

0001 - for avoiding system catalog access in slot_store_error_callback.
0002 - for avoiding system catalog access in conversion_error_callback

Please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#12Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Zhihong Yu (#11)
Re: logical replication worker accesses catalogs in error context callback

On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu <zyu@yugabyte.com> wrote:

For 0002, a few comments on the description:

bq. Avoid accessing system catalogues inside conversion_error_callback

catalogues -> catalog

bq. error context callback, because the the entire transaction might

There is redundant 'the'

bq. Since get_attname() and get_rel_name() could search the sys cache by
store the required information

store -> storing

Thanks for pointing to the changes in the commit message. I corrected
them. Attaching v4 patch set, consider it for further review.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-Avoid-Catalogue-Accesses-In-slot_store_error_call.patchapplication/x-patch; name=v4-0001-Avoid-Catalogue-Accesses-In-slot_store_error_call.patchDownload+27-22
v4-0002-Avoid-Catalogue-Accesses-In-conversion_error_call.patchapplication/x-patch; name=v4-0002-Avoid-Catalogue-Accesses-In-conversion_error_call.patchDownload+62-36
#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Bharath Rupireddy (#12)
Re: logical replication worker accesses catalogs in error context callback

On Wed, Jan 27, 2021 at 9:38 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Thanks for pointing to the changes in the commit message. I corrected
them. Attaching v4 patch set, consider it for further review.

About 0001, have we tried to reproduce the actual bug here which means
when the error_callback is called we should face some problem? I feel
with the correct testcase we should hit the Assert
(Assert(IsTransactionState());) in SearchCatCacheInternal because
there we expect the transaction to be in a valid state. I understand
that the transaction is in a broken state at that time but having a
testcase to hit the actual bug makes it easy to test the fix.

--
With Regards,
Amit Kapila.

#14Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#13)
Re: logical replication worker accesses catalogs in error context callback

Hi,

On 2021-01-28 11:14:03 +0530, Amit Kapila wrote:

On Wed, Jan 27, 2021 at 9:38 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Thanks for pointing to the changes in the commit message. I corrected
them. Attaching v4 patch set, consider it for further review.

About 0001, have we tried to reproduce the actual bug here which means
when the error_callback is called we should face some problem? I feel
with the correct testcase we should hit the Assert
(Assert(IsTransactionState());) in SearchCatCacheInternal because
there we expect the transaction to be in a valid state. I understand
that the transaction is in a broken state at that time but having a
testcase to hit the actual bug makes it easy to test the fix.

I think it's easy to hit - add a trivial DEBUG message to XLogWrite(),
and run anything involving logical rep using a low enough log
level. There might even be enough messages with a low enough level
already somewhere in the relevant paths.

Greetings,

Andres Freund

#15Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Amit Kapila (#13)
Re: logical replication worker accesses catalogs in error context callback

On Thu, Jan 28, 2021 at 11:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 27, 2021 at 9:38 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Thanks for pointing to the changes in the commit message. I corrected
them. Attaching v4 patch set, consider it for further review.

About 0001, have we tried to reproduce the actual bug here which means
when the error_callback is called we should face some problem? I feel
with the correct testcase we should hit the Assert
(Assert(IsTransactionState());) in SearchCatCacheInternal because
there we expect the transaction to be in a valid state. I understand
that the transaction is in a broken state at that time but having a
testcase to hit the actual bug makes it easy to test the fix.

I have not tried hitting the Assert(IsTransactionState() in
SearchCatCacheInternal. To do that, I need to figure out hitting
"incorrect binary data format in logical replication column" error in
either slot_modify_data or slot_store_data so that we will enter the
error callback slot_store_error_callback and then IsTransactionState()
should return false i.e. txn shouldn't be in TRANS_INPROGRESS. I'm not
sure how to do these things. I also looked at the commits 42c80c696e9
and ddfc9cb05 that added Assert(IsTransactionState()); in
SearchCatCacheInternal and RelationIdGetRelation for any relevant
tests or discussions, but I couldn't find anything.

Any further ideas in reproducing the issue?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#16Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#14)
Re: logical replication worker accesses catalogs in error context callback

On Sat, Jan 30, 2021 at 8:23 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-01-28 11:14:03 +0530, Amit Kapila wrote:

On Wed, Jan 27, 2021 at 9:38 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Thanks for pointing to the changes in the commit message. I corrected
them. Attaching v4 patch set, consider it for further review.

About 0001, have we tried to reproduce the actual bug here which means
when the error_callback is called we should face some problem? I feel
with the correct testcase we should hit the Assert
(Assert(IsTransactionState());) in SearchCatCacheInternal because
there we expect the transaction to be in a valid state. I understand
that the transaction is in a broken state at that time but having a
testcase to hit the actual bug makes it easy to test the fix.

I think it's easy to hit - add a trivial DEBUG message to XLogWrite(),
and run anything involving logical rep using a low enough log
level. There might even be enough messages with a low enough level
already somewhere in the relevant paths.

I'm not sure how adding a DEBUG message to XLogWrite() would ensure
the logical replication worker on the subscriber system enters
slot_store_error_callback and hits the Assert(IsTransactionState());?
Could you please elaborate? Or I may be missing something here to
understand.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#17Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#16)
Re: logical replication worker accesses catalogs in error context callback

Hi,

On 2021-02-03 16:35:29 +0530, Bharath Rupireddy wrote:

On Sat, Jan 30, 2021 at 8:23 AM Andres Freund <andres@anarazel.de> wrote:

On 2021-01-28 11:14:03 +0530, Amit Kapila wrote:

On Wed, Jan 27, 2021 at 9:38 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Thanks for pointing to the changes in the commit message. I corrected
them. Attaching v4 patch set, consider it for further review.

About 0001, have we tried to reproduce the actual bug here which means
when the error_callback is called we should face some problem? I feel
with the correct testcase we should hit the Assert
(Assert(IsTransactionState());) in SearchCatCacheInternal because
there we expect the transaction to be in a valid state. I understand
that the transaction is in a broken state at that time but having a
testcase to hit the actual bug makes it easy to test the fix.

I think it's easy to hit - add a trivial DEBUG message to XLogWrite(),
and run anything involving logical rep using a low enough log
level. There might even be enough messages with a low enough level
already somewhere in the relevant paths.

I'm not sure how adding a DEBUG message to XLogWrite() would ensure
the logical replication worker on the subscriber system enters
slot_store_error_callback and hits the Assert(IsTransactionState());?
Could you please elaborate? Or I may be missing something here to
understand.

XLogWrite() is in a critical section, the DEBUG messages triggers error
context callbacks to be called, the callbacks allocate memory, which
triggers an assertion.

Greetings,

Andres Freund

#18Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#17)
Re: logical replication worker accesses catalogs in error context callback

On Thu, Feb 4, 2021 at 5:16 AM Andres Freund <andres@anarazel.de> wrote:

About 0001, have we tried to reproduce the actual bug here which means
when the error_callback is called we should face some problem? I feel
with the correct testcase we should hit the Assert
(Assert(IsTransactionState());) in SearchCatCacheInternal because
there we expect the transaction to be in a valid state. I understand
that the transaction is in a broken state at that time but having a
testcase to hit the actual bug makes it easy to test the fix.

I think it's easy to hit - add a trivial DEBUG message to XLogWrite(),
and run anything involving logical rep using a low enough log
level. There might even be enough messages with a low enough level
already somewhere in the relevant paths.

I'm not sure how adding a DEBUG message to XLogWrite() would ensure
the logical replication worker on the subscriber system enters
slot_store_error_callback and hits the Assert(IsTransactionState());?
Could you please elaborate? Or I may be missing something here to
understand.

XLogWrite() is in a critical section, the DEBUG messages triggers error
context callbacks to be called, the callbacks allocate memory, which
triggers an assertion.

I see that XLogWrite() can be called from logical replication
worker(apply_handle_commit->apply_handle_commit_internal->CommitTransactionCommand->CommitTransaction->RecordTransactionCommit->XLogFlush->XLogWrite
--> by now the txn state is not TRANS_INPROGRESS, so
IsTransactionState() can return false. Adding a DEBUG message there
can call the error context callback.

But the error context callback slot_store_error_callback, gets used in
slot_store_data and slot_modify_data. In both places we just register
the callback before an attribute processing for loop and deregister
after it. So, when we are in those for loops, we expect to be in
TRANS_INPROGRESS, see ensure_transaction before slot_store_data and
the same is true for slot_modify_data. So, to really hit the assert
failure in the catalogue access from slot_store_error_callback, first,
we shouldn't be in TRANS_INPROGRESS in those for loops and have a
DEBUG message. I'm not exactly sure how we achieve this.

Am I missing something?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Bharath Rupireddy (#15)
Re: logical replication worker accesses catalogs in error context callback

On Wed, Feb 3, 2021 at 4:31 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Thu, Jan 28, 2021 at 11:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 27, 2021 at 9:38 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Thanks for pointing to the changes in the commit message. I corrected
them. Attaching v4 patch set, consider it for further review.

About 0001, have we tried to reproduce the actual bug here which means
when the error_callback is called we should face some problem? I feel
with the correct testcase we should hit the Assert
(Assert(IsTransactionState());) in SearchCatCacheInternal because
there we expect the transaction to be in a valid state. I understand
that the transaction is in a broken state at that time but having a
testcase to hit the actual bug makes it easy to test the fix.

I have not tried hitting the Assert(IsTransactionState() in
SearchCatCacheInternal. To do that, I need to figure out hitting
"incorrect binary data format in logical replication column" error in
either slot_modify_data or slot_store_data so that we will enter the
error callback slot_store_error_callback and then IsTransactionState()
should return false i.e. txn shouldn't be in TRANS_INPROGRESS.

Even, if you hit that via debugger it will be sufficient or you can
write another elog/ereport there to achieve the same. The exact test
case to hit that error is not mandatory.

--
With Regards,
Amit Kapila.

#20Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Amit Kapila (#19)
Re: logical replication worker accesses catalogs in error context callback

On Thu, Feb 4, 2021 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

About 0001, have we tried to reproduce the actual bug here which means
when the error_callback is called we should face some problem? I feel
with the correct testcase we should hit the Assert
(Assert(IsTransactionState());) in SearchCatCacheInternal because
there we expect the transaction to be in a valid state. I understand
that the transaction is in a broken state at that time but having a
testcase to hit the actual bug makes it easy to test the fix.

I have not tried hitting the Assert(IsTransactionState() in
SearchCatCacheInternal. To do that, I need to figure out hitting
"incorrect binary data format in logical replication column" error in
either slot_modify_data or slot_store_data so that we will enter the
error callback slot_store_error_callback and then IsTransactionState()
should return false i.e. txn shouldn't be in TRANS_INPROGRESS.

Even, if you hit that via debugger it will be sufficient or you can
write another elog/ereport there to achieve the same. The exact test
case to hit that error is not mandatory.

Thanks Amit. I verified it with gdb. I attached gdb to the logical
replication worker. In slot_store_data's for loop, I intentionally set
CurrentTransactionState->state = TRANS_DEFAULT, and jumped to the
existing error "incorrect binary data format in logical replication
column", so that the slot_store_error_callback is called. While we are
in the error context callback:

On master: since the system catalogues are accessed in
slot_store_error_callback, the Assert(IsTransactionState() in
SearchCatCacheInternal failed and the error we intend to see is not
logged and we see below in the subscriber server log and the session
in the subscriber gets restarted.
2021-02-04 17:26:27.517 IST [2269230] ERROR: could not send data to
WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2021-02-04 17:26:27.518 IST [2269190] LOG: background worker "logical
replication worker" (PID 2269230) exited with exit code 1

With patch: since we avoided system catalogue access in
slot_store_error_callback, we see the error that we intentionally
jumped to, in the subscriber server log.
2021-02-04 17:27:37.542 IST [2269424] ERROR: incorrect binary data
format in logical replication column 1
2021-02-04 17:27:37.542 IST [2269424] CONTEXT: processing remote data
for replication target relation "public.t1" column "a1", remote type
integer, local type integer

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Bharath Rupireddy (#20)
#22Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#18)
#23Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Amit Kapila (#21)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bharath Rupireddy (#12)
#25Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Tom Lane (#24)
#26Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#25)
#27Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Tom Lane (#24)
#28Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bharath Rupireddy (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#29)
#31Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Tom Lane (#29)
#32Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Tom Lane (#30)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bharath Rupireddy (#31)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bharath Rupireddy (#32)
#35Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Tom Lane (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bharath Rupireddy (#35)