BUG #18055: logical decoding core on AllocateSnapshotBuilder()

Started by PG Bug reporting formover 2 years ago22 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 18055
Logged by: ocean li
Email address: ocean_li_996@163.com
PostgreSQL version: 11.9
Operating system: centos7 5.10.84 x86_64
Description:

For testing logical decoding module, *pg_logical_slot_get_changes* function
is used. Sometimes i got an core whose stack was like that:
==>
#0 0x00007f744a7b9277 in raise () from /lib64/libc.so.6
#1 0x00007f744a7ba968 in abort () from /lib64/libc.so.6
#2 0x00000000010edd37 in ExceptionalCondition (conditionName=0x17e2b58
"!(NInitialRunningXacts == 0 && InitialRunningXacts == ((void *)0))",
errorType=0x17e2b45 "FailedAssertion", fileName=0x17e2b39 "snapbuild.c",
lineNumber=381) at assert.c:46
#3 0x0000000000e60b46 in AllocateSnapshotBuilder (reorder=0x551ea98,
xmin_horizon=0, start_lsn=1267160216, need_full_snapshot=false) at
snapbuild.c:381
#4 0x0000000000e50f70 in StartupDecodingContext (output_plugin_options=0x0,
start_lsn=1267160216, xmin_horizon=0, need_full_snapshot=false,
fast_forward=false, read_page=0xe53023 <logical_read_local_xlog_page>,
prepare_write=0xe52df6 <LogicalOutputPrepareWrite>, do_write=0xe52e24
<LogicalOutputWrite>, update_progress=0x0) at logical.c:191
#5 0x0000000000e518b8 in CreateDecodingContext (start_lsn=1267160216,
output_plugin_options=0x0, fast_forward=false, read_page=0xe53023
<logical_read_local_xlog_page>, prepare_write=0xe52df6
<LogicalOutputPrepareWrite>, do_write=0xe52e24 <LogicalOutputWrite>,
update_progress=0x0) at logical.c:486
#6 0x0000000000e53735 in pg_logical_slot_get_changes_guts
(fcinfo=0x7ffcd879e3d0, confirm=true, binary=false) at logicalfuncs.c:259
#7 0x0000000000e53b1c in pg_logical_slot_get_changes
(fcinfo=0x7ffcd879e3d0) at logicalfuncs.c:393
#8 0x00000000010ff89e in FunctionCallInvokeCheckSPL (fcinfo=0x7ffcd879e3d0)
at fmgr.c:2262
...
==>
And in level #3 of stack above, NInitialRunningXacts is 2 and
InitialRunningXacts is not NULL observed in one of cores.

Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently,
the core, as far as i know, maybe caused by this way: an ERROR raised when
calling *pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts.
Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
Unfortunately, I couldn't find a minimal reproduction case. However, I
observed an *ERROR: canceling statement due to statement timeout* logged
before each core occurred. (For some reason, I can't provide the information
of log)

#2ocean_li_996
ocean_li_996@163.com
In reply to: PG Bug reporting form (#1)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

By applying the test.patch in attachment, I got similar core stacks following the procedure:
==>
postgres=# alter system set wal_level to 'logical';
-- restart server
postgres=# set statement_timeout to 1000;
SET
postgres=# select pg_create_logical_replication_slot('test', 'test_decoding', 0);
WARNING: NInitialRunningXacts is 0 , InitialRunningXacts is (nil)
pg_create_logical_replication_slot
------------------------------------
(test,0/423999D8)
(1 row)
postgres=# select pg_logical_slot_get_changes('test', NULL, NULL);
WARNING: NInitialRunningXacts is 0 , InitialRunningXacts is (nil)
ERROR: canceling statement due to statement timeout
postgres=# select pg_logical_slot_get_changes('test', NULL, NULL);
WARNING: NInitialRunningXacts is 3 , InitialRunningXacts is 0x373bc18
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
==>

And core stacks got above are
==>
(gdb) bt
#0 0x00007fbf2c987277 in raise () from /lib64/libc.so.6
#1 0x00007fbf2c988968 in abort () from /lib64/libc.so.6
#2 0x0000000000deb38e in ExceptionalCondition (
conditionName=0x11f5d70 "!(NInitialRunningXacts == 0 && InitialRunningXacts == ((void *)0))",
errorType=0x11f5d60 "FailedAssertion", fileName=0x11f5d00 "snapbuild.c", lineNumber=385)
at assert.c:46
#3 0x0000000000b5d7d2 in AllocateSnapshotBuilder (reorder=0x35795d8, xmin_horizon=0,
start_lsn=1111114144, need_full_snapshot=false) at snapbuild.c:385
#4 0x0000000000b4d4c2 in StartupDecodingContext (output_plugin_options=0x0, start_lsn=1111114144,
xmin_horizon=0, need_full_snapshot=false, fast_forward=false,
read_page=0xb4f5a5 ,
prepare_write=0xb4f378 , do_write=0xb4f3a6 ,
update_progress=0x0) at logical.c:191
#5 0x0000000000b4de0a in CreateDecodingContext (start_lsn=1111114144, output_plugin_options=0x0,
fast_forward=false, read_page=0xb4f5a5 ,
prepare_write=0xb4f378 , do_write=0xb4f3a6 ,
update_progress=0x0) at logical.c:486
#6 0x0000000000b4fcb7 in pg_logical_slot_get_changes_guts (fcinfo=0x344fbe8, confirm=true, binary=false)
at logicalfuncs.c:260
#7 0x0000000000b500b1 in pg_logical_slot_get_changes (fcinfo=0x344fbe8) at logicalfuncs.c:395

==>

In my view, this bug can be fix by using fix.patch in attachment.

Regards.

| |
ocean_li_996
|
|
ocean_li_996@163.com
|
---- Replied Message ----
| From | PG Bug reporting form<noreply@postgresql.org> |
| Date | 8/15/2023 00:05 |
| To | <pgsql-bugs@lists.postgresql.org> |
| Cc | <ocean_li_996@163.com> |
| Subject | BUG #18055: logical decoding core on AllocateSnapshotBuilder() |
The following bug has been logged on the website:

Bug reference: 18055
Logged by: ocean li
Email address: ocean_li_996@163.com
PostgreSQL version: 11.9
Operating system: centos7 5.10.84 x86_64
Description:

For testing logical decoding module, *pg_logical_slot_get_changes* function
is used. Sometimes i got an core whose stack was like that:
==>
#0 0x00007f744a7b9277 in raise () from /lib64/libc.so.6
#1 0x00007f744a7ba968 in abort () from /lib64/libc.so.6
#2 0x00000000010edd37 in ExceptionalCondition (conditionName=0x17e2b58
"!(NInitialRunningXacts == 0 && InitialRunningXacts == ((void *)0))",
errorType=0x17e2b45 "FailedAssertion", fileName=0x17e2b39 "snapbuild.c",
lineNumber=381) at assert.c:46
#3 0x0000000000e60b46 in AllocateSnapshotBuilder (reorder=0x551ea98,
xmin_horizon=0, start_lsn=1267160216, need_full_snapshot=false) at
snapbuild.c:381
#4 0x0000000000e50f70 in StartupDecodingContext (output_plugin_options=0x0,
start_lsn=1267160216, xmin_horizon=0, need_full_snapshot=false,
fast_forward=false, read_page=0xe53023 <logical_read_local_xlog_page>,
prepare_write=0xe52df6 <LogicalOutputPrepareWrite>, do_write=0xe52e24
<LogicalOutputWrite>, update_progress=0x0) at logical.c:191
#5 0x0000000000e518b8 in CreateDecodingContext (start_lsn=1267160216,
output_plugin_options=0x0, fast_forward=false, read_page=0xe53023
<logical_read_local_xlog_page>, prepare_write=0xe52df6
<LogicalOutputPrepareWrite>, do_write=0xe52e24 <LogicalOutputWrite>,
update_progress=0x0) at logical.c:486
#6 0x0000000000e53735 in pg_logical_slot_get_changes_guts
(fcinfo=0x7ffcd879e3d0, confirm=true, binary=false) at logicalfuncs.c:259
#7 0x0000000000e53b1c in pg_logical_slot_get_changes
(fcinfo=0x7ffcd879e3d0) at logicalfuncs.c:393
#8 0x00000000010ff89e in FunctionCallInvokeCheckSPL (fcinfo=0x7ffcd879e3d0)
at fmgr.c:2262
...
==>
And in level #3 of stack above, NInitialRunningXacts is 2 and
InitialRunningXacts is not NULL observed in one of cores.

Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently,
the core, as far as i know, maybe caused by this way: an ERROR raised when
calling *pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts.
Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
Unfortunately, I couldn't find a minimal reproduction case. However, I
observed an *ERROR: canceling statement due to statement timeout* logged
before each core occurred. (For some reason, I can't provide the information
of log)

Attachments:

test.patchapplication/octet-stream; name=test.patchDownload+13-0
fix.patchapplication/octet-stream; name=fix.patchDownload+11-1
#3Amit Kapila
amit.kapila16@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Mon, Aug 14, 2023 at 11:06 PM PG Bug reporting form
<noreply@postgresql.org> wrote:

Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently,
the core, as far as i know, maybe caused by this way: an ERROR raised when
calling *pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts.
Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
Unfortunately, I couldn't find a minimal reproduction case. However, I
observed an *ERROR: canceling statement due to statement timeout* logged
before each core occurred. (For some reason, I can't provide the information
of log)

Thanks for the report. I think your analysis is correct and I see that
you have proposed a patch in the next email. I'll look into it and
share my feedback.

--
With Regards,
Amit Kapila.

#4Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: PG Bug reporting form (#1)
RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 18055
Logged by: ocean li
Email address: ocean_li_996@163.com
PostgreSQL version: 11.9
Operating system: centos7 5.10.84 x86_64
Description:

For testing logical decoding module, *pg_logical_slot_get_changes* function
is used. Sometimes i got an core whose stack was like that:
...
And in level #3 of stack above, NInitialRunningXacts is 2 and
InitialRunningXacts is not NULL observed in one of cores.

Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the
core, as far as i know, maybe caused by this way: an ERROR raised when calling
*pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts.
Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
Unfortunately, I couldn't find a minimal reproduction case. However, I
observed an *ERROR: canceling statement due to statement timeout* logged
before each core occurred. (For some reason, I can't provide the information of
log)

Thanks for the report and the fix!

I can also reproduce by the steps[1]Session 1: DROP TABLE IF EXISTS tbl1; CREATE TABLE tbl1 (val1 integer, val2 integer); CREATE TABLE user_cat (val1 integer) WITH (user_catalog_table = true); SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); BEGIN; SAVEPOINT sp1; INSERT INTO tbl1 VALUES (1); Session 2: CHECKPOINT; SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); Session 1: TRUNCATE tbl1; COMMIT; BEGIN; INSERT INTO tbl1 VALUES (1); Session 2: CHECKPOINT; SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); Session 1: COMMIT; in PG15~11(Note that we need to change
LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal
records).

About the patch:

---
/* free context, call shutdown callback */
FreeDecodingContext(ctx);

		ReplicationSlotRelease();
		InvalidateSystemCaches();
	}
 	PG_CATCH();
 	{
+
+		/* free context, call shutdown callback */
+		FreeDecodingContext(ctx);
+
+		ReplicationSlotRelease();

I think we could not directly call the cleanup functions here. There could be
two risks:
1) it's possible either 'ctx' hasn't been initialized before the error happen,
we don't need to clean it up in this case
2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally
before entering the catch() block which means we will double cleanup(free) it.

To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions
in it and do the null check for 'ctx' before cleaning up.

Just share one small patch based on your fix for reference, feel free to update it.

[1]: Session 1: DROP TABLE IF EXISTS tbl1; CREATE TABLE tbl1 (val1 integer, val2 integer); CREATE TABLE user_cat (val1 integer) WITH (user_catalog_table = true); SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); BEGIN; SAVEPOINT sp1; INSERT INTO tbl1 VALUES (1); Session 2: CHECKPOINT; SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); Session 1: TRUNCATE tbl1; COMMIT; BEGIN; INSERT INTO tbl1 VALUES (1); Session 2: CHECKPOINT; SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); Session 1: COMMIT;
Session 1:
DROP TABLE IF EXISTS tbl1;
CREATE TABLE tbl1 (val1 integer, val2 integer);
CREATE TABLE user_cat (val1 integer) WITH (user_catalog_table = true);
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
BEGIN;
SAVEPOINT sp1;
INSERT INTO tbl1 VALUES (1);
Session 2:
CHECKPOINT;
SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
Session 1:
TRUNCATE tbl1;
COMMIT;
BEGIN;
INSERT INTO tbl1 VALUES (1);
Session 2:
CHECKPOINT;
SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
Session 1:
COMMIT;

Session 2:
-- Use gdb or modify the code to make the first function call fail.
SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');

-- Since the previous command failed and the NInitialRunningXacts was not reset, we should see the Assert() failure this time.
SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');

Best Regards,
Hou zj

Attachments:

v2-0001-Cleanup-decoding-context-and-slots-in-error-cases.patchapplication/octet-stream; name=v2-0001-Cleanup-decoding-context-and-slots-in-error-cases.patchDownload+7-10
#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#4)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 18055
Logged by: ocean li
Email address: ocean_li_996@163.com
PostgreSQL version: 11.9
Operating system: centos7 5.10.84 x86_64
Description:

For testing logical decoding module, *pg_logical_slot_get_changes* function
is used. Sometimes i got an core whose stack was like that:
...
And in level #3 of stack above, NInitialRunningXacts is 2 and
InitialRunningXacts is not NULL observed in one of cores.

Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the
core, as far as i know, maybe caused by this way: an ERROR raised when calling
*pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts.
Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
Unfortunately, I couldn't find a minimal reproduction case. However, I
observed an *ERROR: canceling statement due to statement timeout* logged
before each core occurred. (For some reason, I can't provide the information of
log)

Thanks for the report and the fix!

I can also reproduce by the steps[1] in PG15~11(Note that we need to change
LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal
records).

About the patch:

---
/* free context, call shutdown callback */
FreeDecodingContext(ctx);

ReplicationSlotRelease();
InvalidateSystemCaches();
}
PG_CATCH();
{
+
+               /* free context, call shutdown callback */
+               FreeDecodingContext(ctx);
+
+               ReplicationSlotRelease();

I think we could not directly call the cleanup functions here. There could be
two risks:
1) it's possible either 'ctx' hasn't been initialized before the error happen,
we don't need to clean it up in this case
2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally
before entering the catch() block which means we will double cleanup(free) it.

To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions
in it and do the null check for 'ctx' before cleaning up.

Just share one small patch based on your fix for reference, feel free to update it.

- LogicalDecodingContext *ctx;
+ LogicalDecodingContext *ctx = NULL;

Don't we need to use volatile for ctx? See the following comment in
elog.h: "Note: if a local variable of the function containing PG_TRY
is modified in the PG_TRY section and used in the PG_CATCH section,
that variable must be declared "volatile" for POSIX compliance."?

--
With Regards,
Amit Kapila.

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#5)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 18055
Logged by: ocean li
Email address: ocean_li_996@163.com
PostgreSQL version: 11.9
Operating system: centos7 5.10.84 x86_64
Description:

For testing logical decoding module, *pg_logical_slot_get_changes* function
is used. Sometimes i got an core whose stack was like that:
...
And in level #3 of stack above, NInitialRunningXacts is 2 and
InitialRunningXacts is not NULL observed in one of cores.

Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the
core, as far as i know, maybe caused by this way: an ERROR raised when calling
*pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts.
Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
Unfortunately, I couldn't find a minimal reproduction case. However, I
observed an *ERROR: canceling statement due to statement timeout* logged
before each core occurred. (For some reason, I can't provide the information of
log)

Thanks for the report and the fix!

I can also reproduce by the steps[1] in PG15~11(Note that we need to change
LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal
records).

About the patch:

---
/* free context, call shutdown callback */
FreeDecodingContext(ctx);

ReplicationSlotRelease();
InvalidateSystemCaches();
}
PG_CATCH();
{
+
+               /* free context, call shutdown callback */
+               FreeDecodingContext(ctx);
+
+               ReplicationSlotRelease();

I think we could not directly call the cleanup functions here. There could be
two risks:
1) it's possible either 'ctx' hasn't been initialized before the error happen,
we don't need to clean it up in this case
2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally
before entering the catch() block which means we will double cleanup(free) it.

To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions
in it and do the null check for 'ctx' before cleaning up.

Just share one small patch based on your fix for reference, feel free to update it.

- LogicalDecodingContext *ctx;
+ LogicalDecodingContext *ctx = NULL;

Don't we need to use volatile for ctx? See the following comment in
elog.h: "Note: if a local variable of the function containing PG_TRY
is modified in the PG_TRY section and used in the PG_CATCH section,
that variable must be declared "volatile" for POSIX compliance."?

BTW, don't we need a similar change in pg_logical_replication_slot_advance()?

--
With Regards,
Amit Kapila.

#7Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#6)
RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Thursday, August 17, 2023 6:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form

<noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 18055
Logged by: ocean li
Email address: ocean_li_996@163.com
PostgreSQL version: 11.9
Operating system: centos7 5.10.84 x86_64
Description:

For testing logical decoding module, *pg_logical_slot_get_changes*
function is used. Sometimes i got an core whose stack was like that:
...
And in level #3 of stack above, NInitialRunningXacts is 2 and
InitialRunningXacts is not NULL observed in one of cores.

Using of NInitialRunningXacts and InitialRunningXacts are clear.
Currently, the core, as far as i know, maybe caused by this way:
an ERROR raised when calling
*pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset NInitialRunningXacts and

InitialRunningXacts.

Then, calling pg_logical_slot_get_changes_guts again, the core may

occur.

Unfortunately, I couldn't find a minimal reproduction case.
However, I observed an *ERROR: canceling statement due to
statement timeout* logged before each core occurred. (For some
reason, I can't provide the information of
log)

Thanks for the report and the fix!

I can also reproduce by the steps[1] in PG15~11(Note that we need to
change LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra
running xacts wal records).

About the patch:

---
/* free context, call shutdown callback */
FreeDecodingContext(ctx);

ReplicationSlotRelease();
InvalidateSystemCaches();
}
PG_CATCH();
{
+
+               /* free context, call shutdown callback */
+               FreeDecodingContext(ctx);
+
+               ReplicationSlotRelease();

I think we could not directly call the cleanup functions here. There
could be two risks:
1) it's possible either 'ctx' hasn't been initialized before the
error happen, we don't need to clean it up in this case
2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up
normally before entering the catch() block which means we will double

cleanup(free) it.

To improve this, I think we can use PG_FINALLY() here and move all
these cleanup functions in it and do the null check for 'ctx' before cleaning

up.

Just share one small patch based on your fix for reference, feel free to

update it.

- LogicalDecodingContext *ctx;
+ LogicalDecodingContext *ctx = NULL;

Don't we need to use volatile for ctx? See the following comment in
elog.h: "Note: if a local variable of the function containing PG_TRY
is modified in the PG_TRY section and used in the PG_CATCH section,
that variable must be declared "volatile" for POSIX compliance."?

Thanks for pointing it out, I think we need to use volatile for it.

BTW, don't we need a similar change in pg_logical_replication_slot_advance()?

You are right, I can reproduce the same following the steps in [1]/messages/by-id/OS0PR01MB57160A7DD10B517AFDC73378941AA@OS0PR01MB5716.jpnprd01.prod.outlook.com and
replace the last two pg_logical_slot_get_changes function calls with two
"SELECT pg_replication_slot_advance('isolation_slot', (select
pg_current_wal_lsn()));"

Here is a new version patch based on PG15 STABLE which addressed above comments.

[1]: /messages/by-id/OS0PR01MB57160A7DD10B517AFDC73378941AA@OS0PR01MB5716.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Attachments:

v3-0001-cleanup-decoding-context-and-slots-in-error-cases.patchapplication/octet-stream; name=v3-0001-cleanup-decoding-context-and-slots-in-error-cases.patchDownload+14-18
#8Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#7)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Thu, Aug 17, 2023 at 9:10 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, August 17, 2023 6:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form

<noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 18055
Logged by: ocean li
Email address: ocean_li_996@163.com
PostgreSQL version: 11.9
Operating system: centos7 5.10.84 x86_64
Description:

For testing logical decoding module, *pg_logical_slot_get_changes*
function is used. Sometimes i got an core whose stack was like that:
...
And in level #3 of stack above, NInitialRunningXacts is 2 and
InitialRunningXacts is not NULL observed in one of cores.

Using of NInitialRunningXacts and InitialRunningXacts are clear.
Currently, the core, as far as i know, maybe caused by this way:
an ERROR raised when calling
*pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset NInitialRunningXacts and

InitialRunningXacts.

Then, calling pg_logical_slot_get_changes_guts again, the core may

occur.

Unfortunately, I couldn't find a minimal reproduction case.
However, I observed an *ERROR: canceling statement due to
statement timeout* logged before each core occurred. (For some
reason, I can't provide the information of
log)

Thanks for the report and the fix!

I can also reproduce by the steps[1] in PG15~11(Note that we need to
change LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra
running xacts wal records).

About the patch:

---
/* free context, call shutdown callback */
FreeDecodingContext(ctx);

ReplicationSlotRelease();
InvalidateSystemCaches();
}
PG_CATCH();
{
+
+               /* free context, call shutdown callback */
+               FreeDecodingContext(ctx);
+
+               ReplicationSlotRelease();

I think we could not directly call the cleanup functions here. There
could be two risks:
1) it's possible either 'ctx' hasn't been initialized before the
error happen, we don't need to clean it up in this case
2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up
normally before entering the catch() block which means we will double

cleanup(free) it.

To improve this, I think we can use PG_FINALLY() here and move all
these cleanup functions in it and do the null check for 'ctx' before cleaning

up.

Just share one small patch based on your fix for reference, feel free to

update it.

- LogicalDecodingContext *ctx;
+ LogicalDecodingContext *ctx = NULL;

Don't we need to use volatile for ctx? See the following comment in
elog.h: "Note: if a local variable of the function containing PG_TRY
is modified in the PG_TRY section and used in the PG_CATCH section,
that variable must be declared "volatile" for POSIX compliance."?

Thanks for pointing it out, I think we need to use volatile for it.

BTW, don't we need a similar change in pg_logical_replication_slot_advance()?

You are right, I can reproduce the same following the steps in [1] and
replace the last two pg_logical_slot_get_changes function calls with two
"SELECT pg_replication_slot_advance('isolation_slot', (select
pg_current_wal_lsn()));"

I've not tested but the same is true in a case where
pg_create_logical_replication_slot() raises an error (e.g., cancelled
while DecodingContextFindStartpoint()) after creating the decoding
context?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#9Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Masahiko Sawada (#8)
RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Thursday, August 17, 2023 10:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Aug 17, 2023 at 9:10 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Thursday, August 17, 2023 6:07 PM Amit Kapila

<amit.kapila16@gmail.com> wrote:

On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila
<amit.kapila16@gmail.com>
wrote:

On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form

<noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 18055
Logged by: ocean li
Email address: ocean_li_996@163.com
PostgreSQL version: 11.9
Operating system: centos7 5.10.84 x86_64
Description:

For testing logical decoding module,
*pg_logical_slot_get_changes* function is used. Sometimes i got an

core whose stack was like that:

...
And in level #3 of stack above, NInitialRunningXacts is 2 and
InitialRunningXacts is not NULL observed in one of cores.

Using of NInitialRunningXacts and InitialRunningXacts are clear.
Currently, the core, as far as i know, maybe caused by this way:
an ERROR raised when calling
*pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset NInitialRunningXacts and

InitialRunningXacts.

Then, calling pg_logical_slot_get_changes_guts again, the core
may

occur.

Unfortunately, I couldn't find a minimal reproduction case.
However, I observed an *ERROR: canceling statement due to
statement timeout* logged before each core occurred. (For
some reason, I can't provide the information of
log)

Thanks for the report and the fix!

I can also reproduce by the steps[1] in PG15~11(Note that we
need to change LOG_SNAPSHOT_INTERVAL_MS to a bigger number to
avoid extra running xacts wal records).

About the patch:

---
/* free context, call shutdown callback */
FreeDecodingContext(ctx);

ReplicationSlotRelease();
InvalidateSystemCaches();
}
PG_CATCH();
{
+
+               /* free context, call shutdown callback */
+               FreeDecodingContext(ctx);
+
+               ReplicationSlotRelease();

I think we could not directly call the cleanup functions here.
There could be two risks:
1) it's possible either 'ctx' hasn't been initialized before the
error happen, we don't need to clean it up in this case
2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up
normally before entering the catch() block which means we will
double

cleanup(free) it.

To improve this, I think we can use PG_FINALLY() here and move
all these cleanup functions in it and do the null check for
'ctx' before cleaning

up.

Just share one small patch based on your fix for reference, feel
free to

update it.

- LogicalDecodingContext *ctx;
+ LogicalDecodingContext *ctx = NULL;

Don't we need to use volatile for ctx? See the following comment
in
elog.h: "Note: if a local variable of the function containing
PG_TRY is modified in the PG_TRY section and used in the PG_CATCH
section, that variable must be declared "volatile" for POSIX compliance."?

Thanks for pointing it out, I think we need to use volatile for it.

BTW, don't we need a similar change in

pg_logical_replication_slot_advance()?

You are right, I can reproduce the same following the steps in [1] and
replace the last two pg_logical_slot_get_changes function calls with
two "SELECT pg_replication_slot_advance('isolation_slot', (select
pg_current_wal_lsn()));"

I've not tested but the same is true in a case where
pg_create_logical_replication_slot() raises an error (e.g., cancelled while
DecodingContextFindStartpoint()) after creating the decoding context?

Indeed, I can reproduce this case[1]~~ reproduction steps (Need to change LOG_SNAPSHOT_INTERVAL_MS to a bigger number) as well, thanks!

Here is a new version patch which addressed this based on PG15.

Apart from this, I am thinking we’d better also apply the fix on HEAD even if
it doesn't have such flag and Assert. Because in error case(HEAD), we will skip
calling shutdown callbacks for output plugins, so it could cause other
problems. For example, in pgoutput we free and reset the hash table in shutdown
callbacks, if we skip this, and next time we consume changes from another slot,
we would use an expired hash table which can cause problems. I can provide
patches for all branches if the fix is OK.

[1]: ~~ reproduction steps (Need to change LOG_SNAPSHOT_INTERVAL_MS to a bigger number)
~~ reproduction steps (Need to change LOG_SNAPSHOT_INTERVAL_MS to a bigger number)

session 1:
-----
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1', 'test_decoding');
-----

session 2:
-----
- Start a transaction
BEGIN;
CREATE TABLE test(a int);
INSERT INTO test VALUES(1);
-----

session 3:
-----
- Note, to let the current slot restore another snapshot, we need to use gdb to block the current backend at
- SnapBuildFindSnapshot(), the backend should have logged the running_xacts WAL
- before reaching SnapBuildFindSnapshot.

SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2', 'test_decoding');
-----

session 1:
-----
- The current backend will serialize the snapshot when decoding the running_xacts WAL, and the snapshot
- can be used by other slots (e.g. isolation_slot_2)

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
-----

session 3:
-----
- Release the process and we can see if restore the snaphost and set NInitialRunningXacts to 1.
- Then let it report an ERROR and try to create the slot again after ERROR.
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2', 'test_decoding');
-----

Best Regards,
Hou zj

Attachments:

v4-0001-cleanup-decoding-context-in-error-cases.patchapplication/octet-stream; name=v4-0001-cleanup-decoding-context-in-error-cases.patchDownload+29-27
#10ocean_li_996
ocean_li_996@163.com
In reply to: Zhijie Hou (Fujitsu) (#9)
Re:RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

Excited! Thanks for the discussion and the fix!

The bug seems to be solved well. And usage of PG_FINALLY can make it more elegantly. However, for the version lacking of PG_FINALLY, should the patch be written like that:

---
/* free context, call shutdown callback */
FreeDecodingContext(ctx);

ReplicationSlotRelease();
InvalidateSystemCaches();
}
PG_CATCH();
{
+
+    /* free context, call shutdown callback */
+     if (ctx != NULL)
+             FreeDecodingContext(ctx);
+
+    ReplicationSlotRelease();

----

Best regards!

At 2023-08-18 12:21:53, "Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com> wrote:

Here is a new version patch which addressed this based on PG15.

#11Andres Freund
andres@anarazel.de
In reply to: Zhijie Hou (Fujitsu) (#9)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

Hi,

On 2023-08-18 04:21:53 +0000, Zhijie Hou (Fujitsu) wrote:

From ee1dfccc0306812c356c84bbd7e2558f27d7d348 Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Thu, 17 Aug 2023 19:29:34 +0800
Subject: [PATCH v4] cleanup decoding context in error cases

Some of the management functions for logical decoding didn't clean up the
decoding context when an error occurs during decoding. This can
result in accessing unfreed resources and assert failure the next time we call
these functions. Fix it by cleaning up the decoding context and slots in error
cases.

I don't think this is the right fix - at all. We shouldn't run arbitrary code
after a failure, which we do by calling the shutdown callback. It's not going
to expect to be called in a context where you're not allowed to do very much.

Consider what would happen if FreeDecodingContext() or such failed - we'd not
have executed InvalidateSystemCaches(), and therefore would continue with
corrupted system caches.

There's also no need to free memory here, that's already done via the memory
context cleanup of the "surrounding" memory context (i.e. CurrentMemoryContext
when StartupDecodingContext() is called)

I think the real problem is that 272248a0c added InitialRunningXacts a global
variable. If it just lived in in struct SnapBuild, this whole thing wouldn't
be an issue? The commit justified this choice with avoiding an ABI breakage -
but isn't that bogus? The struct is private to snapbuild.c. It doesn't live on
disk (that's SnapBuildOnDisk). There's no ABI to preserve?

Greetings,

Andres Freund

#12Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#11)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Mon, Aug 21, 2023 at 6:05 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

I think the real problem is that 272248a0c added InitialRunningXacts a global
variable. If it just lived in in struct SnapBuild, this whole thing wouldn't
be an issue? The commit justified this choice with avoiding an ABI breakage -
but isn't that bogus? The struct is private to snapbuild.c. It doesn't live on
disk (that's SnapBuildOnDisk).

No, since SnapBuildOnDisk contains SnapBuild, if we add something to
SnapBuild, the on-disk format compatibility would break. See:

/*
* We store current state of struct SnapBuild on disk in the following manner:
*
* struct SnapBuildOnDisk;
* TransactionId * running.xcnt_space;
* TransactionId * committed.xcnt; (*not xcnt_space*)
*
*/
typedef struct SnapBuildOnDisk
{
/* first part of this struct needs to be version independent */

/* data not covered by checksum */
uint32 magic;
pg_crc32c checksum;

/* data covered by checksum */

/* version, in case we want to support pg_upgrade */
uint32 version;
/* how large is the on disk data, excluding the constant sized part */
uint32 length;

/* version dependent part */
SnapBuild builder;
:

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#11)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Mon, Aug 21, 2023 at 2:35 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-08-18 04:21:53 +0000, Zhijie Hou (Fujitsu) wrote:

From ee1dfccc0306812c356c84bbd7e2558f27d7d348 Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Thu, 17 Aug 2023 19:29:34 +0800
Subject: [PATCH v4] cleanup decoding context in error cases

Some of the management functions for logical decoding didn't clean up the
decoding context when an error occurs during decoding. This can
result in accessing unfreed resources and assert failure the next time we call
these functions. Fix it by cleaning up the decoding context and slots in error
cases.

I don't think this is the right fix - at all. We shouldn't run arbitrary code
after a failure, which we do by calling the shutdown callback.

But OTOH, it can prevent freeing some global memory like in the case
of pgoutput_shutdown which frees some memory allocated in
CacheMemoryContext. Also, as pointed out by Hou-San[1]/messages/by-id/OS0PR01MB57167C62D7DA4A8EBBC92B0A941BA@OS0PR01MB5716.jpnprd01.prod.outlook.com, it can lead to
unwanted behavior as next time we can access some invalid entries.

The other alternatives to fix are (a) Have some Reset* kind of
function (similar to ResetReindexState) to reset the required
variables and call at AbortTransaction time. The minor risk could be
that if any extension uses the same function name as ours but not sure
if it is worth worrying about. (b) Initialize the required variables
in AllocateSnapshotBuilder().

[1]: /messages/by-id/OS0PR01MB57167C62D7DA4A8EBBC92B0A941BA@OS0PR01MB5716.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.

#14Andres Freund
andres@anarazel.de
In reply to: Masahiko Sawada (#12)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

Hi,

On 2023-08-21 21:41:46 +0900, Masahiko Sawada wrote:

On Mon, Aug 21, 2023 at 6:05 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

I think the real problem is that 272248a0c added InitialRunningXacts a global
variable. If it just lived in in struct SnapBuild, this whole thing wouldn't
be an issue? The commit justified this choice with avoiding an ABI breakage -
but isn't that bogus? The struct is private to snapbuild.c. It doesn't live on
disk (that's SnapBuildOnDisk).

No, since SnapBuildOnDisk contains SnapBuild, if we add something to
SnapBuild, the on-disk format compatibility would break. See:

Err, yes. That was a brainfart on my part. But we could still just have
handled that via the 'version' field.

Greetings,

Andres Freund

#15Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#13)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

Hi,

On 2023-08-21 19:07:16 +0530, Amit Kapila wrote:

On Mon, Aug 21, 2023 at 2:35 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-08-18 04:21:53 +0000, Zhijie Hou (Fujitsu) wrote:

From ee1dfccc0306812c356c84bbd7e2558f27d7d348 Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Thu, 17 Aug 2023 19:29:34 +0800
Subject: [PATCH v4] cleanup decoding context in error cases

Some of the management functions for logical decoding didn't clean up the
decoding context when an error occurs during decoding. This can
result in accessing unfreed resources and assert failure the next time we call
these functions. Fix it by cleaning up the decoding context and slots in error
cases.

I don't think this is the right fix - at all. We shouldn't run arbitrary code
after a failure, which we do by calling the shutdown callback.

But OTOH, it can prevent freeing some global memory like in the case
of pgoutput_shutdown which frees some memory allocated in
CacheMemoryContext. Also, as pointed out by Hou-San[1], it can lead to
unwanted behavior as next time we can access some invalid entries.

To me the code in pgoutput is just bad. For one, it uses a global variable
for non-global state - instead it should store the data in
LogicalDecodingContext->output_plugin_private. Secondly, it should not use
CacheMemoryContext, given that the cache apparently is local to a individual
pgoutput instance.

But: Is there actually a danger of invalid entries being accessed? It looks to
me like pgoutput uses invalidation callbacks etc, which would prevent the
cache from being stale?

The other alternatives to fix are (a) Have some Reset* kind of
function (similar to ResetReindexState) to reset the required
variables and call at AbortTransaction time.

-1. This makes things global concerns that shouldn't be.

If we really need something to clean this up, I'd look at
MemoryContextRegisterResetCallback().

Greetings,

Andres Freund

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#15)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Mon, Aug 21, 2023 at 11:57 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-08-21 19:07:16 +0530, Amit Kapila wrote:

On Mon, Aug 21, 2023 at 2:35 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-08-18 04:21:53 +0000, Zhijie Hou (Fujitsu) wrote:

From ee1dfccc0306812c356c84bbd7e2558f27d7d348 Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Thu, 17 Aug 2023 19:29:34 +0800
Subject: [PATCH v4] cleanup decoding context in error cases

Some of the management functions for logical decoding didn't clean up the
decoding context when an error occurs during decoding. This can
result in accessing unfreed resources and assert failure the next time we call
these functions. Fix it by cleaning up the decoding context and slots in error
cases.

I don't think this is the right fix - at all. We shouldn't run arbitrary code
after a failure, which we do by calling the shutdown callback.

But OTOH, it can prevent freeing some global memory like in the case
of pgoutput_shutdown which frees some memory allocated in
CacheMemoryContext. Also, as pointed out by Hou-San[1], it can lead to
unwanted behavior as next time we can access some invalid entries.

To me the code in pgoutput is just bad. For one, it uses a global variable
for non-global state - instead it should store the data in
LogicalDecodingContext->output_plugin_private. Secondly, it should not use
CacheMemoryContext, given that the cache apparently is local to a individual
pgoutput instance.

I agree that this is a better way and we should change the code. I
think we can probably allocate this in RelationSyncCache.

But: Is there actually a danger of invalid entries being accessed? It looks to
me like pgoutput uses invalidation callbacks etc, which would prevent the
cache from being stale?

Actually, during invalidation, we simply mark entries as invalid, and
during the next access, we validate those by freeing the old stuff and
then build it from scratch. I feel if we do what you said in the
previous point this shouldn't be a problem either.

The other alternatives to fix are (a) Have some Reset* kind of
function (similar to ResetReindexState) to reset the required
variables and call at AbortTransaction time.

-1. This makes things global concerns that shouldn't be.

If we really need something to clean this up, I'd look at
MemoryContextRegisterResetCallback().

+1. This sounds like a better idea.

--
With Regards,
Amit Kapila.

#17Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#16)
RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Wednesday, August 23, 2023 4:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 21, 2023 at 11:57 PM Andres Freund <andres@anarazel.de>
wrote:

On 2023-08-21 19:07:16 +0530, Amit Kapila wrote:

On Mon, Aug 21, 2023 at 2:35 AM Andres Freund <andres@anarazel.de>

wrote:

On 2023-08-18 04:21:53 +0000, Zhijie Hou (Fujitsu) wrote:

From ee1dfccc0306812c356c84bbd7e2558f27d7d348 Mon Sep 17
00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Thu, 17 Aug 2023 19:29:34 +0800
Subject: [PATCH v4] cleanup decoding context in error cases

Some of the management functions for logical decoding didn't
clean up the decoding context when an error occurs during
decoding. This can result in accessing unfreed resources and
assert failure the next time we call these functions. Fix it by
cleaning up the decoding context and slots in error cases.

I don't think this is the right fix - at all. We shouldn't run
arbitrary code after a failure, which we do by calling the shutdown

callback.

The other alternatives to fix are (a) Have some Reset* kind of
function (similar to ResetReindexState) to reset the required
variables and call at AbortTransaction time.

-1. This makes things global concerns that shouldn't be.

If we really need something to clean this up, I'd look at
MemoryContextRegisterResetCallback().

+1. This sounds like a better idea.

Here is a small patch based on this idea.
I registered the callback under snapshot builder context
and the callback will reset the variables.

Best Regards,
Hou zj

Attachments:

0001-Reset-InitialRunningXacts-at-memory-context-reset-ca.patchapplication/octet-stream; name=0001-Reset-InitialRunningXacts-at-memory-context-reset-ca.patchDownload+16-5
#18Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Andres Freund (#14)
RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Tuesday, August 22, 2023 2:19 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-08-21 21:41:46 +0900, Masahiko Sawada wrote:

On Mon, Aug 21, 2023 at 6:05 AM Andres Freund <andres@anarazel.de>

wrote:

Hi,

I think the real problem is that 272248a0c added InitialRunningXacts
a global variable. If it just lived in in struct SnapBuild, this
whole thing wouldn't be an issue? The commit justified this choice
with avoiding an ABI breakage - but isn't that bogus? The struct is
private to snapbuild.c. It doesn't live on disk (that's SnapBuildOnDisk).

No, since SnapBuildOnDisk contains SnapBuild, if we add something to
SnapBuild, the on-disk format compatibility would break. See:

Err, yes. That was a brainfart on my part. But we could still just have handled that
via the 'version' field.

I thought you mean we can add the InitialRunningXacts into the SnapBuild struct
while avoid serializing them to the disk, so that the disk file size doesn't
change and the reported problem also get fixed.

I think, normally, we need to increment the snapbuild version when changing the
struct size, but this sounds a bit complex as we need to increment the number in all
support back-branches which have different SNAPBUILD_VERSION.

Or if we agree we don't change the version number in this case as we don't plan
to change the disk file size(only change the struct size), it seems work,
although it looks a bit hacky.

Best Regards,
Hou zj

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#17)
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Fri, Aug 25, 2023 at 9:36 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, August 23, 2023 4:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 21, 2023 at 11:57 PM Andres Freund <andres@anarazel.de>
wrote:

-1. This makes things global concerns that shouldn't be.

If we really need something to clean this up, I'd look at
MemoryContextRegisterResetCallback().

+1. This sounds like a better idea.

Here is a small patch based on this idea.
I registered the callback under snapshot builder context
and the callback will reset the variables.

LGTM. Unless Andres or Sawada-San thinks otherwise, I would like to
proceed with this fix.

--
With Regards,
Amit Kapila.

#20Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#19)
RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

On Tuesday, September 5, 2023 7:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 25, 2023 at 9:36 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Wednesday, August 23, 2023 4:47 PM Amit Kapila

<amit.kapila16@gmail.com> wrote:

On Mon, Aug 21, 2023 at 11:57 PM Andres Freund <andres@anarazel.de>
wrote:

-1. This makes things global concerns that shouldn't be.

If we really need something to clean this up, I'd look at
MemoryContextRegisterResetCallback().

+1. This sounds like a better idea.

Here is a small patch based on this idea.
I registered the callback under snapshot builder context and the
callback will reset the variables.

LGTM. Unless Andres or Sawada-San thinks otherwise, I would like to proceed
with this fix.

Thanks, attach the patches for all support branches. I have confirmed that they can
pass the regression test and fix the issue.

Best Regards,
Hou zj

Attachments:

0001-PG14-Reset-InitialRunningXacts-at-memory-context-res.patchapplication/octet-stream; name=0001-PG14-Reset-InitialRunningXacts-at-memory-context-res.patchDownload+16-5
0001-PG11-12-13-Reset-InitialRunningXacts-at-memory-context-res.patchapplication/octet-stream; name=0001-PG11-12-13-Reset-InitialRunningXacts-at-memory-context-res.patchDownload+16-5
0001-PG15-Reset-InitialRunningXacts-at-memory-context-reset-ca.patchapplication/octet-stream; name=0001-PG15-Reset-InitialRunningXacts-at-memory-context-reset-ca.patchDownload+16-5
#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#20)
#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#21)