BUG #18641: Logical decoding of two-phase commit fails with TOASTed default values
The following bug has been logged on the website:
Bug reference: 18641
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 17.0
Operating system: Ubuntu 22.04
Description:
Logical decoding of a transaction like:
BEGIN;
INSERT INTO test_tab VALUES(1);
PREPARE TRANSACTION 'pt';
where test_tab defined as:
CREATE TABLE test_tab(a int primary key,
t text DEFAULT 'Some TOASTable value';
for a subscription created with two_phase = on, fails as below:
2024-09-28 06:44:50.708 UTC [3741774:6][client backend][6/2:740] LOG:
statement: PREPARE TRANSACTION 'pt';
2024-09-28 06:44:50.709 UTC [3741774:7][client backend][:0] LOG:
disconnection: session time: 0:00:00.006 user=law database=postgres
host=[local]
2024-09-28 06:44:50.713 UTC [3741741:17][walsender][25/0:0] ERROR:
unexpected table_index_fetch_tuple call during logical decoding
2024-09-28 06:44:50.713 UTC [3741741:18][walsender][25/0:0] BACKTRACE:
table_index_fetch_tuple at tableam.h:1253:3
index_fetch_heap at indexam.c:637:10
index_getnext_slot at indexam.c:697:6
systable_getnext_ordered at genam.c:717:5
heap_fetch_toast_slice at heaptoast.c:698:17
table_relation_fetch_toast_slice at tableam.h:1924:1
toast_fetch_datum at detoast.c:379:2
detoast_attr at detoast.c:123:10
pg_detoast_datum_packed at fmgr.c:1867:10
text_to_cstring at varlena.c:220:23
AttrDefaultFetch at relcache.c:4537:17
RelationBuildTupleDesc at relcache.c:697:4
RelationBuildDesc at relcache.c:1188:24
RelationIdGetRelation at relcache.c:2116:7
ReorderBufferProcessTXN at reorderbuffer.c:2246:17
ReorderBufferReplay at reorderbuffer.c:2725:2
ReorderBufferPrepare at reorderbuffer.c:2822:2
DecodePrepare at decode.c:826:2
xact_decode at decode.c:347:5
LogicalDecodingProcessRecord at decode.c:123:1
XLogSendLogical at walsender.c:3413:33
WalSndLoop at walsender.c:2814:4
StartLogicalReplication at walsender.c:1523:2
exec_replication_command at walsender.c:2148:5
PostgresMain at postgres.c:4763:11
BackendInitialize at backend_startup.c:123:1
postmaster_child_launch at launch_backend.c:281:9
BackendStartup at postmaster.c:3593:8
ServerLoop at postmaster.c:1677:10
PostmasterMain at postmaster.c:1372:11
startup_hacks at main.c:217:1
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)
[0x7f30a90d7e40]
postgres: node_publisher: walsender law postgres [local]
START_REPLICATION(_start+0x25) [0x5647b6023565]
2024-09-28 06:44:50.713 UTC [3741741:19][walsender][25/0:0] LOG: released
logical replication slot "test_sub"
2024-09-28 06:44:50.713 UTC [3741741:20][walsender][25/0:0] LOG: could not
send data to client: Broken pipe
2024-09-28 06:44:50.713 UTC [3741741:21][walsender][25/0:0] FATAL:
connection to client lost
The issue can be easily reproduced with 022_twophase_cascade.pl modified
like this:
@@ -67,0 +68 @@ $node_A->safe_psql('postgres',
+my $default = join('', map {chr(65 + rand 26)} (1 .. 10000));
@@ -69 +70 @@ $node_B->safe_psql('postgres',
- "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz
DEFAULT now(), d bigint DEFAULT 999)"
+ "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz
DEFAULT now(), d bigint DEFAULT 999, t text DEFAULT '$default')"
@@ -72 +73 @@ $node_C->safe_psql('postgres',
- "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz
DEFAULT now(), d bigint DEFAULT 999)"
+ "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz
DEFAULT now(), d bigint DEFAULT 999, t text DEFAULT '$default')"
Reproduced on REL_15_STABLE (starting from a8fd13cab) .. master.
Hi, Alexander and hackers
The following bug has been logged on the website:
Bug reference: 18641
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 17.0
Operating system: Ubuntu 22.04
Description:Logical decoding of a transaction like:
BEGIN;
INSERT INTO test_tab VALUES(1);
PREPARE TRANSACTION 'pt';where test_tab defined as:
CREATE TABLE test_tab(a int primary key,
t text DEFAULT 'Some TOASTable value';for a subscription created with two_phase = on, fails as below:
2024-09-28 06:44:50.708 UTC [3741774:6][client backend][6/2:740] LOG:
statement: PREPARE TRANSACTION 'pt';
2024-09-28 06:44:50.709 UTC [3741774:7][client backend][:0] LOG:
disconnection: session time: 0:00:00.006 user=law database=postgres
host=[local]
2024-09-28 06:44:50.713 UTC [3741741:17][walsender][25/0:0] ERROR:
unexpected table_index_fetch_tuple call during logical decoding
2024-09-28 06:44:50.713 UTC [3741741:18][walsender][25/0:0] BACKTRACE:
table_index_fetch_tuple at tableam.h:1253:3
index_fetch_heap at indexam.c:637:10
index_getnext_slot at indexam.c:697:6
systable_getnext_ordered at genam.c:717:5
heap_fetch_toast_slice at heaptoast.c:698:17
table_relation_fetch_toast_slice at tableam.h:1924:1
toast_fetch_datum at detoast.c:379:2
detoast_attr at detoast.c:123:10
pg_detoast_datum_packed at fmgr.c:1867:10
text_to_cstring at varlena.c:220:23
AttrDefaultFetch at relcache.c:4537:17
RelationBuildTupleDesc at relcache.c:697:4
RelationBuildDesc at relcache.c:1188:24
RelationIdGetRelation at relcache.c:2116:7
ReorderBufferProcessTXN at reorderbuffer.c:2246:17
ReorderBufferReplay at reorderbuffer.c:2725:2
ReorderBufferPrepare at reorderbuffer.c:2822:2
DecodePrepare at decode.c:826:2
xact_decode at decode.c:347:5
LogicalDecodingProcessRecord at decode.c:123:1
XLogSendLogical at walsender.c:3413:33
WalSndLoop at walsender.c:2814:4
StartLogicalReplication at walsender.c:1523:2
exec_replication_command at walsender.c:2148:5
PostgresMain at postgres.c:4763:11
BackendInitialize at backend_startup.c:123:1
postmaster_child_launch at launch_backend.c:281:9
BackendStartup at postmaster.c:3593:8
ServerLoop at postmaster.c:1677:10
PostmasterMain at postmaster.c:1372:11
startup_hacks at main.c:217:1
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)
[0x7f30a90d7e40]
postgres: node_publisher: walsender law postgres [local]
START_REPLICATION(_start+0x25) [0x5647b6023565]
2024-09-28 06:44:50.713 UTC [3741741:19][walsender][25/0:0] LOG: released
logical replication slot "test_sub"
2024-09-28 06:44:50.713 UTC [3741741:20][walsender][25/0:0] LOG: could not
send data to client: Broken pipe
2024-09-28 06:44:50.713 UTC [3741741:21][walsender][25/0:0] FATAL:
connection to client lostThe issue can be easily reproduced with 022_twophase_cascade.pl modified like this: @@ -67,0 +68 @@ $node_A->safe_psql('postgres', +my $default = join('', map {chr(65 + rand 26)} (1 .. 10000)); @@ -69 +70 @@ $node_B->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999)" + "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999, t text DEFAULT '$default')" @@ -72 +73 @@ $node_C->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999)" + "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999, t text DEFAULT '$default')"Reproduced on REL_15_STABLE (starting from a8fd13cab) .. master.
Thank you for reporting the issue.
I was able to reproduce the issue by modifying 022_twophase_cascade.pl
accordingly.
The scan for toast index is actually done under systable_getnext_ordered,
where HandleConcurrentAbort() is called. So it seems to me that this
scan is actually safe for concurrent abort in logical decoding.
Logic around HandleConcurrentAbort is intorduced
https://github.com/postgres/postgres/commit/7259736a6e5b7c7588fff9578370736a6648acbb.
Though I may not understand the logic around HandleConcurrentAbort
fully and I am not sure not-setting bsysscan at
systable_beginscan_ordered is intentional,
it seems to me setting and unsetting a bsysscan flag in
systable_beginscan_ordered and systable_endscan_ordered would resolve
the issue.
Patch is attached.
Regards,
Takeshi Ideriha
Attachments:
0001-Set-and-unset-bsysscan-flag-in-systable_-_ordered.patchapplication/octet-stream; name=0001-Set-and-unset-bsysscan-flag-in-systable_-_ordered.patchDownload+17-1
On Mon, Sep 30, 2024 at 6:46 AM Takeshi Ideriha
<iderihatakeshi@gmail.com> wrote:
Patch is attached.
Thanks for the patch. I'll look into it.
--
With Regards,
Amit Kapila.
On Mon, Sep 30, 2024 at 6:46 AM Takeshi Ideriha
<iderihatakeshi@gmail.com> wrote:
Thank you for reporting the issue.
I was able to reproduce the issue by modifying 022_twophase_cascade.pl
accordingly.The scan for toast index is actually done under systable_getnext_ordered,
where HandleConcurrentAbort() is called. So it seems to me that this
scan is actually safe for concurrent abort in logical decoding.
Logic around HandleConcurrentAbort is intorduced
https://github.com/postgres/postgres/commit/7259736a6e5b7c7588fff9578370736a6648acbb.Though I may not understand the logic around HandleConcurrentAbort
fully and I am not sure not-setting bsysscan at
systable_beginscan_ordered is intentional,
it seems to me setting and unsetting a bsysscan flag in
systable_beginscan_ordered and systable_endscan_ordered would resolve
the issue.
We forgot to set/unset the flag in functions
systable_beginscan_ordered and systable_endscan_ordered. BTW,
shouldn't this occur even without prepare transaction? If so, we need
to backpatch this till 14.
Also, it is better to have a test for this, and let's ensure that the
new test doesn't increase the regression time too much if possible.
One minor point:
+
+ /*
+ * If CheckXidAlive is set then set a flag to indicate that system table
The indentation in the first comment line seems off.
--
With Regards,
Amit Kapila.
Hi, Amit
(adding hackers for discussion)
We forgot to set/unset the flag in functions
systable_beginscan_ordered and systable_endscan_ordered. BTW,
Thank you for the clarification.
shouldn't this occur even without prepare transaction? If so, we need
to backpatch this till 14.
Yes, it occurred also at PG14.
I did some tests using 015_stream.pl with some modification like
below, which tests the subscription about stream is on but two-phase
is off.
The same issue occurred at both current head source code and PG14.
```
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -134,9 +134,11 @@ my $node_subscriber =
PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
$node_subscriber->start;
+my $default = join('', map {chr(65 + rand 26)} (1 .. 10000));
+
# Create some preexisting content on publisher
$node_publisher->safe_psql('postgres',
- "CREATE TABLE test_tab (a int primary key, b bytea)");
+ "CREATE TABLE test_tab (a int primary key, b bytea, t text
DEFAULT '$default')");
$node_publisher->safe_psql('postgres',
"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
@@ -144,7 +146,7 @@ $node_publisher->safe_psql('postgres', "CREATE
TABLE test_tab_2 (a int)");
# Setup structure on subscriber
$node_subscriber->safe_psql('postgres',
- "CREATE TABLE test_tab (a int primary key, b bytea, c
timestamptz DEFAULT now(), d bigint DEFAULT 999)"
+ "CREATE TABLE test_tab (a int primary key, b bytea, t text
DEFAULT '$default', c timestamptz DEFAULT now(), d bigint DEFAULT
999)"
);
```
Logs from head source:
```
2024-10-01 12:34:56.694 UTC [575202] LOG: starting PostgreSQL 18devel
on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat
11.4.1-2), 64-bit
...
2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG: statement: BEGIN;
2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG: statement:
INSERT INTO test_tab SELECT i, sha256(i::text::bytea) FROM
generate_series(3, 5000) s(i);
2024-10-01 12:34:57.524 UTC [575242] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 12:34:57.524 UTC [575242] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG: released logical
replication slot "tap_sub"
2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 12:34:57.525 UTC [575242] tap_sub FATAL: connection to client lost
...
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG: starting logical
decoding for slot "tap_sub"
2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL: Streaming
transactions committing after 0/1583A68, reading WAL from 0/1583A30.
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG: logical decoding
found consistent point at 0/1583A30
2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL: There are no
running transactions.
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.831 UTC [575260] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 12:34:57.831 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG: released logical
replication slot "tap_sub"
2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 12:34:57.832 UTC [575260] tap_sub FATAL: connection to client lost
```
Logs from PG14 source:
```
2024-10-01 13:20:29.409 UTC [593696] LOG: starting PostgreSQL 14.9 on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat
11.4.1-2), 64-bit
...
2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG: statement: BEGIN;
2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG: statement:
INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,
5000) s(i);
2024-10-01 13:20:31.301 UTC [593740] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 13:20:31.301 UTC [593740] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.302 UTC [593740] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 13:20:31.302 UTC [593740] tap_sub FATAL: connection to client lost
...
2024-10-01 13:20:31.830 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.830 UTC [593756] tap_sub LOG: logical decoding
found consistent point at 0/172C758
2024-10-01 13:20:31.830 UTC [593756] tap_sub DETAIL: There are no
running transactions.
2024-10-01 13:20:31.830 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.834 UTC [593756] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 13:20:31.834 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.835 UTC [593756] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 13:20:31.835 UTC [593756] tap_sub FATAL: connection to client lost
```
Also, it is better to have a test for this, and let's ensure that the
new test doesn't increase the regression time too much if possible.
Sure. I'm going to add some test codes in a few days.
The indentation in the first comment line seems off.
Opps, thank you. Fixed.
regards,
Takeshi Ideriha
Attachments:
0001-Set-and-unset-bsysscan-flag-in-systable_-_ordered-v2.patchapplication/x-patch; name=0001-Set-and-unset-bsysscan-flag-in-systable_-_ordered-v2.patchDownload+16-1
Hi, Amit
(Sending this again due to previous email got moderated because I
tried to multi post to hackers and bugs.)
We forgot to set/unset the flag in functions
systable_beginscan_ordered and systable_endscan_ordered. BTW,
Thank you for the clarification.
shouldn't this occur even without prepare transaction? If so, we need
to backpatch this till 14.
Yes, it occurred also at PG14.
I did some tests using 015_stream.pl with some modification like
below, which tests the subscription about stream is on but two-phase
is off.
The same issue occurred at both current head source code and PG14.
```
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -134,9 +134,11 @@ my $node_subscriber =
PostgreSQL::Test::Cluster->new
('subscriber');
$node_subscriber->init;
$node_subscriber->start;
+my $default = join('', map {chr(65 + rand 26)} (1 .. 10000));
+
# Create some preexisting content on publisher
$node_publisher->safe_psql('postgres',
- "CREATE TABLE test_tab (a int primary key, b bytea)");
+ "CREATE TABLE test_tab (a int primary key, b bytea, t text
DEFAULT '$default')");
$node_publisher->safe_psql('postgres',
"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
@@ -144,7 +146,7 @@ $node_publisher->safe_psql('postgres', "CREATE
TABLE test_tab_2 (a int)");
# Setup structure on subscriber
$node_subscriber->safe_psql('postgres',
- "CREATE TABLE test_tab (a int primary key, b bytea, c
timestamptz DEFAULT now(), d bigint DEFAULT 999)"
+ "CREATE TABLE test_tab (a int primary key, b bytea, t text
DEFAULT '$default', c timestamptz DEFAULT now(), d bigint DEFAULT
999)"
);
```
Logs from head source:
```
2024-10-01 12:34:56.694 UTC [575202] LOG: starting PostgreSQL 18devel
on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat
11.4.1-2), 64-bit
...
2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG: statement: BEGIN;
2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG: statement:
INSERT INTO test_tab SELECT i, sha256(i::text::bytea) FROM
generate_series(3, 5000) s(i);
2024-10-01 12:34:57.524 UTC [575242] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 12:34:57.524 UTC [575242] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG: released logical
replication slot "tap_sub"
2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 12:34:57.525 UTC [575242] tap_sub FATAL: connection to client lost
...
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG: starting logical
decoding for slot "tap_sub"
2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL: Streaming
transactions committing after 0/1583A68, reading WAL from 0/1583A30.
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG: logical decoding
found consistent point at 0/1583A30
2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL: There are no
running transactions.
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.831 UTC [575260] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 12:34:57.831 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG: released logical
replication slot "tap_sub"
2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 12:34:57.832 UTC [575260] tap_sub FATAL: connection to client lost
```
Logs from PG14 source:
```
2024-10-01 13:20:29.409 UTC [593696] LOG: starting PostgreSQL 14.9 on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat
11.4.1-2), 64-bit
...
2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG: statement: BEGIN;
2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG: statement:
INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,
5000) s(i);
2024-10-01 13:20:31.301 UTC [593740] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 13:20:31.301 UTC [593740] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.302 UTC [593740] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 13:20:31.302 UTC [593740] tap_sub FATAL: connection to client lost
...
2024-10-01 13:20:31.830 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.830 UTC [593756] tap_sub LOG: logical decoding
found consistent point at 0/172C758
2024-10-01 13:20:31.830 UTC [593756] tap_sub DETAIL: There are no
running transactions.
2024-10-01 13:20:31.830 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.834 UTC [593756] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 13:20:31.834 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.835 UTC [593756] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 13:20:31.835 UTC [593756] tap_sub FATAL: connection to client lost
```
Also, it is better to have a test for this, and let's ensure that the
new test doesn't increase the regression time too much if possible.
Sure. I'm going to add some test codes in a few days.
The indentation in the first comment line seems off.
Opps, thank you. Fixed.
regards,
Takeshi Ideriha
Attachments:
0001-Set-and-unset-bsysscan-flag-in-systable_-_ordered-v2.patchapplication/octet-stream; name=0001-Set-and-unset-bsysscan-flag-in-systable_-_ordered-v2.patchDownload+16-1
On Tue, Oct 1, 2024 at 7:07 PM Takeshi Ideriha <iderihatakeshi@gmail.com> wrote:
We forgot to set/unset the flag in functions
systable_beginscan_ordered and systable_endscan_ordered. BTW,Thank you for the clarification.
shouldn't this occur even without prepare transaction? If so, we need
to backpatch this till 14.Yes, it occurred also at PG14.
I did some tests using 015_stream.pl with some modification like
below, which tests the subscription about stream is on but two-phase
is off.
The same issue occurred at both current head source code and PG14.``` --- a/src/test/subscription/t/015_stream.pl +++ b/src/test/subscription/t/015_stream.pl @@ -134,9 +134,11 @@ my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); $node_subscriber->init; $node_subscriber->start;+my $default = join('', map {chr(65 + rand 26)} (1 .. 10000)); + # Create some preexisting content on publisher $node_publisher->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea)"); + "CREATE TABLE test_tab (a int primary key, b bytea, t text DEFAULT '$default')"); $node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");@@ -144,7 +146,7 @@ $node_publisher->safe_psql('postgres', "CREATE
TABLE test_tab_2 (a int)");# Setup structure on subscriber $node_subscriber->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999)" + "CREATE TABLE test_tab (a int primary key, b bytea, t text DEFAULT '$default', c timestamptz DEFAULT now(), d bigint DEFAULT 999)" ); ```Logs from head source:
```
2024-10-01 12:34:56.694 UTC [575202] LOG: starting PostgreSQL 18devel
on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat
11.4.1-2), 64-bit
...
2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG: statement: BEGIN;
2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG: statement:
INSERT INTO test_tab SELECT i, sha256(i::text::bytea) FROM
generate_series(3, 5000) s(i);
2024-10-01 12:34:57.524 UTC [575242] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 12:34:57.524 UTC [575242] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG: released logical
replication slot "tap_sub"
2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 12:34:57.525 UTC [575242] tap_sub FATAL: connection to client lost
...
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG: starting logical
decoding for slot "tap_sub"
2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL: Streaming
transactions committing after 0/1583A68, reading WAL from 0/1583A30.
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG: logical decoding
found consistent point at 0/1583A30
2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL: There are no
running transactions.
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.831 UTC [575260] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 12:34:57.831 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG: released logical
replication slot "tap_sub"
2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 12:34:57.832 UTC [575260] tap_sub FATAL: connection to client lost
```Logs from PG14 source:
```
2024-10-01 13:20:29.409 UTC [593696] LOG: starting PostgreSQL 14.9 on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat
11.4.1-2), 64-bit
...
2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG: statement: BEGIN;
2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG: statement:
INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,
5000) s(i);
2024-10-01 13:20:31.301 UTC [593740] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 13:20:31.301 UTC [593740] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.302 UTC [593740] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 13:20:31.302 UTC [593740] tap_sub FATAL: connection to client lost
...
2024-10-01 13:20:31.830 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.830 UTC [593756] tap_sub LOG: logical decoding
found consistent point at 0/172C758
2024-10-01 13:20:31.830 UTC [593756] tap_sub DETAIL: There are no
running transactions.
2024-10-01 13:20:31.830 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.834 UTC [593756] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 13:20:31.834 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.835 UTC [593756] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 13:20:31.835 UTC [593756] tap_sub FATAL: connection to client lost
```Also, it is better to have a test for this, and let's ensure that the
new test doesn't increase the regression time too much if possible.Sure. I'm going to add some test codes in a few days.
The above test shown in the email will work to test this issue.
However, we should use 'debug_logical_replication_streaming' as that
will help to reproduce the issue with just one write operation. Is it
possible to add the test in 018_stream_subxact_abort where we already
use it?
--
With Regards,
Amit Kapila.
On Tue, Oct 1, 2024 at 7:07 PM Takeshi Ideriha <iderihatakeshi@gmail.com> wrote:
Sending it again to avoid getting it stuck because the original was
replied to both -hackers and -bugs.
We forgot to set/unset the flag in functions
systable_beginscan_ordered and systable_endscan_ordered. BTW,Thank you for the clarification.
shouldn't this occur even without prepare transaction? If so, we need
to backpatch this till 14.Yes, it occurred also at PG14.
I did some tests using 015_stream.pl with some modification like
below, which tests the subscription about stream is on but two-phase
is off.
The same issue occurred at both current head source code and PG14.``` --- a/src/test/subscription/t/015_stream.pl +++ b/src/test/subscription/t/015_stream.pl @@ -134,9 +134,11 @@ my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); $node_subscriber->init; $node_subscriber->start;+my $default = join('', map {chr(65 + rand 26)} (1 .. 10000)); + # Create some preexisting content on publisher $node_publisher->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea)"); + "CREATE TABLE test_tab (a int primary key, b bytea, t text DEFAULT '$default')"); $node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");@@ -144,7 +146,7 @@ $node_publisher->safe_psql('postgres', "CREATE
TABLE test_tab_2 (a int)");# Setup structure on subscriber $node_subscriber->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999)" + "CREATE TABLE test_tab (a int primary key, b bytea, t text DEFAULT '$default', c timestamptz DEFAULT now(), d bigint DEFAULT 999)" ); ```Logs from head source:
```
2024-10-01 12:34:56.694 UTC [575202] LOG: starting PostgreSQL 18devel
on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat
11.4.1-2), 64-bit
...
2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG: statement: BEGIN;
2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG: statement:
INSERT INTO test_tab SELECT i, sha256(i::text::bytea) FROM
generate_series(3, 5000) s(i);
2024-10-01 12:34:57.524 UTC [575242] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 12:34:57.524 UTC [575242] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG: released logical
replication slot "tap_sub"
2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 12:34:57.525 UTC [575242] tap_sub FATAL: connection to client lost
...
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG: starting logical
decoding for slot "tap_sub"
2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL: Streaming
transactions committing after 0/1583A68, reading WAL from 0/1583A30.
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG: logical decoding
found consistent point at 0/1583A30
2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL: There are no
running transactions.
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.831 UTC [575260] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 12:34:57.831 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG: released logical
replication slot "tap_sub"
2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 12:34:57.832 UTC [575260] tap_sub FATAL: connection to client lost
```Logs from PG14 source:
```
2024-10-01 13:20:29.409 UTC [593696] LOG: starting PostgreSQL 14.9 on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat
11.4.1-2), 64-bit
...
2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG: statement: BEGIN;
2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG: statement:
INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,
5000) s(i);
2024-10-01 13:20:31.301 UTC [593740] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 13:20:31.301 UTC [593740] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.302 UTC [593740] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 13:20:31.302 UTC [593740] tap_sub FATAL: connection to client lost
...
2024-10-01 13:20:31.830 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.830 UTC [593756] tap_sub LOG: logical decoding
found consistent point at 0/172C758
2024-10-01 13:20:31.830 UTC [593756] tap_sub DETAIL: There are no
running transactions.
2024-10-01 13:20:31.830 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.834 UTC [593756] tap_sub ERROR: unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 13:20:31.834 UTC [593756] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.835 UTC [593756] tap_sub LOG: could not send data
to client: Broken pipe
2024-10-01 13:20:31.835 UTC [593756] tap_sub FATAL: connection to client lost
```Also, it is better to have a test for this, and let's ensure that the
new test doesn't increase the regression time too much if possible.Sure. I'm going to add some test codes in a few days.
The above test shown in the email will work to test this issue.
However, we should use 'debug_logical_replication_streaming' as that
will help to reproduce the issue with just one write operation. Is it
possible to add the test in 018_stream_subxact_abort where we already
use it?
--
With Regards,
Amit Kapila.
On Saturday, October 5, 2024 6:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Sending it again to avoid getting it stuck because the original was replied to both -hackers and -bugs.
On Tue, Oct 1, 2024 at 7:07 PM Takeshi Ideriha
<iderihatakeshi@gmail.com>
wrote:We forgot to set/unset the flag in functions
systable_beginscan_ordered and systable_endscan_ordered. BTW,Thank you for the clarification.
shouldn't this occur even without prepare transaction? If so, we
need to backpatch this till 14.Yes, it occurred also at PG14.
I did some tests using 015_stream.pl with some modification like
below, which tests the subscription about stream is on but two-phase
is off.
The same issue occurred at both current head source code and PG14.``` --- a/src/test/subscription/t/015_stream.pl +++ b/src/test/subscription/t/015_stream.pl @@ -134,9 +134,11 @@ my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); $node_subscriber->init; $node_subscriber->start;+my $default = join('', map {chr(65 + rand 26)} (1 .. 10000)); + # Create some preexisting content on publisher $node_publisher->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea)"); + "CREATE TABLE test_tab (a int primary key, b bytea, t text DEFAULT '$default')"); $node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");@@ -144,7 +146,7 @@ $node_publisher->safe_psql('postgres', "CREATE
TABLE test_tab_2 (a int)");# Setup structure on subscriber $node_subscriber->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999)" + "CREATE TABLE test_tab (a int primary key, b bytea, t text DEFAULT '$default', c timestamptz DEFAULT now(), d bigint DEFAULT 999)"Also, it is better to have a test for this, and let's ensure that
the new test doesn't increase the regression time too much if possible.Sure. I'm going to add some test codes in a few days.
The above test shown in the email will work to test this issue.
However, we should use 'debug_logical_replication_streaming' as that
will help to reproduce the issue with just one write operation. Is it
possible to add the test in 018_stream_subxact_abort where we already use it?
I think it might be better to put the test in test_decoding/sql/stream.sql as
the bug is only related to the decoding part. Writing it in sql test also seems cheaper.
Just share a top-up patch 0002 which includes the tests I suggested for
reference. The test would fail without the fix.
Node that I didn't change any logic in 0001 as the fix looks good to
me. I just ran the pgindent for the 0001.
Best Regards,
Hou zj
Attachments:
v3-0001-Set-and-unset-bsysscan-flag-in-systable_-_ordered.patchapplication/octet-stream; name=v3-0001-Set-and-unset-bsysscan-flag-in-systable_-_ordered.patchDownload+16-1
v3-0002-add-a-testcase-in-stream.sql.patchapplication/octet-stream; name=v3-0002-add-a-testcase-in-stream.sql.patchDownload+44-1
On Monday, October 7, 2024 1:57 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
On Saturday, October 5, 2024 6:12 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:Sending it again to avoid getting it stuck because the original was replied to
both -hackers and -bugs.On Tue, Oct 1, 2024 at 7:07 PM Takeshi Ideriha
<iderihatakeshi@gmail.com>
wrote:We forgot to set/unset the flag in functions
systable_beginscan_ordered and systable_endscan_ordered. BTW,Thank you for the clarification.
shouldn't this occur even without prepare transaction? If so, we
need to backpatch this till 14.Yes, it occurred also at PG14.
I did some tests using 015_stream.pl with some modification like
below, which tests the subscription about stream is on but two-phase
is off.
The same issue occurred at both current head source code and PG14.``` --- a/src/test/subscription/t/015_stream.pl +++ b/src/test/subscription/t/015_stream.pl @@ -134,9 +134,11 @@ my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); $node_subscriber->init; $node_subscriber->start;+my $default = join('', map {chr(65 + rand 26)} (1 .. 10000)); + # Create some preexisting content on publisher $node_publisher->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea)"); + "CREATE TABLE test_tab (a int primary key, b bytea, t text DEFAULT '$default')"); $node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");@@ -144,7 +146,7 @@ $node_publisher->safe_psql('postgres', "CREATE
TABLE test_tab_2 (a int)");# Setup structure on subscriber $node_subscriber->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999)" + "CREATE TABLE test_tab (a int primary key, b bytea, t text DEFAULT '$default', c timestamptz DEFAULT now(), d bigint DEFAULT 999)"Also, it is better to have a test for this, and let's ensure that
the new test doesn't increase the regression time too much if possible.Sure. I'm going to add some test codes in a few days.
The above test shown in the email will work to test this issue.
However, we should use 'debug_logical_replication_streaming' as that
will help to reproduce the issue with just one write operation. Is it
possible to add the test in 018_stream_subxact_abort where we already useit?
I think it might be better to put the test in test_decoding/sql/stream.sql as the
bug is only related to the decoding part. Writing it in sql test also seems
cheaper.Just share a top-up patch 0002 which includes the tests I suggested for
reference. The test would fail without the fix.Node that I didn't change any logic in 0001 as the fix looks good to me. I just ran
the pgindent for the 0001.
While testing the fix on back-branches, I realized that we don't have the
debug_logical_replication_streaming GUC on PG14 and 15. So I tried to generate
enough data to exceed the logical_decoding_work_mem in the test instead. Please
see the attachment for reference. I also rebased 0001 on back-branches, so just
share them here as well.
Best Regards,
Hou zj