BUG #18815: Logical replication worker Segmentation fault
The following bug has been logged on the website:
Bug reference: 18815
Logged by: Sergey Belyashov
Email address: sergey.belyashov@gmail.com
PostgreSQL version: 17.3
Operating system: Debian bookworm x86_64
Description:
Today I try to upgrade my cluster from postgresql-16 to postgresql-17. And
it was successfull until I restore some logical replication subscriptions.
When subscription is activated and first data are come then server logs:
2025-02-17 13:34:08.975 [98417] LOG: logical replication apply worker for
subscription "node4_closed_sessions_sub" has started
2025-02-17 13:34:11.213 [62583] LOG: background worker "logical
replication apply worker" (PID 98417) was terminated by signal 11:
Segmentation fault
2025-02-17 13:34:11.213 [62583] LOG: terminating any other active server
processes
2025-02-17 13:34:11.240 [62583] LOG: all server processes terminated;
reinitializing
2025-02-17 13:34:11.310 [98418] LOG: database system was interrupted; last
known up at 2025-02-17 13:22:08
and then restarts.
Kernel has been logged following info:
[94740743.468001] postgres[98417]: segfault at 10 ip 0000562b2b74d69c sp
00007fff284a7320 error 4 in postgres[562b2b6bb000+595000]
[94740743.468173] Code: 1f 80 00 00 00 00 44 89 e0 48 8b 15 56 0b 82 00 f7
d0 48 98 4c 8b 3c c2 eb 99 0f 1f 40 00 55 48 89 e5 53 48 89
fb 48 83 ec 08 <8b> 7f 10 e8 4c b1 32 00 8b 7b 14 85 ff 75 15 48 89 df 48
8b 5d f8
After some investigations I found that segfault is caused by one type of
subscriptions: subscription for huge partitioned tables on publisher and
subscriber (via root), subscriptions are created with data_copy=false
(source table updated by inserts and partition detaches, and it is huge,
data transfer is not compressed so it may take a days). Segfault does not
come immediately after subscription creation, but it cause when data is come
from the publisher. Then subscriber is restarts, recover, run subscription
again, catch segfault and repeat again until subscription is disabled.
Subscriptions for tables (small) without partitions works fine.
There is difference for publisher server versions: both publishers 16 and 17
cause the segfault on subscriber (version 17.3).
postgresql versions 12-16 works for years without any segfault with same
partition tables and publications/subscriptions.
postgresql-17=17.3-3.pgdg120+1 installed from the repository:
http://apt.postgresql.org/pub/repos/apt/ bookworm-pgdg main
PG Bug reporting form <noreply@postgresql.org> writes:
After some investigations I found that segfault is caused by one type of
subscriptions: subscription for huge partitioned tables on publisher and
subscriber (via root), subscriptions are created with data_copy=false
(source table updated by inserts and partition detaches, and it is huge,
data transfer is not compressed so it may take a days). Segfault does not
come immediately after subscription creation, but it cause when data is come
from the publisher. Then subscriber is restarts, recover, run subscription
again, catch segfault and repeat again until subscription is disabled.
This is not enough information for anyone else to reproduce the
problem; it very likely depends on details that you haven't mentioned.
Can you create a reproducer case? I'm hoping for a script that sets
up the necessary tables and subscriptions and populates the tables
with enough dummy data to cause the failure.
Something that might be less work for you is to get a stack trace
from the crash:
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend
However, I make no promises that we can isolate the cause from
just a stack trace. A reproducer would be much better.
regards, tom lane
Hi,
I think backtrace will help.
Core was generated by `postgres: 17/main: logical replication apply
worker for subscription 602051860'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00005635402c869c in brinRevmapTerminate (revmap=0x0)
at ./build/../src/backend/access/brin/brin_revmap.c:102
(gdb) backtrace
#0 0x00005635402c869c in brinRevmapTerminate (revmap=0x0)
at ./build/../src/backend/access/brin/brin_revmap.c:102
#1 0x00005635402bfddd in brininsertcleanup (index=<optimized out>,
indexInfo=<optimized out>)
at ./build/../src/backend/access/brin/brin.c:515
#2 0x0000563540479309 in ExecCloseIndices
(resultRelInfo=resultRelInfo@entry=0x563541cab8d0)
at ./build/../src/backend/executor/execIndexing.c:248
#3 0x000056354048067f in ExecCleanupTupleRouting
(mtstate=0x563541c6db58, proute=0x563541cab638)
at ./build/../src/backend/executor/execPartition.c:1270
#4 0x00005635405c89f7 in finish_edata (edata=0x563541ca0fa8)
at ./build/../src/backend/replication/logical/worker.c:718
#5 0x00005635405cc6c4 in apply_handle_insert (s=0x7f61d2a3a1d8)
at ./build/../src/backend/replication/logical/worker.c:2438
#6 apply_dispatch (s=s@entry=0x7ffd30d95a70) at
./build/../src/backend/replication/logical/worker.c:3296
#7 0x00005635405cdb7f in LogicalRepApplyLoop (last_received=106949425100872)
at ./build/../src/backend/replication/logical/worker.c:3587
#8 start_apply (origin_startpos=origin_startpos@entry=0)
at ./build/../src/backend/replication/logical/worker.c:4429
#9 0x00005635405ce11f in run_apply_worker () at
./build/../src/backend/replication/logical/worker.c:4550
#10 ApplyWorkerMain (main_arg=<optimized out>) at
./build/../src/backend/replication/logical/worker.c:4719
#11 0x0000563540594bf8 in BackgroundWorkerMain (startup_data=<optimized out>,
startup_data_len=<optimized out>) at
./build/../src/backend/postmaster/bgworker.c:848
#12 0x0000563540596daa in postmaster_child_launch
(child_type=child_type@entry=B_BG_WORKER,
startup_data=startup_data@entry=0x563541bc3618 "",
startup_data_len=startup_data_len@entry=1472,
client_sock=client_sock@entry=0x0) at
./build/../src/backend/postmaster/launch_backend.c:277
#13 0x0000563540598f88 in do_start_bgworker (rw=0x563541bc3618)
at ./build/../src/backend/postmaster/postmaster.c:4272
#14 maybe_start_bgworkers () at
./build/../src/backend/postmaster/postmaster.c:4503
#15 0x0000563540599fea in process_pm_pmsignal () at
./build/../src/backend/postmaster/postmaster.c:3776
#16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1669
#17 0x000056354059c7c8 in PostmasterMain (argc=argc@entry=5,
argv=argv@entry=0x563541b229c0)
at ./build/../src/backend/postmaster/postmaster.c:1374
#18 0x00005635402bf5b1 in main (argc=5, argv=0x563541b229c0) at
./build/../src/backend/main/main.c:199
The destination (subscriber) table has two timestamps "started" and
"closed" with brin index on each of them. Table is partitioned by the
range on the "closed" column. Each partition is splitted on 6
subpartitions via list (remainder of id).
Best regards,
Sergey Belyashov
пн, 17 февр. 2025 г. в 19:39, Tom Lane <tgl@sss.pgh.pa.us>:
Show quoted text
PG Bug reporting form <noreply@postgresql.org> writes:
After some investigations I found that segfault is caused by one type of
subscriptions: subscription for huge partitioned tables on publisher and
subscriber (via root), subscriptions are created with data_copy=false
(source table updated by inserts and partition detaches, and it is huge,
data transfer is not compressed so it may take a days). Segfault does not
come immediately after subscription creation, but it cause when data is come
from the publisher. Then subscriber is restarts, recover, run subscription
again, catch segfault and repeat again until subscription is disabled.This is not enough information for anyone else to reproduce the
problem; it very likely depends on details that you haven't mentioned.
Can you create a reproducer case? I'm hoping for a script that sets
up the necessary tables and subscriptions and populates the tables
with enough dummy data to cause the failure.Something that might be less work for you is to get a stack trace
from the crash:https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend
However, I make no promises that we can isolate the cause from
just a stack trace. A reproducer would be much better.regards, tom lane
Sergey Belyashov <sergey.belyashov@gmail.com> writes:
I think backtrace will help.
Core was generated by `postgres: 17/main: logical replication apply
worker for subscription 602051860'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00005635402c869c in brinRevmapTerminate (revmap=0x0)
at ./build/../src/backend/access/brin/brin_revmap.c:102
(gdb) backtrace
#0 0x00005635402c869c in brinRevmapTerminate (revmap=0x0)
at ./build/../src/backend/access/brin/brin_revmap.c:102
#1 0x00005635402bfddd in brininsertcleanup (index=<optimized out>,
indexInfo=<optimized out>)
at ./build/../src/backend/access/brin/brin.c:515
#2 0x0000563540479309 in ExecCloseIndices
(resultRelInfo=resultRelInfo@entry=0x563541cab8d0)
at ./build/../src/backend/executor/execIndexing.c:248
Thanks! It seems clear from that that the fault is basically in
brininsertcleanup(), which is trashing the BrinInsertState but
leaving indexInfo->ii_AmCache still pointing at it, so that
the next brininsert() will think it has a valid cache entry.
I suspect that the attached will fix it. What I don't understand
is why it's apparently so hard to trigger the crash, because it
looks to me like any two successive insert commands on the same
BRIN index should hit this.
BTW, I'm also a bit suspicious of the comment's claim that the
brinDesc doesn't need cleanup. That looks like a potential
memory leak.
regards, tom lane
Attachments:
fix-brininsertcleanup.patchtext/x-diff; charset=us-ascii; name=fix-brininsertcleanup.patchDownload
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 4265687afa..60320440fc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -511,16 +511,18 @@ brininsertcleanup(Relation index, IndexInfo *indexInfo)
BrinInsertState *bistate = (BrinInsertState *) indexInfo->ii_AmCache;
/* bail out if cache not initialized */
- if (indexInfo->ii_AmCache == NULL)
+ if (bistate == NULL)
return;
+ /* do this first to avoid dangling pointer if we fail partway through */
+ indexInfo->ii_AmCache = NULL;
+
/*
* Clean up the revmap. Note that the brinDesc has already been cleaned up
* as part of its own memory context.
*/
brinRevmapTerminate(bistate->bis_rmAccess);
- bistate->bis_rmAccess = NULL;
- bistate->bis_desc = NULL;
+ pfree(bistate);
}
/*
I wrote:
I suspect that the attached will fix it. What I don't understand
is why it's apparently so hard to trigger the crash, because it
looks to me like any two successive insert commands on the same
BRIN index should hit this.
Oh, wait: I was confusing ii_AmCache with rd_amcache in the index's
relcache entry. This coding would absolutely not work with rd_amcache
since that's persistent. It mostly works with the IndexInfo field
though, since an IndexInfo typically only survives per-query.
Evidently there's some path in logical replication that will re-use an
IndexInfo across multiple distinct insertion operations, and that's
what breaks it.
BTW, I'm also a bit suspicious of the comment's claim that the
brinDesc doesn't need cleanup. That looks like a potential
memory leak.
This concern still stands.
regards, tom lane
Further to this ... I'd still really like to have a reproducer.
While brininsertcleanup is clearly being less robust than it should
be, I now suspect that there is another bug somewhere further down
the call stack. We're getting to this point via ExecCloseIndices,
and that should be paired with ExecOpenIndices, and that would have
created a fresh IndexInfo. So it looks a lot like some path in a
logrep worker is able to call ExecCloseIndices twice on the same
working data. That would probably lead to a "releasing a lock you
don't own" error if we weren't hitting this crash first.
regards, tom lane
I wrote:
Further to this ... I'd still really like to have a reproducer.
While brininsertcleanup is clearly being less robust than it should
be, I now suspect that there is another bug somewhere further down
the call stack. We're getting to this point via ExecCloseIndices,
and that should be paired with ExecOpenIndices, and that would have
created a fresh IndexInfo. So it looks a lot like some path in a
logrep worker is able to call ExecCloseIndices twice on the same
working data. That would probably lead to a "releasing a lock you
don't own" error if we weren't hitting this crash first.
Hmm ... I tried modifying ExecCloseIndices to blow up if called
twice, as in the attached. This gets through core regression
just fine, but it blows up in three different subscription TAP
tests, all with a stack trace matching Sergey's:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f064bfe3e65 in __GI_abort () at abort.c:79
#2 0x00000000009e9253 in ExceptionalCondition (
conditionName=conditionName@entry=0xb8717b "indexDescs[i] != NULL",
fileName=fileName@entry=0xb87139 "execIndexing.c",
lineNumber=lineNumber@entry=249) at assert.c:66
#3 0x00000000006f0b13 in ExecCloseIndices (
resultRelInfo=resultRelInfo@entry=0x2f11c18) at execIndexing.c:249
#4 0x00000000006f86d8 in ExecCleanupTupleRouting (mtstate=0x2ef92d8,
proute=0x2ef94e8) at execPartition.c:1273
#5 0x0000000000848cb6 in finish_edata (edata=0x2ef8f50) at worker.c:717
#6 0x000000000084d0a0 in apply_handle_insert (s=<optimized out>)
at worker.c:2460
#7 apply_dispatch (s=<optimized out>) at worker.c:3389
#8 0x000000000084e494 in LogicalRepApplyLoop (last_received=25066600)
at worker.c:3680
#9 start_apply (origin_startpos=0) at worker.c:4507
#10 0x000000000084e711 in run_apply_worker () at worker.c:4629
#11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:4798
#12 0x00000000008138f9 in BackgroundWorkerMain (startup_data=<optimized out>,
startup_data_len=<optimized out>) at bgworker.c:842
The problem seems to be that apply_handle_insert_internal does
ExecOpenIndices and then ExecCloseIndices, and then
ExecCleanupTupleRouting does ExecCloseIndices again, which nicely
explains why brininsertcleanup blows up if you happen to have a BRIN
index involved. What it doesn't explain is how come we don't see
other symptoms from the duplicate index_close calls, regardless of
index type. I'd have expected an assertion failure from
RelationDecrementReferenceCount, and/or an assertion failure for
nonzero rd_refcnt at transaction end, and/or a "you don't own a lock
of type X" gripe from LockRelease. We aren't getting any of those,
but why not, if this code is as broken as I think it is?
(On closer inspection, we seem to have about 99% broken relcache.c's
ability to notice rd_refcnt being nonzero at transaction end, but
the other two things should still be happening.)
regards, tom lane
Attachments:
complain-about-duplicate-ExecCloseIndices.patchtext/x-diff; charset=us-ascii; name=complain-about-duplicate-ExecCloseIndices.patchDownload
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 7c87f012c3..a264a2edbc 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -246,14 +246,15 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
for (i = 0; i < numIndices; i++)
{
- if (indexDescs[i] == NULL)
- continue; /* shouldn't happen? */
+ Assert(indexDescs[i] != NULL);
/* Give the index a chance to do some post-insert cleanup */
index_insert_cleanup(indexDescs[i], indexInfos[i]);
/* Drop lock acquired by ExecOpenIndices */
index_close(indexDescs[i], RowExclusiveLock);
+
+ indexDescs[i] = NULL;
}
/*
I wrote:
The problem seems to be that apply_handle_insert_internal does
ExecOpenIndices and then ExecCloseIndices, and then
ExecCleanupTupleRouting does ExecCloseIndices again, which nicely
explains why brininsertcleanup blows up if you happen to have a BRIN
index involved. What it doesn't explain is how come we don't see
other symptoms from the duplicate index_close calls, regardless of
index type.
Hah, I see it: this bug is specific to the apply_handle_tuple_routing
code path. The ResultRelInfo we're dealing with is made by
ExecFindPartition, which does ExecOpenIndices on it. Then
apply_handle_tuple_routing calls apply_handle_insert_internal, which
does ExecOpenIndices *again* on the same ResultRelInfo. That gets a
second refcount and second lock on the index(es), and leaks all the
IndexInfo data structures made by the first call. When done,
apply_handle_insert_internal does ExecCloseIndices, releasing one
refcount and lock. Then, back in apply_handle_tuple_routing, we do
ExecCloseIndices again. So the refcounts and locks balance out, and
it very accidentally fails to crash, so long as nobody is expecting
the contents of the IndexInfos to match up.
The "Move the tuple into the new partition" path in
apply_handle_tuple_routing's UPDATE case is even squirrelier. That
does another ExecFindPartition and then apply_handle_insert_internal,
but there's no ExecCloseIndices call balancing the ExecFindPartition,
meaning it looks like it's leaking refcount and lock. Experimentation
shows that it's not, because the matching ExecCloseIndices is
eventually done by the ExecCleanupTupleRouting call in finish_edata
after returning to the outer apply_handle_update.
So if you ask me, this is impossibly convoluted and in need of a
significant rewrite. It's not okay to be opening/closing the
ResultRelInfo's indexes twice (not least because "speculative"
is different in the two open calls). And it's not okay to have
such radically different paths for cleaning up the index-opening
done by ExecFindPartition. It's even more not okay that there's
not one word of commentary about this complexity, strongly suggesting
that the original author didn't quite understand how it worked either.
regards, tom lane
Hi,
Do I need to apply this patch for debugging purposes?
I want to remove brin indexes from active partitions and start
replication. When the issue is fixed I will return brin indexes back.
Best regards,
Sergey Belyashov
вт, 18 февр. 2025 г. в 02:37, Tom Lane <tgl@sss.pgh.pa.us>:
Show quoted text
I wrote:
Further to this ... I'd still really like to have a reproducer.
While brininsertcleanup is clearly being less robust than it should
be, I now suspect that there is another bug somewhere further down
the call stack. We're getting to this point via ExecCloseIndices,
and that should be paired with ExecOpenIndices, and that would have
created a fresh IndexInfo. So it looks a lot like some path in a
logrep worker is able to call ExecCloseIndices twice on the same
working data. That would probably lead to a "releasing a lock you
don't own" error if we weren't hitting this crash first.Hmm ... I tried modifying ExecCloseIndices to blow up if called
twice, as in the attached. This gets through core regression
just fine, but it blows up in three different subscription TAP
tests, all with a stack trace matching Sergey's:#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f064bfe3e65 in __GI_abort () at abort.c:79
#2 0x00000000009e9253 in ExceptionalCondition (
conditionName=conditionName@entry=0xb8717b "indexDescs[i] != NULL",
fileName=fileName@entry=0xb87139 "execIndexing.c",
lineNumber=lineNumber@entry=249) at assert.c:66
#3 0x00000000006f0b13 in ExecCloseIndices (
resultRelInfo=resultRelInfo@entry=0x2f11c18) at execIndexing.c:249
#4 0x00000000006f86d8 in ExecCleanupTupleRouting (mtstate=0x2ef92d8,
proute=0x2ef94e8) at execPartition.c:1273
#5 0x0000000000848cb6 in finish_edata (edata=0x2ef8f50) at worker.c:717
#6 0x000000000084d0a0 in apply_handle_insert (s=<optimized out>)
at worker.c:2460
#7 apply_dispatch (s=<optimized out>) at worker.c:3389
#8 0x000000000084e494 in LogicalRepApplyLoop (last_received=25066600)
at worker.c:3680
#9 start_apply (origin_startpos=0) at worker.c:4507
#10 0x000000000084e711 in run_apply_worker () at worker.c:4629
#11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:4798
#12 0x00000000008138f9 in BackgroundWorkerMain (startup_data=<optimized out>,
startup_data_len=<optimized out>) at bgworker.c:842The problem seems to be that apply_handle_insert_internal does
ExecOpenIndices and then ExecCloseIndices, and then
ExecCleanupTupleRouting does ExecCloseIndices again, which nicely
explains why brininsertcleanup blows up if you happen to have a BRIN
index involved. What it doesn't explain is how come we don't see
other symptoms from the duplicate index_close calls, regardless of
index type. I'd have expected an assertion failure from
RelationDecrementReferenceCount, and/or an assertion failure for
nonzero rd_refcnt at transaction end, and/or a "you don't own a lock
of type X" gripe from LockRelease. We aren't getting any of those,
but why not, if this code is as broken as I think it is?(On closer inspection, we seem to have about 99% broken relcache.c's
ability to notice rd_refcnt being nonzero at transaction end, but
the other two things should still be happening.)regards, tom lane
Sergey Belyashov <sergey.belyashov@gmail.com> writes:
Do I need to apply this patch for debugging purposes?
I think we've debugged enough to understand what's happening.
Thanks for the report!
If you just want to get some work done, applying that patch
should be enough to prevent the crash, although I want to
change some other things too.
regards, tom lane
Thank you very much
вт, 18 февр. 2025 г., 19:54 Tom Lane <tgl@sss.pgh.pa.us>:
Show quoted text
Sergey Belyashov <sergey.belyashov@gmail.com> writes:
Do I need to apply this patch for debugging purposes?
I think we've debugged enough to understand what's happening.
Thanks for the report!If you just want to get some work done, applying that patch
should be enough to prevent the crash, although I want to
change some other things too.regards, tom lane
[ redirecting to -hackers for patch discussion ]
Here's a completed set of patches for bug #18815 [1]/messages/by-id/18815-2a0407cc7f40b327@postgresql.org (which,
briefly, is that the logrep worker opens/closes indexes
multiple times and thereby breaks brininsertcleanup).
0001 adds a subscriber-side BRIN index to 013_partition.pl, which
replicates the crash Sergey reported. I've got mixed feelings about
whether to actually commit this: it's certainly useful for testing
this fix, but without context it looks like a pretty random thing to
do. It could be justified perhaps on the grounds of testing
aminsertcleanup callbacks within replication, since BRIN is the
only core index type that has such a callback.
0002 fixes the crash by hardening brininsertcleanup against multiple
calls. I think that this is patching a symptom not the root cause,
but it still seems like good defensive programming.
0003 adds Asserts to ExecOpenIndices and ExecCloseIndices to check
that they're not called more than once per ResultRelInfo, and thereby
exposes what I consider the root cause: apply_handle_tuple_routing
opens the indexes twice and then closes them twice. This somewhat
accidentally leaves us with zero refcounts and zero locks on the
indexes, so in the absence of aminsertcleanup callbacks the only real
reason to complain is the duplicative construction of the IndexInfo
structures. But the double call of brininsertcleanup breaks the
existing coding of that function, and it might well break third-party
aminsertcleanup callbacks if there are any. So I think we should
institute a policy that this coding pattern is forbidden.
And finally, 0004 fixes worker.c to not do that. This turned out
simpler than I thought, because I was misled by believing that the
ExecOpenIndices/ExecCloseIndices calls in apply_handle_tuple_routing
itself do something useful. They don't, and should be nuked outright.
I don't intend 0003 for back-patch, but the rest of this has to go
back as far as the bug can be observed, which I didn't test yet.
To be clear, 0004 would fix the issue even without 0002, but
I still think we should back-patch both.
regards, tom lane
Attachments:
v1-0001-Demonstrate-crash-in-brininsertcleanup-during-log.patchtext/x-diff; charset=us-ascii; name*0=v1-0001-Demonstrate-crash-in-brininsertcleanup-during-log.p; name*1=atchDownload
From 70071453e273e676267a58b6ad31a0886b0fec13 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 18 Feb 2025 12:18:05 -0500
Subject: [PATCH v1 1/4] Demonstrate crash in brininsertcleanup during logical
replication.
A cross-partition update crashes if the subscriber's table has
a BRIN index. This is easily demonstrated by adding such an
index to 013_partition.pl. I'm not sure if we actually want
to include this in the final patchset, so it's a separate patch
for now.
Bug: #18815
Reported-by: Sergey Belyashov <sergey.belyashov@gmail.com>
Discussion: https://postgr.es/m/18815-2a0407cc7f40b327@postgresql.org
Backpatch-through: TBD
---
src/test/subscription/t/013_partition.pl | 3 +++
1 file changed, 3 insertions(+)
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index 14a3beae6e..c1f3ef781b 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -49,6 +49,9 @@ $node_publisher->safe_psql('postgres',
$node_subscriber1->safe_psql('postgres',
"CREATE TABLE tab1 (c text, a int PRIMARY KEY, b text) PARTITION BY LIST (a)"
);
+$node_subscriber1->safe_psql('postgres',
+ "CREATE INDEX ON tab1 USING brin (c)"
+);
$node_subscriber1->safe_psql('postgres',
"CREATE TABLE tab1_1 (b text, c text DEFAULT 'sub1_tab1', a int NOT NULL)"
);
--
2.43.5
v1-0002-Harden-brininsertcleanup-against-repeat-calls.patchtext/x-diff; charset=us-ascii; name=v1-0002-Harden-brininsertcleanup-against-repeat-calls.patchDownload
From badeb15ca1261a3681603ab7fe0a3898a640fdfc Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 18 Feb 2025 12:25:49 -0500
Subject: [PATCH v1 2/4] Harden brininsertcleanup against repeat calls.
Leaving an empty BrinInsertState struct linked in ii_AmCache
confuses both brininsertcleanup itself and brininsert, were that
to be called again. We should fully remove it, instead.
This stops the crash reported in bug #18815. I think it's fixing
the symptom rather than the root cause, but nonetheless there's
little argument for brininsertcleanup to be leaving an obviously
broken data structure behind.
Bug: #18815
Reported-by: Sergey Belyashov <sergey.belyashov@gmail.com>
Discussion: https://postgr.es/m/18815-2a0407cc7f40b327@postgresql.org
Backpatch-through: TBD
---
src/backend/access/brin/brin.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 4265687afa..60320440fc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -511,16 +511,18 @@ brininsertcleanup(Relation index, IndexInfo *indexInfo)
BrinInsertState *bistate = (BrinInsertState *) indexInfo->ii_AmCache;
/* bail out if cache not initialized */
- if (indexInfo->ii_AmCache == NULL)
+ if (bistate == NULL)
return;
+ /* do this first to avoid dangling pointer if we fail partway through */
+ indexInfo->ii_AmCache = NULL;
+
/*
* Clean up the revmap. Note that the brinDesc has already been cleaned up
* as part of its own memory context.
*/
brinRevmapTerminate(bistate->bis_rmAccess);
- bistate->bis_rmAccess = NULL;
- bistate->bis_desc = NULL;
+ pfree(bistate);
}
/*
--
2.43.5
v1-0003-Assert-that-ExecOpenIndices-and-ExecCloseIndices-.patchtext/x-diff; charset=us-ascii; name*0=v1-0003-Assert-that-ExecOpenIndices-and-ExecCloseIndices-.p; name*1=atchDownload
From 484115dc9598cc496040ef0058239f8f61fea457 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 18 Feb 2025 12:47:57 -0500
Subject: [PATCH v1 3/4] Assert that ExecOpenIndices and ExecCloseIndices are
not repeated.
These functions should be called at most once per ResultRelInfo;
it's wasteful to do otherwise, and certainly the pattern of
opening twice and then closing twice is a bad idea. Moreover,
index_insert_cleanup functions might not be prepared to be
called twice, as the just-hardened code in BRIN demonstrates.
Sadly, logical replication's apply_handle_tuple_routing() does
exactly that, meaning that either hunk of this patch is
sufficient to make it crash.
While other patches in this series are back-patch candidates,
this one should only be applied to HEAD, as perhaps there
are extensions doing the now-forbidden coding pattern.
We shouldn't break them in minor releases.
Bug: #18815
Reported-by: Sergey Belyashov <sergey.belyashov@gmail.com>
Discussion: https://postgr.es/m/18815-2a0407cc7f40b327@postgresql.org
---
src/backend/executor/execIndexing.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 7c87f012c3..742f3f8c08 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -181,6 +181,9 @@ ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative)
if (len == 0)
return;
+ /* This Assert will fail if ExecOpenIndices is called twice */
+ Assert(resultRelInfo->ri_IndexRelationDescs == NULL);
+
/*
* allocate space for result arrays
*/
@@ -246,19 +249,23 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
for (i = 0; i < numIndices; i++)
{
- if (indexDescs[i] == NULL)
- continue; /* shouldn't happen? */
+ /* This Assert will fail if ExecCloseIndices is called twice */
+ Assert(indexDescs[i] != NULL);
/* Give the index a chance to do some post-insert cleanup */
index_insert_cleanup(indexDescs[i], indexInfos[i]);
/* Drop lock acquired by ExecOpenIndices */
index_close(indexDescs[i], RowExclusiveLock);
+
+ /* Mark the index as closed */
+ indexDescs[i] = NULL;
}
/*
- * XXX should free indexInfo array here too? Currently we assume that
- * such stuff will be cleaned up automatically in FreeExecutorState.
+ * We don't attempt to free the IndexInfo data structures or the arrays,
+ * instead assuming that such stuff will be cleaned up automatically in
+ * FreeExecutorState.
*/
}
--
2.43.5
v1-0004-Avoid-duplicate-ExecOpenIndices-ExecCloseIndices-.patchtext/x-diff; charset=us-ascii; name*0=v1-0004-Avoid-duplicate-ExecOpenIndices-ExecCloseIndices-.p; name*1=atchDownload
From 71a286cd08c840f438bf4ec71f59d1ac4615662e Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 18 Feb 2025 15:09:19 -0500
Subject: [PATCH v1 4/4] Avoid duplicate ExecOpenIndices/ExecCloseIndices in
logrep worker.
During a cross-partition update, apply_handle_tuple_routing calls
ExecFindPartition which does ExecOpenIndices (and expects that
ExecCleanupTupleRouting will close the indexes again). Therefore,
we must avoid re-opening/closing the table's indexes when the
update code path calls apply_handle_insert_internal or
apply_handle_delete_internal.
That leaves us with only one caller of each function that needs
indexes to be opened/closed, so we can just move those calls
to the callers.
Also, remove the entirely-duplicative open/close calls within
apply_handle_tuple_routing itself.
Bug: #18815
Reported-by: Sergey Belyashov <sergey.belyashov@gmail.com>
Discussion: https://postgr.es/m/18815-2a0407cc7f40b327@postgresql.org
Backpatch-through: TBD
---
src/backend/replication/logical/worker.c | 30 +++++++++++++-----------
1 file changed, 16 insertions(+), 14 deletions(-)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f09ab41c60..a79c80e656 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2454,8 +2454,13 @@ apply_handle_insert(StringInfo s)
apply_handle_tuple_routing(edata,
remoteslot, NULL, CMD_INSERT);
else
- apply_handle_insert_internal(edata, edata->targetRelInfo,
- remoteslot);
+ {
+ ResultRelInfo *relinfo = edata->targetRelInfo;
+
+ ExecOpenIndices(relinfo, true);
+ apply_handle_insert_internal(edata, relinfo, remoteslot);
+ ExecCloseIndices(relinfo);
+ }
finish_edata(edata);
@@ -2482,16 +2487,13 @@ apply_handle_insert_internal(ApplyExecutionData *edata,
{
EState *estate = edata->estate;
- /* We must open indexes here. */
- ExecOpenIndices(relinfo, true);
+ /* Caller will not have done this bit. */
+ Assert(relinfo->ri_onConflictArbiterIndexes == NIL);
InitConflictIndexes(relinfo);
/* Do the insert. */
TargetPrivilegesCheck(relinfo->ri_RelationDesc, ACL_INSERT);
ExecSimpleRelationInsert(relinfo, estate, remoteslot);
-
- /* Cleanup. */
- ExecCloseIndices(relinfo);
}
/*
@@ -2816,8 +2818,14 @@ apply_handle_delete(StringInfo s)
apply_handle_tuple_routing(edata,
remoteslot, NULL, CMD_DELETE);
else
- apply_handle_delete_internal(edata, edata->targetRelInfo,
+ {
+ ResultRelInfo *relinfo = edata->targetRelInfo;
+
+ ExecOpenIndices(relinfo, false);
+ apply_handle_delete_internal(edata, relinfo,
remoteslot, rel->localindexoid);
+ ExecCloseIndices(relinfo);
+ }
finish_edata(edata);
@@ -2851,7 +2859,6 @@ apply_handle_delete_internal(ApplyExecutionData *edata,
bool found;
EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1, NIL);
- ExecOpenIndices(relinfo, false);
found = FindReplTupleInLocalRel(edata, localrel, remoterel, localindexoid,
remoteslot, &localslot);
@@ -2892,7 +2899,6 @@ apply_handle_delete_internal(ApplyExecutionData *edata,
}
/* Cleanup. */
- ExecCloseIndices(relinfo);
EvalPlanQualEnd(&epqstate);
}
@@ -3131,7 +3137,6 @@ apply_handle_tuple_routing(ApplyExecutionData *edata,
* work already done above to find the local tuple in the
* partition.
*/
- ExecOpenIndices(partrelinfo, true);
InitConflictIndexes(partrelinfo);
EvalPlanQualSetSlot(&epqstate, remoteslot_part);
@@ -3181,8 +3186,6 @@ apply_handle_tuple_routing(ApplyExecutionData *edata,
get_namespace_name(RelationGetNamespace(partrel_new)),
RelationGetRelationName(partrel_new));
- ExecOpenIndices(partrelinfo, false);
-
/* DELETE old tuple found in the old partition. */
EvalPlanQualSetSlot(&epqstate, localslot);
TargetPrivilegesCheck(partrelinfo->ri_RelationDesc, ACL_DELETE);
@@ -3217,7 +3220,6 @@ apply_handle_tuple_routing(ApplyExecutionData *edata,
remoteslot_part);
}
- ExecCloseIndices(partrelinfo);
EvalPlanQualEnd(&epqstate);
}
break;
--
2.43.5
I wrote:
I don't intend 0003 for back-patch, but the rest of this has to go
back as far as the bug can be observed, which I didn't test yet.
Ah: this bug is new in v17, because neither brininsertcleanup nor
the whole aminsertcleanup infrastructure exist before that.
So that leaves us with a question of whether 0004 should be
back-patched any further than v17. There's some attraction
to doing so to keep the branches in sync; but there's already
quite a lot of cross-branch churn in worker.c, so on the whole
I'm inclined to just do v17.
regards, tom lane
Hi,
The 4th patch is not applicable for the current REL_17_STABLE branch.
There are a lot of differences from master in the worker.c.
Best regards,
Sergey Belyashov
ср, 19 февр. 2025 г. в 01:24, Tom Lane <tgl@sss.pgh.pa.us>:
Show quoted text
I wrote:
I don't intend 0003 for back-patch, but the rest of this has to go
back as far as the bug can be observed, which I didn't test yet.Ah: this bug is new in v17, because neither brininsertcleanup nor
the whole aminsertcleanup infrastructure exist before that.So that leaves us with a question of whether 0004 should be
back-patched any further than v17. There's some attraction
to doing so to keep the branches in sync; but there's already
quite a lot of cross-branch churn in worker.c, so on the whole
I'm inclined to just do v17.regards, tom lane
Sergey Belyashov <sergey.belyashov@gmail.com> writes:
The 4th patch is not applicable for the current REL_17_STABLE branch.
Yeah, I saw that, didn't work on fixing it yet. It looked like
the diffs were mostly about the ExecOpen/CloseIndices calls in
apply_handle_tuple_routing, which we already knew are completely
bogus. So I'm hopeful that just removing whichever of them are
in v17 will do the job --- but it'll require a full re-test
to be sure :-(
regards, tom lane
Sergey Belyashov <sergey.belyashov@gmail.com> writes:
The 4th patch is not applicable for the current REL_17_STABLE branch.
There are a lot of differences from master in the worker.c.
If you want to try the committed v17 patch, see
Thanks again for the report!
regards, tom lane
I have applied the patch. Replication now works. Thank you.
Regards
чт, 20 февр. 2025 г. в 00:55, Tom Lane <tgl@sss.pgh.pa.us>:
Show quoted text
Sergey Belyashov <sergey.belyashov@gmail.com> writes:
The 4th patch is not applicable for the current REL_17_STABLE branch.
There are a lot of differences from master in the worker.c.If you want to try the committed v17 patch, see
Thanks again for the report!
regards, tom lane
On Wed, Feb 19, 2025 at 4:38 AM Tom Lane wrote:
Hi,
Here's a completed set of patches for bug #18815 [1] (which, briefly,
is that the logrep worker opens/closes indexes multiple times and
thereby breaks brininsertcleanup).0001 adds a subscriber-side BRIN index to 013_partition.pl, which
replicates the crash Sergey reported. I've got mixed feelings about
whether to actually commit this: it's certainly useful for testing
this fix, but without context it looks like a pretty random thing to
do. It could be justified perhaps on the grounds of testing
aminsertcleanup callbacks within replication, since BRIN is the only core index type that has such a callback.0002 fixes the crash by hardening brininsertcleanup against multiple
calls. I think that this is patching a symptom not the root cause,
but it still seems like good defensive programming.0003 adds Asserts to ExecOpenIndices and ExecCloseIndices to check
that they're not called more than once per ResultRelInfo, and thereby
exposes what I consider the root cause: apply_handle_tuple_routing
opens the indexes twice and then closes them twice. This somewhat
accidentally leaves us with zero refcounts and zero locks on the
indexes, so in the absence of aminsertcleanup callbacks the only real
reason to complain is the duplicative construction of the IndexInfo
structures. But the double call of brininsertcleanup breaks the
existing coding of that function, and it might well break third-party
aminsertcleanup callbacks if there are any. So I think we should
institute a policy that this coding pattern is forbidden.And finally, 0004 fixes worker.c to not do that. This turned out
simpler than I thought, because I was misled by believing that the
ExecOpenIndices/ExecCloseIndices calls in apply_handle_tuple_routing
itself do something useful. They don't, and should be nuked outright.
My colleague Nisha reported off-list about a crash in logical replication that
occurs during unique constraint violations on leaf partitions. Upon
investigation, I confirmed that this crash started happening after commit
9ff6867.
The problem arises because unique key conflict detection relied on the
ExecOpenIndices call in apply_handle_insert_internal and
apply_handle_tuple_routing with the speculative parameter set to true to
construct necessary index information. However, these openings were redundant,
as indexes had already been opened during target partition searches via the
parent table (e.g., using ExecFindPartition). Hence, they were removed in
commit 9ff6867. Unfortunately, ExecFindPartition opens indexes without
constructing the needed index information for conflict detection, which leads
to the crash.
To fix it, I tried to add a detect_conflict flag in ModifyTableState, enabling
ExecFindPartition() to internally build the required index information in this
case, like the attachment.
An alternative approach may be to delay index information initialization until
immediately before executing the actual update or insert. E.g., do that in
InitConflictIndexes(). This approach can also avoid unnecessary construction of
index information when the tuple to be updated is not found, or during a
cross-partition update where index information for the source partition is not
required. However, it introduces an additional location for index
initialization, potentially increasing maintenance efforts.
Best Regards,
Hou zj
Attachments:
v1-0001-Fix-crashes-in-logical-replication-during-unique-.patchapplication/octet-stream; name=v1-0001-Fix-crashes-in-logical-replication-during-unique-.patchDownload
From ff466fcd130d8152c24182d449d1b896b3bc48d2 Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Mon, 24 Mar 2025 16:40:26 +0800
Subject: [PATCH v1] Fix crashes in logical replication during unique
constraint violations on leaf partitions.
This commit addresses a logical replication crash introduced after commit
9ff6867, which occurs when incoming INSERT or UPDATE operations violate unique
constraints on leaf partitions.
The issue arises because the unique key conflict detection depended on the last
ExecOpenIndices call to construct necessary index information, where the
speculative parameter was set to true. However, this index opening was
redundant because the indexes were already opened during target partition
identification via the parent table (e.g., through ExecFindPartition) and hence
was removed in 9ff6867. Unfortunately, ExecFindPartition opens indexes without
constructing index information required for conflict detection, leading to the
crash.
To address this, a detect_conflict flag is introduced in ModifyTableState,
enabling ExecFindPartition() to internally build the required index information
for conflict detection.
---
src/backend/executor/execPartition.c | 2 +-
src/backend/replication/logical/worker.c | 2 ++
src/include/nodes/execnodes.h | 9 ++++++
src/test/subscription/t/035_conflicts.pl | 37 +++++++++++++++++++++++-
4 files changed, 48 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 84ccd7d457d..2eb8af0e0da 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -539,7 +539,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
if (partrel->rd_rel->relhasindex &&
leaf_part_rri->ri_IndexRelationDescs == NULL)
ExecOpenIndices(leaf_part_rri,
- (node != NULL &&
+ mtstate->detect_conflict || (node != NULL &&
node->onConflictAction != ONCONFLICT_NONE));
/*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e3b2b144942..b2013929756 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2990,6 +2990,8 @@ apply_handle_tuple_routing(ApplyExecutionData *edata,
mtstate->ps.state = estate;
mtstate->operation = operation;
mtstate->resultRelInfo = relinfo;
+ mtstate->detect_conflict = (operation == CMD_INSERT ||
+ operation == CMD_UPDATE);
/* ... as is PartitionTupleRouting. */
edata->proute = proute = ExecSetupPartitionTupleRouting(estate, parentrel);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e42f9f9f957..0e33a90e676 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1458,6 +1458,15 @@ typedef struct ModifyTableState
List *mt_updateColnosLists;
List *mt_mergeActionLists;
List *mt_mergeJoinConditions;
+
+ /*
+ * Whether to detect conflicts within the target relation and potentially
+ * take actions other than throwing an error.
+ *
+ * Primarily used in scenarios without a real plan node, like the logical
+ * replication apply worker.
+ */
+ bool detect_conflict;
} ModifyTableState;
/* ----------------
diff --git a/src/test/subscription/t/035_conflicts.pl b/src/test/subscription/t/035_conflicts.pl
index 3a4d44e1d0e..2a7a8239a29 100644
--- a/src/test/subscription/t/035_conflicts.pl
+++ b/src/test/subscription/t/035_conflicts.pl
@@ -25,14 +25,23 @@ $node_subscriber->start;
$node_publisher->safe_psql('postgres',
"CREATE TABLE conf_tab (a int PRIMARY KEY, b int UNIQUE, c int UNIQUE);");
+$node_publisher->safe_psql('postgres',
+ "CREATE TABLE conf_tab_2 (a int PRIMARY KEY, b int UNIQUE, c int UNIQUE);");
+
# Create same table on subscriber
$node_subscriber->safe_psql('postgres',
"CREATE TABLE conf_tab (a int PRIMARY key, b int UNIQUE, c int UNIQUE);");
+$node_subscriber->safe_psql(
+ 'postgres', qq[
+ CREATE TABLE conf_tab_2 (a int PRIMARY KEY, b int, c int, unique(a,b)) PARTITION BY RANGE (a);
+ CREATE TABLE conf_tab_2_p1 PARTITION OF conf_tab_2 FOR VALUES FROM (MINVALUE) TO (100);
+]);
+
# Setup logical replication
my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
$node_publisher->safe_psql('postgres',
- "CREATE PUBLICATION pub_tab FOR TABLE conf_tab");
+ "CREATE PUBLICATION pub_tab FOR TABLE conf_tab, conf_tab_2");
# Create the subscription
my $appname = 'sub_tab';
@@ -110,4 +119,30 @@ $node_subscriber->wait_for_log(
pass('multiple_unique_conflicts detected during update');
+# Truncate table to get rid of the error
+$node_subscriber->safe_psql('postgres', "TRUNCATE conf_tab;");
+
+
+##################################################
+# Test multiple_unique_conflicts due to INSERT on a leaf partition
+##################################################
+
+# Insert data in the subscriber table
+$node_subscriber->safe_psql('postgres',
+ "INSERT INTO conf_tab_2 VALUES (55,2,3);");
+
+# Insert data in the publisher table
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab_2 VALUES (55,2,3);");
+
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab_2_p1\": conflict=multiple_unique_conflicts.*
+.*Key already exists in unique index \"conf_tab_2_p1_pkey\".*
+.*Key \(a\)=\(55\); existing local tuple \(55, 2, 3\); remote tuple \(55, 2, 3\).*
+.*Key already exists in unique index \"conf_tab_2_p1_a_b_key\".*
+.*Key \(a, b\)=\(55, 2\); existing local tuple \(55, 2, 3\); remote tuple \(55, 2, 3\)./,
+ $log_offset);
+
+pass('multiple_unique_conflicts detected on a leaf partition during insert');
+
done_testing();
--
2.30.0.windows.2
On Wed, Mar 26, 2025 at 10:38 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
On Wed, Feb 19, 2025 at 4:38 AM Tom Lane wrote:
My colleague Nisha reported off-list about a crash in logical replication that
occurs during unique constraint violations on leaf partitions. Upon
investigation, I confirmed that this crash started happening after commit
9ff6867.
Even though the commit 9ff68679b5 is backpatched to 17, the proposed
patch needs to be applied only for HEAD because the requirement for
additional conflict information is new to HEAD, right?
The problem arises because unique key conflict detection relied on the
ExecOpenIndices call in apply_handle_insert_internal and
apply_handle_tuple_routing with the speculative parameter set to true to
construct necessary index information. However, these openings were redundant,
as indexes had already been opened during target partition searches via the
parent table (e.g., using ExecFindPartition). Hence, they were removed in
commit 9ff6867. Unfortunately, ExecFindPartition opens indexes without
constructing the needed index information for conflict detection, which leads
to the crash.To fix it, I tried to add a detect_conflict flag in ModifyTableState, enabling
ExecFindPartition() to internally build the required index information in this
case, like the attachment.
I have a question about the additional information built in
ExecOpenIndices(). We built the required information when the index is
not used for exclusion constraint whereas we set the specConflict flag
in ExecInsertIndexTuples() even for the exclusion constraints case.
The specConflict flag is used by the caller to report conflicts. So,
won't we need to build the required information in ExecOpenIndices()
even when the index is used for exclusion constraint? What is the
behavior of conflict reporting code in case of exclusion constraints?
An alternative approach may be to delay index information initialization until
immediately before executing the actual update or insert. E.g., do that in
InitConflictIndexes().
Right, this is also worth considering. But let us first conclude on
the existing approach to build the required information in
ExecOpenIndices().
--
With Regards,
Amit Kapila.
On Mon, Apr 7, 2025 at 5:37 PM Amit Kapila wrote:
On Wed, Mar 26, 2025 at 10:3 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:On Wed, Feb 19, 2025 at 4:38 AM Tom Lane wrote:
My colleague Nisha reported off-list about a crash in logical replication that
occurs during unique constraint violations on leaf partitions. Upon
investigation, I confirmed that this crash started happening after commit
9ff6867.Even though the commit 9ff68679b5 is backpatched to 17, the proposed
patch needs to be applied only for HEAD because the requirement for
additional conflict information is new to HEAD, right?
Right.
The problem arises because unique key conflict detection relied on the
ExecOpenIndices call in apply_handle_insert_internal and
apply_handle_tuple_routing with the speculative parameter set to true to
construct necessary index information. However, these openings wereredundant,
as indexes had already been opened during target partition searches via the
parent table (e.g., using ExecFindPartition). Hence, they were removed in
commit 9ff6867. Unfortunately, ExecFindPartition opens indexes without
constructing the needed index information for conflict detection, which leads
to the crash.To fix it, I tried to add a detect_conflict flag in ModifyTableState, enabling
ExecFindPartition() to internally build the required index information in this
case, like the attachment.I have a question about the additional information built in
ExecOpenIndices(). We built the required information when the index is
not used for exclusion constraint whereas we set the specConflict flag
in ExecInsertIndexTuples() even for the exclusion constraints case.
The specConflict flag is used by the caller to report conflicts. So,
won't we need to build the required information in ExecOpenIndices()
even when the index is used for exclusion constraint?
I think the index information used for detecting exclusion conflict is always
initialized In ExecOpenIndices-> BuildIndexInfo -> RelationGetExclusionInfo
regardless of the input parameter.
What is the
behavior of conflict reporting code in case of exclusion constraints?
Under logical replication context, since we do not detect conflicts for exclusion
constraints, it would simply report the original constraint violation ERROR.
Best Regards,
Hou zj
On Mon, Apr 7, 2025 at 3:28 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
What is the
behavior of conflict reporting code in case of exclusion constraints?Under logical replication context, since we do not detect conflicts for exclusion
constraints, it would simply report the original constraint violation ERROR.
Fair enough. On considering it again, I find your idea of building
conflict-related information when it is actually required sounds
better, as it may also save us performance in some corner cases.
--
With Regards,
Amit Kapila.
On Tue, Apr 8, 2025 at 2:00 PM Amit Kapila wrote:
On Mon, Apr 7, 2025 at 3:28 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:What is the
behavior of conflict reporting code in case of exclusion constraints?Under logical replication context, since we do not detect conflicts
for exclusion constraints, it would simply report the original constraintviolation ERROR.
Fair enough. On considering it again, I find your idea of building conflict-related
information when it is actually required sounds better, as it may also save us
performance in some corner cases.
Thanks for reviewing.
After studying more on this alternative, I think it can be improved further.
Specifically, we can delay index info initialization until the
FindConflictTuple() function, where the index info is actually used. This
can minimize overhead associated with building index info when no conflict
exists.
I generated a profile on the HEAD to check the impact of
BuildSpeculativeIndexInfo() during bulk insert operations (without conflicts),
and the results were notable:
--14.68%--BuildSpeculativeIndexInfo
|--5.34%--IndexAmTranslateCompareType
| --5.05%--GetIndexAmRoutineByAmId
| |--2.26%--GetIndexAmRoutine
| |--1.68%--SearchSysCache1
| --0.52%--ReleaseSysCache
|--5.29%--get_opfamily_member
| |--4.06%--SearchSysCache4
| --0.50%--ReleaseSysCache
|
|--2.66%--get_opcode
| |--1.74%--SearchSysCache1
And this function disappeared after delaying the init to FindConflictTuple().
To further analyze performance, I measured the time taken to apply a
transaction that inserts 1,000,000 rows into a table with three unique indexes:
HEAD: 6267 ms
Improved alternative: 5593 ms
It shows about 11% performance improvement with the refined approach.
I am attaching the V2 patch which implements this idea by
postponing index info initialization until the FindConflictTuple(). I confirmed
It can pass regression and pgindent check.
Best Regards,
Hou zj
Attachments:
v2-0001-Fix-crashes-in-logical-replication-during-unique-.patchapplication/octet-stream; name=v2-0001-Fix-crashes-in-logical-replication-during-unique-.patchDownload
From 50c3d20b8257994a4c9548267142d2ac794657df Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Tue, 25 Mar 2025 12:58:34 +0800
Subject: [PATCH v2] Fix crashes in logical replication during unique
constraint violations on leaf partitions.
This commit addresses a logical replication crash introduced after commit
9ff6867, which occurs when incoming INSERT or UPDATE operations violate unique
constraints on leaf partitions.
The issue arises because the unique key conflict detection depended on the last
ExecOpenIndices call to construct necessary index information, where the
speculative parameter was set to true. However, this index opening was
redundant because the indexes were already opened during target partition
identification via the parent table (e.g., through ExecFindPartition) and hence
was removed in 9ff6867. Unfortunately, ExecFindPartition opens indexes without
constructing index information required for conflict detection, leading to the
crash.
While one solution could be to adjust ExecFindPartition to build this necessary
index information by adding a function parameter, this commit opts for a more
efficient approach: delaying index information initialization until a potential
conflict is detected. This avoids unnecessary index information construction
when there is no conflict.
---
src/backend/executor/execIndexing.c | 5 ++--
src/backend/executor/execReplication.c | 31 ++++++++++++++++++++
src/backend/replication/logical/worker.c | 4 +--
src/test/subscription/t/035_conflicts.pl | 37 +++++++++++++++++++++++-
4 files changed, 71 insertions(+), 6 deletions(-)
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index e3fe9b78bb5..bdf862b2406 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -214,9 +214,8 @@ ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative)
ii = BuildIndexInfo(indexDesc);
/*
- * If the indexes are to be used for speculative insertion or conflict
- * detection in logical replication, add extra information required by
- * unique index entries.
+ * If the indexes are to be used for speculative insertion, add extra
+ * information required by unique index entries.
*/
if (speculative && ii->ii_Unique && !indexDesc->rd_index->indisexclusion)
BuildSpeculativeIndexInfo(indexDesc, ii);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index ede89ea3cf9..c9b2aedbeef 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -431,6 +431,30 @@ retry:
return found;
}
+/*
+ * Build additional index information necessary for conflict detection.
+ */
+static void
+BuildConflictIndexInfo(ResultRelInfo *resultRelInfo, Oid conflictindex)
+{
+ for (int i = 0; i < resultRelInfo->ri_NumIndices; i++)
+ {
+ Relation indexRelation = resultRelInfo->ri_IndexRelationDescs[i];
+ IndexInfo *indexRelationInfo = resultRelInfo->ri_IndexRelationInfo[i];
+
+ if (conflictindex != RelationGetRelid(indexRelation))
+ continue;
+
+ /*
+ * This Assert will fail if BuildSpeculativeIndexInfo() is called
+ * twice for the given index.
+ */
+ Assert(indexRelationInfo->ii_UniqueOps == NULL);
+
+ BuildSpeculativeIndexInfo(indexRelation, indexRelationInfo);
+ }
+}
+
/*
* Find the tuple that violates the passed unique index (conflictindex).
*
@@ -452,6 +476,13 @@ FindConflictTuple(ResultRelInfo *resultRelInfo, EState *estate,
*conflictslot = NULL;
+ /*
+ * Build additional information required to detect constraints violations.
+ * Refer to check_exclusion_or_unique_constraint() for details on how this
+ * information is used.
+ */
+ BuildConflictIndexInfo(resultRelInfo, conflictindex);
+
retry:
if (ExecCheckIndexConstraints(resultRelInfo, slot, estate,
&conflictTid, &slot->tts_tid,
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e3b2b144942..5ce596f4576 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2457,7 +2457,7 @@ apply_handle_insert(StringInfo s)
{
ResultRelInfo *relinfo = edata->targetRelInfo;
- ExecOpenIndices(relinfo, true);
+ ExecOpenIndices(relinfo, false);
apply_handle_insert_internal(edata, relinfo, remoteslot);
ExecCloseIndices(relinfo);
}
@@ -2680,7 +2680,7 @@ apply_handle_update_internal(ApplyExecutionData *edata,
MemoryContext oldctx;
EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1, NIL);
- ExecOpenIndices(relinfo, true);
+ ExecOpenIndices(relinfo, false);
found = FindReplTupleInLocalRel(edata, localrel,
&relmapentry->remoterel,
diff --git a/src/test/subscription/t/035_conflicts.pl b/src/test/subscription/t/035_conflicts.pl
index 3a4d44e1d0e..2a7a8239a29 100644
--- a/src/test/subscription/t/035_conflicts.pl
+++ b/src/test/subscription/t/035_conflicts.pl
@@ -25,14 +25,23 @@ $node_subscriber->start;
$node_publisher->safe_psql('postgres',
"CREATE TABLE conf_tab (a int PRIMARY KEY, b int UNIQUE, c int UNIQUE);");
+$node_publisher->safe_psql('postgres',
+ "CREATE TABLE conf_tab_2 (a int PRIMARY KEY, b int UNIQUE, c int UNIQUE);");
+
# Create same table on subscriber
$node_subscriber->safe_psql('postgres',
"CREATE TABLE conf_tab (a int PRIMARY key, b int UNIQUE, c int UNIQUE);");
+$node_subscriber->safe_psql(
+ 'postgres', qq[
+ CREATE TABLE conf_tab_2 (a int PRIMARY KEY, b int, c int, unique(a,b)) PARTITION BY RANGE (a);
+ CREATE TABLE conf_tab_2_p1 PARTITION OF conf_tab_2 FOR VALUES FROM (MINVALUE) TO (100);
+]);
+
# Setup logical replication
my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
$node_publisher->safe_psql('postgres',
- "CREATE PUBLICATION pub_tab FOR TABLE conf_tab");
+ "CREATE PUBLICATION pub_tab FOR TABLE conf_tab, conf_tab_2");
# Create the subscription
my $appname = 'sub_tab';
@@ -110,4 +119,30 @@ $node_subscriber->wait_for_log(
pass('multiple_unique_conflicts detected during update');
+# Truncate table to get rid of the error
+$node_subscriber->safe_psql('postgres', "TRUNCATE conf_tab;");
+
+
+##################################################
+# Test multiple_unique_conflicts due to INSERT on a leaf partition
+##################################################
+
+# Insert data in the subscriber table
+$node_subscriber->safe_psql('postgres',
+ "INSERT INTO conf_tab_2 VALUES (55,2,3);");
+
+# Insert data in the publisher table
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab_2 VALUES (55,2,3);");
+
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab_2_p1\": conflict=multiple_unique_conflicts.*
+.*Key already exists in unique index \"conf_tab_2_p1_pkey\".*
+.*Key \(a\)=\(55\); existing local tuple \(55, 2, 3\); remote tuple \(55, 2, 3\).*
+.*Key already exists in unique index \"conf_tab_2_p1_a_b_key\".*
+.*Key \(a, b\)=\(55, 2\); existing local tuple \(55, 2, 3\); remote tuple \(55, 2, 3\)./,
+ $log_offset);
+
+pass('multiple_unique_conflicts detected on a leaf partition during insert');
+
done_testing();
--
2.31.1
On Tue, Apr 8, 2025 at 12:26 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
I am attaching the V2 patch which implements this idea by
postponing index info initialization until the FindConflictTuple(). I confirmed
It can pass regression and pgindent check.
Thanks, I pushed this patch yesterday.
--
With Regards,
Amit Kapila.