"ERROR: deadlock detected" when replicating TRUNCATE

Started by Peter Smithover 4 years ago12 messages

smithpb2250@gmail.com

over 4 years ago

2 attachment(s)

While doing logical replication testing we encountered a problem which
causes a deadlock error to be logged when replicating a TRUNCATE
simultaneously to 2 Subscriptions.
e.g.
----------
2021-05-12 11:30:19.457 AEST [11393] ERROR: deadlock detected
2021-05-12 11:30:19.457 AEST [11393] DETAIL: Process 11393 waits for
ShareLock on transaction 597; blocked by process 11582.
Process 11582 waits for ShareLock on relation 16384 of database 14896;
blocked by process 11393.
----------

At this time, both the subscriber (apply worker) processes are
executing inside the ExecuteTruncateGuts function simultaneously and
they become co-dependent.

e.g.
----------
Process 11582
(gdb) bt
#0 0x00007fa1979515e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x000000000093e5d0 in WaitEventSetWaitBlock (set=0x2facac8,
cur_timeout=-1, occurred_events=0x7ffed5fdff00, nevents=1) at
latch.c:1450
#2 0x000000000093e468 in WaitEventSetWait (set=0x2facac8, timeout=-1,
occurred_events=0x7ffed5fdff00, nevents=1, wait_event_info=50331648)
at latch.c:1396
#3 0x000000000093d8cd in WaitLatch (latch=0x7fa191042654,
wakeEvents=33, timeout=0, wait_event_info=50331648) at latch.c:473
#4 0x00000000009660f0 in ProcSleep (locallock=0x2fd06d8,
lockMethodTable=0xcd90a0 <default_lockmethod>) at proc.c:1361
#5 0x0000000000954fd5 in WaitOnLock (locallock=0x2fd06d8,
owner=0x2fd9a48) at lock.c:1858
#6 0x0000000000953c14 in LockAcquireExtended (locktag=0x7ffed5fe0370,
lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true,
locallockp=0x7ffed5fe0368) at lock.c:1100
#7 0x00000000009511f1 in LockRelationOid (relid=16384, lockmode=5) at
lmgr.c:117
#8 0x000000000049e779 in relation_open (relationId=16384, lockmode=5)
at relation.c:56
#9 0x00000000005652ff in table_open (relationId=16384, lockmode=5) at
table.c:43
#10 0x00000000005c8b5a in reindex_relation (relid=16384, flags=1,
params=0x7ffed5fe05f0) at index.c:3990
#11 0x00000000006d2c85 in ExecuteTruncateGuts
(explicit_rels=0x3068aa8, relids=0x3068b00, relids_extra=0x3068b58,
relids_logged=0x3068bb0, behavior=DROP_RESTRICT, restart_seqs=false)
at tablecmds.c:2036
#12 0x00000000008ebc50 in apply_handle_truncate (s=0x7ffed5fe08d0) at
worker.c:2252
#13 0x00000000008ebe94 in apply_dispatch (s=0x7ffed5fe08d0) at worker.c:2308
#14 0x00000000008ec424 in LogicalRepApplyLoop (last_received=24192624)
at worker.c:2564
----------
Process 11393
(gdb) bt
#0 0x00007fa197917f90 in __nanosleep_nocancel () from /lib64/libc.so.6
#1 0x00007fa197917e44 in sleep () from /lib64/libc.so.6
#2 0x0000000000950f84 in DeadLockReport () at deadlock.c:1151
#3 0x0000000000955013 in WaitOnLock (locallock=0x2fd05d0,
owner=0x2fd9a48) at lock.c:1873
#4 0x0000000000953c14 in LockAcquireExtended (locktag=0x7ffed5fe01d0,
lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true,
locallockp=0x0) at lock.c:1100
#5 0x00000000009531bc in LockAcquire (locktag=0x7ffed5fe01d0,
lockmode=5, sessionLock=false, dontWait=false) at lock.c:751
#6 0x0000000000951d86 in XactLockTableWait (xid=597,
rel=0x7fa1986e9e08, ctid=0x7ffed5fe0284, oper=XLTW_Update) at
lmgr.c:674
#7 0x00000000004f84d8 in heap_update (relation=0x7fa1986e9e08,
otid=0x3067dc4, newtup=0x3067dc0, cid=0, crosscheck=0x0, wait=true,
tmfd=0x7ffed5fe03b0, lockmode=0x7ffed5fe03ac) at heapam.c:3549
#8 0x00000000004fa5dd in simple_heap_update (relation=0x7fa1986e9e08,
otid=0x3067dc4, tup=0x3067dc0) at heapam.c:4222
#9 0x00000000005c9932 in CatalogTupleUpdate (heapRel=0x7fa1986e9e08,
otid=0x3067dc4, tup=0x3067dc0) at indexing.c:312
#10 0x0000000000af5774 in RelationSetNewRelfilenode
(relation=0x7fa1986fdc80, persistence=112 'p') at relcache.c:3707
#11 0x00000000006d2b4d in ExecuteTruncateGuts
(explicit_rels=0x30688b8, relids=0x3068910, relids_extra=0x3068968,
relids_logged=0x30689c0, behavior=DROP_RESTRICT, restart_seqs=false)
at tablecmds.c:2012
#12 0x00000000008ebc50 in apply_handle_truncate (s=0x7ffed5fe08d0) at
worker.c:2252
#13 0x00000000008ebe94 in apply_dispatch (s=0x7ffed5fe08d0) at worker.c:2308
----------

The essence of the trouble seems to be that the apply_handle_truncate
function never anticipated it may end up truncating the same table
from 2 separate workers (subscriptions) like this test case is doing.
Probably this is quite an old problem because the
apply_handle_truncate code has not changed much for a long time. The
code of apply_handle_truncate function (worker.c) has a very similar
pattern to the ExecuteTruncate function (tablecmds.c) but the
ExecuteTruncate is using a more powerful AcccessExclusiveLock than the
apply_handle_truncate was using.

PSA a patch to make the apply_handle_truncate use AccessExclusiveLock
same as the ExecuteTruncate function does.

PSA a patch adding a test for this scenario.

--------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v2_test_for_deadlock.patchapplication/octet-stream; name=v2_test_for_deadlock.patchDownload

diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 2d49f2537b..5a0df32b21 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -6,16 +6,20 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 11;
+use Test::More tests => 14;
 
 # setup
 
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_logical_replication_workers = 6));
 $node_publisher->start;
 
 my $node_subscriber = get_new_node('subscriber');
 $node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_logical_replication_workers = 6));
 $node_subscriber->start;
 
 my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
@@ -187,3 +191,52 @@ $result = $node_subscriber->safe_psql('postgres',
 	"SELECT count(*), min(a), max(a) FROM tab1");
 is($result, qq(0||),
 	'truncate replicated in synchronous logical replication');
+
+# test that truncate works for synchronous logical replication when
+# there are multiple subscriptions for a single table
+
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub5 FOR TABLE tab5");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_1 CONNECTION '$publisher_connstr' PUBLICATION pub5"
+);
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_2 CONNECTION '$publisher_connstr' PUBLICATION pub5"
+);
+
+$node_publisher->safe_psql('postgres',
+	"ALTER SYSTEM SET synchronous_standby_names TO 'any 2(sub5_1, sub5_2)'");
+$node_publisher->safe_psql('postgres', "SELECT pg_reload_conf()");
+
+# insert data to truncate
+
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab5 VALUES (1), (2), (3)");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(6|1|3), 'check synchronous logical replication');
+
+$node_publisher->safe_psql('postgres', "TRUNCATE tab5");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(0||),
+	'truncate replicated in synchronous logical replication');
+
+# check deadlocks detected
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT deadlocks FROM pg_stat_database WHERE datname='postgres'");
+is($result, qq(0), 'no deadlocks detected');

v2-0001-Fix-deadlock-for-multiple-replicating-truncates-o.patchapplication/octet-stream; name=v2-0001-Fix-deadlock-for-multiple-replicating-truncates-o.patchDownload

From d2e55920d341daa2b613ee8f02640a1035ac8a35 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 13 May 2021 11:53:29 +1000
Subject: [PATCH v2] Fix deadlock for multiple replicating truncates of same
 table.

The ExecuteTruncate uses AccessExclusiveLock, but the apply_handle_truncate only used
RowExclusiveLock. This could lead to a deadlock detected error in scenarios where there
are 2 subscribers for a single table which is being truncated.

Fix to use the same lock level as ExecuteTruncate.
---
 src/backend/replication/logical/worker.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1432554..60bf7f7 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1818,6 +1818,7 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids = NIL;
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
+	LOCKMODE    lockmode = AccessExclusiveLock;
 
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
@@ -1831,14 +1832,14 @@ apply_handle_truncate(StringInfo s)
 		LogicalRepRelId relid = lfirst_oid(lc);
 		LogicalRepRelMapEntry *rel;
 
-		rel = logicalrep_rel_open(relid, RowExclusiveLock);
+		rel = logicalrep_rel_open(relid, lockmode);
 		if (!should_apply_changes_for_rel(rel))
 		{
 			/*
 			 * The relation can't become interesting in the middle of the
 			 * transaction so it's safe to unlock it.
 			 */
-			logicalrep_rel_close(rel, RowExclusiveLock);
+			logicalrep_rel_close(rel, lockmode);
 			continue;
 		}
 
@@ -1856,7 +1857,7 @@ apply_handle_truncate(StringInfo s)
 		{
 			ListCell   *child;
 			List	   *children = find_all_inheritors(rel->localreloid,
-													   RowExclusiveLock,
+													   lockmode,
 													   NULL);
 
 			foreach(child, children)
@@ -1876,7 +1877,7 @@ apply_handle_truncate(StringInfo s)
 				 */
 				if (RELATION_IS_OTHER_TEMP(childrel))
 				{
-					table_close(childrel, RowExclusiveLock);
+					table_close(childrel, lockmode);
 					continue;
 				}
 
-- 
1.8.3.1

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: Peter Smith (#1)

Re: "ERROR: deadlock detected" when replicating TRUNCATE

On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

While doing logical replication testing we encountered a problem which
causes a deadlock error to be logged when replicating a TRUNCATE
simultaneously to 2 Subscriptions.
e.g.
----------
2021-05-12 11:30:19.457 AEST [11393] ERROR: deadlock detected
2021-05-12 11:30:19.457 AEST [11393] DETAIL: Process 11393 waits for
ShareLock on transaction 597; blocked by process 11582.
Process 11582 waits for ShareLock on relation 16384 of database 14896;
blocked by process 11393.
----------

At this time, both the subscriber (apply worker) processes are
executing inside the ExecuteTruncateGuts function simultaneously and
they become co-dependent.

e.g.
----------
Process 11582
(gdb) bt
#0 0x00007fa1979515e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x000000000093e5d0 in WaitEventSetWaitBlock (set=0x2facac8,
cur_timeout=-1, occurred_events=0x7ffed5fdff00, nevents=1) at
latch.c:1450
#2 0x000000000093e468 in WaitEventSetWait (set=0x2facac8, timeout=-1,
occurred_events=0x7ffed5fdff00, nevents=1, wait_event_info=50331648)
at latch.c:1396
#3 0x000000000093d8cd in WaitLatch (latch=0x7fa191042654,
wakeEvents=33, timeout=0, wait_event_info=50331648) at latch.c:473
#4 0x00000000009660f0 in ProcSleep (locallock=0x2fd06d8,
lockMethodTable=0xcd90a0 <default_lockmethod>) at proc.c:1361

----------
Process 11393
(gdb) bt
#0 0x00007fa197917f90 in __nanosleep_nocancel () from /lib64/libc.so.6
#1 0x00007fa197917e44 in sleep () from /lib64/libc.so.6
#2 0x0000000000950f84 in DeadLockReport () at deadlock.c:1151
#3 0x0000000000955013 in WaitOnLock (locallock=0x2fd05d0,
owner=0x2fd9a48) at lock.c:1873

----------

The essence of the trouble seems to be that the apply_handle_truncate
function never anticipated it may end up truncating the same table
from 2 separate workers (subscriptions) like this test case is doing.
Probably this is quite an old problem because the
apply_handle_truncate code has not changed much for a long time.

Yeah, have you checked it in the back branches?

I am also able to reproduce and have analyzed the cause of the above
error. In the above, Process 11393 waits while updating pg_class tuple
via RelationSetNewRelfilenode() which is already updated by process
11582 (with transaction id 597 which is yet not committed). Now,
process 11582 waits for acquiring ShareLock on relation 16384 which is
acquired RowExclusiveMode by process 11393 in function
apply_handle_truncate. So, both the processes started waiting on each
other which causes a deadlock.

PSA a patch adding a test for this scenario.

+
+$node_publisher->safe_psql('postgres',
+ "ALTER SYSTEM SET synchronous_standby_names TO 'any 2(sub5_1, sub5_2)'");
+$node_publisher->safe_psql('postgres', "SELECT pg_reload_conf()");

Do you really need these steps to reproduce the problem? IIUC, this
has nothing to do with synchronous replication.

--
With Regards,
Amit Kapila.

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Peter Smith (#1)

Re: "ERROR: deadlock detected" when replicating TRUNCATE

On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

The essence of the trouble seems to be that the apply_handle_truncate
function never anticipated it may end up truncating the same table
from 2 separate workers (subscriptions) like this test case is doing.
Probably this is quite an old problem because the
apply_handle_truncate code has not changed much for a long time. The
code of apply_handle_truncate function (worker.c) has a very similar
pattern to the ExecuteTruncate function (tablecmds.c) but the
ExecuteTruncate is using a more powerful AcccessExclusiveLock than the
apply_handle_truncate was using.

Right, that's a problem.

PSA a patch to make the apply_handle_truncate use AccessExclusiveLock
same as the ExecuteTruncate function does.

I think the fix makes sense to me.

PSA a patch adding a test for this scenario.

I am not sure this test case is exactly targeting the problematic
behavior because that will depends upon the order of execution of the
apply workers right?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

tanghy.fnst@fujitsu.com

over 4 years ago

In reply to: Amit Kapila (#2)

1 attachment(s)

RE: "ERROR: deadlock detected" when replicating TRUNCATE

On Monday, May 17, 2021 5:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote

+$node_publisher->safe_psql('postgres',
+ "ALTER SYSTEM SET synchronous_standby_names TO 'any 2(sub5_1,
sub5_2)'");
+$node_publisher->safe_psql('postgres', "SELECT pg_reload_conf()");
Do you really need these steps to reproduce the problem? IIUC, this
has nothing to do with synchronous replication.

Agreed.
I tested in asynchronous mode, and could reproduce this problem, too.

The attached patch removed the steps for setting synchronous replication.
And the test passed after applying Peter's patch.
Please take it as your reference.

Regards
Tang

Attachments:

v3_test_for_deadlock.patchapplication/octet-stream; name=v3_test_for_deadlock.patchDownload

diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 2d49f2537b..d91a1e7577 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -6,16 +6,20 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 11;
+use Test::More tests => 14;
 
 # setup
 
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_logical_replication_workers = 6));
 $node_publisher->start;
 
 my $node_subscriber = get_new_node('subscriber');
 $node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_logical_replication_workers = 6));
 $node_subscriber->start;
 
 my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
@@ -187,3 +191,48 @@ $result = $node_subscriber->safe_psql('postgres',
 	"SELECT count(*), min(a), max(a) FROM tab1");
 is($result, qq(0||),
 	'truncate replicated in synchronous logical replication');
+
+# test that truncate works for synchronous logical replication when
+# there are multiple subscriptions for a single table
+
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub5 FOR TABLE tab5");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_1 CONNECTION '$publisher_connstr' PUBLICATION pub5"
+);
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_2 CONNECTION '$publisher_connstr' PUBLICATION pub5"
+);
+
+# insert data to truncate
+
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab5 VALUES (1), (2), (3)");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(6|1|3), 'check synchronous logical replication');
+
+$node_publisher->safe_psql('postgres', "TRUNCATE tab5");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(0||),
+	'truncate replicated in synchronous logical replication');
+
+# check deadlocks detected
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT deadlocks FROM pg_stat_database WHERE datname='postgres'");
+is($result, qq(0), 'no deadlocks detected');

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#3)

Re: "ERROR: deadlock detected" when replicating TRUNCATE

On Mon, May 17, 2021 at 3:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

PSA a patch adding a test for this scenario.

I am not sure this test case is exactly targeting the problematic
behavior because that will depends upon the order of execution of the
apply workers right?

Yeah, so we can't guarantee that this test will always reproduce the
problem but OTOH, I have tried two times and it reproduced both times.
I guess we don't have a similar test where Truncate will replicate to
two subscriptions, otherwise, we would have caught such a problem.
Having said that, I am fine with leaving this test if others feel so
on the grounds that it won't always lead to the problem reported.

--
With Regards,
Amit Kapila.

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Amit Kapila (#5)

Re: "ERROR: deadlock detected" when replicating TRUNCATE

On Mon, May 17, 2021 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 17, 2021 at 3:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

PSA a patch adding a test for this scenario.

I am not sure this test case is exactly targeting the problematic
behavior because that will depends upon the order of execution of the
apply workers right?

Yeah, so we can't guarantee that this test will always reproduce the
problem but OTOH, I have tried two times and it reproduced both times.
I guess we don't have a similar test where Truncate will replicate to
two subscriptions, otherwise, we would have caught such a problem.
Having said that, I am fine with leaving this test if others feel so
on the grounds that it won't always lead to the problem reported.

Although it is not guaranteed to reproduce the scenario every time, it
is testing a new scenario, so +1 for the test.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Peter Smith

smithpb2250@gmail.com

over 4 years ago

In reply to: Amit Kapila (#5)

Re: "ERROR: deadlock detected" when replicating TRUNCATE

On Mon, May 17, 2021 at 8:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 17, 2021 at 3:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

PSA a patch adding a test for this scenario.

I am not sure this test case is exactly targeting the problematic
behavior because that will depends upon the order of execution of the
apply workers right?

Yeah, so we can't guarantee that this test will always reproduce the
problem but OTOH, I have tried two times and it reproduced both times.
I guess we don't have a similar test where Truncate will replicate to
two subscriptions, otherwise, we would have caught such a problem.
Having said that, I am fine with leaving this test if others feel so
on the grounds that it won't always lead to the problem reported.

If there is any concern that the problem won't always happen then I
think we should just increase the numbers of subscriptions.

Having more simultaneous subscriptions (e.g. I have tried 4). will
make it much more likely for the test to encounter the deadlock, and
it probably would also be quite a useful worker stress test in it's
own right.

Also, should this test be in the 010_truncate,pl, or does it belong in
the 100_bugs.pl? (I don't know what are the rules for when a test
gets put into 100_bugs.pl)

----------
Kind Regards,
Peter Smith.
Fujitsu Australia

Peter Smith

smithpb2250@gmail.com

over 4 years ago

In reply to: Amit Kapila (#2)

Re: "ERROR: deadlock detected" when replicating TRUNCATE

On Mon, May 17, 2021 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

[...]

The essence of the trouble seems to be that the apply_handle_truncate
function never anticipated it may end up truncating the same table
from 2 separate workers (subscriptions) like this test case is doing.
Probably this is quite an old problem because the
apply_handle_truncate code has not changed much for a long time.

Yeah, have you checked it in the back branches?

Yes, the apply_handle_truncate function was introduced in April/2018 [1]https://github.com/postgres/postgres/commit/039eb6e92f20499ac36cc74f8a5cef7430b706f6.

REL_11_0 was tagged in Oct/2018.

The "ERROR: deadlock detected" log is reproducible in PG 11.0.

----------
[1]: https://github.com/postgres/postgres/commit/039eb6e92f20499ac36cc74f8a5cef7430b706f6

Kind Regards,
Peter Smith.
Fujitsu Australia.

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: Peter Smith (#7)

Re: "ERROR: deadlock detected" when replicating TRUNCATE

On Tue, May 18, 2021 at 6:19 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, May 17, 2021 at 8:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 17, 2021 at 3:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

PSA a patch adding a test for this scenario.

I am not sure this test case is exactly targeting the problematic
behavior because that will depends upon the order of execution of the
apply workers right?

Yeah, so we can't guarantee that this test will always reproduce the
problem but OTOH, I have tried two times and it reproduced both times.
I guess we don't have a similar test where Truncate will replicate to
two subscriptions, otherwise, we would have caught such a problem.
Having said that, I am fine with leaving this test if others feel so
on the grounds that it won't always lead to the problem reported.

If there is any concern that the problem won't always happen then I
think we should just increase the numbers of subscriptions.

Having more simultaneous subscriptions (e.g. I have tried 4). will
make it much more likely for the test to encounter the deadlock, and
it probably would also be quite a useful worker stress test in it's
own right.

I don't think we need to go that far.

~~

Also, should this test be in the 010_truncate,pl,

+1 for keeping it in 010_truncate.pl but remove the synchronous part
of it from the testcase and comments as that is not required.

--
With Regards,
Amit Kapila.

#10

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: Peter Smith (#8)

4 attachment(s)

Re: "ERROR: deadlock detected" when replicating TRUNCATE

On Tue, May 18, 2021 at 6:52 AM Peter Smith <smithpb2250@gmail.com> wrote:

Yeah, have you checked it in the back branches?

Yes, the apply_handle_truncate function was introduced in April/2018 [1].

REL_11_0 was tagged in Oct/2018.

The "ERROR: deadlock detected" log is reproducible in PG 11.0.

Okay, I have prepared the patches for all branches (11...HEAD). Each
version needs minor changes in the test, the code doesn't need much
change. Some notable changes in the tests:
1. I have removed the conf change for max_logical_replication_workers
on the publisher node. We only need this for the subscriber node.
2. After creating the new subscriptions wait for initial
synchronization as we do for other tests.
3. synchronous_standby_names need to be reset for the previous test.
This is only required for HEAD.
4. In PG-11, we need to specify the application_name in the connection
string, otherwise, it took the testcase file name as application_name.
This is the same as other tests are doing in PG11.

Can you please once verify the attached patches?

--
With Regards,
Amit Kapila.

Attachments:

v4-0001-Fix-deadlock-for-multiple-replicating-truncates-11.patchapplication/octet-stream; name=v4-0001-Fix-deadlock-for-multiple-replicating-truncates-11.patchDownload

From ce8b7d461a65700e615e4421cf4cf7a03850ffaa Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 19 May 2021 18:05:25 +0530
Subject: [PATCH v11] Fix deadlock for multiple replicating truncates of the
 same table.

While applying the truncate change, the logical apply worker acquires
RowExclusiveLock on the relation being truncated. This allowed truncate on
the relation at a time by two apply workers which lead to a deadlock. The
reason was that one of the workers after updating the pg_class tuple tries
to acquire SHARE lock on the relation and started to wait for the second
worker which has acquired RowExclusiveLock on the relation. And when the
second worker tries to update the pg_class tuple, it starts to wait for
the first worker which leads to a deadlock. Fix it by acquiring
AccessExclusiveLock on the relation before applying the truncate change as
we do for normal truncate operation.

Author: Peter Smith, test case by Haiying Tang
Reviewed-by: Dilip Kumar, Amit Kapila
Backpatch-through: 11
Discussion: https://postgr.es/m/CAHut+PsNm43p0jM+idTvWwiGZPcP0hGrHMPK9TOAkc+a4UpUqw@mail.gmail.com
---
 src/backend/replication/logical/worker.c |  5 +--
 src/test/subscription/t/010_truncate.pl  | 54 +++++++++++++++++++++++++++++++-
 2 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index bd27ef3..db52e4a 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -950,6 +950,7 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids = NIL;
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
+	LOCKMODE    lockmode = AccessExclusiveLock;
 
 	ensure_transaction();
 
@@ -960,14 +961,14 @@ apply_handle_truncate(StringInfo s)
 		LogicalRepRelId relid = lfirst_oid(lc);
 		LogicalRepRelMapEntry *rel;
 
-		rel = logicalrep_rel_open(relid, RowExclusiveLock);
+		rel = logicalrep_rel_open(relid, lockmode);
 		if (!should_apply_changes_for_rel(rel))
 		{
 			/*
 			 * The relation can't become interesting in the middle of the
 			 * transaction so it's safe to unlock it.
 			 */
-			logicalrep_rel_close(rel, RowExclusiveLock);
+			logicalrep_rel_close(rel, lockmode);
 			continue;
 		}
 
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index de1443b..033108a 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 9;
+use Test::More tests => 12;
 
 # setup
 
@@ -13,6 +13,8 @@ $node_publisher->start;
 
 my $node_subscriber = get_new_node('subscriber');
 $node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_logical_replication_workers = 6));
 $node_subscriber->start;
 
 my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
@@ -158,3 +160,53 @@ is($result, qq(0||), 'truncate of multiple tables some not published');
 $result = $node_subscriber->safe_psql('postgres',
 	"SELECT count(*), min(a), max(a) FROM tab2");
 is($result, qq(3|1|3), 'truncate of multiple tables some not published');
+
+# test that truncate works for logical replication when there are multiple
+# subscriptions for a single table
+
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub5 FOR TABLE tab5");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_1 CONNECTION '$publisher_connstr application_name=sub5_1' PUBLICATION pub5"
+);
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_2 CONNECTION '$publisher_connstr application_name=sub5_2' PUBLICATION pub5"
+);
+
+# wait for initial data sync
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# insert data to truncate
+
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab5 VALUES (1), (2), (3)");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(6|1|3), 'insert replicated for multiple subscriptions');
+
+$node_publisher->safe_psql('postgres', "TRUNCATE tab5");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(0||),
+	'truncate replicated for multiple subscriptions');
+
+# check deadlocks
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT deadlocks FROM pg_stat_database WHERE datname='postgres'");
+is($result, qq(0), 'no deadlocks detected');
+
-- 
1.8.3.1

v4-0001-Fix-deadlock-for-multiple-replicating-truncates-12.patchapplication/octet-stream; name=v4-0001-Fix-deadlock-for-multiple-replicating-truncates-12.patchDownload

From e5e6d8b7b4e945e54630d516500dc5079ed92643 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 20 May 2021 10:07:53 +0530
Subject: [PATCH v12] Fix deadlock for multiple replicating truncates of the
 same table.

While applying the truncate change, the logical apply worker acquires
RowExclusiveLock on the relation being truncated. This allowed truncate on
the relation at a time by two apply workers which lead to a deadlock. The
reason was that one of the workers after updating the pg_class tuple tries
to acquire SHARE lock on the relation and started to wait for the second
worker which has acquired RowExclusiveLock on the relation. And when the
second worker tries to update the pg_class tuple, it starts to wait for
the first worker which leads to a deadlock. Fix it by acquiring
AccessExclusiveLock on the relation before applying the truncate change as
we do for normal truncate operation.

Author: Peter Smith, test case by Haiying Tang
Reviewed-by: Dilip Kumar, Amit Kapila
Backpatch-through: 11
Discussion: https://postgr.es/m/CAHut+PsNm43p0jM+idTvWwiGZPcP0hGrHMPK9TOAkc+a4UpUqw@mail.gmail.com
---
 src/backend/replication/logical/worker.c |  5 +--
 src/test/subscription/t/010_truncate.pl  | 54 +++++++++++++++++++++++++++++++-
 2 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9f0d13c..583752c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -936,6 +936,7 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids = NIL;
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
+	LOCKMODE    lockmode = AccessExclusiveLock;
 
 	ensure_transaction();
 
@@ -946,14 +947,14 @@ apply_handle_truncate(StringInfo s)
 		LogicalRepRelId relid = lfirst_oid(lc);
 		LogicalRepRelMapEntry *rel;
 
-		rel = logicalrep_rel_open(relid, RowExclusiveLock);
+		rel = logicalrep_rel_open(relid, lockmode);
 		if (!should_apply_changes_for_rel(rel))
 		{
 			/*
 			 * The relation can't become interesting in the middle of the
 			 * transaction so it's safe to unlock it.
 			 */
-			logicalrep_rel_close(rel, RowExclusiveLock);
+			logicalrep_rel_close(rel, lockmode);
 			continue;
 		}
 
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..4d7d26a 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 9;
+use Test::More tests => 12;
 
 # setup
 
@@ -13,6 +13,8 @@ $node_publisher->start;
 
 my $node_subscriber = get_new_node('subscriber');
 $node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_logical_replication_workers = 6));
 $node_subscriber->start;
 
 my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
@@ -158,3 +160,53 @@ is($result, qq(0||), 'truncate of multiple tables some not published');
 $result = $node_subscriber->safe_psql('postgres',
 	"SELECT count(*), min(a), max(a) FROM tab2");
 is($result, qq(3|1|3), 'truncate of multiple tables some not published');
+
+# test that truncate works for logical replication when there are multiple
+# subscriptions for a single table
+
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub5 FOR TABLE tab5");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_1 CONNECTION '$publisher_connstr' PUBLICATION pub5"
+);
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_2 CONNECTION '$publisher_connstr' PUBLICATION pub5"
+);
+
+# wait for initial data sync
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# insert data to truncate
+
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab5 VALUES (1), (2), (3)");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(6|1|3), 'insert replicated for multiple subscriptions');
+
+$node_publisher->safe_psql('postgres', "TRUNCATE tab5");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(0||),
+	'truncate replicated for multiple subscriptions');
+
+# check deadlocks
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT deadlocks FROM pg_stat_database WHERE datname='postgres'");
+is($result, qq(0), 'no deadlocks detected');
+
-- 
1.8.3.1

v4-0001-Fix-deadlock-for-multiple-replicating-truncates-13.patchapplication/octet-stream; name=v4-0001-Fix-deadlock-for-multiple-replicating-truncates-13.patchDownload

From 0cb648eeeb064aac2f36b4c5de02c3c3e7770d2f Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 20 May 2021 11:04:37 +0530
Subject: [PATCH v13] Fix deadlock for multiple replicating truncates of the
 same table.

While applying the truncate change, the logical apply worker acquires
RowExclusiveLock on the relation being truncated. This allowed truncate on
the relation at a time by two apply workers which lead to a deadlock. The
reason was that one of the workers after updating the pg_class tuple tries
to acquire SHARE lock on the relation and started to wait for the second
worker which has acquired RowExclusiveLock on the relation. And when the
second worker tries to update the pg_class tuple, it starts to wait for
the first worker which leads to a deadlock. Fix it by acquiring
AccessExclusiveLock on the relation before applying the truncate change as
we do for normal truncate operation.

Author: Peter Smith, test case by Haiying Tang
Reviewed-by: Dilip Kumar, Amit Kapila
Backpatch-through: 11
Discussion: https://postgr.es/m/CAHut+PsNm43p0jM+idTvWwiGZPcP0hGrHMPK9TOAkc+a4UpUqw@mail.gmail.com
---
 src/backend/replication/logical/worker.c |  9 +++---
 src/test/subscription/t/010_truncate.pl  | 53 +++++++++++++++++++++++++++++++-
 2 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ff887ea..e25ad67 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1248,6 +1248,7 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids = NIL;
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
+	LOCKMODE    lockmode = AccessExclusiveLock;
 
 	ensure_transaction();
 
@@ -1258,14 +1259,14 @@ apply_handle_truncate(StringInfo s)
 		LogicalRepRelId relid = lfirst_oid(lc);
 		LogicalRepRelMapEntry *rel;
 
-		rel = logicalrep_rel_open(relid, RowExclusiveLock);
+		rel = logicalrep_rel_open(relid, lockmode);
 		if (!should_apply_changes_for_rel(rel))
 		{
 			/*
 			 * The relation can't become interesting in the middle of the
 			 * transaction so it's safe to unlock it.
 			 */
-			logicalrep_rel_close(rel, RowExclusiveLock);
+			logicalrep_rel_close(rel, lockmode);
 			continue;
 		}
 
@@ -1283,7 +1284,7 @@ apply_handle_truncate(StringInfo s)
 		{
 			ListCell   *child;
 			List	   *children = find_all_inheritors(rel->localreloid,
-													   RowExclusiveLock,
+													   lockmode,
 													   NULL);
 
 			foreach(child, children)
@@ -1303,7 +1304,7 @@ apply_handle_truncate(StringInfo s)
 				 */
 				if (RELATION_IS_OTHER_TEMP(childrel))
 				{
-					table_close(childrel, RowExclusiveLock);
+					table_close(childrel, lockmode);
 					continue;
 				}
 
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..1f3719c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 9;
+use Test::More tests => 12;
 
 # setup
 
@@ -13,6 +13,8 @@ $node_publisher->start;
 
 my $node_subscriber = get_new_node('subscriber');
 $node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_logical_replication_workers = 6));
 $node_subscriber->start;
 
 my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
@@ -158,3 +160,52 @@ is($result, qq(0||), 'truncate of multiple tables some not published');
 $result = $node_subscriber->safe_psql('postgres',
 	"SELECT count(*), min(a), max(a) FROM tab2");
 is($result, qq(3|1|3), 'truncate of multiple tables some not published');
+
+# test that truncate works for logical replication when there are multiple
+# subscriptions for a single table
+
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub5 FOR TABLE tab5");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_1 CONNECTION '$publisher_connstr' PUBLICATION pub5"
+);
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_2 CONNECTION '$publisher_connstr' PUBLICATION pub5"
+);
+
+# wait for initial data sync
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# insert data to truncate
+
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab5 VALUES (1), (2), (3)");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(6|1|3), 'insert replicated for multiple subscriptions');
+
+$node_publisher->safe_psql('postgres', "TRUNCATE tab5");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(0||),
+	'truncate replicated for multiple subscriptions');
+
+# check deadlocks
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT deadlocks FROM pg_stat_database WHERE datname='postgres'");
+is($result, qq(0), 'no deadlocks detected');
-- 
1.8.3.1

v4-0001-Fix-deadlock-for-multiple-replicating-truncates-HEAD.patchapplication/octet-stream; name=v4-0001-Fix-deadlock-for-multiple-replicating-truncates-HEAD.patchDownload

From 053b97dbfffd88ae7240c3ae4b36df175666e714 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 19 May 2021 15:39:38 +0530
Subject: [PATCH v14] Fix deadlock for multiple replicating truncates of the
 same table.

While applying the truncate change, the logical apply worker acquires
RowExclusiveLock on the relation being truncated. This allowed truncate on
the relation at a time by two apply workers which lead to a deadlock. The
reason was that one of the workers after updating the pg_class tuple tries
to acquire SHARE lock on the relation and started to wait for the second
worker which has acquired RowExclusiveLock on the relation. And when the
second worker tries to update the pg_class tuple, it starts to wait for
the first worker which leads to a deadlock. Fix it by acquiring
AccessExclusiveLock on the relation before applying the truncate change as
we do for normal truncate operation.

Author: Peter Smith, test case by Haiying Tang
Reviewed-by: Dilip Kumar, Amit Kapila
Backpatch-through: 11
Discussion: https://postgr.es/m/CAHut+PsNm43p0jM+idTvWwiGZPcP0hGrHMPK9TOAkc+a4UpUqw@mail.gmail.com
---
 src/backend/replication/logical/worker.c |  9 ++---
 src/test/subscription/t/010_truncate.pl  | 57 +++++++++++++++++++++++++++++++-
 2 files changed, 61 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1432554..60bf7f7 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1818,6 +1818,7 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids = NIL;
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
+	LOCKMODE    lockmode = AccessExclusiveLock;
 
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
@@ -1831,14 +1832,14 @@ apply_handle_truncate(StringInfo s)
 		LogicalRepRelId relid = lfirst_oid(lc);
 		LogicalRepRelMapEntry *rel;
 
-		rel = logicalrep_rel_open(relid, RowExclusiveLock);
+		rel = logicalrep_rel_open(relid, lockmode);
 		if (!should_apply_changes_for_rel(rel))
 		{
 			/*
 			 * The relation can't become interesting in the middle of the
 			 * transaction so it's safe to unlock it.
 			 */
-			logicalrep_rel_close(rel, RowExclusiveLock);
+			logicalrep_rel_close(rel, lockmode);
 			continue;
 		}
 
@@ -1856,7 +1857,7 @@ apply_handle_truncate(StringInfo s)
 		{
 			ListCell   *child;
 			List	   *children = find_all_inheritors(rel->localreloid,
-													   RowExclusiveLock,
+													   lockmode,
 													   NULL);
 
 			foreach(child, children)
@@ -1876,7 +1877,7 @@ apply_handle_truncate(StringInfo s)
 				 */
 				if (RELATION_IS_OTHER_TEMP(childrel))
 				{
-					table_close(childrel, RowExclusiveLock);
+					table_close(childrel, lockmode);
 					continue;
 				}
 
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 2d49f25..065f5b0 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 11;
+use Test::More tests => 14;
 
 # setup
 
@@ -16,6 +16,8 @@ $node_publisher->start;
 
 my $node_subscriber = get_new_node('subscriber');
 $node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_logical_replication_workers = 6));
 $node_subscriber->start;
 
 my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
@@ -187,3 +189,56 @@ $result = $node_subscriber->safe_psql('postgres',
 	"SELECT count(*), min(a), max(a) FROM tab1");
 is($result, qq(0||),
 	'truncate replicated in synchronous logical replication');
+
+$node_publisher->safe_psql('postgres',
+	"ALTER SYSTEM RESET synchronous_standby_names");
+$node_publisher->safe_psql('postgres', "SELECT pg_reload_conf()");
+
+# test that truncate works for logical replication when there are multiple
+# subscriptions for a single table
+
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab5 (a int)");
+
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub5 FOR TABLE tab5");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_1 CONNECTION '$publisher_connstr' PUBLICATION pub5"
+);
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub5_2 CONNECTION '$publisher_connstr' PUBLICATION pub5"
+);
+
+# wait for initial data sync
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# insert data to truncate
+
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab5 VALUES (1), (2), (3)");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(6|1|3), 'insert replicated for multiple subscriptions');
+
+$node_publisher->safe_psql('postgres', "TRUNCATE tab5");
+
+$node_publisher->wait_for_catchup('sub5_1');
+$node_publisher->wait_for_catchup('sub5_2');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab5");
+is($result, qq(0||),
+	'truncate replicated for multiple subscriptions');
+
+# check deadlocks
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT deadlocks FROM pg_stat_database WHERE datname='postgres'");
+is($result, qq(0), 'no deadlocks detected');
-- 
1.8.3.1

#11

tanghy.fnst@fujitsu.com

over 4 years ago

In reply to: Amit Kapila (#10)

RE: "ERROR: deadlock detected" when replicating TRUNCATE

On Thursday, May 20, 2021 3:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I have prepared the patches for all branches (11...HEAD). Each
version needs minor changes in the test, the code doesn't need much
change. Some notable changes in the tests:
1. I have removed the conf change for max_logical_replication_workers
on the publisher node. We only need this for the subscriber node.
2. After creating the new subscriptions wait for initial
synchronization as we do for other tests.
3. synchronous_standby_names need to be reset for the previous test.
This is only required for HEAD.
4. In PG-11, we need to specify the application_name in the connection
string, otherwise, it took the testcase file name as application_name.
This is the same as other tests are doing in PG11.

Can you please once verify the attached patches?

I have tested your patches for all branches(11...HEAD). All of them passed. B.T.W, I also confirmed that the bug exists in these branches without your fix.

The changes in tests LGTM.
But I saw whitespace warnings when applied the patches for PG11 and PG12, please take a look at this.

Regards
Tang

#12

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: tanghy.fnst@fujitsu.com (#11)

Re: "ERROR: deadlock detected" when replicating TRUNCATE

On Thu, May 20, 2021 at 12:46 PM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:

On Thursday, May 20, 2021 3:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I have prepared the patches for all branches (11...HEAD). Each
version needs minor changes in the test, the code doesn't need much
change. Some notable changes in the tests:
1. I have removed the conf change for max_logical_replication_workers
on the publisher node. We only need this for the subscriber node.
2. After creating the new subscriptions wait for initial
synchronization as we do for other tests.
3. synchronous_standby_names need to be reset for the previous test.
This is only required for HEAD.
4. In PG-11, we need to specify the application_name in the connection
string, otherwise, it took the testcase file name as application_name.
This is the same as other tests are doing in PG11.

Can you please once verify the attached patches?

I have tested your patches for all branches(11...HEAD). All of them passed. B.T.W, I also confirmed that the bug exists in these branches without your fix.

The changes in tests LGTM.
But I saw whitespace warnings when applied the patches for PG11 and PG12, please take a look at this.

Thanks, I have pushed after fixing the whitespace warning.

--
With Regards,
Amit Kapila.