subscriptionCheck failures

Started by Thomas Munroalmost 5 years ago7 messages
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#1)
Re: subscriptionCheck failures

On Mon, Mar 15, 2021 at 6:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Hi,

This seems to be a new low frequency failure, I didn't see it mentioned already:

Thanks for reporting, I'll look into it.

--
With Regards,
Amit Kapila.

#3osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#2)
RE: subscriptionCheck failures

Hello

On Tuesday, March 16, 2021 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 15, 2021 at 6:00 PM Thomas Munro
<thomas.munro@gmail.com> wrote:

Hi,

This seems to be a new low frequency failure, I didn't see it mentioned

already:

Oh, this is the test I wrote and included as part of the commit ce0fdbfe
# Failed test 'DROP SUBSCRIPTION during error can clean up the slots on the publisher'
# at t/004_sync.pl line 171.
# got: '1'
# expected: '0'
# Looks like you failed 1 test of 8.
I apologize this and will check the reason.

Best Regards,
Takamichi Osumi

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#2)
1 attachment(s)
Re: subscriptionCheck failures

On Tue, Mar 16, 2021 at 9:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 15, 2021 at 6:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Hi,

This seems to be a new low frequency failure, I didn't see it mentioned already:

Thanks for reporting, I'll look into it.

By looking at the logs [1]2021-02-23 09:57:47.593 UTC [6034d19b.291aed:7] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_16396_sync_16384_6932396177428838370" LOGICAL pgoutput USE_SNAPSHOT 2021-02-23 09:57:47.593 UTC [6034d19b.291aed:8] STATEMENT: CREATE_REPLICATION_SLOT "pg_16396_sync_16384_6932396177428838370" LOGICAL pgoutput USE_SNAPSHOT 2021-02-23 09:57:47.664 UTC [6034d19b.291ae2:14] LOG: disconnection: session time: 0:00:00.130 user=andres database=postgres host=[local] 2021-02-23 09:57:47.686 UTC [6034d19b.291af3:1] LOG: connection received: host=[local] 2021-02-23 09:57:47.687 UTC [6034d19b.291af3:2] LOG: replication connection authorized: user=andres application_name=tap_sub 2021-02-23 09:57:47.688 UTC [6034d19b.291af3:3] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2021-02-23 09:57:47.688 UTC [6034d19b.291af3:4] LOG: received replication command: DROP_REPLICATION_SLOT pg_16396_sync_16384_6932396177428838370 WAIT 2021-02-23 09:57:47.688 UTC [6034d19b.291af3:5] STATEMENT: DROP_REPLICATION_SLOT pg_16396_sync_16384_6932396177428838370 WAIT 2021-02-23 09:57:47.688 UTC [6034d19b.291af3:6] ERROR: replication slot "pg_16396_sync_16384_6932396177428838370" does not exist 2021-02-23 09:57:47.688 UTC [6034d19b.291af3:7] STATEMENT: DROP_REPLICATION_SLOT pg_16396_sync_16384_6932396177428838370 WAIT in the buildfarm, I think I know what is
going on here. After Create Subscription, the tablesync worker is
launched and tries to create the slot for doing the initial copy but
before it could finish creating the slot, we issued the Drop
Subscription which first stops the tablesync worker and then tried to
drop its slot. Now, it is quite possible that by the time Drop
Subscription tries to drop the tablesync slot, it is not yet created.
We treat this condition okay and just Logs the message. I don't think
this is an issue because anyway generally such a slot created on the
server will be dropped before we persist it but the test was checking
the existence of slots on server before it gets dropped. I think we
can avoid such a situation by preventing cancel/die interrupts while
creating tablesync slot.

This is a timing issue, so I have reproduced it via debugger and
tested that the attached patch fixes it.

[1]: 2021-02-23 09:57:47.593 UTC [6034d19b.291aed:7] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_16396_sync_16384_6932396177428838370" LOGICAL pgoutput USE_SNAPSHOT 2021-02-23 09:57:47.593 UTC [6034d19b.291aed:8] STATEMENT: CREATE_REPLICATION_SLOT "pg_16396_sync_16384_6932396177428838370" LOGICAL pgoutput USE_SNAPSHOT 2021-02-23 09:57:47.664 UTC [6034d19b.291ae2:14] LOG: disconnection: session time: 0:00:00.130 user=andres database=postgres host=[local] 2021-02-23 09:57:47.686 UTC [6034d19b.291af3:1] LOG: connection received: host=[local] 2021-02-23 09:57:47.687 UTC [6034d19b.291af3:2] LOG: replication connection authorized: user=andres application_name=tap_sub 2021-02-23 09:57:47.688 UTC [6034d19b.291af3:3] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2021-02-23 09:57:47.688 UTC [6034d19b.291af3:4] LOG: received replication command: DROP_REPLICATION_SLOT pg_16396_sync_16384_6932396177428838370 WAIT 2021-02-23 09:57:47.688 UTC [6034d19b.291af3:5] STATEMENT: DROP_REPLICATION_SLOT pg_16396_sync_16384_6932396177428838370 WAIT 2021-02-23 09:57:47.688 UTC [6034d19b.291af3:6] ERROR: replication slot "pg_16396_sync_16384_6932396177428838370" does not exist 2021-02-23 09:57:47.688 UTC [6034d19b.291af3:7] STATEMENT: DROP_REPLICATION_SLOT pg_16396_sync_16384_6932396177428838370 WAIT
2021-02-23 09:57:47.593 UTC [6034d19b.291aed:7] LOG: received
replication command: CREATE_REPLICATION_SLOT
"pg_16396_sync_16384_6932396177428838370" LOGICAL pgoutput
USE_SNAPSHOT
2021-02-23 09:57:47.593 UTC [6034d19b.291aed:8] STATEMENT:
CREATE_REPLICATION_SLOT "pg_16396_sync_16384_6932396177428838370"
LOGICAL pgoutput USE_SNAPSHOT
2021-02-23 09:57:47.664 UTC [6034d19b.291ae2:14] LOG: disconnection:
session time: 0:00:00.130 user=andres database=postgres host=[local]
2021-02-23 09:57:47.686 UTC [6034d19b.291af3:1] LOG: connection
received: host=[local]
2021-02-23 09:57:47.687 UTC [6034d19b.291af3:2] LOG: replication
connection authorized: user=andres application_name=tap_sub
2021-02-23 09:57:47.688 UTC [6034d19b.291af3:3] LOG: statement:
SELECT pg_catalog.set_config('search_path', '', false);
2021-02-23 09:57:47.688 UTC [6034d19b.291af3:4] LOG: received
replication command: DROP_REPLICATION_SLOT
pg_16396_sync_16384_6932396177428838370 WAIT
2021-02-23 09:57:47.688 UTC [6034d19b.291af3:5] STATEMENT:
DROP_REPLICATION_SLOT pg_16396_sync_16384_6932396177428838370 WAIT
2021-02-23 09:57:47.688 UTC [6034d19b.291af3:6] ERROR: replication
slot "pg_16396_sync_16384_6932396177428838370" does not exist
2021-02-23 09:57:47.688 UTC [6034d19b.291af3:7] STATEMENT:
DROP_REPLICATION_SLOT pg_16396_sync_16384_6932396177428838370 WAIT

--
With Regards,
Amit Kapila.

Attachments:

v1-0001-Prevent-cancel-die-interrupts-while-creating-tabl.patchapplication/octet-stream; name=v1-0001-Prevent-cancel-die-interrupts-while-creating-tabl.patchDownload
From e89b00fb9bbc26608086e92b382e5451e20adf50 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 16 Mar 2021 12:07:46 +0530
Subject: [PATCH v1] Prevent cancel/die interrupts while creating tablesync
 slot.

---
 src/backend/replication/logical/tablesync.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e7ac0..53788685398 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1051,9 +1051,16 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * Create a new permanent logical decoding slot. This slot will be used
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
+	 *
+	 * Prevent cancel/die interrupts while creating slot here because it is
+	 * possible that before the server finishes this command a concurrent drop
+	 * subscrition happens which would complete without removing this slot
+	 * leading to a dangling slot on the server.
 	 */
+	HOLD_INTERRUPTS();
 	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
+	RESUME_INTERRUPTS();
 
 	/*
 	 * Setup replication origin tracking. The purpose of doing this before the
-- 
2.28.0.windows.1

#5vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#4)
Re: subscriptionCheck failures

On Tue, Mar 16, 2021 at 12:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 16, 2021 at 9:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 15, 2021 at 6:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Hi,

This seems to be a new low frequency failure, I didn't see it mentioned already:

Thanks for reporting, I'll look into it.

By looking at the logs [1] in the buildfarm, I think I know what is
going on here. After Create Subscription, the tablesync worker is
launched and tries to create the slot for doing the initial copy but
before it could finish creating the slot, we issued the Drop
Subscription which first stops the tablesync worker and then tried to
drop its slot. Now, it is quite possible that by the time Drop
Subscription tries to drop the tablesync slot, it is not yet created.
We treat this condition okay and just Logs the message. I don't think
this is an issue because anyway generally such a slot created on the
server will be dropped before we persist it but the test was checking
the existence of slots on server before it gets dropped. I think we
can avoid such a situation by preventing cancel/die interrupts while
creating tablesync slot.

This is a timing issue, so I have reproduced it via debugger and
tested that the attached patch fixes it.

Thanks for the patch.
I was able to reproduce the issue using debugger by making it wait at
CreateReplicationSlot. After applying the patch the issue gets solved.

Few minor comments:
1) subscrition should be subscription in the below change:
+ * Prevent cancel/die interrupts while creating slot here because it is
+ * possible that before the server finishes this command a concurrent drop
+ * subscrition happens which would complete without removing this slot
+ * leading to a dangling slot on the server.
  */
2) "finishes this command a concurrent drop" should be "finishes this
command, a concurrent drop" in the below change:
+ * Prevent cancel/die interrupts while creating slot here because it is
+ * possible that before the server finishes this command a concurrent drop
+ * subscrition happens which would complete without removing this slot
+ * leading to a dangling slot on the server.
  */

Regards,
Vignesh

#6osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: vignesh C (#5)
RE: subscriptionCheck failures

Hi

On Tuesday, March 16, 2021 4:15 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, Mar 16, 2021 at 12:29 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Mar 16, 2021 at 9:00 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Mon, Mar 15, 2021 at 6:00 PM Thomas Munro

<thomas.munro@gmail.com> wrote:

Hi,

This seems to be a new low frequency failure, I didn't see it mentioned

already:

Thanks for reporting, I'll look into it.

By looking at the logs [1] in the buildfarm, I think I know what is
going on here. After Create Subscription, the tablesync worker is
launched and tries to create the slot for doing the initial copy but
before it could finish creating the slot, we issued the Drop
Subscription which first stops the tablesync worker and then tried to
drop its slot. Now, it is quite possible that by the time Drop
Subscription tries to drop the tablesync slot, it is not yet created.
We treat this condition okay and just Logs the message. I don't think
this is an issue because anyway generally such a slot created on the
server will be dropped before we persist it but the test was checking
the existence of slots on server before it gets dropped. I think we
can avoid such a situation by preventing cancel/die interrupts while
creating tablesync slot.

This is a timing issue, so I have reproduced it via debugger and
tested that the attached patch fixes it.

Thanks for the patch.
I was able to reproduce the issue using debugger by making it wait at
CreateReplicationSlot. After applying the patch the issue gets solved.

I really appreciate everyone's help.

For the double check, I utilized the patch and debugger too.
I also put one while loop at the top of CreateReplicationSlot to control walsender.

Without the patch, DROP SUBSCRIPTION goes forward,
even when the table sync worker cannot move by the CreateReplicationSlot loop,
and the table sync worker is killed by DROP SUBSCRIPTION command.
On the other hand, with the patch contents, I observed that
DROP SUBSCRIPTION hangs and waits
until I release the walsender process from CreateReplicationSlot.
After this, the command drops two slots like below.

NOTICE: dropped replication slot "pg_16391_sync_16385_6940222843739406079" on publisher
NOTICE: dropped replication slot "mysub1" on publisher
DROP SUBSCRIPTION

To me, this correctly works because
the timing I put the while loop and stops the walsender
makes the DROP SUBSCRIPTION affects two slots. Any comments ?

Best Regards,
Takamichi Osumi

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#6)
Re: subscriptionCheck failures

On Tue, Mar 16, 2021 at 6:22 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

To me, this correctly works because
the timing I put the while loop and stops the walsender
makes the DROP SUBSCRIPTION affects two slots. Any comments ?

No, your testing looks fine. I have also done the similar test. Pushed!

--
With Regards,
Amit Kapila.