potential stuck lock in SaveSlotToPath()

Started by Peter Eisentrautalmost 6 years ago16 messages
#1Peter Eisentraut
peter.eisentraut@2ndquadrant.com
1 attachment(s)

When SaveSlotToPath() is called with elevel=LOG, the early exits don't
release the slot's io_in_progress_lock. Fix attached.

This could result in a walsender being stuck on the lock forever. A
possible way to get into this situation is if the offending code paths
are triggered in a low disk space situation. (This is how it was found;
maybe there are other ways.)

Pavan Deolasee and Craig Ringer worked on this issue. I'm forwarding it
on their behalf.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Drop-slot-s-LWLock-before-returning-from-SaveSlotToP.patchtext/plain; charset=UTF-8; name=0001-Drop-slot-s-LWLock-before-returning-from-SaveSlotToP.patch; x-mac-creator=0; x-mac-type=0Download
From b33a856c68e1c48c8663a760f79b2b3d3bf0088c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Wed, 18 Mar 2020 16:24:59 +0100
Subject: [PATCH] Drop slot's LWLock before returning from SaveSlotToPath()

When SaveSlotToPath() is called with elevel=LOG, the early exits didn't
release the slot's io_in_progress_lock.

This could result in a walsender being stuck on the lock forever.  A
possible way to get into this situation is if the offending code paths
are triggered in a low disk space situation.

Author: Pavan Deolasee <pavan.deolasee@2ndquadrant.com>
Reported-by: Craig Ringer <craig@2ndquadrant.com>
---
 src/backend/replication/slot.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1cec53d748..1bff2d6185 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1256,6 +1256,13 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	fd = OpenTransientFile(tmppath, O_CREAT | O_EXCL | O_WRONLY | PG_BINARY);
 	if (fd < 0)
 	{
+		/*
+		 * If not an ERROR, then release the lock before returning the control
+		 * back to the caller. In case of an ERROR, the error recovery path
+		 * should automatically release the lock, but no harm in explicitly
+		 * releaseing even in that case.
+		 */
+		LWLockRelease(&slot->io_in_progress_lock);
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m",
@@ -1287,6 +1294,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 		pgstat_report_wait_end();
 		CloseTransientFile(fd);
+		LWLockRelease(&slot->io_in_progress_lock);
 
 		/* if write didn't set errno, assume problem is no disk space */
 		errno = save_errno ? save_errno : ENOSPC;
@@ -1306,6 +1314,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 		pgstat_report_wait_end();
 		CloseTransientFile(fd);
+		LWLockRelease(&slot->io_in_progress_lock);
 		errno = save_errno;
 		ereport(elevel,
 				(errcode_for_file_access(),
@@ -1317,6 +1326,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	if (CloseTransientFile(fd) != 0)
 	{
+		LWLockRelease(&slot->io_in_progress_lock);
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
@@ -1327,6 +1337,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	/* rename to permanent file, fsync file and directory */
 	if (rename(tmppath, path) != 0)
 	{
+		LWLockRelease(&slot->io_in_progress_lock);
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-- 
2.25.0

#2Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#1)
Re: potential stuck lock in SaveSlotToPath()

Hi,

On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:

When SaveSlotToPath() is called with elevel=LOG, the early exits don't
release the slot's io_in_progress_lock. Fix attached.

I'm a bit confused as to why we we ever call it with elevel = LOG
(i.e. why we have the elevel parameter at all). That seems to have been
there from the start, so it's either me or Robert that's to blame. But I
can't immediately see a reason for it?

Greetings,

Andres Freund

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#2)
Re: potential stuck lock in SaveSlotToPath()

On 2020-Mar-18, Andres Freund wrote:

Hi,

On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:

When SaveSlotToPath() is called with elevel=LOG, the early exits don't
release the slot's io_in_progress_lock. Fix attached.

I'm a bit confused as to why we we ever call it with elevel = LOG
(i.e. why we have the elevel parameter at all). That seems to have been
there from the start, so it's either me or Robert that's to blame. But I
can't immediately see a reason for it?

I guess you didn't want failure to save a slot be a reason to abort a
checkpoint.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Eisentraut (#1)
Re: potential stuck lock in SaveSlotToPath()

On Thu, Mar 19, 2020 at 4:46 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

[patch]

+ * releaseing even in that case.

Typo.

#5Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#3)
Re: potential stuck lock in SaveSlotToPath()

Hi,

On 2020-03-18 16:54:19 -0300, Alvaro Herrera wrote:

On 2020-Mar-18, Andres Freund wrote:

On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:

When SaveSlotToPath() is called with elevel=LOG, the early exits don't
release the slot's io_in_progress_lock. Fix attached.

I'm a bit confused as to why we we ever call it with elevel = LOG
(i.e. why we have the elevel parameter at all). That seems to have been
there from the start, so it's either me or Robert that's to blame. But I
can't immediately see a reason for it?

I guess you didn't want failure to save a slot be a reason to abort a
checkpoint.

I don't see a valid reason for that though - if anything it's dangerous,
because we're not persistently saving the slot. It should fail the
checkpoint imo. Robert, do you have an idea?

Greetings,

Andres Freund

#6Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#5)
Re: potential stuck lock in SaveSlotToPath()

On Wed, Mar 18, 2020 at 4:25 PM Andres Freund <andres@anarazel.de> wrote:

I don't see a valid reason for that though - if anything it's dangerous,
because we're not persistently saving the slot. It should fail the
checkpoint imo. Robert, do you have an idea?

Well, the comment atop SaveSlotToPath says:

* This needn't actually be part of a checkpoint, but it's a convenient
* location.

And I agree with that.

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Robert Haas (#6)
Re: potential stuck lock in SaveSlotToPath()

On 2020-03-19 16:38, Robert Haas wrote:

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

There appear to be appropriate pgstat_report_wait_end() calls. What are
you seeing?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#7)
Re: potential stuck lock in SaveSlotToPath()

On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2020-03-19 16:38, Robert Haas wrote:

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

There appear to be appropriate pgstat_report_wait_end() calls. What are
you seeing?

Oh, you're right. I think I got confused because the rename() and
close() don't have that, but those don't have a wait event set either.
Sorry for the noise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Robert Haas (#8)
Re: potential stuck lock in SaveSlotToPath()

On 2020-03-20 16:38, Robert Haas wrote:

On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2020-03-19 16:38, Robert Haas wrote:

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

There appear to be appropriate pgstat_report_wait_end() calls. What are
you seeing?

Oh, you're right. I think I got confused because the rename() and
close() don't have that, but those don't have a wait event set either.
Sorry for the noise.

Any concerns about applying and backpatching the patch I posted?

The talk about reorganizing this code doesn't seem very concrete at the
moment and would probably not be backpatch material anyway.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#9)
Re: potential stuck lock in SaveSlotToPath()

On 2020-Mar-25, Peter Eisentraut wrote:

On 2020-03-20 16:38, Robert Haas wrote:

On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2020-03-19 16:38, Robert Haas wrote:

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

There appear to be appropriate pgstat_report_wait_end() calls. What are
you seeing?

Oh, you're right. I think I got confused because the rename() and
close() don't have that, but those don't have a wait event set either.
Sorry for the noise.

Any concerns about applying and backpatching the patch I posted?

It looks a straight bug fix to me, I agree it should be back-patched.

The talk about reorganizing this code doesn't seem very concrete at the
moment and would probably not be backpatch material anyway.

Agreed on both counts.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#9)
Re: potential stuck lock in SaveSlotToPath()

On Wed, Mar 25, 2020 at 6:13 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Any concerns about applying and backpatching the patch I posted?

Not from me.

The talk about reorganizing this code doesn't seem very concrete at the
moment and would probably not be backpatch material anyway.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Robert Haas (#11)
Re: potential stuck lock in SaveSlotToPath()

On 2020-03-25 17:56, Robert Haas wrote:

On Wed, Mar 25, 2020 at 6:13 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Any concerns about applying and backpatching the patch I posted?

Not from me.

The talk about reorganizing this code doesn't seem very concrete at the
moment and would probably not be backpatch material anyway.

+1.

committed and backpatched

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#12)
Re: potential stuck lock in SaveSlotToPath()

On Thu, Mar 26, 2020 at 02:16:05PM +0100, Peter Eisentraut wrote:

committed and backpatched

The patch committed does that in three places:
/* rename to permanent file, fsync file and directory */
if (rename(tmppath, path) != 0)
{
+ LWLockRelease(&slot->io_in_progress_lock);
ereport(elevel,
(errcode_for_file_access(),
errmsg("could not rename file \"%s\" to \"%s\": %m",

But why do you assume that LWLockRelease() never changes errno? It
seems to me that you should save errno before calling LWLockRelease(),
and then restore it back before using %m in the log message, no? See
for example the case where trace_lwlocks is set.
--
Michael

#14Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#13)
1 attachment(s)
Re: potential stuck lock in SaveSlotToPath()

On 2020-03-27 08:48, Michael Paquier wrote:

On Thu, Mar 26, 2020 at 02:16:05PM +0100, Peter Eisentraut wrote:

committed and backpatched

The patch committed does that in three places:
/* rename to permanent file, fsync file and directory */
if (rename(tmppath, path) != 0)
{
+ LWLockRelease(&slot->io_in_progress_lock);
ereport(elevel,
(errcode_for_file_access(),
errmsg("could not rename file \"%s\" to \"%s\": %m",

But why do you assume that LWLockRelease() never changes errno? It
seems to me that you should save errno before calling LWLockRelease(),
and then restore it back before using %m in the log message, no? See
for example the case where trace_lwlocks is set.

Good catch. How about the attached patch?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Save-errno-across-LWLockRelease-calls.patchtext/plain; charset=UTF-8; name=0001-Save-errno-across-LWLockRelease-calls.patch; x-mac-creator=0; x-mac-type=0Download
From 484d734e9a985fa3e9242644ca9173658b9efe2c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Wed, 1 Apr 2020 16:24:47 +0200
Subject: [PATCH] Save errno across LWLockRelease() calls

Fixup for "Drop slot's LWLock before returning from SaveSlotToPath()"

Reported-by: Michael Paquier <michael@paquier.xyz>
---
 src/backend/replication/slot.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d90c7235e9..47851ec4c1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1259,9 +1259,13 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 		/*
 		 * If not an ERROR, then release the lock before returning.  In case
 		 * of an ERROR, the error recovery path automatically releases the
-		 * lock, but no harm in explicitly releasing even in that case.
+		 * lock, but no harm in explicitly releasing even in that case.  Note
+		 * that LWLockRelease() could affect errno.
 		 */
+		int			save_errno = errno;
+
 		LWLockRelease(&slot->io_in_progress_lock);
+		errno = save_errno;
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m",
@@ -1325,7 +1329,10 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	if (CloseTransientFile(fd) != 0)
 	{
+		int			save_errno = errno;
+
 		LWLockRelease(&slot->io_in_progress_lock);
+		errno = save_errno;
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
@@ -1336,7 +1343,10 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	/* rename to permanent file, fsync file and directory */
 	if (rename(tmppath, path) != 0)
 	{
+		int			save_errno = errno;
+
 		LWLockRelease(&slot->io_in_progress_lock);
+		errno = save_errno;
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-- 
2.26.0

#15Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#14)
Re: potential stuck lock in SaveSlotToPath()

On Wed, Apr 01, 2020 at 04:26:25PM +0200, Peter Eisentraut wrote:

Good catch. How about the attached patch?

WFM. Another trick would be to call LWLockRelease() after generating
the log, but I find your patch more consistent with the surroundings.
--
Michael

#16Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#15)
Re: potential stuck lock in SaveSlotToPath()

On 2020-04-02 08:21, Michael Paquier wrote:

On Wed, Apr 01, 2020 at 04:26:25PM +0200, Peter Eisentraut wrote:

Good catch. How about the attached patch?

WFM. Another trick would be to call LWLockRelease() after generating
the log, but I find your patch more consistent with the surroundings.

done

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services