potential stuck lock in SaveSlotToPath()

Started by Peter Eisentrautabout 6 years ago16 messageshackers
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

When SaveSlotToPath() is called with elevel=LOG, the early exits don't
release the slot's io_in_progress_lock. Fix attached.

This could result in a walsender being stuck on the lock forever. A
possible way to get into this situation is if the offending code paths
are triggered in a low disk space situation. (This is how it was found;
maybe there are other ways.)

Pavan Deolasee and Craig Ringer worked on this issue. I'm forwarding it
on their behalf.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Drop-slot-s-LWLock-before-returning-from-SaveSlotToP.patchtext/plain; charset=UTF-8; name=0001-Drop-slot-s-LWLock-before-returning-from-SaveSlotToP.patch; x-mac-creator=0; x-mac-type=0Download+11-1
#2Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#1)
Re: potential stuck lock in SaveSlotToPath()

Hi,

On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:

When SaveSlotToPath() is called with elevel=LOG, the early exits don't
release the slot's io_in_progress_lock. Fix attached.

I'm a bit confused as to why we we ever call it with elevel = LOG
(i.e. why we have the elevel parameter at all). That seems to have been
there from the start, so it's either me or Robert that's to blame. But I
can't immediately see a reason for it?

Greetings,

Andres Freund

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#2)
Re: potential stuck lock in SaveSlotToPath()

On 2020-Mar-18, Andres Freund wrote:

Hi,

On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:

When SaveSlotToPath() is called with elevel=LOG, the early exits don't
release the slot's io_in_progress_lock. Fix attached.

I'm a bit confused as to why we we ever call it with elevel = LOG
(i.e. why we have the elevel parameter at all). That seems to have been
there from the start, so it's either me or Robert that's to blame. But I
can't immediately see a reason for it?

I guess you didn't want failure to save a slot be a reason to abort a
checkpoint.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Eisentraut (#1)
Re: potential stuck lock in SaveSlotToPath()

On Thu, Mar 19, 2020 at 4:46 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

[patch]

+ * releaseing even in that case.

Typo.

#5Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#3)
Re: potential stuck lock in SaveSlotToPath()

Hi,

On 2020-03-18 16:54:19 -0300, Alvaro Herrera wrote:

On 2020-Mar-18, Andres Freund wrote:

On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:

When SaveSlotToPath() is called with elevel=LOG, the early exits don't
release the slot's io_in_progress_lock. Fix attached.

I'm a bit confused as to why we we ever call it with elevel = LOG
(i.e. why we have the elevel parameter at all). That seems to have been
there from the start, so it's either me or Robert that's to blame. But I
can't immediately see a reason for it?

I guess you didn't want failure to save a slot be a reason to abort a
checkpoint.

I don't see a valid reason for that though - if anything it's dangerous,
because we're not persistently saving the slot. It should fail the
checkpoint imo. Robert, do you have an idea?

Greetings,

Andres Freund

#6Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#5)
Re: potential stuck lock in SaveSlotToPath()

On Wed, Mar 18, 2020 at 4:25 PM Andres Freund <andres@anarazel.de> wrote:

I don't see a valid reason for that though - if anything it's dangerous,
because we're not persistently saving the slot. It should fail the
checkpoint imo. Robert, do you have an idea?

Well, the comment atop SaveSlotToPath says:

* This needn't actually be part of a checkpoint, but it's a convenient
* location.

And I agree with that.

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#6)
Re: potential stuck lock in SaveSlotToPath()

On 2020-03-19 16:38, Robert Haas wrote:

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

There appear to be appropriate pgstat_report_wait_end() calls. What are
you seeing?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#7)
Re: potential stuck lock in SaveSlotToPath()

On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2020-03-19 16:38, Robert Haas wrote:

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

There appear to be appropriate pgstat_report_wait_end() calls. What are
you seeing?

Oh, you're right. I think I got confused because the rename() and
close() don't have that, but those don't have a wait event set either.
Sorry for the noise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#8)
Re: potential stuck lock in SaveSlotToPath()

On 2020-03-20 16:38, Robert Haas wrote:

On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2020-03-19 16:38, Robert Haas wrote:

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

There appear to be appropriate pgstat_report_wait_end() calls. What are
you seeing?

Oh, you're right. I think I got confused because the rename() and
close() don't have that, but those don't have a wait event set either.
Sorry for the noise.

Any concerns about applying and backpatching the patch I posted?

The talk about reorganizing this code doesn't seem very concrete at the
moment and would probably not be backpatch material anyway.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#9)
Re: potential stuck lock in SaveSlotToPath()

On 2020-Mar-25, Peter Eisentraut wrote:

On 2020-03-20 16:38, Robert Haas wrote:

On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2020-03-19 16:38, Robert Haas wrote:

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

There appear to be appropriate pgstat_report_wait_end() calls. What are
you seeing?

Oh, you're right. I think I got confused because the rename() and
close() don't have that, but those don't have a wait event set either.
Sorry for the noise.

Any concerns about applying and backpatching the patch I posted?

It looks a straight bug fix to me, I agree it should be back-patched.

The talk about reorganizing this code doesn't seem very concrete at the
moment and would probably not be backpatch material anyway.

Agreed on both counts.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#9)
Re: potential stuck lock in SaveSlotToPath()

On Wed, Mar 25, 2020 at 6:13 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Any concerns about applying and backpatching the patch I posted?

Not from me.

The talk about reorganizing this code doesn't seem very concrete at the
moment and would probably not be backpatch material anyway.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#11)
Re: potential stuck lock in SaveSlotToPath()

On 2020-03-25 17:56, Robert Haas wrote:

On Wed, Mar 25, 2020 at 6:13 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Any concerns about applying and backpatching the patch I posted?

Not from me.

The talk about reorganizing this code doesn't seem very concrete at the
moment and would probably not be backpatch material anyway.

+1.

committed and backpatched

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#12)
Re: potential stuck lock in SaveSlotToPath()

On Thu, Mar 26, 2020 at 02:16:05PM +0100, Peter Eisentraut wrote:

committed and backpatched

The patch committed does that in three places:
/* rename to permanent file, fsync file and directory */
if (rename(tmppath, path) != 0)
{
+ LWLockRelease(&slot->io_in_progress_lock);
ereport(elevel,
(errcode_for_file_access(),
errmsg("could not rename file \"%s\" to \"%s\": %m",

But why do you assume that LWLockRelease() never changes errno? It
seems to me that you should save errno before calling LWLockRelease(),
and then restore it back before using %m in the log message, no? See
for example the case where trace_lwlocks is set.
--
Michael

#14Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#13)
Re: potential stuck lock in SaveSlotToPath()

On 2020-03-27 08:48, Michael Paquier wrote:

On Thu, Mar 26, 2020 at 02:16:05PM +0100, Peter Eisentraut wrote:

committed and backpatched

The patch committed does that in three places:
/* rename to permanent file, fsync file and directory */
if (rename(tmppath, path) != 0)
{
+ LWLockRelease(&slot->io_in_progress_lock);
ereport(elevel,
(errcode_for_file_access(),
errmsg("could not rename file \"%s\" to \"%s\": %m",

But why do you assume that LWLockRelease() never changes errno? It
seems to me that you should save errno before calling LWLockRelease(),
and then restore it back before using %m in the log message, no? See
for example the case where trace_lwlocks is set.

Good catch. How about the attached patch?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Save-errno-across-LWLockRelease-calls.patchtext/plain; charset=UTF-8; name=0001-Save-errno-across-LWLockRelease-calls.patch; x-mac-creator=0; x-mac-type=0Download+11-2
#15Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#14)
Re: potential stuck lock in SaveSlotToPath()

On Wed, Apr 01, 2020 at 04:26:25PM +0200, Peter Eisentraut wrote:

Good catch. How about the attached patch?

WFM. Another trick would be to call LWLockRelease() after generating
the log, but I find your patch more consistent with the surroundings.
--
Michael

#16Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#15)
Re: potential stuck lock in SaveSlotToPath()

On 2020-04-02 08:21, Michael Paquier wrote:

On Wed, Apr 01, 2020 at 04:26:25PM +0200, Peter Eisentraut wrote:

Good catch. How about the attached patch?

WFM. Another trick would be to call LWLockRelease() after generating
the log, but I find your patch more consistent with the surroundings.

done

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services