Graceful way to handle too many locks

Started by Chris Clevelandabout 1 year ago3 messages
#1Chris Cleveland
ccleveland@dieselpoint.com

In my extension I got a mystery error:

TRAP: failed Assert("InterruptHoldoffCount > 0"), File: "lwlock.c", Line:
1869, PID: 62663
0 postgres 0x000000010135adb4 ExceptionalCondition + 108
1 postgres 0x00000001012235ec LWLockRelease + 1456
2 postgres 0x00000001011faebc UnlockReleaseBuffer + 24

Turns out there was a bug in my extension where I was getting a share lock
on a particular index page over and over. Oddly, the error showed up not
when I was getting the locks, but when I released them. Any time I locked
the index page more than ~200 times, this error would show up on release.

Questions:

1. Why is the limit on the number of locks so low? I thought that when
getting a share lock, all it did was bump a reference count.

2. Is there a way to get this to fail gracefully, that is, with an error
message that makes sense, and kicks in at the moment you go over the limit,
instead of later?

--
Chris Cleveland
312-339-2677 mobile

#2Tomas Vondra
tomas@vondra.me
In reply to: Chris Cleveland (#1)
Re: Graceful way to handle too many locks

On 11/13/24 20:05, Chris Cleveland wrote:

In my extension I got a mystery error:

TRAP: failed Assert("InterruptHoldoffCount > 0"), File: "lwlock.c",
Line: 1869, PID: 62663
0postgres 0x000000010135adb4ExceptionalCondition + 108
1postgres 0x00000001012235ecLWLockRelease + 1456
2postgres 0x00000001011faebcUnlockReleaseBuffer + 24

Turns out there was a bug in my extension where I was getting a share
lock on a particular index page over and over. Oddly, the error showed
up not when I was getting the locks, but when I released them. Any time
I locked the index page more than ~200 times, this error would show up
on release. 

Questions:

1. Why is the limit on the number of locks so low? I thought that when
getting a share lock, all it did was bump a reference count.

Because good code shouldn't really need more than 200 LWLocks. Note this
limit does not apply to row locks, relation locks, and so on.

2. Is there a way to get this to fail gracefully, that is, with an error
message that makes sense, and kicks in at the moment you go over the
limit, instead of later?

Not really, the limit of 200 lwlocks is hard-coded, so the only solution
is to not acquire that many of them (in a single backend). But I wonder
if you're actually hitting that limit, because that should trigger

/* Ensure we will have room to remember the lock */
if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
elog(ERROR, "too many LWLocks taken");

and not the assert. That suggests your extension does something wrong
with HOLD_INTERRUPTS() or something like that.

regards

--
Tomas Vondra

#3Robert Haas
robertmhaas@gmail.com
In reply to: Chris Cleveland (#1)
Re: Graceful way to handle too many locks

On Wed, Nov 13, 2024 at 2:05 PM Chris Cleveland
<ccleveland@dieselpoint.com> wrote:

In my extension I got a mystery error:

TRAP: failed Assert("InterruptHoldoffCount > 0"), File: "lwlock.c", Line: 1869, PID: 62663
0 postgres 0x000000010135adb4 ExceptionalCondition + 108
1 postgres 0x00000001012235ec LWLockRelease + 1456
2 postgres 0x00000001011faebc UnlockReleaseBuffer + 24

Turns out there was a bug in my extension where I was getting a share lock on a particular index page over and over. Oddly, the error showed up not when I was getting the locks, but when I released them. Any time I locked the index page more than ~200 times, this error would show up on release.

I wonder how you managed to avoid hitting this check in LWLockAcquire:

/* Ensure we will have room to remember the lock */
if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
elog(ERROR, "too many LWLocks taken");

1. Why is the limit on the number of locks so low? I thought that when getting a share lock, all it did was bump a reference count.

200 LWLocks is an ENORMOUS number of LWLocks to be holding at once.
Except in very specific circumstances such as the one mentioned in the
comment for MAX_SIMUL_LWLOCKS, holding more than 1 or 2 or MAYBE 3
LWLocks simultaneously is a recipe for disaster. One issue is that
there's no deadlock checking, and it's easy to bring down an entire
system. Another issue is that other code will be expecting you to
release the lock quickly and you may cause the entire system to pile
up behind whichever lock you're holding. Details aside, you're only
supposed to hold an LWLock while you're actively looking at the
in-memory data structure it protects. If you need to keep a buffer
around for a longer time, you can hold a buffer pin for a longer time,
but the time for which you actually hold the lock needs to be minimal.

--
Robert Haas
EDB: http://www.enterprisedb.com