hash_xlog_split_allocate_page: failed to acquire cleanup lock
Hi,
One CI run for the meson branch just failed in a way I hadn't seen before on
windows, when nothing had changed on windows
https://cirrus-ci.com/task/6111743586861056
027_stream_regress.pl ended up failing due to a timeout. Which in turn was
caused by the standby crashing.
2022-08-10 01:46:20.731 GMT [2212][startup] PANIC: hash_xlog_split_allocate_page: failed to acquire cleanup lock
2022-08-10 01:46:20.731 GMT [2212][startup] CONTEXT: WAL redo at 0/7A6EED8 for Hash/SPLIT_ALLOCATE_PAGE: new_bucket 31, meta_page_masks_updated F, issplitpoint_changed F; blkref #0: rel 1663/16384/24210, blk 23; blkref #1: rel 1663/16384/24210, blk 45; blkref #2: rel 1663/16384/24210, blk 0
abort() has been called2022-08-10 01:46:31.919 GMT [7560][checkpointer] LOG: restartpoint starting: time
2022-08-10 01:46:32.430 GMT [8304][postmaster] LOG: startup process (PID 2212) was terminated by exception 0xC0000354
The relevant code triggering it:
newbuf = XLogInitBufferForRedo(record, 1);
_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
xlrec->new_bucket_flag, true);
if (!IsBufferCleanupOK(newbuf))
elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire cleanup lock");
Why do we just crash if we don't already have a cleanup lock? That can't be
right. Or is there supposed to be a guarantee this can't happen?
Greetings,
Andres Freund
On Aug 9, 2022, at 7:26 PM, Andres Freund <andres@anarazel.de> wrote:
The relevant code triggering it:
newbuf = XLogInitBufferForRedo(record, 1);
_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
xlrec->new_bucket_flag, true);
if (!IsBufferCleanupOK(newbuf))
elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire cleanup lock");Why do we just crash if we don't already have a cleanup lock? That can't be
right. Or is there supposed to be a guarantee this can't happen?
Perhaps the code assumes that when xl_hash_split_allocate_page record was written, the new_bucket field referred to an unused page, and so during replay it should also refer to an unused page, and being unused, that nobody will have it pinned. But at least in heap we sometimes pin unused pages just long enough to examine them and to see that they are unused. Maybe something like that is happening here?
I'd be curious to see the count returned by BUF_STATE_GET_REFCOUNT(LockBufHdr(newbuf)) right before this panic. If it's just 1, then it's not another backend, but our own, and we'd want to debug why we're pinning the same page twice (or more) while replaying wal. Otherwise, maybe it's a race condition with some other process that transiently pins a buffer and occasionally causes this code to panic?
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Aug 10, 2022 at 3:21 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
On Aug 9, 2022, at 7:26 PM, Andres Freund <andres@anarazel.de> wrote:
The relevant code triggering it:newbuf = XLogInitBufferForRedo(record, 1);
_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
xlrec->new_bucket_flag, true);
if (!IsBufferCleanupOK(newbuf))
elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire cleanup lock");Why do we just crash if we don't already have a cleanup lock? That can't be
right. Or is there supposed to be a guarantee this can't happen?Perhaps the code assumes that when xl_hash_split_allocate_page record was written, the new_bucket field referred to an unused page, and so during replay it should also refer to an unused page, and being unused, that nobody will have it pinned. But at least in heap we sometimes pin unused pages just long enough to examine them and to see that they are unused. Maybe something like that is happening here?
Here's an email about that:
/messages/by-id/CAE9k0P=OXww6RQCGrmDNa8=L3EeB01SGbYuP23y-qZJ=4td38Q@mail.gmail.com
I'd be curious to see the count returned by BUF_STATE_GET_REFCOUNT(LockBufHdr(newbuf)) right before this panic. If it's just 1, then it's not another backend, but our own, and we'd want to debug why we're pinning the same page twice (or more) while replaying wal. Otherwise, maybe it's a race condition with some other process that transiently pins a buffer and occasionally causes this code to panic?
But which backend could that be? We aren't starting any at that point
in the test.
Someone might wonder if it's the startup process itself via the new
WAL prefetching machinery, but that doesn't pin pages, it only probes
the buffer mapping table to see if future pages are cached already
(see bufmgr.c PrefetchSharedBuffer()). (This is a topic I've thought
about a bit because I have another installment of recovery prefetching
in development using real AIO that *does* pin pages in advance, and
has to deal with code that wants cleanup locks like this...)
It's possible that git log src/backend/access/hash/ can explain a
behaviour change, as there were some recent changes there, but it's
not jumping out at me. Maybe 4f1f5a7f "Remove fls(), use
pg_leftmost_one_pos32() instead." has a maths error, but I don't see
it. Maybe e09d7a12 "Improve speed of hash index build." accidentally
reaches a new state and triggers a latent bug. Maybe a latent bug
showed up now just because we started testing recovery not too long
ago... but all of that still needs another backend involved. We can
see which blocks the startup process has pinned, 23 != 45. Hmmm.
Hi,
On 2022-08-09 20:21:19 -0700, Mark Dilger wrote:
On Aug 9, 2022, at 7:26 PM, Andres Freund <andres@anarazel.de> wrote:
The relevant code triggering it:
newbuf = XLogInitBufferForRedo(record, 1);
_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
xlrec->new_bucket_flag, true);
if (!IsBufferCleanupOK(newbuf))
elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire cleanup lock");Why do we just crash if we don't already have a cleanup lock? That can't be
right. Or is there supposed to be a guarantee this can't happen?Perhaps the code assumes that when xl_hash_split_allocate_page record was
written, the new_bucket field referred to an unused page, and so during
replay it should also refer to an unused page, and being unused, that nobody
will have it pinned. But at least in heap we sometimes pin unused pages
just long enough to examine them and to see that they are unused. Maybe
something like that is happening here?
I don't think it's a safe assumption that nobody would hold a pin on such a
page during recovery. While not the case here, somebody else could have used
pg_prewarm to read it in.
But also, the checkpointer or bgwriter could have it temporarily pinned, to
write it out, or another backend could try to write it out as a victim buffer
and have it temporarily pinned.
static int
SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
{
...
/*
* Pin it, share-lock it, write it. (FlushBuffer will do nothing if the
* buffer is clean by the time we've locked it.)
*/
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
As you can see we acquire a pin without holding a lock on the page (and that
can't be changed!).
I assume this is trying to defend against some sort of deadlock by not
actually getting a cleanup lock (by passing get_cleanup_lock = true to
XLogReadBufferForRedoExtended()).
I don't think it's possible to rely on a dirty page to never be pinned by
another backend. All you can rely on with a cleanup lock is that there's no
*prior* references to the buffer, and thus it's safe to reorganize the buffer,
because the pin-holder hasn't yet gotten a lock on the page.
I'd be curious to see the count returned by
BUF_STATE_GET_REFCOUNT(LockBufHdr(newbuf)) right before this panic. If it's
just 1, then it's not another backend, but our own, and we'd want to debug
why we're pinning the same page twice (or more) while replaying wal.
This was the first time in a couple hundred runs on that I have seen this, so
I don't think it's that easily debuggable for me.
Otherwise, maybe it's a race condition with some other process that
transiently pins a buffer and occasionally causes this code to panic?
As pointed out above, it's legal to have a transient pin on a page, so this
just looks like a bad assumption in the hash code to me.
Greetings,
Andres Freund
On Wed, Aug 10, 2022 at 5:28 PM Andres Freund <andres@anarazel.de> wrote:
I don't think it's a safe assumption that nobody would hold a pin on such a
page during recovery. While not the case here, somebody else could have used
pg_prewarm to read it in.But also, the checkpointer or bgwriter could have it temporarily pinned, to
write it out, or another backend could try to write it out as a victim buffer
and have it temporarily pinned.
Right, of course. So it's just that hash indexes didn't get xlog'd
until 2017, and still aren't very popular, and then recovery didn't
get regression tested until 2021, so nobody ever hit it.
On Wed, Aug 10, 2022 at 5:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:
2021
Or, rather, 14 days into 2022 :-)
On Wed, Aug 10, 2022 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-08-09 20:21:19 -0700, Mark Dilger wrote:
On Aug 9, 2022, at 7:26 PM, Andres Freund <andres@anarazel.de> wrote:
The relevant code triggering it:
newbuf = XLogInitBufferForRedo(record, 1);
_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
xlrec->new_bucket_flag, true);
if (!IsBufferCleanupOK(newbuf))
elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire cleanup lock");Why do we just crash if we don't already have a cleanup lock? That can't be
right. Or is there supposed to be a guarantee this can't happen?Perhaps the code assumes that when xl_hash_split_allocate_page record was
written, the new_bucket field referred to an unused page, and so during
replay it should also refer to an unused page, and being unused, that nobody
will have it pinned. But at least in heap we sometimes pin unused pages
just long enough to examine them and to see that they are unused. Maybe
something like that is happening here?I don't think it's a safe assumption that nobody would hold a pin on such a
page during recovery. While not the case here, somebody else could have used
pg_prewarm to read it in.But also, the checkpointer or bgwriter could have it temporarily pinned, to
write it out, or another backend could try to write it out as a victim buffer
and have it temporarily pinned.static int
SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
{
...
/*
* Pin it, share-lock it, write it. (FlushBuffer will do nothing if the
* buffer is clean by the time we've locked it.)
*/
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);As you can see we acquire a pin without holding a lock on the page (and that
can't be changed!).
I think this could be the probable reason for failure though I didn't
try to debug/reproduce this yet. AFAIU, this is possible during
recovery/replay of WAL record XLOG_HASH_SPLIT_ALLOCATE_PAGE as via
XLogReadBufferForRedoExtended, we can mark the buffer dirty while
restoring from full page image. OTOH, because during normal operation
we didn't mark the page dirty SyncOneBuffer would have skipped it due
to check (if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))).
I assume this is trying to defend against some sort of deadlock by not
actually getting a cleanup lock (by passing get_cleanup_lock = true to
XLogReadBufferForRedoExtended()).
IIRC, this is just following what we do during normal operation and
based on the theory that the meta-page is not updated yet so no
backend will access it. I think we can do what you wrote unless there
is some other reason behind this failure.
--
With Regards,
Amit Kapila.
On Wed, Aug 10, 2022 at 12:39 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Here's an email about that:
/messages/by-id/CAE9k0P=OXww6RQCGrmDNa8=L3EeB01SGbYuP23y-qZJ=4td38Q@mail.gmail.com
Hmm. If I'm reading that email correctly, it indicates that I noticed
this problem before commit and asked for it to be changed, but then
for some reason it wasn't changed and I still committed it.
I can't immediately think of a reason why it wouldn't be safe to
insist on acquiring a cleanup lock there.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Aug 10, 2022 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 10, 2022 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-08-09 20:21:19 -0700, Mark Dilger wrote:
On Aug 9, 2022, at 7:26 PM, Andres Freund <andres@anarazel.de> wrote:
The relevant code triggering it:
newbuf = XLogInitBufferForRedo(record, 1);
_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
xlrec->new_bucket_flag, true);
if (!IsBufferCleanupOK(newbuf))
elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire cleanup lock");Why do we just crash if we don't already have a cleanup lock? That can't be
right. Or is there supposed to be a guarantee this can't happen?Perhaps the code assumes that when xl_hash_split_allocate_page record was
written, the new_bucket field referred to an unused page, and so during
replay it should also refer to an unused page, and being unused, that nobody
will have it pinned. But at least in heap we sometimes pin unused pages
just long enough to examine them and to see that they are unused. Maybe
something like that is happening here?I don't think it's a safe assumption that nobody would hold a pin on such a
page during recovery. While not the case here, somebody else could have used
pg_prewarm to read it in.But also, the checkpointer or bgwriter could have it temporarily pinned, to
write it out, or another backend could try to write it out as a victim buffer
and have it temporarily pinned.static int
SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
{
...
/*
* Pin it, share-lock it, write it. (FlushBuffer will do nothing if the
* buffer is clean by the time we've locked it.)
*/
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);As you can see we acquire a pin without holding a lock on the page (and that
can't be changed!).I think this could be the probable reason for failure though I didn't
try to debug/reproduce this yet. AFAIU, this is possible during
recovery/replay of WAL record XLOG_HASH_SPLIT_ALLOCATE_PAGE as via
XLogReadBufferForRedoExtended, we can mark the buffer dirty while
restoring from full page image. OTOH, because during normal operation
we didn't mark the page dirty SyncOneBuffer would have skipped it due
to check (if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))).
I'm trying to simulate the scenario in streaming replication using the below:
CREATE TABLE pvactst (i INT, a INT[], p POINT) with (autovacuum_enabled = off);
CREATE INDEX hash_pvactst ON pvactst USING hash (i);
INSERT INTO pvactst SELECT i, array[1,2,3], point(i, i+1) FROM
generate_series(1,1000) i;
With the above scenario, it will be able to replay allocation of page
for split operation. I will slightly change the above statements and
try to debug and see if we can make the background writer process to
pin this buffer and simulate the scenario. I will post my findings
once I'm done with the analysis.
Regards,
Vignesh
Hi,
On 2022-08-10 14:52:36 +0530, Amit Kapila wrote:
I think this could be the probable reason for failure though I didn't
try to debug/reproduce this yet. AFAIU, this is possible during
recovery/replay of WAL record XLOG_HASH_SPLIT_ALLOCATE_PAGE as via
XLogReadBufferForRedoExtended, we can mark the buffer dirty while
restoring from full page image. OTOH, because during normal operation
we didn't mark the page dirty SyncOneBuffer would have skipped it due
to check (if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))).
I think there might still be short-lived references from other paths, even if
not marked dirty, but it isn't realy important.
I assume this is trying to defend against some sort of deadlock by not
actually getting a cleanup lock (by passing get_cleanup_lock = true to
XLogReadBufferForRedoExtended()).IIRC, this is just following what we do during normal operation and
based on the theory that the meta-page is not updated yet so no
backend will access it. I think we can do what you wrote unless there
is some other reason behind this failure.
Well, it's not really the same if you silently continue in normal operation
and PANIC during recovery... If it's an optional operation the tiny race
around not getting the cleanup lock is fine, but it's a totally different
story during recovery.
Greetings,
Andres Freund
On Wed, Aug 10, 2022 at 1:28 AM Andres Freund <andres@anarazel.de> wrote:
I assume this is trying to defend against some sort of deadlock by not
actually getting a cleanup lock (by passing get_cleanup_lock = true to
XLogReadBufferForRedoExtended()).
I had that thought too, but I don't *think* it's the case. This
function acquires a lock on the oldest bucket page, then on the new
bucket page. We could deadlock if someone who holds a pin on the new
bucket page tries to take a content lock on the old bucket page. But
who would do that? The new bucket page isn't yet linked from the
metapage at this point, so no scan should do that. There can be no
concurrent writers during replay. I think that if someone else has the
new page pinned they probably should not be taking content locks on
other buffers at the same time.
So maybe we can just apply something like the attached.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
hash-xlog-split-allocate-page-v1.patchapplication/octet-stream; name=hash-xlog-split-allocate-page-v1.patchDownload+2-3
Robert Haas <robertmhaas@gmail.com> writes:
I had that thought too, but I don't *think* it's the case. This
function acquires a lock on the oldest bucket page, then on the new
bucket page. We could deadlock if someone who holds a pin on the new
bucket page tries to take a content lock on the old bucket page. But
who would do that? The new bucket page isn't yet linked from the
metapage at this point, so no scan should do that. There can be no
concurrent writers during replay. I think that if someone else has the
new page pinned they probably should not be taking content locks on
other buffers at the same time.
Agreed, the core code shouldn't do that, but somebody doing random stuff
with pageinspect functions could probably make a query do this.
See [1]/messages/by-id/17568-ef121b956ec1559c@postgresql.org; unless we're going to reject that bug with "don't do that",
I'm not too comfortable with this line of reasoning.
regards, tom lane
On Tue, Aug 16, 2022 at 5:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
I had that thought too, but I don't *think* it's the case. This
function acquires a lock on the oldest bucket page, then on the new
bucket page. We could deadlock if someone who holds a pin on the new
bucket page tries to take a content lock on the old bucket page. But
who would do that? The new bucket page isn't yet linked from the
metapage at this point, so no scan should do that. There can be no
concurrent writers during replay. I think that if someone else has the
new page pinned they probably should not be taking content locks on
other buffers at the same time.Agreed, the core code shouldn't do that, but somebody doing random stuff
with pageinspect functions could probably make a query do this.
See [1]; unless we're going to reject that bug with "don't do that",
I'm not too comfortable with this line of reasoning.
I don't see the connection. The problem there has to do with bypassing
shared buffers, but this operation isn't bypassing shared buffers.
What sort of random things would someone do with pageinspect functions
that would hold buffer pins on one buffer while locking another one?
The functions in hashfuncs.c don't even seem like they would access
multiple buffers in total, let alone at overlapping times. And I don't
think that a query pageinspect could realistically be suspended while
holding a buffer pin either. If you wrapped it in a cursor it'd be
suspended before or after accessing any given buffer, not right in the
middle of that operation.
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
What sort of random things would someone do with pageinspect functions
that would hold buffer pins on one buffer while locking another one?
The functions in hashfuncs.c don't even seem like they would access
multiple buffers in total, let alone at overlapping times. And I don't
think that a query pageinspect could realistically be suspended while
holding a buffer pin either. If you wrapped it in a cursor it'd be
suspended before or after accessing any given buffer, not right in the
middle of that operation.
pin != access. Unless things have changed really drastically since
I last looked, a seqscan will sit on a buffer pin throughout the
series of fetches from a single page.
Admittedly, that's about *heap* page pins while indexscans have
different rules. But I recall that btrees at least use persistent
pins as well.
It may be that there is indeed no way to make this happen with
available SQL tools. But I wouldn't put a lot of money on that,
and even less that it'll stay true in the future.
Having said that, you're right that this is qualitatively different
from the other bug, in that this is a deadlock not apparent data
corruption. However, IIUC it's an LWLock deadlock, which we don't
handle all that nicely.
regards, tom lane
Hi,
On 2022-08-16 17:02:27 -0400, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
I had that thought too, but I don't *think* it's the case. This
function acquires a lock on the oldest bucket page, then on the new
bucket page. We could deadlock if someone who holds a pin on the new
bucket page tries to take a content lock on the old bucket page. But
who would do that? The new bucket page isn't yet linked from the
metapage at this point, so no scan should do that. There can be no
concurrent writers during replay. I think that if someone else has the
new page pinned they probably should not be taking content locks on
other buffers at the same time.Agreed, the core code shouldn't do that, but somebody doing random stuff
with pageinspect functions could probably make a query do this.
See [1]; unless we're going to reject that bug with "don't do that",
I'm not too comfortable with this line of reasoning.
I don't think we can defend against lwlock deadlocks where somebody doesn't
follow the AM's deadlock avoidance strategy. I.e. it's fine to pin and lock
pages from some AM without knowing that AM's rules, as long as you only block
while holding a pin/lock of a single page. But it is *not* ok to block waiting
for an lwlock / pin while already holding an lwlock / pin on some other
buffer. If we were concerned about this we'd have to basically throw many of
our multi-page operations that rely on lock order logic out.
Greetings,
Andres Freund
Hi,
On 2022-08-16 19:55:18 -0400, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
What sort of random things would someone do with pageinspect functions
that would hold buffer pins on one buffer while locking another one?
The functions in hashfuncs.c don't even seem like they would access
multiple buffers in total, let alone at overlapping times. And I don't
think that a query pageinspect could realistically be suspended while
holding a buffer pin either. If you wrapped it in a cursor it'd be
suspended before or after accessing any given buffer, not right in the
middle of that operation.pin != access. Unless things have changed really drastically since
I last looked, a seqscan will sit on a buffer pin throughout the
series of fetches from a single page.
That's still the case. But for heap that shouldn't be a problem, because we'll
never try to take a cleanup lock while holding another page locked (nor even
another heap page pinned, I think).
I find it *highly* suspect that hash needs to acquire a cleanup lock while
holding another buffer locked. The recovery aspect alone makes that seem quite
unwise. Even if there's possibly no deadlock here for some reason or another.
Looking at the non-recovery code makes me even more suspicious:
/*
* Physically allocate the new bucket's primary page. We want to do this
* before changing the metapage's mapping info, in case we can't get the
* disk space. Ideally, we don't need to check for cleanup lock on new
* bucket as no other backend could find this bucket unless meta page is
* updated. However, it is good to be consistent with old bucket locking.
*/
buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
if (!IsBufferCleanupOK(buf_nblkno))
{
_hash_relbuf(rel, buf_oblkno);
_hash_relbuf(rel, buf_nblkno);
goto fail;
}
_hash_getnewbuf() calls _hash_pageinit() which calls PageInit(), which
memset(0)s the whole page. What does it even mean to check whether you
effectively have a cleanup lock after you zeroed out the page?
Reading the README and the comment above makes me wonder if this whole cleanup
lock business here is just cargo culting and could be dropped?
Admittedly, that's about *heap* page pins while indexscans have
different rules. But I recall that btrees at least use persistent
pins as well.
I think that's been changed, although not in an unproblematic way.
Having said that, you're right that this is qualitatively different
from the other bug, in that this is a deadlock not apparent data
corruption. However, IIUC it's an LWLock deadlock, which we don't
handle all that nicely.
Theoretically the startup side could be interrupted. Except that we don't
accept the startup process dying...
Greetings,
Andres Freund
On Wed, Aug 17, 2022 at 6:27 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-08-16 19:55:18 -0400, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
What sort of random things would someone do with pageinspect functions
that would hold buffer pins on one buffer while locking another one?
The functions in hashfuncs.c don't even seem like they would access
multiple buffers in total, let alone at overlapping times. And I don't
think that a query pageinspect could realistically be suspended while
holding a buffer pin either. If you wrapped it in a cursor it'd be
suspended before or after accessing any given buffer, not right in the
middle of that operation.pin != access. Unless things have changed really drastically since
I last looked, a seqscan will sit on a buffer pin throughout the
series of fetches from a single page.That's still the case. But for heap that shouldn't be a problem, because we'll
never try to take a cleanup lock while holding another page locked (nor even
another heap page pinned, I think).I find it *highly* suspect that hash needs to acquire a cleanup lock while
holding another buffer locked. The recovery aspect alone makes that seem quite
unwise. Even if there's possibly no deadlock here for some reason or another.Looking at the non-recovery code makes me even more suspicious:
/*
* Physically allocate the new bucket's primary page. We want to do this
* before changing the metapage's mapping info, in case we can't get the
* disk space. Ideally, we don't need to check for cleanup lock on new
* bucket as no other backend could find this bucket unless meta page is
* updated. However, it is good to be consistent with old bucket locking.
*/
buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
if (!IsBufferCleanupOK(buf_nblkno))
{
_hash_relbuf(rel, buf_oblkno);
_hash_relbuf(rel, buf_nblkno);
goto fail;
}_hash_getnewbuf() calls _hash_pageinit() which calls PageInit(), which
memset(0)s the whole page. What does it even mean to check whether you
effectively have a cleanup lock after you zeroed out the page?Reading the README and the comment above makes me wonder if this whole cleanup
lock business here is just cargo culting and could be dropped?
I think it is okay to not acquire a clean-up lock on the new bucket
page both in recovery and non-recovery paths. It is primarily required
on the old bucket page to avoid concurrent scans/inserts. As mentioned
in the comments and as per my memory serves, it is mainly for keeping
it consistent with old bucket locking.
--
With Regards,
Amit Kapila.
On Tue, Aug 16, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:
I don't think we can defend against lwlock deadlocks where somebody doesn't
follow the AM's deadlock avoidance strategy.
That's a good way of putting it. Tom seems to be postulating that
maybe someone can use random tools that exist to take buffer locks and
pins in arbitrary order, and if that is true then you can make any AM
deadlock. I think it isn't true, though, and I think if it were true
the right fix would be to remove the tools that are letting people do
that.
There's also zero evidence that this was ever intended as a deadlock
avoidance maneuver. I think that we are only hypothesizing that it was
intended that way because the code looks weird. But I think the email
discussion shows that I thought it was wrong at the time it was
committed, and just missed the fact that the final version of the
patch hadn't fixed it. And if it *were* a deadlock avoidance maneuver
it would still be pretty broken, because it would make the startup
process error out and the whole system go down.
Regarding the question of whether we need a cleanup lock on the new
bucket I am not really seeing the advantage of going down that path.
Simply fixing this code to take a cleanup lock instead of hoping that
it always gets one by accident is low risk and should fix the observed
problem. Getting rid of the cleanup lock will be more invasive and I'd
like to see some evidence that it's a necessary step before we take
the risk of breaking things.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2022-08-17 10:18:14 +0530, Amit Kapila wrote:
Looking at the non-recovery code makes me even more suspicious:
/*
* Physically allocate the new bucket's primary page. We want to do this
* before changing the metapage's mapping info, in case we can't get the
* disk space. Ideally, we don't need to check for cleanup lock on new
* bucket as no other backend could find this bucket unless meta page is
* updated. However, it is good to be consistent with old bucket locking.
*/
buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
if (!IsBufferCleanupOK(buf_nblkno))
{
_hash_relbuf(rel, buf_oblkno);
_hash_relbuf(rel, buf_nblkno);
goto fail;
}_hash_getnewbuf() calls _hash_pageinit() which calls PageInit(), which
memset(0)s the whole page. What does it even mean to check whether you
effectively have a cleanup lock after you zeroed out the page?Reading the README and the comment above makes me wonder if this whole cleanup
lock business here is just cargo culting and could be dropped?I think it is okay to not acquire a clean-up lock on the new bucket
page both in recovery and non-recovery paths. It is primarily required
on the old bucket page to avoid concurrent scans/inserts. As mentioned
in the comments and as per my memory serves, it is mainly for keeping
it consistent with old bucket locking.
It's not keeping it consistent with bucket locking to zero out a page before
getting a cleanup lock, hopefully at least. This code is just broken on
multiple fronts, and consistency isn't a defense.
Greetings,
Andres Freund
Hi,
On 2022-08-17 08:25:06 -0400, Robert Haas wrote:
Regarding the question of whether we need a cleanup lock on the new
bucket I am not really seeing the advantage of going down that path.
Simply fixing this code to take a cleanup lock instead of hoping that
it always gets one by accident is low risk and should fix the observed
problem. Getting rid of the cleanup lock will be more invasive and I'd
like to see some evidence that it's a necessary step before we take
the risk of breaking things.
Given that the cleanup locks in question are "taken" *after* re-initializing
the page, I'm doubtful that's a sane path forward. It seems quite likely to
mislead somebody to rely on it working as a cleanup lock in the future.
Greetings,
Andres Freund