Inval reliability, especially for inplace updates

Started by Noah Mischalmost 2 years ago41 messageshackers
Jump to latest
#1Noah Misch
noah@leadboat.com

/messages/by-id/20240512232923.aa.nmisch@google.com wrote:

Separable, nontrivial things not fixed in the attached patch stack:

- Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of
CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right.

I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
inside the critical section. Send it in heap_xlog_inplace(), too. The
interesting decision is how to handle RelationCacheInitFilePreInvalidate(),
which has an unlink_initfile() that can fail with e.g. EIO. Options:

1. Unlink during critical section, and accept that EIO becomes PANIC. Replay
may reach the same EIO, and the system won't reopen to connections until
the storage starts cooperating.a Interaction with checkpoints is not ideal.
If we checkpoint and then crash between inplace XLogInsert() and inval,
we'd be relying on StartupXLOG() -> RelationCacheInitFileRemove(). That
uses elevel==LOG, so replay would neglect to PANIC on EIO.

2. Unlink before critical section, so normal xact abort suffices. This would
hold RelCacheInitLock and a buffer content lock at the same time. In
RecordTransactionCommit(), it would hold RelCacheInitLock and e.g. slru
locks at the same time.

The PANIC risk of (1) seems similar to the risk of PANIC at
RecordTransactionCommit() -> XLogFlush(), which hasn't been a problem. The
checkpoint-related risk bothers me more, and (1) generally makes it harder to
reason about checkpoint interactions. The lock order risk of (2) feels
tolerable. I'm leaning toward (2), but that might change. Other preferences?

Another decision is what to do about LogLogicalInvalidations(). Currently,
inplace update invalidations do reach WAL via LogLogicalInvalidations() at the
next CCI. Options:

a. Within logical decoding, cease processing invalidations for inplace
updates. Inplace updates don't affect storage or interpretation of table
rows, so they don't affect logicalrep_write_tuple() outcomes. If they did,
invalidations wouldn't make it work. Decoding has no way to retrieve a
snapshot-appropriate version of the inplace-updated value.

b. Make heap_decode() of XLOG_HEAP_INPLACE recreate the invalidation. This
would be, essentially, cheap insurance against invalidations having a
benefit I missed in (a).

I plan to pick (a).

- AtEOXact_Inval(true) is outside the RecordTransactionCommit() critical
section, but it is critical. We must not commit transactional DDL without
other backends receiving an inval. (When the inplace inval becomes
nontransactional, it will face the same threat.)

This faces the same RelationCacheInitFilePreInvalidate() decision, and I think
the conclusion should be the same as for inplace update.

Thanks,
nm

#2Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#1)
Re: Inval reliability, especially for inplace updates

On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:

/messages/by-id/20240512232923.aa.nmisch@google.com wrote:

Separable, nontrivial things not fixed in the attached patch stack:

- Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of
CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right.

I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
inside the critical section. Send it in heap_xlog_inplace(), too.

a. Within logical decoding, cease processing invalidations for inplace

I'm attaching the implementation. This applies atop the v3 patch stack from
/messages/by-id/20240614003549.c2.nmisch@google.com, but the threads are
mostly orthogonal and intended for independent review. Translating a tuple
into inval messages uses more infrastructure than relmapper, which needs just
a database ID. Hence, this ended up more like a miniature of inval.c's
participation in the transaction commit sequence.

I waffled on whether to back-patch inplace150-inval-durability-atcommit. The
consequences of that bug are plenty bad, but reaching them requires an error
between TransactionIdCommitTree() and AtEOXact_Inval(). I've not heard
reports of that, and I don't have a recipe for making it happen on demand.
For now, I'm leaning toward back-patch. The main risk would be me overlooking
an LWLock deadlock scenario reachable from the new, earlier RelCacheInitLock
timing. Alternatives for RelCacheInitLock:

- RelCacheInitLock before PreCommit_Notify(), because notify concurrency
matters more than init file concurrency. I chose this.
- RelCacheInitLock after PreCommit_Notify(), because PreCommit_Notify() uses a
heavyweight lock, giving it less risk of undetected deadlock.
- Replace RelCacheInitLock with a heavyweight lock, and keep it before
PreCommit_Notify().
- Fold PreCommit_Inval() back into AtCommit_Inval(), accepting that EIO in
unlink_initfile() will PANIC.

Opinions on that?

The patch changes xl_heap_inplace of XLOG_HEAP_INPLACE. For back branches, we
could choose between:

- Same change, no WAL version bump. Standby must update before primary. This
is best long-term, but the transition is more disruptive. I'm leaning
toward this one, but the second option isn't bad:

- heap_xlog_inplace() could set the shared-inval-queue overflow signal on
every backend. This is more wasteful, but inplace updates might be rare
enough (~once per VACUUM) to make it tolerable.

- Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE. This isn't
correct if one ends recovery between the two records, but you'd need to be
unlucky to notice. Noticing would need a procedure like the following. A
hot standby backend populates a relcache entry, then does DDL on the rel
after recovery ends.

Future cleanup work could eliminate LogStandbyInvalidations() and the case of
!markXidCommitted && nmsgs != 0. Currently, the src/test/regress suite still
reaches that case:

- AlterDomainDropConstraint() queues an inval even if !found; it can stop
that.

- ON COMMIT DELETE ROWS nontransactionally rebuilds an index, which sends a
relcache inval. The point of that inval is, I think, to force access
methods like btree and hash to reload the metapage copy that they store in
rd_amcache. Since no assigned XID implies no changes to the temp index, the
no-XID case could simply skip the index rebuild. (temp.sql reaches this
with a read-only transaction that selects from an ON COMMIT DELETE ROWS
table. Realistic usage will tend not to do that.) ON COMMIT DELETE ROWS
has another preexisting problem for indexes, mentioned in a code comment.

Thanks,
nm

Attachments:

inplace160-inval-durability-inplace-v1.patchtext/plain; charset=us-asciiDownload+362-145
inplace130-AtEOXact_RelationCache-comments-v1.patchtext/plain; charset=us-asciiDownload+10-23
inplace140-heapam_xlog-comment-v1.patchtext/plain; charset=us-asciiDownload+0-1
inplace150-inval-durability-atcommit-v1.patchtext/plain; charset=us-asciiDownload+129-71
#3Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#2)
Re: Inval reliability, especially for inplace updates

On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:

I'm attaching the implementation.

I'm withdrawing inplace150-inval-durability-atcommit-v1.patch, having found
two major problems so far:

1. It sends transactional invalidation messages before
ProcArrayEndTransaction(), so other backends can read stale data.

2. It didn't make the equivalent changes for COMMIT PREPARED.

#4Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#2)
Re: Inval reliability, especially for inplace updates

On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:

On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:

/messages/by-id/20240512232923.aa.nmisch@google.com wrote:

Separable, nontrivial things not fixed in the attached patch stack:

- Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of
CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right.

I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
inside the critical section. Send it in heap_xlog_inplace(), too.

a. Within logical decoding, cease processing invalidations for inplace

I'm attaching the implementation. This applies atop the v3 patch stack from
/messages/by-id/20240614003549.c2.nmisch@google.com, but the threads are
mostly orthogonal and intended for independent review. Translating a tuple
into inval messages uses more infrastructure than relmapper, which needs just
a database ID. Hence, this ended up more like a miniature of inval.c's
participation in the transaction commit sequence.

I waffled on whether to back-patch inplace150-inval-durability-atcommit

That inplace150 patch turned out to be unnecessary. Contrary to the
"noncritical resource releasing" comment some lines above
AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
PANIC. An ERROR just before or after sending invals becomes PANIC, "cannot
abort transaction %u, it was already committed". Since
inplace130-AtEOXact_RelationCache-comments existed to clear the way for
inplace150, inplace130 also becomes unnecessary. I've removed both from the
attached v2 patch stack.

The patch changes xl_heap_inplace of XLOG_HEAP_INPLACE. For back branches, we
could choose between:

- Same change, no WAL version bump. Standby must update before primary. This
is best long-term, but the transition is more disruptive. I'm leaning
toward this one, but the second option isn't bad:

- heap_xlog_inplace() could set the shared-inval-queue overflow signal on
every backend. This is more wasteful, but inplace updates might be rare
enough (~once per VACUUM) to make it tolerable.

- Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE. This isn't
correct if one ends recovery between the two records, but you'd need to be
unlucky to notice. Noticing would need a procedure like the following. A
hot standby backend populates a relcache entry, then does DDL on the rel
after recovery ends.

That still holds.

Attachments:

inplace140-heapam_xlog-comment-v2.patchtext/plain; charset=us-asciiDownload+0-1
inplace160-inval-durability-inplace-v2.patchtext/plain; charset=us-asciiDownload+378-151
#5Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#4)
Re: Inval reliability, especially for inplace updates

Hi,

On 2024-06-17 16:58:54 -0700, Noah Misch wrote:

On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:

On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:

/messages/by-id/20240512232923.aa.nmisch@google.com wrote:

Separable, nontrivial things not fixed in the attached patch stack:

- Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of
CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right.

I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
inside the critical section. Send it in heap_xlog_inplace(), too.

I'm worried this might cause its own set of bugs, e.g. if there are any places
that, possibly accidentally, rely on the invalidation from the inplace update
to also cover separate changes.

Have you considered instead submitting these invalidations during abort as
well?

a. Within logical decoding, cease processing invalidations for inplace

I'm attaching the implementation. This applies atop the v3 patch stack from
/messages/by-id/20240614003549.c2.nmisch@google.com, but the threads are
mostly orthogonal and intended for independent review. Translating a tuple
into inval messages uses more infrastructure than relmapper, which needs just
a database ID. Hence, this ended up more like a miniature of inval.c's
participation in the transaction commit sequence.

I waffled on whether to back-patch inplace150-inval-durability-atcommit

That inplace150 patch turned out to be unnecessary. Contrary to the
"noncritical resource releasing" comment some lines above
AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
PANIC. An ERROR just before or after sending invals becomes PANIC, "cannot
abort transaction %u, it was already committed".

Relying on that, instead of explicit critical sections, seems fragile to me.
IIRC some of the behaviour around errors around transaction commit/abort has
changed a bunch of times. Tying correctness into something that could be
changed for unrelated reasons doesn't seem great.

I'm not sure it holds true even today - what if the transaction didn't have an
xid? Then RecordTransactionAbort() wouldn't trigger
"cannot abort transaction %u, it was already committed"
I think?

- Same change, no WAL version bump. Standby must update before primary. This
is best long-term, but the transition is more disruptive. I'm leaning
toward this one, but the second option isn't bad:

Hm. The inplace record doesn't use the length of the "main data" record
segment for anything, from what I can tell. If records by an updated primary
were replayed by an old standby, it'd just ignore the additional data, afaict?

I think with the code as-is, the situation with an updated standby replaying
an old primary's record would actually be worse - it'd afaict just assume the
now-longer record contained valid fields, despite those just pointing into
uninitialized memory. I think the replay routine would have to check the
length of the main data and execute the invalidation conditionally.

- heap_xlog_inplace() could set the shared-inval-queue overflow signal on
every backend. This is more wasteful, but inplace updates might be rare
enough (~once per VACUUM) to make it tolerable.

We already set that surprisingly frequently, as
a) The size of the sinval queue is small
b) If a backend is busy, it does not process catchup interrupts
(i.e. executing queries, waiting for a lock prevents processing)
c) There's no deduplication of invals, we often end up sending the same inval
over and over.

So I suspect this might not be too bad, compared to the current badness.

At least for core code. I guess there could be extension code triggering
inplace updates more frequently? But I'd hope they'd do it not on catalog
tables... Except that we wouldn't know that that's the case during replay,
it's not contained in the record.

- Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE. This isn't
correct if one ends recovery between the two records, but you'd need to be
unlucky to notice. Noticing would need a procedure like the following. A
hot standby backend populates a relcache entry, then does DDL on the rel
after recovery ends.

Hm. The problematic cases presumably involves an access exclusive lock? If so,
could we do LogStandbyInvalidations() *before* logging the WAL record for the
inplace update? The invalidations can't be processed by other backends until
the exclusive lock has been released, which should avoid the race?

Greetings,

Andres Freund

#6Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#5)
Re: Inval reliability, especially for inplace updates

On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote:

On 2024-06-17 16:58:54 -0700, Noah Misch wrote:

On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:

On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:

/messages/by-id/20240512232923.aa.nmisch@google.com wrote:

Separable, nontrivial things not fixed in the attached patch stack:

- Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of
CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right.

I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
inside the critical section. Send it in heap_xlog_inplace(), too.

I'm worried this might cause its own set of bugs, e.g. if there are any places
that, possibly accidentally, rely on the invalidation from the inplace update
to also cover separate changes.

Good point. I do have index_update_stats() still doing an ideally-superfluous
relcache update for that reason. Taking that further, it would be cheap
insurance to have the inplace update do a transactional inval in addition to
its immediate inval. Future master-only work could remove the transactional
one. How about that?

Have you considered instead submitting these invalidations during abort as
well?

I had not. Hmmm. If the lock protocol in README.tuplock (after patch
inplace120) told SearchSysCacheLocked1() to do systable scans instead of
syscache reads, that could work. Would need to ensure a PANIC if transaction
abort doesn't reach the inval submission. Overall, it would be harder to
reason about the state of caches, but I suspect the patch would be smaller.
How should we choose between those strategies?

a. Within logical decoding, cease processing invalidations for inplace

I'm attaching the implementation. This applies atop the v3 patch stack from
/messages/by-id/20240614003549.c2.nmisch@google.com, but the threads are
mostly orthogonal and intended for independent review. Translating a tuple
into inval messages uses more infrastructure than relmapper, which needs just
a database ID. Hence, this ended up more like a miniature of inval.c's
participation in the transaction commit sequence.

I waffled on whether to back-patch inplace150-inval-durability-atcommit

That inplace150 patch turned out to be unnecessary. Contrary to the
"noncritical resource releasing" comment some lines above
AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
PANIC. An ERROR just before or after sending invals becomes PANIC, "cannot
abort transaction %u, it was already committed".

Relying on that, instead of explicit critical sections, seems fragile to me.
IIRC some of the behaviour around errors around transaction commit/abort has
changed a bunch of times. Tying correctness into something that could be
changed for unrelated reasons doesn't seem great.

Fair enough. It could still be a good idea for master, but given I missed a
bug in inplace150-inval-durability-atcommit-v1.patch far worse than the ones
$SUBJECT fixes, let's not risk it in back branches.

I'm not sure it holds true even today - what if the transaction didn't have an
xid? Then RecordTransactionAbort() wouldn't trigger
"cannot abort transaction %u, it was already committed"
I think?

I think that's right. As the inplace160-inval-durability-inplace-v2.patch
edits to xact.c say, the concept of invals in XID-less transactions is buggy
at its core. Fortunately, after that patch, we use them only for two things
that could themselves stop with something roughly as simple as the attached.

- Same change, no WAL version bump. Standby must update before primary. This
is best long-term, but the transition is more disruptive. I'm leaning
toward this one, but the second option isn't bad:

Hm. The inplace record doesn't use the length of the "main data" record
segment for anything, from what I can tell. If records by an updated primary
were replayed by an old standby, it'd just ignore the additional data, afaict?

Agreed, but ...

I think with the code as-is, the situation with an updated standby replaying
an old primary's record would actually be worse - it'd afaict just assume the
now-longer record contained valid fields, despite those just pointing into
uninitialized memory. I think the replay routine would have to check the
length of the main data and execute the invalidation conditionally.

I anticipated back branches supporting a new XLOG_HEAP_INPLACE_WITH_INVAL
alongside the old XLOG_HEAP_INPLACE. Updated standbys would run both fine,
and old binaries consuming new WAL would PANIC, "heap_redo: unknown op code".

- heap_xlog_inplace() could set the shared-inval-queue overflow signal on
every backend. This is more wasteful, but inplace updates might be rare
enough (~once per VACUUM) to make it tolerable.

We already set that surprisingly frequently, as
a) The size of the sinval queue is small
b) If a backend is busy, it does not process catchup interrupts
(i.e. executing queries, waiting for a lock prevents processing)
c) There's no deduplication of invals, we often end up sending the same inval
over and over.

So I suspect this might not be too bad, compared to the current badness.

That is good. We might be able to do the overflow signal once at end of
recovery, like RelationCacheInitFileRemove() does for the init file. That's
mildly harder to reason about, but it would be cheaper. Hmmm.

At least for core code. I guess there could be extension code triggering
inplace updates more frequently? But I'd hope they'd do it not on catalog
tables... Except that we wouldn't know that that's the case during replay,
it's not contained in the record.

For what it's worth, from a grep of PGXN, only citus does inplace updates.

- Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE. This isn't
correct if one ends recovery between the two records, but you'd need to be
unlucky to notice. Noticing would need a procedure like the following. A
hot standby backend populates a relcache entry, then does DDL on the rel
after recovery ends.

Hm. The problematic cases presumably involves an access exclusive lock? If so,
could we do LogStandbyInvalidations() *before* logging the WAL record for the
inplace update? The invalidations can't be processed by other backends until
the exclusive lock has been released, which should avoid the race?

A lock forces a backend to drain the inval queue before using the locked
object, but it doesn't stop the backend from draining the queue and
repopulating cache entries earlier. For example, pg_describe_object() can
query many syscaches without locking underlying objects. Hence, the inval
system relies on the buffer change getting fully visible to catcache queries
before the sinval message enters the shared queue.

Thanks,
nm

#7Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#6)
Re: Inval reliability, especially for inplace updates

On Tue, Jun 18, 2024 at 08:23:49AM -0700, Noah Misch wrote:

On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote:

On 2024-06-17 16:58:54 -0700, Noah Misch wrote:

On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:

On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:

/messages/by-id/20240512232923.aa.nmisch@google.com wrote:

Separable, nontrivial things not fixed in the attached patch stack:

- Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of
CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right.

I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
inside the critical section. Send it in heap_xlog_inplace(), too.

I'm worried this might cause its own set of bugs, e.g. if there are any places
that, possibly accidentally, rely on the invalidation from the inplace update
to also cover separate changes.

Good point. I do have index_update_stats() still doing an ideally-superfluous
relcache update for that reason. Taking that further, it would be cheap
insurance to have the inplace update do a transactional inval in addition to
its immediate inval. Future master-only work could remove the transactional
one. How about that?

Have you considered instead submitting these invalidations during abort as
well?

I had not. Hmmm. If the lock protocol in README.tuplock (after patch
inplace120) told SearchSysCacheLocked1() to do systable scans instead of
syscache reads, that could work. Would need to ensure a PANIC if transaction
abort doesn't reach the inval submission. Overall, it would be harder to
reason about the state of caches, but I suspect the patch would be smaller.
How should we choose between those strategies?

a. Within logical decoding, cease processing invalidations for inplace

I'm attaching the implementation. This applies atop the v3 patch stack from
/messages/by-id/20240614003549.c2.nmisch@google.com, but the threads are
mostly orthogonal and intended for independent review. Translating a tuple
into inval messages uses more infrastructure than relmapper, which needs just
a database ID. Hence, this ended up more like a miniature of inval.c's
participation in the transaction commit sequence.

I waffled on whether to back-patch inplace150-inval-durability-atcommit

That inplace150 patch turned out to be unnecessary. Contrary to the
"noncritical resource releasing" comment some lines above
AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
PANIC. An ERROR just before or after sending invals becomes PANIC, "cannot
abort transaction %u, it was already committed".

Relying on that, instead of explicit critical sections, seems fragile to me.
IIRC some of the behaviour around errors around transaction commit/abort has
changed a bunch of times. Tying correctness into something that could be
changed for unrelated reasons doesn't seem great.

Fair enough. It could still be a good idea for master, but given I missed a
bug in inplace150-inval-durability-atcommit-v1.patch far worse than the ones
$SUBJECT fixes, let's not risk it in back branches.

I'm not sure it holds true even today - what if the transaction didn't have an
xid? Then RecordTransactionAbort() wouldn't trigger
"cannot abort transaction %u, it was already committed"
I think?

I think that's right. As the inplace160-inval-durability-inplace-v2.patch
edits to xact.c say, the concept of invals in XID-less transactions is buggy
at its core. Fortunately, after that patch, we use them only for two things
that could themselves stop with something roughly as simple as the attached.

Now actually attached.

Show quoted text

- Same change, no WAL version bump. Standby must update before primary. This
is best long-term, but the transition is more disruptive. I'm leaning
toward this one, but the second option isn't bad:

Hm. The inplace record doesn't use the length of the "main data" record
segment for anything, from what I can tell. If records by an updated primary
were replayed by an old standby, it'd just ignore the additional data, afaict?

Agreed, but ...

I think with the code as-is, the situation with an updated standby replaying
an old primary's record would actually be worse - it'd afaict just assume the
now-longer record contained valid fields, despite those just pointing into
uninitialized memory. I think the replay routine would have to check the
length of the main data and execute the invalidation conditionally.

I anticipated back branches supporting a new XLOG_HEAP_INPLACE_WITH_INVAL
alongside the old XLOG_HEAP_INPLACE. Updated standbys would run both fine,
and old binaries consuming new WAL would PANIC, "heap_redo: unknown op code".

- heap_xlog_inplace() could set the shared-inval-queue overflow signal on
every backend. This is more wasteful, but inplace updates might be rare
enough (~once per VACUUM) to make it tolerable.

We already set that surprisingly frequently, as
a) The size of the sinval queue is small
b) If a backend is busy, it does not process catchup interrupts
(i.e. executing queries, waiting for a lock prevents processing)
c) There's no deduplication of invals, we often end up sending the same inval
over and over.

So I suspect this might not be too bad, compared to the current badness.

That is good. We might be able to do the overflow signal once at end of
recovery, like RelationCacheInitFileRemove() does for the init file. That's
mildly harder to reason about, but it would be cheaper. Hmmm.

At least for core code. I guess there could be extension code triggering
inplace updates more frequently? But I'd hope they'd do it not on catalog
tables... Except that we wouldn't know that that's the case during replay,
it's not contained in the record.

For what it's worth, from a grep of PGXN, only citus does inplace updates.

- Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE. This isn't
correct if one ends recovery between the two records, but you'd need to be
unlucky to notice. Noticing would need a procedure like the following. A
hot standby backend populates a relcache entry, then does DDL on the rel
after recovery ends.

Hm. The problematic cases presumably involves an access exclusive lock? If so,
could we do LogStandbyInvalidations() *before* logging the WAL record for the
inplace update? The invalidations can't be processed by other backends until
the exclusive lock has been released, which should avoid the race?

A lock forces a backend to drain the inval queue before using the locked
object, but it doesn't stop the backend from draining the queue and
repopulating cache entries earlier. For example, pg_describe_object() can
query many syscaches without locking underlying objects. Hence, the inval
system relies on the buffer change getting fully visible to catcache queries
before the sinval message enters the shared queue.

Thanks,
nm

Attachments:

inval-requires-xid-v0.patchtext/plain; charset=us-asciiDownload+7-6
#8Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#4)
Re: Inval reliability, especially for inplace updates

On Mon, Jun 17, 2024 at 04:58:54PM -0700, Noah Misch wrote:

attached v2 patch stack.

Rebased. This applies on top of three patches from
/messages/by-id/20240629024251.03.nmisch@google.com. I'm attaching those
to placate cfbot, but this thread is for review of the last patch only.

Attachments:

inplace090-LOCKTAG_TUPLE-eoxact-v5.patchtext/plain; charset=us-asciiDownload+5-0
inplace110-successors-v5.patchtext/plain; charset=us-asciiDownload+832-123
inplace120-locktag-v5.patchtext/plain; charset=us-asciiDownload+409-52
inplace160-inval-durability-inplace-v3.patchtext/plain; charset=us-asciiDownload+386-153
#9Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#6)
Re: Inval reliability, especially for inplace updates

On Tue, Jun 18, 2024 at 08:23:49AM -0700, Noah Misch wrote:

On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote:

On 2024-06-17 16:58:54 -0700, Noah Misch wrote:

That inplace150 patch turned out to be unnecessary. Contrary to the
"noncritical resource releasing" comment some lines above
AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
PANIC. An ERROR just before or after sending invals becomes PANIC, "cannot
abort transaction %u, it was already committed".

Relying on that, instead of explicit critical sections, seems fragile to me.
IIRC some of the behaviour around errors around transaction commit/abort has
changed a bunch of times. Tying correctness into something that could be
changed for unrelated reasons doesn't seem great.

Fair enough. It could still be a good idea for master, but given I missed a
bug in inplace150-inval-durability-atcommit-v1.patch far worse than the ones
$SUBJECT fixes, let's not risk it in back branches.

What are your thoughts on whether a change to explicit critical sections
should be master-only vs. back-patched? I have a feeling your comment pointed
to something I'm still missing, but I don't know where to look next.

#10Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#6)
Re: Inval reliability, especially for inplace updates

On Tue, Jun 18, 2024 at 08:23:49AM -0700, Noah Misch wrote:

On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote:

On 2024-06-17 16:58:54 -0700, Noah Misch wrote:

On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:

On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:

/messages/by-id/20240512232923.aa.nmisch@google.com wrote:

Separable, nontrivial things not fixed in the attached patch stack:

- Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of
CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right.

I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
inside the critical section. Send it in heap_xlog_inplace(), too.

I'm worried this might cause its own set of bugs, e.g. if there are any places
that, possibly accidentally, rely on the invalidation from the inplace update
to also cover separate changes.

Good point. I do have index_update_stats() still doing an ideally-superfluous
relcache update for that reason. Taking that further, it would be cheap
insurance to have the inplace update do a transactional inval in addition to
its immediate inval. Future master-only work could remove the transactional
one. How about that?

Restoring the transactional inval seemed good to me, so I've rebased and
included that. This applies on top of three patches from
/messages/by-id/20240822073200.4f.nmisch@google.com. I'm attaching those
to placate cfbot, but this thread is for review of the last patch only.

Attachments:

inplace090-LOCKTAG_TUPLE-eoxact-v9.patchtext/plain; charset=us-asciiDownload+10-0
inplace110-successors-v9.patchtext/plain; charset=us-asciiDownload+802-154
inplace120-locktag-v9.patchtext/plain; charset=us-asciiDownload+494-57
inplace160-inval-durability-inplace-v4.patchtext/plain; charset=us-asciiDownload+392-143
#11Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#10)
Re: Inval reliability, especially for inplace updates

Rebased.

Attachments:

inplace160-inval-durability-inplace-v5.patchtext/plain; charset=us-asciiDownload+392-143
#12Nitin Motiani
nitinmotiani@google.com
In reply to: Noah Misch (#11)
Re: Inval reliability, especially for inplace updates

On Sat, Oct 12, 2024 at 5:47 PM Noah Misch <noah@leadboat.com> wrote:

Rebased.

Hi,

I have a couple of questions :

1. In heap_inplace_update_and_unlock, currently both buffer and tuple
are unlocked outside the critical section. Why do we have to move the
buffer unlock within the critical section here? My guess is that it
needs to be unlocked for the inplace invals to be processed. But what
is the reasoning behind that?

2. Is there any benefit in CacheInvalidateHeapTupleCommon taking the
preapre_callback argument? Wouldn't it be simpler to just pass an
InvalidationInfo* to the function?

Also is inval-requires-xid-v0.patch planned to be fixed up to inplace160?

Thanks

#13Noah Misch
noah@leadboat.com
In reply to: Nitin Motiani (#12)
Re: Inval reliability, especially for inplace updates

On Sat, Oct 12, 2024 at 06:05:06PM +0530, Nitin Motiani wrote:

1. In heap_inplace_update_and_unlock, currently both buffer and tuple
are unlocked outside the critical section. Why do we have to move the
buffer unlock within the critical section here? My guess is that it
needs to be unlocked for the inplace invals to be processed. But what
is the reasoning behind that?

AtInplace_Inval() acquires SInvalWriteLock. There are two reasons to want to
release the buffer lock before acquiring SInvalWriteLock:

1. Otherwise, we'd need to maintain the invariant that no other part of the
system tries to lock the buffer while holding SInvalWriteLock. (That would
cause an undetected deadlock.)

2. Concurrency is better if we release a no-longer-needed LWLock before doing
something time-consuming, like acquiring another LWLock potentially is.

Inplace invals do need to happen in the critical section, because we've
already written the change to shared buffers, making it the new authoritative
value. If we fail to invalidate, other backends may continue operating with
stale caches.

2. Is there any benefit in CacheInvalidateHeapTupleCommon taking the
preapre_callback argument? Wouldn't it be simpler to just pass an
InvalidationInfo* to the function?

CacheInvalidateHeapTupleCommon() has three conditions that cause it to return
without invoking the callback. Every heap_update() calls
CacheInvalidateHeapTuple(). In typical performance-critical systems, non-DDL
changes dwarf DDL. Hence, the overwhelming majority of heap_update() calls
involve !IsCatalogRelation(). I wouldn't want to allocate InvalidationInfo in
DDL-free transactions. To pass in InvalidationInfo*, I suppose I'd move those
three conditions to a function and make the callers look like:

CacheInvalidateHeapTuple(Relation relation,
HeapTuple tuple,
HeapTuple newtuple)
{
if (NeedsInvalidateHeapTuple(relation))
CacheInvalidateHeapTupleCommon(relation, tuple, newtuple,
PrepareInvalidationState());
}

I don't have a strong preference between that and the callback way.

Also is inval-requires-xid-v0.patch planned to be fixed up to inplace160?

I figure I'll pursue that on a different thread, after inplace160 and
inplace180. If there's cause to pursue it earlier, let me know.

Thanks,
nm

#14Nitin Motiani
nitinmotiani@google.com
In reply to: Noah Misch (#13)
Re: Inval reliability, especially for inplace updates

On Sun, Oct 13, 2024 at 6:15 AM Noah Misch <noah@leadboat.com> wrote:

On Sat, Oct 12, 2024 at 06:05:06PM +0530, Nitin Motiani wrote:

1. In heap_inplace_update_and_unlock, currently both buffer and tuple
are unlocked outside the critical section. Why do we have to move the
buffer unlock within the critical section here? My guess is that it
needs to be unlocked for the inplace invals to be processed. But what
is the reasoning behind that?

AtInplace_Inval() acquires SInvalWriteLock. There are two reasons to want to
release the buffer lock before acquiring SInvalWriteLock:

1. Otherwise, we'd need to maintain the invariant that no other part of the
system tries to lock the buffer while holding SInvalWriteLock. (That would
cause an undetected deadlock.)

2. Concurrency is better if we release a no-longer-needed LWLock before doing
something time-consuming, like acquiring another LWLock potentially is.

Inplace invals do need to happen in the critical section, because we've
already written the change to shared buffers, making it the new authoritative
value. If we fail to invalidate, other backends may continue operating with
stale caches.

Thanks for the clarification.

2. Is there any benefit in CacheInvalidateHeapTupleCommon taking the
preapre_callback argument? Wouldn't it be simpler to just pass an
InvalidationInfo* to the function?

CacheInvalidateHeapTupleCommon() has three conditions that cause it to return
without invoking the callback. Every heap_update() calls
CacheInvalidateHeapTuple(). In typical performance-critical systems, non-DDL
changes dwarf DDL. Hence, the overwhelming majority of heap_update() calls
involve !IsCatalogRelation(). I wouldn't want to allocate InvalidationInfo in
DDL-free transactions. To pass in InvalidationInfo*, I suppose I'd move those
three conditions to a function and make the callers look like:

CacheInvalidateHeapTuple(Relation relation,
HeapTuple tuple,
HeapTuple newtuple)
{
if (NeedsInvalidateHeapTuple(relation))
CacheInvalidateHeapTupleCommon(relation, tuple, newtuple,
PrepareInvalidationState());
}

I don't have a strong preference between that and the callback way.

Thanks. I would have probably done it using the
NeedsInvalidateHeapTuple. But I don't have a strong enough preference
to change it from the callback way. So the current approach seems
good.

Also is inval-requires-xid-v0.patch planned to be fixed up to inplace160?

I figure I'll pursue that on a different thread, after inplace160 and
inplace180. If there's cause to pursue it earlier, let me know.

Sure. Can be done in a different thread.

Thanks,
Nitin Motiani
Google

#15Nitin Motiani
nitinmotiani@google.com
In reply to: Nitin Motiani (#14)
Re: Inval reliability, especially for inplace updates

On Mon, Oct 14, 2024 at 3:15 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Sun, Oct 13, 2024 at 6:15 AM Noah Misch <noah@leadboat.com> wrote:

On Sat, Oct 12, 2024 at 06:05:06PM +0530, Nitin Motiani wrote:

1. In heap_inplace_update_and_unlock, currently both buffer and tuple
are unlocked outside the critical section. Why do we have to move the
buffer unlock within the critical section here? My guess is that it
needs to be unlocked for the inplace invals to be processed. But what
is the reasoning behind that?

AtInplace_Inval() acquires SInvalWriteLock. There are two reasons to want to
release the buffer lock before acquiring SInvalWriteLock:

1. Otherwise, we'd need to maintain the invariant that no other part of the
system tries to lock the buffer while holding SInvalWriteLock. (That would
cause an undetected deadlock.)

2. Concurrency is better if we release a no-longer-needed LWLock before doing
something time-consuming, like acquiring another LWLock potentially is.

Inplace invals do need to happen in the critical section, because we've
already written the change to shared buffers, making it the new authoritative
value. If we fail to invalidate, other backends may continue operating with
stale caches.

Thanks for the clarification.

2. Is there any benefit in CacheInvalidateHeapTupleCommon taking the
preapre_callback argument? Wouldn't it be simpler to just pass an
InvalidationInfo* to the function?

CacheInvalidateHeapTupleCommon() has three conditions that cause it to return
without invoking the callback. Every heap_update() calls
CacheInvalidateHeapTuple(). In typical performance-critical systems, non-DDL
changes dwarf DDL. Hence, the overwhelming majority of heap_update() calls
involve !IsCatalogRelation(). I wouldn't want to allocate InvalidationInfo in
DDL-free transactions. To pass in InvalidationInfo*, I suppose I'd move those
three conditions to a function and make the callers look like:

CacheInvalidateHeapTuple(Relation relation,
HeapTuple tuple,
HeapTuple newtuple)
{
if (NeedsInvalidateHeapTuple(relation))
CacheInvalidateHeapTupleCommon(relation, tuple, newtuple,
PrepareInvalidationState());
}

I don't have a strong preference between that and the callback way.

Thanks. I would have probably done it using the
NeedsInvalidateHeapTuple. But I don't have a strong enough preference
to change it from the callback way. So the current approach seems
good.

Also is inval-requires-xid-v0.patch planned to be fixed up to inplace160?

I figure I'll pursue that on a different thread, after inplace160 and
inplace180. If there's cause to pursue it earlier, let me know.

Sure. Can be done in a different thread.

I tested the patch locally and it works. And I have no other question
regarding the structure. So this patch looks good to me to commit.

Thanks,
Nitin Motiani
Google

#16Noah Misch
noah@leadboat.com
In reply to: Nitin Motiani (#15)
Re: Inval reliability, especially for inplace updates

With the releases wrapping in 2.5 weeks, I'm ambivalent about pushing this
before the release or after. Pushing before means fewer occurrences of
corruption, but pushing after gives more bake time to discover these changes
were defective. It's hard to predict which helps users more, on a
risk-adjusted basis. I'm leaning toward pushing this week. Opinions?

On Sun, Oct 20, 2024 at 06:41:37PM +0530, Nitin Motiani wrote:

I tested the patch locally and it works. And I have no other question
regarding the structure. So this patch looks good to me to commit.

Thanks. While resolving a back-branch merge conflict, I found
AtEOXact_Inval() and AtEOSubXact_Inval() were skipping inplaceInvalInfo tasks
if transInvalInfo==NULL. If one PreInplace_Inval() failed, the session's next
inplace update would get an assertion failure. Non-assert builds left
inplaceInvalInfo pointing to freed memory, but I didn't find a reachable
malfunction. Separately, the xact.c comment edit wasn't reflecting that v4
brought back the transactional inval. v6 fixes those. Regarding the
back-branch alternatives to the WAL format change:

On Tue, Jun 18, 2024 at 08:23:49AM -0700, Noah Misch wrote:

On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote:

On 2024-06-17 16:58:54 -0700, Noah Misch wrote:

On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:

- heap_xlog_inplace() could set the shared-inval-queue overflow signal on
every backend. This is more wasteful, but inplace updates might be rare
enough (~once per VACUUM) to make it tolerable.

We already set that surprisingly frequently, as
a) The size of the sinval queue is small
b) If a backend is busy, it does not process catchup interrupts
(i.e. executing queries, waiting for a lock prevents processing)
c) There's no deduplication of invals, we often end up sending the same inval
over and over.

So I suspect this might not be too bad, compared to the current badness.

That is good.

I benchmarked that by hacking 027_stream_regress.pl to run "pgbench
--no-vacuum --client=4 -T 30 -b select-only" on the standby, concurrent with
the primary running the regression tests. Standby clients acted on sinval
resetState ~16k times, and pgbench tps decreased 4.5%. That doesn't
necessarily mean the real-life cost would be unacceptable, but it was enough
of a decrease that I switched to the next choice:

We might be able to do the overflow signal once at end of
recovery, like RelationCacheInitFileRemove() does for the init file. That's
mildly harder to reason about, but it would be cheaper. Hmmm.

I'm attaching the branch-specific patches for that and for the main fix.
Other notes from back-patching:

- All branches change the ABI of PrepareToInvalidateCacheTuple(), a function
catcache.c exports for the benefit of inval.c. No PGXN extension calls
that, and I can't think of a use case in extensions.

- Due to v15 commit 3aafc03, the patch for v14 differs to an unusual degree
from the patch for v15+v16. The difference is mostly mechanical, though.

Thanks,
nm

Attachments:

inplace160-inval-durability-inplace-v6_17.patchtext/plain; charset=us-asciiDownload+320-137
inplace160-inval-durability-inplace-v6.patchtext/plain; charset=us-asciiDownload+403-147
inplace160-inval-durability-inplace-v6_16.patchtext/plain; charset=us-asciiDownload+320-132
inplace155-backbranch-inval-v1_14.patchtext/plain; charset=us-asciiDownload+67-0
inplace155-backbranch-inval-v1_16.patchtext/plain; charset=us-asciiDownload+67-0
inplace155-backbranch-inval-v1_17.patchtext/plain; charset=us-asciiDownload+68-0
inplace160-inval-durability-inplace-v6_13.patchtext/plain; charset=us-asciiDownload+289-117
inplace160-inval-durability-inplace-v6_14.patchtext/plain; charset=us-asciiDownload+290-118
#17Nitin Motiani
nitinmotiani@google.com
In reply to: Noah Misch (#16)
Re: Inval reliability, especially for inplace updates

On Thu, Oct 24, 2024 at 8:24 AM Noah Misch <noah@leadboat.com> wrote:

With the releases wrapping in 2.5 weeks, I'm ambivalent about pushing this
before the release or after. Pushing before means fewer occurrences of
corruption, but pushing after gives more bake time to discover these changes
were defective. It's hard to predict which helps users more, on a
risk-adjusted basis. I'm leaning toward pushing this week. Opinions?

I lean towards pushing after the release. This is based on my
assumption that since this bug has been around for a while, it is
(probably) not hit often. And a few weeks delay is better than
introducing a new defect.

Thanks

#18Noah Misch
noah@leadboat.com
In reply to: Nitin Motiani (#17)
Re: Inval reliability, especially for inplace updates

On Mon, Oct 28, 2024 at 02:27:03PM +0530, Nitin Motiani wrote:

On Thu, Oct 24, 2024 at 8:24 AM Noah Misch <noah@leadboat.com> wrote:

With the releases wrapping in 2.5 weeks, I'm ambivalent about pushing this
before the release or after. Pushing before means fewer occurrences of
corruption, but pushing after gives more bake time to discover these changes
were defective. It's hard to predict which helps users more, on a
risk-adjusted basis. I'm leaning toward pushing this week. Opinions?

I lean towards pushing after the release. This is based on my
assumption that since this bug has been around for a while, it is
(probably) not hit often. And a few weeks delay is better than
introducing a new defect.

I had pushed this during the indicated week, before your mail. Reverting it
is an option. Let's see if more opinions arrive.

#19Alexander Lakhin
exclusion@gmail.com
In reply to: Noah Misch (#18)
Re: Inval reliability, especially for inplace updates

Hello Noah,

31.10.2024 04:39, Noah Misch wrote:

I had pushed this during the indicated week, before your mail. Reverting it
is an option. Let's see if more opinions arrive.

I've accidentally discovered an incorrect behaviour caused by commit
4eac5a1fa. Running this script:
for ((j=1; j<=100; j++)); do
echo "iteration $j"

cat << 'EOF' | timeout 60 psql >>psql-$SID.log || { res=1; echo "hanged on iteration $j"; break; }
SELECT format('CREATE TABLE t%s (a int, b text);', g) FROM generate_series(1, 50) g
\gexec

SELECT format('DROP TABLE t%s;', g) FROM generate_series(1, 50) g
\gexec
EOF
done

with
autovacuum = on
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_analyze_threshold = 1

in parallel using separate servers (the full script is attached), like:
parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40`

I can catch the following:
...
3       hanged on iteration 51
...
19      hanged on iteration 64
...
39      hanged on iteration 99

And after the script run, I see the server processes hanging:
law      1081433       1  0 16:22 ?        00:00:00 .../usr/local/pgsql/bin/postgres
law      1081452 1081433  0 16:22 ?        00:00:00 postgres: checkpointer
law      1081453 1081433  0 16:22 ?        00:00:00 postgres: background writer
law      1081460 1081433  0 16:22 ?        00:00:00 postgres: walwriter
law      1081462 1081433  0 16:22 ?        00:00:00 postgres: autovacuum launcher
law      1081464 1081433  0 16:22 ?        00:00:00 postgres: logical replication launcher
law      1143065 1081433  0 16:32 ?        00:00:00 postgres: postgres postgres [local] CREATE TABLE
law      1143263 1081433  0 16:32 ?        00:00:00 postgres: autovacuum worker postgres
law      1143320 1081433  0 16:32 ?        00:00:00 postgres: autovacuum worker postgres
law      1143403 1081433  0 16:32 ?        00:00:00 postgres: autovacuum worker

Attaching to process 1143065
...
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0,
futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0,
futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fed9a8171b8, expected=expected@entry=0,
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007feda4674c5f in do_futex_wait (sem=sem@entry=0x7fed9a8171b8, abstime=0x0, clockid=0) at
./nptl/sem_waitcommon.c:111
#4  0x00007feda4674cf8 in __new_sem_wait_slow64 (sem=0x7fed9a8171b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a8171b8) at pg_sema.c:327
#6  0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318
#7  0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182
#8  0x0000561dd6d4f506 in heapam_index_fetch_tuple (scan=0x561dd8cb6588, tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28,
slot=0x561dd8cb75a0, call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6) at heapam_handler.c:146
...
(the full backtrace is attached)

All three autovacuum workers (1143263, 1143320, 1143403) are also waiting
for the same buffer lock:
#5  0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at pg_sema.c:327
#6  0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318
#7  0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182

Probably, this can be reproduced with VACUUM pg_class/pg_type/..., but I
haven't found out the exact combination needed yet.

Also as a side note, these processes can't be terminated with SIGTERM, I
have to kill them.

Initially I saw this on a slowed down VM, but with the attached patch
applied I could reproduce it on my workstation too.

Best regards,
Alexander

Attachments:

reproi.shapplication/x-shellscript; name=reproi.shDownload
delay-before-lockbuffer.patchtext/x-patch; charset=UTF-8; name=delay-before-lockbuffer.patchDownload+4-0
1143065-bt.txttext/plain; charset=UTF-8; name=1143065-bt.txtDownload
#20Noah Misch
noah@leadboat.com
In reply to: Alexander Lakhin (#19)
Re: Inval reliability, especially for inplace updates

On Thu, Oct 31, 2024 at 05:00:02PM +0300, Alexander Lakhin wrote:

I've accidentally discovered an incorrect behaviour caused by commit
4eac5a1fa. Running this script:

Thanks. This looks important.

parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40`

This didn't reproduce it for me, at -j20, -j40, or -j80. I tested at commit
fb7e27a. At what commit(s) does it reproduce for you? At what commits, if
any, did your test not reproduce this?

All three autovacuum workers (1143263, 1143320, 1143403) are also waiting
for the same buffer lock:
#5� 0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at pg_sema.c:327
#6� 0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318
#7� 0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182

Can you share the full backtrace for the autovacuum workers?

This looks like four backends all waiting for BUFFER_LOCK_SHARE on the same
pg_class page. One backend is in CREATE TABLE, and three are in autovacuum.
There are no other apparent processes that would hold the
BUFFER_LOCK_EXCLUSIVE blocking these four processes.

Also as a side note, these processes can't be terminated with SIGTERM, I
have to kill them.

That suggests they're trying to acquire one LWLock while holding another.
I'll recreate your CREATE TABLE stack trace and study its conditions. It's
not readily clear to me how that would end up holding relevant lwlocks.

Guessing how this happened did lead me to a bad decision in commit a07e03f,
but I expect fixing that bad decision won't fix the hang you captured. That
commit made index_update_stats() needlessly call RelationGetNumberOfBlocks()
and visibilitymap_count() with a pg_class heap buffer lock held. Both do I/O,
and the latter can exclusive-lock a visibility map buffer. The attached patch
corrects that. Since the hang you captured involved a pg_class heap buffer
lock, I don't think this patch will fix that hang. The other inplace updaters
are free from similar badness.

Attachments:

inplace230-index_update_stats-io-before-buflock-v1.patchtext/plain; charset=us-asciiDownload+40-22
#21Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#20)
#22Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#21)
#23Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#22)
#24Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#23)
#25Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#23)
#26Paul Jungwirth
pj@illuminatedcomputing.com
In reply to: Noah Misch (#25)
#27Paul Jungwirth
pj@illuminatedcomputing.com
In reply to: Paul Jungwirth (#26)
#28Noah Misch
noah@leadboat.com
In reply to: Paul Jungwirth (#27)
#29Paul Jungwirth
pj@illuminatedcomputing.com
In reply to: Noah Misch (#28)
#30Noah Misch
noah@leadboat.com
In reply to: Paul Jungwirth (#29)
#31Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#30)
#32Paul Jungwirth
pj@illuminatedcomputing.com
In reply to: Noah Misch (#31)
#33Noah Misch
noah@leadboat.com
In reply to: Paul Jungwirth (#32)
#34Alexander Lakhin
exclusion@gmail.com
In reply to: Noah Misch (#33)
#35Noah Misch
noah@leadboat.com
In reply to: Alexander Lakhin (#34)
#36Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Noah Misch (#16)
#37Noah Misch
noah@leadboat.com
In reply to: Mark Dilger (#36)
#38Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Noah Misch (#37)
#39Noah Misch
noah@leadboat.com
In reply to: Mark Dilger (#38)
#40Peter Eisentraut
peter_e@gmx.net
In reply to: Noah Misch (#39)
#41Noah Misch
noah@leadboat.com
In reply to: Peter Eisentraut (#40)